AMHARIC SPEECH RECOGNITION USING JOINT TRANSFORMER AND CONNECTIONIST TEMPORAL CLASSIFICATION WITH CHARACTER-BASED AND SUBWORD-BASED ACOUSTIC AND LANGUAGE MODELS
DOI:
https://doi.org/10.20372/zede.v42i.10187Abstract
Sequence-to-sequence attention-based models have gained considerable attention in recent times for automatic speech recognition. The transformer architecture has been extensively employed for a variety of sequence-to-sequence transformation problems, including machine translation and automatic speech recognition. This architecture avoids sequential computation that is used in recurrent neural networks and leads to improved iteration rate during the training phase. Connectionist temporal classification, on the other hand, is widely employed to accelerate the convergence of the sequence-to-sequence model by explicitly learning a better alignment between the input speech feature and output label sequences. Amharic language, a Semitic language spoken by 57.5 million people in Ethiopia, is a morphologically rich language that poses a challenge for continuous speech recognition as a root word can be conjugated and inflected into thousands of words to reflect subject, object, tense and quantity. In this research, the connectionist temporal classification is integrated with the transformer for continuous Amharic speech recognition. A speech recognition system is also investigated by utilizing character- based and sub word-based models. The results show that a best character error rate of 8.04 % for the character-based model with character level language model and a best word error rate of 22.31 % for the sub word-based model with sub word-level language model.