160x Filetype PDF File size 0.38 MB Source: aclanthology.org
Amharic-English Speech Translation in Tourism Domain Michael Melese Woldeyohannis Addis Ababa University, Addis Ababa, Ethiopia michael.melese@aau.edu.et Laurent Besacier Million Meshesha LIGLaboratory, UJF, BP53, Addis Ababa University, 38041Grenoble Cedex 9, France Addis Ababa, Ethiopia laurent.besacier@imag.fr michael.melese@aau.edu.et Abstract guageusingacomputer(Gaoetal.,2006). Speech This paper describes speech translation translation research for major and technologi- from Amharic-to-English, particularly cal supported languages like English, European Automatic Speech Recognition (ASR) languages (like French and Spanish) and Asian with post-editing feature and Amharic- languages (like Japanese and Chinese) has been English Statistical Machine Translation conducted since the 1983s by NEC Corporation (SMT). ASR experiment is conducted (Kurematsu, 1996). The advancement of speech using morpheme language model (LM) translation captivates the communication between and phoneme acoustic model (AM). people who do not share the same language. Likewise, SMTconductedusingwordand The state-of-the-art of speech translation sys- morphemeasunit. tem can be seen as the integration of three major cascading components (Gao et al., 2006; Jurafsky Morphemebased translation shows a 6.29 andMartin,2008);AutomaticSpeechRecognition BLEU score at a 76.4% of recognition (ASR), Machine Translation (MT) and Text-To- accuracy while word based translation Speech (TTS) synthesis. shows a 12.83 BLEU score using 77.4% ASR is the process by which a machine infers word recognition accuracy. Further, after spoken words, by means of talking to computer, post-edit on Amharic ASR using corpus and having it correctly understand a recorded au- based n-gram, the word recognition accu- dio signal. Beside ASR, MT is the process by racy increased by 1.42%. Since post-edit which a machine is used to translate a text from approach reduces error propagation, the one source language to another target language. wordbasedtranslation accuracy improved Finally, TTScreatesaspokenversionfromthetext by0.25(1.95%) BLEUscore. of electronic document such as text file and web We are now working towards further im- document. proving propagated errors through differ- As one major component of speech transla- ent algorithms at each unit of speech trans- tion, Amharic ASR started in 2001 (Melese lation cascading component. et al., 2016). A number of attempts have been 1 Introduction made for Amharic ASR using different methods and techniques towards designing speaker inde- Speech is one of the most natural form of com- pendent, large vocabulary, contineous speech and munication for humankind (Honda, 2003). Com- spontaneous speech recognition. puter with the ability to understand natural lan- In addition to ASR, a preliminary English- guagepromotedthedevelopmentofman-machine Amharic machine translation experiments was interface. This can be extended through different conducted using phonemic transcription on the digital platforms such as radio, mobile, TV, CD Amharic corpus (Teshome et al., 2015). The and others. Through these, speech translation fa- result obtained from the experiment shows that, cilitates communication between the people who it is possible to design English-Amharic machine speak different languages. translation using statistical method. Speech translation is the process by which spo- As the last component of speech translation, ken source phrases are translated to a target lan- a number of TTS research have been attempted 59 Proceedings of the First Workshop on Speech-Centric Natural Language Processing, pages 59–66 c Copenhagen, Denmark, September 7–11, 2017. 2017 Association for Computational Linguistics using different techniques and methods as dis- based writing system called fidel (âÔl) written cussed by (Anberbir and Takara, 2009). Among and read from left to right. Amharic graphemes these, concatenative, cepstral, formant and a sylla- are represented as a sequence of consonant vowel blebasedspeechsynthesizerswerethemainmeth- (CV)pairs,thebasicshapedeterminedbythecon- ods and techniques applied. sonant, which is modified for the vowel. All the above research works were conducted The Amharic writing system is composed of using different methods and techniques beside four distinct categories consisting of 276 different data difference and integration as a cascading symbols; 33 core characters with 7 orders (€, ∫, component. Moreover, dataset and tools used in ‚,ƒ,„,…and†),4labiovelarswith5orderssym- the above research are not accessible which makes bol (q, u, k and g), 18 labialized consonants with difficult to evaluate the advancement of research 1order(wƒ)and1labiodentalcharactersconsist- in speech technology for local languages. ing 7 orders (€, ∫, ‚, ƒ, „, … and †). However, there is no attempt to integrate ASR, In Amharic writing system, all the 276 distinct SMT and TTS to come up with speech transla- orthographic representation are indispensable due tion system for Amharic language. Thus, the main to their distinct orthographic representation. aim of this study is to investigate the possibility However, as part of speech translation, speech to design Amharic-English speech translation sys- recognition mainly deals with distinct sound. tem that controls recognition errors propagating Among those, some of the graphemes generate through cascading components. same sound like (h, M, u and Ω) pronounced as h/h/. 2 AmharicLanguage Ontheother hand, Machine translation empha- AmharicisaSemiticlanguagederivedfromGe’ez sizes on orthographic representation which result with the second largest speaker in the world the same meaning in different graphemes. As a next to Arabic (Simons and Fennig, 2017). The result, normalization is required to minimize the name Amharic (€≈r{) comes from the district graphemes variation which leads to better trans- of Amhara (€≈•) in northern Ethiopia, which is lation while minimizing the ASR model. Table 1 thought to be the historic, classical and ecclesi- presentstheAmhariccharactersetbeforeandafter astical language of Ethiopia. Moreover, the lan- normalization. guage Amharic has five dialectical variations spo- Unnormalized Normalized Difference ken named as: Addis Ababa, Gojam, Gonder, Core Character 33 27 6 Wollo and Menz. Labiovelar 4 4 0 Labialized 18 18 0 Amharic is the official working language of Labiodental 1 1 0 government of Ethiopia among the 89 languages Total 276 234 42 registered in the country with up to 200 differ- Table 1: Distribution of Amharic character set ent spoken dialects (Simons and Fennig, 2017; adopted and modified from (Melese et al., 2016) Thompson, 2016). Beside these, Amharic lan- guage is being used in governmental administra- As a result, graphemes that generate the same tion, public media and national commerce of some sound are normalized in to the seven order of core regionalstatesofthecountry. Thisincludes;Addis character. The normalization is based on the usage Ababa, Amhara, Diredawa and Southern Nations, of most characters frequency in Amharic text doc- Nationalities and People (SNNP). ument. This includes, normalization from (h, M, Amharic language is spoken by more than 25 uandΩ)toh,(…, e) to …, (U, s) to s and (Õ, Ý) million with up to 22 million native speakers. The to Õ along with order. majority of Amharic speakers found in Ethiopia even though there are also speakers in a number 3 TourisminEthiopia of other countries, particularly Italy, Canada, the USAandSweden. Tourism is the activity of traveling to and stay- Unlike other Semitic languages, such as Ara- ing in places outside their usual environment bic and Hebrew, modern Amharic script has in- for not more than one year to create a direct herited its writing system from Ge’ez (gez) (Yi- contact between people and cultures (UNWTO, mam, 2000). Amharic language uses a grapheme 2016). Ethiopia has muchtoofferforinternational 60 tourists1 ranging from the peaks of the rugged one step further helps in solving language barriers Semien mountains to the lowest points on earth problem. called Danakil Depression which is more than 400 Therefore, this study attempts to come up with feet below sea level. an Amharic-English speech translation system In addition, tourism become a pleasing sustain- taking tourism as a domain. able economicdevelopmentthatservesasanalter- native source of foreign exchange for the counties 4 DataPreparation like Ethiopia. Nowadays, Amharic language suffers from a lack Moreover, The 2015 United Nations World of speech and text corpora for ASR and SMT. Be- Tourism report (UNWTO, 2016) and the World side these, collecting standardized and annotated 2 Bank report indicate that, in 2015 a total of corpora is one of the most challenging and ex- 864,000 non-resident tourists come to Ethiopia to pensive tasks when working with under resourced visit different tourist attraction. These include; languages (Besacier et al., 2006; Gauthier et al., ancient, medieval cities and world heritages reg- 2016). istered by UNESCO as tourist attraction. Since For Amharic speech recognition training and the year 2010 until 2015, the average number of development, 20 hours of read speech corpus pre- tourist flow increase by 13.05% per year. pared by Abate et. al (2005) were used. How- 3 According to Walta Information Center , cit- ever, due to unavailability of standardized corpora ing Ethiopia Ministry of Culture and Tourism, for speech translation in tourism domain, a text Ethiopia has secured 872 million dollars in first corpus is acquired from resourced and technolog- quarter of its 2016/17 fiscal year from 223,032 ically supported languages particularly English. international tourists. The revenue was mostly Accordingly, a parallel English-Arabic text data throughconferencetourism,researchbusinessand was acquired from the Basic Traveller Expres- other activities. Majority of the tourists were from sion Corpus (BTEC) 2009 which is made avail- USA,England, Germany, France and Italy speak- able through International Workshop on Spoken ing foreign languages. Beside this, tourists ex- Language Translation (IWSLT) (Kessler, 2010). press their ideas using different languages, the ma- Aparallel Amharic-English corpus has been pre- jority of the tourists can speak and communicate pared by translating the English BTEC data using in English to exchange information about tourist a bilingual speaker. This data is used for the de- attractions. velopmentofspeechtranslationcascadingcompo- Duetothis, language barriers are a major prob- nent such as, ASR and SMT. lemfortoday’sglobalcommunication(Nakamura, 2009). As a result, they look for an alternate The corpus has a total of 28,084 Amharic- option that lets them communicate with the sur- English parallel sentences. To keep the dataset rounding. consistent, the text corpus has been further prepro- Thus, speech translation system is one of the cessed, such as typing errors are corrected, abbre- best technologies used to fill the communication viations have been expanded, numbers have been gap between the people who speak different lan- textually transcribed and concatenated words have guages (Nakamura, 2009). This is especially been separated. true in overcoming language barriers of today’s Amharic speech recognition is conducted using global communication besides supporting under- words and morphemes as a language model with resourced language. a phoneme-based acoustic model. Similarly word However, under-resourced languages such as andmorphemehavebeenusedasatranslationunit Amharic, suffer from having a digital text and for Amharic in Amharic-English machine trans- speech corpus to support speech translation. So, lation. Morpheme-based segmentation of train- after collecting text and speech corpora, moving ing, development, testing obtained by segment- 1http://www.investethiopia.gov.et/ ing word into sub-word unit using corpus-based, images/pdf/Investment_Brochure_to_ language independent and unsupervised segmen- Ethiopia.pdf tation for using morfessor 2.0 (Smit et al., 2014). 2 http://data.worldbank.org/indicator/ OncetheAmharic-EnglishBTECcorpusispre- ST.INT.ARVL?end=2015 3https://www.waltainfo.com/ pared, it is divided into training, tuning and test- FeaturedArticles/detail?cid=28751 ing set with a proportion of 69.33% (19472 sen- 61 tences), 1.78%(500 sentences) and 28.88%(8112 Unit Train Dev Test sentences), respectively. Sentence 19,472 500 8,172 Word Token 107,049 2,795 37,288 Then, the 8112 (28.38%) test set sentences Amharic Type 18,650 1,470 4,168 are recorded under a normal office environment Sentence 19,472 500 8,172 from eight (4 Male and 4 Female) native Amharic Morpheme Token 145,419 3,828 50,906 Type 15,679 1,621 4,035 speakers using LIG-Aikuma, a smartphone based Sentence 19,472 500 8,172 application tool (Blachon et al., 2016). English Word Token 157,550 4,024 55,062 Accordingly, a total of 7.43 hours read speech Type 10,544 1,227 3,775 corpus ranging from 1,020 ms to 14,633 ms with Table 3: Distribution of Amharic-English SMT an average speech time of 3,297 ms has been col- data. lected from the tourism domain. Moreover,assuggestedbyMeleseetal.,(2016), quenceshavebeenextractedafterexpandingnum- morphologically rich and under-resourced lan- bers and abbreviation. guage like Amharic provides a better recognition accuracy using morpheme based language model 5 SystemArchitecture with phoneme based acoustic model. Similarly, language model data for Amharic As discussed in Section 1, the state-of-the-art of speech recognition has been collected from differ- speech translation suggest to apply through the ent sources. A text corpus collected for Google integration of cascading components to translate project (Tachbelie and Abate, 2015) have been speech from source language (Amharic) to a tar- used in addition to BTEC SMT training data ex- get language (English). cluding the test data. Table 2 presents the train- Aspartofthecascadingcomponents,theoutput ing, development and language model data used of a speech recognizer contains more and presents for Amharic speech recognition. avariety of errors. These errors further propagates Language Model to the succeeding component of speech translation Train Test Word Morpheme which results in low performance. Sentence 10,875 8,112 261,620 261,620 Hence, in this study we propose an Amharic Token 145,404 50,906 4,223,835 5,773,282 ASR post-editing module that can detect an er- Type 24,653 4,035 328,615 141,851 ror, identify possible suggestion and finally correct Table 2: Distribution of Amharic data for ASR. based on the proposal. The correction is made us- Like speech recognition, a total of 42,134 sen- ingn-gramdatastoreusingminimumeditdisatnce tences (374,153 token of 8,678 type) English lan- and perplexity before the error heads to statistical guage model data have been used for Amharic- machine translation. English machine translation. The data is collected Figure 1 presents Amharic-English speech-to- from the same BTEC corpus excluding test data. speech translation (S2ST) architecture with and Consequently, corpus based and language in- without considering ASR post-edit. dependent segmentation have been applied on a The post-edit process mainly consists of three training, development and test set of Amharic different phases; error detection, correction pro- SMT data. Morfessor is used to segment words posal and finally suggest correction as depicted in to a sub word units. Table 3 presents summary Figure 2. of the corpus used for Amharic-English machine The first phase of post editing is to detect the translation using word and morpheme as a unit. error from ASR recognition output. Basically, to Likewise,thepost-editisconductedusingacor- detect an error, recognized morpheme units are pus based n-gram approach. Accordingly, a cor- concatenated to form a word and its existence is pus containing 681,910 sentences (11,514,557 to- checked in unigram Amharic dictionary. kens) of 582,150 type data crawled from web in- Thus, a morpheme-based speech recognition cluding news and magazine. output “Î+ -s¶³ …¡ -°È¶Û °sã €Ôr+-݆∫ 4 Then, the data is further cleaned, preprocessed ” concatenated to form a phrase “Îs¶³ …¡ - and normalized. From this data, a total of °È¶Û °sã €Ôr݆∫”. 5,057,112 bigram, 8,341,966 trigram, 9,276,600 4“+” refers to morphemes followed by other morpheme quadrigram and 9,242,670 pentagram word se- while “-” refer to leading morpheme is there. 62
no reviews yet
Please Login to review.