Language Pdf 101742

Partial capture of text on file.
                              Amharic-English Speech Translation in Tourism Domain
                                                    Michael Melese Woldeyohannis
                                            Addis Ababa University, Addis Ababa, Ethiopia
                                                       michael.melese@aau.edu.et
                                     Laurent Besacier                              Million Meshesha
                               LIGLaboratory, UJF, BP53,                        Addis Ababa University,
                             38041Grenoble Cedex 9, France                       Addis Ababa, Ethiopia
                                 laurent.besacier@imag.fr                     michael.melese@aau.edu.et
                                      Abstract                          guageusingacomputer(Gaoetal.,2006). Speech
                     This paper describes speech translation            translation research for major and technologi-
                     from Amharic-to-English,        particularly       cal supported languages like English, European
                     Automatic Speech Recognition (ASR)                 languages (like French and Spanish) and Asian
                     with post-editing feature and Amharic-             languages (like Japanese and Chinese) has been
                     English Statistical Machine Translation            conducted since the 1983s by NEC Corporation
                     (SMT). ASR experiment is conducted                 (Kurematsu, 1996). The advancement of speech
                     using morpheme language model (LM)                 translation captivates the communication between
                     and phoneme acoustic model (AM).                   people who do not share the same language.
                     Likewise, SMTconductedusingwordand                    The state-of-the-art of speech translation sys-
                     morphemeasunit.                                    tem can be seen as the integration of three major
                                                                        cascading components (Gao et al., 2006; Jurafsky
                     Morphemebased translation shows a 6.29             andMartin,2008);AutomaticSpeechRecognition
                     BLEU score at a 76.4% of recognition               (ASR), Machine Translation (MT) and Text-To-
                     accuracy while word based translation              Speech (TTS) synthesis.
                     shows a 12.83 BLEU score using 77.4%                  ASR is the process by which a machine infers
                     word recognition accuracy. Further, after          spoken words, by means of talking to computer,
                     post-edit on Amharic ASR using corpus              and having it correctly understand a recorded au-
                     based n-gram, the word recognition accu-           dio signal.  Beside ASR, MT is the process by
                     racy increased by 1.42%. Since post-edit           which a machine is used to translate a text from
                     approach reduces error propagation, the            one source language to another target language.
                     wordbasedtranslation accuracy improved             Finally, TTScreatesaspokenversionfromthetext
                     by0.25(1.95%) BLEUscore.                           of electronic document such as text ﬁle and web
                     We are now working towards further im-             document.
                     proving propagated errors through differ-             As one major component of speech transla-
                     ent algorithms at each unit of speech trans-       tion, Amharic ASR started in 2001          (Melese
                     lation cascading component.                        et al., 2016).  A number of attempts have been
                 1    Introduction                                      made for Amharic ASR using different methods
                                                                        and techniques towards designing speaker inde-
                 Speech is one of the most natural form of com-         pendent, large vocabulary, contineous speech and
                 munication for humankind (Honda, 2003). Com-           spontaneous speech recognition.
                 puter with the ability to understand natural lan-         In addition to ASR, a preliminary English-
                 guagepromotedthedevelopmentofman-machine               Amharic machine translation experiments was
                 interface. This can be extended through different      conducted using phonemic transcription on the
                 digital platforms such as radio, mobile, TV, CD        Amharic corpus      (Teshome et al., 2015). The
                 and others. Through these, speech translation fa-      result obtained from the experiment shows that,
                 cilitates communication between the people who         it is possible to design English-Amharic machine
                 speak different languages.                             translation using statistical method.
                    Speech translation is the process by which spo-        As the last component of speech translation,
                 ken source phrases are translated to a target lan-     a number of TTS research have been attempted
                                                                     59
                             Proceedings of the First Workshop on Speech-Centric Natural Language Processing, pages 59–66
                                                                      c
                             Copenhagen, Denmark, September 7–11, 2017. 
2017 Association for Computational Linguistics
                          using different techniques and methods as dis-                                    based writing system called ﬁdel (âÔl) written
                          cussed by (Anberbir and Takara, 2009). Among                                      and read from left to right. Amharic graphemes
                          these, concatenative, cepstral, formant and a sylla-                              are represented as a sequence of consonant vowel
                          blebasedspeechsynthesizerswerethemainmeth-                                        (CV)pairs,thebasicshapedeterminedbythecon-
                          ods and techniques applied.                                                       sonant, which is modiﬁed for the vowel.
                              All the above research works were conducted                                       The Amharic writing system is composed of
                          using different methods and techniques beside                                     four distinct categories consisting of 276 different
                          data difference and integration as a cascading                                    symbols; 33 core characters with 7 orders (€, ∫,
                          component. Moreover, dataset and tools used in                                    ‚,ƒ,„,…and†),4labiovelarswith5orderssym-
                          the above research are not accessible which makes                                 bol (q, u, k and g), 18 labialized consonants with
                          difﬁcult to evaluate the advancement of research                                  1order(wƒ)and1labiodentalcharactersconsist-
                          in speech technology for local languages.                                         ing 7 orders (€, ∫, ‚, ƒ, „, … and †).
                              However, there is no attempt to integrate ASR,                                    In Amharic writing system, all the 276 distinct
                          SMT and TTS to come up with speech transla-                                       orthographic representation are indispensable due
                          tion system for Amharic language. Thus, the main                                  to their distinct orthographic representation.
                          aim of this study is to investigate the possibility                                   However, as part of speech translation, speech
                          to design Amharic-English speech translation sys-                                 recognition mainly deals with distinct sound.
                          tem that controls recognition errors propagating                                  Among those, some of the graphemes generate
                          through cascading components.                                                     same sound like (h, M, u and Ω) pronounced as
                                                                                                            h/h/.
                          2     AmharicLanguage                                                                 Ontheother hand, Machine translation empha-
                          AmharicisaSemiticlanguagederivedfromGe’ez                                         sizes on orthographic representation which result
                          with the second largest speaker in the world                                      the same meaning in different graphemes. As a
                          next to Arabic (Simons and Fennig, 2017). The                                     result, normalization is required to minimize the
                          name Amharic (€≈r{) comes from the district                                       graphemes variation which leads to better trans-
                          of Amhara (€≈•) in northern Ethiopia, which is                                    lation while minimizing the ASR model. Table 1
                          thought to be the historic, classical and ecclesi-                                presentstheAmhariccharactersetbeforeandafter
                          astical language of Ethiopia. Moreover, the lan-                                  normalization.
                          guage Amharic has ﬁve dialectical variations spo-                                                             Unnormalized    Normalized    Difference
                          ken named as: Addis Ababa, Gojam, Gonder,                                                    Core Character        33             27            6
                          Wollo and Menz.                                                                              Labiovelar            4              4             0
                                                                                                                       Labialized            18             18            0
                              Amharic is the ofﬁcial working language of                                               Labiodental           1              1             0
                          government of Ethiopia among the 89 languages                                                Total                276            234           42
                          registered in the country with up to 200 differ-                                  Table 1:          Distribution of Amharic character set
                          ent spoken dialects (Simons and Fennig, 2017;                                     adopted and modiﬁed from (Melese et al., 2016)
                          Thompson, 2016). Beside these, Amharic lan-
                          guage is being used in governmental administra-                                       As a result, graphemes that generate the same
                          tion, public media and national commerce of some                                  sound are normalized in to the seven order of core
                          regionalstatesofthecountry. Thisincludes;Addis                                    character. The normalization is based on the usage
                          Ababa, Amhara, Diredawa and Southern Nations,                                     of most characters frequency in Amharic text doc-
                          Nationalities and People (SNNP).                                                  ument. This includes, normalization from (h, M,
                              Amharic language is spoken by more than 25                                    uandΩ)toh,(…, e) to …, (U, s) to s and (Õ, Ý)
                          million with up to 22 million native speakers. The                                to Õ along with order.
                          majority of Amharic speakers found in Ethiopia
                          even though there are also speakers in a number                                   3 TourisminEthiopia
                          of other countries, particularly Italy, Canada, the
                          USAandSweden.                                                                     Tourism is the activity of traveling to and stay-
                              Unlike other Semitic languages, such as Ara-                                  ing in places outside their usual environment
                          bic and Hebrew, modern Amharic script has in-                                     for not more than one year to create a direct
                          herited its writing system from Ge’ez (gez) (Yi-                                  contact between people and cultures (UNWTO,
                          mam, 2000). Amharic language uses a grapheme                                      2016). Ethiopia has muchtoofferforinternational
                                                                                                        60
                  tourists1 ranging from the peaks of the rugged            one step further helps in solving language barriers
                  Semien mountains to the lowest points on earth            problem.
                  called Danakil Depression which is more than 400             Therefore, this study attempts to come up with
                  feet below sea level.                                     an Amharic-English speech translation system
                     In addition, tourism become a pleasing sustain-        taking tourism as a domain.
                  able economicdevelopmentthatservesasanalter-
                  native source of foreign exchange for the counties        4 DataPreparation
                  like Ethiopia.                                            Nowadays, Amharic language suffers from a lack
                     Moreover, The 2015 United Nations World                of speech and text corpora for ASR and SMT. Be-
                  Tourism report (UNWTO, 2016) and the World                side these, collecting standardized and annotated
                        2
                  Bank report indicate that, in 2015 a total of             corpora is one of the most challenging and ex-
                  864,000 non-resident tourists come to Ethiopia to         pensive tasks when working with under resourced
                  visit different tourist attraction.   These include;      languages (Besacier et al., 2006; Gauthier et al.,
                  ancient, medieval cities and world heritages reg-         2016).
                  istered by UNESCO as tourist attraction. Since               For Amharic speech recognition training and
                  the year 2010 until 2015, the average number of           development, 20 hours of read speech corpus pre-
                  tourist ﬂow increase by 13.05% per year.                  pared by Abate et. al (2005) were used. How-
                                                                  3
                     According to Walta Information Center , cit-           ever, due to unavailability of standardized corpora
                  ing Ethiopia Ministry of Culture and Tourism,             for speech translation in tourism domain, a text
                  Ethiopia has secured 872 million dollars in ﬁrst          corpus is acquired from resourced and technolog-
                  quarter of its 2016/17 ﬁscal year from 223,032            ically supported languages particularly English.
                  international tourists.   The revenue was mostly             Accordingly, a parallel English-Arabic text data
                  throughconferencetourism,researchbusinessand              was acquired from the Basic Traveller Expres-
                  other activities. Majority of the tourists were from      sion Corpus (BTEC) 2009 which is made avail-
                  USA,England, Germany, France and Italy speak-             able through International Workshop on Spoken
                  ing foreign languages. Beside this, tourists ex-          Language Translation (IWSLT) (Kessler, 2010).
                  press their ideas using different languages, the ma-      Aparallel Amharic-English corpus has been pre-
                  jority of the tourists can speak and communicate          pared by translating the English BTEC data using
                  in English to exchange information about tourist          a bilingual speaker. This data is used for the de-
                  attractions.                                              velopmentofspeechtranslationcascadingcompo-
                     Duetothis, language barriers are a major prob-         nent such as, ASR and SMT.
                  lemfortoday’sglobalcommunication(Nakamura,
                  2009).    As a result, they look for an alternate            The corpus has a total of 28,084 Amharic-
                  option that lets them communicate with the sur-           English parallel sentences.     To keep the dataset
                  rounding.                                                 consistent, the text corpus has been further prepro-
                     Thus, speech translation system is one of the          cessed, such as typing errors are corrected, abbre-
                  best technologies used to ﬁll the communication           viations have been expanded, numbers have been
                  gap between the people who speak different lan-           textually transcribed and concatenated words have
                  guages (Nakamura, 2009).          This is especially      been separated.
                  true in overcoming language barriers of today’s              Amharic speech recognition is conducted using
                  global communication besides supporting under-            words and morphemes as a language model with
                  resourced language.                                       a phoneme-based acoustic model. Similarly word
                     However, under-resourced languages such as             andmorphemehavebeenusedasatranslationunit
                  Amharic, suffer from having a digital text and            for Amharic in Amharic-English machine trans-
                  speech corpus to support speech translation. So,          lation.   Morpheme-based segmentation of train-
                  after collecting text and speech corpora, moving          ing, development, testing obtained by segment-
                     1http://www.investethiopia.gov.et/                     ing word into sub-word unit using corpus-based,
                  images/pdf/Investment_Brochure_to_                        language independent and unsupervised segmen-
                  Ethiopia.pdf                                              tation for using morfessor 2.0 (Smit et al., 2014).
                     2  http://data.worldbank.org/indicator/                   OncetheAmharic-EnglishBTECcorpusispre-
                  ST.INT.ARVL?end=2015
                     3https://www.waltainfo.com/                            pared, it is divided into training, tuning and test-
                  FeaturedArticles/detail?cid=28751                         ing set with a proportion of 69.33% (19472 sen-
                                                                         61
                          tences), 1.78%(500 sentences) and 28.88%(8112                                                                  Unit                Train     Dev       Test
                          sentences), respectively.                                                                                            Sentence     19,472      500     8,172
                                                                                                                                Word           Token       107,049    2,795    37,288
                              Then, the 8112 (28.38%) test set sentences                                          Amharic                      Type         18,650    1,470     4,168
                          are recorded under a normal ofﬁce environment                                                                        Sentence     19,472      500     8,172
                          from eight (4 Male and 4 Female) native Amharic                                                       Morpheme       Token       145,419    3,828    50,906
                                                                                                                                               Type         15,679    1,621     4,035
                          speakers using LIG-Aikuma, a smartphone based                                                                        Sentence     19,472      500     8,172
                          application tool (Blachon et al., 2016).                                                English       Word           Token       157,550    4,024    55,062
                              Accordingly, a total of 7.43 hours read speech                                                                   Type         10,544    1,227     3,775
                          corpus ranging from 1,020 ms to 14,633 ms with                                     Table 3:         Distribution of Amharic-English SMT
                          an average speech time of 3,297 ms has been col-                                   data.
                          lected from the tourism domain.
                              Moreover,assuggestedbyMeleseetal.,(2016),                                      quenceshavebeenextractedafterexpandingnum-
                          morphologically rich and under-resourced lan-                                      bers and abbreviation.
                          guage like Amharic provides a better recognition
                          accuracy using morpheme based language model                                       5 SystemArchitecture
                          with phoneme based acoustic model.
                              Similarly, language model data for Amharic                                     As discussed in Section 1, the state-of-the-art of
                          speech recognition has been collected from differ-                                 speech translation suggest to apply through the
                          ent sources. A text corpus collected for Google                                    integration of cascading components to translate
                          project (Tachbelie and Abate, 2015) have been                                      speech from source language (Amharic) to a tar-
                          used in addition to BTEC SMT training data ex-                                     get language (English).
                          cluding the test data. Table 2 presents the train-                                     Aspartofthecascadingcomponents,theoutput
                          ing, development and language model data used                                      of a speech recognizer contains more and presents
                          for Amharic speech recognition.                                                    avariety of errors. These errors further propagates
                                                                            Language Model                   to the succeeding component of speech translation
                                                Train        Test            Word       Morpheme             which results in low performance.
                            Sentence         10,875         8,112        261,620            261,620              Hence, in this study we propose an Amharic
                            Token           145,404       50,906       4,223,835         5,773,282           ASR post-editing module that can detect an er-
                            Type             24,653         4,035        328,615            141,851          ror, identify possible suggestion and ﬁnally correct
                           Table 2: Distribution of Amharic data for ASR.                                    based on the proposal. The correction is made us-
                              Like speech recognition, a total of 42,134 sen-                                ingn-gramdatastoreusingminimumeditdisatnce
                          tences (374,153 token of 8,678 type) English lan-                                  and perplexity before the error heads to statistical
                          guage model data have been used for Amharic-                                       machine translation.
                          English machine translation. The data is collected                                     Figure 1 presents Amharic-English speech-to-
                          from the same BTEC corpus excluding test data.                                     speech translation (S2ST) architecture with and
                              Consequently, corpus based and language in-                                    without considering ASR post-edit.
                          dependent segmentation have been applied on a                                          The post-edit process mainly consists of three
                          training, development and test set of Amharic                                      different phases; error detection, correction pro-
                          SMT data. Morfessor is used to segment words                                       posal and ﬁnally suggest correction as depicted in
                          to a sub word units. Table 3 presents summary                                      Figure 2.
                          of the corpus used for Amharic-English machine                                         The ﬁrst phase of post editing is to detect the
                          translation using word and morpheme as a unit.                                     error from ASR recognition output. Basically, to
                              Likewise,thepost-editisconductedusingacor-                                     detect an error, recognized morpheme units are
                          pus based n-gram approach. Accordingly, a cor-                                     concatenated to form a word and its existence is
                          pus containing 681,910 sentences (11,514,557 to-                                   checked in unigram Amharic dictionary.
                          kens) of 582,150 type data crawled from web in-                                        Thus, a morpheme-based speech recognition
                          cluding news and magazine.                                                         output “Î+ -s¶³ …¡ -°È¶Û °sã €Ôr+-Ý†∫
                                                                                                               4
                              Then, the data is further cleaned, preprocessed                                ” concatenated to form a phrase “Îs¶³ …¡ -
                          and normalized.                 From this data, a total of                         °È¶Û °sã €ÔrÝ†∫”.
                          5,057,112 bigram, 8,341,966 trigram, 9,276,600                                         4“+” refers to morphemes followed by other morpheme
                          quadrigram and 9,242,670 pentagram word se-                                        while “-” refer to leading morpheme is there.
                                                                                                        62
The words contained in this file might help you see if this file matches what you are looking for:

...Amharic english speech translation in tourism domain michael melese woldeyohannis addis ababa university ethiopia aau edu et laurent besacier million meshesha liglaboratory ujf bp grenoble cedex france imag fr abstract guageusingacomputer gaoetal this paper describes research for major and technologi from to particularly cal supported languages like european automatic recognition asr french spanish asian with post editing feature japanese chinese has been statistical machine conducted since the s by nec corporation smt experiment is kurematsu advancement of using morpheme language model lm captivates communication between phoneme acoustic am people who do not share same likewise smtconductedusingwordand state art sys morphemeasunit tem can be seen as integration three cascading components gao al jurafsky morphemebased shows a andmartin automaticspeechrecognition bleu score at mt text accuracy while word based tts synthesis process which infers further after spoken words means talking c...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area