Language Pdf 100643

Partial capture of text on file.
                          NITK-UoH:Tamil-TeluguMachineTranslationSystemsfortheWMT21
                                                               Similar Language Translation Task
                                                  Richard Saldanha,                                        ParameswariKrishnamurthy
                            AnanthanarayanaV.S and AnandKumarM CentreforAppliedLinguistics
                                   Department of Information Technology,                                        and Translation Studies,
                                National Institute of Technology Karnataka                                      University of Hyderabad
                               NH66,Srinivasnagar, Surathkal, Mangalore                                             Prof. CR Rao Road
                                               Karnataka 575025, India                                           Gachibowli, Hyderabad
                        richardsaldanha.207it005@nitk.edu.in Telangana500046,India
                                               anvs@nitk.edu.in                                                 pksh@uohyd.ac.in
                                      m_anandkumar@nitk.edu.in
                                                    Abstract                                         USA,CanadaandtheUK.Bothlanguagesbelong
                             In this work, two Neural Machine Transla-                               to the Dravidian family of languages which com-
                             tion (NMT) systems have been developed and                              prise of Tamil, Telugu, Kannada and Malayalam as
                             evaluated as part of the bidirectional Tamil-                           the major languages spokeninSouthIndia. Despite
                             Telugusimilarlanguagestranslationsubtaskin                              belonging to the same family of languages, there
                             WMT21. The OpenNMT-py toolkit has been                                  are many differences between Tamil and Telugu,
                             used to create quick prototypes of the systems,                         suchasthescriptusedforwritingandlinguisticdif-
                             following which models have been trained on                             ferences in terms of phonology, morphology, syn-
                             the training datasets containing the parallel                           tax among others. Tamil belongs to the Southern
                             corpus and ﬁnally the models have been evalu-                           branch of Dravidian languages, which has a rich
                             ated on the dev datasets provided as part of the                        literary tradition spanning more than 2000 years.
                             task. Both the systems have been trained on a                           Telugu, on the other hand, belongs to the South
                             DGXstationwith4-V100GPUs.
                             The ﬁrst NMT system in this work is a Trans-                            Central branch of Dravidian languages and has a
                             former based 6 layer encoder-decoder model,                             considerable amount of different linguistic charac-
                             trained for 100000 training steps, whose con-                           teristics when compared to Tamil as described by
                             ﬁguration is similar to the one provided by                             Krishnamurthy (2019).
                             OpenNMT-py and this is used to create a                                     Aspart of the similar language translation’s sub-
                             model for bidirectional translation. The sec-                           task for Dravidian Languages, namely Tamil (TA)
                             ond NMT system contains two unidirectional                              and Telugu (TE), we have attempted to build Neu-
                             translation modelswiththesameconﬁguration                               ral Machine Translation (NMT) models using the
                             as the ﬁrst system, with the addition of utiliz-                        OpenNMT-py toolkit 1, which helps to generate
                             ingBytePairEncoding(BPE)forsubwordtok-                                  quick prototypes for the NMT models with the
                             enizationthroughthepre-trainedMultiBPEmb
                             model. Based on the dev dataset evaluation                              desired conﬁgurations. The ﬁrst NMT system (sub-
                             metrics for both the systems, the ﬁrst system                           mitted as the primary system) in this work is a
                             i.e.   the vanilla Transformer model has been                           Transformer based 6 layer encoder-decoder model
                             submitted as the Primary system. Since there                            which provides a single model for bidirectional
                             were no improvements in the metrics during                              translation between Tamil and Telugu using the
                             training of the second system with BPE, it has                          datasets provided for this shared task. The sec-
                             been submitted as a contrastive system.                                 ond NMT system (submitted as the contrastive
                        1     Introduction                                                           system) consists of two unidirectional translation
                                                                                                     models with the same conﬁguration as the ﬁrst sys-
                        Tamilisalanguage,predominantlyspokeninTamil                                  tem, but with the addition of utilizing Byte Pair
                        Nadu, a state in Southern India, along with coun-                            Encoding (BPE) for subword tokenization using
                        tries with a large Tamil speaking diaspora such as                           the pre-trained MultiBPEmb model (Heinzerling
                        Sri Lanka, Malaysia and Singapore, to name a few.                            and Strube, 2018).
                        Telugu on the other hand is the ofﬁcial language                                 Therest of the work is described in sections that
                        of two Southern states in India, namely Andhra                               pertain to the related work, data, system descrip-
                        Pradesh and Telangana. It is also spoken among                                     1https://opennmt.net/OpenNMT-py/main.
                        the Telugu speaking immigrant population in the                              html
                                                                                                299
                                                Proceedings of the Sixth Conference on Machine Translation (WMT), pages 299–303
                                                      November10–11,2021. ©2021Association for Computational Linguistics
                            Dataset Type                                 Dataset Name         Numberofsamples
                            Parallel Aligned TA-TE pairs (Training)         PMIndia                  26009
                            Parallel Aligned TA-TE pairs (Training)           News                   11038
                            Parallel Aligned TA-TE pairs (Training)           MKB                     3100
                            Parallel Aligned TA-TE pairs (Dev)                 Dev                    1261
                            NonAlignedTA-TEsets(Test)                          Test         1735 (per language set)
                  Table 1: Dataset statistics for parallel aligned Tamil-Telugu pairs used as train and dev (validation) datasets along
                  with non aligned samples used as the test set.
                                    Dataset Type     Dataset Name       Language      Longest Line Length
                                    Training            PMIndia             TA                  659
                                    Training              News              TA                 1524
                                    Training              MKB               TA                  412
                                    Dev                    Dev              TA                  923
                                    Test                   Test             TA                 1544
                                    Training            PMIndia             TE                  718
                                    Training              News              TE                 1356
                                    Training              MKB               TE                  376
                                    Dev                    Dev              TE                 1004
                                    Test                   Test             TE                  757
                                                   Table 2: Dataset statistics for Longest Line.
                  tion, results and conclusion.                            and Strube, 2018).
                  2   Rationale for Selecting the Models and                 Other methods to improve translation quality,
                      Related Work                                         that have not been explored as part of this work are
                                                                           the use of back translation using monolingual cor-
                  There has been a signiﬁcant amount of work done          pus or corpora, on the lines of the one described by
                  ondeveloping machine translation systems for In-         Sennrich et al. (2016). Factored NMT (which uses
                  dian languages, with some notable examples for           data tagged on the basis of morphology and Parts
                  Dravidian languages such as Tamil and Malayalam          of Speech (POS)) such as the one described by
                  described in Kumar et al. (2019). This shared            García-Martínez et al. (2016) is another possible
                  task provides a unique challenge in terms of the         candidate suitable for the kind of challenge pro-
                  constraint on the parallel aligned language pair         vided by the similar language translation task, as
                  data made available for training. The other chal-        the use of POS and morphological information can
                  lenges include the linguistically rich and domain        reduce the number of tokens and make the models
                  speciﬁc content present in the Prime Minister of         moregeneralizable in terms of predictions.
                  India (PMI) and the Mann ki baat (MKB) datasets,         3   Data
                  wheretopicsrelatedtoIndia’sdomesticandforeign
                  policy issues can be found.                              ThedatasetsusedintheNMTsystemsforthiswork
                    In order to address the challenge of lengthy input     are the parallel aligned Tamil and Telugu (TA-TE)
                  (samples containing more than 300 space delim-           language pairs provided as part of the Dravidian
                  ited tokens), the Transformer model described by         LanguagesubtaskoftheSimilarLanguageTransla-
                  Vaswanietal. (2017) was adopted. This model pro-                         2
                  vides the multi head attention mechanism which           tion shared task . Some statistics about the dataset
                  helps retain context for longer length sentence sam-     are outlined in Table 1.
                  ples. To reduce the vocabulary, reduce the training      3.1   Dataset preprocessing
                  time and possibly improve the translation quality        Due to the moderate size of the training dataset,
                  (through sub word tokenization), a MultiBPEmb           whichcontains40147samples,alongwiththetopic
                  modeltrained with a vocabulary of 100000 tokens
                  from 275 languages has been utilised (Heinzerling           2https://wmt21similar.cs.upc.edu/
                                                                       300
                                     ModelConﬁgurationName                     ModelConﬁgurationValue
                                     Corpus Weights for PMI dataset                          23
                                     Corpus Weights for News dataset                         19
                                     Corpus Weights for MKB dataset                           3
                                     Source and Target Sequence Length                      1600
                                     Save checkpoint after steps                            500
                                     Numberoftraining steps                               100000
                                     Numberofvalidation steps                               5000
                                     Training batch size                                    4096
                                     Dev(validation) batch size                              16
                                     Optimizer                                             Adam
                                     NumberofEncoderDecoderLayers                         6 (each)
                                     NumberofAttention heads                                  8
                         Table 3: Training Conﬁguration for Transformer based Encoder-Decoder Model (Primary System).
                  overlap of sentence samples between the training          The conﬁguration for this model is the same as
                  and dev datasets as well as test set (to a certain ex-    that provided by OpenNMT-py. In order to save
                  tent) on topics such as the Indian Prime Minister’s       time, a single bidirectional translation model for
                  statements on domestic issues and foreign policies        TA-TElanguagepairhasbeencreated, which can
                  in the PM India dataset, the entire training dataset      translate from Tamil to Telugu and vice versa. The
                  has been utilized in its original form.                   datasets used in this system were doubled in terms
                     Thelengthwisestatistics of the dataset (in terms       of the number of samples when compared to the
                  of space delimited tokens) is given in Table 2, this      second NMTsystem(constrastive submission), by
                  wastaken as the deciding factor in ﬁxing the max-         reversing the position of the TA-TE language pair
                  imuminput length as 1600 for the NMT systems              and appending them to the original datasets. No
                  developed. Thetokenizationfortheprimarysystem             special tagging identiﬁers were used as the Tamil
                  wasdoneasspacedelimited tokens which yielded              and Telugu scripts are distinct.
                  a shared Tamil-Telugu vocabulary of 194860 to-              Basic space delimited tokenization was applied
                  kens. On the other hand on using the MultiBPEmb           onthe datasets, which resulted in a combined TA-
                  model for subword tokenization gave a vocabulary          TEvocabulary of 194860 tokens being generated,
                  of 14056 tokens for Tamil (TA) and 13170 tokens           the relevant key conﬁguration for this model are
                  for Telugu (TE), which included some words in             listed in Table 3.
                  English as well.                                            The corpus weights help assign varied impor-
                  4   SystemDescription                                     tance to the particular datasets used in this task,
                                                                            the values for these weights were determined after
                  As mentioned in section 1, the PyTorch based              visual analysis of the dev(validation) dataset which
                  toolkit OpenNMT-py has been used to create rapid          indicated the dev dataset’s contents had a greater
                  prototypes for NMT models (the motivations for            overlapwithPMI,Newsand(MannkiBaat-which
                  the same can be seen in section 2), which have then       roughlytranslatesto"Fromtheheart")MKBinthat
                  been trained on the datasets provided, validated          particular order. The training time for the entire
                  against the provided dev sets and ﬁnally transla-         modelwas18hours.
                  tions for the test sets described in section 3 have         The second NMT system consists of two uni-
                  been obtained and submitted to the committee for          directional translation models with the same con-
                  evaluating the Similar Language Translation task.         ﬁguration as the ﬁrst system, with the addition of
                     ADGXstationwith4-V100GPUshavebeen                      utilizing Byte Pair Encoding (BPE) for subwords
                  used to train the models utilized in this task. A         using the pretrained MultiBPEmb model (Heinzer-
                  Transformer based 6 layer encoder-decoder model           ling and Strube, 2018). The intuition behind using
                  on the lines of the NMT system described by               BPEwastoreducethevocabularysizeusingsub-
                  Vaswani et al. (2017), was trained for 100000 train-      word tokenization. The choice of the pre trained
                  ing steps as the ﬁrst NMT system to be evaluated.         BPEmodelwasbasedontherelevanceofcontent
                                                                        301
                   SystemName                                                   Source Target BLEU RIBES TER
                                                                                Lan-      Lan-
                                                                                guage     guage
                   Primary System (Transformer Based)                           TA        TE       4.321     7.4      99.1
                   Contrastive System (Transformer Based + BPE subword)         TA        TE       0.003     0.0      130.6
                   Primary System (Transformer Based)                           TE        TA       3.908     9.0      98.7
                   Contrastive System (Transformer Based + BPE subword)         TE        TA       0.029     3.0      105.0
                            Table 4: Dev dataset BLEU, RIBES and TER Corpus level scores using the VizSeq library.
                                SystemName             Source Target BLEU RIBES TER                   System
                                                       Lan-     Lan-                                  Rank
                                                       guage    guage
                                Primary System         TA       TE        6.09     17.03    -         1
                                Contrastive System     TA       TE        0.00     0.03     -         9
                                Primary System         TE       TA        6.55     19.61     98.356   4
                                Contrastive System     TE       TA        0.04     1.00     -         9
                        Table 5: Test dataset BLEU, RIBES, TER scores and BLEU based System Rank in the Shared Task
                 used for BPE model training, languages supported          Corpus level metrics for the dev dataset were
                 and size of the vocabulary. Heinzerling and Strube      computedusingtheVizSeqpythonlibrarywhichis
                 (2018) describes a MultiBPE model with a 100000         an implementation of several metrics described by
                 vocabulary which was deemed suitable for this task     Wangetal.(2019).The metrics for the dev dataset
                 as it supported Tamil and Telugu, was trained on        are listed in Table 4.
                 WikiNewsandcoulduseasinglevocabularylike                  Based on the evaluation metrics of the Dev (val-
                 the ﬁrst NMT system used in this work. During           idation) dataset translations for both the systems
                 training it was found that the translations for the     evaluated in this work, the ﬁrst system i.e. the
                 Dev set couldn’t distinguish between Tamil and         vanilla Transformer model has been submitted as
                 Telugu subwords correctly, due to the failure in        the Primary system. Since there were no improve-
                 vocabulary matching for the candidates used in          ments in the metrics (the reason for it can be seen
                 the evaluation and possibly due to the vocabulary       in section 6), during training of the second system
                 shared between the languages. Hence, this system       which consists of the Transformer model along
                 was trained twice generating two unidirectional        with the use of MultiBPEmb model for sub word
                 models for TA-TE and TE-TA translations. The            tokenization, hence the second system has been
                 training time for each model was 5 hours, which is      submitted as a contrastive system.
                 less when compared to the primary system due to                                               3
                                                                           Table 5 lists the evaluation metrics applied on
                 the number of samples used (the primary system          the test dataset and the BLEU based system rank
                 uses double the number of samples) and the vocab-       in the shared task provided by the evaluation com-
                 ulary size (the contrastive system has a smaller and    mittee 4,5.
                 ﬁxed vocabulary as a pre trained BPE model has
                 been used).                                             6   Conclusion and Future Work
                 5   Results                                            The analysis of the evaluation metrics, from sec-
                                                                         tion 5, on the dev dataset indicates that the primary
                 The evaluation metrics used to evaluate the sys-        system, which is a Transformer based Encoder-
                 tems in this task are BiLingual Evaluation Under-          3TheresultsoftheTERmetricsforthetestsettranslations
                 study (BLEU) score as described by Papineni et al.      have been marked as - (refer Table 5), when the values exceed
                 (2002), Rank-based Intuitive Bilingual Evaluation      100.0
                 (RIBES)score as described by Isozaki et al. (2010)         4https://mzampieri.com/workshops/wmt/
                                                                         2021/TA_TE.pdf
                 and Translation Error Rate (TER) as described by           5https://mzampieri.com/workshops/wmt/
                 Snover et al. (2006).                                   2021/TE_TA.pdf
                                                                     302
The words contained in this file might help you see if this file matches what you are looking for:

...Nitk uoh tamil telugumachinetranslationsystemsforthewmt similar language translation task richard saldanha parameswarikrishnamurthy ananthanarayanav s and anandkumarm centreforappliedlinguistics department of information technology studies national institute karnataka university hyderabad nh srinivasnagar surathkal mangalore prof cr rao road india gachibowli richardsaldanha it edu in telangana anvs pksh uohyd ac m anandkumar abstract usa canadaandtheuk bothlanguagesbelong this work two neural machine transla to the dravidian family languages which com tion nmt systems have been developed prise telugu kannada malayalam as evaluated part bidirectional major spokeninsouthindia despite telugusimilarlanguagestranslationsubtaskin belonging same there wmt opennmt py toolkit has are many differences between used create quick prototypes suchasthescriptusedforwritingandlinguisticdif following models trained on ferences terms phonology morphology syn training datasets containing parallel tax amon...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area