132x Filetype PDF File size 0.13 MB Source: aclanthology.org
NITK-UoH:Tamil-TeluguMachineTranslationSystemsfortheWMT21 Similar Language Translation Task Richard Saldanha, ParameswariKrishnamurthy AnanthanarayanaV.S and AnandKumarM CentreforAppliedLinguistics Department of Information Technology, and Translation Studies, National Institute of Technology Karnataka University of Hyderabad NH66,Srinivasnagar, Surathkal, Mangalore Prof. CR Rao Road Karnataka 575025, India Gachibowli, Hyderabad richardsaldanha.207it005@nitk.edu.in Telangana500046,India anvs@nitk.edu.in pksh@uohyd.ac.in m_anandkumar@nitk.edu.in Abstract USA,CanadaandtheUK.Bothlanguagesbelong In this work, two Neural Machine Transla- to the Dravidian family of languages which com- tion (NMT) systems have been developed and prise of Tamil, Telugu, Kannada and Malayalam as evaluated as part of the bidirectional Tamil- the major languages spokeninSouthIndia. Despite Telugusimilarlanguagestranslationsubtaskin belonging to the same family of languages, there WMT21. The OpenNMT-py toolkit has been are many differences between Tamil and Telugu, used to create quick prototypes of the systems, suchasthescriptusedforwritingandlinguisticdif- following which models have been trained on ferences in terms of phonology, morphology, syn- the training datasets containing the parallel tax among others. Tamil belongs to the Southern corpus and finally the models have been evalu- branch of Dravidian languages, which has a rich ated on the dev datasets provided as part of the literary tradition spanning more than 2000 years. task. Both the systems have been trained on a Telugu, on the other hand, belongs to the South DGXstationwith4-V100GPUs. The first NMT system in this work is a Trans- Central branch of Dravidian languages and has a former based 6 layer encoder-decoder model, considerable amount of different linguistic charac- trained for 100000 training steps, whose con- teristics when compared to Tamil as described by figuration is similar to the one provided by Krishnamurthy (2019). OpenNMT-py and this is used to create a Aspart of the similar language translation’s sub- model for bidirectional translation. The sec- task for Dravidian Languages, namely Tamil (TA) ond NMT system contains two unidirectional and Telugu (TE), we have attempted to build Neu- translation modelswiththesameconfiguration ral Machine Translation (NMT) models using the as the first system, with the addition of utiliz- OpenNMT-py toolkit 1, which helps to generate ingBytePairEncoding(BPE)forsubwordtok- quick prototypes for the NMT models with the enizationthroughthepre-trainedMultiBPEmb model. Based on the dev dataset evaluation desired configurations. The first NMT system (sub- metrics for both the systems, the first system mitted as the primary system) in this work is a i.e. the vanilla Transformer model has been Transformer based 6 layer encoder-decoder model submitted as the Primary system. Since there which provides a single model for bidirectional were no improvements in the metrics during translation between Tamil and Telugu using the training of the second system with BPE, it has datasets provided for this shared task. The sec- been submitted as a contrastive system. ond NMT system (submitted as the contrastive 1 Introduction system) consists of two unidirectional translation models with the same configuration as the first sys- Tamilisalanguage,predominantlyspokeninTamil tem, but with the addition of utilizing Byte Pair Nadu, a state in Southern India, along with coun- Encoding (BPE) for subword tokenization using tries with a large Tamil speaking diaspora such as the pre-trained MultiBPEmb model (Heinzerling Sri Lanka, Malaysia and Singapore, to name a few. and Strube, 2018). Telugu on the other hand is the official language Therest of the work is described in sections that of two Southern states in India, namely Andhra pertain to the related work, data, system descrip- Pradesh and Telangana. It is also spoken among 1https://opennmt.net/OpenNMT-py/main. the Telugu speaking immigrant population in the html 299 Proceedings of the Sixth Conference on Machine Translation (WMT), pages 299–303 November10–11,2021. ©2021Association for Computational Linguistics Dataset Type Dataset Name Numberofsamples Parallel Aligned TA-TE pairs (Training) PMIndia 26009 Parallel Aligned TA-TE pairs (Training) News 11038 Parallel Aligned TA-TE pairs (Training) MKB 3100 Parallel Aligned TA-TE pairs (Dev) Dev 1261 NonAlignedTA-TEsets(Test) Test 1735 (per language set) Table 1: Dataset statistics for parallel aligned Tamil-Telugu pairs used as train and dev (validation) datasets along with non aligned samples used as the test set. Dataset Type Dataset Name Language Longest Line Length Training PMIndia TA 659 Training News TA 1524 Training MKB TA 412 Dev Dev TA 923 Test Test TA 1544 Training PMIndia TE 718 Training News TE 1356 Training MKB TE 376 Dev Dev TE 1004 Test Test TE 757 Table 2: Dataset statistics for Longest Line. tion, results and conclusion. and Strube, 2018). 2 Rationale for Selecting the Models and Other methods to improve translation quality, Related Work that have not been explored as part of this work are the use of back translation using monolingual cor- There has been a significant amount of work done pus or corpora, on the lines of the one described by ondeveloping machine translation systems for In- Sennrich et al. (2016). Factored NMT (which uses dian languages, with some notable examples for data tagged on the basis of morphology and Parts Dravidian languages such as Tamil and Malayalam of Speech (POS)) such as the one described by described in Kumar et al. (2019). This shared García-Martínez et al. (2016) is another possible task provides a unique challenge in terms of the candidate suitable for the kind of challenge pro- constraint on the parallel aligned language pair vided by the similar language translation task, as data made available for training. The other chal- the use of POS and morphological information can lenges include the linguistically rich and domain reduce the number of tokens and make the models specific content present in the Prime Minister of moregeneralizable in terms of predictions. India (PMI) and the Mann ki baat (MKB) datasets, 3 Data wheretopicsrelatedtoIndia’sdomesticandforeign policy issues can be found. ThedatasetsusedintheNMTsystemsforthiswork In order to address the challenge of lengthy input are the parallel aligned Tamil and Telugu (TA-TE) (samples containing more than 300 space delim- language pairs provided as part of the Dravidian ited tokens), the Transformer model described by LanguagesubtaskoftheSimilarLanguageTransla- Vaswanietal. (2017) was adopted. This model pro- 2 vides the multi head attention mechanism which tion shared task . Some statistics about the dataset helps retain context for longer length sentence sam- are outlined in Table 1. ples. To reduce the vocabulary, reduce the training 3.1 Dataset preprocessing time and possibly improve the translation quality Due to the moderate size of the training dataset, (through sub word tokenization), a MultiBPEmb whichcontains40147samples,alongwiththetopic modeltrained with a vocabulary of 100000 tokens from 275 languages has been utilised (Heinzerling 2https://wmt21similar.cs.upc.edu/ 300 ModelConfigurationName ModelConfigurationValue Corpus Weights for PMI dataset 23 Corpus Weights for News dataset 19 Corpus Weights for MKB dataset 3 Source and Target Sequence Length 1600 Save checkpoint after steps 500 Numberoftraining steps 100000 Numberofvalidation steps 5000 Training batch size 4096 Dev(validation) batch size 16 Optimizer Adam NumberofEncoderDecoderLayers 6 (each) NumberofAttention heads 8 Table 3: Training Configuration for Transformer based Encoder-Decoder Model (Primary System). overlap of sentence samples between the training The configuration for this model is the same as and dev datasets as well as test set (to a certain ex- that provided by OpenNMT-py. In order to save tent) on topics such as the Indian Prime Minister’s time, a single bidirectional translation model for statements on domestic issues and foreign policies TA-TElanguagepairhasbeencreated, which can in the PM India dataset, the entire training dataset translate from Tamil to Telugu and vice versa. The has been utilized in its original form. datasets used in this system were doubled in terms Thelengthwisestatistics of the dataset (in terms of the number of samples when compared to the of space delimited tokens) is given in Table 2, this second NMTsystem(constrastive submission), by wastaken as the deciding factor in fixing the max- reversing the position of the TA-TE language pair imuminput length as 1600 for the NMT systems and appending them to the original datasets. No developed. Thetokenizationfortheprimarysystem special tagging identifiers were used as the Tamil wasdoneasspacedelimited tokens which yielded and Telugu scripts are distinct. a shared Tamil-Telugu vocabulary of 194860 to- Basic space delimited tokenization was applied kens. On the other hand on using the MultiBPEmb onthe datasets, which resulted in a combined TA- model for subword tokenization gave a vocabulary TEvocabulary of 194860 tokens being generated, of 14056 tokens for Tamil (TA) and 13170 tokens the relevant key configuration for this model are for Telugu (TE), which included some words in listed in Table 3. English as well. The corpus weights help assign varied impor- 4 SystemDescription tance to the particular datasets used in this task, the values for these weights were determined after As mentioned in section 1, the PyTorch based visual analysis of the dev(validation) dataset which toolkit OpenNMT-py has been used to create rapid indicated the dev dataset’s contents had a greater prototypes for NMT models (the motivations for overlapwithPMI,Newsand(MannkiBaat-which the same can be seen in section 2), which have then roughlytranslatesto"Fromtheheart")MKBinthat been trained on the datasets provided, validated particular order. The training time for the entire against the provided dev sets and finally transla- modelwas18hours. tions for the test sets described in section 3 have The second NMT system consists of two uni- been obtained and submitted to the committee for directional translation models with the same con- evaluating the Similar Language Translation task. figuration as the first system, with the addition of ADGXstationwith4-V100GPUshavebeen utilizing Byte Pair Encoding (BPE) for subwords used to train the models utilized in this task. A using the pretrained MultiBPEmb model (Heinzer- Transformer based 6 layer encoder-decoder model ling and Strube, 2018). The intuition behind using on the lines of the NMT system described by BPEwastoreducethevocabularysizeusingsub- Vaswani et al. (2017), was trained for 100000 train- word tokenization. The choice of the pre trained ing steps as the first NMT system to be evaluated. BPEmodelwasbasedontherelevanceofcontent 301 SystemName Source Target BLEU RIBES TER Lan- Lan- guage guage Primary System (Transformer Based) TA TE 4.321 7.4 99.1 Contrastive System (Transformer Based + BPE subword) TA TE 0.003 0.0 130.6 Primary System (Transformer Based) TE TA 3.908 9.0 98.7 Contrastive System (Transformer Based + BPE subword) TE TA 0.029 3.0 105.0 Table 4: Dev dataset BLEU, RIBES and TER Corpus level scores using the VizSeq library. SystemName Source Target BLEU RIBES TER System Lan- Lan- Rank guage guage Primary System TA TE 6.09 17.03 - 1 Contrastive System TA TE 0.00 0.03 - 9 Primary System TE TA 6.55 19.61 98.356 4 Contrastive System TE TA 0.04 1.00 - 9 Table 5: Test dataset BLEU, RIBES, TER scores and BLEU based System Rank in the Shared Task used for BPE model training, languages supported Corpus level metrics for the dev dataset were and size of the vocabulary. Heinzerling and Strube computedusingtheVizSeqpythonlibrarywhichis (2018) describes a MultiBPE model with a 100000 an implementation of several metrics described by vocabulary which was deemed suitable for this task Wangetal.(2019).The metrics for the dev dataset as it supported Tamil and Telugu, was trained on are listed in Table 4. WikiNewsandcoulduseasinglevocabularylike Based on the evaluation metrics of the Dev (val- the first NMT system used in this work. During idation) dataset translations for both the systems training it was found that the translations for the evaluated in this work, the first system i.e. the Dev set couldn’t distinguish between Tamil and vanilla Transformer model has been submitted as Telugu subwords correctly, due to the failure in the Primary system. Since there were no improve- vocabulary matching for the candidates used in ments in the metrics (the reason for it can be seen the evaluation and possibly due to the vocabulary in section 6), during training of the second system shared between the languages. Hence, this system which consists of the Transformer model along was trained twice generating two unidirectional with the use of MultiBPEmb model for sub word models for TA-TE and TE-TA translations. The tokenization, hence the second system has been training time for each model was 5 hours, which is submitted as a contrastive system. less when compared to the primary system due to 3 Table 5 lists the evaluation metrics applied on the number of samples used (the primary system the test dataset and the BLEU based system rank uses double the number of samples) and the vocab- in the shared task provided by the evaluation com- ulary size (the contrastive system has a smaller and mittee 4,5. fixed vocabulary as a pre trained BPE model has been used). 6 Conclusion and Future Work 5 Results The analysis of the evaluation metrics, from sec- tion 5, on the dev dataset indicates that the primary The evaluation metrics used to evaluate the sys- system, which is a Transformer based Encoder- tems in this task are BiLingual Evaluation Under- 3TheresultsoftheTERmetricsforthetestsettranslations study (BLEU) score as described by Papineni et al. have been marked as - (refer Table 5), when the values exceed (2002), Rank-based Intuitive Bilingual Evaluation 100.0 (RIBES)score as described by Isozaki et al. (2010) 4https://mzampieri.com/workshops/wmt/ 2021/TA_TE.pdf and Translation Error Rate (TER) as described by 5https://mzampieri.com/workshops/wmt/ Snover et al. (2006). 2021/TE_TA.pdf 302
no reviews yet
Please Login to review.