140x Filetype PDF File size 0.17 MB Source: aclanthology.org
Punjabi to Urdu Machine Translation System 1 2 Nitin Bansal , Ajit Kumar 1 Department of Computer Science, Punjabi University, Patiala, India 2 Associate Professor, Multani Mal Modi College, Patiala, India 1 2 E-mail: profnitinbansal@gmail.com, ajit8671@gmail.com mainly. The quality of machine translation Abstract systems can be measured mainly using Development of Machine Translation System (MTS) Bi-lingual Evaluation Study (BLEU), where it for any language pair is a challenging task for several produces a score between 0 and 1. reasons. Lack of lexical resources for any language is one of the major issues that arise while developing Among various regional languages in India, we MTS using that language. For example, during the have chosen Punjabi and Urdu for developing development of Punjabi to Urdu MTS, many issues Punjabi to Urdu Machine Translation System were recognized while preparing lexical resources for both the languages. Since there is no machine (PUMTS). Punjabi is the mother tongue of our readable dictionary available for Punjabi to Urdu state, Punjab, where it was used as an official which can be directly used for translation; however language in government offices. Urdu was also various dictionaries are available to explain the being used as an official language in Punjab, meaning of the word. Along with this, handling of OOV(out of vocabulary words), handling of multiple before independence. Thus, PUMTS helps us to sense Punjabi word in Urdu, identification of proper make Punjabi understandable to Urdu nouns, identification of collocations in the source communities who still want to be in touch with sentence i.e. Punjabi sentences in our case, are the earlier Punjab. These two languages in India, are issues which we are facing during development of taken as resource-poor languages, because this system. Since MTSs are in great demand from the last one decade and are being widely used in parallel corpus on language pairs is not available. applications such as in case of smart phones. Thus it became a challenging task for us to Therefore, development of such a system becomes develop parallel corpus on this language pair. more demanding and more user friendly. Their usage Further, it also describes types of MTSs being is mainly in large scale translations, automated translations; act as an instrument to bridge a digital developed with Indian and non-Indian divide. perspective. 1 Introduction 2 Methodology Due to the availability of many regional An introduction to Punjabi and Urdu languages languages in India, machine translation in India help in understanding about history and close has enormous scope. Human and machine proximity among this language pair. Since translation have their share of challenges. word-order of this language pair is same but Scientifically and philosophically, machine writing order is different from each other i.e. translation results can be applied to various Punjabi can be written from left-to-right and areas such as artificial intelligence, linguistics, Urdu from right-to-left. Mapping among and the philosophy of language. Various characters of language pairs has also been approaches are required in machine translation studied during the development of PUMTS. The to make communication possible among two implementation of our methodology for the languages. These approaches can be rule-based, development of PUMTS, where the architecture corpus-based, hybrid or neural-based. Here, followed during the development has been hybrid approach is a combination of two documented. We have proposed three approaches approaches i.e. rule-based and corpus-based 32 Proceedings of the 17th International Conference on Natural Language Processing: System Demonstrations, pages 32–34 Patna, India, December 18 - 21, 2020. ©2019 NLP Association of India (NLPAI) to develop bilingual parallel corpus for Punjabi (OOV) words has also been designed and and Urdu languages. But BLEU score suggested developed, which is working as web-based for one final approach for corpus development, nowadays. This system has been designed in two results in higher accuracy. All the algorithms phases i.e. first on a web-based platform using which were developed during the development ASP.Net and secondly, it has been designed for of PUMTS, followed the final corpus approach. PUMTS, to handle OOV words during machine Lastly, Punjabi to Urdu machine transliteration translation, using MOSES platform. system to handle Out-of-Vocabulary Words Chart 1: Phase-wise improvement in BLEU score for PUMTS Human evaluation has also been conducted 3 Results and Discussion where our evaluators are well known to both the languages. Accuracy has been tested using Various results had been evaluated by starting standard automated metric methodologies i.e. from 10000 parallel sentences to 1 lakh parallel BLEU and NIST, on PUMTS and Google sentences after including pre-processing and translator. Data domains followed during the post-processing modules. The results have been development of parallel corpus are politics, compared with Google translator so as to keep sports, health, tourism, entertainment, books & the accuracy comparable and required improvisation can be included in PUMTS. 33 magazines, education, arts & culture, religion, References and literature. Thomas D. Hedden, 1992-2010, Machine Translation: A brief Introduction, Since, human evaluation is still considered the http://ice.he.net/~hedden/intro_mt.html most reliable and efficient method to test the P Koehn, H Huang, et al., 2007, Moses: Open system's accuracy. However, this is impracticable Source Toolkit for Statistical Machine in today’s circumstances. Thus, we have used Translation. ACL Demos, 2007. automatic evaluation with BLEU and NIST to Shahid Aasim Ali and Malik Muhammad quickly and inexpensively evaluate the impact of new ideas, algorithms, and data sets. During the Kamran, 2010, Development of parallel evaluation of PUMTS, a sufficient bilingual corpus and English to Urdu Statistical parallel corpus in Punjabi-Urdu language pair Machine Translation, International Journal of (more than 1 lakh parallel sentences) has been Engineering and Technology, PP. 31-33, Vil used on MOSES, and automated standard metric 10 No 5, October 2010. scores have been generated. Various methods Ajit Kumar and Vishal Goyal, 2011, had been applied to increase the system's Comparative analysis of tools available for accuracy, like the order of languages has been developing statistical approach based changed during the testing to analyze which one machine translation system, in proceedings of gives better results. Moreover, the PUMTS International conference ICISIL 2011, Patiala system has also been checked with the Google (Punjab), India, PP. 254-260, March9-11. translator output, where we have found that our Tajinder Singh Sani, 2011, Word Disambiguation system output performs better than Google in Shahmukhi to Gurmukhi Transliteration, translator with an accuracy of about 82%. Processing of the 9th Wordshop on Asian Following chart representation helps us to get an Language Resources, Chiang Mai, Thailand, idea where PUMTS generates better results pages: 79-87, November 12 and 13. domain-wise. Gurpreet Singh Lehal and Tejinder Singh Saini, As shown in chart 1, the development of 2012, Development of a Complete PUMTS has been started from 10,000 parallel Urdu-Hindi Transliteration System, sentences, and the MOSES system has been Proceedings of COLING 2012: Posters, PP. set-up for this purpose to regularly test the 643-652, COLING 2012, Mumbai. accuracy of this data. Therefore, phase-wise Arif Tasleem et al, An analysis of challenge in testing and the recording of BLEU and NIST English and Urdu machine translation, scores has been performed. The second phase National conference on Recent Innovations has been tested on 50,000 sentences, and after and Advancements in Information that, final evaluation has been performed on Technology (RIAIT 2014), ISBN more than 1,00,000 sentences. We can observe 978-93-5212-284-4 from the above chart; there was a sharp increase Ajit Kumar and Vishal Goyal, 2015, Statistical in accuracy when the number of sentences had Post Editing System (SPES) applied to been increased from 10,000 to 50,000 sentences. Hindi-Punjabi PB-SMT system, Indian It has also been observed that the increase in size Journal of Science and Technology”, Vol from 50,000 to 1,00,000 results in increments of 8(27). accuracy at a slower rate, which is due to the Zakir H. Mohamed and Nagnoor M. Shafeen, handling of OOV words and increments on 2017, A brief study of challenges in machine corpus size, gives more chances of meaningful Translation, International journal of computer sentences too. Science Issues, PP. 54-57, Vol 14 No 2. 34
no reviews yet
Please Login to review.