jagomart
digital resources
picture1_Language Pdf 98400 | Icon Demo13


 140x       Filetype PDF       File size 0.17 MB       Source: aclanthology.org


File: Language Pdf 98400 | Icon Demo13
punjabi to urdu machine translation system 1 2 nitin bansal ajit kumar 1 department of computer science punjabi university patiala india 2 associate professor multani mal modi college patiala india ...

icon picture PDF Filetype PDF | Posted on 21 Sep 2022 | 3 years ago
Partial capture of text on file.
                                    Punjabi to Urdu Machine Translation System
                                                                                     1                   2
                                                                   Nitin Bansal , Ajit Kumar
                                       1
                                        Department of Computer Science, Punjabi University, Patiala, India
                                          2
                                            Associate Professor, Multani Mal Modi College, Patiala, India
                                                       1                                          2
                                            E-mail: profnitinbansal@gmail.com, ajit8671@gmail.com
                                                                                         mainly. The quality of machine translation
                                              Abstract
                                                                                         systems       can    be     measured       mainly      using
                     Development of Machine Translation System (MTS)
                                                                                         Bi-lingual Evaluation Study (BLEU), where it
                     for any language pair is a challenging task for several
                                                                                         produces a score between 0 and 1.
                     reasons. Lack of lexical resources for any language is
                     one of the major issues that arise while developing
                                                                                         Among various regional languages in India, we
                     MTS using that language. For example, during the
                                                                                         have chosen Punjabi and Urdu for developing
                     development of Punjabi to Urdu MTS, many issues
                                                                                         Punjabi to Urdu Machine Translation System
                     were recognized while preparing lexical resources for
                     both the languages. Since there is no machine
                                                                                         (PUMTS). Punjabi is the mother tongue of our
                     readable dictionary available for Punjabi to Urdu
                                                                                         state, Punjab, where it was used as an official
                     which can be directly used for translation; however
                                                                                         language in government offices. Urdu was also
                     various dictionaries are available to explain the
                                                                                         being used as an official language in Punjab,
                     meaning of the word. Along with this, handling of
                     OOV(out of vocabulary words), handling of multiple                  before independence. Thus, PUMTS helps us to
                     sense Punjabi word in Urdu, identification of proper
                                                                                         make        Punjabi       understandable         to     Urdu
                     nouns, identification of collocations in the source
                                                                                         communities who still want to be in touch with
                     sentence i.e. Punjabi sentences in our case, are the
                                                                                         earlier Punjab. These two languages in India, are
                     issues which we are facing during development of
                                                                                         taken     as    resource-poor       languages,       because
                     this system. Since MTSs are in great demand from the
                     last  one decade and are being widely used in
                                                                                         parallel corpus on language pairs is not available.
                     applications    such as in case of smart phones.
                                                                                         Thus it became a challenging task for us to
                     Therefore, development of such a system becomes
                                                                                         develop parallel corpus on this language pair.
                     more demanding and more user friendly. Their usage
                                                                                         Further, it also describes types of MTSs being
                     is  mainly in large scale translations, automated
                     translations; act as an instrument to bridge a digital              developed         with      Indian      and      non-Indian
                     divide.
                                                                                         perspective.
                     1    Introduction
                                                                                         2     Methodology
                     Due to the availability of many regional
                                                                                         An introduction to Punjabi and Urdu languages
                     languages in India, machine translation in India
                                                                                         help in understanding about history and close
                     has enormous scope. Human and machine
                                                                                         proximity among this language pair. Since
                     translation     have their share of challenges.
                                                                                         word-order of this language pair is same but
                     Scientifically      and     philosophically,       machine
                                                                                         writing order is different from each other i.e.
                     translation results can be applied to various
                                                                                         Punjabi can be written from left-to-right and
                     areas such as artificial intelligence, linguistics,
                                                                                         Urdu      from      right-to-left.     Mapping        among
                     and the philosophy of language. Various
                                                                                         characters of language pairs has also been
                     approaches are required in machine translation
                                                                                         studied during the development of PUMTS. The
                     to make communication possible among two
                                                                                         implementation of our methodology for the
                     languages. These approaches can be rule-based,
                                                                                         development of PUMTS, where the architecture
                     corpus-based, hybrid or neural-based. Here,
                                                                                         followed during the development has been
                     hybrid approach is a combination of two
                                                                                         documented. We have proposed three approaches
                     approaches i.e. rule-based and corpus-based
                                                                                     32
                       Proceedings of the 17th International Conference on Natural Language Processing: System Demonstrations, pages 32–34
                                            Patna, India, December 18 - 21, 2020. ©2019 NLP Association of India (NLPAI)
                to develop bilingual parallel corpus for Punjabi     (OOV) words has also been designed and
                and Urdu languages. But BLEU score suggested         developed, which is working as web-based
                for one final approach for corpus development,       nowadays. This system has been designed in two
                results in higher accuracy. All the algorithms       phases i.e. first on a web-based platform using
                which were developed during the development          ASP.Net and secondly, it has been designed for
                of PUMTS, followed the final corpus approach.        PUMTS, to handle OOV words during machine
                Lastly, Punjabi to Urdu machine transliteration      translation, using MOSES platform.
                system to handle Out-of-Vocabulary Words
                                       Chart 1: Phase-wise improvement in BLEU score for PUMTS
                                                                     Human evaluation has also been conducted
                3   Results and Discussion
                                                                     where our evaluators are well known to both the
                                                                     languages. Accuracy has been tested using
                Various results had been evaluated by starting
                                                                     standard automated metric methodologies i.e.
                from 10000 parallel sentences to 1 lakh parallel
                                                                     BLEU and NIST, on PUMTS and Google
                sentences after including pre-processing and
                                                                     translator. Data domains followed during the
                post-processing modules. The results have been
                                                                     development of parallel corpus are politics,
                compared with Google translator so as to keep
                                                                     sports, health, tourism, entertainment, books &
                the    accuracy    comparable    and    required
                improvisation   can be included in PUMTS.
                                                                 33
                magazines, education, arts & culture, religion,
                                                                     References
                and literature.
                                                                     Thomas D. Hedden, 1992-2010, Machine
                                                                        Translation:     A     brief    Introduction,
                Since, human evaluation is still considered the
                                                                        http://ice.he.net/~hedden/intro_mt.html
                most reliable and efficient method to test the
                                                                     P Koehn, H Huang, et al., 2007, Moses: Open
                system's accuracy. However, this is impracticable
                                                                        Source   Toolkit  for   Statistical Machine
                in today’s circumstances. Thus, we have used
                                                                        Translation. ACL Demos, 2007.
                automatic evaluation with BLEU and NIST to
                                                                     Shahid   Aasim Ali and Malik Muhammad
                quickly and inexpensively evaluate the impact of
                new ideas, algorithms, and data sets. During the        Kamran, 2010, Development of parallel
                evaluation of PUMTS, a sufficient bilingual             corpus and English to Urdu Statistical
                parallel corpus in Punjabi-Urdu language pair           Machine Translation, International Journal of
                (more than 1 lakh parallel sentences) has been
                                                                        Engineering and Technology, PP. 31-33, Vil
                used on MOSES, and automated standard metric
                                                                        10 No 5, October 2010.
                scores have been generated. Various methods
                                                                     Ajit   Kumar     and   Vishal    Goyal,   2011,
                had been applied to increase the system's
                                                                        Comparative analysis of tools available for
                accuracy, like the order of languages has been
                                                                        developing    statistical approach     based
                changed during the testing to analyze which one
                                                                        machine translation system, in proceedings of
                gives better results. Moreover, the PUMTS
                                                                        International conference ICISIL 2011, Patiala
                system has also been checked with the Google
                                                                        (Punjab), India, PP. 254-260, March9-11.
                translator output, where we have found that our
                                                                     Tajinder Singh Sani, 2011, Word Disambiguation
                system output performs better than Google
                                                                        in Shahmukhi to Gurmukhi Transliteration,
                translator with an accuracy of about 82%.
                                                                        Processing of the 9th Wordshop on Asian
                Following chart representation helps us to get an
                                                                        Language Resources, Chiang Mai, Thailand,
                idea where PUMTS generates better results
                                                                        pages: 79-87, November 12 and 13.
                domain-wise.
                                                                     Gurpreet Singh Lehal and Tejinder Singh Saini,
                As shown in chart 1, the development of
                                                                        2012,    Development     of    a   Complete
                PUMTS has been started from 10,000 parallel
                                                                        Urdu-Hindi       Transliteration     System,
                sentences, and the MOSES system has been
                                                                        Proceedings of COLING 2012: Posters, PP.
                set-up for this purpose to regularly test the
                                                                        643-652, COLING 2012, Mumbai.
                accuracy of this data. Therefore, phase-wise
                                                                     Arif Tasleem et al, An analysis of challenge in
                testing and the recording of BLEU and NIST
                                                                        English   and Urdu machine translation,
                scores has been performed. The second phase
                                                                        National conference on Recent Innovations
                has been tested on 50,000 sentences, and after
                                                                        and     Advancements       in    Information
                that, final evaluation has been performed on
                                                                        Technology      (RIAIT      2014),     ISBN
                more than 1,00,000 sentences. We can observe
                                                                        978-93-5212-284-4
                from the above chart; there was a sharp increase
                                                                     Ajit Kumar and Vishal Goyal, 2015, Statistical
                in accuracy when the number of sentences had
                                                                        Post Editing System (SPES) applied to
                been increased from 10,000 to 50,000 sentences.
                                                                        Hindi-Punjabi    PB-SMT     system,   Indian
                It has also been observed that the increase in size
                                                                        Journal of Science and Technology”, Vol
                from 50,000 to 1,00,000 results in increments of
                                                                        8(27).
                accuracy at a slower rate, which is due to the
                                                                     Zakir H. Mohamed and Nagnoor M. Shafeen,
                handling of OOV words and increments on
                                                                        2017, A brief study of challenges in machine
                corpus size, gives more chances of meaningful
                                                                        Translation, International journal of computer
                sentences too.
                                                                        Science Issues, PP. 54-57, Vol 14 No 2.
                                                                 34
The words contained in this file might help you see if this file matches what you are looking for:

...Punjabi to urdu machine translation system nitin bansal ajit kumar department of computer science university patiala india associate professor multani mal modi college e mail profnitinbansal gmail com mainly the quality abstract systems can be measured using development mts bi lingual evaluation study bleu where it for any language pair is a challenging task several produces score between and reasons lack lexical resources one major issues that arise while developing among various regional languages in we example during have chosen many were recognized preparing both since there no pumts mother tongue our readable dictionary available state punjab was used as an official which directly however government offices also dictionaries are explain being meaning word along with this handling oov out vocabulary words multiple before independence thus helps us sense identification proper make understandable nouns collocations source communities who still want touch sentence i sentences case ear...

no reviews yet
Please Login to review.