118x Filetype PDF File size 0.45 MB Source: www.iaeme.com
International Journal of Advanced Research in Engineering and Technology (IJARET) Volume 12, Issue 1, January 2021, pp. 753-759, Article ID: IJARET_12_01_068 Available online at http://iaeme.com/Home/issue/IJARET?Volume=12&Issue=1 Journal Impact Factor (2020): 10.9475 (Calculated by GISI) www.jifactor.com ISSN Print: 0976-6480 and ISSN Online: 0976-6499 DOI: 10.34218/IJARET.12.1.2021.068 © IAEME Publication Scopus Indexed VITERBI BASED PARTS OF SPEECH TAGGING FOR HINDI AND MARATHI Vijayshri Khedkar Research Scholar, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India Pritesh Shah Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India ABSTRACT Machine translation has expanded immensely, particularly in this period. Machine translation can be broken into seven main steps namely- token generation, analyzing morphology, lexeme, tagging Part of Speech, chunking, parsing, and disambiguation in words. NLP is a promising field of research, which enables the machine to analyze and process the meaning behind human languages. The aim of our project is to assign a specific grammatical class to the input sequence of Hindi and Marathi language. Major part of India's population belongs to rural areas and these people are more comfortable and well acquainted with Hindi and Marathi Language. It is considered one of the official languages of India. But, as most of the material available online today is in English it becomes difficult for them to understand it. So, to ease up their interaction with the online portal and to make it effective, language translation comes into view and Natural Language Processing plays a key role in it. From speech recognition to sentiment analysis, NLP is the backbone of this interaction. Furthermore, for development of any NLP application, POS tagging is a necessary step. English language tagging is already available so our concentration was basically more on Hindi and Marathi corpus POS tagging. Although there are many approaches available for POS tagging like rule- based POS tagging, lexical analysis etc. we have considered the stochastic based POS tagging for our project because of its better results in other languages. Key words: POS tagging, Marathi, Rule-based tagging, Viterbi Algorithm, stochastic taggers. Cite this Article: Vijayshri Khedkar and Pritesh Shah, Viterbi Based Parts of Speech Tagging for Hindi and Marathi, International Journal of Advanced Research in Engineering and Technology, 12(1), 2021, pp. 753-759. http://iaeme.com/Home/issue/IJARET?Volume=12&Issue=1 http://iaeme.com/Home/journal/IJARET 753 editor@iaeme.com Viterbi Based Parts of Speech Tagging for Hindi and Marathi 1. INTRODUCTION Natural Language Processing is one of the fields of machine learning. It engenders an approach through which interaction between machine and human can be made less complicated [2]. Part- of-speech tagging is the process of assigning a specific grammatical class to a word like noun, pronoun, conjunction, preposition, etc. It is one of the elemental steps to approach and analyze a natural language [3]. Previously defined as, “Given a meaningful sequence of words w1...wn, the system has to assign respective POS tags t1...tn to input sequence as the output” [4]. We can state mathematically as, (1) POS tagging is a basic tool for linguistic operations on a natural language such as machine translation text recognition, named-entity recognition etc. As far as morphology is concerned, Hindi and Marathi are richer in terms of grammatical class including verb forms etc [5]. Due to high morphology, determining the uncertainty of tags is an onerous task when working on Hindi language [6]. For instance, the term “” may be a conjunction and may be a quantifier or an intensifier too depending on how it is used. Contribution of this project includes: • Splitting of sentences into tokens and distributing them. • Part of Speech of different tokens detected. • Presenting POS tagging list for the sentence. This model works on a labeled training set (39588 sentences) and yields 92.97% of precision with an accuracy of 92.97%. 2. VITERBI ALGORITHM Consider an Input sequence a ... a 1 n arg max q(a ... a b .....b ) (2) 1 n, 1 n+1 where arg max is taken over all series b …..b such that b€ S for i = 1…n, and b = 1 n+1 i n+1 STOP We assume that p again takes the form q( ….a ,b …b ) = (3) 1 n 1 n+1 We have assumed that = = *, and = STOP The main purpose of using this algorithm is to discover the most optimal sequence of states using the Hidden Markov Model (HMM) and a sequence of given observations. In this context, the term optimal refers to probability. The sequence with maximum probability is deemed optimal by the model. A list of possible tags is used by the model such as ‘S’ – {Verbs, Adjectives, Nouns, Adverbs, conjunction, etc}. Each word in each observation will be assigned with any one of the tags available in set ‘S’ [7]. A list of all possible tag sequences is formed multiplying the trigram and emission probabilities for a sequence. Each sequence formed by the model will result in a probability. The sequence with maximum probability will be deemed as optimal using a dynamic programming approach [8]. 3. PROPOSED METHODOLOGY Our Project includes a Hindi and Marathi part-of-speech tagger which has three fundamental steps. First, input Hindi and Marathi text is splitted into sentences. In the next step, the sentences are tokenized into words and the third step allocates part-of-speech tags to sentences. The http://iaeme.com/Home/journal/IJARET 754 editor@iaeme.com Vijayshri Khedkar and Pritesh Shah system was evaluated over a data of 39588 sentences. The data set used for training and validation contains 34588 and 5000 sentences respectively. Every word in the sentences is annotated with at least one out of 24 possible tags. There are two consecutive phases to the system. It trains the model in the first phase, using defined words (present in the training dataset). In the next phase it labels undefined words (present in testing dataset) and delivers a tag sequence ts.1 ..... ts.n for input series of words w.1 .... w.n. The following section details the tagset that we have implemented and the methodology that the system follows. Output: Hindi or Marathi sentence text tagged with part Input: Hindi Word to tag User Tag or Marathi mapping Interface Generator sentence text Splitter Trained Viterbi Token corpus Tagger generatorT Figure 1 Proposed System Architecture We have built a tagset for the Hindi and Marathi languages that includes 24 part-speech tags. The tagset is inspired by a research in CDAC, Pune[9]. It also contains tags for numbers in many formats. The entire tagset is mentioned in Table I. Table 1 Tags and Description S.No. Tag Description 1 NN Common Noun 2 PRP Pronoun 3 NNP Proper Noun 4 PSP Postposition 5 JJ Adjective 6 INTF Intensifier 7 RP Particles 8 NEG Negative Word 9 RB Adverb 10 QF Quantifiers 11 DEM Demonstrative http://iaeme.com/Home/journal/IJARET 755 editor@iaeme.com Viterbi Based Parts of Speech Tagging for Hindi and Marathi 12 NST Spatial Noun 13 SYM Symbol 14 ECH Echo Words 15 WQ Question Words 16 QC Cardinals 17 XC Compounds 18 CC Conjuncts 19 QO Ordinals 20 RDP Reduplication 21 INJ Interjection 22 VM Main Verb 23 VAUX Verb Auxiliary 24 UNK Unknown Words 4. EXPERIMENTS AND RESULTS Various experiments have been performed to test the validity, results and precision of the proposed method. Few observations of POS tagging from the method being discussed are stated below: Input: Output: ['JJ', 'NN', 'INTF', 'JJ', 'NN', 'VAUX', 'CC', 'PRP', 'NN', 'PSP', 'NN', 'RP', 'QC', 'NN', 'VM', 'VAUX'] Input: Output: ['JJ', 'NN', 'INTF', 'JJ', 'NN', 'VAUX', 'PRP', 'NN', 'PSP', 'NN', 'RP', 'QC', 'NN', 'VM', 'VAUX'] Input: 2011 - 1102 Output: ['NN', 'PSP', 'PSP', 'XC', 'NN', 'PSP', 'NN', 'PSP', 'NNP', 'PSP', 'NNP', 'NN', 'PSP', 'NN', 'PSP', 'NN', 'PSP', 'NN', 'NN', 'VM'] Input: तुलनेत २०११ ा जनगणनेनुसार, भारतातील िबहार राातील लोकाची घनता बत चौरस बकमीवर 1102 लोक होते. Output: ['NN', 'PSP', 'PSP', 'XC', 'NN', 'PSP', 'NN', 'PSP', 'NNP', 'PSP', 'NNP', 'NN', 'PSP', 'NN', 'PSP', 'NN', 'PSP', 'NN', 'NN', 'VM'] In these examples, the Hindi and Marathi Devanagari texts are marked as per Hindi and Marathi grammar with their corresponding part-of-speech class. For tagging, Viterbi algorithm is applied to tag the unknown meaningful sequence of words. http://iaeme.com/Home/journal/IJARET 756 editor@iaeme.com
no reviews yet
Please Login to review.