jagomart
digital resources
picture1_Language Pdf 103033 | Supervised Morphosyntactic Tagging Of Parts Of Speech Of Twi A Ghanaian Language


 191x       Filetype PDF       File size 1.35 MB       Source: www.globalscientificjournal.com


File: Language Pdf 103033 | Supervised Morphosyntactic Tagging Of Parts Of Speech Of Twi A Ghanaian Language
gsj volume 8 issue 6 june 2020 issn 2320 9186 172 gsj volume 8 issue 6 june 2020 online issn 2320 9186 www globalscientificjournal com supervised morphosyntactic tagging of parts ...

icon picture PDF Filetype PDF | Posted on 23 Sep 2022 | 3 years ago
Partial capture of text on file.
            GSJ: Volume 8, Issue 6, June 2020 
            ISSN 2320-9186                                                          172
                                                                            
                     GSJ: Volume 8, Issue 6, June 2020, Online: ISSN 2320-9186 
                                    www.globalscientificjournal.com 
              Supervised morphosyntactic tagging of parts of speech of Twi, a Ghanaian language 
             
            Joseph Arimiyawu, Abdulai 
            Affiliation : Odomaseman Senior High School, Odomase, Sunyani, Ghana. 
            Email : abdulaijoseph@ymail.com 
             
            Richard, Okyere Baffour 
            Affiliation : University of Energy and Natural Resources, Fiapre, Ghana 
            Email : richard.okyere@uenr.edu.gh  
             
             
            Abstract  
            In this article, we present the results of the supervised automatic tagging of parts of speech of 
            Twi. We speak of the importance of tagging parts of speech as presented by other researchers. 
            We explain the objective of the present work and how tagging the parts of speech of the Twi 
            language is useful. We present the corpus as well as the tagging tool which we adapted for the 
            Twi language. We also present the  methodology and the steps involved in tagging. We 
            analyse some morphosyntactic phenomena which can be a source of difficulty to the 
            automatic tagging process. We suggest some solutions to these problems. In conclusion, we 
            present some recommendations aimed at improving the results of this preliminary approach to 
            the automatic tagging of the Twi language.  
             
            Keywords 
            Twi, part of speech tagging, treetagger, Natural Language Processing (NLP) 
             
             
             
             
             
                                              GSJ© 2020 
                                        www.globalscientificjournal.com
          GSJ: Volume 8, Issue 6, June 2020 
          ISSN 2320-9186                                         173
          
          
         1. Introduction 
         Morphological tagging is a process that involves assigning a tag to each word in a text. This is 
         important since the information that is provided for each word and its surroundings is 
         necessary for linguistic analyses. In the field of Natural Language  Processing (NLP), 
         morphological tagging is used for speech synthesis, linguistic searches based on corpora, and 
         translation [7]. According to a study on the development of morphological tags for Arabic 
         [11], providing text with linguistic information (morphological tags) increases the potential of 
         the text to be integrated into various computer applications for linguistic analysis. 
         Twi is one of the most widely spoken languages in Ghana. The Akan group is made up of 
         several languages, including Twi, which was the subject of our study. According to the 
         classification carried out by [6], the Akan belongs to the kwa branch of the big Niger-Congo 
         family. According to [1] the other languages of the Akan group are: fante, ahanta, aowin, 
         sefwi, bono, ahafo, kwahu, akyem, agona, dankyira and asen. 
         According to research, there are two versions of the grammar of Twi. First, there is the 
         grammar proposed by [4] and the modified version of [2]. According to these two versions, 
         there are nine parts of the speech for the Twi language: Edin (The noun), Edin Nkyerɛkyerɛmu 
         (The adjective), Edinnsiananmu (The pronoun), Adeyɔ (The verb), Ɔkyerɛfoɔ (The adverb), 
         Edin  -akyi sibea (Postposition), Nkabomdeɛ  (The conjunction / connector), Nteamu 
         (Interjection), Nsisodeɛ (The emphasis marker). The present work was carried out on these 
         nine parts of the speech.  
         In this article, we first present the literature review. Next, we present the methodology used 
         and the corpus of the study. We describe the tool we used. We also present the pre-treatment 
         of  the  corpus and the steps we followed for tagging. Finally we present the results, a 
         discussion of the results and perspectives for future research. 
          
         2. Literature review 
         Over time, automatic morphological tagging has undergone a lot of development, which has 
         led to the development of several tagging methods as well as tools that apply these methods. 
         We present in the figure below some tagging methods [7]. 
          
                                   GSJ© 2020 
                               www.globalscientificjournal.com
          GSJ: Volume 8, Issue 6, June 2020 
          ISSN 2320-9186                                         174
          Figure 1 : Classification of methods of tagging. 
                                          
          
          
         We find that the supervised and unsupervised tagging methods share three components: the 
         use of rules, the stochastic method and the neutral method. The difference between the two 
         tagging methods is marked by the use of a set of predefined rules and a training corpus 
         (supervised method) or the use of a set of predefined rules, the context of use of words 
         without a training corpus (unsupervised method). An example of the rules used in this context 
         could be as follows: a word preceded by a determinant and followed by an adjective should be 
         a noun [7]. 
         Regarding the stochastic method, we determine the tags to assign to words by calculating the 
         probability that a word is associated with a certain tag and also, the frequency of such an 
         association. This probabilistic method is used in the TreeTagger. We also have tools such as 
         Brill’s  tagger  [3]  which uses the two components mentioned above (rules and probability 
         calculation). This tool works well for languages which do not have a sufficient corpus for 
         analysis but which have a well-established rule system. Besides the Brill tagger, other taggers 
         have been tested on several languages [11]. 
          
         2.1. TreeTagger 
         The TreeTagger is a supervised probabilistic tagging tool that works according to decision 
         trees. This tool is based on the principle of “Hidden Markov Model”, a representation model 
         of the distribution of probabilities in relation to a series of observations [5]. Designed by 
                                   GSJ© 2020 
                               www.globalscientificjournal.com
          GSJ: Volume 8, Issue 6, June 2020 
          ISSN 2320-9186                                         175
         Helmut for English [8], this tagger has been trained and adapted to German [9] and other 
         languages such as French, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, 
         Chinese, Swahili, Slovak, Latin, Estonian, Polish, and Old French. The tagger is supposed to 
         be able to label other languages apart from those mentioned above if these languages have a 
         lexicon and a manually tagged training corpus. To tag a language with TreeTagger, a training 
         model is created from a sample of the corpus. The creation of the training model is ensured by 
         the “train-tree-tagger” module which is launched at the command line. This training module 
         requires four arguments: 
          
         1. “Lexicon”: a lexicon composed of words. On each line of the lexicon, there is a word and 
         its lemma separated by a tabulation. 
          
         2. "open class file": a file containing the labels that are used when the tagger is dealing with 
         unknown words. 
          
         3. "input file": this is the file that contains the manually tagged corpus. This file consists of a 
         word and its appropriate label on each line. 
          
         4. “output file”: this is the name of the file where the training results are stored. 
          
         Following the creation of this model, the tagger is launched with another untagged sample.  
         The module that provides automatic tagging requires three arguments: 
          
         1. "parameter file": the file created at the end of the training phase (this is the "output file" of 
         the previous steps). 
          
         2. “input file”: this file contains the text to be automatically tagged. There is a word on each 
         line of the file. 
          
         3. "output file": the results of the automatic tagging are stored in this output file. 
          
                                   GSJ© 2020 
                               www.globalscientificjournal.com
The words contained in this file might help you see if this file matches what you are looking for:

...Gsj volume issue june issn online www globalscientificjournal com supervised morphosyntactic tagging of parts speech twi a ghanaian language joseph arimiyawu abdulai affiliation odomaseman senior high school odomase sunyani ghana email abdulaijoseph ymail richard okyere baffour university energy and natural resources fiapre uenr edu gh abstract in this article we present the results automatic speak importance as presented by other researchers explain objective work how is useful corpus well tool which adapted for also methodology steps involved analyse some phenomena can be source difficulty to process suggest solutions these problems conclusion recommendations aimed at improving preliminary approach keywords part treetagger processing nlp introduction morphological that involves assigning tag each word text important since information provided its surroundings necessary linguistic analyses field used synthesis searches based on corpora translation according study development tags arab...

no reviews yet
Please Login to review.