191x Filetype PDF File size 1.35 MB Source: www.globalscientificjournal.com
GSJ: Volume 8, Issue 6, June 2020 ISSN 2320-9186 172 GSJ: Volume 8, Issue 6, June 2020, Online: ISSN 2320-9186 www.globalscientificjournal.com Supervised morphosyntactic tagging of parts of speech of Twi, a Ghanaian language Joseph Arimiyawu, Abdulai Affiliation : Odomaseman Senior High School, Odomase, Sunyani, Ghana. Email : abdulaijoseph@ymail.com Richard, Okyere Baffour Affiliation : University of Energy and Natural Resources, Fiapre, Ghana Email : richard.okyere@uenr.edu.gh Abstract In this article, we present the results of the supervised automatic tagging of parts of speech of Twi. We speak of the importance of tagging parts of speech as presented by other researchers. We explain the objective of the present work and how tagging the parts of speech of the Twi language is useful. We present the corpus as well as the tagging tool which we adapted for the Twi language. We also present the methodology and the steps involved in tagging. We analyse some morphosyntactic phenomena which can be a source of difficulty to the automatic tagging process. We suggest some solutions to these problems. In conclusion, we present some recommendations aimed at improving the results of this preliminary approach to the automatic tagging of the Twi language. Keywords Twi, part of speech tagging, treetagger, Natural Language Processing (NLP) GSJ© 2020 www.globalscientificjournal.com GSJ: Volume 8, Issue 6, June 2020 ISSN 2320-9186 173 1. Introduction Morphological tagging is a process that involves assigning a tag to each word in a text. This is important since the information that is provided for each word and its surroundings is necessary for linguistic analyses. In the field of Natural Language Processing (NLP), morphological tagging is used for speech synthesis, linguistic searches based on corpora, and translation [7]. According to a study on the development of morphological tags for Arabic [11], providing text with linguistic information (morphological tags) increases the potential of the text to be integrated into various computer applications for linguistic analysis. Twi is one of the most widely spoken languages in Ghana. The Akan group is made up of several languages, including Twi, which was the subject of our study. According to the classification carried out by [6], the Akan belongs to the kwa branch of the big Niger-Congo family. According to [1] the other languages of the Akan group are: fante, ahanta, aowin, sefwi, bono, ahafo, kwahu, akyem, agona, dankyira and asen. According to research, there are two versions of the grammar of Twi. First, there is the grammar proposed by [4] and the modified version of [2]. According to these two versions, there are nine parts of the speech for the Twi language: Edin (The noun), Edin Nkyerɛkyerɛmu (The adjective), Edinnsiananmu (The pronoun), Adeyɔ (The verb), Ɔkyerɛfoɔ (The adverb), Edin -akyi sibea (Postposition), Nkabomdeɛ (The conjunction / connector), Nteamu (Interjection), Nsisodeɛ (The emphasis marker). The present work was carried out on these nine parts of the speech. In this article, we first present the literature review. Next, we present the methodology used and the corpus of the study. We describe the tool we used. We also present the pre-treatment of the corpus and the steps we followed for tagging. Finally we present the results, a discussion of the results and perspectives for future research. 2. Literature review Over time, automatic morphological tagging has undergone a lot of development, which has led to the development of several tagging methods as well as tools that apply these methods. We present in the figure below some tagging methods [7]. GSJ© 2020 www.globalscientificjournal.com GSJ: Volume 8, Issue 6, June 2020 ISSN 2320-9186 174 Figure 1 : Classification of methods of tagging. We find that the supervised and unsupervised tagging methods share three components: the use of rules, the stochastic method and the neutral method. The difference between the two tagging methods is marked by the use of a set of predefined rules and a training corpus (supervised method) or the use of a set of predefined rules, the context of use of words without a training corpus (unsupervised method). An example of the rules used in this context could be as follows: a word preceded by a determinant and followed by an adjective should be a noun [7]. Regarding the stochastic method, we determine the tags to assign to words by calculating the probability that a word is associated with a certain tag and also, the frequency of such an association. This probabilistic method is used in the TreeTagger. We also have tools such as Brill’s tagger [3] which uses the two components mentioned above (rules and probability calculation). This tool works well for languages which do not have a sufficient corpus for analysis but which have a well-established rule system. Besides the Brill tagger, other taggers have been tested on several languages [11]. 2.1. TreeTagger The TreeTagger is a supervised probabilistic tagging tool that works according to decision trees. This tool is based on the principle of “Hidden Markov Model”, a representation model of the distribution of probabilities in relation to a series of observations [5]. Designed by GSJ© 2020 www.globalscientificjournal.com GSJ: Volume 8, Issue 6, June 2020 ISSN 2320-9186 175 Helmut for English [8], this tagger has been trained and adapted to German [9] and other languages such as French, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Chinese, Swahili, Slovak, Latin, Estonian, Polish, and Old French. The tagger is supposed to be able to label other languages apart from those mentioned above if these languages have a lexicon and a manually tagged training corpus. To tag a language with TreeTagger, a training model is created from a sample of the corpus. The creation of the training model is ensured by the “train-tree-tagger” module which is launched at the command line. This training module requires four arguments: 1. “Lexicon”: a lexicon composed of words. On each line of the lexicon, there is a word and its lemma separated by a tabulation. 2. "open class file": a file containing the labels that are used when the tagger is dealing with unknown words. 3. "input file": this is the file that contains the manually tagged corpus. This file consists of a word and its appropriate label on each line. 4. “output file”: this is the name of the file where the training results are stored. Following the creation of this model, the tagger is launched with another untagged sample. The module that provides automatic tagging requires three arguments: 1. "parameter file": the file created at the end of the training phase (this is the "output file" of the previous steps). 2. “input file”: this file contains the text to be automatically tagged. There is a word on each line of the file. 3. "output file": the results of the automatic tagging are stored in this output file. GSJ© 2020 www.globalscientificjournal.com
no reviews yet
Please Login to review.