139x Filetype PDF File size 0.18 MB Source: aclanthology.org
A Comprehensive NLP System for Modern Standard Arabic and Modern Hebrew Morphological analysis, lemmatization, vocalization, disambiguation and text-to-speech Dror Kamir Naama Soreq Yoni Neeman Melingo Ltd. Melingo Ltd. Melingo Ltd. 16 Totseret Haaretz st. 16 Totseret Haaretz st. 16 Totseret Haaretz st. Tel-Aviv, Israel Tel-Aviv, Israel Tel-Aviv, Israel drork@melingo.com naamas@melingo.com yonin@melingo.com Abstract 1 Introduction This paper presents a comprehensive NLP sys- 1.1 The common Semitic basis from an NLP tem by Melingo that has been recently developed standpoint TM for Arabic, based on Morfix an operational Modern Standard Arabic (MSA) and Modern formerly developed highly successful comprehen- Hebrew (MH) share the basic Semitic traits: rich sive Hebrew NLP system. morphology, based on consonantal roots (Jiðr / The system discussed includes modules for ore)1, which depends on vowel changes and in morphological analysis, context sensitive lemmati- some cases consonantal insertions and deletions to zation, vocalization, text-to-phoneme conversion, 2 create inflections and derivations. and syntactic-analysis-based prosody (intonation) For example, in MSA: the consonantal root model. It is employed in applications such as full /ktb/ combined with the vocalic pattern CaCaCa text search, information retrieval, text categoriza- derives the verb kataba to write. This derivation tion, textual data mining, online contextual dic- is further inflected into forms that indicate seman- tionaries, filtering, and text-to-speech applications tic features, such as number, gender, tense etc.: in the fields of telephony and accessibility and katab-tu I wrote, katab-ta you (sing. masc.) could serve as a handy accessory for non-fluent wrote, katab-ti you (sing. fem.) wrote, ?a-ktubu Arabic or Hebrew speakers. I write/will write, etc. Modern Hebrew and Modern Standard Arabic Similarly in MH: the consonantal root /ktv/ combined with the vocalic pattern CaCaC derives share some unique Semitic linguistic characteris- the verb katav to write, and its inflections are: tics. Yet up to now, the two languages have been katav-ti I wrote, katav-ta you (sing. masc.) handled separately in Natural Language Processing circles, both on the academic and on the applica- 1 A remark about the notation: Phonetic transcriptions always tive levels. This paper reviews the major similari- appear in Italics, and follow the IPA convention, except the ties and the minor dissimilarities between Modern following: ? glottal stop, ¿ voiced pharyngeal fricative Hebrew and Modern Standard Arabic from the (Ayn), đ velarized d, ś velarized s. Orthographic NLP standpoint, and emphasizes the benefit of de- transliterations appear in curly brackets. Bound morphemes (affixes, clitics, consonantal roots) are written between two veloping and maintaining a unified system for both slashes. Arabic and Hebrew linguistic terms are written in languages. phonetic spelling beginning with a capital letter. The Arabic term comes first. 2 For a review on the different approaches to Semitic inflec- tions see Beesley (2001), p. 2. wrote, katav-t you (sing. fem.) wrote, e-xtov I The fact that MSA and MH morphology is will write etc. root-based might promote the notion of identifying In fact, morphological similarity extends much the lemma with the root. But this solution is not further than this general observation, and includes satisfactory: in most cases there is indeed a dia- very specific similarities in terms of the NLP sys- chronic relation in meaning among words and tems, such as usage of nominal forms to mark forms of the same consonantal root. However, se- tenses and moods of verbs; usage of pronominal mantic shifts which occur over the years rule out enclitics to convey direct objects, and usage of this method in synchronic analysis. Moreover, proclitics to convey some prepositions. Moreover, some diachronic processes result in totally coinci- the inflectional patterns and clitics are quite similar dental sharing of a root by two or more com- in form in most cases. Both languages exhibit con- pletely different semantic domains. For example, struct formation (Iđa:fa / Smixut), which is similar in MSA, the words fajr dawn and infija:r explo- in its structure and in its role. The suffix marking sion share the same root /fjr/ (the latter might have feminine gender is also similar, and similarity goes originally been a metaphor). Similarly, in MH the as far as peculiarities in the numbering system, verbs pasal to ban, disqualify and pisel to sculp- where the female gender suffix marks the mascu- ture share the same root /psl/ (the former is an old line. Some of these phenomena will be demon- loan from Aramaic). strated below. In Morfix, as described below (2.1), a lemma 1.2 Lemmatization of Semitic Languages is defined not as the root, but as the manifestation of this root, most commonly as the lesser marked A consistent definition of lemma is crucial for form of a noun, adjective or verb. There is no es- a data retrieval system. A lemma can be said to be cape from some arbitrariness in the implementation the equivalent to a lexical entry: the basic gram- of this definition, due to the fine line between in- matical unit of natural language that is semanti- flectional morphology and derivational morphol- cally closed. In applications such as search ogy. However, Morfix generally follows the engines, usually it is the lemma that is sought, tradition set by dictionaries, especially bilingual while additional information including tense, num- dictionaries. Thus, for example, difference in part ber, and person are dispensable. of speech entails different lemmas, even if the In MSA and MH a lemma is actually the morphological process is partially predictable. common denominator of a set of forms (hundreds Similarly each verb pattern (Wazn / Binyan) is or thousands of forms in each set) that share the treated as a different lemma. same meaning and some morphological and syn- Even so, the roots should not be overlooked, as tactic features. Thus, in MSA, the forms: ?awla:d, they are a good basis for forming groups of lem- walada:ni, despite their remarkable difference in mas; in other words, the root can often serve as a appearance, share the same lemma WALAD a boy. super-lemma, joining together several lemmas, This is even more noticeable in verbs, where forms provided they all share a semantic field. like kataba, yaktubu, kutiba, yuktabu, kita:ba and 1.3 The Issue of Nominal Inflections of Verbs many more are all part of the same lemma: KATABA to write. The inconclusive selection of lemmas in MSA The rather large number of inflections and and MH can be demonstrated by looking into an complex forms (forms that include clitics, see be- interesting phenomenon: the nominal inflections of low 1.5) possible for each lemma results in a high verbs (roughly parallel to the Latin participle, see total number of forms, which, in fact, is estimated below). Since this issue is a good example both for to be the same for both languages: around 70 mil- a characteristic of Semitic NLP and for the simi- 3 lion . The mapping of these forms into lemmas is larities between MSA and MH, it is worthwhile to inconclusive (See Dichy (2001), p. 24). Hence the further elaborate on it. question rises: what should be defined as lemma in Both MSA and MH use the nominal inflections MSA and MH. of verbs to convey tenses, moods and aspects. These inflections are derived directly from the verb 3 For Arabic - see Beesley (2001), p. 7 For Hebrew - our own according to strict rules, and their forms are pre- sources. dictable in most cases. Nonetheless, grammati- It is easy to see the additional difficulty that cally, these forms behave as nouns or adjectives. this writing convention presents for NLP. The This means that they bear case marking in MSA, string {yktb} in MSA can be interpreted as yak- nominal marking for number and gender (in both tubu (future tense), yaktuba (subjunctive), yaktub languages) and they can be definite or indefinite (jussive), yuktabu (future tense passive) and even (in both languages). Moreover, these inflections yuktibu he dictates/will dictate a form that is con- often serve as nouns or adjectives in their own sidered by Morfix to be a different lemma alto- right. This, in fact, causes the crucial problem for gether (see above 1.2). Furthermore, ambiguity can data retrieval, since the system has to determine occur between totally unrelated words, as will be whether the user refers to the noun/adjective or shown in section 1.7. A trained MSA reader can rather to the verb for which it serves as inflection. distinguish between these forms by using contex- Nominal inflections of verbs exist in non- tual cues (both syntactic and semantic). A similar Semitic languages as well; in most European lan- contextual sensitivity must be programmed into the guages participles and infinitives have nominal NLP system in order to meet this challenge. features. However, two Semitic traits make this Each language also has some orthographic pe- phenomenon more challenging in our case the culiarities of its own. The most striking in MH is rich morphology which creates a large set of in- the multiple spelling conventions that are used si- flections for each base form (i.e. the verb is in- multaneously. The classical convention has been flected to create nominal forms and then each form replaced in most texts with some kind of spelling is inflected again for case, gender and number). system that partially indicates vowels, and thus Furthermore, Semitic languages allow nominal reduces ambiguities. An NLP system has to take clauses, namely verbless sentences, which increase into account the various spelling systems and the ambiguity. For example, in English it is easy to fact that the classic convention is still occasionally recognize the form drunk in he has drunk as used. Thus, each word often has more than one related to the lemma DRINK (V) (and not as an ad- spelling. For example: the word shi?ur a lesson jective). This is done by spotting the auxiliary has can be written {¿wr} or {y¿wr}. The word kiven which precedes this form. However in MH, the to direct can be written {kwn} or {kywwn}, the clause axi omer could mean my brother is a former is the classical spelling (Ktiv Xaser) while guard or my brother guards/is guarding. The the later is the standard semi-vocalized system syntactical cues for the final decision are subtle (Ktiv Male), but a some non-standard spellings can and elusive. Similarly in MSA: axi ka:tibun could also appear: {kywn}, {kwwn}. mean my brother is writing or my brother is a MSA spelling is much more standardized and writer. follows classic conventions. Nonetheless, some of 1.4 Orthography these conventions may seem confusing at first sight. The Hamza sign, which represents the glottal From the viewpoint of NLP, especially com- stop phoneme, can be written in 5 different ways, mercially applicable NLP, it is important to note depending on its phonological environment. There- that the writing systems of both MSA and MH fol- fore, any change in vowels (very regular a phe- low the same conventions, in which most vowels nomenon in MSA inflectional paradigms) results in are not marked. Therefore, in MSA the form yak- a different shape of Hamza. This occurs even when tubu he writes/will write is written {yktb}. Simi- the vowels themselves are not marked. Moreover larly in MH, the form yilmad he will learn is there is often more than one shape possible per written {ylmd}. Both languages have a supplemen- form, without any mandatory convention. One tary marking system for vocalization (written could argue that all Hamza shapes should be en- above, under and beside the text), but it is not used coded as one for our purposes. This may solve in the overwhelming majority of texts. In both lan- some problems, but then again it would deny us of guages, when vowels do appear as letters, letters of crucial information about the vowels in the word. consonantal origin are used, consequently turning Since the Hamza changes according to vowels these letters ambiguous (between their consonantal around it, it is a good cue for retrieving the vocali- and vocalic readings). zation of the word, and to reduce ambiguity. 1.5 Clitics and Complex Forms proclitics must be taken into account in the lemma- The phenomenon which will be described in tization process. this section is related both to the morphological 1.6 Syntax structure of MSA and MH, and to the orthographi- The syntactic structure of MSA and MH is cal conventions shared by these languages. Both languages use a diverse system of clitics4 that are very similar. In fact, the list of major syntactic appended to the inflectional forms, creating com- rules is almost identical, though the actual applica- plex forms and further complications in proper tion of these rules may differ between the lan- lemmatization and data retrieval. guages. For example, in MSA, the form: ?awla:dun A good demonstration of that is the agreement boys (nom.), a part of the lemma WALAD boy, rule. Both languages demand a strict noun- can take the genitive pronominal enclitic /-ha/ her adjective-verb agreement. The agreement includes and create the complex form: ?awla:d-u-ha boys- features such as number, gender, definiteness and nom.-her (=her boys). This complex form is in MSA also case marking (in noun-adjective agreement). The MH agreement rule is more orthographically represented as follows: {?wladha}. Similarly in Hebrew, the form yeladim straightforward than the MSA one. For example: children (of the lemma YELED child), combined ha-yeladim ha-gdolim halxu the-child-pl. the-big- with the genitive pronominal enclitic /-ha/ her, pl. go-past-pl. (=The big children went). Note that yields the complex form yelade-ha children-her all elements in the sentence are marked as plural, (=her children). The orthographical representation and the noun and the adjective also agree in defi- is: {yldyh}. niteness. Enclitics usually denote genitive pronouns for The case of MSA is slightly different. MSA nouns (as demonstrated above) and accusative pro- has incomplete agreement in verb-subject sen- nouns for verbs. For example, in MSA, ?akaltu-hu tences, which are the vast majority. In this case the I ate it {?klth}, or in MH axalti-v I ate it agreement of the verb will only be in gender but {?kltyw}. It is easy to see how this phenomenon, not in number, e.g. ðahaba l-?awla:du go-past- especially the orthographic convention which con- masc.-sing. boy-pl. (=The boys went). MSA also joins these enclitics to the basic form, may create distinguishes between human plural forms and confusion in lemmatizing and data retrieval. How- non-human plural forms, i.e. if the plural form ever, the nature of clitics which limits their posi- does not have a human referent, the verb or the tion and possible combinations helps to locate adjective will be marked as feminine rather than them and trace the basic form from which the plural, e.g. ðahabat el-kila:bu l-kabi:ratu go-past- complex one was created. fem.-sing. the-dog-masc.-pl. the-big-fem.-sing. There are also several proclitics denoting (=The big dogs went). prepositions and other particles, attached to the The example of the agreement rule demon- preceding form by orthographic convention. The strates both the similarities and the differences be- most common are the conjunctions /w, f/, the tween MSA and MH. Furthermore, it demonstrates prepositions /b, l, k/ and the definite article /al/ in how minor are the differences as far as our pur- MSA, and the conjunction /w/, the prepositions /b, poses go. As long as the agreement rule is taken k, l, m/ (often referred to as Otiyot Baxlam), the into account, its actual implementation has hardly relative pronoun // and the definite article /h/ in any consequences in the level of the system. This MH. Therefore, in MSA, the phrase: wa-li-l- example also demonstrates a very useful cue to ?wla:di and to the boys will have the following reduce ambiguity among forms. This cue is proba- orthographical representation: {wll?wlad}. In MH blyused intuitively by trained readers of MSA and the phrase ve-la-yeladim and to the children will MH, and encoding it into the Morfix NLP system be represented orthographically as: {wlyldym}. turns out quite useful. Once again, when scanning a written text, these 1.7 Ambiguity 4 The term clitics is employed here as the closest term which Perhaps the major challenge for NLP analysis can describe this phenomenon without committing to any in MSA and MH is overcoming the ambiguity of linguistic theory.
no reviews yet
Please Login to review.