124x Filetype PDF File size 0.66 MB Source: aclanthology.org
Arabic Morphology Generation Using a Concatenative Strategy Violetta Cavalli-Sforza Abdelhadi Soudi Teruko Mitamura Carnegie Technology Computer Science Department Language Technologies Education Ecole Nationale de L'Industrie Institute 4615 Forbes Avenue Minerale Carnegie Mellon University Pittsburgh, PA, 15213 Rabat, Morocco Pittsburgh, PA 15213 violetta@cs.cmu.edu asoudi@enim.ac.ma teruko @cs.cmu.edu the 2 tenses (perfect and imperfect), the 2 voices Abstract (active and passive), and the 5 moods Arabic inflectional morphology requires (indicative, subjunctive, jussive, imperative and infixation, prefixation and suffixation, energetic). ~ The stem used in the conjugation of giving rise to a large space of morphological the verb may differ depending on the person, variation. In this paper we describe an number, gender, tense, mood, and the presence approach to reducing the complexity of of certain root consonants. Stem changes Arabic morphology generation using combine with suffixes in the perfect indicative discrimination trees and transformational (e.g., katab-naa 'we wrote', kutib-a 'it was rules. By decoupling the problem of stem written') and the imperative (e.g. uktub-uu changes from that of prefixes and suffixes, 'write', plural), and with both prefixes and we gain a significant reduction in the suffixes for the imperfect tense in the indicative, number of rules required, as much as a subjunctive, and jussive moods (e.g. ya-ktub-na factor of three for certain verb types. We 'they write, feminine plural') and in the focus on hollow verbs but discuss the wider energetic mood (e.g. ya-ktub-unna or ya-ktub-un applicability of the approach. 'he certainly writes'). There are a total of 13 person-number-gender combinations. Distinct prefixes are used in the active and passive voices Introduction in the imperfect, although in most cases this Morphologically, Arabic is a non-concatenative results in a change in the written form only if language. The basic problem with generating diacritic marks are used. 2 Arabic verbal morphology is the large number of Most previous computational treatments of variants that must be generated. Verbal stems Arabic morphology are based on linguistic are based on triliteral or quadriliteral roots (3- or models that describe Arabic in a non- 4-radicals). Stems are formed by a derivational concatenative way and focus primarily on combination of a root morpheme and a vowel analysis. Beesley (1991) describes a system that melody; the two are arranged according to analyzes Arabic words based on Koskenniemi's canonical patterns. Roots are said to interdigitate with patterns to form stems. For 1 The jussive is used in specific constructions, for example, the Arabic stem katab (he wrote) is example, negation in the past with the negative composed of the morpheme ktb (notion of particle tam (e.g., tam aktub 'I didn't write'). The writing) and the vowel melody morpheme 'a-a'. energetic expresses corroboration of an action taking The two are coordinated according to the pattern place. The indicative is common to both perfect and CVCVC (C=consonant, V=vowel). imperfect tenses, but the subjunctive and the jussive are restricted to the imperfect tense. The imperative There are 15 triliteral patterns, of which at least has a special form, and the energetic can be derived 9 are in common use, and 4 much rarer from either the imperfect or the imperative. quadriliteral patterns. All these patterns undergo z Diacritic marks are used in Arabic language some stem changes with respect to voweling in textbooks and occasionally in regular texts to resolve ambiguous words (e.g. to mark a passive verb use). 86 (1983) two-level morphology. In Beesley To illustrate our approach, we focus on a (1996) the system is reworked into a finite-state particular type of verbs, termed hollow verbs, lexical transducer to perform analysis and and show how we integrate their treatment with generation. In two-level systems, the lexical that of more regular verbs. We also discuss how level includes short vowels that are typically not the approach can be extended to other classes of realized on the the surface level. Kiraz (1994) verbs and other parts of speech. presents an analysis of Arabic morphology based on the CV-, moraic-, and affixational 1 Arabic Verbal Morphology models. He introduces a multi-tape two-level Verb roots in Arabic can be classified as shown model and a formalism where three tapes are in Figure 1. 3 A primary distinction is made used for the lexical level (root, pattern, and between weak and strong verbs. Weak verbs vocalization) and one tape for the surface level. have a weak consonant ('w' or 'y') as one or In this paper, we propose a computational more of their radicals; strong verbs do not have approach that applies a concatenative treatment any weak radicals. to Arabic morphology generation by separating Strong verbs undergo systematic changes in the issue of infixation from other inflectional stem voweling from the perfect to the imperfect. variations. We are developing an Arabic The first radical vowel disappears in the morphological generator using MORPHE imperfect. Verbs whose middle radical vowel in (Leavitt, 1994), a tool for modeling morphology the perfect is 'a' can change it to 'a' (e.g., based on discrimination trees and regular qaTa'a 'he cut' -> yaqTa'u 'he cuts'), 4 'i' (e.g., expressions. MORPHE is part of a suite of tools Daraba 'he hit' -> yaDribu 'he hits'), or 'u' (e.g., developed at the Language Technologies kataba 'he wrote' -> yaktubu 'he writes') in the Institute, Carnegie Mellon University, for imperfect. Verbs whose middle radical vowel in knowledge-based machine translation. Large the perfect is 'i' can only change it to 'a' (e.g., systems for MT from English to Spanish, shariba 'he drank' -> yashrabu 'he drinks') or 'i' French, German, Portuguese and a prototype for (e.g., Hasiba 'he supposed' -> yaHsibu 'he Italian have already been developed. Within this supposes'). Verbs with middle radical vowel 'u' framework, we are exploring English to Arabic in the perfect do not change it in the imperfect translation and Arabic generation for (e.g., Hasuna 'he was beautiful' -> yaHsunu 'he pedagogical purposes. We generate Arabic is beautiful'). For strong verbs, neither perfect words including short vowels and diacritic nor imperfect stems change with person, gender, marks, since they are pedagogically useful and or number. can always be stripped before display. Our approach seeks to reduce the number of Hollow verbs are those with a weak middle rules for generating morphological variants of radical. In both perfect and imperfect tenses, the Arabic verbs by breaking the problem into two underlying stem is realized by two characteristic parts. We observe that, with the exception of a allomorphs, one short and one long, whose use few verb types, there is very little interaction depends on the person, number and gender. between stem changes and the processes of prefixation and suffixation. It is therefore 3 Grammars of Arabic are not uniform in their possible to decouple, in large part, the problem classification of "hamzated" verbs, verbs containing of stem changes from that of prefixes and the glottal stop as one of the radicals (e.g. [sa?a[] 'to suffixes. The gain is a significant reduction in ask'). Wright (1968) includes them as weak verbs, the size number of transformational rules, as but Cowan (1964) doesn't. Hamzated verbs change much as a factor of three for certain verb classes. the written 'seat' of the hamza from 'alif' to 'waaw' This improves the space efficiency of the system or 'yaa?', depending on the phonetic context. and its maintainability by reducing duplication 4 In the Arabic transcription capital letters indicate of rules, and simplifies the rules by isolating emphatic consonants; 'H' is the voiceless pharyngeal different types of changes. fricative ; "' the voiced pharyngeal fricative ; '?' is the glottal stop 'hamza'. 87 triliteral I I strong weak I , I I I [ I regular hamzated doubled weak initial weak middle weak final radical radical radical radical (assimilated) (hollow) (defective) I I I I I I tense mood I I , , I I I I reterit present participle indicative imperative subjunctive jussive energetic ffect) (imperfect) ' I I I active passive Figure 1: Classification of Arabic Verbal Roots and Mood Tense System Hollow verbs fall into four classes: Stem allomorphs : Perfect: -bi'- and -baa'- . Verbs of the pattern CawaC or CawuC Imperfect: and -bi'- and -bii'- (e.g. [Tawut] 'to be long'), where the middle radical is 'w'. Their characteristic . Verbs of the pattern CayiC, where middle is a long 'uu' between the first and last radical is 'y'. E.g., radical in the imperfect. E.g., From the underlying root [hayib]: From the underlying root [zawar]: haaba 'he feared' and yahaabu 'he fears' zaara 'he visited' and yazuuru 'he visits' Stem allomorphs : Stem allomorphs: Perfect: -bib- and-haab- Perfect: -zur- and -zaar- Imperfect: -hab- and-haab- Imperfect:-zur- and-zuur- In the relevant literature (e.g., Beesley, 1998; . Verbs of the pattern CawiC, where the Kiraz, 1994), verbs belonging to the above middle radical is 'w'. Their characteristic classes are all assumed to have the pattern is a long 'aa' between the first and last CVCVC. The pattern does not show the verb radical in the imperfect. E.g., conjugation class and makes it difficult to From the underlying root [nawim]: predict the type of stem allomorph to use. To naama 'he slept and yanaamu 'he sleeps' avoid these problems, we keep information on Stem aUomorphs : the middle radical and vowel in the base form Perfect: -nirn- and -naam- of the verb. In generation, classes 2 and 4 of Imperfect:-ham- and-naam- the verb can be handled as one because they have the same perfect and imperfect stemsP . Verbs of the pattern CayaC, where the 5 The only exception is the passive participle. Verbs middle radical is 'y'. Their characteristic of classes 1 and 2 behave the same (e.g. Class 1: is a long 'ii' before the first and last radical [zawar]: mazuwr 'visited'; Class 2 [nawil] --) in the imperfect. E.g., manuwt 'obtained'), as do verbs of classes 3 and 4 From the underlying root [baya" ]: (e.g. Class 3: [baya'] --) mabii" 'sold', Class 4: baa" a 'he sold' and yabii" u 'he sells' [hayib] --) mahiib 'feared'). 88 We describe our approach to modeling strong morphological forms in the language. Each and hollow verbs below, following a node in the tree below the root is built by description of the implementation framework. specifying the parent of the node and the conjunction or disjunction of FVPs that define 2 The MORPHE System the node. Portions of the Arabic MFH are MORPHE (Leavitt, 1994) is a tool that shown in Figures 2-4. compiles morphological transformation rules Transformational Rules. A rule attached to into either a word parsing program or a word each leaf node of the MFH effects the desired generation program. 6 In this paper we will morphological transformations for that node. focus on the use of MORPHE in generation. A rule consists of one or more mutually Input and Output. MORPHE's output is exclusive clauses. The 'if' part of a clause is a simply a string. Input is a feature structure regular expression pattern, which is matched (FS) which describes the item that MORPHE against the value of the feature ROOT (a string). must transform. A FS is implemented as a The 'then' part includes one or more operators, recursive Lisp list. Each element of the FS is a applied in the given order. Operators include feature-value pair (FVP), where the value can addition, deletion, and replacement of prefixes, be atomic or complex. A complex value is infixes, and suffixes. The output of the itself a FS. For example, the FS for generating transformation is the transformed ROOT string. the Arabic zurtu 'I visited' would be: An example of a rule attached to a node in the MFH is given in Section 3.1 below. ((ROOT "zawar") Process Logic. In generation, the MFH acts as (CAT V) (PAT CVCVC) (VOW HOL) a discrimination network. The specified FS is (TENSE PERF) (MOOD IND) matched against the features defining each (VOICE ACT) subtree until a leaf is reached. At that point, (NI/MBER SG) (PERSON i)) MORPHE first checks in the irregular forms The choice of feature names and values, other lexicon for an entry indexed by the name of the than ROOT, which identifies the lexical item to leaf node (i.e., the MF) and the value of the be transformed, is entirely up to the user. The ROOT feature in the FS. If an irregular form is FVPs in a FS come from one of two sources. not found, the transformation rule attached to Static features, such as CAT (part of speech) the leaf node is tried. If no rule is found or and ROOT, come from the syntactic lexicon, none of the clauses of the applicable rule which, in addition to the base form of words, match, MORPHE returns the value of ROOT can contain morphological and syntactic unchanged. features. Dynamic features, such as TENSE and NUMBER, are set by MORPHE's caller. 3 Handling Arabic Verbal The Morphological Form Hierarchy. Morphology in MORPHE MORPHE is based on the notion of a Figure 2 sketches the basic MFH and the morphological form hierarchy (MFH) or tree. division of the verb subtree into stem changes Each internal node of the tree specifies a piece and prefix/suffix additions. 7 The inflected verb of the FS that is common to that entire is generated in two steps. MORPHE is first subtree. The root of the tree is a special node called with the feature CHG set to STEM. The that simply binds all subtrees together. The required stem is returned and temporarily leaf nodes of the tree correspond to distinct substituted for the value of the ROOT feature. 7 The use of two parts of the same tree for the two 6 MORPHE is written in Common Lisp and the problems is a constraint of MORPHE's compiled MFH and transformation rules are implementation, which does not permit multiple themselves a set of Common Lisp functions. trees with separate roots. 89
no reviews yet
Please Login to review.