126x Filetype PDF File size 0.18 MB Source: aclanthology.org
Building an HPSG-based Indonesian Resource Grammar (INDRA) DavidMoeljadi Francis Bond SanghounSong Division of Linguistics and Multilingual Studies NanyangTechnological University Singapore {D001,fcbond,sanghoun}@ntu.edu.sg Abstract tem, including a variety of prefixes, suffixes, cir- cumfixes, and reduplication. Most of the affixes This paper presents the creation and the arederivational. Twoimportantinflectionalaffixes initial stage development of a broad- are the prefix meN- which marks active voice and coverage Indonesian Resource Grammar di- which denotes passive voice (Sneddon et al., (INDRA) within the framework of Head 2010, pp. 29, 72). DrivenPhraseStructureGrammar(HPSG) Indonesian has a strong tendency to be head- (Pollard and Sag, 1994) and Minimal Re- initial (Sneddon et al., 2010, pp. 26-28). In a noun cursion Semantics (MRS) (Copestake et phrase with an adjective, a demonstrative or a rel- al., 2005). At the present stage, INDRA ative clause, the head noun precedes the adjective, focuses on verbal constructions and sub- the demonstrative or the relative clause. There is categorization since they are fundamental noagreementinIndonesian. In general, grammat- for argument and event structure. Verbs ical relations are only distinguished in terms of in INDRA were semi-automatically ac- wordorder. AsisoftenthecasewithAustronesian quired from the English Resource Gram- languages of Indonesia, Indonesian has a basic mar(ERG)(Flickinger,2000)viaWordnet word order of SVO with a nominative-accusative Bahasa (Nurril Hirfana Mohamed Noor et alignment pattern. Argument alternations are trig- al., 2011; Bond et al., 2014). In the future, gered by passive and applicative constructions. INDRA will be used in the development process of machine translation. A prelim- 2 Background inary evaluation of INDRA on the MRS test-suite shows promising coverage. This section introduces the background theory, as 1 Introduction to Indonesian well as an overview of the Deep Linguistic Pro- cessingwithHPSGInitiative(DELPH-IN)andthe Indonesian (ISO 639-3: ind) is a Western Malayo- tools to build and develop INDRA. PolynesianlanguageoftheAustronesianlanguage 2.1 Frameworks family. Within this subgroup, it belongs to the Malayic branch with Standard Malay in Malaysia INDRA uses the theoretical framework of HPSG and other Malay varieties (Lewis, 2009). It is spo- (Pollard and Sag, 1994). HPSG is mono- ken mainly in the Republic of Indonesia as the stratal, handling orthography, syntax, semantics sole official and national language and as the com- and pragmatics in a single structure (sign), mod- monlanguageforhundredsofethnicgroupsliving eled through typed feature structures. HPSG is there (Alwi et al., 2014, pp. 1-2). In Indonesia it unification- and constraint-based. The words and is spoken by around 22.8 million people as their phrases are combined according to constraints of first language and by more than 140 million peo- the lexical entries based on the type hierarchy. pleastheirsecondlanguage. Thelexicalsimilarity INDRA uses MRS (Copestake et al., 2005) as is over 80% with Standard Malay (Lewis, 2009). its semantic framework because it is adaptable Morphologically, Indonesian is a mildly agglu- for HPSG typed-feature structure and suitable for tinative language, compared to Finnish or Turk- parsing and generation. The semantic structures in ish where the morpheme-per-word ratio is higher MRS are underspecified for scope and thus suit- (Larasati et al., 2011). It has a rich affixation sys- able for representing ambiguous scoping. 9 Proceedings of the Grammar Engineering Across Frameworks (GEAF) Workshop, 53rd Annual Meeting of the ACL and 7th IJCNLP, pages 9–16, c Beijing, China, July 26-31, 2015. 2015 Association for Computational Linguistics There is no previous work done on Indone- and LOGON (Oepen et al., 2007), a collection of sian HPSG but much has been done using Lexi- software, grammars, andotherlinguistic resources cal Functional Grammar (LFG) (Kaplan and Bres- for transfer-based machine translation. nan, 1982), e.g. Arka and Manning (2008) on ac- tive and passive voice and Arka (2000) on con- 3 INDRA trol constructions. In addition, Arka (2012) and This section describes some preliminary work as Mistica (2013) have worked on the computational well as the methodology. grammar ”IndoGram” which is a part of the Par- Gram (Sulger et al., 2013).1 However, it is not 3.1 Methodology open-source or very broad in its coverage. Fur- Themethodology used in INDRA follows Bender ther, it does not produce MRS, so cannot be easily et al. (2008). We model our analysis in HPSG and incorporated into our machine translation system. implement it by editing some TDL files after an- Thus, there is a need to build and develop a broad- alyzing a phenomenon based on reference gram- coverage open-source HPSG of Indonesian. mars and other linguistic literatures. Afterwards, 2.2 DELPH-IN we compile the grammar and test it by parsing The DELPH-IN consortium (Deep Linguistic sample sentences or test-suites. The grammar is Processing with HPSG Initiative, http://www. debugged and developed further if some gaps or delph-in.net) is a research collaboration be- problems are found according to the parse results. tween linguists and computer scientists which Afterwards, the sample sentences in test-suites builds and develops open source grammar, tools will be parsed again and treebanked. This pro- for grammar development and applications using cess goes repetitively. If problems are not found HPSGandMRS.Morethanfifteengrammarshave or the debugging process has finished with a good been created and developed within DELPH-IN, result, the grammar will be updated in GitHub e.g. English Resource Grammar (ERG) (Copes- (https://github.com/davidmoeljadi/INDRA). take and Flickinger, 2000) and Japanese grammar 3.2 GrammarDevelopment Jacy(SiegelandBender,2002). DELPH-INgram- INDRA was created firstly by filling in mars define typed feature structures using Type the required sections of the online page Description Language (TDL) (Copestake, 2002). of LinGO Grammar Matrix questionnaire We make extensive use of several open-source which covers basic grammar phenomena tools for grammar development provided by such as word order, tense-aspect-mode, co- DELPH-IN:LinguisticKnowledgeBuilder(LKB) ordination, morphology, subcategorization (Copestake, 2002), a grammar and lexicon de- of nouns and verbs (http://www.delph- velopment environment for typed feature struc- in.net/matrix/customize/matrix.cgi). IN- ture grammars; The LinGO Grammar Matrix DRA subcategorizes nouns into three groups: (Bender et al., 2010), a web-based question- common noun, pronoun and proper name. Com- naire for writing new DELPH-IN grammars, pro- mon nouns are subcategorized into inanimate, viding a wide range of phenomena and ba- non-human and human based on three main sic files to make the grammars compatible with classifiers in Indonesian: the classifier buah (lit. DELPH-IN parsers and generators; Answer Con- fruit) for inanimate nouns, ekor (lit. tail) for straint Engine (ACE) (http://sweaglesw.org/ non-human animate nouns and orang (lit. person) linguistics/ace/), an efficient processor for for human nouns (Sneddon et al., 2010, p. 139; DELPH-IN grammars; ITSDB or [incr tsdb()] Alwi et al., 2014, p. 288). (Oepen and Flickinger, 1998), a tool for testing, Verbs are subcategorized into three groups: profilingtheperformanceofthegrammarandtree- intransitive which has one argument, transitive banking; Full Forest Treebanker (FFTB) (http: which has two arguments and optional transitive //moin.delph-in.net/FftbTop), a treebanking which has one obligatory subject argument and tool for DELPH-IN grammars, allowing the selec- one optional object argument as in Adi makan tion of an arbitrary tree from the “full forest” with- (nasi) “Adi eats (rice)”. The verb subcategoriza- out enumerating all analyses in the parsing stage; tion here follows Alwi et al. (2014, pp. 95-98). 1http://iness.uib.no/iness/xle-web Besides the number of arguments, the possibil- 10 ity of passivization with morphological inflection al., 2014) and group them based on syntactic types plays an important role in distinguishing intran- in the ERG, such as intransitive, transitive, and di- sitives from transitives in Indonesian. Examples transitive, using Python 3.4 and Natural Language [1] and [2a] show intransitive and transitive Toolkit (NLTK) (Bird et al., 2009). The group- sentences respectively. ing of verbs (verb frames) in Wordnet (Fellbaum, 1998) is employed to be the bridge between the (1) Adi tidur. English and Indonesian grammar. Adisleep Eachverbsynset in Wordnet (also Wordnet Ba- “Adi sleeps.” hasa) contains a list of sentence frames specified (2) a. Adi mengejar Budi. by the lexicographer illustrating the types of sim- Adi ACT-chase Budi ple sentences in which the verbs in the synset can be used (Fellbaum, 1998). There are 35 verbal “Adi chases Budi.” sentence frames in Wordnet, some of them are b. Budi dikejar Adi. shownasfollows with their frame numbers: Budi PASS-chase Adi (3) 1 Something----s “Budi is chased by Adi.” 8 Somebody----ssomething 21 Somebody----ssomethingPP c. Budi saya kejar. Frame 1 is a typical intransitive verbal sentence Budi 1SG chase frame, as in the book fell; frame 8 is a typical “Budi is chased by me.” (mono)transitive verbal sentence frame, as in he chases his friend; and frame 21 is a typical di- In Example (2a), the verb mengejar is formed transitive verbal sentence frame, as in she put a fromanactive prefix meN- and the base kejar (the book on a table. A verb may have more than one initial sound k undergoes nasalization; see Section synset and each synset may have more than one 4.2). The active prefix meN- is changed to a pas- verb frame, e.g. the verb eat has six synsets with sive prefix di- in passive type one (Sneddon et al., eachsynsethavingdifferent verb frames. Three of 2010, pp. 256-257) in Example (2b) and without the six synsets, together with their definition and affixinpassivetypetwo(Sneddonetal.,2010,pp. verb frames, are presented in Table 1. These verb 257-258) in Example (2c). Sneddon et al. (2010, frames can be employed as a bridge between the pp.256-257)statesthatinpassivetypeone,theac- verb types (also verb lexical items) in ERG and tor is third person or a noun, while in passive two, those in INDRA. the agent is a pronoun or pronoun substitute and it Synset Definition Verb frame comesbefore the unprefixed verb. 01168468-v Take in solid food 8 Somebody ----s The more detailed verb subcategorization into something othergroupssuchasditransitivewillbementioned 01166351-v Eat a meal, take a 2 Somebody----s meal in the next subsection. The lexical items for each 01157517-v Use up (resources 11Something----s nounandverbsubcategorywereaddedandtheaf- or materials) something fixes to support the active-passive voice were in- 8 Somebody ----s something cluded. However, the Matrix does not handle mor- phologyasinthenasalizationprocessofmeN-and Table 1: Three of six synsets of the verb “eat” and thus has to be manually added (see Section 4.2). their verb frames in Wordnet 3.3 Lexical Acquisition Out of 354 verb types in ERG, the top eleven The lexicon is important in the robustness of the mostfrequentlyusedtypesinthecorpuswerecho- grammar. Since inputting words or lexical entries sen, excluding the specific English verb types such manually into the grammar is labor intensive and as be-type verbs (e.g. is, be and was), have-type time consuming, doing lexical acquisition semi- verbs, verbs with prepositions (e.g. depend on, re- automatically is vital. In order to do this, we fer to and look after) and modals (e.g. would, may need good lexical resources. We attempted to ex- andneed). Thechosenelevenverbtypesaregiven tract Indonesian verbs from WordnetBahasa(Nur- in Table 2. The third, fifth and eighth type (v - ril Hirfana Mohamed Noor et al., 2011; Bond et unacc le, v - le and v pp unacc le all written in 11 bold in Table 2) are regarded as the same type, i.e. Verb type Verb frame intransitive verb type, in INDRA. v pp* dir le 2 Sb ----s & 22Sb----sPP v vp seq le 28Sb----stoINFINITIVE Verb type Freq Examples of verb v - unacc le 1 Sth ----s || Corp Lex v - le 2 Sb ----s v pp* dir le 7079 204 go, come, hike v pp unacc le v vp seq le 3921 105 want, like, try v np noarg3 le 8 Sb ----s sth || - unacc le 3144 334 close, start, end 11Sth----ssth v np noarg3 le 2723 5 make, take, give 15Sb----ssthtosb|| v - le 2666 486 arrive, occur, stand 17Sb----ssbwithsth|| v np-pp e le 2439 334 compare, know, relate v np-pp e le 20Sb----ssbPP|| v pp*-cp le 2360 154 think, add, note 21Sb----ssthPP|| v pp unacc le 2307 44 rise, fall, grow 31Sb----ssthwithsth v np-pp prop le 1861 135 base, put, locate v pp*-cp le 26Sb----sthatCLAUSE v cp prop le 1600 80 believe, know, find 20Sb----ssbPP|| v np ntr le 1558 10 get, want, total v np-pp prop le 21Sb----ssthPP Table 2: The ten most frequently used ERG verb v cp prop le 26Sb----sthatCLAUSE v np ntr le 8 Sb ----s sth || types in the corpus 11Sth----ssth The first type contains verbs expressing move- Table 3: The eleven most frequently used ERG ment or direction with optional PP complements, verb types in the corpus and their corresponding asinBcreptintotheroom. Theverbsinthesecond Wordnetverbframes(sb=somebody,sth=some- type are subject control verbs, as in B intended to thing, & = AND, || = OR win. The third type consists of unaccusative verbs without complements as in The plate gleamed. type in Table 2 whether it is in Wordnet or not. The fourth type contains verbs having two argu- If it could be found in Wordnet, the next step was ments (monotransitive) although they have a po- to checkwhethertheverbincludestheverbframes tential to be ditransitive as in B took the book. The mentioned in Table 3 or not. This step had to be fifth type contains intransitive (unergative) verbs done in order to find out the right synset since as in B arose. The verbs in the sixth type have a verb can have many synsets but different verb obligatory NP and PP complements as in B com- frames as shown in Table 1. After the right synset pared C with D. The verbs in the seventh type are was found, the corresponding Indonesian lemmas verbs with optional PP complements and obliga- or translations were checked. One synset may tory subordinate clauses as in B said to C that D have more than one Indonesian lemma or may not won. Unaccusative verbs with optional PP com- have Indonesian lemmas at all. plements as in The seed grew into a tree belong The next important step is to check one by to the eighth type. Ditransitive verbs with oblig- one the Indonesian lemmas belonging to the same atory NPs and PPs with state result as in B put synset and verb frames whether each can be C on D belong to the ninth type. The tenth type grouped in the same verb type or not. This man- consists of verbs with optional complementizers ual step has to be done because grouping verbs as in B hoped (that) C won and the eleventh type in a particular language into types is a language- consists of verbs with obligatory NP complements specific work. Arka (2000) states that languages which cannot be passivized as in B remains C. vary with respect to their lexical stock of “syn- Basedonthesyntacticinformation of each verb onymous”verbs that may have different argument type mentioned above, the corresponding verb structures, e.g. the verb know can be both intransi- frames in Wordnet were manually chosen. For tive and transitive in Indonesian tahu and ketahui example, the first type contains intransitive verbs respectively, transitive only with an obligatory NP with optional PP; thus, the verb frames should 2 be Sb ----s and Sb ----s PP. The intransitive in Balinese tawang, and transitive with optional verbs without complements should correspond to NPin English know. Lastly, after the Indonesian the verb frames Sth ----s or Sb ----s, regard- verbs were extracted and grouped into their cor- less of whether the subject is a thing or a person. 2Balinese (ISO 639-3: ban) is a Western Malayo- Table 3 shows the eleven verb types in ERG and Polynesian language of the Austronesian language family. It their corresponding Wordnet verb frames. belongs to the Malayo-Sumbawan branch. It is mainly spo- ken in the island of Bali in the Republic of Indonesia as a First, we checked for each verb in each verb regional language (Lewis, 2009). 12
no reviews yet
Please Login to review.