jagomart
digital resources
picture1_Indonesian Grammar Pdf 105891 | W15 3302


 126x       Filetype PDF       File size 0.18 MB       Source: aclanthology.org


File: Indonesian Grammar Pdf 105891 | W15 3302
building an hpsg based indonesian resource grammar indra davidmoeljadi francis bond sanghounsong division of linguistics and multilingual studies nanyangtechnological university singapore d001 fcbond sanghoun ntu edu sg abstract tem including ...

icon picture PDF Filetype PDF | Posted on 24 Sep 2022 | 3 years ago
Partial capture of text on file.
                          Building an HPSG-based Indonesian Resource Grammar (INDRA)
                               DavidMoeljadi                             Francis Bond                            SanghounSong
                                                   Division of Linguistics and Multilingual Studies
                                                            NanyangTechnological University
                                                                            Singapore
                                                   {D001,fcbond,sanghoun}@ntu.edu.sg
                                            Abstract                                 tem, including a variety of prefixes, suffixes, cir-
                                                                                     cumfixes, and reduplication. Most of the affixes
                         This paper presents the creation and the                    arederivational. Twoimportantinflectionalaffixes
                         initial  stage development of a broad-                      are the prefix meN- which marks active voice and
                         coverage Indonesian Resource Grammar                        di- which denotes passive voice (Sneddon et al.,
                         (INDRA) within the framework of Head                        2010, pp. 29, 72).
                         DrivenPhraseStructureGrammar(HPSG)                             Indonesian has a strong tendency to be head-
                         (Pollard and Sag, 1994) and Minimal Re-                     initial (Sneddon et al., 2010, pp. 26-28). In a noun
                         cursion Semantics (MRS) (Copestake et                       phrase with an adjective, a demonstrative or a rel-
                         al., 2005). At the present stage, INDRA                     ative clause, the head noun precedes the adjective,
                         focuses on verbal constructions and sub-                    the demonstrative or the relative clause. There is
                         categorization since they are fundamental                   noagreementinIndonesian. In general, grammat-
                         for argument and event structure. Verbs                     ical relations are only distinguished in terms of
                         in INDRA were semi-automatically ac-                        wordorder. AsisoftenthecasewithAustronesian
                         quired from the English Resource Gram-                      languages of Indonesia, Indonesian has a basic
                         mar(ERG)(Flickinger,2000)viaWordnet                         word order of SVO with a nominative-accusative
                         Bahasa (Nurril Hirfana Mohamed Noor et                      alignment pattern. Argument alternations are trig-
                         al., 2011; Bond et al., 2014). In the future,               gered by passive and applicative constructions.
                         INDRA will be used in the development
                         process of machine translation. A prelim-                   2    Background
                         inary evaluation of INDRA on the MRS
                         test-suite shows promising coverage.                        This section introduces the background theory, as
                    1    Introduction to Indonesian                                  well as an overview of the Deep Linguistic Pro-
                                                                                     cessingwithHPSGInitiative(DELPH-IN)andthe
                    Indonesian (ISO 639-3: ind) is a Western Malayo-                 tools to build and develop INDRA.
                    PolynesianlanguageoftheAustronesianlanguage                      2.1    Frameworks
                    family.    Within this subgroup, it belongs to the
                    Malayic branch with Standard Malay in Malaysia                   INDRA uses the theoretical framework of HPSG
                    and other Malay varieties (Lewis, 2009). It is spo-              (Pollard and Sag, 1994).               HPSG is mono-
                    ken mainly in the Republic of Indonesia as the                   stratal, handling orthography, syntax, semantics
                    sole official and national language and as the com-               and pragmatics in a single structure (sign), mod-
                    monlanguageforhundredsofethnicgroupsliving                       eled through typed feature structures. HPSG is
                    there (Alwi et al., 2014, pp. 1-2). In Indonesia it              unification- and constraint-based. The words and
                    is spoken by around 22.8 million people as their                 phrases are combined according to constraints of
                    first language and by more than 140 million peo-                  the lexical entries based on the type hierarchy.
                    pleastheirsecondlanguage. Thelexicalsimilarity                   INDRA uses MRS (Copestake et al., 2005) as
                    is over 80% with Standard Malay (Lewis, 2009).                   its semantic framework because it is adaptable
                       Morphologically, Indonesian is a mildly agglu-                for HPSG typed-feature structure and suitable for
                    tinative language, compared to Finnish or Turk-                  parsing and generation. The semantic structures in
                    ish where the morpheme-per-word ratio is higher                  MRS are underspecified for scope and thus suit-
                    (Larasati et al., 2011). It has a rich affixation sys-            able for representing ambiguous scoping.
                                                                                  9
          Proceedings of the Grammar Engineering Across Frameworks (GEAF) Workshop, 53rd Annual Meeting of the ACL and 7th IJCNLP, pages 9–16,
                                                                           c
                                         Beijing, China, July 26-31, 2015. 
2015 Association for Computational Linguistics
                     There is no previous work done on Indone-               and LOGON (Oepen et al., 2007), a collection of
                  sian HPSG but much has been done using Lexi-               software, grammars, andotherlinguistic resources
                  cal Functional Grammar (LFG) (Kaplan and Bres-             for transfer-based machine translation.
                  nan, 1982), e.g. Arka and Manning (2008) on ac-
                  tive and passive voice and Arka (2000) on con-             3   INDRA
                  trol constructions. In addition, Arka (2012) and           This section describes some preliminary work as
                  Mistica (2013) have worked on the computational            well as the methodology.
                  grammar ”IndoGram” which is a part of the Par-
                  Gram (Sulger et al., 2013).1 However, it is not            3.1   Methodology
                  open-source or very broad in its coverage. Fur-            Themethodology used in INDRA follows Bender
                  ther, it does not produce MRS, so cannot be easily         et al. (2008). We model our analysis in HPSG and
                  incorporated into our machine translation system.          implement it by editing some TDL files after an-
                  Thus, there is a need to build and develop a broad-        alyzing a phenomenon based on reference gram-
                  coverage open-source HPSG of Indonesian.                   mars and other linguistic literatures. Afterwards,
                  2.2    DELPH-IN                                            we compile the grammar and test it by parsing
                  The DELPH-IN consortium (Deep Linguistic                   sample sentences or test-suites. The grammar is
                  Processing with HPSG Initiative, http://www.               debugged and developed further if some gaps or
                  delph-in.net) is a research collaboration be-              problems are found according to the parse results.
                  tween linguists and computer scientists which              Afterwards, the sample sentences in test-suites
                  builds and develops open source grammar, tools             will be parsed again and treebanked. This pro-
                  for grammar development and applications using             cess goes repetitively. If problems are not found
                  HPSGandMRS.Morethanfifteengrammarshave                      or the debugging process has finished with a good
                  been created and developed within DELPH-IN,                result, the grammar will be updated in GitHub
                  e.g. English Resource Grammar (ERG) (Copes-                (https://github.com/davidmoeljadi/INDRA).
                  take and Flickinger, 2000) and Japanese grammar            3.2   GrammarDevelopment
                  Jacy(SiegelandBender,2002). DELPH-INgram-                  INDRA was created firstly by filling in
                  mars define typed feature structures using Type             the   required    sections   of   the   online   page
                  Description Language (TDL) (Copestake, 2002).              of   LinGO Grammar Matrix questionnaire
                     We make extensive use of several open-source            which     covers    basic    grammar      phenomena
                  tools for grammar development provided by                  such as word order, tense-aspect-mode, co-
                  DELPH-IN:LinguisticKnowledgeBuilder(LKB)                   ordination,      morphology,       subcategorization
                  (Copestake, 2002), a grammar and lexicon de-               of   nouns     and    verbs   (http://www.delph-
                  velopment environment for typed feature struc-             in.net/matrix/customize/matrix.cgi).              IN-
                  ture grammars; The LinGO Grammar Matrix                    DRA subcategorizes nouns into three groups:
                  (Bender et al., 2010), a web-based question-               common noun, pronoun and proper name. Com-
                  naire for writing new DELPH-IN grammars, pro-              mon nouns are subcategorized into inanimate,
                  viding a wide range of phenomena and ba-                   non-human and human based on three main
                  sic files to make the grammars compatible with              classifiers in Indonesian: the classifier buah (lit.
                  DELPH-IN parsers and generators; Answer Con-               fruit) for inanimate nouns, ekor (lit.       tail) for
                  straint Engine (ACE) (http://sweaglesw.org/                non-human animate nouns and orang (lit. person)
                  linguistics/ace/), an efficient processor for               for human nouns (Sneddon et al., 2010, p. 139;
                  DELPH-IN grammars; ITSDB or [incr tsdb()]                  Alwi et al., 2014, p. 288).
                  (Oepen and Flickinger, 1998), a tool for testing,             Verbs are subcategorized into three groups:
                  profilingtheperformanceofthegrammarandtree-                 intransitive which has one argument, transitive
                  banking; Full Forest Treebanker (FFTB) (http:              which has two arguments and optional transitive
                  //moin.delph-in.net/FftbTop), a treebanking                which has one obligatory subject argument and
                  tool for DELPH-IN grammars, allowing the selec-            one optional object argument as in Adi makan
                  tion of an arbitrary tree from the “full forest” with-     (nasi) “Adi eats (rice)”. The verb subcategoriza-
                  out enumerating all analyses in the parsing stage;         tion here follows Alwi et al. (2014, pp. 95-98).
                      1http://iness.uib.no/iness/xle-web                     Besides the number of arguments, the possibil-
                                                                         10
                     ity of passivization with morphological inflection                al., 2014) and group them based on syntactic types
                     plays an important role in distinguishing intran-                in the ERG, such as intransitive, transitive, and di-
                     sitives from transitives in Indonesian. Examples                 transitive, using Python 3.4 and Natural Language
                     [1] and [2a] show intransitive and transitive                    Toolkit (NLTK) (Bird et al., 2009). The group-
                     sentences respectively.                                          ing of verbs (verb frames) in Wordnet (Fellbaum,
                                                                                      1998) is employed to be the bridge between the
                     (1)   Adi tidur.                                                 English and Indonesian grammar.
                           Adisleep                                                      Eachverbsynset in Wordnet (also Wordnet Ba-
                           “Adi sleeps.”                                              hasa) contains a list of sentence frames specified
                     (2)   a.   Adi mengejar Budi.                                    by the lexicographer illustrating the types of sim-
                                Adi ACT-chase Budi                                    ple sentences in which the verbs in the synset can
                                                                                      be used (Fellbaum, 1998). There are 35 verbal
                                “Adi chases Budi.”                                    sentence frames in Wordnet, some of them are
                           b.   Budi dikejar Adi.                                     shownasfollows with their frame numbers:
                                Budi PASS-chase Adi                                   (3)    1 Something----s
                                “Budi is chased by Adi.”                                     8 Somebody----ssomething
                                                                                             21 Somebody----ssomethingPP
                           c.   Budi saya kejar.                                      Frame 1 is a typical intransitive verbal sentence
                                Budi 1SG chase                                        frame, as in the book fell; frame 8 is a typical
                                “Budi is chased by me.”                               (mono)transitive verbal sentence frame, as in he
                                                                                      chases his friend; and frame 21 is a typical di-
                        In Example (2a), the verb mengejar is formed                  transitive verbal sentence frame, as in she put a
                     fromanactive prefix meN- and the base kejar (the                  book on a table. A verb may have more than one
                     initial sound k undergoes nasalization; see Section              synset and each synset may have more than one
                     4.2). The active prefix meN- is changed to a pas-                 verb frame, e.g. the verb eat has six synsets with
                     sive prefix di- in passive type one (Sneddon et al.,              eachsynsethavingdifferent verb frames. Three of
                     2010, pp. 256-257) in Example (2b) and without                   the six synsets, together with their definition and
                     affixinpassivetypetwo(Sneddonetal.,2010,pp.                       verb frames, are presented in Table 1. These verb
                     257-258) in Example (2c). Sneddon et al. (2010,                  frames can be employed as a bridge between the
                     pp.256-257)statesthatinpassivetypeone,theac-                     verb types (also verb lexical items) in ERG and
                     tor is third person or a noun, while in passive two,             those in INDRA.
                     the agent is a pronoun or pronoun substitute and it                Synset          Definition            Verb frame
                     comesbefore the unprefixed verb.                                    01168468-v      Take in solid food   8 Somebody ----s
                        The more detailed verb subcategorization into                                                        something
                     othergroupssuchasditransitivewillbementioned                       01166351-v      Eat a meal, take a   2 Somebody----s
                                                                                                        meal
                     in the next subsection. The lexical items for each                 01157517-v      Use up (resources    11Something----s
                     nounandverbsubcategorywereaddedandtheaf-                                           or materials)        something
                     fixes to support the active-passive voice were in-                                                       8 Somebody ----s
                                                                                                                             something
                     cluded. However, the Matrix does not handle mor-
                     phologyasinthenasalizationprocessofmeN-and                       Table 1: Three of six synsets of the verb “eat” and
                     thus has to be manually added (see Section 4.2).                 their verb frames in Wordnet
                     3.3   Lexical Acquisition                                           Out of 354 verb types in ERG, the top eleven
                     The lexicon is important in the robustness of the                mostfrequentlyusedtypesinthecorpuswerecho-
                     grammar. Since inputting words or lexical entries                sen, excluding the specific English verb types such
                     manually into the grammar is labor intensive and                 as be-type verbs (e.g. is, be and was), have-type
                     time consuming, doing lexical acquisition semi-                  verbs, verbs with prepositions (e.g. depend on, re-
                     automatically is vital.       In order to do this, we            fer to and look after) and modals (e.g. would, may
                     need good lexical resources. We attempted to ex-                 andneed). Thechosenelevenverbtypesaregiven
                     tract Indonesian verbs from WordnetBahasa(Nur-                   in Table 2. The third, fifth and eighth type (v -
                     ril Hirfana Mohamed Noor et al., 2011; Bond et                    unacc le, v - le and v pp unacc le all written in
                                                                                  11
                  bold in Table 2) are regarded as the same type, i.e.            Verb type            Verb frame
                  intransitive verb type, in INDRA.                            v pp* dir le    2 Sb ----s &
                                                                                               22Sb----sPP
                                                                               v vp seq le     28Sb----stoINFINITIVE
                      Verb type         Freq        Examples of verb           v - unacc le    1 Sth ----s ||
                                     Corp   Lex                                v - le          2 Sb ----s
                   v pp* dir le      7079   204   go, come, hike               v pp unacc le
                   v vp seq le       3921   105   want, like, try              v np noarg3 le  8 Sb ----s sth ||
                    - unacc le       3144   334   close, start, end                            11Sth----ssth
                   v np noarg3 le    2723     5   make, take, give                             15Sb----ssthtosb||
                   v - le            2666   486   arrive, occur, stand                         17Sb----ssbwithsth||
                   v np-pp e le      2439   334   compare, know, relate        v np-pp e le    20Sb----ssbPP||
                   v pp*-cp le       2360   154   think, add, note                             21Sb----ssthPP||
                   v pp unacc le     2307    44   rise, fall, grow                             31Sb----ssthwithsth
                   v np-pp prop le   1861   135   base, put, locate            v pp*-cp le     26Sb----sthatCLAUSE
                   v cp prop le      1600    80   believe, know, find                           20Sb----ssbPP||
                   v np ntr le       1558    10   get, want, total             v np-pp prop le 21Sb----ssthPP
                  Table 2: The ten most frequently used ERG verb               v cp prop le    26Sb----sthatCLAUSE
                                                                               v np ntr le     8 Sb ----s sth ||
                  types in the corpus                                                          11Sth----ssth
                    The first type contains verbs expressing move-         Table 3: The eleven most frequently used ERG
                  ment or direction with optional PP complements,         verb types in the corpus and their corresponding
                  asinBcreptintotheroom. Theverbsinthesecond              Wordnetverbframes(sb=somebody,sth=some-
                  type are subject control verbs, as in B intended to     thing, & = AND, || = OR
                  win. The third type consists of unaccusative verbs
                  without complements as in The plate gleamed.            type in Table 2 whether it is in Wordnet or not.
                  The fourth type contains verbs having two argu-         If it could be found in Wordnet, the next step was
                  ments (monotransitive) although they have a po-         to checkwhethertheverbincludestheverbframes
                  tential to be ditransitive as in B took the book. The   mentioned in Table 3 or not. This step had to be
                  fifth type contains intransitive (unergative) verbs      done in order to find out the right synset since
                  as in B arose. The verbs in the sixth type have         a verb can have many synsets but different verb
                  obligatory NP and PP complements as in B com-           frames as shown in Table 1. After the right synset
                  pared C with D. The verbs in the seventh type are       was found, the corresponding Indonesian lemmas
                  verbs with optional PP complements and obliga-          or translations were checked.     One synset may
                  tory subordinate clauses as in B said to C that D       have more than one Indonesian lemma or may not
                  won. Unaccusative verbs with optional PP com-           have Indonesian lemmas at all.
                  plements as in The seed grew into a tree belong           The next important step is to check one by
                  to the eighth type. Ditransitive verbs with oblig-      one the Indonesian lemmas belonging to the same
                  atory NPs and PPs with state result as in B put         synset and verb frames whether each can be
                  C on D belong to the ninth type. The tenth type         grouped in the same verb type or not. This man-
                  consists of verbs with optional complementizers         ual step has to be done because grouping verbs
                  as in B hoped (that) C won and the eleventh type        in a particular language into types is a language-
                  consists of verbs with obligatory NP complements        specific work. Arka (2000) states that languages
                  which cannot be passivized as in B remains C.           vary with respect to their lexical stock of “syn-
                    Basedonthesyntacticinformation of each verb           onymous”verbs that may have different argument
                  type mentioned above, the corresponding verb            structures, e.g. the verb know can be both intransi-
                  frames in Wordnet were manually chosen. For             tive and transitive in Indonesian tahu and ketahui
                  example, the first type contains intransitive verbs      respectively, transitive only with an obligatory NP
                  with optional PP; thus, the verb frames should                      2
                  be Sb ----s and Sb ----s PP. The intransitive           in Balinese tawang, and transitive with optional
                  verbs without complements should correspond to          NPin English know. Lastly, after the Indonesian
                  the verb frames Sth ----s or Sb ----s, regard-          verbs were extracted and grouped into their cor-
                  less of whether the subject is a thing or a person.        2Balinese (ISO 639-3:  ban) is a Western Malayo-
                  Table 3 shows the eleven verb types in ERG and          Polynesian language of the Austronesian language family. It
                  their corresponding Wordnet verb frames.                belongs to the Malayo-Sumbawan branch. It is mainly spo-
                                                                          ken in the island of Bali in the Republic of Indonesia as a
                    First, we checked for each verb in each verb          regional language (Lewis, 2009).
                                                                       12
The words contained in this file might help you see if this file matches what you are looking for:

...Building an hpsg based indonesian resource grammar indra davidmoeljadi francis bond sanghounsong division of linguistics and multilingual studies nanyangtechnological university singapore d fcbond sanghoun ntu edu sg abstract tem including a variety prexes sufxes cir cumxes reduplication most the afxes this paper presents creation arederivational twoimportantinectionalafxes initial stage development broad are prex men which marks active voice coverage di denotes passive sneddon et al within framework head pp drivenphrasestructuregrammar has strong tendency to be pollard sag minimal re in noun cursion semantics mrs copestake phrase with adjective demonstrative or rel at present ative clause precedes focuses on verbal constructions sub relative there is categorization since they fundamental noagreementinindonesian general grammat for argument event structure verbs ical relations only distinguished terms were semi automatically ac wordorder asisoftenthecasewithaustronesian quired from eng...

no reviews yet
Please Login to review.