A Morphological Analyzer for Gulf Arabic Verbs

We present CALIMAGLF, a Gulf Arabic morphological analyzer currently covering over 2,600 verbal lemmas. We describe in detail the process of building the analyzer starting from phonetic dictionary entries to fully inflected orthographic paradigms and associated lexicon and orthographic variants. We evaluate the coverage of CALIMA-GLF against Modern Standard Arabic and Egyptian Arabic analyzers on part of a Gulf Arabic novel. CALIMA-GLF verb analysis token recall for identifying correct POS tag outperforms both the Modern Standard Arabic and Egyptian Arabic analyzers by over 27.4% and 16.9% absolute, respectively.


Introduction
Until recently, Dialectal Arabic (DA) was mainly spoken with little to no publicly available written content.Modern Standard Arabic (MSA) on the other hand is the official language in more than 20 countries, where most written documents from news articles, to educational materials and entertainment magazines, are written in MSA.Hence, most of the tools that are available for Natural Language Processing (NLP) tasks are focused on MSA.With the introduction of social media platforms online, dialectal written content is being produced abundantly.Using existing tools that were developed for MSA on DA proved to have limited performance (Habash and Rambow, 2006;Khalifa et al., 2016).Having resources specific to DA, such as morphological lexicons is important for Arabic NLP tasks, such as part-of-speech (POS) tagging and morphological disambiguation.Recently, dialects such as Egyptian (EGY) and Levantine (LEV) Arabic have been receiving increasing attention.Morphological analyzers for EGY and LEV proved to perform well when used for morphological tagging (Eskander et al., 2016).To our knowledge, there exist no full morphological analyzers for Gulf Arabic (GLF) that produce segmentation, POS analysis and lemmas.Although we note the work of Abuata and Al-Omari (2015) on developing a Gulf Arabic stemmer.In this paper, we present CALIMA GLF , 1 a morphological analyzer for GLF.In the current work, we present the effort focusing on GLF verbs only.We utilize a combination of computational techniques in addition to explicit linguistic knowledge to create this resource.We also evaluate it against wide coverage tools for MSA and EGY.CALIMA GLF verb analysis token recall in terms of identifying correct POS tagging outperforms on both MSA and EGY by over 27.4% and 16.9% absolute, respectively.CALIMA GLF will be made publicly available to researchers working on Arabic and Arabic dialect NLP. 2  The rest of this paper is organized as follows.In Section 2 we review related literature, then we briefly describe the main characteristics of GLF in Section 3. In Section 4 we describe the approach and the resources involved and evaluate in Section 5. We conclude and discuss future work in Section 6.
2 Related Work

Arabic Morphological Modeling
Much work has been done on Arabic morphological modeling, covering a wide range of different system designs.Earlier systems such as BAMA, SAMA and MAGEAD (Buckwalter, 2004; Graff   1 In Arabic kalimah means 'Word'.We follow the naming convention from (Habash et al., 2012a) who developed CALIMAEGY since we are using the same format and analysis engine for the databases we create.
2 CALIMAGLF can be obtained from http://camel.abudhabi.nyu.edu/resources/.et al., 2009;Habash and Rambow, 2006) were entirely manually designed.Similarly, Habash et al. (2012a) developed CALIMA, a morphological analyzer for Egyptian Arabic (hence CALIMA EGY ).CALIMA EGY was developed based on a lexicon of morphologically annotated data using several methods and then manually verified.Furthermore, Salloum and Habash (2011) extended existing SAMA and CALIMA EGY resources using hand crafted rules which extended affixes and clitics based on matching on existing ones.Recently, Eskander et al. (2013) developed a technique that generates a morphological analyzer based on an annotated corpus.They describe a technique in which they define inflectional classes for lexemes that represents morphosyntactic features in addition to inflected stems.They automatically 'complete' these classes in a process called paradigm completion.They also show that using manually annotated iconic inflectional classes helps in the overall performance.Using the aforementioned paradigm completion technique, a Moroccan Arabic and a Sanaani Yemeni Arabic morphological analyzers were created (Al-Shargi et al., 2016).And very recently Eskander et al. (2016) presented a single pipeline to produce a morphological analyzer and tagger from a single annotation of a corpus; they produced resources for EGY and LEV.Other works that involve DA morphological modeling include the work of Abuata and Al-Omari (2015).Who developed a rule-based system to segment affixes and clitics in GLF text.They compare their results to other well known MSA stemmers.
In this paper, we create morphological paradigms similar to the iconic inflectional classes discussed by Eskander et al. (2013).Our paradigms map from morphological features to fully inflected orthographic forms.The paradigms abstract over templatic roots; and lexical entries are specified in a lexicon as root-paradigm pairs, in a manner similar to the work of Habash and Rambow (2006).We convert the paradigms to the database representation used in MADAMIRA (Pasha et al., 2014) and CALIMA EGY (Habash et al., 2012a).

Dialectal Orthography
Due to the lack of standardized orthography guidelines for DA, and given the major differences from MSA, dialects are usually written in ways that re-flects the words' pronunciation or etymological relation to MSA cognates (Habash et al., 2012b), and even then with a lot of inconsistency.Furthermore, as with MSA, Arabic orthography ignores the spelling of short vowel diacritics, thus increasing the ambiguity of the written forms.As a result, it is rather challenging to computationally process raw DA text directly from the source, or even agree on a common normalization.Habash et al. (2012b) proposed a Conventional Orthography for Dialectal Arabic (CODA) as part of a solution allowing different researchers to agree on a set of DA orthographic conventions for computational purposes.CODA was first defined for EGY, but has been extended to Palestinian, Tunisian, Algerian, Maghrebi and Gulf Arabic (Jarrar et al., 2014;Zribi et al., 2014;Saadane and Habash, 2015;Turki et al., 2016;Khalifa et al., 2016).We follow the conventions defined by Khalifa et al. (2016) for CODA GLF.

Dialectal Arabic Resources
In addition to the above mentioned morphological analyzers, there exist other resources such as dictionaries and corpora for both DA and MSA.For annotated MSA corpora, several developed such as (Maamouri and Cieri, 2002;Maamouri et al., 2004;Smrž and Hajič, 2006;Habash and Roth, 2009;Zaghouani et al., 2014).
Specifically for GLF, we use the Qafisheh Gulf Arabic Dictionary (Qafisheh, 1997) as well as the Gumar Corpus (Khalifa et al., 2016) in developing our analyzer.

Background
From a linguistic point of view, Gulf Arabic refers to the linguistic varieties spoken on the western coast of the Arabian Gulf, that is Bahrain, Qatar, and the seven Emirates of the United Arab Emirates, as well as in Kuwait and the eastern region of Saudi Arabia (Holes, 1990;Qafisheh, 1977).We extend the use of the term 'Gulf Arabic' (GLF) to include any Arabic variety spoken by the indigenous populations residing the six countries of the Gulf Cooperation Council.In this paper, we focus specifically on Emirati Arabic.

Orthography
Similar to other dialects, GLF has no standard orthography (Habash et al., 2012b).As such, words may be written in a manner reflecting their pronunciation or their etymological relationship to MSA cognates.For example the word for 'dawn' /al-fayr/ may be written as Alfyr3 (reflecting pronunciation) or as Alfjr (reflecting its MSA cognate).In this work we follow the same CODA standards for GLF that were introduced by the authors in (Khalifa et al., 2016) extending the original CODA in (Habash et al., 2012b).We use CODA in developing the morphological databases; but we also add support for non-CODA variants and evaluate on raw non-CODA input.Another challenge caused by Arabic orthography in general (for MSA and other dialects including GLF) is that Arabic orthography does not require writing short vowel diacritics, which adds a lot of ambiguity.

Morphology
GLF shares many of the same morphological complexities of MSA and other Arabic dialects.Arabic rich morphology is represented templatically and affixationally with a number of attachable clitics.This representation in addition to the fact that short vowel diacritics are usually dropped in text add to the text's ambiguity.In comparison to MSA, EGY and LEV, GLF shares and differs in several aspects: • Like MSA, but unlike EGY and LEV, GLF has no negation enclitic marker, namely the iš '[negation]' ending such as mA qultiš in EGY and LEV as opposed to mA qilt in GLF 'I did not say'.• The future verbal particle in GLF is b which is different from the MSA equivalent ( sa), and can be easily confused with the present progressive particle b in both EGY and LEV in.GLF does not have a progressive particle.

General Approach
Our goal is to build a morphological analysis and generation model for GLF.We focus on verb forms in this paper, but plan to extend the work to other POS in the future.We employ two databases that capture the full morphological inflection space from lemmas and morphological features to fully inflected surface forms and in reverse.The two databases are (1) a collection of root-abstracted paradigms which map from features to root-abstracted stems, prefixes and suffixes; and (2) a lexicon specifying verbal entries in terms of roots and paradigm IDs.These two structures together define for any verb all the possible analyses allowed within GLF morphology.The two databases are then merged to create a full model.The merging can be done as a finite state machine.However, the implementation we chose is a variant based on the BAMA/SAMA databases following the representation used in MADAMIRA (Pasha et al., 2014) and CALIMA EGY (Habash et al., 2012a).
Next we discuss step by step the process we took to build CALIMA GLF , starting with a phonetic dictionary all the way to building a fully functional morphological analyzer that even models non-CODA spelling variants.

The Qafisheh Gulf Arabic Verb Lexicon
Our starting point is the Qafisheh Gulf Arabic Verb Lexicon (QGAVL), which is a portion of the Qafisheh (1997) dictionary.Each entry in the lexicon includes a root, perfective and imperfective verb inflections, Verb Form (as in form II or VII) and English gloss.See Table 1 for some example entries.The Arabic entries are in a phonetic representation and not in Arabic script.The verb forms are only in third person masculine singular inflection (PV3MS and IV3MS, for perfective and imperfective aspect, respectively); and no clitics are attached.In total, there are 2,648 verb entries.

Orthographic Mapping
The first step we took was to create the orthographic spelling of all the verb entries.This included mapping to the appropriate vowel spelling as well as following the CODA spelling rules for stem consonants and morphemes.This step was first done automatically and then checked manually for every entry.See Table 2 for an example of the result of mapping the entries in Table 1.We mapped the roots in two ways, one following CODA and one reflecting a phonological spelling.This information will be used later to make the analyzer robust to non-CODA spellings.

PV-IV Pattern Extraction
Next, we identified for each verb its orthographic inflected templatic pattern, i.e., the pattern that would directly produce the surface form once the root radicals are inserted.This approach to pattern definition is most like the work of Eskander et al. (2013) in it being a one shot application of root-template merging to generate surface orthography.The approach differs from the work of Habash and Rambow (2006), who use a large number of rewrite rules for phonology, morphology and orthography after inserting the roots into the templates.
The pattern extraction was done automatically and then manually checked.It was only done to the forms available in the lexicon so far (PV3MS and IV3MS).The PV-IV Pattern (perfectiveimperfective pattern) uses digits (e.g., 1,2,3,4,5) to represent root radicals.In this pattern, all vowels and glottal stop (Hamza) forms are explicitly spelled because they tend to vary within single paradigms.For example, the first entry in Table 5 specifies the PV-IV Pattern 1A3-y1uw3, which when merged with the root radicals qwl generates the perfective and imperfective forms qAl and yquwl.

Basic Paradigm Construction
We identified 72 unique PV-IV patterns in the lexicon, which represent 72 different paradigms.Arabic Verb Forms (I, II, III, etc.) are too general to capture the different variations within the paradigms.That is due to the different root classes (i.e.hamzated, hollow, defective, geminate and sound); and other root-pattern interactions, such as the different forms of Form VIII ( / , / , / , etc.).All of these phenomena can be handled with orthographic, phonological and morphological rules as was done by Habash and Rambow (2006).However, here we embedded the result of such rule application in the paradigm directly.See Table 3  We use the PV-IV patterns as keys (indices) for the paradigms.We then proceed to build a database of Basic Paradigms (BP).A BP is defined as the complete set of possible morphological features (except for clitic features) along with the corresponding stem.The features included are Aspect (perfective PV, imperfective IV, command CV), Person (1, 2, 3), Gender (masculine M, feminine F, unspecified U), and Number (singular S, plural P).The total number of allowable feature combinations is 19.The BP is defined in a similar fashion to the iconic inflectional classes that was defined by Eskander et al. (2013).Each form of the BP is divided into prefix, stem template, and suffix.See Table 4 for examples of two BPs.

Affixational Orthographic Rules
While we covered most of the orthographic, phonological and morphological rules by embedding them in the BPs, there are still a small number of additional orthographic rules that apply to specific stem-suffix combinations.Specifically, suffixes beginning with t and n that attach to stems ending with the same letter are modified as a result of the orthographic gemination rule ( Shadda).For example the verb + naHat+t 'I sculpted' should be written as naHat∼; and the verb + Daman+nA 'we guaranteed' should be written as Daman∼A.We automatically identified all root-paradigm pairs that cause the above rules to apply, and we created new paradigms from them.For example, the root nHt is linked with the paradigm 1a2a3-yi12a3-t and the root Dmn is linked with the paradigm 1a2a3-yi12a3-n.This resulted in 32 additional paradigms, bringing the total to 104 paradigms.

Lexicon Construction
From the set of PV-IV patterns, which we used as paradigm keys, and the lexical entries converted from QGAVL, we constructed our lexicon automatically and then manually validated all the entries.The lexicon consists of 2,648 entries that are linked to the paradigms.See Table 5 for examples of the lexical entries in previous tables.Each entry specifies the root (in phonological spelling and CODA) as well as the paradigm key and gloss.

Clitic Extension of the Basic Paradigms
At this point, we have a complete inflectional model of GLF verbs except that they do not include any of the numerous clitics written attached in Arabic.We define a set of rules for extending the paradigms to include the clitics.Our extensions include two types of resources.
Clitic Locations and Forms First is the list of clitics with their morpheme POS (a la Buckwalter tag) and their relative location around the basic inflected verb, and any conditions for their application.For example, the future particle proclitic b appears immediately before the basic verb form, but can only occur with imperfective verbs; the conjunction proclitics wi 'and' and fa 'so' can appear as the first clitics in any series of clitics; and so on.All possible clitic combinations are then applied to each form in the paradigm along with the necessary spelling changes.The negative proclitic mA and the indirect pronominal enclitics introduced with the preposition are introduced as attached at this point (which is non-CODA compliant).With this information, we are able to model the verb w+mA+b+y-ktb+hA+l+hm 'and+not+will+he-write+it+for+them' (the bolded substring is the only element from the BP).
We extended the paradigms with a total of 25 clitics, including five proclitics which are wi 'and', fa 'so', the future particle b 'will' and the two negation particles mA and lA.
For the enclitics, we extended with all possible 10 direct object enclitics which are: kun 'you[FP]' and their respective 10 indirect objects enclitics by adding the preposition li 'for'.With all of the additional clitics and their features, the total number of allowable feature combinations (or rows in the paradigms) increases from 19 to 24,321 per paradigm.

Clitic Rewrite Rules
We apply a number of clitic rewrite rules which are mandated by CODA spelling conventions.One example is the change of the stem Alif Maqsura to Alif when it is not word final.For example the basic verb + Aštrý+hA 'he bought + it' is rewritten as AštrAhA ( ý → A).Another example is the drop of the Alif of the plural suffix pronouns wA when it is not word final.For example, + AštrwA+hA 'they boaught + it' is rewritten as AštrwhA ( wA → w).

Database Generation
To generate the database, we used the same toolkit used in (Al-Shargi et al., 2016;Eskander et al., 2016) which generates a morphological analyzer database in the representation used in MADAMIRA (Pasha et al., 2014) and CALIMA EGY (Habash et al., 2012a).The conversion was straightforward once we converted our paradigm and lexicon database to the forms expected by the database generation tool.This conversion included providing a POS tag for every prefix, stem and suffix.We use the Buckwalter POS tag style used by many other databases for Arabic morphology (Graff et al., 2009;Habash et al., 2012a).

Extending to Non-CODA Variants
The generated database at this point expects only CODA input, which is not realistic for dealing with raw dialectal text.We extended the database for the set of complex prefixes (pronoun prefixes and proclitics), complex suffixes (pronoun suffixes and enclitics) and stems.For the complex affixes we used the same extensions used in (Habash et al., 2012a) as we don't have enough annotated data to learn from.As for the stems, we inflected the phonological roots that correspond to the CODA roots in the lexicon to their respective stems, which are mapped to the CODA stems in the database.With these extensions we will be able to correctly model a non-CODA input like yAbw 'they brought' as correct CODA form jAbwA.

Experimental Setup
Dataset We used a part of an Emirati novel in raw text from the Gumar corpus.We contextually annotated all the verbs appearing in first 4,000 words of the novel -a total of 620 verbs.The annotation includes identifying the CODA spelling, full Buckwalter tag and the morphemic segmentation.Table 6 shows an annotation example of one sentence from the data.
In this work we only use one dataset for the evaluation as we didn't use any feedback from the evaluation in the current state of work, i.e., this was a blind test.
Metrics We report token recall on verbs only.We report in terms of CODA spelling, segmentation and POS.We report in two modes of input: raw input and CODA compliant input of the same text.Token recall counts the percentage of the time one of the analyses returned by the morphological analyzer given a particular input word matches the gold analysis of the input word in the aspect evaluated (e.g., CODA, segmentation or POS).This is similar to the evaluation carried by Habash and Rambow (2006).
Systems We used six different analyzers for our experiments.

Results
SAMA performs the least amongst all systems in all aspects which is consistent with results reported by Habash and Rambow (2006)

Error Analysis
We conducted an error analysis on the analyzed verbs for CALIMA GLF .We identified three main sources of errors.an CALIMA EGY by over 27.4% and 16.9% absolute, respectively, in terms of identifying correct POS tag.We plan to morphologically annotate a large portion of the Gumar corpus to learn different spelling variations and grow the coverage of lemmas.We also plan to extend CALIMA GLF beyond verbs using those annotations.We also plan to use a similar building process to create morphological analyzers and lexicons for other dialects given the availability of resources.

Table 1 :
for counts of PV-IV patterns per Verb Form.Example of a Qafisheh Gulf Arabic Verb Lexicon Entry.

Table 2 :
Orthographic mapping of the entries in Qafisheh Gulf Arabic Verb Lexicon.The Root is orthographically spelled in two ways reflecting phonology and etymology (CODA style); PV3MS and IV3MS refer to the perfective and imperfective third masculine singular verb forms.

Table 4 :
Example of BP for a paradigm of Form I and another of Form II for the roots qwl and Trš respectively.The verb qAl means 'he said' and the verb Tar∼aš means 'he sent'.

Table 5 :
Example of lexicon entries.For each entry there is: (a) a phonological root, which will be used to model possible non-CODA variations, (b) a CODA root, (c) two verbal forms (PV3MS and IV3MS), (d) the paradigm key, (e) Verb Form, and (f) English gloss.

Table 6 :
Annotation example.In this sentence, there are total of 12 verbs marked with[[ ]].For each verb we provide the CODA spelling, morphemic segmentation and the full Buckwalter POS tag.

Table 7 :
First are typos in the raw text which lead to no possible analysis.Examples include Abls instead of Albs 'I wear' and Adxlww instead of AdxlwA 'come in'.These kinds of errors are around 19%. Second are non-CODA-compliant input words that lead to different segmentations and POS, e.g., the word AtSly (CODA t+Sly 'she prays') is analyzed as AtSl+y 'call![FS]'.These make up around 18% of errors.Third are the out-of-vocabulary (OOV) cases, which for us include words with lemmas not in our lexicon, or words with affixes not modeled in our paradigms.For example, we encountered some EGY-like verbal constructions that we did not expect to see in GLF: These cases are about 63% of the errors.When we compare the performance of our best system (CALIMA GLF +CALIMA EGY ) to CALIMA GLF , we note that the errors of the first two types do not change, but there is a drop of 13% absolute in the OOV error cases.6Conclusionand Future WorkWe presented CALIMA GLF , a morphological analyzer for GLF currently covering over 2,600 verbal lemmas.CALIMA GLF verb analysis token recall with CODA input outperforms both SAMA and Token recall evaluation on CODA matching, Buckwalter POS tag and morphemic segmentation.Evaluation is on verbs only.The evaluated analyzers are (1) SAMA for MSA, (2) CALIMA EGY for EGY, which includes MSA, (3) CALIMA GLF for GLF, and (4) CALIMA GLF-CODA , which is CALIMA GLF without the extension discussed in 4.10.