A Two-Level Morphological Description of Bashkir Turkish

In recent years, the topic of Natural Language Processing (NLP) has attracted increasing interest. Many NLP applications including machine translation, machine learning, speech recognition, sentiment analysis, semantic search and natural language generation have been developed for most of the existing languages. Besides, two-level morphological description of the language to be used is required for these applications. However, there is no comprehensive study of Bashkir Turkish in the literature. In this paper, a two-level description of Bashkir Turkish morphology is described. The description based on a root word lexicon of Bashkir Turkish is implemented using Extensible Markup Language (XML) and appended to Nuve framework. The phonetic rules of Bashkir Turkish are encoded using 41 two-level rules. This two-level morphological description is promising to be used in Bashkir Turkish oriented NLP applications.


INTRODUCTION
Bashkir, the co-official language with Russian in the Republic of Bashkortostan, is the part of the Kipchak group of the Turkic languages. There are almost 1.2 million people speaking Bashkir in the Russian Federation, with the ethnic population nearly 1.6 million according to the 2010 census data. Bashkir language has three dialects, namely Burzhan (Western Bashkir), Kuvakan (Mountain Bashkir) and Yurmaty (Steppe Bashkir) [1].
Bashkir is an agglutinative subject-object-verb language as a member of the Turkic language family [2]. In Bashkir, the vocabulary mostly consists of Turkic roots. Furthermore, Bashkir has lots of loan words from Arabic, Russian and Persian languages [3,4].
In earlier times, Chagatai was used as the written language by Bashkir people and then replaced with a literary Turkic language which is a regional diversity of Turki in the late 19 th century. Turki and Chagatai were written in a variance of the Arabic script. A writing system for Bashkir was particularly created using the Arabic script in 1923. Concurrently, a literary Bashkir language using a modified Arabic alphabet in the beginning was formed by differing from Turkic influences. This Arabic alphabet was replaced with a Latin alphabet in 1930 and Cyrillic alphabet in 1938, respectively [4].
Bashkir Turkish is a bridge between Tatar and Kazakh Turkish, and has almost the same features with Tatar Turkish in terms of structure. Nevertheless, it moves away from Tatar Turkish in the way of phonology. Bashkir Turkish differs from historical written Turkic language with its distinctive lisp and fricative consonants. Besides, the advanced consonant harmonies are seen in Bashkir Turkish as in Kazak Turkish [5][6][7][8][9][10][11].
Bashkir Turkish has finite-state and highly complicated morphotactics as in Turkish language [12]. The words in Bashkir can be converted from a nominal structure to verbal structure or vice-versa by means of adding morphemes to a root word or a stem. These morphemes can also create adverb structures. The phonetic rules in Bashkir Turkish constrain and alter morphological structures. In order to achieve vowel harmony, vowels in affixed morphemes have to comply with the preceding vowel in definite circumstances. Moreover, vowels in the roots and morphemes are dropped under certain conditions. In a similar way, consonants in the roots or in the affixed morphemes experience certain modifications and might be removed.
Natural Language Processing (NLP) is the area of computational modelling of several aspects of natural languages and developing numerous systems [13]. In order to make computer systems discover and process languages, many NLP methods and applications have been developed in the disciplines of computer engineering, information science, linguistics and psychology. Machine learning, artificial intelligence, natural language generation, expert systems, speech recognition, machine translation, summarization, sentiment analysis and semantic search are the examples of NLP applications [14,15]. Various studies including the aforementioned applications have been done for two-level morphological descriptions of many languages until now. To the best of the author's knowledge, in the literature, there is no other comprehensive work related to Bashkir Turkish in this framework. This paper describes a two-level morphological description based on a root word lexicon of Bashkir Turkish. The implementation of this morphological description promising to be used in Bashkir Turkish NLP applications is performed utilizing Extensible Markup Language (XML) and added to Nuve which is a two-level parser/generator framework developed for agglutinative languages.
The rest of the paper is organized as follows. In Section 2, two-level morphology is explained. Section 3 introduces the two-level morphological description of Bashkir Turkish. In Section 4, the implementation of two-level rules is demonstrated. Finally, conclusions being under study are summarized in Section 5.

TWO-LEVEL MORPHOLOGY
Two-level morphology is a generic approach to describe morphology of word structures [16][17][18][19] and used for analysing the morphology of various languages [12,[20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35]. Two-level description consists of two levels, namely lexical and surface. The structure of the functional components of a word is represented by the lexical level. On the other hand, the standard orthographic realization of the word associated with the given lexical structure is represented by the surface level [12,16,26]. The rule types denoting the phonetic restrictions and modifications are demonstrated in Table 1. Left context (LC) and right context (RC) denote lexical and surface levels, respectively. Context restriction, surface coercion, composite and exclusion rules shown in Table 1 are separately compiled into a Finite State Transducer (FST) which is a Finite State Machine (FSM) consisting of lexical and surface tapes. These FSTs control whether a lexical matches a surface correspondingly [36,37]. The FST architecture for two-level morphology is demonstrated in Figure  1.
Appropriate morpheme sequences are designated by morphotactics which are encoded as FSMs. Moreover, these FSMs utilize lexicons for roots and suffixes, and changes for obtaining suffix sequences [12]. Readers are referred to [16] for further details about two-level morphology.

TWO-LEVEL MORPHOLOGICAL DE-SCRIPTION
The Bashkir Turkish language is officially written in Cyrillic alphabet and its orthography is composed of an adapted alphabet of 35 Latin letters. There are 9 vowels: a, ä, ı, i , η, u, ü,ȗ, u, and 26 consonants: b, v, d, g,ġ, η, j , z, y, k, q, l, m, n, η, p, r , s, š, t, f , h, ç,ş, x, w [38]. In addition, there are geminate consonants, such as "ts" and "şç" taken from Russian. There are also "yu", "yo" and "ya" voices used in the Russian words. The phonetic features corresponding to the sounds denoted by these vowels and consonants are shown in Tables 2 and 3.
In order to create the two-level description of Bashkir Turkish morphology, firstly, the following letter subsets are defined:

Two-Level Rules
The two-level rules for the phonetic component of the morphological description are given below: 1. L:l <= V +:0__Ar This rule converts L which is at the beginning of the suffix +LAr to l when the last letter of stem is V.
Lexical: äsä+LAr N(mother/anne)+PLU Surface: äsä0lär äsälär (mothers/anneler) This rule converts L which is at the beginning of the suffix +LAr to d when the last letter of stem is one of the consonants in the option list.
Rule type Rule Description Context restriction a:b => LC __ RC a is realized as b only in the given LC and RC, but not necessarily always. Surface coercion a:b <= LC __ RC a is always realized as b in the given LC and RC, but not necessarily only in this context. Composite a:b <=> LC __ RC a is always realized as b in the given LC and RC and nowhere else. Exclusion a:b / <= LC __ RC a is never realized as b in the given LC and RC. Figure 1 FST architecture for two-level morphology [12].  This rule converts L which is at the beginning of the suffix +LAr to η when the last letter of stem is one of the consonants in the option list.

N:n <= V +:0__Hη
This rule converts N which is at the beginning of the suffix +NHη to n when the last letter of stem is V.

IMPLEMENTATION OF TWO-LEVEL RULES
In this study, a two-level morphological description including 41 two-level rules generated for the phonetic rules of Bashkir Turkish is described. The description is implemented using XML and added to Nuve framework. A lexicon of approximately 700 words is created and utilized for implementation and testing. After implementation, all words in the lexicon have been tested and it has been observed that morphological generation and parsing function well for all words, which means a test accuracy of one hundred percent.
Nuve [39] is a language independent top-down morphological analyser and generator designed principally for Turkic languages, and can be utilized for all agglutinative languages. It is open source and developed with C# on .NET platform. Nuve also supports stemming, sentence boundary detection and n-gram extraction.
The implementation outline of the two-level description of Bashkir Turkish morphology consisting of the following four steps is shown in Figure 2.
Step 1: A root lexicon for Bashkir Turkish containing root type and flag attributes is created as a comma-separated values (CSV) file.
Step 2: A suffix lexicon for Bashkir Turkish including lexical form, surface form and rule type attributes is generated as a CSV file.
Step 3: An orthographic rules file involving the two-level rules for the phonetic component of the morphological description is formed in XML format. A part of the orthographic rules file is shown in Figure 3 which contains Bashkir Turkish alphabet and indicates the 1 st , 2 nd , 3 rd and 4 th two level rules, respectively.
Step 4: A morphotactics rules file is created for Bashkir Turkish in XML format. Figure 4 demonstrates a part of the morphotactics rules file.
After all language specific files, such as root lexicon, suffix lexicon, orthographic and morphotactics rules are defined, morphological generation and parsing for Bashir Turkish can be tested on Nuve. The user interfaces of Nuve for orthography and morphology are shown in Figures 5 and 6, respectively.
In the morphological generation stage, a desired root/stem is designated with one or more suffixes and then Nuve generates the surface forms as 5 phases according to the lexical form of the root/stem specified in suffix lexicon file ( Figure 5).
In other respects, morphological parsing of a chosen Bashkir Turkish word is realised by Nuve as shown in Figure 6.

CONCLUSION
In this paper, a two-level morphological description based on a root word lexicon is described for Bashkir Turkish. This description is implemented using XML and added to Nuve framework that is an agglutinative language independent two-level generator/parser developed especially for Turkic languages. The phonetic rules of Bashkir Turkish are encoded utilizing 41 two-level rules. Furthermore, a root lexicon of about 700 words is used for implementation and testing stages. Being the first extensive two-level description of Bashkir Turkish, this two-level morphological description is promising to be used to feed Bashkir Turkish-based NLP applications, such as corpus tagging, text segmentation and semantic analysis.