Morphological segmentation method for Turkic language neural machine translation

Abstract Dictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmenting each word into morphemes in parallel source corpora. Therefore, this study introduces a new morphological segmentation approach for Turkic languages based on the complete set of endings (CSE), which reduces the vocabulary volume of the source corpora. Herein, we demonstrate the proposed CSE-based morphological segmentation method for the Kazakh, Kyrgyz, and Uzbek languages and present the results of computational NMT experiments for the Kazakh language. The NMT experiment results show that in comparison with byte-pair encoding (BPE)-based segmentation, the proposed CSE-based segmentation increases the bilingual evaluation understudy score of 0.5 and 0.2 points on average for Kazakh–English and English–Kazakh pairs, respectively. Furthermore, in comparison with the BPE-based segmentation, the proposed CSE-based segmentation approach reduced the vocabulary size in NMT by more than a factor of two. This feature of the proposed segmentation approach will be crucial for NMT as the size of the source corpora is increased to improve translation quality.

Professor Ualsher Tukeyev runs an active research group that focusses on development and investigation in area of Natural Languages Processing (machine translation, computational linguistics, corpus linguistics) of Turkic languages. Specifically, current research of group focuses on the development of linguistically feature oriented methods for support of neural machine translation. Proposed segmentation method is used on the current project "Development and research of the Kazakh language neural machine translation system", financed by Ministry of Education and Science of Republic Kazakhstan. Also, this segmentation method is used for the Uzbek and Kyrgyz languages investigations.

PUBLIC INTEREST STATEMENT
The article proposes a novel segmentation method for agglutinative languages to be used in neural machine translation pre-processing, which more than twice decrease the volume of the machine translation vocabulary. The proposed segmentation method is based on the construction of a complete set of language endings. A decrease in the machine translation vocabulary allows to increase a volume of the input parallel corpus for training, which leads to increase the quality of machine translation. The proposed segmentation method for agglutinative languages could be well used in the field of information retrieval for lexicon-free stemming of words, for morphological analysis of texts, for morphological tagging of language corpora.

Introduction
Turkic 1 languages are one of the largest language families and are spoken by more than 160 million people; the languages in this family include Turkish, Azeri,Uzbek,Kazakh,Tatar,and Kyrgyz. Approximately 13,4.4, and 24 million people speak in Kazakh, Kyrgyz, and Uzbek languages, respectively. The Kazakh speakers live in Kazakhstan, Russia, China, Uzbekistan, Mongolia, and Turkmenistan; the Kyrgyz speakers live in Kyrgyzstan, Uzbekistan, China, and Tajikistan; and the Uzbek speakers live in Uzbekistan, Afghanistan, and Tajikistan.
Turkic languages are agglutinative, thereby making them challenging for neural machine translation (NMT) because almost all the words from the source corpus must be included in the dictionary. NMT learning is generally improved by increasing the vocabulary size; however, if the vocabulary is significantly large, the memory will overflow, thereby resulting in system error. This system error can be avoided by word segmentation.
In this study, a new morphological segmentation method is proposed by considering three languages as examples, which are Kazakh, Kyrgyz, and Uzbek. The Kazakh language was considered for experiments due to the presence of scientific developments and the presence of a parallel Kazakh-English corpus. The Kyrgyz language was considered herein because it belongs to the same Kypchak-Nogai subgroup of the Turkic languages as Kazakh, which enables the examination of the NMT complexity for the languages in a single subgroup. Furthermore, the Uzbek language was considered because it belongs to the Karluk subgroup of the Turkic languages, which enables the investigation of the NMT complexity for the languages in different Turkic language subgroups. All the three languages considered herein are low-resource languages.
When training NMT for these language pairs, the volume of the corresponding NMT dictionary rapidly increases; therefore, it requires excessive computer memory resources. The well-known approaches for text segmentation are BPE-based method (Senrich et al., 2016) and Morfessor (Creutz & Lagus, 2002), both of which are unsupervised and statistics-based methods. The advantage of these two methods lies in their universal applicability to different languages.
This study proposes a novel morphological segmentation method based on complete set of endings (CSE) (suffixes) of words in a language. The proposed CSE-based segmentation method can be applied to the agglutinative languages in the Turkic group. Furthermore, this study demonstrates the applicability of the proposed CSE-based morphological segmentation method to the Kazakh, Kyrgyz, and Uzbek languages and presents the results of computational experiments for the Kazakh language. This approach can be extended to the other languages in the Turkic language group. The remainder of this paper is organised as follows. Section 2 provides an overview of the previous works conducted in this field. Section 3 describes the proposed CSE-based segmentation of words in the Kazakh, Kyrgyz, and Uzbek languages. Section 4 presents the experimental NMT results for the Kazakh-English language pair obtained using the proposed CSE-based segmentation method. Finally, Section 5 presents the conclusions and suggests the direction for future work.

Related work
The research works related to segmentation for NMT can be divided into those based on the BPE method, Morfessor, and finite-state transducers (Sennrich et al., 2016;Ataman et al.,2017;Sánchez-Cartagena & Toral, 2016).
Sennrich et al. developed methods that segment corpora into frequent sequences of characters (Sennrich et al., 2016). Specifically, the well-known byte-pair encoding (BPE) compression method was applied to English-German and English-Russian NMT systems in these methods. The authors adapted the BPE algorithm for the segmentation task to create open vocabulary. The advantage of using BPE-based method segmentation is that rare words are segmented into frequent subsequences, enabling the translations of unknown words to be built, which is one of the major goals of NMT. The BPE segmentation method is the dominant approach to subword segmentation [Tacorda et al., 2016].
The BPE-based method involves the splitting of words into different variations of word segments; however, this approach is not suitable for languages with rich morphologies, such as the Kazakh language. For example, in the learning phase, the words "жобалар"(projects), "жобасын"(project of), and "жобаның"(of project) are presented as "жоб алар</w>"(right segmentation is "жобалар"), "жобас ын</w>"(right segmentation is "жоба сын"), and "жобаның</w>"(without segmentation), respectively, by the BPE method. When BPE method is applied to files in the test phase, these words are not split, but are rather left as whole words. This is explained that whole words often have highest frequency than word segments, the experiments in Section 4 confirm this assumption, and therefore the vocabulary of BPE-based segmentation is more than that of morphological segmentation. Tacorda et al. proposed the use of the controlled byte-pair encoding (CBPE) method for English-Filipino and Filipino-English translation (Tacorda et al., 2016). CBPE is used to recognise inflected words in morphologically rich languages. The authors compared the results of BPE and CBPE and concluded that both improve the bilingual evaluation understudy (BLEU) metric; however, with CBPE, the quality metric was improved slightly.
The use of BPE-based method has been considered in other researches based on Turkic languages. Ataman et al. predicted subword segments using an unsupervised morphology learning algorithm based on a prior morphology model (Ataman et al., 2017). They investigated morphological and BPE segmentation. Morphological segmentation was applied to the Turkish language; for fair comparison, only the source side was segmented. Their study presented two morphological segmentation methods, i.e. supervised and unsupervised. The supervised method maintained a full description of the morphological properties of subwords, whereas the unsupervised method was based on the Morfessor framework with category-based model averting. Experiments were performed separately using the BPE and developed methods, and the results showed that in comparison with the BPE segmentation, the developed methods improved the BLEU metric by 2.2.
Sánchez-Cartagena and Toral used a rule-based morphological analyser for Finnish language to separate words into root and inflection boundaries for vocabulary reduction for NMT (Sánchez-Cartagena & Toral, 2016). The authors combined an NMT system and phrase-based statistical machine translation (SMT) system enhanced with a neural language model. In SMT, the length of the segmented Finnish sentences is reduced by joining the most frequent sequences of morphs. BPE was used for the Finnish language because it has a more complex morphology than English. The authors concluded that combining BPE with morphological segmentation does not yield any clear improvement.
BPE performs the merging operations iteratively to find the most frequent character combination. BPE segmentation is conducted regardless of the morphology of the language. Therefore, the BPE output has no semantic meaning in languages with rich morphologies.
There are some other segmentation approaches based on Morfessor, which is an open-source software for unsupervised morphological analysis. Morfessor segments words according to their morphological structures. There are three main versions of Morfessor, which are Morfessor Baseline, Morfessor Categories-ML, and Morfessor Categories-MAP (Creutz, 2003;Creutz & Lagus, 2002, 2005Creutz & Linden, 2004). Morfessor is a statistical morphological segmentation tool. It evaluates all possible ways by which a word can be split into two substrings, and the split with the highest probability is selected. Morfessor segments words according to their morphological structures, however, like N-gram models, it does not have a preference for infrequent words. Therefore, it suffers from the problem called out of vocabulary (OOV). This can be resolved with the use of the BPE (Banerjee & Bhattacharyya, 2018;2007;Papli, 2017). Therefore, to avoid the appearance of several unknown tags and erroneous probabilistic segmentation, it was decided that BPE should be used for Kazakh-English and English-Kazakh language pairs. This choice was also influenced by the fact that BPE is a dominant approach in the domain of word segmentation.
At World Machine Translation (WMT), 2019, the Kazakh language was added to the translation tasks, and the translation of Kazakh to English was considered (Briakou and Carpuat, 2019;Casas et al., 2019;Kocmi & Bojar, 2019;Littell et al., 2019;Sánchez-Cartagena et al., 2019). Briakou and Carpuat applied transfer-learning technology to Kazakh-English and English-Kazakh translations (Briakou and Carpuat, 2019). As additional data for transfer learning, parallel corpora of Turkish-English were used because Kazakh and Turkish belong to the same language group. The researchers compared different configurations of BPE and soft decoupled encoding. The texts of Kazakh corpora were Romanised, and experiments were conducted with and without Romanisation. The dictionary volume significantly increased with Romanisation. With the BPE configuration, the BLEU score was improved by 0.20 with Romanisation, whereas that of the original text (in Cyrillic) was improved by 1.24.
Casas et al. mentioned the morphological complexity of the Kazakh and Russian languages (Casas et al., 2019). Russian-Kazakh SMT was used as a pivot system for English-Kazakh NMT. The researchers employed BPE with 10,000 comparative operations for each language in NMT, producing a BLEU score of 2.32. Kocmi  . They segmented a source text using a rule-based morphological analyser. If a word had no valid segmentation, many segmentation variants were generated as there were known suffixes that matched the word. After morphological segmentation, the BPE was applied to all the training data. For example, "университетiнiң"(of her/his university) has the morphological analysis result n. px3sp.gen. The proposed morphological segmentation split this term as "университет@@ iнiң", whereas BPE left the word unchanged as "университетiнiң(of her/his university)". Thus, the proposed morphological segmentation divides the given words into only two parts. In contrast, our morphological segmentation based on the complete set of Kazakh endings performs splitting into more than two parts and conducts segmentation by using ending types defined exactly according to the grammar: "университет@@ i@@нiң".
Thus, to improve the quality of NMT, appropriate segmentation and satisfactory volume of parallel corpora are required. To achieve these objectives, this study proposes a morphological segmentation approach based on the CSE-model and special stemming algorithm. Furthermore, it demonstrates the usability of the proposed approach to the Turkic languages for creating a complete set of language endings considering the Kazakh, Kyrgyz, and Uzbek languages as examples and presents the results of computational experiments, wherein the proposed morphological segmentation method was applied to the Kazakh language.

Description of the CSE-based morphological segmentation method
Morphology refers to the structures of words in terms of minimal semantic grammatical units known as morphemes. Morphemes are usually divided into two groups, i.e. stems and affixes; stem defines the basic meaning of a word, whereas affixes define the various forms of meaning of the word. Moreover, depending on the language type, i.e. agglutinative or inflectional (fusional), affixes can have either single or multiple grammatical meanings. Thus, for agglutinative languages, each affix has a single meaning, whereas for inflective languages, an affix can have several grammatical roles, such as case, gender, and number. For agglutinative languages, several affixes may be added to a stem, so that the word as a whole carries several grammatical meanings. In an agglutinative language, such a sequence of affixes after the stem is called the ending of the word. Tukeyev et al. defined the complete system of endings for the Kazakh language .
The proposed study is novel because it demonstrates the applicability of the proposed CSE-based morphological segmentation method for the Turkic language family. Section 3.1 briefly shows the complete set of Kazakh endings, presents the CSE-based morphological segmentation model, and demonstrates its effectiveness for the agglutinative languages of the Turkic group, if a CSE-model of morphology is created for the language. Section 3.2 describes the morphological segmentation algorithm for words in the Kazakh language and its application to other languages in the Turkic group. Further, the possibility of constructing a CSE-model morphology for language of the Turkic group is demonstrated using Kyrgyz and Uzbek as examples.

CSE-model of morphology
This section analyses the morphology of the Turkic language group, more specifically, Kazakh, Uzbek, and Kyrgyz languages, and shows that the morphological structure of these languages enables the building of a CSE-model, which is essential for application of the proposed segmentation method.
We considered the Kazakh language, wherein the endings are divided into nominal endings (nouns, adjectives, and numerals) and verbal endings (verbs, participles, gerunds, mood, and voice).
The nominal endings in the Kazakh language have four types of base affixes, i.e. plural affixes (K), possessive affixes (T), case affixes (C), and personal affixes (J). These endings can occur in sequences of one, two, three, or four types of affixes, in order, as prescribed by the morphotactic of the language. All Turkic languages have these four types of base affixes. 2 Any ending comprising a single affix is semantically valid. The valid two-, three-and four-affix combinations are KT, TC, CJ, KC, TJ, and KJ; KTC, KTJ, TCJ, and KCJ; and KTCJ, respectively. Thus, the total number of ending combinations for words with nominal bases is 15 (= 4 + 6 + 4 + 1).
The system of endings for verbal bases in Kazakh includes endings type of verbs, participles, moods, and voices. The system of verb endings include the following affixes types: tenses (eight), persons (three), and negation. Thus, the total number of possible types of verb endings is 25 (= 8 × 3 + 1). The system of participle endings includes participle endings (R), plural endings (K), possessive endings (T), case endings (C), and personal endings (J). Possible semantically acceptable variants of participle endings types, verbal participles, moods, and voices are 11, 1, 6, and 8, respectively. Therefore, the total number of ending types for words with verbal bases will be 51 (= 25 + 11 + 1 + 6 + 8), whereas the total number of types of endings with nominal bases and types of endings of words with verbal bases is 66 (= 15 + 51).
According to these ending types, finite sets of endings were constructed for all the main parts of speech in the Kazakh language. The number of endings for parts of speech with nominal bases   (nouns, adjectives, and numerals) is 1,998 and that with verbal bases is 2,729 (Table 1). Hence, there are a total of 4,727 endings for all parts of speech in the Kazakh language.
Israilova and Bakasova considered the formation of Kyrgyz morphology (2018). The Kyrgyz language has ending types similar to those in the Kazakh language. Kyrgyz has ending types E1, E2, E3, and E4, which correspond to K, T, C, and J, respectively, in Kazakh. The ending types in Kazakh, Kyrgyz, and Uzbek are listed in Table 2.
The numbers of base affixes of each type are different in Kazakh, Kyrgyz, and Uzbek; therefore, the number of possible endings in each of these languages will be different. Table 3 lists the examples of Kazakh, Kyrgyz and Uzbek endings for some ending types.
All agglutinative languages have strict systems of word formation and rules for affix conjunction. Kazakh, Uzbek, and Kyrgyz, like other Turkic languages, are grammatically similar in terms of the types of endings. Having studied the types of endings in Kyrgyz and Uzbek, the CSE-based method created for either of these languages could be applied to the segmentation algorithm based on the CSE-model of the Kazakh language. The morphological segmentation algorithms and models based on the CSE-model for the Turkic languages are discussed in the next section.

CSE-based morphological segmentation algorithm
It is possible to create a CSE-based model for each agglutinative language in the Turkic group, as shown in the previous section. Therefore, the algorithm for the morphological segmentation of words will be the same for all languages in the Turkic group. This algorithm includes two stages: 1) splitting of stems and word endings and 2) segmentation of word endings into component suffixes.
1) The stem and ending of a word can be split using a stemming algorithm, which is also based on the use of the CSE-model of the agglutinative languages in the Turkic group. The

Iteration
Splitting of the word on each iteration into stem and ending Comments 1 śa-arydanmyn Did not find any matches "arydanmyn" from the endings list on the first column of the  proposed algorithm is a lexicon-free stemming algorithm based on the CSE of Kazakh language (Tukeyev & Turganbaeva, 2016). Herein, this algorithm is proposed for all Turkic language group. All the endings in the set of endings of the agglutinative languages in the Turkic group are divided into classes according to their length. The algorithm first looks for an ending of maximum length for the given word, which will be two symbols less than the length of the word; it is assumed that the stem cannot contain less than two symbols. The assumed ending of length (L) is searched for in an appropriate class of endings of L. If the ending is not in this class; then, the length of the assumed ending is decreased by one (accordingly, the assumed ending of the word is decreased by one symbol on the left side, and this symbol is added to the assumed stem of the word), and the received ending is searched for in the appropriate ending class until the stemming procedure is complete or the word has no ending.
In the following, L(e) max is the maximum length of endings in the set of endings for the language, e(w) is the ending of analysed word w, st(w) is the stem of w, L(w) is the length of w, L[e(w)] is the calculated length of the ending of w, and L[e(w)] max is the maximum length of the ending of analysed word w.

Iteration
Splitting of the word on each iteration into stem and ending Comments 1 qa-ladanmyn Did not find any matches with "ladanmyn" from the endings list on the first column of the   The steps of the algorithm for splitting the stem and ending are as follows.
2) The word ending is segmented into its component suffixes using a single state transducer, presented as a table of endings with segmented suffixes (Table 4). The columns in this table list the endings of words of the agglutinative languages in the Turkic group and suffixes corresponding to each ending. Note that Table 4 is only a fragment of the common table of the Kazakh endings with segmented suffixes.
The algorithm for segmenting the ending of a word into its component suffixes involves two steps, i.e. finding the ending of the current word in the endings table of the agglutinative language and reading the sequence of suffixes corresponding to the ending of the word. Tables 5-7 present examples of morphological segmentation based on CSE for Uzbek, Kyrgyz and Kazakh. The corresponding table of endings must be used for each language, as presented in Table 4.
The algorithm described above involves separation of the stem and ending of a word without using a dictionary of stems of agglutinative languages in the Turkic group, which is known as lexicon-free algorithm.

Experiments and results
The proposed CSE-based segmentation method was applied to Kazakh-English NMT in a preprocessing phase. This section presents the results of the experiments comparing the proposed CSE-based segmentation and BPE-based segmentation. The choice of BPE is justified by the fact that it is the de facto standard for word segmentation in the domain of neural machine translation  Provilkov et al.). In addition, Morfessor requires the lexicon of morphemes for each language, which incurs additional expenses, while the BPE does not require such additional data.
The Kazakh-English parallel corpora were collected from the news sections of government agency websites (Table 8). The parallel corpora contained one sentence per line, which is tokenised with spaces. The collected corpora were assembled, cleaned, and aligned. The resulting Kazakh-English parallel corpus was pre-processed through tokenisation, normalisation, and shuffled.
The resulting parallel corpora of Kazakh-English comprised 109,772 sentences, where 80,000 sentences were utilised for training, and the remaining were divided into two sets, i.e. test and dev. The test and dev file included 15,000 and 14,772 sentences, respectively. We used TensorFlow 3 "sequence to sequence" model in all the experiments and applied the following settings for the hyperparameters: We experimented with the standard hyper parameters by calibrating the number of units and concluded that training with 1,024 dim hidden units leads to an improvement in the quality of translation. During training, a model checkpoint was saved every 1,000 iterations. The duration of the training was 100,000 epochs. The training corpus on the Kazakh language side was segmented into stems and affixes for each word using the proposed CSE-based segmentation method. The stems and affixes of the Kazakh-English parallel corpora were separated by symbols @@, similar to that in the BPE-based segmentation. The NMT vocabulary was created based on the frequencies of occurrence of the words in the training file, wherein words that occurred only three or more times were included. The corresponding Kazakh vocabulary volumes are listed in Table 9.
The increase in the vocabulary size of the baseline NMT can be explained as follows. In the NMT with BPE, some words of the Kazakh text are not segmented, whereas in the NMT with CSEsegmentation, all the words with endings are segmented. For example, in the NMT with BPE, the word "Қазақстанның (of Kazakhstan)" was left as a whole without any segmentation, whereas, in the proposed CSE-based segmentation method, it was segmented as "Қазақстан@@ның"(of Kazakhstan). Therefore, the volume of the vocabulary in NMT with the proposed CSE-based segmentation is less than that in the NMT with BPE-based segmentation.
In the experiments, different segmentation options were used for training, as follows:  The experimental results were evaluated using the BLEU metric. These values were not sufficient to indicate the good quality of the NMT. The main reason for this result is the unavailability of sufficient data for neural network training of this language pair. In actual practice, it is recommended that large parallel corpora be used for NMT training for adequate machine translation accuracy (Koehn & Knowles, 2017;Poncelas et al., 2018). However, for the Kazakh language, similar to the other languages in the Turkic family except Turkish, there are no sufficiently large parallel corpora.
In comparison with byte-pair encoding (BPE)-based segmentation, the proposed CSE-based segmentation increases the BLEU score of 0.5 and 0.2 points on average for Kazakh-English and English-Kazakh pairs, respectively. Furthermore, the proposed CSE-based segmentation reduces the vocabulary volume by a factor of more than two, i.e. from 28,000 to 13,000, which will be crucial when a larger volume of source corpora is available.

Conclusion
In this study, we developed CSE-based segmentation method and investigated its applicability to the Kazakh, Kyrgyz, and Uzbek languages. These languages, similar to all the Turkic languages, have four types of affixes for forming endings. Consequently, the proposed CSE-based segmentation approach could easily be applied to other languages in the Turkic language family. Computational experiments were conducted using the proposed CSE-based segmentation for NMT of the Kazakh language. In comparison with the BPE-based segmentation method, the proposed CSE-based segmentation method reduced the NMT vocabulary volume by more than twice and increased the BLEU score of 0.5 and 0.2 points on average for Kazakh-English and English-Kazakh pairs, respectively. When the size of the source parallel corpora was increased to improve the quality of NMT learning, the NMT vocabulary size reduction was significant. However, the small size of the available corpora in Turkic languages, other than Turkish, significantly limits the supplication of the proposed method.
In the future, corpora for other Turkic languages should be collected, and the CSE-model of morphology for other Turkic languages should be used in segmentation task for NMT of these languages and NMT transfer-learning experiments should be conducted for languages from other subgroups of the Turkic languages. Furthermore, the possibility of using CSE-model for lexicon-free stemming for informational retrieval in Turkic languages should be investigated, and CSE-model and lexicon-free stemming algorithm should be used for morphological analysis of Turkic languages and for pre-processing of corpora of the Turkic languages for tagging of corpora texts. In the future, the proposed segmentation method will be investigated and applied for processing the morphology of other agglutinative languages, such as Tatar, Karakalpak.