Improved Arabic–Chinese Machine Translation with Linguistic Input Features

This study presents linguistically augmented models of phrase-based statistical machine translation (PBSMT) using different linguistic features (factors) on the top of the source surface form. The architecture addresses two major problems occurring in machine translation, namely the poor performance of direct translation from a highly-inflected and morphologically complex language into morphologically poor languages, and the data sparseness issue, which becomes a significant challenge under low-resource conditions. We use three factors (lemma, part-of-speech tags, and morphological features) to enrich the input side with additional information to improve the quality of direct translation from Arabic to Chinese, considering the importance and global presence of this language pair as well as the limitation of work on machine translation between these two languages. In an effort to deal with the issue of the out of vocabulary (OOV) words and missing words, we propose the best combination of factors and models based on alternative paths. The proposed models were compared with the standard PBSMT model which represents the baseline of this work, and two enhanced approaches tokenized by a state-of-the-art external tool that has been proven to be useful for Arabic as a morphologically rich and complex language. The experiment was performed with a Moses decoder on freely available data extracted from a multilingual corpus from United Nation documents (MultiUN). Results of a preliminary evaluation in terms of BLEU scores show that the use of linguistic features on the Arabic side considerably outperforms baseline and tokenized approaches, the system can consistently reduce the OOV rate as well.


Introduction
With the rapid development of communication technology and economic globalization, translation between languages has become increasingly frequent, thereby drawing growing attention to machine translation (MT). Consequently, as a subtopic of artificial intelligence, MT has achieved considerable progress in recent decades. The primary goal of researchers in this field is to make the translation quality of machines closer to that of humans.
Several approaches have been demonstrated to be useful in MT. Statistical MT (SMT) is one of the most widely supported approaches, whereas phrase-based SMT (PBSMT) is considered state-of-the-art MT [1]. PBSMT uses a sequence of words, aptly called "phrases" or "blocks," rather than single words. This approach segments the source content into phrases translates phrases into the target translation the results are compared with multiple standard PBSMT systems using different preprocessing techniques.

Challenges and Approach
In this section, we explore the challenges and motivation in building an Arabic → Chinese MT model and the primary methods used in this framework.

Linguistic Issues
Arabic and Chinese belong to different language families. Arabic is a morphologically rich and complex language with a typical verb-subject-object (VSO) order. Although VSO is the typical word order of Arabic, the corpora used show a mixed distribution of VSO, subject-verb-object (SVO), and verb-object-subject (VOS) clauses (VOS order is admitted in specific contexts, for instance using a pronoun to express the object). By contrast, Chinese lacks morphology and exhibits a systematic word order of SVO. Word order in Chinese is considerably closer to that in English compared with that in Arabic. Figure 1 shows an example of Arabic typical order and Chinese order. The different word order of this language pair makes the accuracy of word alignment in MT difficult to achieve. This issue causes data sparseness problems, given that numerous words are misaligned cannot be found in the training corpus, thereby leading to low-quality translation results.
Furthermore, dealing with morphology-rich languages is difficult in SMT. A word in Arabic may have various morphologies to represent different meanings (affixes, stems, and clitics) with several types of morphological features, such as person, number, case, and tense. Moreover, optional diacritical marks in Arabic words cause further ambiguity. These features of Arabic cause more challenges, as they worsen the data sparsity issue, increase the OOV words, and consequently complicates the alignment of word-level in translation between Arabic and another language.
In contrast, Chinese is considered a morphology-poor language. A Chinese word can be composed of a single character or multiple characters without word boundaries, and it has a complicated system of quantifier formation and verbal aspects. The authors in [18] discussed in detail the challenges of building a direct MT approach between Arabic and Chinese.
Consequently, the morphology analysis of word units becomes increasingly necessary to improve the translation results between Arabic and Chinese. In this work, we present solutions for these issues by using preprocessing tokenization schemes and injecting linguistic factors into the source side of the translation corpora.

Arabic Preprocessing
The morphological complexity of Arabic language makes it difficult to achieve high quality and accuracy translation between Arabic and various languages. To facilitate the translation process, we first need to analyze the morphological and orthographic features contained in the Arabic word by using preprocessing techniques; it is aptly called the preprocessing schemes. The advantages of morphological preprocessing have been proven useful for Arabic [12], this procedure helps reduce The different word order of this language pair makes the accuracy of word alignment in MT difficult to achieve. This issue causes data sparseness problems, given that numerous words are misaligned cannot be found in the training corpus, thereby leading to low-quality translation results.
Furthermore, dealing with morphology-rich languages is difficult in SMT. A word in Arabic may have various morphologies to represent different meanings (affixes, stems, and clitics) with several types of morphological features, such as person, number, case, and tense. Moreover, optional diacritical marks in Arabic words cause further ambiguity. These features of Arabic cause more challenges, as they worsen the data sparsity issue, increase the OOV words, and consequently complicates the alignment of word-level in translation between Arabic and another language.
In contrast, Chinese is considered a morphology-poor language. A Chinese word can be composed of a single character or multiple characters without word boundaries, and it has a complicated system of quantifier formation and verbal aspects. The authors in [18] discussed in detail the challenges of building a direct MT approach between Arabic and Chinese.
Consequently, the morphology analysis of word units becomes increasingly necessary to improve the translation results between Arabic and Chinese. In this work, we present solutions for these issues by using preprocessing tokenization schemes and injecting linguistic factors into the source side of the translation corpora.

Arabic Preprocessing
The morphological complexity of Arabic language makes it difficult to achieve high quality and accuracy translation between Arabic and various languages. To facilitate the translation process, we first need to analyze the morphological and orthographic features contained in the Arabic word Future Internet 2019, 11,22 6 of 17 by using preprocessing techniques; it is aptly called the preprocessing schemes. The advantages of morphological preprocessing have been proven useful for Arabic [12], this procedure helps reduce data sparseness, thereby improving translation quality. In this work, we use three preprocessing schemes to compare effects across different techniques on Arabic-to-Chinese translation.
For the baseline approach, we test the tokenization script tokenizer.perl of Moses [34], which simply separates the punctuation marks of Arabic side using the same default as English tokenization.
We also experiment with MADAMIRA [35], a state-of-the-art morphological analyzer for the Arabic language to implement the Penn Arabic Treebank (ATB) scheme as morphology-aware tokenization. Except for the definite article, the ATB scheme tokenizes all clitics on the Arabic side. It also has a default normalization for '(' and ')' characters into "-RRB-" and "-LRB-", in addition to the well-known Alif/Ya Arabic normalization. This scheme performs better in Arabic-English MT compared with other schemes [36].
Furthermore, we use a new tokenization scheme of MADAMIRA, the D3* scheme proposed by [22], which tends to exhibit better performance in Arabic-to-Chinese translation and other multiple languages. Compared with the ATB scheme, D3* first requires tokenization according to the D3 scheme, which tokenizes all clitics, splits Future Internet 2018, 10, x FOR PEER REVIEW 6 of 17 data sparseness, thereby improving translation quality. In this work, we use three preprocessing schemes to compare effects across different techniques on Arabic-to-Chinese translation. For the baseline approach, we test the tokenization script tokenizer.perl of Moses [34], which simply separates the punctuation marks of Arabic side using the same default as English tokenization.
We also experiment with MADAMIRA [35], a state-of-the-art morphological analyzer for the Arabic language to implement the Penn Arabic Treebank (ATB) scheme as morphology-aware tokenization. Except for the definite article, the ATB scheme tokenizes all clitics on the Arabic side. It also has a default normalization for '(' and ')' characters into "-RRB-" and "-LRB-", in addition to the well-known Alif/Ya Arabic normalization. This scheme performs better in Arabic-English MT compared with other schemes [36].
Furthermore, we use a new tokenization scheme of MADAMIRA, the D3* scheme proposed by [22], which tends to exhibit better performance in Arabic-to-Chinese translation and other multiple languages. Compared with the ATB scheme, D3* first requires tokenization according to the D3 scheme, which tokenizes all clitics, splits ‫ﺍﻝ‬ (Al) (Arabic transliteration is presented according to the Habash-Soudi-Buckwalter scheme [37]) "the" definite article, and then removes the definite article from the Arabic context.

Linguistic Input Features
Many previous works have demonstrated the benefits of the factored model as an extension of PBSMT between different languages, which annotates the source side and/or target side with linguistic features. In this work, we focus on the annotation of the source side (i.e., Arabic) using some popular linguistic features to determine whether these features will improve the quality of Arabic → Chinese translation. We use MADAMIRA, which analyzes the Arabic context by applying rule-based and supervised learning techniques to implement the morphological features of the input word. Several types of linguistic features can be obtained by MADAMIRA for Arabic. Here, we discuss additional details about the individual factors we use in our experiments.

Lemma:
Arabic is a highly inflected language, which makes Arabic lemmatization important in the preprocessing step to enhance the information extraction results. The inflections of Arabic words are generated by adding prefixes, suffixes, and vowels to the root. For example, the word ‫ﻭﺳﻴﺨﺒﺮﻭﻧﻬﺎ‬ (wsyxbrwnhA) "and they will tell her" has the prefix (ws) "and will" and suffix (wnhA) "they her", which are both attached to the root (xbr) that basically means "telling." This word also has the stem (yxbr) "tell" and the lemma (xbr) "the concept of telling".
Given its language complexity, lemma exhibits better performance in Arabic information retrieval compared with the root and stem. The root leads to low precision, e.g., differences exist between the stem for broken plurals and their singular patterns. Moreover, imperfect verbs have different stems with their perfect verbs.
Several natural language processing (NLP) tools can be used to output lemmas for Arabic words. In this work, we use MADAMIRA morphological lemmatizer. The reported lemmatization accuracy for modern standard Arabic by the MADAMIRA system is 96.2%, and it has been evaluated on a dataset extracted from the Penn Arabic Treebank. Each Arabic token in our corpora has a diacritized lemma, as shown in the example of Table 1.

POS：
POS tags play an essential role in different NLP tasks, such as MT. These tags provide the linguistic knowledge and syntactic role of each token in the context, which helps in information extraction and reduces data ambiguity.

Linguistic Input Features
Many previous works have demonstrated the benefits of the factored model as an extension of PBSMT between different languages, which annotates the source side and/or target side with linguistic features. In this work, we focus on the annotation of the source side (i.e., Arabic) using some popular linguistic features to determine whether these features will improve the quality of Arabic → Chinese translation. We use MADAMIRA, which analyzes the Arabic context by applying rule-based and supervised learning techniques to implement the morphological features of the input word. Several types of linguistic features can be obtained by MADAMIRA for Arabic. Here, we discuss additional details about the individual factors we use in our experiments.

Lemma:
Arabic is a highly inflected language, which makes Arabic lemmatization important in the preprocessing step to enhance the information extraction results. The inflections of Arabic words are generated by adding prefixes, suffixes, and vowels to the root. For example, the word 6 of 17 e use three preprocessing se translation. .perl of Moses [34], which same default as English hological analyzer for the me as morphology-aware litics on the Arabic side. It "-LRB-", in addition to the er in Arabic-English MT e D3* scheme proposed by slation and other multiple ation according to the D3 presented according to the moves the definite article model as an extension of and/or target side with ce side (i.e., Arabic) using ill improve the quality of rabic context by applying ogical features of the input IRA for Arabic. Here, we iments. tization important in the ctions of Arabic words are mple, the word ‫ﻭﺳﻴﺨﺒﺮﻭﻧﻬﺎ‬ suffix (wnhA) "they her", his word also has the stem ce in Arabic information ion, e.g., differences exist ver, imperfect verbs have lemmas for Arabic words. d lemmatization accuracy t has been evaluated on a r corpora has a diacritized T. These tags provide the hich helps in information (ktAb) "book" or ‫ﻛﺘﺐ‬ (ktb) .g., ‫ﻳﻜﺘﺐ‬ (yktb) "write" or (wsyxbrwnhA) "and they will tell her" has the prefix (ws) "and will" and suffix (wnhA) "they her", which are both attached to the root (xbr) that basically means "telling." This word also has the stem (yxbr) "tell" and the lemma (xbr) "the concept of telling".
Given its language complexity, lemma exhibits better performance in Arabic information retrieval compared with the root and stem. The root leads to low precision, e.g., differences exist between the stem for broken plurals and their singular patterns. Moreover, imperfect verbs have different stems with their perfect verbs.
Several natural language processing (NLP) tools can be used to output lemmas for Arabic words. In this work, we use MADAMIRA morphological lemmatizer. The reported lemmatization accuracy for modern standard Arabic by the MADAMIRA system is 96.2%, and it has been evaluated on a dataset extracted from the Penn Arabic Treebank. Each Arabic token in our corpora has a diacritized lemma, as shown in the example of Table 1.

POS:
POS tags play an essential role in different NLP tasks, such as MT. These tags provide the linguistic knowledge and syntactic role of each token in the context, which helps in information extraction and reduces data ambiguity. The basic POS for Arabic words has many categories: noun, e.g., dataset extracted from the Penn Arabic Treebank. Each Arabic token in our corpora has a diacritized lemma, as shown in the example of Table 1.

POS：
POS tags play an essential role in different NLP tasks, such as MT. These tags provide the linguistic knowledge and syntactic role of each token in the context, which helps in information extraction and reduces data ambiguity.
The basic POS for Arabic words has many categories: noun, e.g., ‫ﻛﺘﺎﺏ‬ (ktAb) "book" or ‫ﻛﺘﺐ‬ (ktb) "books" in plural form and ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "writing or message;" verb, e.g., ‫ﻳﻜﺘﺐ‬ (yktb) "write" or (ktAb) "book" or dataset extracted from the Penn Arabic Treebank. Each Arabic token in our corpora has a diacritized lemma, as shown in the example of Table 1.

POS：
POS tags play an essential role in different NLP tasks, such as MT. These tags provide the linguistic knowledge and syntactic role of each token in the context, which helps in information extraction and reduces data ambiguity.
The basic POS for Arabic words has many categories: noun, e.g., ‫ﻛﺘﺎﺏ‬ (ktAb) "book" or ‫ﻛﺘﺐ‬ (ktb) "books" in plural form and ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "writing or message;" verb, e.g., ‫ﻳﻜﺘﺐ‬ (yktb) "write" or (ktb) "books" in plural form and dataset extracted from the Penn Arabic Treebank. Each Arabic token in our corpora has a diacritized lemma, as shown in the example of Table 1.

POS：
POS tags play an essential role in different NLP tasks, such as MT. These tags provide the linguistic knowledge and syntactic role of each token in the context, which helps in information extraction and reduces data ambiguity.

POS：
POS tags play an essential role in different NLP tasks, such as MT. These tags provide the linguistic knowledge and syntactic role of each token in the context, which helps in information extraction and reduces data ambiguity.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, (ktb) "wrote" in past form; adjective, e.g., Future Internet 2018, 10, x FOR PEER REVIEW 7 of 17 ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, (mktwb) "written or fated;" and particle, e.g., Future Internet 2018, 10, x FOR PEER REVIEW 7 of 17 ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word Future Internet 2018, 10, x FOR PEER REVIEW 7 of 17 ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features:
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1). ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects,

verb IV3FS+IV_PASS
Future Internet 2018, 10, x FOR PEER REVIEW 7 of 17 ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects,

prep PREP
Future Internet 2018, 10, x FOR PEER REVIEW 7 of 17 ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, Alqmh Future Internet 2018, 10, x FOR PEER REVIEW 7 of 17 ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, noun DET+NOUN+NSUFF_FEM_SG+CASE_DEF_GEN .

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word Future Internet 2018, 10, x FOR PEER REVIEW 7 of 17 ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3| . . . ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, ‫ﻛﺘﺐ‬ (ktb) "wrote" in past form; adjective, e.g., ‫ﻣﻜﺘﻮﺏ‬ (mktwb) "written or fated;" and particle, e.g., ‫ﻣﻦ‬ (mn) "from" or ‫ﺇﻟﻰ‬ (Ǎlý) "to." From the preceding examples, we can see that the word ‫ﻛﺘﺐ‬ (ktb) "books" can be used as a plural noun and as the verb "wrote" in past form. Meanwhile, the word ‫ﻣﻜﺘﻮﺏ‬ (mktwb) can be used as a noun "message" or an adjective "fated". In such situation, POS tagging helps analyze and distinguish word meaning, which is optically called "word sense" in the translation corpora.
In this work, we extract POS tags for the input tokens of Arabic by using the MADAMIRA morphological analyzer, which has been considered a state-of-the-art Arabic tagger with a POS accuracy of 96.91% [38]. As shown in Table 1, the results of the MADAMIRA tagger annotate each word by its POS tag.

Morph Features：
MT approaches suffer from data sparseness problems when translating into or from morphologically rich and complex languages, such as Arabic. Thus, morphology analysis is necessary to handle data sparseness and improve translation quality.
Different word types in the Arabic language have various sets of morph features. For example, verbs have person, gender, number, voice, and mood, whereas nouns have case, state, gender, gloss, number, and the attached proclitic DET. Concatenative speech includes affixes and stems, whereas templatic speech has root and patterns.
To enable our approach to utilize the advantage of linguistic knowledge, we use the MADAMIRA analyzer to annotate the Arabic input with morph features because it provides the structure and form of each word in the corpus (see Table 1).

Factored Translation Model
The factored model represents the extension of standard PBSMT by using a log-linear approach to combine language, reordering, translation, and generation models. The factored model is based on the integration of rich linguistic features into the translation model, where the word form not only becomes a token but also a vector of factors that provide various knowledge levels.
The following example is extracted from our factored corpus to show how the Arabic word ‫ﺍﻟﻮﺛﺎﺋﻖ‬ (AlwθAŷq) "documents" is integrated and aligned using the format of surface|factor1|factor2|factor3|...

‫َﺔ|ﺍﻟﻮﺛﺎﺋﻖ‬ ‫ِﻴﻘ‬ ‫ﺛ‬ َ ‫|ﻭ‬noun|DET+NOUN+CASE_DEF_NOM
Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, |noun|DET+NOUN+CASE_DEF_NOM Although these factors do not have particular meanings within the model, they enrich the translation model with general representations to overcome the problems of data sparseness in limited training data. Moreover, these factors allow the direct modeling of many translation aspects, such as morphological, semantic, and syntactic levels [12]. For instance, words with the same lemmas allow their inflectional variants to share better representation in the model, which helps reduce the data sparseness problems. The POS tags are also beneficial for disambiguation. Figure 2 shows the diagram of our factored model approach, which we report on experiments using features, such as lemma, POS, and morph features as additional annotations apart from surface form. We compare the effects of several configurations based on these input factors. because we focus on source factors; instead, the words with their factors assign the same probability. The authors in [39] indicated that sparse data arise when no mapping  exists between the surface+factor on the source side and the inflected word on the target side. To bridge this gap and reduce data sparseness, we test alternative paths in the decoding step using multiple decoding paths and backoff configurations to cover occurrences wherein a word appears with a factor that has not been trained with. Additional details about these configurations are provided in Section 5.

Data Preparation
For our experiments, we extract the Arabic-Chinese corpus from the Multi-UN corpus [40], which used in many MT studies. We remove suspicious sentences with wrong language text or considerable non-alphanumeric symbols. We use the Moses script remove-non-printing-char.perl to remove non-printing characters, and clean-corpus.perl to eliminate sentences that are longer than 80 tokens.
To best capture the effects of the preprocessing and linguistic features, we experiment with the relatively medium dataset sizes, where the data sparsity problem becomes of more relevance [22]. For training, we utilize 200,000 lines, 1600 lines as development data, and 2000 lines for evaluation. Table 2 summarizes the statistics of our parallel corpus in words and sentences.  Although the generation step can be used in factored models to generate the probabilities of translation components based on target factors, this framework does not use a generation step because we focus on source factors; instead, the words with their factors assign the same probability. The authors in [39] indicated that sparse data arise when no mapping (1-1) exists between the surface+factor on the source side and the inflected word on the target side. To bridge this gap and reduce data sparseness, we test alternative paths in the decoding step using multiple decoding paths and backoff configurations to cover occurrences wherein a word appears with a factor that has not been trained with. Additional details about these configurations are provided in Section 5.

Data Preparation
For our experiments, we extract the Arabic-Chinese corpus from the Multi-UN corpus [40], which used in many MT studies. We remove suspicious sentences with wrong language text or considerable non-alphanumeric symbols. We use the Moses script remove-non-printing-char.perl to remove non-printing characters, and clean-corpus.perl to eliminate sentences that are longer than 80 tokens.
To best capture the effects of the preprocessing and linguistic features, we experiment with the relatively medium dataset sizes, where the data sparsity problem becomes of more relevance [22]. For training, we utilize 200,000 lines, 1600 lines as development data, and 2000 lines for evaluation. Table 2 summarizes the statistics of our parallel corpus in words and sentences.

Chinese Segmentation
Given that Chinese words are composed of one or multiple characters without spaces between words, segmentation is an essential task in the preprocessing steps of MT [41]. Chinese segmentation separates words in a sentence to make words 1-to-1 mapping between the parallel phrases, and therefore the segmentation is necessary for the word alignment. However, the segmentation faults caused by segmentation schemes lead to word-mismatching problems which, likewise, affect the translation quality. In this work, we use the Stanford Word Segmenter (http://nlp.stanford.edu/ software/segmenter.shtml) for Chinese [42], which is a conditional random field-based segmenter that splits the Chinese context into words separated by spaces according to Chinese Treebank segmentation. An example of a segmentation output is as follows:

Machine Translation (MT) Systems
In all models, we build translation systems using Moses, which is an open source MT toolkit. Using GIZA++ [43] to extract phrase pairs with word alignment, the alignment symmetrization follows grow-diag-final-and and msd-bidirectional-fe for lexical reordering. On the target side of the parallel corpora, we use KenLM [44] to create a 5-gram language model and memory mapping. In the tuning process, minimum error rate training (MERT) is used to perform the tuning. The experiments are discussed in detail for the standard PBSMT and factored MT systems.

Phrase-Based MT Models
In this approach, we run three different tokenizations for the Arabic side to evaluate the translation output through preprocessing schemes in the standard PBSMT system.
Baseline: This experiment is the baseline of all our work, in which we use minimal preprocessing on the Arabic side by applying Moses default tokenizer to separate punctuation marks between Arabic words.
Tokenized-ATB: The Arabic corpus is tokenized and normalized using the ATB scheme of the MADAMIRA morphological analyzer. Except for the definite article, this process tokenizes all clitics added to an Arabic word. It also normalizes according to the default (Alif/Ya) normalization; an example is shown in Figure 3.
Tokenized-D3*: The tokenization scheme is the same as that of the D3 scheme, which tokenizes all clitics and splits the definite article because this article is attached to an Arabic word. However, this case does not occur in Chinese. Instead, we use the MADAMIRA scheme to remove all the definite articles in the Arabic corpus to make it closer to Chinese. Future Internet 2018, 10, x FOR PEER REVIEW 10 of 17

Figure 3.
Step-wise result of the Penn Arabic Treebank (ATB) approach, the red letters are normalized and tokenized by the approach, Chinese pinyin is presented to ease readability.

Factored Models
This part discusses the core of this work. We annotate the Arabic side by adding linguistic features on top of the surface forms in the translation model when translating Arabic into Chinese to optimize the translation results. These features provide rich representation and sufficient knowledge for nearly any required transformation. In this framework, we use three features (lemma, POS, and morph) for Arabic to conduct several experiments using different settings based on these features. Each word in Arabic factored corpus is a vector of features that represent information levels. For Chinese, we use Chinese segmented words. Surface+Lemma: This model incorporates diacritized lemma into the input side to reduce data sparseness and provide improved generalization based on the inflectional variants for the same word. Mapping lemma and surface onto surface enhances performance.
Surface+POS: The translation option of this model is based on adding POS tags to the input corpus. POS tags help in disambiguation and in extracting additional information about the data. Surface+ Morph: To obtain better data representation and linguistic knowledge, we inject an input word using several morph features which are case, state, gender, gloss, number, person, voice, and mood. These features improve generalization ability.
Surface+Lemma+POS: In this integrated model, we incorporate lemma and POS features to solve the data sparseness and disambiguation problems and evaluate whether multiple factors also increase translation performance.
Surface+Lemma+Morph: Here, we combine two features (lemma and morph) on top of the surface form. The motivation is to obtain more flexible notions and improve the precision of the translation output.
Surface+POS+Morph: The word in this model represents the vectors of the POS tags and morph features to enrich the input through additional knowledge, thereby solving the data sparseness problem and enhancing generalization. Step-wise result of the Penn Arabic Treebank (ATB) approach, the red letters are normalized and tokenized by the approach, Chinese pinyin is presented to ease readability.

Factored Models
This part discusses the core of this work. We annotate the Arabic side by adding linguistic features on top of the surface forms in the translation model when translating Arabic into Chinese to optimize the translation results. These features provide rich representation and sufficient knowledge for nearly any required transformation. In this framework, we use three features (lemma, POS, and morph) for Arabic to conduct several experiments using different settings based on these features. Each word in Arabic factored corpus is a vector of features that represent information levels. For Chinese, we use Chinese segmented words. Surface+Lemma: This model incorporates diacritized lemma into the input side to reduce data sparseness and provide improved generalization based on the inflectional variants for the same word. Mapping lemma and surface onto surface enhances performance.
Surface+POS: The translation option of this model is based on adding POS tags to the input corpus. POS tags help in disambiguation and in extracting additional information about the data.
Surface+Morph: To obtain better data representation and linguistic knowledge, we inject an input word using several morph features which are case, state, gender, gloss, number, person, voice, and mood. These features improve generalization ability.
Surface+Lemma+POS: In this integrated model, we incorporate lemma and POS features to solve the data sparseness and disambiguation problems and evaluate whether multiple factors also increase translation performance.
Surface+Lemma+Morph: Here, we combine two features (lemma and morph) on top of the surface form. The motivation is to obtain more flexible notions and improve the precision of the translation output.

Surface+POS+Morph:
The word in this model represents the vectors of the POS tags and morph features to enrich the input through additional knowledge, thereby solving the data sparseness problem and enhancing generalization.
Surface+All Features: We also test the effect of integrating all available features (lemma, POS, morph) to determine whether the addition of such rich information will improve the translation quality of this language pair.
Multiple Decoding Paths Lemma/Morph: The factored model allows the use of multiple paths in parallel; that is, translation options originate from different phrase tables. We set two models in this model. First, the surface-level model maps the surface onto the surface. Second, the lemma/morph model provides morphological analysis. Translation options from multiple tables compete in the decoding step. When the same translation is found in different tables, varying scores are used to create translation options for each occurrence. The translation model that uses the multiple decoding path (MDP) strategy becomes more robust; hence the input sentence is translated with higher probability.
Lemma Backoff: For the translation of morphologically rich and complex languages, such as Arabic, into simpler languages, such as Chinese, translating lemmas instead of the words that have not observed in the training corpus is useful. This strategy is called the backoff model. In contrast to MDP, the decoder in the backoff model finds one phrase in different phrase tables.
The first table is used as a priority table, whereas the second table is a backoff table for translations that are not found in the first one. In this framework, the surface level is a priority phrase table, whereas the lemma-level phrase table is used as a backoff phrase table, as shown in Figure 4. This model helps decrease out-of-vocabulary rates and enhances translation quality. Surface+All Features: We also test the effect of integrating all available features (lemma, POS, morph) to determine whether the addition of such rich information will improve the translation quality of this language pair.
Multiple Decoding Paths Lemma/Morph: The factored model allows the use of multiple paths in parallel; that is, translation options originate from different phrase tables. We set two models in this model. First, the surface-level model maps the surface onto the surface. Second, the lemma/morph model provides morphological analysis. Translation options from multiple tables compete in the decoding step. When the same translation is found in different tables, varying scores are used to create translation options for each occurrence. The translation model that uses the multiple decoding path (MDP) strategy becomes more robust; hence the input sentence is translated with higher probability.
Lemma Backoff: For the translation of morphologically rich and complex languages, such as Arabic, into simpler languages, such as Chinese, translating lemmas instead of the words that have not observed in the training corpus is useful. This strategy is called the backoff model. In contrast to MDP, the decoder in the backoff model finds one phrase in different phrase tables.
The first table is used as a priority  Figure 4. This model helps decrease out-of-vocabulary rates and enhances translation quality.

Automatic Evaluation
We conducted 12 experiments on a subset of the Multi-UN corpus to evaluate the performance of standard phrase-based MT and the factored models for the Arabic-Chinese language pair. For evaluation, we used a test set with one reference throughout all our experiments.
To make the evaluation as fair as possible, minimize the effects of Chinese segmentation and understand the effects of Arabic pre-processing and features, the Chinese output was post-processed before evaluation by using the script deseg.py (https://github.com/EdinburghNLP/wmt17scripts/blob/master/en-zh/deseg.py) of WMT17 for Chinese desegmentation. This process merges the Chinese output by removing all spaces (except for texts with an ASCII letter on both sides) and then converts ASCII commas and periods to their equivalent CJK Unicode.
To evaluate the effect of the pre-processing schemes and linguistic factors on data sparsity, we measured the rates of OOVs unigrams on the test set in terms of tokens and types. The tokens refer

Automatic Evaluation
We conducted 12 experiments on a subset of the Multi-UN corpus to evaluate the performance of standard phrase-based MT and the factored models for the Arabic-Chinese language pair. For evaluation, we used a test set with one reference throughout all our experiments.
To make the evaluation as fair as possible, minimize the effects of Chinese segmentation and understand the effects of Arabic pre-processing and features, the Chinese output was post-processed before evaluation by using the script deseg.py (https://github.com/EdinburghNLP/wmt17-scripts/ blob/master/en-zh/deseg.py) of WMT17 for Chinese desegmentation. This process merges the Chinese output by removing all spaces (except for texts with an ASCII letter on both sides) and then converts ASCII commas and periods to their equivalent CJK Unicode.
To evaluate the effect of the pre-processing schemes and linguistic factors on data sparsity, we measured the rates of OOVs unigrams on the test set in terms of tokens and types. The tokens refer to the total number of words, while the types indicate the number of unique words in the text. Table 3 summarizes the findings. Table 3. BLEU scores with the improvement over the baseline model, along with the effects of pre-processing schemes and linguistic factors on the test set according to OOV rates. Besides, results of statistical significance test where indicates a significant improvement over the baseline with the specified p-value. The best result in terms of each metric (the highest BLEU and lowest OOVs) is highlighted in bold.

System
Model BLEU OOV Sign.  Table 3 shows the results for Arabic-to-Chinese translation on the extracted Multi-UN corpus. In PBSMT, we observe that tokenization strategies (ATB and D3*) have a major effect in alleviating the data sparseness problem, improve the translation quality and achieve higher BLEU scores than the baseline with an advantage to the D3* scheme. The differences in terms of BLEU scores between baseline and both tokenization schemes are statistically significant (p-value < 0.05). This result confirms that Arabic preprocessing helps address data sparseness.

Results and Analysis
All the factored systems considerably outperform the baseline and achieve better or comparable performance with tokenized PBSMT. The differences in BLEU scores between tokenized phrase-based MT and the factored systems are not statistically significant, whereas the differences between the baseline and all the factored systems are statistically significant (p-value < 0.05). Moreover, using a single feature provides a better result than a combination of models with different features, as we notice that the POS features outperform all the systems. The results confirm that the translation model is benefited from the existence of additional linguistic information.
MDP Lemma/morph as well as lemma Backoff models exhibit the best results in the reduction of OOV rates compared to all models. Although the tokenized models show better results in terms of OOV rates compared to some of the factored models, the effects of OOV reduction are not always well reflected by BLEU score [32]. Another important observation is that most OOV words are related to proper nouns that were not found in the training corpus. In this case, transliteration of named entities would be helpful to improve the translation quality. According to these results, factored models are the clear winners in all the scenarios we have presented.
To perform a manual analysis, we randomly selected sentences from the system output over the baseline, tokenized models and best-factored models as shown in Figure 5, and observed the following: (1) Alleviate issues of dropping translations The baseline and both tokenized models ignore or drop the translation of some words in the dataset, whereas this condition occurs at a lower rate in the factored models. This inclusion is due to the poor performance of SMT system for unobserved words in the training corpus, and worse performance (dropping to 48.6%) on words that were seen once [15].
Although Arabic morphological analyzer is helpful in MT, it provides complicated pre-processing making the system training cumbersome which may increase the incidence of the dropping problem.
Factored models choose to include most of these words taking advantage of the syntactic structures that enrich the training pipeline by good coherent sentences. Sentence 3 in Figure 5 shows an example where the words "eliminate those obstacles" were dropped by the baseline system, and in sentence 2 the words "have evolved over time" are dropped by both tokenized models. Unlike the baseline and tokenized models, factored models take those words into consideration.
(2) Overcome the sparseness problem The tokenized and factored models help decrease the OOV rates, while the baseline model suffers from this problem thereby affects the performance of the translation results. We explain this to the morphological complexity and lexical diversity of Arabic, where the morphological analysis tool helps by providing a list of word-level analyses as well as splitting prefixes and suffixes, and removing diacritization that should increase the matching accuracy.
The use of linguistic features provides us new word forms such as lemma, improves the generalization ability and allows the decoder to use multiple translation options based on MDP or backoff model.
As an example, in sentence 3 of Figure 5 the diacritized word "plans" mapped to OOV by the baseline model, whereas removing the diacritization in the tokenized models helped the decoder to find the translation. In the lemma backoff model the decoder had two options to provide the output, which are surface-level and lemma-level phrase tables, during the translation, if the decoder didn't find the surface form, it goes back to its lemma form to gain the translation, and therefore decrease OOV rates.
(3) The improvement of D3* scheme Since that Chinese doesn't have the definite article "the"; removing the definite article by the D3* model makes Arabic as a source language more similar to Chinese as a target language that improves the alignment and decreases the OOV rates which, likewise, gains a better performance. As in the examples of Figure 5, where removing "the" from the words "the state, the necessary" in sentence 1 and "the government" in sentence 3 enhanced the translation performance of D3* model.

(4) The advantages of linguistic factors
From the selected examples, we notice that the factored models obtained better quality and grammatical performance, due to the strength of the linguistic features which helps to characterize the words in sentence perfectly.
POS tags help to best exploit the data which considerably reduce the number of possible options, and promotes the ability of disambiguation for a token that has a different meaning in the context. Morph features improve generalization ability taking benefits from the rich knowledge for each token in the context. As for lemma Backoff and multiple decoding paths lemma/morph, both models show considerable results in handling the OOV issues, especially for the translation from a morphologically complex language as Arabic.
The evaluation results answer our empirical question that using linguistic factors on the Arabic side improves the quality of Arabic-to-Chinese translation. Further, different configurations of the factored model provide better translation quality compared with the baseline and better or comparable performance with both tokenized models.

Conclusions and Future Work
Integrating additional linguistic knowledge is one of the core problems in the PBSMT model, which we addressed in this work. To investigate the benefits of linguistic input features for Arabic → Chinese MT, we compare a linguistically augmented MT model and PBSMT. We performed a preliminary evaluation of several deep linguistic features for Arabic, including lemmas, POS, and morph features. Several configurations were applied to evaluate the results of the factored model Figure 5. Examples of Arabic-to-Chinese MT output over the baseline model, tokenized models, and best-factored models. The English glosses are presented to ease readability.

Conclusions and Future Work
Integrating additional linguistic knowledge is one of the core problems in the PBSMT model, which we addressed in this work. To investigate the benefits of linguistic input features for Arabic → Chinese MT, we compare a linguistically augmented MT model and PBSMT. We performed a preliminary evaluation of several deep linguistic features for Arabic, including lemmas, POS, and morph features. Several configurations were applied to evaluate the results of the factored model approach. We also empirically tested the contribution of various tokenization schemes to the PBSMT system, in addition to the effects of all models on reducing the data sparsity.
Our results show that using tokenization schemes for Arabic pre-processing helps to deal with the major issue of data spareness in the translation from Arabic as a morphologically rich and complex language. Linguistic factors, which we utilized to annotate the Arabic corpus, improved things even further. Factored systems achieved better performance compared with the baseline and tokenized PBSMT. The best system (POS model) yielded a BLEU score of 2.17 over the baseline and 0.63 over the tokenized phrase-based model, while lemma backoff model reduced the OOV rates from 5.74% to 0.70% for tokens, and from 22.80% to 6.71% for types.
To the best of our knowledge, this work is the first to test factored MT on the Arabic-Chinese language pair. Considering the following aspects can help improve this project: (1) using a big dataset to explore the linguistic input features on Arabic-to-Chinese with neural machine translation that has been proven useful for multiple languages; (2) adding factors to the target side language (Chinese); (3) testing other preprocessing tools that may perform better in Arabic-Chinese translation; and (4) conducting experiments on Chinese-to-Arabic translation that may reveal new insights.