Multi-word Expressions in English-Latvian Machine Translation

. The paper presents series of experiments that aim to find best method how to treat multi-word expressions (MWE) in machine translation task. Methods have been investigated in a framework of statistical machine translation (SMT) for translation form English into Latvian. MWE candidates have been extracted using pattern-based and statistical approaches. Different techniques for MWE integration into SMT system are analysed. The best result - +0.59 BLEU points – has been achieved by combining two phrase tables - bilingual MWE dictionary and phrase table created from the parallel corpus in which statistically extracted MWE candidates are treated as single tokens. Using only bilingual dictionary as additional source of information the best result (+0.36 BLEU points) is obtained by combining two phrase tables. In case of statistically obtained MWE lists, the best result (+0.51 BLEU points) is achieved with the largest list of MWE candidates.


Introduction
Multi-word expressions (MWEs), which often are defined as "lexical items that (a) can be decomposed in multiple lexemes and (b) display lexical, syntactic, semantic, pragmatic and/or statistical idiomaticity" (Baldwin and Kim, 2010), is well known 'pain in a neck' (Sag et al., 2002) for human language technology researchers.
MWEs and specific classes of MWEs have been researched for many years.However, lot of problems are still left unsolved (Savary et al., 2015).One of them is MWE processing in automatic translation task.Using rule-based approach MWEs needs to be encoded into dictionary.MWEs that are not included into dictionary usually are treated as common phrases and often are translated incorrectly.In case of statistical machine translation (SMT) MWEs need to be learned from the text corpora during a training process.Common and frequently used phrases, including MWEs, usually are recognized during training process and are stored into phrase table.However, less frequent and longer phrases are usually missing.In many cases this leads to wrong, word by word translation of MWEs.
For MWE translation different approaches are described in literature and has been applied for different language pairs.However, most of existing studies concentrate on widely used languages, such as English, French (e.g., Bouamor et al., 2012) or Spanish (e.g., Lambert and Banchs, 2006).Moreover, most of these studies deal with languages with rather simple morphology and fixed word order.
The aim of this paper is to investigate possible ways of MWE treatment in English-Latvian automated translation task in a framework of statistical machine translation.The paper continues research presented in Skadiņa (2016) through systemic and comprehensive analysis and comparison of different approaches for MWE identification, alignment and treatment in automated translation task.It is first such study for the Latvian language and could be useful for other morphologically rich under-resourced languages.The obtained evaluation results show improvement in automatic evaluation by the BLEU score, as well as increase of fluency and adequacy of translation.

Related work
Translation of MWEs has been recognized as a problem already at rather early stage of machine translation (Hutchins and Somers, 1992).During decades when rule-based systems were dominant, special MWE dictionaries were created manually or semiautomatically.Such approach was also chosen by Deksne et al. (2008) for English-Latvian rule-based machine translation system.For MWE treatment authors proposed to use special dictionary of MWEs and include additional MWE processing step during parsing.
During last decade statistical machine translation has become dominant and thus ways how to teach SMT systems translate MWEs correctly have attracted attention of researchers.Two actual issues have been studiedhow to create/obtain MWE dictionary, if such dictionary does not exist?How to integrate MWE dictionary into SMT system?
Different techniques how to integrate MWE dictionary into SMT system at first were investigated by Carpuat and Diab (2010).They analyse two approachesstatic (MWEs are treated as single units) and dynamic (special feature is used to indicate presence of MWE) for English-Arabic SMT.The best result has been achieved with static approach.It needs to be mentioned that MWEs in this study were obtained from WordNet.Two years later Bouamor et al. (2012) investigated three strategies for integration of bilingual MWE dictionary in English-French SMT task -retraining with MWEs as parallel corpus, MWEs in a phrase table and a specific feature in phrase table.The best results were obtained with the retraining approach.
For less resourced languages several classes of MWEs have been investigated in context of SMT.Pinnis and Skadiņš (2012) investigated term translation problem for domain specific SMT.They reported transformation of translation model into termaware phrase tables using specific feature as the most successful approach.Following recommendations by Carpuat and Diab (2010), Kordoni and Simova (2014) analysed phrasal verb translation with help of dictionary in English-Bulgarian SMT task.They report dynamic integration (specific feature) as best approach.Recently Cholakov and Kordoni (2016) applied word embeddings to augment the phrase table of an English-Bulgarian SMT system with new features.This approach outperformed their previous results for SMT of phrasal verbs.

Data and tools
The DGT-TM corpus (Steinberger et al., 2012) of legal documents was used in described experiments.Although the corpus does not contain idiomatic expressions, it contains a lot of terminological units, light verb constructions and named entities, that needs to be treated as MWEs.The training corpus contains 1,63 million unique English-Latvian parallel sentences.Tuning and test data were selected randomly and separated from the training data before experiments started.1000 sentences were used as test data and 2000 sentences were used as tuning data.For corpus cleaning and selection of test and tuning data LetsMT!(Vasiļjevs et al., 2012) platform was used.
The Moses toolkit (Keohn et al., 2007) with default settings was used for training and translation.The 5-gram language model was created with KenLM (Heafield et al., 2013), minimum error-rate training (Och, 2003) was used for tuning and BLEU score (Papineni et al., 2002) was used for automatic evaluation.

Identification of MWE candidates
One way how to identify MWEs in a text is to apply morpho-syntactic patterns that extract all phrases which match particular pattern.Usually this approach leads to overganeration, thus to filter reliable MWE candidates the association measures are applied afterwards.
214 patterns for Latvian and 61 pattern for English were created for MWE identification.Most of these patterns describe noun phrases.Latvian language is inflected language, thus more patterns, to describe correctly MWE candidates and to avoid unnecessary overgeneration, are necessary.
Fig Using mwetoolkit 610 thousand unique MWE candidates for English and 3.68 million candidates have identified.Such a big difference in number of extracted candidates can be explained by rich morphology of the Latvian language and overgeneration.The MWE candidates then were marked in a text1 .

Creation of bilingual dictionary
When MWE candidates are identified in a monolingual text, a bilingual MWE dictionary could built through alignment of possible translation equivalents.To find translation equivalents for monolingual MWE candidates MPAligner toolkit (Pinnis, 2013) was applied.The toolkit initially was designed to find translation equivalents for terminological units (single token as well as multiword), however, it can be applied for other alignment tasks as well.MPAligner can be used in two ways -with and without dictionary.In a first case translation equivalents will be identified using transliteration, in second case both transliteration and dictionary-based alignment will be performed.In our experiments we used MPAligner with the dictionary that is included in the toolkit.The tool at first extracts all possible translations of MWEs and then selects those that are above specified threshold.For our experiments we used the default threshold 0.7.Initially the toolkit extracted 369 506 candidate pairs (including duplicates).After filtering, 55 363 pairs were kept.
Fig 2 shows fragment of the extracted dictionary and alignment probabilities.One could notice that for single English phrase several Latvian phrases are identified which correspond to different inflectional forms of the phrase.

Association measures for MWE identification
Another way for MWE identification is application of different association measures.Such approach allows to identify MWEs that are hard to describe by patterns.However, this approach also recognizes different frequently used strings as MWE candidates that are not grammatically correct phrases (e.g.'of the').
The Collocate tool version 1.0, which "is designed to provide information about the collocations in a text or corpus " (Barlow, 2004) was used for MWE candidate extraction.The tool allows to find collocations using different association measures: mutual information, T-score, and Log Likelihood.At first several experiments were performed with different association measures to select the most appropriate.After manual inspection of most frequent collocates, it was decided to use the Log Likelihood for collocation extraction.
Several thresholds have been used to extract MWE candidates.The extracted MWE candidates then were filtered using regular expressions to exclude ungrammatical phrases, e.g.preposition followed by determiner or phrases with numbers, etc.In addition, top 200 phrases were manually checked.Table 1 summarizes statistics about MWE candidates after application of different filters.

Strategies for MWE integration into SMT system
Using extracted MWE candidates several strategies how MWEs could be integrated into SMT system have been investigated.

Integration of bilingual MWE dictionary into SMT system
Three different strategies how bilingual MWE dictionary could be integrated into SMT system were investigated: 1. MWE dictionary data were added to the training data and an SMT system was retrained.Automatic evaluation results for three initial systems (A1-1, A2-1 and A3) were close to the baseline.Thus two additional experiments were performed.In first experiment the alignment threshold of MPAligner tool was lowered to 0.4, allowing to create larger bilingual dictionary (83,295 entries).This experiment (A1-2) did not lead to significant improvement of BLEU score.
In the second experiment (A2-2) two translation tables were scored with either table.With this approach the best result -+0.36 BLEU point improvementwas achieved.Obtained results differ from results presented by Pinnis and Skadiņš (2012) who reported additional feature as most efficient way for terminology treatment.This could be explained by quality of the baseline systemwhile BLEU scores for baseline systems in Pinnis and Skadiņš did not exceed 16 BLEU points, baseline system presented in this paper received 46.35 BLEU points.
A manual investigation which was performed to find main differences in translation, revealed some improvement in fluency and adequacy of translations (Skadiņa, 2016).However, it also revealed some limitations.One limitation is related to creation of bilingual MWE dictionaryit is limited to defined patterns, dictionary used for alignment and threshold applied for dictionary filtering.

Integration of MWE lists into SMT system
As integration of bilingual MWE dictionary in most of cases demonstrated small improvement in terms of BLEU score, another approachintegration of monolingual MWE lists, which were extracted with association measures, into SMT system -was investigated.Decision to integrate monolingual lists instead of bilingual dictionary was also chosen after manual inspection of top 200 MWE candidates in each language, which showed that these lists contain different MWEs.This approach also helped to solve one of deficiencies of the previous approachlack of MWEs that are translated as single token.Table 3 summarizes automatic evaluation results for systems created from different MWE candidate lists.The best result -+0.5 BLEU point improvement -has been achieved with rather noisy data, while results for other systems with less data did not exceed the baseline.Similarly to the previous experiment, manual investigation of translations have revealed some improvements regarding adequacy and fluency of obtained translation.

Hybrids: combining data and approaches
Several experiments that combine MWEs obtained with the methods described above have been performed to find best combination of knowledge-base and statistical approaches.
The first experiment was influenced by the best obtained result so fartreatment of MWEs as single unit during training.This approach (C1) has been applied on data from bilingual dictionary, resulting in better result (46.68 BLEU; +0.33), when compared with most of previous results, obtained with bilingual dictionary.
In next experiment (C2) both data sources -MWEs from the bilingual dictionary and the largest MWE lists -were merged and marked (concatenated) in training data.This approach lead to slightly better results in terms of the BLEU score (46.72; +0.37), but did not exceed simple concatenation method presented in previous chapter.
Finally, four experiments were performed by combining two phrase tables from previous experiments.Systems C3 and C4 combine the phrase table obtained by adding bilingual dictionary to the training data (system D1) and phrase table obtained by treatment of MWEs from the largest MWE list as single units (system S1).Systems C5 and C6 combine the phrase table created from bilingual dictionary and phrase table obtained by treatment of MWEs from the largest MWE list as single units (system S1).The best result (+0.59 BLEU points) was achieved by system C6, which uses phrase table created from bilingual dictionary as the first.

Skadiņa
Automatic evaluation results of all experiments are summarized in Table 4.

Analysis and discussion
As it has been demonstrated that increase of BLUE score does not always mean better quality of translation (Smith et al., 2016) and to analyse influence of developed methods on MWE translation, manual analysis of the output of the baseline system and the best system from each approach (D2-2, S1, and C6) was performed.233 sentences from 1000 sentences in test corpus were translated identically by all systems, including 57 sentences which were translated identical to reference (human) translation.The baseline system generated identical translation to reference translation in 84 cases, system that uses bilingual dictionary (D2-2) generated identical translation to reference translation in 80 cases, system that uses statistically extracted MWE lists (S1) in 84 cases, but combined system in 80 cases.
For each sentence in test corpus the standard deviation was calculated between BLEU scores of these four systems.100 sentences with higher standard deviation were analysed manually.The first observation was, that highest BLEU scores (and corresponding translations) in many cases are assigned to two systemseither baseline and D2-2 system (39 cases) or S1 system and hybrid system (20 cases).In 3 cases only the baseline system received the highest score, in 4 cases only the D2-2 system received the highest score, in 2 cases only single token approach has received the best assessment, while in 3 cases only combined system has received the highest BLEU score.Although 12 cases are insufficient to make generalization, 3 cases for each SMT system where particular system has received highest BLEU score and where the standard deviation is large, are analysed in this chapter.
Table 5 shows three cases when the baseline system has received the highest BLEU score.In all cases output from systems S1 and C6 are identical, which can be explained by architecture of C6 system.In first case, although the baseline system has been automatically evaluated as the best, the translation of the system D2-2 is more adequate and fluent, while translations from systems S1 and C6 are incomprehensive.It could be explained by more adequate translation of MWE 'court or arbitration tribunal', where system D2-2 kept agreement between constituents, while the baseline system has lost it.In second case the output of the baseline system is more fluent as for other systems.No influence on MWE translation observed.In third case translation of the baseline system is identical to reference translation, except column, which is missing in reference translation.Fluent, and almost adequate translation is generated also by systems S1 and C6.The translation of the MWE 'shall duly consider the following' by these systems includes 'šādu informāciju' (following information), while reference and baseline system ignores word 'following' in translation.Table 6 presents cases when dictionary-based approach outperforms other approaches in terms of BLEU score.All three cases clearly demonstrate positive influence of bilingual MWE dictionary.In first example two MWEs 'short term toxicity test' (īstermiņa toksicitātes tests) and 'sac-fry stages' (dzeltenummaisa attīstības posmos) is correctly translated only by system D2-2.It also needs to be mentioned, that output from systems S1 and C6 is more adequate and fluent when compared to baseline.Similarly, in second example the term 'japan wax' (japānas vasks) is correctly translated only by system D2-2.Finally in third example, MWE 'per tonne of product' is correctly translated by system D2-2 (the translation of the baseline system is also adequate).However, none of the systems provide completely correct translation for 'corresponding to the tender', while output of improved system generated incorrect inflection.
Table 7 summarizes cases when MWE lists treated as single tokens outperform other approaches in terms of BLEU score.In first case output of system S1 is identical reference translation, as it could be interpreted as MWE.However, translations of baseline and D2-2 systems are also fluent and rather adequate, while output from system C6 is not adequate, although it is fluent and is evaluated as the second best translation.In second case`, systems S1 and C6 translates MWE 'draw up' identically to reference translation, but fails in translation of 'analytical accounts'.The most adequate and fluent translation is generated by D2-2 system.The third example illustrates translation of two MWEs 'set out' and 'the requirement concerning professional secrecy' which translated identically to reference translation by systems S1 and C6.However, these translations include additional word 'uz' (on).Thus translations of baseline and D2-2 systems are more fluent.It needs to be mentioned that the best BLEU score for system combination approach usually is assigned to longer sentences.However, as it is illustrated in examples it does not lead to the better quality.In the first example MWE 'notes that' is translated by system C6 identically as in reference translation.It allows the system receive highest BLEU score, however MWE 'as well as' is correctly translated only by system D2-2, making this translation most fluent and adequate.In second example translations from systems S1 and C6 are close in terms of BLEU score.The most complicated here is translation of MWE 'safeguard activities', the baseline and D2-2 systems have lost translation for word 'activities', while system C6 made mistake in word ordering, and only system S1 translated it correctly, but in wrong inflectional form.The same situation can be observed in third example, in which only system S1 does not loose word 'zona' (area), when translating phrase 'in areas viif and viig'.

Conclusion
In this paper comprehensive analysis of different methods for MWE treatment in statistical machine translation is presented.Different approachesdynamic, static and hybridare analysed.The best result (+0.59 BLEU points) is achieved by hybrid approach.The manual analysis of obtained results demonstrated improvement in fluency and adequacy when MWEaware systems were involved.However, BLEU scores assigned to translation did not always highlight the best translation.Positive influence of bilingual MWE dictionary on translation quality was observed in all analysed cases.However, it was not a case of hybrid system.Where it concerns treatment of MWEs as single token, it improves MWE translation, but not always the overall translation.Although methods are analysed for English-Latvian translation they could be applied for other languages with limited language resources as well.

Fig 2 .
Fig 2.Fragment of extracted bilingual dictionary with reliability scores.

Table 1 .
Number of extracted MWE candidates using different filters and thresholds 2. Additional phrase table was created from bilingual dictionary.Scores assigned by MPAligner were used as translation probabilities for the second phrase table.Translation tables were then combined in two different waysby scoring with both tables and scoring with either table.3. Additional feature that indicates presence of MWE was added to the phrase table.The phrase table combination technique proposed by Bisazza et al (2011) was applied to create MWE aware translation table.All three approaches were evaluated with BLEU metrics.Obtained results are summarised in Table 2.

Table 2 .
Application of different strategies for integration of bilingual MWE dictionary into SMT system

Table 3 .
Results of automatic evaluation

Table 4 .
Results of automatic evaluation

Table 5 .
Examples of SMT system output when baseline system has received the highest BLEU score

Table 6 .
Examples of SMT output when dictionary based system has received the highest BLEU score

Table 7 .
Examples of SMT output when MWE lists has received the highest BLEU score

Table 8 .
Examples of SMT output when hybrid approach has received the highest BLEU score