A Simple and Effective Approach to Coverage-Aware Neural Machine Translation

We offer a simple and effective method to seek a better balance between model confidence and length preference for Neural Machine Translation (NMT). Unlike the popular length normalization and coverage models, our model does not require training nor reranking the limited n-best outputs. Moreover, it is robust to large beam sizes, which is not well studied in previous work. On the Chinese-English and English-German translation tasks, our approach yields +0.4 1.5 BLEU improvements over the state-of-the-art baselines.


Introduction
In the past few years, Neural Machine Translation (NMT) has achieved state-of-the-art performance in many translation tasks.It models the translation problem using neural networks with no assumption of the hidden structures between two languages, and learns the model parameters from bilingual texts in an end-to-end fashion (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Cho et al., 2014).In such systems, target words are generated over a sequence of time steps.The model score is simply defined as the sum of the log-scale word probabilities: where x and y are the source and target sentences, and P(y j |y <j , x) is the probability of generating the j-th word y j given the previously-generated words y <j and the source sentence x.However, the straightforward implementation of this model suffers from many problems, the most obvious one being the bias that the system tends to choose shorter translations because the log-probability is added over time steps.The situation is worse when we use beam search where the shorter translations have more chances to beat the longer ones.It is in general to normalize the model score by translation length (say length normalization) to eliminate this system bias (Wu et al., 2016).
Though widely used, length normalization is not a perfect solution.
NMT systems still have under-translation and over-translation problem even with a normalized model.It is due to the lack of the coverage model that indicates the degree a source word is translated.As an extreme case, a source word might be translated for several times, which results in many duplicated target words.Several research groups have proposed solutions to this bad case (Tu et al., 2016;Mi et al., 2016).E.g., Tu et al. (2016) developed a coveragebased model to measure the fractional count that a source word is translated during decoding.It can be jointly learned with the NMT model.Alternatively, one can rerank the n-best outputs by coverage-sensitive models, but this method just affects the final output list which has a very limited scope (Wu et al., 2016).
In this paper we present a simple and effective approach by introducing a coverage-based feature into NMT.Unlike previous studies, we do not resort to developing extra models nor reranking the limited n-best translations.Instead, we develop a coverage score and apply it to each decoding step.Our approach has several benefits, • Our approach does not require to train a huge neural network and is easy to implement.
Figure 1: The coverage score for a running example (Chinese pinyin-English and β = 0.8).
We test our approach on the NIST Chinese-English and WMT English-German translation tasks, and it outperforms several state-of-the-art baselines by 0.4∼1.5 BLEU points.

The Coverage Score
Given a word sequence, a coverage vector indicates whether the word of each position is translated.This is trivial for statistical machine translation (Koehn, 2009) because there is no overlap between the translation units of a hypothesis, i.e., we have a 0-1 coverage vector.
However, it is not the case for NMT where the coverage is modeled in a soft way.In NMT, no explicit translation units or rules are used.The attention mechanism is used instead to model the correspondence between a source position and a target position (Bahdanau et al., 2015).For a given target position j, the attention-based NMT computes attention score a ij for each source position i. a ij can be regarded as the measure of the correspondent strength between i and j, and is normalized over all source positions (i.e., Here, we present a coverage score (CS) to describe to what extent the source words are translated.In principle, the coverage score should be high if the translation covers most words in source sentence, and low if it covers only a few of them.Given a source position i, we define its coverage as the sum of the past attention probabilities c i = |y| j a ij (Wu et al., 2016;Tu et al., 2016).Then, the coverage score of the sentence pair (x, y) is defined as the sum of the truncated coverage over all positions (See Figure 1 for an illustration): where β is a parameter that can be tuned on a development set.This model has two properties: • Non-linearity Eq. ( 2) is a log-linear model.It is desirable because this model does not benefit too much from the received attention when the coverage of a source word is high.This can prevent the cases that the system puts too much attention on a few words while others only receive a little attention to have relatively high scores.Beyond this, the log-scale scoring fits into the NMT model where word probabilities are represented in the logarithm manner (See Eq. ( 1)).
• Truncation At the early stage of decoding, the coverage of the most source words is close to 0. This may result in a negative infinity value after the logarithm function, and discard hypotheses with sharp attention distributions, which is not necessarily bad.The truncation with the lowest value β can ensure that the coverage score has a reasonable value.
Here β is similar to model warm-up, which makes the model easy to run in the first few decoding steps.Note that our way of truncation is different from Wu et al. (2016)'s, where they clip the coverage into [0, 1] and ignore the fact that a source word may be translated into multiple target words and its coverage should be of a value larger than 1.
For decoding, we incorporate the coverage score into beam search via linear combination with the NMT model score as below, where y is a partial translation generated during decoding, log P(y|x) is the model score, and α is the coefficient for linear interpolation.
In standard implementation of NMT systems, once a hypothesis is finished, it is removed from the beam and the beam shrinks accordingly.Here we choose a different decoding strategy.We keep the finished hypotheses in the beam until the decoding completes, which means that we compare the finished hypotheses with partial translations at each step.This method helps because it can dynamically determine whether a finished hypothesis is kept in beam through the entire decoding process, and thus reduce search errors.It enables the decoder to throw away finished hypotheses if they have very low coverage but are of high likelihood values.

Setup
We evaluated our approach on Chinese-English and German-English translation tasks.We used 1.8M sentence Chinese-English bitext provided within NIST12 OpenMT 2 and 4.5M sentence German-English bitext provided within WMT16.For Chinese-English translation, we chose the evaluation data of NIST MT06 as the development set, and MT08 as the test set.All Chinese sentences were word segmented using the tool provided within NiuTrans (Xiao et al., 2012).For German-English translation, we chose newstest2013 as the development set and new-stest2014 as the test set.
Our baseline systems were based on the opensource implementation of the NMT model presented in Luong et al. (2017).The model was consisted of a 4-layer bi-directional LSTM encoder and a 4-layer LSTM decoder.The size of the embedding and hidden layers was set to 1024.We applied the additive attention model on top of the multi-layer LSTMs (Bahdanau et al., 2015).For training, we used the Adam optimizer (Kingma and Ba, 2015) where the learning rate and batch size were set to 0.001 and 128.We selected the top 30k entries for both source and target vocabularies.For the English-German task, BPE (Sennrich et al., 2016) was used for better performance.
For comparison, we re-implemented the length normalization (LN) and coverage penalty (CP) methods (Wu et al., 2016).We used grid search to tune all hyperparameters on the development set as Wu et al. (2016).Specifically, weights for both CP and our CS are evaluated in interval [0, 1] with step 0.1, while the weight for LN is in interval [0.5, 1.5].We found that the settings determined with beam size 10 can be reliably applied to larger beam sizes in the preliminary experiments and thus we tuned all systems with beam size 10.For Chinese-English translation, we used a weight of 1.0 for both LN and CP, and set α = 0.6 and β = 0.4.For English-German translation, we set the weights of LN and CP to 1.5 and 0.3, and set α = 0.3 and β = 0.2.More details can be found in the Appendix.

Results
Table 1 shows the BLEU scores of the systems under different beam sizes (10, 100, and 500).We see, first of all, that our method outperforms four of the baselines, and the improvement is the largest when the beam size is 500.For a clear presentation, we plotted the BLEU curves by varying beam size.Figure 2 shows that our method has a consistent improvement as the beam size becomes larger, while others start to decline when the beam size is around 50, which indicates that integrating our coverage score into decoding is beneficial to prune out undesirable hypotheses when we search in a larger hypothesis space.We also see that the model gives even better results (+0.5 BLEU) after combining all these methods, which implies that our method doesn't overlap with the others.More interestingly, it is observed that the improvement on the En-De task is smaller than that on the Zh-En task.A possible reason is that there are relatively good word correspondences between English and German, and it is not so difficult for the base model to learn word deletions and insertions in En-De translation.Hence, the baseline system generates translations with proper lengths and does not benefit too much from the coverage model.An interesting phenomenon in Table 1 is that using large beam size 100 rather than standard beam size (around 10) could give considerable improvements, e.g., 0.5 BLEU for Zh-En and 0.2 for En-De, yet the extremely large beam size 500 does not help much.This might result from the fact that our method is applied to each decoding step, thus helps model to search in a larger space and select better hypotheses, while a much larger beam size does not provide more benefits because the model already generates sufficiently good translations with a small beam size.
We also compared CP with our method by ap- Table 3: BLEU against α and β (zh-en/en-de) plying CP to each decoding step (Line CP † ) and our method only to reranking (Line CS † ) in Table 1.We noted that model performance dropped in most cases when CP was applied to each decoding step, and our method was helpful in reranking and obtained even better results as well when it is employed by beam search.This implies that the way of truncation is essential to enable the effective utilization of coverage inside beam search to achieve more significant improvements.
Then, Figure 3 shows that our method has a relatively better ability to handle longer sentences.It obtains a significant improvement over the baselines when we translate sentences of more than 50 words.This is expectable because the coverage provides rich information from the past, which helps to address the long term dependency issue.
Another interesting question is whether the N-MT systems can generate translations with appropriate lengths.To seek its answer, we studied the length difference between the MT output and the shortest reference.Table 2 shows that our method helps on both tasks.It generates translations whose lengths are closer to those of their references, which agrees with the BLEU results in Table 1.This is reasonable because our method encourages the hypotheses with higher coverage scores and thus higher recall.It means that our method can help the model to preserve the meaning of source words, which alleviates the undertranslation problem.
Sensitivity analysis on α and β in Table 3 shows that the two tasks have different optimal choices of these values, which might be due to the natural need of length preference for different languages.

Related Work
The length preference and coverage problems have been discussed for years since the rise of statistical machine translation (Koehn, 2009).In NMT, several good methods have been developed.The simplest of these is length normalization which penalizes short translations in decoding (Wu et al., 2016).More sophisticated methods focus on modeling the coverage problem with extra sub-modules in NMT and require a training process (Tu et al., 2016;Mi et al., 2016).
Perhaps the most related work to this paper is Wu et al. (2016).In their work, the coverage problem can be interpreted in a probability story.However, it fails to account for the cases that one source word is translated into multiple target words and is thus of a total attention score > 1.
To address this issue, we remove the probability constraint and make the coverage score interpretable for different cases.Another difference lies in that our coverage model is applied to every beam search step, while Wu et al. (2016)'s model affects only a small number of translation outputs.
Previous work have pointed out that BLEU scores of NMT systems drop as beam size increases (Britz et al., 2017;Tu et al., 2017;Koehn and Knowles, 2017), and the existing length normalization and coverage models can alleviate this problem to some extent.In this work we show that our method can do this much better.Almost no BLEU drop is observed even when beam size is set to 500.

Conclusion
We have described a coverage score and integrated it into a state-of-the-art NMT system.Our method is easy to implement and does not need training for additional models.Also, it performs well in searching with large beam sizes.On Chinese-English and English-German translation tasks, it outperforms several baselines significantly.

Figure 3 :
Figure 2: BLEU against beam size.base CP LN CS

Table 2 :
Length statistics.Len = average length of translations, Diff = average length difference between translations and shortest references, LR = translation length ratio.