CUED@WMT19:EWC&LMs

Two techniques provide the fabric of the Cambridge University Engineering Department’s (CUED) entry to the WMT19 evaluation campaign: elastic weight consolidation (EWC) and different forms of language modelling (LMs). We report substantial gains by fine-tuning very strong baselines on former WMT test sets using a combination of checkpoint averaging and EWC. A sentence-level Transformer LM and a document-level LM based on a modified Transformer architecture yield further gains. As in previous years, we also extract n-gram probabilities from SMT lattices which can be seen as a source-conditioned n-gram LM.


Introduction
Both fine-tuning and language modelling are techniques widely used for NMT.Fine-tuning is often used to adapt a model to a new domain (Luong and Manning, 2015), while ensembling neural machine translation (NMT) with neural language models (LMs) is an effective way to leverage monolingual data (Gulcehre et al., 2015(Gulcehre et al., , 2017;;Stahlberg et al., 2018a).Our submission to the WMT19 news shared task relies on ideas from these two lines of research, but applies and combines them in novel ways.Our contributions are: • Elastic weight consolidation (Kirkpatrick et al., 2017, EWC) is a domain adaptation technique that aims to avoid degradation in performance on the original domain.We report large gains from fine-tuning our models on former English-German WMT test sets with EWC.We find that combining finetuning with checkpoint averaging (Junczys-Dowmunt et al., 2016b,a) yields further significant gains.Fine-tuning is less effective for German-English.
• Inspired by the shallow fusion technique by Gulcehre et al. (2015Gulcehre et al. ( , 2017) ) we ensemble our neural translation models with neural language models.While this technique is effective for single models, the gains are diminishing under NMT ensembles trained with large amounts of back-translated sentences.
• To incorporate document-level context in a light-weight fashion, we propose a modification to the Transformer (Vaswani et al., 2017) that has separate attention layers for inter-and intra-sentential context.We report large perplexity reductions compared to sentence-level LMs under the new architecture.Our document-level LM yields small BLEU gains on top of strong NMT ensembles, and we hope to benefit even more from it in document-level human evaluation.
• Even though the performance gap between NMT and traditional statistical machine translation (SMT) is growing rapidly on the task at hand, SMT can still improve very strong NMT ensembles.To combine NMT and SMT we follow Stahlberg et al. (2017aStahlberg et al. ( , 2018b) ) and build a specialized n-gram LM for each sentence that computes the risk of hypotheses relative to SMT lattices.
• While data filtering was central in last year's evaluation (Koehn et al., 2018b;Junczys-Dowmunt, 2018b), in our experiments this year we found that a very simple filtering approach based on a small number of crude heuristics can perform as well as dual conditional cross-entropy filtering (Junczys-Dowmunt, 2018a,b).
• We confirm the effectiveness of source-side noise for scaling up back-translation as proposed by Edunov et al. (2018).
2 Document-level Language Modelling MT systems usually translate sentences in isolation.However, there is evidence that humans also take context into account, and judge translations from humans with access to the full document higher than the output of a state-of-the-art sentence-level machine translation system (Läubli et al., 2018).Common examples of ambiguity which can be resolved with cross-sentence context are pronoun agreement or consistency in lexical choice.This year's WMT competition encouraged submissions of translation systems that are sensitive to cross-sentence context.We explored the use of document-level language models to enhance a sentence-level translation system.We argue that this is a particularly light-weight way of incorporating document-level context.First, the LM can be trained independently on monolingual target language documents, i.e. no parallel or source language documents are needed.Second, since our document-level decoder operates on the n-best lists from a sentence-level translation system, existing translation infrastructure does not have to be changed -we just add another (document-level) decoding pass.On a practical note, this means that, by skipping the second decoding pass, our system would work well even for the translation of isolated sentences when no document context is available.Our document-level LMs are trained on the concatenations of all sentences in target language documents, separated by special sentence boundary tokens.Training a standard Transformer LM (Vaswani et al., 2017) on this data already yields significant reductions in perplexity compared to sentence-level LMs.However, the attention layers have to capture two kinds of dependencies -the long-range cross-sentence context and the short-range context within the sentence.Our modified Intra-Inter Transformer architecture (Fig. 1) splits these two responsibilities into two separate layers using masking.The "Intra-Sentential Attention" layer only allows to attend to the previous tokens in the current sentence, i.e. the intra-sentential attention mask activates the tokens between the most recent sentence boundary marker and the current symbol.The "Inter-Sentential Attention" layer is restricted to the tokens in all previous complete sentences, i.e. the mask enables all tokens from the document beginning to the most recent sentence boundary marker.As usual (Vaswani et al., 2017), during training the attention masks are also designed to prevent attending to future tokens.Fig. 2 shows an example of the different masks.Note that as illustrated in Fig. 1, both attention layers are part of the same layer stack which allows a tight integration of both types of context.An implication of this design is that they also use the same positional embedding -the positional encoding for the first unmasked item for intra-sentential attention may not be zero.For example, 'Lonely' has the position 10 in Fig. 2 although it is the first word in the current sentence.
We use our document-level LMs to rerank nbest lists from a sentence-level translation system.Our initial document is the first-best sentence hypotheses.We greedily replace individual sentences with lower-ranked hypotheses (according to the translation score) to drive up a combination of translation and document LM scores.We start with the sentence with the minimum difference between the first-and second-best translation scores.
We stop when the translation score difference to the first-best translation exceeds a threshold. 1

Experimental Setup
Our experimental setup is essentially the same as last year (Stahlberg et al., 2018b): Our preprocessing includes Moses tokenization, punctuation normalization, truecasing, and joint subword segmentation using byte pair encoding (Sennrich et al., 2016c) with 32K merge operations.We compute cased BLEU scores with mteval-v13a.plthat are directly comparable with the official WMT scores. 2 Our models are trained with the TensorFlow (Abadi et al., 2016) based Tensor2Tensor (Vaswani et al., 2018) library and decoded with our SGNMT framework (Stahlberg et al., 2017b(Stahlberg et al., , 2018c)).We delay SGD updates (Saunders et al., 2018) to use larger training batch sizes than our technical infrastructure 3 would normally allow with vanilla SGD by using the MultistepAdam optimizer in Tensor2Tensor.We use Transformer (Vaswani et al., 2017) models in two configurations (Tab.1).Preliminary experiments are carried out with the 'Base' configuration while we use the 'Big' models for our final system.We use news-test2017 as development set to tune model weights and select checkpoints and news-test2018 as test set.

ParaCrawl Corpus Filtering
Junczys-Dowmunt (2018a,b) reported large gains from filtering the ParaCrawl corpus.This year, the WMT organizers made version 3 of the ParaCrawl corpus available.We compared two different filtering approaches on the new data set.First, we implemented dual cross-entropy filtering (Junczys-Dowmunt, 2018a,b), a sophisticated data selection criterion based on neural language model and neural machine translation model scores in both translation directions.In addition, we used the "naive" filtering heuristics proposed by Stahlberg et al. (2018b): • Language detection (Nakatani, 2010) in both source and target language.
• No words contain more than 40 characters.
• Sentences must not contain HTML tags.
• The minimum sentence length is 4 words.
• The character ratio between source and target must not exceed 1:3 or 3:1.
• Source and target sentences must be equal after stripping out non-numerical characters.
• Sentences must end with punctuation marks.
Tab. 2 indicates that our systems benefit from ParaCrawl even without filtering (rows 1 vs. 2).
Our best 'Base' model uses both dual and naive filtering.However, the difference between filtering techniques diminishes under stronger 'Big' models with back-translation (rows 6 and 7).

Back-translation
Back-translation (Sennrich et al., 2016b)  has to be balanced with the amount of real parallel data (Sennrich et al., 2016b,a;Poncelas et al., 2018).Edunov et al. (2018) had overcome this limitation by adding random noise to the synthetic source sentences.Tab. 3 shows that using noise improves the BLEU score by between 0.5 and 1.5 points on the news-test2018 test set (rows 2-4 vs. 5-7). 4Our final model uses a very large number (92M) of (noisy) synthetic sentences (row 9), although the same performance could already be reached with fewer sentences (row 8).

Fine-tuning with EWC and Checkpoint Averaging
Fine-tuning (Luong and Manning, 2015) is a domain adaptation technique that first trains a model 4 We use Sergey Edunov's addnoise.pyscript available at https://gist.github.com/edunov/d67d09a38e75409b8408ed86489645dd until it converges on a training corpus A, and then continues training on a usually much smaller corpus B which is close to the target domain.Similarly to Schamper et al. (2018); Koehn et al. (2018a), we fine-tune our models on former WMT test sets (2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016) to adapt them to the target domain of high-quality news translations.Due to the very small size of corpus B, much care has to be taken to avoid over-fitting.We experimented with different techniques that keep the model parameters in the fine-tuning phase close to the original ones.First, we fine-tuned our models for about 1K-2K iterations (depending on the performance on the news-test2017 dev set) and dumped checkpoints every 500 steps.Averaging all fine-tuning checkpoints together with the last unadapted checkpoint yields minor gains over fine-tuning without averaging (rows 3 vs. 4 in Tab. 4).However, we obtain the best results by combining checkpoint averaging with another regularizer -elastic weight consolidation (Kirkpatrick et al., 2017, EWC) -that explicitly penalizes the distance of the model parameters θ to the optimized but unadapted model parameters θ * A .The regularized training objective according EWC is: where L B (θ) is the normal cross-entropy training loss on task B and ) is an estimate of task A Fisher information, which represents the importance of parameter θ i to A.
On English-German, fine-tuning with EWC and checkpoint averaging yields an 1.1 BLEU improvement (rows 1 vs. 6 in Tab. 4).Gains are generally smaller on German-English.

Language modelling
We introduced our new Intra-Inter Transformer architecture for document-level language modelling in Sec. 2. Tab. 5 shows that our architecture achieves much better perplexity than both a sentence-level language model and a documentlevel vanilla Transformer model.Tab.6 summarizes our translation results with various kinds of language models.Adding a Transformer sentencelevel LM to NMT helps for the single Base model without back-translation, but is less effective on top of (ensembles of) Big models with backtranslation (row 2 vs. 3).Extracting n-gram probabilities from traditional PBSMT lattices as described by Stahlberg et al. (2017a) and using them as source-conditioned n-gram LMs yields gains even on top of our ensembles (row 4).Our document-level Intra-Inter language models improve the ensembles and the single En-De Base model, but hurt performance slightly for the single Big models (row 5).
In our context, both methods aim to avoid catastrophic forgetting5 (Goodfellow et al., 2013;French, 1999) and over-fitting by keeping the adapted model close to the original, and can thus be seen as regularized fine-tuning techniques.Khayrallah et al. (2018); Dakwale and Monz (2017) regularized the output distributions during fine-tuning using techniques inspired by knowledge distillation (Bucilu et al., 2006;Hinton et al., 2014;Kim and Rush, 2016).Barone et al. (2017) applied standard L2 regularization and a variant of dropout to domain adaptation.EWC as generalization of L2 regularization has been used for NMT domain adaptation by Thompson et al. (2019); Saunders et al. (2019).In particular, Saunders et al. (2019) showed that EWC is not only more effective than L2 in reducing catastrophic forgetting but even yields gains on the general domain when used for fine-tuning on a related domain.

NMT-SMT hybrid systems
Popular examples of combining a fully trained SMT system with independently trained NMT are rescoring and reranking methods (Neubig et al., 2015;Stahlberg et al., 2016b;Khayrallah et al., 2017;Grundkiewicz and Junczys-Dowmunt, 2018;Avramidis et al., 2016;Marie and Fujita, 2018;Zhang et al., 2017), although these models may be too constraining if the neural system is much stronger than the SMT system.Loose combination schemes include the edit-distance-based system of Stahlberg et al. (2016a) or the minimum Bayes-risk approach of Stahlberg et al. (2017a) we adopted in this work.NMT and SMT can also be combined in a cascade, with SMT providing the input to a post-processing NMT system (Niehues et al., 2016;Zhou et al., 2017) or vice versa (Du and Way, 2017).Wang et al. (2017bWang et al. ( , 2018) ) interpolated NMT posteriors with word recommendations from SMT and jointly trained NMT together with a gating function which assigns the weight between SMT and NMT scores dynamically.The AMU-UEDIN submission to WMT16 let SMT take the lead and used NMT as a feature in phrase-based MT (Junczys-Dowmunt et al.,

Conclusion
Our WMT19 submission focused on regularized fine-tuning and language modelling.With our novel Intra-Inter Transformer architecture for document-level LMs we achieved significant reductions in perplexity and minor improvements in BLEU over very strong baselines.A combination of checkpoint averaging and EWC proved to be an effective way to regularize fine-tuning.Our systems are competitive on both English-German and German-English (Tab.7), especially considering the immense speed with which our field has been advancing in recent years (Tab.8).

Figure 1 :
Figure 1: Our modified Intra-Inter Transformer architecture with two separate attention layers.

Table 2 :
is a wellestablished technique to use monolingual target language data for NMT.The idea is to automatically generate translations into the source language with an inverse translation model, and add these synthetic sentence pairs to the training data.A major limitation of vanilla backtranslation is that the amount of synthetic data Comparison of ParaCrawl filtering techniques.The rest of the training data is over-sampled to roughly match the size of the filtered ParaCrawl corpus.In the 'Dual x-ent filtering' experiments we selected the 15M best sentences according the dual cross-entropy filtering criterion of Junczys-Dowmunt (2018a).

Table 3 :
Stahlberg et al. (2018b)for back-translation.We back-translated with a 'base' model for news-2017 and the big single Transformer model ofStahlberg et al. (2018b)for news-2016 and news-2018.

Table 5 :
Language model perplexities of different neural language models.'Intra-Inter' denotes our modified Transformer architecture from Sec. 2. The standard model has 448M parameters, Intra-Inter has 549M parameters.

Table 6 :
Using different kinds of language models for translation on news-test2018.The PBSMT baseline gets 26.7 BLEU on English-German and 27.5 BLEU on German-English.

Table 7 :
English-German and German-English primary submissions to the WMT19 shared task.

Table 8 :
Dahlmann et al. (2017)ish-German system with the winning submissions over the past two years.2016b).In contrast,Long et al. (2016)translated most of the sentence with an NMT system, and just used SMT to translate technical terms in a post-processing step.Dahlmann et al. (2017)proposed a hybrid search algorithm in which the neural decoder expands hypotheses with phrases from an SMT system.