Adapting Neural Machine Translation with Parallel Synthetic Data

Recent works have shown that the usage of a synthetic parallel corpus can be effectively exploited by a neural machine translation system. In this paper, we propose a new method for adapting a general neural machine translation system to a speciﬁc task, by exploiting synthetic data. The method consists in selecting, from a large monolingual pool of sentences in the source language, those instances that are more related to a given test set. Next, this selection is automatically translated and the general neural machine translation sys-tem is ﬁne-tuned with these data. For evaluating the adaptation method, we ﬁrst conducted experiments in two controlled domains, with common and well-studied corpora. Then, we evaluated our proposal on a real e-commerce task, yielding consistent improvements in terms of translation quality.


Introduction
Neural machine translation (NMT) (Sutskever et al., 2014;Cho et al., 2014a;Bahdanau et al., 2015) has obtained state-of-the art performance in several domains and language pairs (Sennrich et al., 2016b;Wu et al., 2016).Given the nature of NMT paradigms, the limitation for obtaining bilingual corpora-or their availability-has been one of the major obstacles faced when building competitive NMT systems.Recently, the idea of using synthetic corpora in NMT has reported promising results with regard to the data scarcity in NMT.Many different works demonstrated that the combination of real parallel corpora with synthetic bilingual corpus enhances the NMT trans-lation quality (Sennrich et al., 2016a;Zhang and Zong, 2016a;Cheng et al., 2016).
Following these good results, we aim to adapt general NMT models to real, specific tasks by using synthetic parallel data.The core idea is to select the most valuable instances from a large pool of monolingual source sentences, with respect to a given test set.Next, we automatically translate them.Therefore, we obtain a synthetic parallel corpus, related to our test set domain.Such synthetic corpus can be used to fine-tune a NMT system to the domain at hand.
The main contributions of this paper involve the necessary steps required to adapt a NMT system to a specific domain: • We propose a novel method to create the most adequate synthetic corpus leverages a vector-space representation of sentences, relying on the word embeddings by Mikolov et al. (2013a) and Le and Mikolov (2014).
• We describe the pipeline of our adaptation process, relating the selection, translation and fine-tuning processes.
• We study our adaptation technique on two classical domains.Additionally, we validate our technique on a real e-commerce translation task.
• Results show important improvements over a baseline system.
This paper is structured as follows.NMT technology is briefly described in Section 2. Section 3 summarizes the related work.In Section 4, we present our selection method and we describe the adaptation pipeline.Section 5 presents the experimental set-up and corpora.Results are analyzed and discussed in Section 6.Finally, conclusions and future work are traced in Section 7.

Neural Machine Translation
Neural machine translation is an instantiation of sequence-to-sequence learning: given a sequence of words in the source language, we must produce the corresponding sequence of words in the target language.This is usually done by means of the encoder-decoder architecture: the encoder computes a representation of the input sequence, while the decoder takes it and generates, word by word, the sentence in the target language (Sutskever et al., 2014).In this work, we use a NMT system featuring long short-term memory (LSTM) units (Hochreiter and Schmidhuber, 1997)-in both the encoder and decoder-and equipped with an attention mechanism (Bahdanau et al., 2015).
The input to the system is a sequence of words in the source language.A word embedding matrix projects each word from the discrete to a continuous space.The sequence of word embeddings is then processed by a bidirectional (Schuster and Paliwal, 1997) LSTM network, which produces a sequence of annotations by concatenating the hidden states from the forward and backward layers.
At each decoding timestep, the attention mechanism computes a weighted mean of the sequence of annotations.The weights are given according to a soft alignment model that weights each annotation with the previous decoding state.This can be seen as a joint, dynamic representation of the input sentence.
The decoder is another LSTM network, conditioned to the representation computed by the attention model and the previously generated word.Finally, a deep output layer (Pascanu et al., 2014) computes a distribution over the target language vocabulary.
The model is jointly trained by stochastic gradient descent (SGD), aiming to maximize the loglikelihood over a bilingual parallel corpus.At decoding time, the model approximates the most likely target sentence with beam-search (Sutskever et al., 2014).

Related work
Since Kalchbrenner and Blunsom (2013), Sutskever et al. (2014) and Cho et al. (2014b) proposed the first NMT systems, this has been a boiling research topic.A singular effort has been spent into leverage the advantages that this technology brings in.One of them is the ability of NMT systems to rapidly adapt to a given domain, when they are already trained on a general domain.This is useful either for creating domain-dependent NMT systems or for low-resource tasks.Thus, Luong and Manning (2015) tackled the informal speech translation task by starting from a system trained on the WMT data and adapting it to the translation task at hand.
In phrase-based statistical machine translation (SMT), synthetic bilingual corpora have been mainly proposed as a mean to exploit the vast amount of monolingual data available.By applying a self-training scheme, the synthetic parallel data can be obtained by automatically translating a source-side monolingual corpus (Ueffing et al., 2007;Wu et al., 2008).Other works used targetside corpora to build the synthetic parallel corpus (Bertoldi and Federico, 2009;Lambert et al., 2011).
Inspired by these works in SMT, research referring the inclusion of monolingual data in NMT has a growing interest.Different works have tackled the inclusion of monolingual data, either in source (Zhang and Zong, 2016b) and target language (Gulcehre et al., 2015(Gulcehre et al., , 2017)).
Moreover, Sennrich et al. (2016a) showed that parallel data is not strictly necessary for performing domain adaptation: the usage of synthetic data has positive effects on the NMT system.For obtaining the synthetic data they automatically translated a large monolingual corpus.This syntheticbased approach obtained better results than other methods aimed to exploit monolingual data (e.g.Gulcehre et al. (2015)).Domain adaptation in NMT systems is also integrated in commercial systems, such as SYSTRAN (Crego et al., 2016).

Adaptation using synthetic corpus
As described in the previous section, synthetic parallel data have been widely used to boost the translation quality of NMT.In this work, we further extend their application by adapting NMT models with synthetic parallel data.In certain language pairs or domains where parallel corpora are scarce or even non-existent, a model adjusted with synthetic data can improve the performance with respect to a more general model.
The core idea is that, once a model has been trained on a large, general corpus, we can adapt it to a new domain, by fine-tuning it exclusively using the synthetic data.For doing this, we create an  ad-hoc, specific synthetic corpus in which appear the features from our target-domain data.This corpus is constructed by selecting from a large monolingual pool of sentences-in the source language-those instances that are related with our in-domain dataset.Next, we automatically translate these sentences into the target language.Finally, using this synthetic corpus, we fine-tune a NMT system trained on a more general domain.
Figure 1 shows the pipeline of our adaptation process.
In this section, we describe our technique for creating adequate synthetic corpora, based on a vector-space representation of sentences, and the NMT adaptation process.

Continuous vector-space representation
The idea of representing words or sentence in a continuous vector-space employing neuronal networks was initially proposed by Hinton (1986) and Elman (1990).Continuous vector-space representations (CVR) of words or sentences have been widely leveraged in a variety of natural language applications and demonstrated solid results across a variety of tasks, such as speech recognition (Schwenk, 2007), part-of-speech tagging (Socher et al., 2011), sentiment classification and identification (Glorot et al., 2011) or machine translation (Cho et al., 2014a;Mikolov et al., 2013b).
In this paper, we use a sophisticated CVR of the sentences involved in our data selection method.Specifically, we follow the CVR approach presented by Le and Mikolov (2014).In this work, the authors adapted the continuous Skip-Gram model (Mikolov et al., 2013a) to generate representative vectors of sentences and documents.Thus, with this technique, we obtain a particular vector that represents a complete sentence by means of the the Skip-Gram architecture.

Synthetic creation method
For creating an adequate synthetic corpus for adapting a NMT system, we select from a large pool of monolingual text the most related sentences for our task at hand.We present a novel selection technique, based on the CVR of the sentences.
The intuition is to select sentences whose vector-space representation is similar to the representation of our in-domain instances, assuming that similar sentences will have similar vectors (Le and Mikolov, 2014).
Having a continuous vector space representation of the test sentences allows us to compute a centroid.This can be seen as prototype of the sentences present in the test set.
Provided that similar sentences have similar vector-space representations (Mikolov et al., 2013b), we assume that vectors from the indomain corpus will be clustered.On the other hand, vectors from the general pool of sentences are likely to be more disperse.The idea of our method is to create a hypersphere in the continuous space, with center in our test set centroid, containing all sentences from the test set.Hopefully, only a selection of the sentences from the general pool will be contained in this hypersphere.The hyper-sphere radius is established according to some similarity metric between the centroid of the test set, and the furthest of the test sentences.
As similarity metric we consider the cosine similarity, defined as: where F 1 and F 2 are two z-dimensional vectors.
The centroid is defined as an average of the representations of the sentences from our in-domain corpus T (made up of T sentences): where F xt ∈ R z is the z-dimensional representation of the sentence x t and F T ∈ R z denotes the centroid of our test set.
end end Algorithm 1: Pseudo-code for selecting synthetic corpora.
Algorithm 1 shows the selection procedure.Here, x t ∈ T , is a sentence from our source test data T ; and F xt is the vector-space representation of x t .Analogously, P is the pool of candidate sentences, x p ∈ P is a source candidate sentence, F xp is the vector-space representation of x p , and |P| is the number of sentences in P.Then, our objective is to select data from P such that it is the most suitable for translating data belonging to the source test data T .
ρ represents the radius of the hyper-sphere, which is computed in lines 4 to 8 (the first forall loop) in Algorithm 1.

Adapting with the selection
In our adaptation framework, we assume that we have a NMT model trained on a general domain.We also have a large monolingual pool of sentences (in the source language) and the source part of the test set.
As first step, we compute the distributed representation of the sentences in our large pool.Next, we select sentences from the monolingual pool, given the test set, according to Algorithm 1.This subset of sentences are expected to be related with our in-domain test data.We translate them by means of machine translation (see Section 5.3 for further details).Now we have a synthetic parallel corpus, relating our in-domain task.Finally, we fine-tune the general NMT model with these data.

Experiments
In this section, we describe the experimental framework employed to assess the performance of the NMT adaptation method described in Section 4. For this purpose, we studied its behavior in three corpora.Two of them refer to controlled tasks; while the last one belongs to a real e-commerce task.

Corpora
We performed the experiments on English→Spanish translation.
Our out-ofdomain training data was the Common Crawl (COMMON) corpus which was collected from web sources.We chose the 1 Billion Words corpus (Chelba et al., 2013) as the large pool of monolingual sentences.For validation, we chose the News-commentary test 2013 (dev13) dataset.For testing, we used corpus from three different domains: Xerox printer manuals (XRCE-Test) (Barrachina et al., 2009), Information Technology1 (IT-Test) and Electronic Commerce (E-Com-Test).This last corpus was obtained from a real e-commerce website (Cachitos de Plata2 ).Statistics of all corpora are provided in Table 1.

Evaluation
Translation quality was assessed according to the following well-known metrics: • BLEU (BiLingual Evaluation Understudy) (Papineni et al., 2002), measures n-gram precision with respect to a reference set, with a penalty for sentences that are too short.
• TER (Translation Error Rate) (Snover et al., 2006), is an error metric that computes the minimum number of edits (including swaps) required to modify the system hypotheses so that they match the reference.
For all results, we computed their confidence intervals (p = 0.05) by means of bootstrap resampling (Koehn, 2004).

Machine translation systems
We used NMT-Keras (Peris, 2017) for building the NMT system, as described in Section 2. We applied joint byte pair encoding (BPE) (Sennrich et al., 2016b), learning 32, 000 merge operations, on the out-of-domain dataset.Following the findings from Britz et al. (2017), we used LSTM units.Due to practical reasons, we used single-layered LSTMs.The LSTM, word embedding and attention MLP sizes were 512 each.We applied layer normalization (Ba et al., 2016) and Gaussian noise (σ = 0.01) to the weights (Graves, 2011).We clipped the L 2 norm of the gradients to 1 (Pascanu et al., 2012).We used Adam (Kingma and Ba, 2014) with a learning rate of 0.0002 (Wu et al., 2016).The size of the beam was set to 6.
We trained further the NMT system using the selected synthetic data.For this training, we used vanilla SGD with an initial learning rate of 0.05.Such hyperparameters were set according the results observed in the development set.From this exploration, we also noticed that the application of more sophisticated SGD optimizers (e.g.Adam) is tricky, as they update the model on a more aggressive way.Therefore, if we apply excessively large updates, the knowledge from the general model is somehow lost.
We also tested our method with ensembles of NMT systems.Ensembles were made up of 4 models sampled at different points of the training process.Such points were evenly chosen (each 2, 000 updates) around the single model which obtained the highest performance on the development set.

Corpus creation
The process for building synthetic parallel corpora begins with the selection from the monolingual pool.The selection method presented in Section 4.2, requires to set the dimension of the vectorspace representation.We set it to 200, according to preliminary research, and it was maintained for all the experiments reported in this paper.
Table 3: Selection examples from each domain.
Selected sentence XRCE id rather send files electronically use current antivirus and a firewall images are stored on a one terabyte built in hard drive which includes a DVD burner IT the technology would also be available to ipod touch users although they would have to buy a microphone and headphones to make calls pc world reported if you want to find panorama archive material on delicious the easiest way to search is to use the single word on the right hand column my personal have is tweetdeck which although designed for photo uploading amongst other things E-Com it is perfect for your collection pasta is inexpensive easy and really romantic another shows the dust forming into clumps along magnetic lines like pearls on a necklace Once we obtained the monolingual selections, we translated them.In order to speed up this process, we split the selection and translate it using Moses and NMT.Both systems were trained on the out-of-domain data.In the case of the NMT system, we applied the same BPE subword segmentation to all data.Therefore, the potential vocabulary differences across tasks were effectively leveraged by using subword units.

Results and analysis
In this section, we present and discuss the results obtained.We start by analyzing the selection obtained by Algorithm 1. Next, we present the translation results obtained in all tasks.Finally, in order to get some insights of the system behavior, we analyze several representative examples.

Analysis of the selection
Table 2 shows the features of the selection for each corpus.Note that the average length of the sentences belonging to each selection is tightly related to the sentence length from each test set (Table 1).
Therefore, the selections from XRCE and E-Com had shorter sentences, while the selection obtained from the IT corpus had longer ones.As shown in the following sections, this was a key factor that affected the machine translation systems performance.
Moreover, Table 3 shows some samples from each domain, selected by our selection technique.We can notice that such samples are related to the correspondent test set domain.Thus, sentences from XRCE and IT domains refer to a technological field.As illustrated in Table 2, sentences selected from the IT corpus were notoriously longer than those selected from XRCE.Sentences selected from the E-Com task are related to jewelry or economy.Given the E-Com domain-an electronic shop of silver jewelry-these sentences are also coherent.

Quantitative results
Table 4 shows the results on the XRCE and IT tasks.The general NMT model performed worse than Moses in out-of-domain tasks.The use of a 4-model ensemble was very helpful.Nevertheless, it still had a lower performance than Moses.
The TER values of the general NMT system in the XRCE task were unusually high.This is due to the corpus features: As shown in Table 1, the XRCE-Test set has an average sentence length of 9 words.The general NMT model generated sentences with an average of 13 words, because it was trained on general-domain data.The TER metric greatly penalizes this behavior, because it must delete the exceeding words.Therefore, TER results of the NMT system were surprisingly high.In the case of Moses, the average sentence length of the sentences generated by Moses was 9.5.Because the generation was bounded by the phrase and language models.
The addition of synthetic data significantly improved the NMT systems, in all cases.Taking the reference of a single NMT model, the gains ranged from 5 to 7 BLEU points.The performance of a single fine-tuned NMT model was also clearly better than fine-tuned ensembles.
Especially critical were the enhancements in terms of TER.In the XRCE task, the synthetic data improved TER by almost 40 and 20 points, for single model and ensembles, respectively.Due to the addition of synthetic data, the system learned to produce shorter translations (around 5 words shorter, in average), and therefore, greatly diminishing TER.In the IT task, the synthetic data also improved TER, but to a lower extent.This is because the IT task is closer to the out-of-domain corpus.Therefore, the adaptation benefits brought by the synthetic data were less crucial than in the XRCE task.
It is worth noting that the adaptation of the NMT system was very fast.The system only required to be trained on ∼ 15, 000 samples in order to achieve the best results.Using a GPU, the finetuning of the NMT model can be done in minutes.
Table 5 shows the results on the real E-Com task.This was a very specific task.In these cases, the single NMT model also yielded worse performance in terms of BLEU than Moses, but when applying an ensemble, the results were significantly enhanced.In terms of BLEU, even beating Moses.The NMT systems behaviored similarly to the XRCE task in terms of TER.The E-Com corpus had similar features than XRCE-Test (in this case, 9.7 words per sentence).Therefore, we observed the same phenomenon: as we introduced in-domain-related sentences, the system learned to produce shorter sentences, diminishing TER consequently.
The use of synthetic data again greatly improved the system.The results were coherent with the previous experiments: A single, fine-tuned model, significantly outperformed the general system (+9 BLEU points).A sole adapted system was even better than a general model ensemble.With respect to Moses, we also found major enhance-ments in terms of BLEU.
It is also noticeable the ensemble of systems trained with synthetic data did not improve the performance of a single fine-tuned system.This is probably due to the fact that the adaptation was performed from an already trained model and with few data.Therefore, the systems belonging to the ensemble were quite similar, all of them around the same local minimum.Therefore, potential enhancements from the ensembles were diluted.
Finally, we should remark than the E-Com task belongs to a real-world scenario.This corpus is not designed for experimental purposes.It contains elements that distort the experiment, and therefore yield to unpredictable results.In such open scenarios, a human evaluation should be the next step to take.

Qualitative results
Some translation examples from each corpus are shown in Table 6.In the first example, all the systems presented the similar error at the beginning of the translation (especificación del).This is because that was the most likely translation in our corpora, both the real and synthetic ones.
In the second example, Moses was not able to correctly identify the right meaning of the word (windows) in the sentence to translate.It should be left untranslated, as it is a proper noun.The NMT systems were able to detect it.Also, Moses, NMT and NMT+Synth systems presented the same lexical choice error at the word (deberían).
Finally, we show the translation examples for the e-commerce domain.Moses obtained the worst translation.The NMT Σ method was not able to obtain the word (precioso), as provided in the reference, but instead it a synonym (hermoso).Nevertheless, note that, although this may not be an actual mistake in translation terms, it will be penalized by BLEU and TER.The NMT+Synth ob- Moses son un conjunto de pequeñas y encantadoras tiras finas plata interrelacionado .NMT son un precioso conjunto de tiras de película pequeña y delgada .NMT Σ son un hermoso conjunto de pequeñas y finas tiras de plata .NMT+Synth son un precioso conjunto de pequeñas y finas tiras de plata .
tained the closer translation to the reference.Even though, the system was unable to obtain a translation for the word (intertwined).

Conclusions
In this work we presented an instance selection method and applied it to collect the most adequate sentences for translating a corpus from a specific domain.We selected domain-related instances from a large monolingual corpus, automatically translated them and fine-tuned a NMT system, originally trained on a more general domain.
Results showed significant improvements in terms of BLEU and TER with respect to the original model.Moreover, we found that it is preferable to use a single fine-tuned model than an ensemble of general models.It is also worth mentioning that, once the selection was performed, the adaptation of NMT systems to new domains was very fast (few minutes).
As byproduct of the evaluation carried out in this work, we can also conclude two main points.First, to use a single automatic metric for evaluating machine translation is risky, as every automatic metric is likely to be distorted.In order to have more confidence about the performance of a machine translation system, it should be tested on more metrics.Second, when applying NMT systems to tasks with different features than the training data, we should control the length of the output sentences.This can be achieved either with some heuristics or adapting with an in-domain corpus.
We leave the study of this control as future work.
As additional future work, we intend to prove our methods in more domains and different language pairs in order to establish its robustness.Moreover, we want to observe the influence of the quality and nature of the synthetic data in our pipeline.Therefore, we aim to study the influence of different translation methods or technologies when translating the monolingual corpus.We should also study if adding source synthetic data instead of target synthetic data affects the system.Finally, given the good results obtained, we want to leverage the bondages of the synthetic data, using it in different applications.

Figure 1 :
Figure 1: The process of building an adequate synthetic parallel corpus for a given test set.

Table 2 :
Main figures of the selections obtained by Algorithm 1 for each test set (T ), employed for adapting the NMT system.|S| denotes number of sentences; |W |, number of words; |V |, vocabulary size and |W |, average sentence length.

Table 4 :
Translation results for the XRCE and IT tasks.BLEU and TER results given in percentage.Σ denotes an ensemble of 4 neural models.|W | is the average number of words per sentence.

Table 5 :
E-Com -Test set results.BLEU and TER results given in percentage.Σ denotes an ensemble of 4 neural models.|W | is the average number of words per sentence.