CSP: Code-Switching Pre-training for Neural Machine Translation

This paper proposes a new pre-training method, called Code-Switching Pre-training (CSP for short) for Neural Machine Translation (NMT). Unlike traditional pre-training method which randomly masks some fragments of the input sentence, the proposed CSP randomly replaces some words in the source sentence with their translation words in the target language. Speciﬁcally, we ﬁrstly perform lexicon induction with unsupervised word embedding mapping between the source and target languages, and then randomly replace some words in the input sentence with their translation words according to the extracted translation lexicons. CSP adopts the encoder-decoder framework: its encoder takes the code-mixed sentence as input, and its decoder predicts the replaced fragment of the input sentence. In this way, CSP is able to pre-train the NMT model by explicitly making the most of the cross-lingual alignment information extracted from the source and target monolingual corpus. Additionally, we relieve the pretrain-ﬁnetune discrepancy caused by the artiﬁcial symbols like [mask]. To verify the effectiveness of the proposed method, we conduct extensive experiments on unsupervised and supervised NMT. Experimental results show that CSP achieves signiﬁcant improvements over baselines without pre-training or with other pre-training methods.


Introduction
Neural machine translation (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Cho et al., 2014;Bahdanau et al., 2015) which typically follows the encoder-decoder framework, directly applies a single neural network to transform the source sentence into the target sentence.With tens of millions of trainable parameters in the NMT model, translation tasks are usually datahungry, and many of them are low-resource or even zero-resource in terms of training data.Following the idea of unsupervised and self-supervised pre-training methods in the NLP area (Peters et al., 2018;Radford et al., 2018Radford et al., , 2019;;Devlin et al., 2019;Yang et al., 2019), some works are proposed to improve the NMT model with pretraining, by making full use of the widely available monolingual corpora (Lample and Conneau, 2019;Song et al., 2019b;Edunov et al., 2019;Huang et al., 2019;Wang et al., 2019;Rothe et al., 2019;Clinchant et al., 2019).Typically, two different branches of pre-training approaches are proposed for NMT: model-fusion and parameterinitialization.
The model-fusion approaches seek to incorporate the sentence representation provided by the pretrained model, such as BERT, into the NMT model (Yang et al., 2019b;Clinchant et al., 2019;Weng et al., 2019;Zhu et al., 2020;Lewis et al., 2019;Liu et al., 2020).These approaches are able to leverage the publicly available pre-trained checkpoints in the website but they need to change the NMT model to fuse the sentence embedding calculated by the pre-trained model.Large-scale parameters of the pre-trained model significantly increase the storage cost and inference time, which makes it hard for this branch of approaches to be directly used in production.As opposed to model-fusion approaches, the parameter-initialization approaches aim to directly pre-train the whole or part of the NMT model with tailored objectives, and then initialize the NMT model with pre-trained parameters (Lample and Conneau, 2019;Song et al., 2019b).These approaches are more production-ready since they keep the size and structure of the model same as standard NMT systems.
While achieving substantial improvements, these pre-training approaches have two main cons.Firstly, as pointed out by Yang et al. (2019), the artificial symbols like [mask] used by these approaches during pre-training are absent from real data at finetuning time, resulting in a pretrain-finetune discrepancy.Secondly, while each pre-training step only involves sentences from the same language, these approaches are unable to make use of the cross-lingual alignment information contained in the source and target monolingual corpus.We argue that, as a cross-lingual sequence generation task, NMT requires a tailored pre-training objective which is capable of making use of cross-lingual alignment signals explicitly, e.g., word-pair information extracted from the source and target monolingual corpus, to improve the performance.
To address the limitations mentioned above, we propose Code-Switching Pre-training (CSP) for NMT.We extract the word-pair alignment information from the source and target monolingual corpus automatically, and then apply the extracted alignment information to enhance the pre-training performance.The detailed training process of CSP can be presented in two steps: 1) perform lexicon induction to get translation lexicons by unsupervised word embedding mapping (Artetxe et al., 2018a;Conneau et al., 2018); 2) randomly replace some words in the input sentence with their translation words in the extracted translation lexicons and train the NMT model to predict the replaced words.CSP adopts the encoder-decoder framework: its encoder takes the code-mixed sentence as input, and its decoder predicts the replaced fragments based on the context calculated by the encoder.By predicting the sentence fragment which is replaced on the encoder side, CSP is able to either attend to the remaining words in the source language or to the translation words of the replaced fragment in the target language.Therefore, CSP trains the NMT model to: 1) learn how to build the sentence representation for the input sentence as the traditional pre-training methods do; 2) learn how to perform cross-lingual translation with extracted word-pair alignment information.In summary, we mainly make the following contributions: • We propose the code-switching pre-training for NMT, which makes full use of the crosslingual alignment information contained in source and target monolingual corpus to improve the pre-training for NMT.
• We conduct extensive experiments on super-vised and unsupervised translation tasks.Experimental results show that the proposed approach consistently achieves substantial improvements.
• Last but not least, we find that CSP can successfully handle the code-switching inputs.
2 Related works There have also been works on applying prespecified translation lexicons to improve the performance of NMT.Hokamp and Liu (2017) and Post and Vilar (2018) proposed an altered beam search algorithm, which took target-side pre-specified translations as lexical constraints during beam search.Song et al. (2019a) investigated a data augmentation method, making code-switched training data by replacing source phrases with their target translations according to the pre-specified translation lexicons.Recently, motivated by the success of unsupervised cross-lingual embeddings, Artetxe et al. (2018b), Lample et al. (2018a) and Yang et al. (2018) applied the pre-trained translation lexicons to initialize the word embeddings of the unsupervised NMT model.Sun et al. (2019) applied translation lexicons to unsupervised domain adaptation in NMT.In this paper, we apply the translation lexicons automatically extracted from the monolingual corpus to improve the pre-training of NMT.

CSP
In this section, we firstly describe how to build the shared vocabulary for the NMT model; then we present the way extracting the probabilistic translation lexicons; and we introduce the detailed training process of CSP finally.

Shared sub-word vocabulary
This paper processes the source and target languages with the same shared vocabulary created through the sub-word toolkits, such as Sentence-Piece (SP) and Byte-Pair Encoding (BPE) (Sennrich et al., 2016b).We learn the sub-word splits on the concatenation of the sentences equally sampled from the source and target corpus.The motivation behind is two-fold: Firstly, with processing the source and target languages by the shared vocabulary, the encoder of the NMT model is able to share the same vocabulary with the decoder.Sharing the vocabulary between the encoder and decoder makes it possible for CSP to replace the source words in the input sentence with their translation words in the target language.Secondly, as pointed out by Lample and Conneau (2019), the shared vocabulary greatly improves the alignment of embedding spaces.

Probabilistic translation lexicons
Recently, some works successfully learned translation equivalences between word pairs from two monolingual corpus and extracted translation lexicons (Artetxe et al., 2018a;Conneau et al., 2018).Following Artetxe et al. (2018a), we utilize unsu-pervised word embedding mapping to extract probabilistic translation lexicons with monolingual corpus only.The probabilistic translation lexicons in this paper are defined as one-to-many source-target word translations.Specifically, giving separate source and target word embeddings, i.e., X e and Y e trained on source and target monolingual corpus X and Y , unsupervised word embedding mapping utilizes self-learning or adversarial-training to learn a mapping function f (X) = W X, which transforms source and target word embeddings to a shared embedding space.With word embeddings in the same latent space, we measure the similarities between source and target words with the cosine distance of word embeddings.Then, we extract the probabilistic translation lexicons by selecting the top k nearest neighbors in the shared embedding space.Formally, considering the word x i in the source language, its top k nearest neighbor words in the target language, denoted as y i1 , y i2 , . . ., y ik are extracted as its translation words, and the corresponding normalized similarities s i1 , s i2 , . . ., s ik are defined as the translation probabilities.

Training process of CSP
CSP only requires monolingual data to pre-train the NMT model.Given an unpaired source sentence x ∈ X, where x = (x 1 , x 2 , . . ., x m ) is the source sentence with m tokens, we denote x [u:v] as the sentence fragment of x from u to v where 0 < u < v < m, and denote x \u:v as modified version of x where its fragment from position u to v are replaced with their translation words according to the probabilistic translation lexicons.Formally, x \u:v is represented as: (1) where x \u:v [u:v] = (y u , . . ., y v ) is sampled based on the extracted probabilistic translation lexicons presented on Section 3.2.Here, we take the replacing process from x u to y u as an example.Considering the source word x u , its top k translation words y u1 , y u2 , . . ., y uk and the translation probabilities s u1 , s u2 , . . ., s uk , y u is calculated as: where y uj is decided by performing multinomial sampling on the distribution defined by translation probabilities s u1 , s u2 , . . ., s uk .With higher translation probability s uj , the translation word y uj is more likely to be selected.
(3) Figure 1 shows an example for CSP training, where the original source sentence (x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 ) with the fragment (x 3 , x 4 , x 5 , x 6 ) being replaced with their translation words, i.e., (y 3 , y 4 , y 5 , y 6 ) sampled from the extracted probabilistic translation lexicons.The encoder takes the code-mixed source sentence as input, and the decoder only predicts the replaced fragment (x 3 , x 4 , x 5 , x 6 ).

Experiments and Results
This section describes the experimental details about CSP pre-training and fine-tuning on the supervised and unsupervised NMT tasks.To test the effectiveness and generality of CSP, we conduct extensive experiments on English-German, English-French and Chinese-to-English translation tasks.

CSP pre-training
Model configuration We choose Transformer as the basic model structure.Following the base model in Vaswani et al. (2017), we set the dimension of word embedding as 512, dropout rate as 0.1 and the head number as 8.To be comparable with previous works, we set the model as 4-layer encoder and 4-layer decoder for unsupervised NMT, and 6-layer encoder and 6-layer decoder for supervised NMT.The encoder and decoder share the same word embeddings.

Datasets and pre-processing
Following the work of Song et al. (2019b), we use the monolingual data sampled from WMT News Crawl datasets for English, German and French, with 50M sentences for each language.2For Chinese, we choose 10M sentences from the combination of LDC and WMT2018 corpora.For each translation task, the source and target languages are jointly tokenized into sub-word units with BPE (Sennrich et al., 2016b).The vocabulary is extracted from the tokenized corpora and shared by the source and target languages.For English-German and English-French translation tasks, we set the vocabulary size as 32k.For Chinese-English, the vocabulary size is set as 60k since few tokens are shared by Chinese System en-de de-en en-fr fr-en zh-en and English.To extract the probabilistic translation lexicons, we utilize the monolingual corpora described above to train the embeddings for each language independently by using word2vec (Mikolov et al., 2013) .We then apply the public implementation of the method proposed by Artetxe et al. (2017) to map the source and target word embeddings to a shared-latent space. 4raining details We replace the consecutive tokens in the source input with their translation words sampled from the probabilistic translation lexicons, with random start position u.Following Song et al. (2019b), the length of the replaced fragment is empirically set as roughly 50% of the total number of tokens in the sentence, and the replaced tokens in the encoder will be the translation tokens 80% of the time, a random token 10% of the time and an unchanged token 10% of the time. 5In the extracted probabilistic translation lexicons, we only keep top three translation words for each source word and also investigate how the number of translation words produces an effect on the training process.All of the models are implemented on Py-Torch and trained on 8 P40 GPU cards. 6We use Adam optimizer with a learning rate of 0.0005 for pre-training.

Fine-tuning on unsupervised NMT
In this section, we describe the experiments on the unsupervised NMT, where we only utilize monolingual data to fine-tune the NMT model based on the pre-trained model.
Experimental settings For the unsupervised English-German and English-French translation tasks, we take the similar experimental settings to Lample and Conneau (2019); Song et al. (2019b).Specifically, we randomly sample 5M monolingual sentences from the monolingual data used during pre-training and report BLEU scores on WMT14 English-French and WMT16 English-German.For fine-tuning on the unsupervised Chinese-to-English translation task, we also randomly sample 1.6M monolingual sentences for Chinese and English respectively similar to Yang et al. (2018).We take N IST 02 as the development set and report the BLEU score averaged on the test sets N IST 03, N IST 04 and N IST 05.To be consistent with the baseline systems, we apply the script multi-bleu.pl to evaluate the translation performance for all of the translation tasks.
Baseline systems We take the following four strong baseline systems.English, English-to-French and Chinese-to-English unsupervised translation tasks, with as high as +0.7 BLEU points improvement in German-to-English translation task.In French-to-English translation direction, CSP also achieves comparable results with the SOTA baseline of Song et al. (2019b).In Chinese-to-English translation task, CSP even achieves +1.1 BLEU points improvement compared to the reproduced result of Song et al. (2019b).These results indicate that fine-tuning unsupervised NMT on the model pre-trained by CSP consistently outperforms the previous unsupervised NMT systems with or without pre-training.

Fine-tuning on supervised NMT
This section describes our experiments on supervised NMT where we fine-tune the pre-trained model with bilingual data.
Experimental settings For supervised NMT, we conduct experiments on the publicly available data sets, i.e., WMT14 English-French, WMT14 English-German and LDC Chinese-to-English corpora, which are used extensively as benchmarks for NMT systems.We use the full WMT14 English-German and WMT14 English-French corpus as our training sets, which contain 4.5M and 36M sentence pairs respectively.For Chinese-to-English translation task, our training data consists of 1.6M sentence pairs randomly extracted from LDC corpora. 7All of the sentences are encoded with the same BPE codes utilized in pre-training.
Baseline systems For supervised NMT, we consider the following three baseline systems. 8The first one is the work of Vaswani et al. (2017), which achieves SOTA results on WMT14 English-German and English-French translation tasks.The other two baseline systems are proposed by Lample and Conneau (2019) and Song et al. (2019b), both of which fine-tune the supervised NMT tasks on the pre-trained models.Furthermore, we compare with the back-translation method which has shown its great effectiveness on improving NMT model with monolingual data (Sennrich et al., 2016a).Specifically, for each baseline system, we translate the target monolingual data used during pre-training back to the source language by a reversely-trained model, and get the pseudo-parallel corpus by combining the translation with its original data. 9At last, the training data which includes pseudo and parallel sentence pairs is shuffled and used to train the NMT system.

Results
The experimental results on supervised NMT are presented at Table 2.We report the BLEU scores on English-to-German, English-to-French and Chinese-to-English translation directions.For each translation task, we report the BLEU scores for the standard NMT model and the model trained with back-translation respectively.As shown in Table 2, compared to the baseline system without pre-training (Vaswani et al., 2017), the proposed model achieves +1.6 and +0.7 BLEU points improvements on English-to-German and English-to-French translation directions respectively.Even compared to stronger baseline system with pretraining (Song et al., 2019b), we also achieve +0.5 and +0.4 BLEU points improvements respectively on these two translation directions.On Chineseto-English translation task, the proposed model achieves +0.7 BLEU points improvement compared to the baseline system of Song et al. (2019b).With back-translation, the proposed model still outperforms all of the baseline systems.Experimental results above show that fine-tuning the supervised NMT on the model pre-trained by CSP achieves substantial improvements over previous supervised NMT systems with or without pre-training.Additionally, it has been verified that CSP is able to work together with back-translation.

Study the number of translation words
In CSP, the probabilistic translation lexicons only keep the top k translation words for each source word.For each word in the translation lexicons, the number of translation words k is viewed as an important hyper-parameter and can be set carefully during the process of pre-training.A natural question is that how much of translation words do we need to keep for each source word?Intuitively, if k is set as a small number, the model may lose its generality since each source word can be replaced with only a few translation words, which severely limits the diversity of the context.And if otherwise, the accuracy of the extracted probabilistic translation lexicons may get significantly diminished, which shall introduce too much noise for pre-training.Therefore, there is a trade-off between the generality and accuracy.We investigate this problem by studying the translation performance of unsupervised NMT with different k, where we vary k from 1 to 10 with the interval pre-training and the translation performance after fine-tuning on the unsupervised NMT tasks, including the English-to-German and English-to-French translation directions.For each translation direction, we firstly present the perplexity (PPL) score of the pre-trained model averaged on the monolingual validation sets of the source and target languages. 10nd then we show the BLEU score of the finetuned model on the bilingual validation set.From Figure 2, it can be seen that, when k is set around 3, the pre-trained model achieves the best validation PPL scores on both of the English-to-German and English-to-French translation directions.Similarly, CSP also achieves the best BLEU scores on the unsupervised translation tasks when k is set around 3.

Ablation study
To understand the importance of different components of the model pre-trained by CSP, we perform an ablation study by training multiple versions of the supervised NMT model with some components initialized randomly: the word embeddings, the encoder, the attention module between the encoder and decoder, and the decoder.Experiments are conducted on English-to-German and English-to-French translation tasks.All models are trained without back-translation and results are reported in Table 3.We can find that the two most critical components are the pre-trained encoder and attention module.It shows that CSP enhances NMT not only on the ability of building sentence representation for the input sentence, but also on the ability of aligning the source and target languages with the help of word-pair alignment information.Additionally, the experimental results indicate that the pre-trained decoder shows little effect on the translation performance.This is mainly because the decoder only predicts the source-side words during pre-training but predicts the target-side words during fine-tuning.This pretrain-finetune mismatch makes the pre-trained decoder less helpful for performance improvement.

System en-de en-fr
No pre-trained embeddings 28.4 38.5 No pre-trained encoder 27.9 38.2No pre-trained attention module 28.1 38.3 No pre-trained decoder 28.8 38.8 Full model pre-trained by CSP 28.9 38.8 Table 3: Ablation study on English-German and English-French translation tasks.The embeddings include the source-side and target-side word embeddings.

Code-switching translation
Code-switching, which contains words from different languages in single input, has aroused more and more attention in NMT (Johnson et al., 2017;Menacer et al., 2019).In this section, we show that the proposed CSP is able to enhance the ability of the fine-tuned NMT model on handling the codeswitching input.To present quantitative results, we build two test sets for the supervised Chineseto-English translation task to evaluate the performance of the translation model on handling codeswitching inputs.We randomly select 200 Chinese-English sentence pairs from N IST 02, based on which we build two code-switching test sets.The first test set, referred to as test A, is built by randomly replacing some phrases in each Chinese sentence with their counterpart English phrases, where the English phrase is the translation result by feeding the corresponding Chinese phrase to the Google Chinese-to-English translator; The second test set, referred to as test B, is constructed by randomly replacing parts of the words in each Chinese sentence with their nearest target words in the shared latent embedding space (the same way used by CSP in Section 3.2).

Conclusions and Future work
This work proposes a simple yet effective pretraining approach, i.e., CSP for NMT, which randomly replaces some words in the source sentence with their translation words in the probabilistic translation lexicons extracted from monolingual corpus only.To verify the effectiveness of CSP, we investigate two downstream tasks, supervised and unsupervised NMT, on English-German, English-French and Chinese-to-English translation tasks.Experimental results show that the proposed approach achieves substantial improvements over strong baselines consistently.Additionally, we show that CSP is able to enhance the ability of NMT on handling code-switching inputs.There are two promising directions for the future work.Firstly, we are interested in applying CSP to other related NLP areas for code-switching problems.Secondly, we plan to investigate the pre-training objectives which are more effective in utilizing the cross-lingual alignment information for NMT.

Figure 1 :
Figure 1: The training example of our proposed CSP which randomly replaces some words in the source input with their translation words based on the probabilistic translation lexicons.Identical to MAS, the token -represents the padding in the decoder.The attention module represents the attention between the encoder and decoder

Figure 2 :
Figure 2: The performance of CSP with the probabilistic translation lexicons keeping different translation words for each source word, which includes: (a) the PPL score of the pre-trained English-to-German model; (b) the PPL score of the pre-trained English-to-French model; (c) the BLEU score of the fine-tuned unsupervised English-to-German NMT model; (d)the BLEU score of the fine-tuned unsupervised English-to-French NMT model.
Figure 2 (a) and (c) illustrate the PPL score of the pretrained model and BLEU score of the fine-tuned unsupervised NMT model respectively on Englishto-German translation.Figure 2 (b) and (d) present the PPL and BLEU score respectively for Englishto-French translation.

Table 2 :
The translation performance of supervised NMT on English-German, English-French and Chinese-to-English test sets.(+ BT: trains the model with back-translation method.) ResultsTable1shows the experimental results on the unsupervised NMT.From Table1, we can find that the proposed CSP outperforms all of the previous works on English-to-German, German-to- (Johnson et al., 2017)ation performance of NMT systems on the two code-switching test sets.11Besides the baseline systems mentioned in section 4.3, we also train a Chinese-English multi-lingual system(Johnson et al., 2017)based on Transformer, which has shown the ability of handling code-switching inputs.From Table4, We can find that the proposed approach achieves significant improvements over previous works.Compared to multi-lingual system, we achieve +2.3 and +3.0 BLEU points improvements respectively on test A and test B. The case study can be found in appendix D.

Table 4 :
The performance of Chinese-to-English translation on in-house code-switching test sets.