Improving Pre-Trained Multilingual Models with Vocabulary Expansion

Recently, pre-trained language models have achieved remarkable success in a broad range of natural language processing tasks. However, in multilingual setting, it is extremely resource-consuming to pre-train a deep language model over large-scale corpora for each language. Instead of exhaustively pre-training monolingual language models independently, an alternative solution is to pre-train a powerful multilingual deep language model over large-scale corpora in hundreds of languages. However, the vocabulary size for each language in such a model is relatively small, especially for low-resource languages. This limitation inevitably hinders the performance of these multilingual models on tasks such as sequence labeling, wherein in-depth token-level or sentence-level understanding is essential. In this paper, inspired by previous methods designed for monolingual settings, we investigate two approaches (i.e., joint mapping and mixture mapping) based on a pre-trained multilingual model BERT for addressing the out-of-vocabulary (OOV) problem on a variety of tasks, including part-of-speech tagging, named entity recognition, machine translation quality estimation, and machine reading comprehension. Experimental results show that using mixture mapping is more promising. To the best of our knowledge, this is the first work that attempts to address and discuss the OOV issue in multilingual settings.


Introduction
It has been shown that performance on many natural language processing tasks drops dramatically on held-out data when a significant percentage of words do not appear in the training data, i.e., out-of-vocabulary (OOV) words (Søgaard and Johannsen, 2012;Madhyastha et al., 2016).A higher OOV rate (i.e., the percentage of the unseen words in the held-out data) may lead to a more severe performance drop (Kaljahi et al., 2015).OOV problems have been addressed in previous works under monolingual settings, through replacing OOV words with their semantically similar invocabulary words (Madhyastha et al., 2016;Kolachina et al., 2017) or using character/word information (Kim et al., 2016(Kim et al., , 2018;;Chen et al., 2018) or subword information like byte pair encoding (BPE) (Sennrich et al., 2016;Stratos, 2017).
Recently, fine-tuning a pre-trained deep language model, such as Generative Pre-Training (GPT) (Radford et al., 2018) and Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018), has achieved remarkable success on various downstream natural language processing tasks.Instead of pre-training many monolingual models like the existing English GPT, English BERT, and Chinese BERT, a more natural choice is to develop a powerful multilingual model such as the multilingual BERT.
However, all those pre-trained models rely on language modeling, where a common trick is to tie the weights of softmax and word embeddings (Press and Wolf, 2017).Due to the expensive computation of softmax (Yang et al., 2017) and data imbalance across different languages, the vocabulary size for each language in a multilingual model is relatively small compared to the monolingual BERT/GPT models, especially for lowresource languages.Even for a high-resource language like Chinese, its vocabulary size 10k in the multilingual BERT is only half the size of that in the Chinese BERT.Just as in monolingual settings, the OOV problem also hinders the performance of a multilingual model on tasks that are sensitive to token-level or sentence-level information.For ex-ample, in the POS tagging problem (Table 2), 11 out of 16 languages have significant OOV issues (OOV rate ≥ 5%) when using multilingual BERT.
According to previous work (Radford et al., 2018;Devlin et al., 2018), it is time-consuming and resource-intensive to pre-train a deep language model over large-scale corpora.To address the OOV problems, instead of pre-training a deep model with a large vocabulary, we aim at enlarging the vocabulary size when we fine-tune a pretrained multilingual model on downstream tasks.
We summarize our contributions as follows: (i) We investigate and compare two methods to alleviate the OOV issue.To the best of our knowledge, this is the first attempt to address the OOV problem in multilingual settings.(ii) By using English as an interlingua, we show that bilingual information helps alleviate the OOV issue, especially for low-resource languages.(iii) We conduct extensive experiments on a variety of token-level and sentence-level downstream tasks to examine the strengths and weaknesses of these methods, which may provide key insights into future directions1 .

Approach
We use the multilingual BERT as the pre-trained model.We first introduce the pre-training procedure of this model (Section 2.1) and then introduce two methods we investigate to alleviate the OOV issue by expanding the vocabulary (Section 2.2).Note that these approaches are not restricted to BERT but also applicable to other similar models.
In the pre-training stage, Devlin et al. (2018) use two objectives: masked language model (LM) and next sentence prediction (NSP).In masked LM, they randomly mask some input tokens and then predict these masked tokens.Compared to unidirectional LM, masked LM enables representations to fuse the context from both directions.In the NSP task, given a certain sentence, it aims to predict the next sentence.The purpose of adding the NSP objective is that many downstream tasks such as question answering and language inference require sentence-level understanding, which is not directly captured by LM objectives.
After pre-training on large-scale corpora like Wikipedia and BookCorpus (Zhu et al., 2015), we follow recent work (Radford et al., 2018;Devlin et al., 2018) to fine-tune the pre-trained model on different downstream tasks with minimal architecture adaptation.We show how to adapt BERT to different downstream tasks in Figure 1 (left).

Vocabulary Expansion
Devlin et al. ( 2018) pre-train the multilingual BERT on Wikipedia in 102 languages, with a shared vocabulary that contains 110k subwords calculated from the WordPiece model (Wu et al., 2016).If we ignore the shared subwords between languages, on average, each language has a 1.1k vocabulary, which is significantly smaller than that of a monolingual pre-trained model such as GPT (40k).The OOV problem tends to be less serious for languages (e.g., French and Spanish) that belong to the same language family of English.However, this is not always true, especially for morphologically rich languages such as German (Ataman and Federico, 2018;Lample et al., 2018).OOV problem is much more severe in lowresource scenarios, especially when a language (e.g., Japanese and Urdu) uses an entirely different character set from high-resource languages.
We focus on addressing the OOV issue at subword level in multilingual settings.Formally, suppose we have an embedding E bert extracted from the (non-contextualized) embedding layer in the multilingual BERT (i.e., the first layer of BERT).And suppose we have another set of (non-contextualized) sub-word embeddings {E l 1 , E l 2 , . . ., E ln } ∪ {E en }, which are pre-trained on large corpora using any standard word embedding toolkit.Specifically, E en represents the pre-trained embedding for English, and E l i represents the pre-trained embedding for non-English language l i at the subword level.We denote the vocabulary of E bert , E en , and E l i by V bert , V en , and V l i , respectively.For each subword w in V bert , we use E bert (w) to denote the pretrained embedding of word w in E bert .E l i (•) and E en (•) are defined in a similar way as E bert (•).For   each non-English language l ∈ {l 1 , l 2 , . . ., l n }, we aim to enrich E bert with more subwords from the vocabulary in E l i since E l i contains a larger vocabulary of language l i compared to E bert .
As there is no previous work to address multilingual OOV issues, inspired by previous solutions designed for monolingual settings, we investigate the following two methods, and all of them can be applied at both word/subword level, though we find subword-level works better (Section 3).Joint Mapping For each non-English language l, we first construct a joint embedding space E l through mapping E l to E en by an orthogonal mapping matrix B l (i.e., E l = E l B l ).When a bilingual dictionary f l : V l → V en is available or can be constructed based on the shared common subwords (Section 3.1), we obtain B l by minimizing: where • F denotes the Frobenius norm.Otherwise, for language pair (e.g., English-Urdu) that meets neither of the above two conditions, we obtain B l by an unsupervised word alignment method from MUSE (Conneau et al., 2018).
We then map E l to E bert by an orthogonal mapping matrix A l , which is obtained by minimizing We denote this method by M J in our discussion below, where the subscript J stands for "joint".Mixture Mapping Following the work of Gu et al. (2018) where they use English as "universal tokens" and map all other languages to English to obtain the subword embeddings, we represent each subword in E l (described in joint mapping) as a mixture of English subwords where those English subwords are already in the BERT vocab V bert .This method, denoted by M M , is also a joint mapping without the need for learning the mapping from E l to E bert .Specifically, for each w ∈ V l , we obtain its embedding e(w) in the BERT embedding space E bert as follows.
where T (w) is a set to be defined later, and the mixture coefficient p(u|w) is defined by where CSLS refers to the distance metric Crossdomain Similarity Local Scaling (Conneau et al., 2018).We select five v ∈ V en ∪ V bert with the highest CSLS(E l (v), E en (w)) to form set T (w).In all our experiments, we set the number of nearest neighbors in CSLS to 10.We refer readers to Conneau et al. (2018) for details. Figure 1 (right) illustrates the joint and mixture mapping.

Experiment Settings
We obtain the pre-trained embeddings of a specific language by training fastText (Bojanowski et al., 2017) on Wikipedia articles in that language, with context window 5 and negative sampling 5. Before training, we first apply BPE (Sennrich et al., 2016) to tokenize the corpus with subword vocabulary size 50k.For joint mapping method M J , we use bilingual dictionaries provided by Conneau et al. (2018).For a language pair where a bilingual dictionary is not easily available, if two languages share a significant number of common subwords (this often happens when two languages belong to the same language family), we construct a bilingual dictionary based on the assumption that identical subwords have the same meaning (Søgaard et al., 2018).We add all unseen subwords from 50k vocabulary to BERT.We define a word as an OOV word once it cannot be represented as a single word.For example, in BERT, the sentence "Je sens qu' entre c ¸a et les films de médecins et scientifiques" is represented as "je sens qu ##' entre [UNK] et les films de [UNK] et scientifiques", where qu' is an OOV word since it can only be represented by two subword units: qu and ##', but it is not OOV at subword level; c ¸a and médecins cannot be represented by any single word or combination of subword units, and thus they are OOV at both word and subword level.
We use the multilingual BERT with default parameters in all our experiments, except that we tune the batch size and training epochs.To have a thorough examination about the pros and cons of the explored methods, we conduct experiments on a variety of token-level and sentence-level classification tasks: part of speech (POS) tagging, named entity recognition (NER), machine translation quality estimation, and machine reading comprehension.See more details in each subsection.

Discussions about Mapping Methods
Previous work typically assumes a linear mapping exists between embedding spaces of different languages if their embeddings are trained using similar techniques (Xing et al., 2015;Madhyastha et al., 2016).However, it is difficult to map embeddings learned with different methods (Søgaard et al., 2018).Considering the differences between BERT and fastText: e.g., the objectives, the way to differentiate between different subwords, and the much deeper architecture of BERT, it is very unlikely that the (non-contextualized) BERT embedding and fastText embedding reside in the same geometric space.Besides, due to that BERT has a relatively smaller vocabulary for each language, when we map a pre-trained vector to its portion in BERT indirectly as previous methods, the supervision signal is relatively weak, especially for low-resource languages.In our experiment, we find that the accuracy of the mapping from our pre-trained English embedding to multilingual BERT embedding (English portion) is lower than 30%.In contrast, the accuracy of the mapping between two regular English embeddings that are pre-trained using similar methods (e.g., CBOW or SkipGram (Mikolov et al., 2013)) could be above 95% (Conneau et al., 2018).
Besides the joint mapping method M J (Section 2.2), another possible method that could be used for OOV problem in multilingual setting is that, for each language l, we map its pre-trained embedding space E l to embedding E bert by an orthogonal mapping matrix A l , which is obtained by minimizing F .This approach is similar to (Madhyastha et al., 2016), and is referred as independent mapping method below.However, we use examples to demonstrate why these kind of methods are less promising.In Table 1, the first two rows are results obtained by mapping our pre-trained English embedding (using fastText) to the (non-contextualized) BERT embedding.In this new unified space, we align words with CSLS metric, and for each subword that appears in English Wikipedia, we seek its closest neighbor in the BERT vocabulary.Ideally, each word should find itself if it exists in the BERT vocabulary.However, this is not always true.For example, although "however" exists in the BERT vocabulary, independent mapping fails to find it as its own closest neighbor.Instead, it incorrectly maps it to irrelevant Chinese words "盘" ("plate") and "龙" ("dragon").A similar phenomenon is observed for Chinese.For example, "册" is incorrectly aligned to Chinese words "书" and "卷".Furthermore, our POS tagging experiments (Section 3.3) further show that joint mapping M J does not improve (or even hurt) the performance of the multilingual BERT.Therefore, we use mixture mapping M M to address and discuss OOV issues in the remaining sections.(Conneau et al., 2018)).Finally, we keep 16 languages.We use the original multilingual BERT (without using CRF (Lafferty et al., 2001) on top of it for sequence labeling) to tune hyperparameters on the dev set and use the fixed hyperparameters for the expanded multilingual model.We do not tune the parameters for each model separately.As shown in Table 2, at both the word and subword level, the OOV rate in this dataset is quite high.Mixture mapping improves the accuracy on 10 out of 16 languages, leading to a 1.97% absolute gain in average.We discuss the influence of alignments in Section 3.6.
Chinese NER: We are also interested in investigating the performance gap between the expanded multilingual model and a monolingual BERT that is pre-trained on a large-scale monolingual corpus.Currently, pre-trained monolingual BERT models are available in English and Chinese.As English has been used as the interlingua, we compare the expanded multilingual BERT and the Chinese BERT on a Chinese NER task, evaluated on the Weibo NER dataset constructed from social media by Peng and Dredze (2015).In the training set, the token-level OOV rate is 2.17%, and the subwordlevel OOV rate is 0.54%.We tune the hyperparameters of each model based on the development set separately and then use the best hyperparameters of each model for evaluation on the test set.
As shown in Table 3, the expanded model outperforms the multilingual BERT on the Weibo NER dataset.We boost the F1 score from 59.0% to 61.4%.Compared to the Chinese BERT (66.9%), there still exists a noticeable performance gap.One possible reason could be the grammatical differences between Chinese and English.As BERT uses the language model loss function for pre-training, the pre-trained Chinese BERT could better capture the language-specific information comapred to the multilingual BERT.

Code-Mixed Sequence Labeling Tasks
As the multilingual BERT is pre-trained over 102 languages, it should be able to handle code-mixed texts.Here we examine its performance and the effectiveness of the expanded model in mixed language scenarios, using two tasks as case studies.Code-Switch Challenge: We first evaluate on the CALCS-2018 code-switched task (Aguilar et al., 2018), which contains two NER tracks on Twitter social data: mixed English&Spanish (en-es) and mixed Modern Standard Arabic&Egyptian (ar-eg).Compared to traditional NER datasets constructed from news, the dataset contains a significant portion of uncommon tokens like hashtags and abbreviations, making it quite challenging.For example, in the en-es track, the tokenlevel OOV rate is 44.6%, and the subword-level OOV rate is 3.1%; in the ar-eg track, the tokenlevel OOV rate is 64.0%, and the subword-level OOV rate is 6.0%.As shown in Table 4, on areg, we boost the F1 score from 74.7% to 77.3%.However, we do not see similar gains on the en-es dataset, probably because that English and Spanish share a large number of subwords, and adding too many new subwords might prevent the model from utilizing the well pre-trained subwords embedding.See Section 3.6 for more discussions.Machine Translation Quality Estimation: All previous experiments are based on well-curated data.Here we evaluate the expanded model on a language generation task, where sometimes the generated sentences are out-of-control.
We choose the automatic Machine Translation Quality Estimation task and use Task 2 -wordlevel quality estimation -in WMT18 (Bojar et al., 2018).Given a source sentence and its translation (i.e., target), this task aims to estimate the translation quality ("BAD" or "OK") at each position: e.g., each token in the source and target sentence, each gap in the target sentence.We use English to German (en-de) SMT translation.On all three categories, the expanded model consistently outperforms the original multilingual BERT (Table 5)2 .

Sequence Classification Tasks
Finally, we evaluate the expanded model on sequence classification in a mixed-code setting, where results are less sensitive to unseen words.Code-Mixed Machine Reading Comprehension: We consider the mixed-language machine reading comprehension task.Since there is no such public available dataset, we construct a new Chinese-English code-mixed machine reading comprehension dataset based on 37,436 unduplicated utterances obtained from the transcriptions of a Chinese and English mixed speech recognition corpus King-ASR-065-13 .We generate a multiple-choice machine reading comprehension problem (i.e., a question and four answer options) for each utterance.A question is an utterance with an English text span removed (we randomly pick one if there are multiple English spans) and the correct answer option is the removed English span.Distractors (i.e., wrong answer options) come from the top three closest English text spans, which appear in the corpus, based on the cosine similarity of word embeddings trained on the same corpus.For example, given a question "突然听到 21 ，那强劲的鼓 点，那一张张脸。" ("Suddenly I heard 21 , and the powerful drum beats reminded me of the players.")and four answer options { "forever", "guns", "jay", "twins" }, the task is to select the correct answer option "guns" ("21 Guns" is a song by the American rock band Green Day).We split the dataset into training, development, and testing of size 36,636, 400, 400, respectively.An-  (2018).MT: machine translation, e.g., target sentence, SRC: source sentence.F1-OK: F1 score for "OK" class; F1-BAD: F1 score for "BAD" class; F1-multi: multiplication of F1-OK and F1-BAD.
notators manually clean and improve the quality problems by generating more confusing distractors in the dev and testing sets to guarantee that these problems are error-free and challenging.
In this experiment, for each BERT model, we follow its default hyperparameters.As shown in Table 6, the expanded model improves the multilingual BERT (38.1%) by 1.2% in accuracy.Human performance (81.4%) indicates that this is not an easy task even for human readers.

Discussions
In this section, we first briefly investigate whether the performance boost indeed comes from the reduction of OOV and then discuss the strengths and weaknesses of the methods we investigate.First, we argue that it is essential to alleviate the OOV issue in multilingual settings.Taking the POS tagging task as an example, we find that most errors occur at the OOV positions (Table 7 in Section 3.3).In the original BERT, the accuracy of OOV words is much lower than that of non-OOV words, and we significantly boost the accuracy of OOV words with the expanded BERT.All these results indicate that the overall improvement mostly comes from the reduction of OOV.
We also observe that the following factors may influence the performance of the expanded model.Subwords: When expanding the vocabulary, it is critical to add only frequent subwords.Currently, we add all unseen subwords from the 50k vocabulary (Section 3.1), which may be not an optimal choice.Adding too many subwords may prevent the model from utilizing the information from pretrained subword embedding in BERT, especially when there is a low word-level overlap between the training and test set.Language: We also find that languages can influence the performance of the vocabulary expansion through the following two aspects: the alignment accuracy and the closeness between a language and English.For languages that are closely related to English such as French and Dutch, it is relatively easy to align their embeddings to English as most subword units are shared (Søgaard et al., 2018;Conneau et al., 2018).In such case, the BERT embedding already contains sufficient information, and therefore adding additional subwords may hurt the performance.On the other hand, for a distant language such as Polish (Slavic family), which shares some subwords with English (Germanic family), adding subwords to BERT brings performance improvements.In the meantime, as Slavic and Germanic are two subdivisions of the Indo-European languages, we find that the embedding alignment methods per-form reasonably well.For these languages, vocabulary expansion is usually more effective, indicated by POS tagging accuracies for Polish, Portuguese, and Slovenian (Table 2).For more distant languages like Arabic (Semitic family) that use different character sets, it is necessary to add additional subwords.However, as the grammar of such a language is very different from that of English, how to accurately align their embeddings is the main bottleneck.
Task: We see more significant performance gains on NER, POS and MT Quality Estimation, possibly because token-level understanding is more critical for these tasks, therefore alleviating OOV helps more.In comparison, for sequence level classification tasks such as machine reading comprehension (Section 3.5), OOV issue is less severe since the result is based on the entire sentence.

Related Work
OOV poses challenges for many tasks (Pinter et al., 2017) such as machine translation (Razmara et al., 2013;Sennrich et al., 2016) and sentiment analysis (Kaewpitakkun et al., 2014).Even for tasks such as machine reading comprehension that are less sensitive to the meanings of each word, OOV still hurts the performance (Chu et al., 2017;Zhang et al., 2018).We now discuss previous methods in two settings.

Multilingual Setting
Addressing OOV problems in a multilingual setting is relatively under-explored, probably because most multilingual models use separate vocabularies (Jaffe, 2017;Platanios et al., 2018).While there is no direct precedent, previous work show that incorporating multilingual contexts can improve monolingual word embeddings (Zou et al., 2013;Andrew et al., 2013;Faruqui and Dyer, 2014;Lu et al., 2015;Ruder et al., 2017).
Madhyastha and España-Bonet (2017) increase the vocabulary size for statistical machine translation (SMT).Given an OOV source word, they generate a translation list in target language, and integrate this list into SMT system.Although they also generate translation list (similar with us), their approach is still in monolingual setting with SMT.Cotterell and Heigold (2017) train charlevel taggers to predict morphological taggings for high/low resource languages jointly, alleviating OOV problems to some extent.In contrast, we focus on dealing with the OOV issue at subword level in the context of pre-trained BERT model.

Conclusion
We investigated two methods (i.e., joint mapping and mixture mapping) inspired by monolingual solutions to alleviate the OOV issue in multilingual settings.Experimental results on several benchmarks demonstrate the effectiveness of mixture mapping and the usefulness of bilingual information.To the best of our knowledge, this is the first work to address and discuss OOV issues at the subword level in multilingual settings.Future work includes: investigating other embedding alignment methods such as Gromov-Wasserstein alignment (Alvarez-Melis and Jaakkola, 2018) upon more languages; investigating approaches to choose the subwords to be added dynamically. cer

Table 1 :
Alignment from Independent Mapping.

Table 3 :
Performance of various models on the test set of Weibo NER.BERT zh : Chinese BERT pre-trained over Chinese Wikipedia.We use scripts conlleval for evaluation on NER.

Table 4 :
Accuracy (%) on the code-switch challenge.The top two rows are based on the test set, and the bottom three rows are based on the development set.

Table 5 :
WMT18 Quality Estimation Task 2 for the en→de SMT dataset.♣: result from Specia et al.

Table 7 :
POS tagging accuracy (%) for OOV tokens and non-OOV tokens on the Universal Dependencies v1.2 dataset, where the OOV/non-OOV are defined at word level with the original BERT vocabulary.