On the scalability of data augmentation techniques for low-resource machine translation between Chinese and Vietnamese

ABSTRACT Neural Machine Translation (NMT) has constantly been shown to be a standard choice to build a translation system, in both academia and industry. For low-resource language pairs, data augmentation techniques have been widely used to tackle the data shortage problem in NMT. In this paper, we investigate the scaling behaviour of transformer-based NMT model to the increasing amount of synthetic data. Through the experiments, conducted in the Chinese-to-Vietnamese translation task, we aim to provide a guideline to the application of several methods such as back-translation, tagged back-translation, self-training and sentence concatenation in a low-resource, less-related language pair. Our results suggest that choosing the appropriate amount of synthetic data is a crucial task when building NMT systems. In addition, when combining methods, it is recommended to tag the data sources before training.


Introduction
Machine Translation (MT) is a natural language processing task that consists of transcribing an original text written in a source language into a target language in which the translation is desired. There are two main approaches for the MT task: the traditional statistical approach and more recently the neural approach. The statistical approach generates a translated document based on statistical models obtained through the analysis of multilingual databases and the neural approach relies on artificial neural networks to learn linguistic rules.
Despite its success, building a good NMT system is known to require a large parallel corpora, which is not always available for low-resource languages like Albanian or Romanian, or even pairs of widely used languages such as German-Russian. To tackle this issue, many data augmentation approaches have been proposed in recent years. Some techniques aim to generate synthetic parallel corpora, to be mixed during training with the authentic dataset. Others manipulate the original data to add noises during training.
This paper aims to investigate recent advances in data augmentation on the Chinese-Vietnamese Neural Machine Translation task. Through these experiments, the efficacy and scaling behaviour of data augmentation methods is analysed, under a fairly low-resource settings and under a modern network architecture. Conclusions drawn from this paper can provide recommendations for future research on the choice of different data augmentation methods and the ratio of each source of synthetic data. Finally, our experiments show that finely tagged each data source is a simple and efficient method to increase translation quality.
The remainder of this paper is organized as follows. Section 2 introduces the Neural Machine Translation task. Following that, Section 3 presents methods for Data Augmentation in NMT and discusses scaling problems of these methods. Section 4 highlights the Chinese to Vietnamese translation problem, the languages relatedness and the previous work on this task. Section 5 gives details on the experimental setup and Section 6 shows the results of our experiments. Finally, Section 7 discusses and concludes the paper.

Neural machine translation
Neural Machine Translation is a typical Sequence to Sequence model, which translates an input sequence x = {x 1 , … , x s } into a target sequence y = {y 1 , … , y t }, in which the length of the two sequences might differ. The most common approach to build an NMT system is an encoder-decoder system. The encoder is a neural network that converts the input sequence into a set of vectors. The decoder then converts this set of vectors into the translated sequence, by generating each word in order.
At each step, an NMT model produces a target word y i by estimating its probability, given the input sequence x = {x 1 , … , x s } and the previously generated words {y 1 , … , y i−1 }: Regardless of the model architecture, an NMT system is generally trained to produce a set of parameters θ that maximize the conditional log-likelihood of the sentence-aligned bilingual data.
Various network architecture can be used as NMT encoders and decoders. Recently, it has been shown that transformer models (Vaswani et al. 2017) are the de facto standard choice for NMT systems, in both academia and industry. This type of model has constantly outperformed other architectures such as GRU (Cho et al., 2014b), LSTM (Hochreiter & Schmidhuber, 1997) and CNN (Gehring, Auli, Grangier, Yarats, & Dauphin, 2017) in most high-resource and low-resource settings (Yang et al., 2020). The transformer model is based on multi-headed self-attention layers, which, in short, computes at each decoding step and decides the tokens in the input sequence to focus on. The original transformer architecture is described in Figure 1.
The training data of an NMT system often requires a parallel corpora, which consists of aligned sentences in the source and target languages. Normally, state-of-the-art models on high-resource language pairs such as English-French or English-German are trained with millions of sentences Raffel et al., 2020). This amount of data hardly exists for most language pairs and generating such dataset is highly time-consuming. On the other hand, publicly available monolingual data is easy to collect and often comes with a large size. Therefore, various studies have been conducted to leverage this massive amount of monolingual data to improve translation quality. In the next section, we will present several data augmentation techniques that are currently used widely in academia and industry.

Data augmentation for NMT
Since the quality of an NMT system depends highly on the size of parallel corpora, many data augmentation techniques have been studied. In this work, we experiment three of the most commonly used techniques: back-translation, self-training and sentence concatenation. This section presents these methods and discusses the scaling problem of data augmentation techniques, which leads to the motivation of our work.
To perform these techniques, one would need an original bilingual corpora, usually called the authentic data D a . In addition, some methods utilize source-side and targetside monolingual datasets (D s and D t respectively). The data generated by data augmentation techniques is commonly referred as synthetic data.

Back-translation
Back-translation (Sennrich, Haddow, & Birch, 2016a) is currently used in many state-of-theart NMT systems. It has been shown to outperform other techniques in high-resource settings, while improving translation quality in low-resource language pairs. Back-translation involves training a backward translation model (i.e. from the target language to the source language) using the original parallel data D a . This model is then used to generate synthetic data by translating the target-side monolingual corpus D t to the source language. The synthetic sentence pairs are then mixed with the authentic data to train the final model (cf. Figure 2).
The quality of a model trained with back-translation depends on various factors. These include the architecture of the backward model, the quality of the target-side monolingual data, and the ratio between the back-translated data and the original data.
To further help the model differentiate between original and back-translated data, a simple yet efficient method called tagged back-translation (Caswell, Chelba, & Grangier, 2019) has been proposed. By simply putting a tag (e.g. < bt >) before each back-translated sentence to help the model distinguish between data sources, the approach has led to more efficient use of synthetic data. Tagged BT has constantly been shown to outperform standard BT in various settings.

Self-training
Another data-augmentation method that is widely used is self-training (Zhang & Zong, 2016). Generally, a self-training algorithm performs three steps. First, the authentic data is used to train a baseline forward MT model. This baseline model then translates the monolingual corpus D s from the source side to the target side, resulting in synthetic bilingual corpora. Similar to back-translation, the synthetic and the authentic data are then mixed during the training process of the final model (cf. Figure 3).
Self-training enhances source-side diversity, while back-translation does the same on the target side. In general, human evaluation prefers a system that produces smooth translated sentences. Therefore, back-translation is commonly chosen over self-training. In this paper, we experiment both methods with different ratios, as well as the combination between them. A tagged version is also studied to show whether our model learns better once the data is separated.

Sentence concatenation
Recently, a simple yet effective named sentence concatenation (Kondo, Hotate, Kaneko, & Komachi, 2021) has been introduced. The technique proposed is to concatenate sentences during training. The original training set is first shuffled to avoid context overlapping. Each sentence is then concatenated with the following sentence, with a < sep > token separating them to form the training set (cf. Figure 4). In the original paper, the authors reported that this technique alone improves the model by a minor factor. However, when combined with BT (sentences generated with BT are also concatenated), the performance gain seems to be more significant. All these experiments were reported using the same dataset size of 400k sentences (for each of the original, BT and concatenated data).

The scaling problem of data augmentation methods
All the above methods aim to make use of the available, large-scale monolingual datasets to tackle the data shortage problem in NMT. Various aspects of these methods have been studied such as the sampling problem (Fadaee & Monz, 2018) or the iterative back-translation problem (Hoang, Koehn, Haffari, & Cohn, 2018). However, when applying data augmentation methods, the size of the synthetic data is frequently chosen by the authors without explicit explanations. One similar work to ours that studied the scaling behaviour of back-translation is proposed by Poncelas, Shterionov, Way, de Buy Wenniger, and Passban (2018). This paper investigates the ratio between authentic and synthetic data in high-resource settings (1M authentic sentences). Under these settings, it is shown that models trained with back-translated data constantly outperform the baseline model (the authors have experimented up to 2.5M synthetic sentences). However, since back-translated data is generated by a backward NMT model and contain more mistakes than authentic data, the authors concluded that adding more and more back-translated data, at a certain point might reduce model quality, forcing it to unlearn correct parameters from authentic data.
Even though the conclusions drawn from Poncelas et al. (2018) are interesting, it is unclear how back-translation scales in lower-resource settings. Furthermore, the scalability of other methods such as self-training, tagged-BT and sentence concatenation has not been studied as far as we know. This paper aims to explore model performance against the increasing ratio of synthetic data, for each method individually. When combining all the above methods, we showed that the best-performed model can be achieved by tagging each source of data separately.
Finally, to finish discussing related work, the next section shows the existing studies on the Chinese-Vietnamese machine translation task. It covers the differences and similarities between the two languages and some proposed efforts that addressed the translation problem, using both SMT and NMT.

The Chinese-Vietnamese machine translation task
China and Vietnam have had a deeply intertwined relationship for being long-time neighbours. Considering the geographic and historical relations between the two countries, Chinese culture and characters have had a big impact on the Vietnamese language. After a long and complex period of development, Vietnamese has gradually grown beyond Chinese influences and demonstrated clear distinctiveness. Nonetheless, the Chinese languages (Mandarin and Cantonese) still hold powerful effects on Vietnamese.
A clear difference between Vietnamese and Chinese is that the former is based on the Latin alphabet complimented by diacritics while the latter is a system of logographic characters. This means that there are white spaces to separate words in Vietnamese. In contrast, written Chinese does not have spaces between the characters in a text. Additionally, as opposed to the Latin alphabet-based Vietnamese, a lot of Chinese characters were created based on the images and meanings of words (e.g. the word 'a tree'/'wood' is characterized as '木', which simulates a picture of a tree). The two languages also belong to different language groups. Vietnamese is an Austroasiatic language; on the other hand, Chinese is a Sino-Tibetan language. Some clear differences in grammar can be found between the languages. An example would be the organization of the noun phrase. In Vietnamese, a head noun is considered the main subject. Adjectives and other nouns complimenting the head noun would come after it, compared to Chinese in which complimentary nouns or adjectives would come before the main noun.
However, there are also similarities between the two, most notable in the vocabulary. As Vietnamese is heavily influenced by written and spoken Chinese, Sino-Vietnamese vocabulary-borrowed literary Chinese words with Vietnamese-based pronunciationsis an important part of the Vietnamese language. These words are mostly used in technical and formal speeches and writings, accounting for 70% of a typical Vietnamese sentence or paragraph. In addition, in terms of phonology, both languages are tonal (6 tones in Vietnamese, 4 tones in Mandarin and 6 tones in Cantonese), which might be challenging for machine translators to pronounce. As Vietnam has been under heavy influence from China and Chinese literature since ancient times, many Chinese words were adopted into Vietnamese. The reverse direction of influence, unfortunately, hardly exists due to the asymmetrical power dynamics of both countries.
Despite having a close relationship historically and economically, there is not much research on the Chinese-to-Vietnamese translation task, and the amount of available data is still low. Most of the conducted research is based on SMT (Tran, Dinh, & Nguyen, 2016) or focusses on other problems such as incorporating named entities (Tran, Dinh, Le, & Nguyen, 2013;Tran, Dinh, & Tran, 2014). Some investigates back-translation for both SMT and NMT, but under lower resource settings and under less recent models (Li, Sha, & Shi, 2020). Our work, which is based on a recently proposed transformer model and uses a larger dataset, is therefore hardly comparable with these existing ones.

Dataset preparation
There is no published dataset for the Chinese-Vietnamese translation task up until now. Hence, a dataset needs to be prepared by compiling bilingual articles from various sources such as news, wikipedia and TED talks. The articles are then partitioned into sentences, which are then manually aligned. The final authentic bilingual corpora has 192k sentences and is split into three parts: the training set (180k sentences), the validation set (6k sentences) and the test set (6k sentences).
To generate synthetic data, we crawled a Chinese monolingual corpus, mostly consists of news articles. The Vietnamese monolingual data was taken from the mTet dataset (Ngo et al., 2022). This dataset has 3.3M sentences and is inline with the domain of our authentic data. We filtered out Vietnamese sentences that existed in the authentic data.

Model configuration
In our experiments, we use the Transformer model (Vaswani et al., 2017), which has been widely adopted in NMT. Since its introduction, this model has outperformed existing architectures for the NMT task and is currently the state-of-the-art model for various language pairs. In addition, we also learned Byte Pair Encoding (BPE) (Sennrich, Haddow, & Birch, 2016b) on the training set to apply it to the training, validation and test set. Beam search (Tillmann & Ney, 2003) is used for decoding translated sentences. We decided to use the standard configuration of Transformer. The hyperparameters used are as follows Table 1.
To evaluate the performance of the resulting models, the BLEU score is measured using SacreBLEU (Post, 2018).

Empirical results
To be able to conclude how data augmentation methods scale in low-resource translation settings, we will produce a series of experiments. In our first experiment, we scale the ratio between back-translated data (both the tagged and non-tagged version), self-training data and authentic data. Table 2 shows the BLEU score of models trained with different data.
As expected, standard back-translation's peak performance arrives quickly at a quite low ratio of 75k back-translated sentences/180k original sentences. Beyond that point, it is observed that model performance gradually decrease. From 125k synthetic sentences (the ratio is 1.44:1 between the authentic and synthetic data respectively), the model performance has dropped below the baseline model.
As per tagged back-translation, since the input data has been tagged so that our model can easily distinguish between the authentic and synthetic data, the results suggest that we can use a much larger amount of synthetic data. The translation quality at the peak is better than the standard version. Experiments suggest that one can use the same amount of synthetic data and still gets a model that outperforms the baseline model.
Self-training seems to be the least efficient data augmentation method among the three. The performance improves slightly at 50k self-training sentences while dropped heavily after. This suggests that one should prefer the use of tagged back-translation over self-training and standard back-translation. The size of tagged BT data used can be quite high. In our experiments, the model that has the best BLEU score is the one trained with 125k TaggedBT sentences and 180k original data.
In the second experiments, we train our models using sentence concatenation. As proposed in the original paper (Kondo et al., 2021), sentence concatenation should be used with BT to produce good performance. Therefore, we experiment different BT  sizes and tagged-BT, vary from 0 (sentence concatenation only) to 200k sentences. The results are shown in Table 3. First, we confirm the conclusions by Kondo et al. (2021) that sentence concatenation alone only introduces a slight improvement to the model (+0.1 BLEU). It is best to combine this technique with back-translation. We observe that with or without sentence concatenation, the standard back-translation version follows the same scaling behaviour, with the peak performance at 75k back-translated data. On the other hand, tagged BT, when combined with sentence concatenation has achieved the best performance of 26.7 BLEU (+2.2 BLEU compared to the baseline model) at 150k sentences.
Since the tagged version helps the model distinguish between sources of data, we took a step further and tested a model trained with back-translation, sentence con-catenation and self-training data combined (150k back-translated sentences, 50k self-training sentences + concatenation) (cf. Table 4). The version with no tagging achieved a BLEU score of 26.3. If we tag all synthetic sentences with one simple tag (here we use < da>, which stands for data augmentation), the BLEU score rises to 26.8. Finally, fine-grained tagged data (using a separate tag for each data augmentation method) results in a model with 27.0 BLEU, which is the best model that we achieved during our experiments.

Conclusion
This paper investigated how transformer-based translation model performs when training with augmented Chinese to Vietnamese data. It provides a guide on how to apply these methods on a low-resource, less related language pairs. When applying each method, our results show that when the ratio of synthetic data and authentic data becomes so high, model performance might drop below the baseline model with no synthetic data. Therefore, choosing the right amount of synthetic data is a crucial part of building a good NMT system. Experiments also suggest that among the evaluated options, combining all three data augmentation methods, with an appropriate size for each one, yields the best translation quality. When implementing the combined version, we also conclude that it is best to finely tag the data sources before training. For future directions, we can investigate more augmentation methods, transfer learning or experiment with different model sizes. A complete study can be served as a full guideline for how to choose between methods, and what size of synthetic data should one use for each model configuration.

Disclosure statement
No potential conflict of interest was reported by the author(s).