A Novel Beam Search to Improve Neural Machine Translation for English-Chinese

Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, overcoming the weaknesses of conventional phrase-based translation systems. Although NMT based systems have gained their popularity in commercial translation applications, there is still plenty of room for improvement. Being the most popular search algorithm in NMT, beam search is vital to the translation result. However, traditional beam search can produce duplicate or missing translation due to its target sequence selection strategy. Aiming to alleviate this problem, this paper proposed neural machine translation improvements based on a novel beam search evaluation function. And we use reinforcement learning to train a translation evaluation system to select better candidate words for generating translations. In the experiments, we conducted extensive experiments to evaluate our methods. CASIA corpus and the 1,000,000 pairs of bilingual corpora of NiuTrans are used in our experiments. The experiment results prove that the proposed methods can effectively improve the English to Chinese translation quality.


Introduction
Natural language processing is a comprehensive interdisciplinary subject integrating linguistics, mathematics, computer science and cognitive science. Machine Translation is a flagship of the recent successes and advances in the field of natural language processing. Its practical applications and use as a testbed for sequence transduction algorithms have spurred renewed interest in this topic. Neural machine translation (NMT) has achieved state-of-the-art translation performances [Kalchbrenner and Blunsom (2013) ;Sutskever, Vinyals and Le (2014); Bahdanau, Cho and Bengio (2014)] in last several years. Furthermore, many recent studies have shown that neural networks [Zhang, Jin, Sun et al. (2018); Zhang, Wang, Lu et al. (2019)] can be successfully applied to many tasks in natural language processing (NLP). These include, but are not limited to, rule-based machine translation, memory-based translation, mechanic translation by analogy principle, language modeling, paraphrase detection and word embedding extraction [Mikolov, Corrado, Chen et al. (2013)], In the field of statistical machine translation (SMT), deep neural networks [Zhang, Xie, Sun et al. (2020)] have begun to show promising results [Schwenk, Udani, Gupta et al. (2012)], Schwenk et al. summarize the successful application of feedforward neural networks in the framework of phrase-based SMT (Statistical Machine Translation) systems. In 2016, instead of using SMT, Google used neural network to translate some languages, which introduced neural network machine translation from academia to the industry. Their solution requires no change in the model architecture from their base system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. Based on the existing translation model architecture and summarized their advantages and disadvantages, we concluded that: • Although there are many translation tools available to us today, these machine translations usually do not produce satisfactory results.
• While recent advances have reported near human-level performance on several language pairs using neural approaches, it still has some shortcomings. For example, the result of translation may duplicate content and some of the source statement is missing.
To solve these problems, we added two penalties in beam search to make it selects better translation, which are based on duplicate detection and length ratio respectively. Furthermore, we use reinforcement learning to train a translation evaluation system to select better candidate words for the translation result. In summary our contributions are: 1. An innovative beam search evaluation function was used to select the better translation, which improved with duplicate detection and length ratio. 2. A novel reinforcement learning based method was used to build a selection model of translations. This method can make the translation evaluation system select better candidate words for generating translations. The remaining of this paper is organized as the follows. Section 2 presents the related researches work of neural machine translation and beam search method, Section 3 the proposed method is illustrated in detail. Section 4 gives a detailed description of the experiments, analyzes the experiment results. And Section 5 draws some conclusion.

Related work
In this section, we discuss and compare common model architecture of machine translation system and application of language model in machine translation, focusing on heuristic search algorithm for translation results.

Neural machine translation
The neural machine translation system usually can be implemented as an encoderdecoder network with recurrent neural networks. The encoder is a bidirectional neural network with gated recurrent units that reads an input sequence x=( 1 ,..., ) and calculates a forward sequence of hidden states ( ℎ 1 ����⃗ , … , ℎ �����⃗ ), and a backward sequence ( ℎ 1 ⃐���� , ..., ℎ ⃐����� ). The hidden states ℎ ���⃗ and ℎ ⃐��� are concatenated to obtain the annotation vector ℎ The decoder is a recurrent neural network that predicts a target sequence y=( 1 ,..., ). Fig. 1 shows the structure of a typical English-Chinese translation system, which consists of encoder-decoder network. The input is an English sentence, which is compiled into a vector z=( 1 ,..., )by the encoder, and then the vector z is decoded into a Chinese syntactic sentence by the decoder. Machine translation for end-to-end learning was first proposed by Chrisman [Chrisman (1991); Kalchbrenner and Blunsom (2013)] mapped input sentences into vectors of fixed dimensions for the first time. Based on this, Cho et al. [Cho, Merrienboer, Gülçehre et al. (2014)] further proposed the RNN encoder-decoder model, which improved the performance of the machine translation system. Later, Sutskever et al. ] proposed a new end-to-end sequence learning method using a model combining two LSTM (Long short-term memory) networks. The decoder uses beam search to find the best translation result. Based on Graves A's LSTM equation, Sutskever et al. [Sutskever, Udani, Gupta et al. (2014)] proposed an improved model which is composed of two different LSTM networks, one of which takes reversed input sentence, to produce a "minimum time delay". Unlike feature-based traditional approaches and long short-term memory network based models, Zeng's work, Zeng et al. [Zeng, Dai, Li et al. (2019)], combines the strengths of linguistic resources and gating mechanism to propose an effective convolutional neural network based model for aspect-based sentiment analysis. Bahdanau et al. [Bahdanau, Cho and Bengio (2014)] conjecture that base on encoderdecoder architecture, the use of a fixed-length vector is a bottleneck in improving the performance, and propose to extend this by allowing a model to automatically (soft-) search for parts of a source sentence which are relevant to predicting a target word, without having to form these parts into hard segments explicitly. The attentional mechanism was improved by Luong et al. [Luong, Pham, Manning et al. (2015)], The authors examine two effective kinds of attention mechanism. The former approach is similar to Bahdanau's method but simpler, and the latter approach can be seen as an interesting mixture between hard attention and soft attention. Similar method can be also applied in other tasks such as question classification, Liu et al. [Liu, Yang, Lv et al. (2019)] proposed an attention-based encoder-decoder model that can extract the features of Chinese questions effectively. Furthermore, Kaiser et al. [Kaiser, Gomez, Chollet et al. (2017)] applied depthwise separable convolutions [Chollet (2017)] to neural machine translation. Lample et al. [Lample, Ott, Conneau et al. (2018)] propose two model variants, a neural and a phrase-based model. Similarly, Generative Adversarial Networks (GANs) is also used to do distant supervised relation extraction by Zeng et al. [Zeng, Dai, Li et al. (2018); Moryossef, Aharoni, Goldberg et al. (2019)] propose a black-box approach for injecting the missing information to a pre-trained neural machine translation system, allowing to control the morphological variations in the generated translations without changing the underlying model or training data.

Beam search
Here we briefly review neural text generation and then review existing beam search algorithms. Beam search is a widely used approximate search strategy for neural network decoders, and it generally outperforms simple greedy decoding on tasks like machine translation. Beam search is the optimization of the best priority search, which can reduce its memory requirements. Assume the input is embedded into a vector x, from which we generate the output sentence y which is a completed hypothesis: where < is a popular shorthand notation for the prefix 0 , 1 ,..., −1 . We say that a hypothesis is completed, notated as ( ), if its last word is 〈/ 〉, i.e., in which case it will not be further expanded. Wiseman et al. [Wiseman and Rush (2016)] introduce a model and beam-search training scheme, that extends seq2seq to learn global sequence scores. Furthermore, Google's Neural Machine Translation system (GNMT) [Wu, Schuster, Chen et al. (2016)] uses beam search technique which employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. Sennrich et al. [Sennrich, Haddow and Birch (2016)] introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of sub word units. Furthermore, part-of-speech (POS) feature is used to improve the language model used in translation. Liu et al. [Liu, Lin, Ren et al. (2018)] proposed a negative sampling algorithm based on POS tagging, which can optimize the negative sampling process and improve the quality of the final language model. In view of the long execution time and low execution efficiency of Support Vector Machine in large-scale training samples, Chen et al. [Chen, Xiong, Xu et al. (2019)] proposed the online incremental and decremental learning algorithm based on variable support vector machine (VSVM). Williams et al. [Williams and Aleksic (2017)] present a technique to adapt the beam search algorithm to preserve hypotheses when they may benefit from rescoring. This technique makes it feasible to use one base language model, but still achieve high-accuracy speech recognition results in all contexts.

Method
In this section, aiming at the two problems that often occur in NMT, two corresponding methods to improve the sequence to sequence machine translation model (seq2seq) are introduced in detail. We propose a neural machine translation improvement based on beam search, firstly. Secondly, reinforcement learning was used in machine translation to train a translation evaluation system to select better candidate words of the model.

Improved beam search with duplicate detection and length ratio control 3.1.1 Issues in traditional beam search algorithm
There are two problems that often appear in neural machine translation: 1. Traditional machine translation model always selects the word with the highest probability at the current time step, which is not necessarily the best choice for the whole translation. So, in the translation model, due to duplicate translation of some fragments in the source statement, one fragment may be translated twice or more. 2. Beam Search prefers short results over long ones, that caused the translation model may miss part of the source statement. As show in the following example: Source statement: There are people living in squatter huts, temporary housing areas, "cage" apartments or private rental housing for which they pay painfully expensive rents; people whose names are on the public housing Waiting List; and people who want to buy Home Ownership Scheme flats but who are disappointed every time because they fail to get drawn in the balloting. These kinds of people, too, are likely to ask: How much of the money in the Budget will be spent on improving my life?

Translation generated by Translation model: 他们在寮屋、临屋、笼屋或私人租住公 屋单位、临屋或私 人租住公屋单位，他们有多少时间可以用来改善我的生活？
Both types of problems have appeared in the generated test by the translation model. Among them, "临屋" and "或私人租住公屋单位" have been translated twice, and a section "正在轮候公屋， 或每次在居屋抽签中都不能中签，失望而回的居民同样会 问，今年的财政预算案中" in the middle of the standard translation of the target language is completely omitted. These situations decrease the quality and readability of the generated translation. As we described above, beam search prefers short results over long ones. So, the appearance of duplicate translation will increase the possibility of missing words. We add two penalty terms to the beam search evaluation function. The penalty items for duplicate translations are added to alleviate the first problem, and the penalty items based on length ratios are added to alleviate the second one.

Beam search with duplicate detection and length ratio control
The original evaluation function of the beam search is as follows: where, P(Y|X) represents the probability that Y will appear under the condition that X appears. In this paper, we modified the beam search evaluation function, added a penalty item d(x) based on the duplicate detection and a penalty term based on the length ratio l(x): For the problem of duplicate translation, we propose the penalty d(x) that based on duplicate detection. We obtained the similarity by comparing a range of translations. The specific comparison method is as shown in Fig. 2:   Figure 2: The selection of the translation fragment that has been generated We compared adjacent fragment of different lengths one by one. That is, shows in Fig. 2, the black part is compared with the dark gray part to find the same part. If the same item appears in the translation fragment, then we multiply the number of identical words by the corresponding penalty factor. Finally, we used weighted summation to get the penalty for duplicate translation. The equation of d(x) is as Eq. (5): where, c is the index of the current translated word, δ is the range of duplicate detection, and ε is the coefficient of the penalty term. It can be seen from the Eq. (5) that the closer the duplicate is, the larger the corresponding coefficient is, and the corresponding penalty is larger. In Eq. (5), δ is the range between two word in the translation from 1 to δ. And for each range, we count the number of times that every two word in this range is equal, then multiply it by the penalty coefficient. Finally, we add the weighted terms to the evaluation function of the beam search.
To avoid missing words in translation, we use the current candidate word, cumulative length distribution function ( ), and the penalty factor as the input of the algorithm to get the penalty for Length Ratio. To get a proper range of sentence length ratios, firstly we count the length of 1,000,000 pairs of parallel corpus translated by NiuTrans [NiuTrans (2011)], secondly we divide the length of the target language sentence (Chinese) into different sets by the length of the corresponding source language sentence (English), which obtain a series of ratios, then remove the data which have obvious errors, for example, the ratio is too small and too large. After excluding these extreme data, we use linear regression to fit the cumulative distribution function of the remaining data to obtain ( ) which is the probability that the translation has ended： ( ) = ( < ) here x can be defined as: , which is the ratio of the length of the target language sentence to the length of the source language sentence. When the EOS (Sentence Mark) and the ordinary word appear simultaneously in the candidate words of the beam search, ( ) and 1-( ) will add to their evaluation functions separately. ( ) is the probability that the translation has ended and1 − ( ) is the probability that the translation has not ended.
In Eq. (7), when candidate word is EOS, the untranslated probability is multiplied by the penalty factor as a penalty term. And when the candidate word is not EOS, the untranslated probability is multiplied by a penalty factor as a penalty term. We can get different scores through penalty terms, thus beam search can make better choices by choosing different l( ).

Algorithm description
( ) * _ epsilon Scores tmp scores i + = 8. return Scores 9. end while Based on the construction principle of the duplicate detection penalty items that were described in Section 3. 1. 3, the computation process is given in Algorithm I. We take the candidate words of the entire beam search, the parameters δ and ε as the input to the algorithm. After dividing into multiple fragments of different sizes to compare, we calculated the penalty terms respectively, and finally sum with weight together.

return result
Based on the construction principle of the length ratio control penalty items described in Section 3. 1. 3, the computation process is given in Algorithm II. We take the current candidate word, the cumulative distribution function calculated by the current length, and the parameter θ as the input to the computation. First, we use the vector operation to get whether the candidate word is EOS, if it is EOS, the value is 1, otherwise it is 0. Then by the form of dot multiplication, we got the value of l(x) in the equation.

Reinforcement learning based translation evaluation
The quality of translations generated by machine translation is generally evaluated by BLEU scores, which compares the similarity between generated translation and target translation. Because this evaluation has to wait until the translation is fully generated, so it has some limitations, for example, it cannot improve the accuracy performance during the translation process. If the BLEU score could be used in the process of beam search generating translation, better candidate words can be selected, and the quality of the generated translation will be improved accordingly.

Model description
To improve the translation text evaluation system, making it choose better candidate words. this paper use reinforcement learning based method to train an evaluation system. This method can make use of the maximize long-term cumulative rewards to improve the generated text during the target generation process. We first use the generated text by the model and the corresponding target translation as corpus, then put the corpus through the Q learning algorithm. The Q learning algorithm evaluates the original strategy, executes the ε-greedy strategy, and Q table is used to record the state in the algorithm. First, we postulate that the beginning of each sentence is the Start of Sentence (SOS) that is the initial state of each sentence (state). Then, the first word after the initial state is taken as a selectable action. After selecting an action, the current state is switched to the selected action (the selected word), and each sentence in the corpus represents a choice of route. As show in Fig. 3 below. The equation for the reward r is as follows: r = ∑ (8)

Figure 3: State transition diagram
The reward score in the reward matrix R is defined as: for the route generated by the translation model (the translation), the reward uses the negative score, and the target translation uses the positive score. Since the length of each sentence is different, the reward for each word is the sentence score divided by the length of the sentence. After defining the reward matrix R, the Q table can be obtained by the Q learning algorithm. The values of Q table are updated as follow: Among them, s and respectively represent the current state and behavior, ̃, � respectively represent the next state and behavior of s, and is the value of the learning parameter which is between [0, 1]. Through Eq. (9), we can get the score for each route, that is, the score for each sentence.

Data set
The data sets used in this paper are mainly Chinese and English bilingual parallel corpora which is download from the website of NiuTrans translation. The corpus is divided into two documents in Chinese and English respectively, each of which records 1,000,000 pieces of data. Each line of the two files is corresponds one-to-one, and is a translation of each other. The CASIA corpus [Zhou, Li, Yin et al. (2010)] is also used in the experiments in this paper.

Experimental results on the first penalty term
We firstly use CASIA's corpus to train 24 epochs, then sets δ to 20, and compares the BLEU scores of the 24th epochs, which the model with the ε in intervals [-2, 2] and [-1, 1]. The scores are shown in the Tab. 1 below:   When ε=-0.7, the BLEU is the highest, which is 0.1 higher than the model that ε=0 (the original model). We use the same coefficient in the corpus of NiuTrans translation, and also use 24 as the training epoch of the model, and get the following data: The BLEU score of the model has also been improved, proving the effectiveness of the method.

Experimental results on the second penalty term
First, we count 1,000,000 pairs of sentences in the parallel corpus of NiuTrans. Second, we let the length of the target language sentences and the length of the corresponding original sentence do the ratio operation in each pair of sentences. Finally, for those ratios in the interval [0, 1], use them to fit the distribution line chart. Results are shown in Fig. 7. By narrowing the logarithmic interval, it is clear that the ratios are mainly taken in [0.2, 2.6]. Then, by plotting the cumulative distribution function of the data, we get the blue line in Fig. 8 below.

Figure 8:
The cumulative curve and the fitted curve of the ratio of sentence pairs in the corpus From the blue curve in Fig. 8 above, we can see that the ratio of most sentence pairs in the corpus is between (0.5, 2.5). The trajectory of this line is close to sigmoid, so we used a linear regression algorithm to fit it.
We first assume that l( ) = , = . Secondly, because the shape of the curve is similar to sigmoid, so we bring x into _ = 1 1+ + . Then we define the error as 1 | − _| 2 and use the gradient descent method to find the minimum value. And then we got the formula (corresponding to the red curve in Fig. 8 above): In the above formula, a=-5.0875, b=6.7187, x is the ratio of the length of the source language sentence to the length of the target language sentence. By substituting the fitted ( )into the formula, we got a penalty term for the length ratio which is used to correct the evaluation function of the beam search. We take different values for θ, use the corpus of NiuTrans, and use 24 epochs as our training epoch. We got the experimental data as follow:   It can be seen from the table that when θ=0.2, the BLEU is the highest, which is 0.11 higher than the model that ε=0 (the original model). We use the same coefficients in the CASIA corpus, and also use 24 as the training epoch of the model, and get the following data: After the improvement, the model's BLEU score is also higher than the original model's score, which proves the effectiveness of the method.

Conclusion
Traditional beam search always selects the word with the highest probability at the current time step, which is not necessarily the best choice for the whole translation, and result in the problem of duplicate translation. In addition, traditional beam search also prefers short results over long ones, thus, may miss part of the source statement. In this paper, we proposed an innovative beam search evaluation function with two penalties based on duplicate detection and length ratio, to solve duplicate and missing translation issues. Furthermore, we use a novel translation evaluation system based on reinforcement learning to select better candidate words for generating translations. Then we conducted extensive experiments to evaluate our methods. CASIA corpus and the 1,000,000 pairs of bilingual corpora of NiuTrans are used in our experiments. Experimental results show that our two methods can effectively improve the quality of the translation.