Hybrid Neural Network Alignment and Lexicon Model in Direct HMM for Statistical Machine Translation

Recently, the neural machine translation systems showed their promising performance and surpassed the phrase-based systems for most translation tasks. Retreating into conventional concepts machine translation while utilizing effective neural models is vital for comprehending the leap accomplished by neural machine translation over phrase-based methods. This work proposes a direct HMM with neural network-based lexicon and alignment models, which are trained jointly using the Baum-Welch algorithm. The direct HMM is applied to rerank the n-best list created by a state-of-the-art phrase-based translation system and it provides improvements by up to 1.0% Bleu scores on two different translation tasks.


Introduction
The hidden Markov model (HMM) was first introduced to statistical machine translation for addressing the word alignment problem (Vogel et al., 1996). Then the HMM-based approach was widely used along with the IBM models (Brown et al., 1993) for aligning the source and target words. In the conventional approach, the Bayes' theorem is used and the HMM is applied to the inverse translation model Pr(e I 1 |f J 1 ) = Pr(e I 1 ) · Pr(f J 1 |e I 1 ) = a J 1 Pr(f J 1 , a J 1 |e I 1 ) In this case, as a part of a noisy channel model, the marginalisation becomes intractable for every e.
This work proposes a novel concept focusing on direct HMM for Pr(e I 1 |f J 1 ), in which the alignment direction is from target to source positions. This specific property allows us to introduce dependencies into the translation model that take the full source sentence into account. This aspect will be important for the future decoder to be developed. The lexicon and alignment probabilities in the HMM are modeled using feedforward neural networks (FFNN) and they are trained jointly. The trained HMM is then applied for reranking the n-best lists created by a state-of-the-art open source phrase-based translation system. The experiments are conducted on the IWSLT 2016 German→English and BOLT Chinese→English translation tasks. The FFNNbased hybrid HMM provides improvements by up to 1.0% BLEU scores.

Related Work
In order to discuss related work, we will consider the following two key concepts that are essential for the work to be presented: • Neural lexicon and alignment models The idea of using neural networks for lexicon modeling is not new (Schwenk, 2012;Sundermeyer et al., 2014;Devlin et al., 2014). Apart from differences in the neural network architecture, the important difference to this work is that those approaches did not include the concepts of HMM models and end-to-end training. In addition to neural lexicon modeling, (Alkhouli et al., 2016) also applied a neural network for alignment modeling like this work, but their training procedure was based on the maximum approximation and on predefined GIZA++ (Och and Ney, 2003) alignments.
There were other studies that focused on feature-rich alignment models (Blunsom and Cohn, 2006;Berg-Kirkpatrick et al., 2010;, but those studies did not use a neural network to automatically learn features (as we do in this work). (Yang et al., 2013) used neural network-based lexicon and alignment models inside the HMM alignment model, but they model alignments using a simple distortion model that has no dependence on lexical context. Their goal was to improve the alignment quality in the context of a phrase-based translation system. However, apart from , no results on translation were reported.
The idea of using neural networks is the basis of the state-of-the-art attention-based approach to machine translation (Bahdanau et al., 2015;Luong et al., 2015). However, that approach is not based on the principle of an explicit and separate lexicon model.

• End-to-end training
The HMM in combination with the neural translation model lends itself to what is usually called end-to-end training. The training criterion is the logarithm of the target sentence posterior probability. This criterion results in a specific training algorithm that can be interpreted as a combination of forwardbackward algorithm (as in EM style training of HHMs) and backpropagation. To the best of our knowledge, this end-to-end training has not been considered before for machine translation. In the context of signal processing and recognition, the connectionist temporal classification (CTC) approach (Graves et al., 2006) leads to a similar training procedure. (Tran et al., 2016) studied neural networks for unsupervised training for a part-ofspeech tagging task. In their approach, the training criterion for this problem results in a combination of EM framework and backpropagation, which has a certain similarity to the training algorithm for translation as presented in this work.

Definition of neural network-based HMM
Similar to hidden alignments a j = j → i between the source string f J 1 = f 1 ...f j ...f J and the target string e I 1 = e 1 ...e i ...e I in the conventional HMM, we define the alignments in direct HMM as b i = i → j. Then the model can be defined as: Our feed-forward alignment model has the same architecture ( Figure 1) as the one proposed in (Alkhouli et al., 2016). Thus the alignment probability can be modeled by: denotes the jump from the predecessor position to the current position. Thus, the jump over the source is estimated based on a m-words source context window and n predecessor target words. Figure 1: A feed-forward alignment neural network with 3 target history words, 5-gram source window, a projection layer, 2 non-linear hidden layers and a small output layer to predict jumps.
For the lexicon model, we assume a similar dependence as in the alignment model with a shift, namely on the source words within a window centred on the aligned source word and n predecessor target words. To overcome the high costs of the softmax function for large vocabularies, we adopt the class-factored output layer consisting of a class layer and a word layer (Goodman, 2001;Morin and Bengio, 2005). The model in this case is defined as where c denotes a word mapping that assigns each target word to a single class, where the number of classes is chosen to be much smaller than the vocabulary size. The lexicon model architecture is shown in Figure 2.
Figure 2: A feed-forward lexicon neural network with the same structure as the alignment model, except a class-factored output layer.

Training
The training data of the direct HMM are the source and target sequences, without any alignment information. In the training of direct HMM including neural network-based models, the weights have to be updated along with the posterior probabilities calculated by the Baum-Welch algorithm. Similar to the training procedure used in (Berg-Kirkpatrick et al., 2010), we apply the EM algorithm and define the auxiliary function as whereθ = {α,β}, j = b i−1 and Then the parameters can be separated for lexicon model and alignment model: where (9) and (10) we can observe that the marginalisation of hidden alignments ( j p i (j|e I 1 , f J 1 , θ)) is the only difference compared to the derivative of neural network training based on word-aligned data. In this approach we iterate over all source positions and the word alignment toolkit such as GIZA++ is not required. Furthermore, the word-aligned data generated e.g. by GIZA++ might contain unaligned and multiply aligned words, which make the data difficult to use for training neural networks. Thus the heuristicbased approaches (Sundermeyer et al., 2014;Devlin et al., 2014) have to be used in order to guarantee the one-on-one alignments, which may negatively influence the quality of the alignments. By contrast, the neural network-based HMM is not constrained by these heuristics. In addition, even though the training process of the direct HMM takes more time than the neural network training on the word-aligned data, we should note that generating the word-aligned data using GIZA++ is also a time-consuming process.
In general, our training procedure can be summarized as follows: 1. One iteration IBM-1 model training to create lexicon table for initializing the forwardbackward table. 2. In the first epoch, for each sentence pair calculate and save the entire table of posterior probabilities p i (b|e I 1 , f J 1 ) (also p i (b , b|e I 1 , f J 1 ) for alignment model) using forward-backward algorithm based on the results of IBM-1 model. In this work the IBM-1 initialization is required. We tried to train neural network models from scratch, but the perplexity converges towards a bad local minimum and gets stuck in it. We also attempted other heuristics for initialization, such as assigning probability 0.9 to diagonal alignments and spreading the left 0.1 evenly among other source positions. The resulted perplexity is much higher compared to initializing using IBM-1.

Experimental Results
The experiments are conducted on the IWSLT 2016 German→English and BOLT Chinese→English translation tasks, which consist of 20M and 4M parallel sentence pairs respectively. The feed-forward neural network alignment and lexicon models are jointly trained on the subset of about 200K sentence pairs. As an initial research of this topic, our new model is only applied for reranking n-best lists created by a phrase-based decoder. The maximum size of the n-best lists is 500. The translation quality is evaluated by case-insensitive BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) metrics using MultEval . The scaling factors are tuned with MERT (Och, 2003) with BLEU as optimization criterion on the development sets. For the translation experiments, the averaged scores are presented on the development set from three optimization runs.
Our direct HMM consists of a feed-forward neural network lexicon model with following configuration: • Five one-hot input vectors for source words and three for target words • Projection layer size 100 for each word • Two non-linear hidden layers with 1000 and 500 nodes respectively • A class-factored output layer with 1000 singleton classes dedicated to the most frequent words, and 1000 classes shared among the rest of the words. and a feed-forward neural network alignment model with the same configuration as the lexicon model, except a small output layer with 201 nodes, which reflects that the aligned position can jump within the scope from −100 to 100 (Alkhouli et al., 2016).
We conducted experiments on the source and target window size of both network models. Larger source and target windows could not provide significant improvements on BLEU scores, at least for rescoring experiments.
The model is applied for reranking the n-best lists created by the Jane toolkit (Vilar et al., 2010;Wuebker et al., 2012) with a log-linear framework containing phrasal and lexical smoothing models for both directions, word and phrase penalties, a distance-based reordering model, enhanced low frequency features (Chen et al., 2011), a hierarchical reordering model (Galley and Manning, 2008), a word class language mode (Wuebker et al., 2013) and an n-gram language model. The word alignments used for the training of phrase-tables are generated by GIZA++, which performs the alignment training sequentially for IBM-1, HMM and IBM-4. More details about our phrase-based baseline system can be found in (Peter et al., 2015).
The experimental results are demonstrated in Table 1. The rescoring experiments are conducted by adding HMM probability as feature and tuned with MERT. The applied attention-based neural network is a neural machine translation system similar to (Bahdanau et al., 2015). The decoder and encoder word embeddings are of size 620, the encoder uses a bidirectional layer with 1000 LSTMs (Hochreiter and Schmidhuber, 1997) to encode the source side. A layer with 1000 LSTMs Table 1: Experimental results of rescoring using neural network-based direct HMM. The model with sum denotes the system proposed in this work, while the model with Viterbi denotes the model with the same neural network structure, which was trained based on the word-aligned data (alignments generated by GIZA++) (Alkhouli et al., 2016). Improvements by systems marked by * have a 95% statistical significance from the NN-based direct HMM (Viterbi) system, whereas † denotes the 95% statistical significant improvements with respect to the attention-based system in rescoring. 1 was used in reranking the n-best lists, while 2 denotes the stand-alone attention-based decoder.  (Sennrich et al., 2016). During training a batch size of 50 is used. More details about our neural machine translation system can be found in .
With n-best rescoring, all neural network-based systems achieve significant improvements over the phrase-based system. The neural network-based HMMs provide promising performance, even with simple feed-forward neural networks. The direct HMM trained by the EM procedure with marginalizing the hidden alignments outperformed the same model trained on the word-aligned data. For the rescoring tasks, it provides comparable performance with the attention-based network. The neural network-based HMM also helps the phrase-based system achieve comparable results with the stand-alone attention-based system on the German→English task.

Conclusion and Future Work
This work aims to close the gap between the conventional word alignment models and the novel neural machine translation. The proposed direct HMM consists of neural network-based alignment and lexicon models, both models are trained jointly and without any alignment information. With the simple feed-forward neural network models, the HMM model already provides promising results and significantly improves the strong phrase-based translation system.
As future work, we are searching for alternatives to initialize the training instead of using IBM-1. We will investigate recurrent model struc-tures, such as the LSTM representation for source and target word embeddings (Luong et al., 2015). In addition to the network structure, we will implement a stand-alone decoder based on this novel model. The first step would be to apply maximum approximation for the search problem as elucidated in (Yu et al., 2017). Then we plan to investigate heuristics for marginalizing the hidden alignment during search.

Acknowledgments
The work reported in this paper results from two projects, SEQCLAS and QT21. SEQCLAS has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme under grant agreement n o 694537. QT21 has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement n o 645452. The work reflects only the authors' views and neither the European Commission nor the European Research Council Executive Agency are responsible for any use that may be made of the information it contains.
Tamer Alkhouli was partly funded by the 2016 Google PhD Fellowship for North America, Europe and the Middle East.