Unsupervised Source Hierarchies for Low-Resource Neural Machine Translation

Incorporating source syntactic information into neural machine translation (NMT) has recently proven successful (Eriguchi et al., 2016; Luong et al., 2016). However, this is generally done using an outside parser to syntactically annotate the training data, making this technique difficult to use for languages or domains for which a reliable parser is not available. In this paper, we introduce an unsupervised tree-to-sequence (tree2seq) model for neural machine translation; this model is able to induce an unsupervised hierarchical structure on the source sentence based on the downstream task of neural machine translation. We adapt the Gumbel tree-LSTM of Choi et al. (2018) to NMT in order to create the encoder. We evaluate our model against sequential and supervised parsing baselines on three low- and medium-resource language pairs. For low-resource cases, the unsupervised tree2seq encoder significantly outperforms the baselines; no improvements are seen for medium-resource translation.


Introduction
Neural machine translation (NMT) is a widely used approach to machine translation that is often trained without outside linguistic information.In NMT, sentences are typically modeled using recurrent neural networks (RNNs), so they are represented in a continuous space, alleviating the sparsity issue that afflicted many previous machine translation approaches.As a result, NMT is state-of-the-art for many language pairs (Bentivogli et al., 2016;Toral and Sánchez-Cartagena, 2017).Despite these successes, there is room for improvement.RNN-based NMT is sequential, whereas natural language is hierarchical; thus, RNNs may not be the most appropriate models for language.In fact, these sequential models do not fully learn syntax (Bentivogli et al., 2016;Linzen et al., 2016;Shi et al., 2016).In addition, although NMT performs well on high-resource languages, it is less successful in low-resource scenarios (Koehn and Knowles, 2017).
As a solution to these challenges, researchers have incorporated syntax into NMT, particularly on the source side.Notably, Eriguchi et al. (2016) introduced a tree-to-sequence (tree2seq) NMT model in which the RNN encoder was augmented with a tree long short-term memory (LSTM) network (Tai et al., 2015).This and related techniques have yielded improvements in NMT; however, injecting source syntax into NMT requires parsing the training data with an external parser, and such parsers may be unavailable for low-resource languages.Adding syntactic source information may improve low-resource NMT, but we would need a way of doing so without an external parser.
We would like to mimic the improvements that come from adding source syntactic hierarchies to NMT without requiring syntactic annotations of the training data.Recently, there have been some proposals to induce unsupervised hierarchies based on semantic objectives for sentiment analysis and natural language inference (Choi et al., 2018;Yogatama et al., 2017).Here, we apply these hierarchical sentence representations to lowresource neural machine translation.
In this work, we adapt the Gumbel tree-LSTM of Choi et al. (2018) to low-resource NMT, allowing unsupervised hierarchies to be injected into the encoder.We compare this model to sequential neural machine translation, as well as to NMT enriched with information from an external parser.
Our proposed model yields significant improvements in very low-resource NMT without requiring outside data or parsers beyond what is used in standard NMT; in addition, this model is not significantly slower to train than RNN-based models.

Neural Machine Translation
Neural machine translation (Cho et al., 2014;Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014) is an end-to-end neural approach to machine translation consisting of an encoder, a decoder, and an attention mechanism (Bahdanau et al., 2015).The encoder and decoder are usually LSTMs (Hochreiter and Schmidhuber, 1997) or gated recurrent units (GRUs) (Cho et al., 2014).The encoder reads in the source sentence and creates an embedding; the attention mechanism calculates a weighted combination of the words in the source sentence.This is then fed into the decoder, which uses the source representations to generate a translation in the target language.
3 Unsupervised Tree-to-Sequence NMT We modify the standard RNN-based neural machine translation architecture by combining a sequential LSTM decoder with an unsupervised tree-LSTM encoder.This encoder induces hierarchical structure on the source sentence without syntactic supervision.We refer to models containing this encoder as (unsupervised) tree2seq.
In this section, we present our unsupervised tree2seq model.Section 3.1 describes the subword-level representations, while section 3.2 explains how the Gumbel tree-LSTM is used to add hierarchies in the encoder.We address topdown representations of the phrase nodes in section 3.3 and explain the attention mechanism in section 3.4.

Word Node Representations
The hierarchical encoder consists of word nodes (nodes corresponding to the subwords of the source sentence) and phrase nodes (internal nodes resulting from the induced hierarchies).In order to obtain representations of the word nodes, we run a single-layer bidirectional LSTM over the source sentence; we refer to this LSTM as the leaf LSTM.

Phrase Node Representation
Our proposed unsupervised tree-LSTM encoder uses a Gumbel tree-LSTM (Choi et al., 2018) to obtain the phrase nodes of the source sentence.This encoder introduces unsupervised, discrete hierarchies without modifying the maximum likelihood objective used to train NMT by leveraging straight-through Gumbel softmax (Jang et al., 2017) to sample parsing decisions.
In a Gumbel tree-LSTM, the hidden state h p and memory cell c p for a given node are computed recursively based on the hidden states and memory nodes of its left and right children (h l , h r , c l , and c r ).This is done as in a standard binary tree-LSTM as follows: where W is the weight matrix, b is the bias vector, σ is the sigmoid activation function, and is the element-wise product.However, the Gumbel tree-LSTM differs from standard tree-LSTMs in that the selection of nodes to merge at each timestep is done in an unsupervised manner.At each timestep, each pair of adjacent nodes is considered for merging, and the hidden states h i for each candidate parent representation are computed using equation 3. A composition query vector q, which is simply a vector of trainable weights, is used to obtain a score v i for each candidate representation as follows: Finally, the straight-through Gumbel softmax estimator (Jang et al., 2017) is used to sample a parent from the candidates h i based on these scores v i ; this allows us to sample a hard parent selection while still maintaining differentiability.
This process continues until there is only one remaining node that summarizes the entire sentence; we refer to this as the root node.At inference time, straight-through Gumbel softmax is not used; instead, we greedily select the highest-scoring candidate.See Choi et al. (2018) for a more detailed description of Gumbel tree-LSTMs.Thus, this encoder induces a binary hierarchy over the source sentence.For a sentence of length n, there are n word nodes and n − 1 phrase nodes (including the root node).We initialize the decoder using the root node; attention to word and/or phrase nodes is described in section 3.4.

Top-Down Encoder Pass
In the bottom-up tree-LSTM encoder described in the previous section, each node is able to incorporate local information from its respective children; however, no global information is used.Thus, we introduce a top-down pass, which allows the nodes to take global information about the tree into account.We refer to models containing this topdown pass as top-down tree2seq models.Note that such a top-down pass has been shown to aid in tree-based NMT with supervised syntactic information (Chen et al., 2017a;Yang et al., 2017); here, we add it to our unsupervised hierarchies.
Our top-down tree implementation is similar to the bidirectional tree-GRU of Kokkinos and Potamianos (2017).The top-down root node h ↓ root is defined as follows: where h ↑ root is the hidden state of the bottomup root node (calculated using the Gumbel tree-LSTM described in section 3.2).
For each remaining node, including word nodes, the top-down representation h ↓ i is computed from its bottom-up hidden state representation h ↑ i (calculated using the Gumbel tree-LSTM) and the top-down representation of its parent h ↓ p (calculated during the previous top-down steps) using a GRU: where W td , U td , W td h , and U td h are weight matrices; b td and b td h are bias vectors; and σ is the sigmoid activation function.Note that we do not use different weights for left and right children of a given parent.
Each node needs a final representation to supply to the attention mechanism.Here, the topdown version of each node is used, because the top-down version captures both local and global information about the node.
The decoder is initialized with the top-down representation of the root node.Note, however, that this is identical to the bottom-up representation of the root node, so no additional top-down information is used to initialize the decoder.Since the root node contains information about the entire sentence, this allows the decoder to be initialized with a summary of the source sentence, mirroring standard sequential NMT.

Attention to Words and Phrases
The standard and top-down tree2seq models take different approaches to attention.The standard (bottom-up) model attends to the intermediate phrase nodes of the tree-LSTM, in addition to the word nodes output by the leaf LSTM.This follows what was done by Eriguchi et al. (2016).We use one attention mechanism for all nodes (word and phrase), making no distinction between different node types.Note that without the attention to the phrase nodes, the bottom-up tree2seq model would be almost equivalent to standard seq2seq, since the word nodes are created using a sequential LSTM (the only difference would be the use of the root node to initialize the decoder).
When the top-down pass (section 3.3) is added to the encoder, the final word nodes contain hierarchical information from the entire tree, as well as sequential information.Therefore, in the topdown tree2seq model, we attend to the top-down word nodes only, ignoring the phrase nodes.We argue that attention to the phrase nodes is unnecessary, since the word nodes summarize the phraselevel information; indeed, in preliminary experiments, attending to phrase nodes did not yield improvements.

Data
The models are tested on Tagalog (TL) ↔ English (EN), Turkish (TR) ↔ EN, and Romanian (RO) ↔ EN.These pairs were selected because they range from very low-resource to medium-resource, so we can evaluate the models at various settings.Table 1 displays the number of parallel training sentences for each language pair.Table 1: Amount of parallel sentences for each language pair after preprocessing.
The TR↔EN and RO↔EN data is from the WMT17 and WMT16 shared tasks, respectively (Bojar et al., 2017(Bojar et al., , 2016)).Development is done on newsdev2016 and evaluation on new-stest2016.The TL↔EN data is from IARPA MATERIAL Program language collection release IARPA MATERIAL BASE-1B-BUILD v1.0.No monolingual data is used for training.
The data is tokenized and truecased with the Moses scripts (Koehn et al., 2007).We use byte pair encoding (BPE) with 45k merge operations to split words into subwords (Sennrich et al., 2016).Notably, this means that the unsupervised tree encoder induces a binary parse tree over subwords (rather than at the word level).

Baselines
We compare our models to an RNN-based attentional NMT model; we refer to this model as seq2seq.Apart from the encoder, this baseline is identical to our proposed models.We train the seq2seq baseline on unparsed parallel data.
For translations out of English, we also consider an upper bound that uses syntactic supervision; we dub this model parse2seq.This is based on the mixed RNN model proposed by Li et al. (2017).We parse the source sentences using the Stanford CoreNLP parser (Manning et al., 2014) and linearize the resulting parses.We parse before applying BPE, and do not add any additional structure to segmented words; thus, final parses are not necessarily binary.This is fed directly into a seq2seq model (with increased maximum source sentence length to account for the parsing tags).

Implementation
All models are implemented in OpenNMTpy (Klein et al., 2017).They use word embedding size 500, hidden layer size 1000, batch size 64, two layers in the encoder and decoder, and dropout rate 0.3 (Gal and Ghahramani, 2016).We set maximum sentence length to 50 (150 for parse2seq source).Models are trained using Adam (Kingma and Ba, 2015) with learning rate 0.001.For treebased models, we use a Gumbel temperature of We train until convergence on the validation set, and the model with the highest BLEU on the validation set is used to translate the test data.During inference, we set beam size to 12 and maximum length to 100.

Translation Performance
Tables 2 and 3 display BLEU scores for our unsupervised tree2seq models translating into and out of English, respectively.For the lowerresource language pairs, TL↔EN and TR↔EN, the tree2seq and top-down models consistently improve over the seq2seq and parse2seq baselines.However, for the medium-resource language pair (RO↔EN), the unsupervised tree models do not improve over seq2seq, unlike the parse2seq baseline.Thus, inducing hierarchies on the source side is most helpful in very low-resource scenarios.et al. (2017) observed that the parses resulting from Gumbel tree-LSTMs for sentence classification did not seem to fit a known formalism.An examination of the parses induced by our NMT models suggests this as well.Furthermore, the different models (tree2seq and top-down tree2seq) do not seem to learn the same parses for the same language pair.We display example parses induced by the trained systems on a sentence from the test data in Table 4.

Williams
Example Parse EN→TR tree2seq ( ( ( others have ) ( ( ( dismissed him ) as ) a ) ) ( j@@ ( oke .) ) ) EN→TR top-down ( ( ( ( others have ) dismissed ) ( him as ) ) ( ( a ( j@@ oke ) ) . ) ) EN→RO tree2seq ( ( ( others have ) dismissed ) ( him ( ( ( as a ) joke ) . ) ) ) EN→RO top-down ( others ( ( ( have ( ( dismissed him ) ( as a ) ) ) joke ) . ) )  Inducing unsupervised or semi-supervised hierarchies in NMT is a relatively recent research area.Gehring et al. (2017a,b) introduced a fully convolutional model for NMT, which improved over strong sequential baselines.Hashimoto and Tsuruoka (2017) added a latent graph parser to the encoder, allowing it to learn dependency-like source parses in an unsupervised manner.However, they found that pre-training the parser with a small amount of human annotations yielded the best results.Finally, Kim et al. (2017) introduced structured attention networks, which extended basic attention by allowing models to attend to latent structures such as subtrees.

Conclusions
In this paper, we have introduced a method for incorporating unsupervised structure into the source side of NMT.For low-resource language pairs, this method yielded strong improvements over sequential and parsed baselines.This technique is useful for adding hierarchies to low-resource NMT when a source-language parser is not available.Further analysis indicated that the induced structures are not similar to known linguistic structures.
In the future, we plan on exploring ways of inducing unsupervised hierarchies on the decoder.Additionally, we would like to try adding some supervision to the source trees, for example in the form of pre-training, in order to see whether actual syntactic information improves our models.

Table 2 :
BLEU for the baseline and the unsupervised tree2seq systems on *→EN translation.

Table 4 :
Induced parses on an example sentence from the test data.

Table 5 :
Recombined subwords in the test data.