Debugging Translations of Transformer-based Neural Machine Translation Systems

In this paper, we describe a tool for debugging the output and attention weights of neural machine translation (NMT) systems and for improved estimations of confidence about the output based on the attention. We dive deeper into ways for it to handle output from transformerbased NMT models. Its purpose is to help researchers and developers find weak and faulty translations that their NMT systems produce without the need for reference translations. We present a demonstration website of our tool with examples of good and bad translations: http: //attention.lielakeda.lv.


Introduction
As one of the primary use-cases for the modern computer, automated translation of texts from one language into another or machine translation (MT) has evolved vastly since its early days in the 1950s.There have been several large paradigm shifts that have greatly impacted the field of MT -rule-based MT (RBMT), statistical MT (SMT) and neural network MT (NMT) (Bahdanau et al., 2014).With each paradigm shift detailed understanding of how the system produces its final translation has changed from fully clear in the case of RBMT to slightly less, but often still predictable in SMT, to often completely unpredictable in NMT.Many current tools for inspecting results of statistical phrase-based approaches are either not compatible or serve little purpose in dealing with neural network generated output.
To address the lack of tools for inspection and analysis of NMT translations, we propose a tool for browsing, inspecting and comparing translations specifically designed for NMT output.The tool uses the attention weights that correspond to specific token pairs, which are generated during the decoding process, by turning them into one of several visual representations that can help humans better understand how the output translations were produced.Aside from just visualising attention alignments, the tool also uses them to estimate the confidence in translation, which allows to distinguish acceptable outputs from completely unreliable ones.For this, no reference translations are required.
The structure of this paper is as follows: Section 2 summarises related work on tools for inspecting translation outputs and alignments; Section 3 introduces the key concepts of the baseline tool -how it scores translations and displays the visualisations in different environments, as well as outlines the improvements made to make it more useful for debugging machine translation output.In section 4, we give an overview of how to make the most use of our tool in finding odd translations, what to look for when comparing them and possible causes of errors.Section 5 talks about the challenges introduced by multi-layer models like transformers and section 6 -about how to deal with them.Finally, we conclude the paper in Section 7 and introduce plans for future work in the area.

Related Work
The foundation of our tool is based on the paper of Rikters et al. (2017), who introduce visualisation of NMT attention and use attention-based scoring of NMT as described by Rikters and Fishel (2017).While in general it can be useful to quickly find sentences with "scrambled" attention alignments, it gets more challenging when having to deal with output from multi-layer neural networks.This tends to mislead users when sorting data sets by confidence and looking for the highest scoring examples.

Attention Averaging
The general intuition is that transformer models do learn to pay more attention to specific source sentence tokens while generating translation tokens just like attentional RNNs.Since each attention head in each layer shows different results, it becomes nontrivial to decipher which one or several matrices, if any, has learnt the alignment representation.Averaging attention probabilities over all attention heads in all layers provides a solution to obtain a single attention matrix for a translated sentence.
We trained transformer and RNN NMT models using data from the highest-ranking English-Latvian system in WMT 2017 (Pinnis et al., 2017) and used both systems to translate formatting-rich documents.To compare the quality of attention averaging to the established RNN attention alignments.we performed a small-scale human evaluation on the formatting transfer between source and translated documents.The human evaluation showed that the averaged transformer alignments are just as acceptable as RNN alignments.

Guided Alignments
Chen et al. ( 2016) claim that translation of unknown out of vocabulary (OOV) words is linked to soft alignment dispersion and may be the source of some translation errors.To improve the alignments and the output translations, the authors propose to use the IBM model 4 Viterbi alignments as additional input data during training.They experiment with adding alignments produced by GIZA++ to RNN-based NMT systems.The authors report improvements in alignment distributions as well as overall translation quality.Liu et al. (2016) also attempt to improve attention alignments produced by RNNbased NMT systems.In addition to GIZA++ alignments, they experiment with fast align and add several heuristics.The authors report that the significantly faster fast align generates slightly lower quality alignments and they improve NMT output quality and soft alignments just as well.
We did a similar experiment as with the averaging by training an NMT system with guided alignments (fast align) and translating formatted documents to perform human evaluation.The evaluation showed that the model with guided alignments is able to transfer document formatting slightly better than the averaged alignments and RNN attention alignments.

Visualisation Tool
The basis of our visualisation tool is described in full detail in the baseline paper (Rikters et al., 2017).It requires source and translated sentences along with the corresponding attention alignments from NMT systems as input files and can provide a visual overview in a command line environment (Linux Terminal or Windows Powershell) or a web browser of any modern device.It is published in a GitHub repository 1 and opensourced with the MIT License.In the further subsections of the paper, we will outline only core components and focus more on highlighting improvements and differences.
In addition to Nematus, Neural Monkey and Marian 2 (Junczys-Dowmunt et al., 2018), we have also added out-of-the-box support for working with attention alignments from OpenNMT and Sockeye 3 (Hieber et al., 2017) frameworks.

Confidence Scores
This section outlines how the confidence scores are calculated and outlines what is how the final score differs from the baseline.
The four main metrics that we use for scoring translations are: -Coverage Deviation Penalty (CDP) penalises attention deficiency and excessive attention per input token.
-Absentmindedness Penalties (AP out , in ) penalise output tokens that pay attention to too many input tokens, or input tokens that produce too many output tokens.-Overlap Penalty (OP) penalises translations that copy large fractions from source sentences.A stronger penalty is allocated to longer sentences that copy large amounts from the source while shorter ones get more tolerance (e.g., the three-word English sentence "Thanks Barack Obama." can be perfectly translated into "Paldies Barack Obama."although 2/3 of words in the translation are the same in the source).A plot of how the penalty increases in relation to the source-translation overlap and source sentence length is shown in Figure 1.
-Confidence is the sum of the three main metrics -CDP, AP in and AP out and the similarity penalty, when the similarity between input and output sentences is high (similarity > 0.3) .
In all of the metrics L s is the length of the source sentence; L t -length of the target sentence; S -similarity between the source sentence and the translation on the scale of 0 -1; α ji -the attention weight between source token i and translation token j.
Changes have been introduced to the final confidence score by first calculating the similarity ratio between input and output sentences and then adding a further penalty only if the similarity is high enough.The similarity is calculated by finding the longest contiguous matching sub-sequence.
Since the baseline confidence score considered only the attention alignments when calculating the final value, examples like shown in Figure 2 received particularly high values due to consistent one-to-one attention alignments.The updated score takes care of this problem by penalising hypothesis sentence that is overly similar to the input source.

Source:
Kepler measures spin rates of stars in Pleiades cluster Hypothesis: Kepler measures spin rates of stars in Pleiades cluster Reference: Keplers izmēra zvaigžn ¸u griešanās ātrumu Plejādes zvaigznājā.

Web Interface
The web interface is the primary point of interaction with the tool.Aside from browsing visualisations, ordering data sets by confidence scores and exporting visualisations as images, that are all clarified in the baseline paper, we introduce several significant changes to the system.The first one is a technical update on how data is served -loading is performed asynchronously in the background and thereby eliminating long wait times to view the proceeding sentences in a large data set.The three major additions are: the addition of source-translation overlap percentage alongside the four base scores (Section 3.3); the ability to provide reference translations, if available, to display next to the hypothesis and calculate BLEU scores (Section 3.4); the ability to directly compare translations and alignments from two different NMT systems (Section 3.5).

Overlap
As mentioned in Section 3.1, the updated confidence score considers hypotheses translations that are long and have a significant overlap with the source sentence as a worse translations, while tolerating considerable overlap for shorter sentences.In addition to contributing to the final confidence score, the overlap ratio has been added as an individual score for sorting, navigating and comparing sentences from a data set as shown in Figure 3.The system also underlines the longest matching sub-string between the source and translation in cases where the overlap is high enough (over 10%).An example is shown in Figure 3, where the overlap ratio is 20.19%.

References and BLEU
We believe that simply displaying the reference next to the hypothesis is helpful more often than not.Having provided references also allows to calculate BLEU scores for the translations, providing yet another dimension for sorting (Figure 3).Unlike overlap, the BLEU scores do not influence the overall confidence scores.
Both overlap and BLEU score calculation and output has also been added to the terminal interface of the tool (Figure 4).

Comparing Translations
The final major addition to the tool is the option to directly compare two translations of the same source sentence.To perform the comparison, all source sentences for both input data sets must match, but the target sentences may differ in output token order as well as count.Comparisons may be performed between translations obtained from any two of the five currently supported NMT frameworks (Nematus, Neural Monkey, OpenNMT, Marian and Sockeye) or even an arbitrary input file, as long as it's formatted according to the specification provided in the readme4 .Figure 5 shows an example comparison of a sentence translated by two different NMT systems.On the top row is the source text and the bottom rows represent output from each individual NMT system colour-coded to match the colours of the alignment lines.The second hypothesis (in green) exhibits stronger and more reliable output alignments to the content words while the first shows strong alignments coming from the stop sign.In this example neither hypothesis matches the reference, but since it is only two words long for a source sentence of triple the length, it can hint to an oversimplified translation by the translator (assuming English was the original) and does not mean that both hypotheses are completely wrong.In fact, the second hypothesis is a fairly decent representation of the source sentence.Figure 6 illustrates another example with strong attention alignments and a high overlap ratio (94.03%) between source and translated sentences from one system compared to a weak, but at least better translation from another system.The final confidence score for the second translation is strongly influenced by the high overlap, even though the sentence is not particularly long.In similar conditions, the confidence score of the second hypothesis calculated by the baseline system would be very close to 100% due to its complete disregard for the actual words of the source and hypothesis sentences.

Recipes for Debugging
In this section, we summarise several tips and tricks that may come in handy when using the tool to look for faulty translations of various kinds.Here we also list common causes associated with the problems.Some peculiarities to pay attention to may include: -Short sentences with a low confidence, CDP, AP in or AP out All of the metrics do not necessarily need to be low, but translations that exhibit at least one of them to be under 30% are often worth looking into.
Source: the loss was by the team.Hypothesis 1: zaudējums bija komandas biedrs.Hypothesis 2: šis zaudējums bija komandai.Reference: zaudē komanda.-Long sentences with a high overlap As stated before, for short, several words long sentences it may be completely normal to have an overlap of 50% or more, but if it occurs in sentences that are 10 or more words long, it may indicate that the system has only partially translated the Source: they did so just in time as Hindes emerged.Hypothesis 1: vin ¸i to darīja tikai toreiz , kad parādījās hinduisti.Hypothesis 2: it did so just in time as Hindes emerged.Reference: vin ¸iem tas izdevās pēdējā brīdī.An ongoing challenge is to find a way of how to better acquire attention alignments generated by multi-layer neural networks.While in recurrent neural network NMT systems this is rarely a problem, more modern approaches like convolutional neural networks (Gehring et al., 2017) and transformer neural networks (Vaswani et al., 2017) require training of deeper models to achieve translation results of competitive quality.This, however, results in uncertainties of how to interpret attention weights, including whether they encompass reliable alignment information.Even when all attention matrices are summed up, the result looks like every source token is connected to every hypothesis token as can be seen in Figure 7.
Out of all modern NMT approaches that are built as deep multi-layer neural networks, the transformer-based NMT systems currently achieve state-of-the-art translation quality results for most language pairs (as shown by the results of the WMT shared task for news translation (Bojar et al., 2018)).Therefore, we chose to investigate how they work and how the attention information can be made useful for debugging output translations.

Transformer Models
Vaswani et al. ( 2017) proposed a novel neural network architecture, the Transformer, which relies only on the attention mechanism to draw global dependencies between input and output.It has an encoder-decoder structure using multiple stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.One of the big advantages of training self-attentional models is that they are highly parallelizable, as they do not employ the recurrent connections of recurrent neural networks (RNNs).
A typical transformer model would consist of six layers of which each would consist of eight attention heads.This means that there are 48 source-to-target attention matrices that the neural network can use for translation purposes.
Aside from visualising and interpreting NMT output, attention alignments are also used to get hard word alignments in order to correctly translate structured documents and reconstruct the structure after translating (Pinnis et al., 2018b).To achieve similar results with transformer-based NMT models, several approaches have been explored, such as learning guided alignments5 , averaging attention matrices6 and using fast align (Dyer et al., 2013) to generate alignments after the translation has been produced (e.g., Pinnis et al. (2018a)).The latter approach is of no use for interpreting NMT output as it uses a separate model and only attempts to guess what the alignments are after the result has been produced.The other two are worth looking into.

Experiments and Results
We used the previously mentioned averaged transformer and guided alignment transformer models to determine, which approach is better suited for our debugging tool to quickly identify faulty and suspicious translations.Both models were trained using data from the Tilde's unconstrained submission to the WMT 2017 shared task on news translation (Pinnis et al., 2017).The averaged transformer model was trained using the Sockeye NMT toolkit on the fully processed (including factorisation and morphologydriven word splitting) dataset of the Tilde's unconstrained English-Latvian submission (46.04 million sentence pairs in total).The guided alignment model was trained using the Marian toolkit on the same dataset, but without factorisation and without morphology driven word splitting (only byte-pair encoding was performed).Both models were trained until convergence and reached about the same quality on news domain (WMT17 development set) and general domain data (ACCURAT development corpus (Skadin ¸a et al., 2012)).
In order to determine the usefulness, we aimed to answer two main questions -1) do the resulting attention alignments represent actual relations between the source sentence and the output translation, and 2) do the confidence scores produce similar results using these alignments.Regarding the first question, it is important to understand whether the alignments from transformer models for high-quality translations actually represent word-by-word source-translation alignments and/or relevant phrases.As for the second question -even if the first one is not fully confirmed, the alignments may still be useful for finding translation errors.Therefore, if the resulting transformer attention alignments help in producing distinct and sortable confidence scores, they will be considered useful.

Attention Averaging
Figure 8 exhibits rather dispersed attention alignments for an acceptable translation.A significant amount of the attention is focused on the stop mark in the end of the source sentence and even more on the word "a", which clearly should not be connected to so many output tokens.Such a distribution of attention alignments for RNN-based models would indicate that the model had problems translating some or most input tokens and an unsuccessful translation had been produced.During manual inspection, we noticed that most results exhibit similar outputs by having an excessive amount of attention focused to one or two source tokens, while the translations themselves were good.This indicates that the answer to the first question is negative.
To answer the second question, we sorted the test set of 2000 sentences by each of the confidence scores and looked for low-scoring and relatively short sentences.All scores exhibited a large number of false-positives, mainly due to dispersed attention alignments.Such behaviour means that the attention-based scores that are computed from transformer models with attention averaging cannot aid in finding poorly translated sentences.

Guided Alignments
The top part of Figure 9 shows how the same sentence is translated with the system that was trained using guided alignments.In this example, the translation is noticeably worse, but the alignment lines in the visualisation are much stronger and less scattered.The computed confidence score of 48.65% seems fairly adequate, as the sentence-level BLEU score is also quite low -12.69.This leads to believe that the learned alignments do a better job in representing relations between source and translated tokens, answering positively to the first question.The example shows that the attention is mainly dispersed in places where words are split in subword units (ending with '@@').
To see how attention alignments change after joining subword units and the respective attentions into full words, we summed attention weights over source subword units and averaged attention weights over target subword units.The soft attention alignments acquired with this method for the same sentence can be seen in the lower part of Figure 9.As expected, the alignments became stronger and it also improved the confidence

Source:
Some cyclists sing hymns or recite nursery rhymes as a climbing aid.Hypothesis: Daži riten ¸braucēji dzied himnas vai deklamē bērnu dziesmas kā kāpšanas palīglīdzekli.Reference: Daži riten ¸braucēji dzied himnas vai skaita bērnudārza dzejol ¸us, lai vieglāk tiktu kalnā.scores.This allowed to better single out several of the very-worst translations of the set when sorting it by AP out and CDP.

Conclusion
In this paper, we described how our visual NMT debugging tool handles output from multi-layer neural networks, such as the recent and very popular Transformer models.We explored two scenarios of preparing attention alignments from transformer-based NMT to be compatible with our tool.We found that the guided alignment training strategy yields the best results for quickly locating better and worse translations in arbitrary test sets.Compared to other similar tools, ours relies on the confidence scores and does not require reference translations to facilitate this easier navigation, but it only benefits with additional features that are enabled when the references are provided.This allows to integrate it, for example, in an NMT system with a web-based interface, providing users with an explanation for the result of a specific translation.
In a future version of the system, we plan to include other reference-based MT scoring metrics for more variety of scoring and sorting.Some examples of metrics may include chrF (Popović, 2015) or TER (Snover et al., 2006).Another idea for future work would be to list and order specific best, worst or interesting examples of translations.This could be done by considering the recipes from Section 4.
In addition to the reference-based metrics, there exist other reference-less approaches yet to be utilised.For instance, borrowing ideas from parallel corpora filtering (Pinnis et al., 2017), such as 1) source-hypothesis sentence length difference; 2) language identification for the hypothesis; 3) digit mismatch between the source and hypothesis; 4) foreign or corrupt symbol checking for the hypothesis.

Figure 3 .
Figure 3.An example translation from Estonian into Russian, showing useful features for debugging translation outcomes -underlining of the longest matching sub-string between the source and translated sentences; sorting translations by overlap (pink bars) or BLEU score (purple bars); reference translation (grey background).

Figure 4 .
Figure 4.An example of the updated terminal interface output.

Figure 5 .
Figure 5.A direct comparison of attention alignments for translating the same sentence with two different NMT systems.

Figure 8 .
Figure 8. Attention alignment example of a translation from English into Latvian with a transformer-based NMT model and attention averaging.