Historical Text Normalization with Delayed Rewards

Training neural sequence-to-sequence models with simple token-level log-likelihood is now a standard approach to historical text normalization, albeit often outperformed by phrase-based models. Policy gradient training enables direct optimization for exact matches, and while the small datasets in historical text normalization are prohibitive of from-scratch reinforcement learning, we show that policy gradient fine-tuning leads to significant improvements across the board. Policy gradient training, in particular, leads to more accurate normalizations for long or unseen words.


Introduction
Historical text normalization is a common approach to making historical documents accessible and searchable.It is a challenging problem, since most historical texts were written without fixed spelling conventions, and spelling is therefore at times idiosyncratic (Piotrowski, 2012).
Traditional approaches to historical text normalization relied on hand-written rules, but recently, several authors have proposed neural models for historical text normalization (Bollmann and Søgaard, 2016;Bollmann, 2018;Tang et al., 2018).Such models are trained using characterlevel maximum-likelihood training, which is inconsistent with the objective of historical text normalization; namely, transduction into modern, searchable word forms.The discrepancy between character-level loss and our word-level objective means that model decision costs are biased.Our objective, however, is reflected by the standard evaluation metric, which is computed as the fraction of benchmark words that are translated correctly.
In order to mitigate the discrepancy between the optimization method and the task objective, work has been carried out on using reinforcement learning to optimize directly for the evaluation metric (Ranzato et al., 2016;Shen et al., 2016).Reinforcement learning enables direct optimization of exact matches or other non-decomposable metrics, computing updates based on delayed rewards rather than token-level error signals.This paper contrasts maximum likelihood training and training with delayed rewards, in the context of sequence-to-sequence historical text normalization (Bollmann et al., 2017).

Contributions
We show that training with delayed rewards achieves better performance than maximum likelihood training across six different historical text normalization benchmarks; and that training with delayed rewards is particularly helpful for long words, words where maximum likelihood training leads to predicting long words, and for unseen words.We note that our approach differs from other applications in the NLP literature in using the mean reward as our baseline, and in comparing different reward functions; we also fine-tune relying only on rewards, rather than a mixture of cross entropy loss and rewards.

Historical text normalization datasets
Historical text normalization datasets are rare and typically rather small.Most of them are based on collections of medieval documents.In our experiments, we include six historical text normalization datasets: the English, Hungarian, Icelandic, and Swedish datasets from Pettersson (2016); the German dataset introduced in Bollmann et al. (2017); and the Slovene "Bohorič" dataset from Ljubešić et al. (2016).We use these datasets in the form provided by Bollmann (2019), i.e., preprocessed to remove punctuation, perform Unicode normalization, replace digits that do not require normalization with a dummy symbol, and lowercase all tokens.Note the differences in the number of words that are invariant across time, i.e., where the original input word form is the correct prediction according to the manual annotations.The differences are reasons to expect performance to be higher on English, but lower on Hungarian, for example; since it is easier to learn to memorize the input than to learn abstract transduction patterns.In practice, we see differences being relatively small.Performance on English, however, is significantly higher than for the other languages (see Table 2).

Normalization models
Our baseline model is an LSTM-based encoderdecoder model with attention.The model receives as input a sequence of characters from the source vocabulary (i 1 , . . ., i N ).Each character i t is mapped to the corresponding randomly initialized embedding, which is then given as input to the bi-LSTM encoder.The decoder then uses the Bahdanau attention mechanism (Bahdanau et al., 2014) over the encoded representation to output a sequence of characters from the target vocabulary (o 1 , ..., o M ).Note that the input and output sequences may differ in length.Both the encoder and decoder is composed of three layers with dimensionality 256.The character embeddings have 128 dimensions.
For training our maximum likelihood baseline, we use the Adam optimiser initialized with a learning rate of 0.001 and default decay rates.In addition, we use a dropout probability of 20%.The model is trained with batch size 16 for 10 epochs with early stopping.All hyper-parameters were tuned on English development data.Policy gradient fine-tuning We use policy gradient training with delayed rewards for fine-tuning our models: We use maximum likelihood pretraining for 10 epochs (see above) and update our model based on policy gradients computed using the REINFORCE algorithm (Williams, 1992;Sutton et al., 1999).This enables us to optimize for delayed rewards that are non-decomposable.Specifically, we directly minimize a distance function between strings, e.g., Levenshtein distance, by using negative distance as a delayed reward:1 R( Ŷ ) = −Levenshtein(Y, Ŷ ).REINFORCE maximizes the expected reward, under some probability distribution P ( Ŷ |θ), parameterized by some θ.This way, the cost function, J(θ), is defined as the negative expected reward: From this cost function, the PG can be derived as: (1) We refer the reader to prior work for the full derivation (Williams, 1992;Karpathy, 2016).In Equation (2), there is no need to differentiate R( Ŷ ), and policy gradient training therefore is possible with non-differentiable reward functions (Karpathy, 2016).To explore the search space, we use a stochastic sampling function S(X) that, given an input sequence X, produces k sample hypotheses Ŷ1 , . . ., Ŷk .The hypotheses are generated by, at each time step, sampling actions based on the multinomial probability distribution of the policy.In order to reduce the search space, we sample only from the ten most likely actions at each time step.Furthermore, duplicate samples are filtered.
In practice, we do not optimize directly for the reward R( Ŷ ).Instead we replace it with the advantage score (Weaver and Tao, 2001;Mnih and Gregor, 2014): A( Ŷ ) = R( Ŷ ) − b, where b is a baseline reward (Weaver and Tao, 2001), introduced to reduce the variance in the gradients.We use the mean reward over the samples as our baseline reward.This way, the advantage scores of the samples will be centered around 0, meaning that about half of the produced samples will be encouraged and about half will be discouraged (Karpathy, 2016).
We also found it necessary to normalize the probability distribution P ( Ŷ |X; θ) over the samples from S(X).We follow Shen et al. (2016) and define a probability distribution Q( Ŷ |X; θ, α) over the subspace of S(X).
This function is essentially a smoothing function over the sample probabilities, with a hyperparameter α that controls the level of smoothing.We follow Shen et al. (2016) and set α = 0.005.With these alterations, our cost function and gradients can be defined as: The algorithm is described in pseudocode in Algorithm 1.We optimized hyper-parameters the same way we optimized our baseline model hyperparameters.Compared to the baseline, the policy gradient model's optimal batch size is bigger (64), and the learning rate is smaller (0.00001).Both strategies are known to increase generalization, by increasing the scale of random fluctuations in the SGD dynamics (Smith and Le, 2018;Balles et al., 2017).

Experiments
Our experiments compare maximum likelihood training and policy gradient training across six historical text normalization datasets (cf.Table 1).We optimized hyper-parameters on the English development data and used the same hyperparameters across the board (see above).
Distance metric We also treated the distance metric used as our reward function as a hyperparameter.Figure 1 shows a comparison of three reward functions on the Icelandic development data: (i) the Levenshtein distance, which is the number of character operations (substitute, insert, delete) to transform one string into another; (ii) the Hamming distance, which is the number of positions at which the corresponding characters of two strings of equal length are different (we pad the shorter of the two strings with spaces); and (iii) the Jaro-Winkler distance (Cohen et al., 2003)

Results
The results are presented in Table 2. 2Generally, we see that policy gradient fine-tuning improves results across the board.For English, the error reduction is 20%.For German, Hungarian, Icelandic, Slovene, and Swedish, the error reduction is smaller (7-16%), but still considerable and highly significant (p < 0.01).Tang et al. (2018) do show, however, that multi-headed attention architectures (Vaswani et al., 2017) generally seem to outperform sequence-to-sequence models with attention for historical text normalization.This is orthogonal to the analysis presented here, and similar improvements can likely be obtained by multiheaded attention architectures.
Analysis To avoid bias from small, high variance datasets, we limit error analysis to English, German, Hungarian, and Icelandic.In Table 3, we present correlation scores between our observed improvements and characteristics of the data. 3We consider the following characteristics: 1. GOLD LENGTH: Reinforcement learning with delayed rewards can potentially mitigate error propagation, and we do observe that gains from reinforcement learning, i.e., the distribution of correct normalizations by reinforcement learning that our baseline architecture classified wrongly, correlate significantly with the length of the input across all four languages.

MLE LENGTH:
The correlations are even stronger with the length of the output of the MLE model.This suggests that reinforcement learning -or policy gradient training -is particularly effective on examples for which maximum likelihood training tends to predict long normalizations.

MLE BACKOFF:
We also correlate gains with the distribution of instances on which the MLE backed off to predicting the original input word form.Here, we see a negative correlation, suggesting our baseline is good at predicting when the word form is invariant across time.
4. IDENTICAL: The three trends above are all quite strong.Our fourth variable is when input and output are identical (invariant across time).Here, we see mixed results.Policy gradient gains correlate negatively with invariance in English, but positively in Icelandic.
5. UNSEEN WORDS: Finally, we correlate gains with whether words had been previously seen at training time.Our policy gradient finetuned model performs much better on unseen words, and especially for English, German, and Hungarian, we see strong correlations between improvements and unseen words.Our predictions also exhibit smaller Levenshtein distances to the annotations compared to our baseline model, e.g., 0.11 vs. 0.14 for English, respectively, and 0.20 vs. 0.23 for German.

Conclusions
Our experiments show that across several languages, policy gradient fine-tuning outperforms maximum likelihood training of sequence-tosequence models for historical text normalization.
Since historical text normalization is a characterlevel transduction task, it is feasible to experiment with reinforcement learning, and we believe our results are very promising.In our error analysis, we, in addition, observe that reinforcement learning is particularly beneficial for long words and unseen words, which are probably the hardest challenges in historical text normalization.

Figure 1 :
Figure 1: Different reward functions on Icelandic (dev) Table 1 gives an overview of the datasets.

Table 2 :
Comparison of maximum likelihood training (MLE) and policy gradient fine-tuning (MLE+PG), given in word-level accuracy in percent, as well as the error reduction between MLE and MLE+PG.