How Do Neural Sequence Models Generalize? Local and Global Cues for Out-of-Distribution Prediction

After a neural sequence model encounters an unexpected token, can its behavior be predicted? We show that RNN and transformer language models exhibit structured, consistent generalization in out-of-distribution contexts. We begin by introducing two idealized models of generalization in next-word prediction: a lexical context model in which generalization is consistent with the last word observed, and a syntactic context model in which generalization is consistent with the global structure of the input. In experiments in English, Finnish, Mandarin, and random regular languages, we demonstrate that neural language models interpolate between these two forms of generalization: their predictions are well-approximated by a log-linear combination of lexical and syntactic predictive distributions. We then show that, in some languages, noise mediates the two forms of generalization: noise applied to input tokens encourages syntactic generalization, while noise in history representations encourages lexical generalization. Finally, we offer a preliminary theoretical explanation of these results by proving that the observed interpolation behavior is expected in log-linear models with a particular feature correlation structure. These results help explain the effectiveness of two popular regularization schemes and show that aspects of sequence model generalization can be understood and controlled.


Introduction
Neural language models (LMs) play a key role in language processing systems for tasks as diverse as machine translation, dialogue, and automated speech recognition (Baziotis et al., 2020;Sordoni et al., 2015;Mikolov et al., 2010).These LMs, which model distributions over words in context via recurrent, convolutional, or attentional neural networks, have been found to consistently outperform finite-state approaches to language modeling based on hidden Markov models (Kuhn et al., 1994)  Global generalization:

Local context model Global context model Interpolation
Figure 1: We develop formal models of the predictions of neural language models in surprising contexts in which local information (e.g. the most recent token) and global information (e.g. the rest of the sentence) conflict (top).In these out-of-distribution contexts, predictors trained on both synthetic and natural languages favor either local or global information, but are best approximated by an interpolation of a local-only and global-only predictor (bottom).
or n-gram statistics (Miller and Selfridge, 1950).But improved predictive power comes at the cost of increased model complexity and a loss of transparency.While it is possible to characterize (and even control) how finite-state models will behave in previously unseen contexts, generalization in neural LMs is not nearly as well understood.
Consider the following sentence prefixes: Each of these prefixes should be assigned a low probability under any reasonable statistical model of English: (a) is missing a word, (b) has a noun used in place of a verb, and (c) a features a selectional restriction violation. 1When exposed to these surprising contexts, what word will language models predict next?For finite-state models of language, the answer is clear: n-gram models back off to the shortest context in which statistics can be reliably estimated (e.g.just the final word; Katz 1987), and hidden Markov models explicitly integrate the possibility of an unexpected part-of-speech transition and an unexpected word choice (Freitag and McCallum, 1999).But in neural models, model behavior in-distribution provides little insight into behavior in novel contexts like the ones shown in (a-c).
Characterizing neural LMs' behavior on inputs like these is important for many reasons-including evaluating their robustness, characterizing their effectiveness as models of human language processing, and identifying inductive biases relevant to deployment in new tasks.This paper offers three steps toward such a characterization: 1. We present an empirical description of neural LM behavior in out-of-distribution contexts like the ones shown in (a-c).We introduce two idealized models of prediction in these contexts: a local context model in which generalization is consistent with the last word observed (ignoring global sentence structure), and a global context model, in which generalization is consistent with the global structure of the input (ignoring unexpected words).In experiments on English, Finnish, Mandarin, and a collection of random regular languages, we show that neural LM behavior is reasonably well approximated by either the local or global context model, and even better predicted by an interpolation of the two: neural LMs reconcile conflicting information from local and global context by modeling their contributions independently and combining their predictions post-hoc (Fig. 1).
2. We further show that, in regular languages, noise introduced at training time modulates the relative strength of local and global context in this interpolation: input noise (in the form of random word substitution) encourages global generalization, while history noise (dropout applied to recurrent states or selfattention layers) encourages local generalization.These effects are small, but point toward a potential role for noise-based regularization schemes in controlling out-of-distribution behavior.
3. Finally, we offer a preliminary mathematical explanation of the observed results by demonstrating that this interpolation behavior arises in any regularized log-linear model with separate local and global context features that are individually predictive of future tokens.
Despite the complexity of current neural LMs, these results show that aspects of their out-ofdistribution generalization can be characterized, controlled, and understood theoretically.

Background
Generalization in count-based LMs Before the widespread use of neural approaches in NLP, statistical approaches to language modeling were typically defined by explicit independence assumptions governing their generalization in contexts never observed in the training data.For example, ngram models (Miller and Selfridge, 1950;Shannon, 1951) ignore global sentence structure in favor of a local context of at most n words.By contrast, latent-variable language models based on finitestate machines (Kuhn et al., 1994) (or more expressive automata; Chelba andJelinek 1998, Pauls andKlein 2012) explicitly incorporate information from the long-range context by conditioning next-word prediction on abstract global states constrained by global sentence structure.In models of both kinds, behavior in contexts unlike any seen at training time is be explicitly specified via backoff and smoothing schemes aimed at providing robust estimates of the frequency of rare events (Good, 1953;Katz, 1987;Kneser and Ney, 1995).Like past work on backoff and smoothing, our work in this paper attempts to provide a general mechanism for both prediction and control in more complex, black-box neural LMs.

Generalization in feature-based and neural
LMs Such mechanisms are necessary because, with the advent of feature-rich approaches to language modeling-including log-linear models (Rosenfeld, 1996) and neural network models (Bengio et al., 2003;Mikolov et al., 2010;Vaswani et al., 2017)-the kinds of structured, engineered generalization available in finite-state models of language have largely been lost.Current models clearly generalize to new linguistic contexts (including those with semantic content very different from anything seen at training time; Radford et al. 2019).But the precise nature and limits of that generalizationespecially its robustness to unusual syntax and its ability to incorporate information about global sentence structure-remain a topic of ongoing study.
Current work largely focuses on controlled, linguistically motivated tests of generalization: measuring models' ability to capture long-range agreement, movement, and licensing phenomena on diagnostic datasets (Gauthier et al., 2020).For example, Linzen et al. (2016) show that while RNNs are capable of storing the information necessary to enforce subject-verb agreement, the language modeling training objective does not encourage it; McCoy et al. (2020) demonstrate that RNN models for a question formation task favor linear generalizations over hierarchical ones (roughly, lexical generalizations over syntactic ones) on out-of-distribution inputs.Rather than focusing on a specific language or class of linguistic phenomenon, our work in this paper aims to provide a general-purpose framework for reasoning about generalization in neural sequence models across contexts and languages.
Generalization beyond NLP The generalizations investigated in paper involve instances of covariate shift-a change in the distribution p(x) for a conditional model p(y | x)-which has been extensively investigated in more general machine learning settings (e.g.Storkey, 2009).Outside of NLP, there have been several attempts to describe more abstract inductive biases native to RNNs and transformers, including work focused on compositionality (Liška et al., 2018;Lake and Baroni, 2018;Weber et al., 2018) and even more generic algorithmic priors (Lan et al., 2021;Kharitonov and Chaabouni, 2020).Here we focus on the architectures and context shifts relevant to language processing tasks.We validate our models of generalization using real models trained on natural data and explain them in terms of measurable properties of these data distributions.

Models of Generalization
Consider the example contexts shown in (a-c).Each is an extremely unlikely sentence prefix, featuring text that is globally inconsistent with English syntax or semantic constraints.In such contexts, is it possible to predict a priori what a neural LM trained on language data will do next?
We can formalize the situation depicted in these examples as follows: Let p(X 1:n ) = p(X 1 , X 2 , . . ., X n ) be a distribution over sentences with tokens X i , and let be a learned approximation to this distribution produced by an autoregressive model of the conditional distribution p LM (X n | X 1:n−1 ).We will consider each context X 1:n−1 to comprise a global context X G = X 1:n−2 (all but the last word) and a local context X L = X n−1 (the last word in the context).Then, given some thresholds and τ , we will call a context (X G , X L ) surprising if for some thresholds and τ , (the juxtaposition of X G and X L is lowprobability), while: and each for each i (X G and X L are high-probability marginally).In example (c), X G = the pizza, X L = ate.Given a language model p LM , we wish to understand whether p LM (X | X G , X L ) has systematic or predictable structure in surprising contexts-can it be explained in terms of statistics of the underlying distribution p or the behavior of p LM in unsurprising contexts?
In the remainder of this section, we describe a set of candidate hypotheses about what this next-token distribution might look like, and in Section 4 evaluate the extent to which these hypotheses accurately predict the true behavior of p LM .

Local and global models of generalization
We focus on two idealized models of the generalization that might be exhibited by neural language models.
Local context model In this model, we hypothesize that predictors reconcile the conflicting information from X G and X L by ignoring the global component of the context, and making the nexttoken distribution locally consistent with the last token seen, regardless of global sentence structure.We denote this model of generalization pL : (5) pL implements a form of backoff common in ngram language models: faced with a long context in which the data distribution is unknown, models discard long-range information and use higher-quality estimates from a shorter context.We previously defined X L = X n−1 , so experiments with pL will predict that neural LMs behave like bigram models; this could be naturally generalized to local contexts consisting of more than a single word.pL can also be viewed as the hypothesis that NLMs implement a particular kind of lossy-context model (Futrell and Levy, 2017;Futrell et al., 2020), who note that "local contextual information plays a privileged role in [human] language comprehension"; as we will see, this appears to be the case for some neural models as well.Sequence models with backoff may also be given a hierarchical Bayesian interpretation (Teh, 2006).

Global context model
As an alternative, we consider the possibility that predictors rely exclusively on the global component of the context, ignoring the unexpected final token: In the language of count-based models, this amounts to the hypothesis that NLMs generalize as skip-gram models (Goodman, 2001;Guthrie et al., 2006), performing a kind of reverse backoff to context prior to the most recent word.In the global context model, it is the most recent word, and not the rest of the context, that as treated as a possible source of noise to be marginalized out rather than conditioned on.

Interpolated models
Even when combined in surprising ways, both the local and global context are likely to carry useful information about the identity of the next word.Indeed, models and features implementing both kinds of context representation have been found useful in past work on language modeling (Goodman, 2001).It is thus natural to consider the possibility that neural LMs interpolate between the local context and global context models, combining evidence from p(X n | X L ) and p(X n | X G ) when there is no evidence for the specific context p(X n | X L , X G ).
We consider two ways in which this evidence might be combined: Linear interpolation In this model, Here we predict generalization according to a direct weighted combination of pL and pG , with the relative importance of the two hypotheses controlled by a parameter λ ∈ [0, 1].Informally, this hypothesis assigns non-negligible probability to next tokens that are consistent with either base hypothesis.Similar interpolation schemes were proposed for n-gram modeling by Jelinek and Mercer (1980).
Log-linear interpolation In this model, That is for λ 1 and λ 2 ∈ [0, 1] and some contextual normalizing constant Z that depends on X 1:n−1 .Here, probabilities from the two base hypotheses are added in log-space then renormalized; informally, this has the effect of assigning non-negligible probability to next tokens that are consistent with both base hypotheses.A similar approach was proposed for count-based language modeling by Klakow (1998).

Experiments
Which of these models (if any) best describes the empirical behavior of neural LMs trained on real datasets?In this section, we present two sets of evaluations.The first aims to characterize how well pL , pG , and combinations of the two predict the out-of-distribution behavior of RNN (Elman, 1990) and transformer (Vaswani et al., 2017) language models with standard training.The second explores whether these training procedures can be modified to control the relative strength of local and global generalization.
Both sets of experiments investigate the behavior of RNN and transformer LMs on a diverse set of datasets: first, a collection of random regular languages in which the true data distribution p(X) can be precisely modeled; second, a collection of natural language datasets from three languages (Mandarin Chinese, English, and Finnish) which vary in the flexibility of their word order and the complexity of their morphology.We begin with a more detailed discussion of models and datasets in Section 4.1; then describe generalization experiments in Section 4.2 and control experiments in Section 4.3.

Preliminaries
Data: Formal languages The first collection of evaluation datasets consists of a family of random regular languages.We begin by generating three deterministic finite automata, each with 8 states and a vocabulary of 128 symbols.Using the algorithm in Appendix C, we randomly add edges to the DFA to satisfy the following constraints: (1) every state is connected to approximately 4 other states, and (2) each symbol appears on approximately 4 edges.States are marked as accepting with probability 1 2 .Experiments on these carefully controlled synthetic languages are appealing for a number of reasons.First, because we have access to the true generative process underlying the training data, we can construct arbitrarily large training sets and surprising evaluation contexts (X G , X L ) that are guaranteed to have zero probability under the training distribution, ensuring that our experiments cleanly isolate out-of-distribution behavior.Second, the specific construction given above means that "evidence" for the local and global models of generalization is balanced: no tokens induce especially high uncertainty over the distribution of states that can follow, and no states induce especially high uncertainty over the set of tokens they can emit, meaning that a preference for local or global generalization must arise from the model rather than the underlying data distribution.
In experiments on these datasets, we generate training examples via a random walk through the DFA, choosing an out edge (or, if available, termination) uniformly at random from those available in the current state.We generate surprising test examples by again sampling a random walk, then appending a symbol that cannot be produced along any out-edge from that random walk's final state.We compute pL and pG using the ground-truth distribution from each DFA.
In regular languages, the local context model thus hypothesizes that lexical information governs out-of-distribution prediction, predicting that LM outputs are determined by the set of states attached to an edge labeled with the surprising symbol.Conversely, the global context model hypothesizes that structural information governs out-of-distribution prediction: LM outputs are determined by the set of states reachable from the last state visited before the surprising symbol.
RNN experiments use gated recurrent units (Cho et al., 2014)  Data: Natural languages The second collection of evaluation datasets use natural language data.We conduct experiments on English, Finnish, and Mandarin Chinese.These languages exhibit varying degrees of morphological complexity and freedom of word order, with Finnish at one extreme (morphologically complex and freely ordered) and Mandarin at the other.
English data comes from the WMT News Crawl corpus (Barrault et al., 2019).We used a 20,000sentence subset of sentences from articles from 2007, tokenized using the SentencePiece byte-pair encoding (Kudo and Richardson, 2018) with a vocabulary size of 2 14 .We used a 2,000-sentence held-out set for validation.Finnish data comes from the Turku Dependency Treebank (Haverinen et al., 2014), and Chinese data from the Simplified GSD Treebank, both included in the Universal Dependencies corpus (Nivre et al., 2020).These datasets are already tokenized; for the Chinese data we used the existing tokenization, limited by a vocabulary size of 2 14 − 2 with added "unknown" and "end-of-sentence" tokens.For Finnish we also used the SentencePiece byte-pair encoding with a vocabulary size of 2 14 .
To generate surprising natural language sentences (X G , X L ), we first select X G by truncating sentences from the validation set to uniformly random lengths.We then run our best trained model p LM to determine p LM (X n−1 |X G ), and Regular Langs.

English
Chinese Finnish Language Figure 2: Accuracy acc(p, p LM ) of predicted generalization for various hypotheses p, with black lines showing one standard deviation across 5 (GRU) or 4 (transformer) random restarts.In some cases, generalization hypotheses are nearly as predictive as new neural models trained on the same data, suggesting that they explain most of extrapolation behavior that can be derived from data alone.Multiplicative interpolation is consistently a bit better than additive interpolation.Which of the two base hypotheses performs best (global or local generalization) varies substantially across languages.choose a token X L uniformly from among the set {X : p LM (X|X G ) < 1 198 , X ∈ L} where L is the set of the 198 most-common tokens by unigram count (200 less the "unknown" and "end-ofsentence" tokens used in Chinese).In the framework of Equations 2-4, is set to 1 198 and τ to the smallest probability assigned in context to an in-distribution token.
To compute generalization model predictions on natural language data, we estimate pL from bigram counts in the training set: pL (X n |X L ) = count(Xn,X L ) count(X L ) .To estimate pG , we train a second, random restart of the model p LM .We then estimate pG using one step of beam search in p LM with a beam of size 15: where the v i range over the 15 top predicted tokens after X G .Given a trained model that performs well on the in-distribution validation set, we will have v), and therefore that Eq. ( 10) gives a good approximation of pG . 2  For each natural language datasets, we trained GRUs with 2 hidden layers and word embedding and hidden sizes of 1024, and transformers with 4 heads, 2 layers, and hidden sizes of 512.All models 2 We compute this quantity with a second language model in order to prevent information about the true model's out-ofdistribution behavior from leaking into our prediction.
were optimized with Adam using a learning rate of 3e-4 on shuffled length-aligned batches of up to 128 for 15 epochs.The model with the best held-out performance was then selected.

Which model of generalization fits best?
Given a dataset of surprising contexts {(X G , X L ) i }, a hypothesis p, and a trained model p LM , we compute the accuracy of the hypothesis p as acc(p, p LM ) = 1 − err(p, p LM ) where and δ is the total variation distance In other words, we measure the accuracy of each hypothesis by computing the average 1 distance between the hypothesized and true probability histograms across surprising contexts.acc(p, p LM ) is between 0 and 1; a large value indicates that p is a good approximation to p LM .3 For hypotheses p, we use (1) the local and global context (Section 3.1), and (2) optimal linear and log-linear interpolations between them (Section 3.2), choosing settings for λ that minimize error on the evaluation set itself.To provide context for these results, we report the accuracy of a unigram baseline, which predicts the unigram distribution p(X n ) independent of the context.We additionally report the error obtained by a random restart-a new model p θ trained from scratch (with a different initialization) on the same data as p LM ; this model provides a rough upper bound on how much of p LM 's prediction can be explained by structural properties of the data distribution itself.err was computed from 100 samples for regular languages and 200 samples for natural languages.
Results are shown in Fig. 2. In each language, either the local or global model is a good fit to observed generalization behavior.There are substantial differences across languages: the local context model is a good predictor for regular languages and English, but a poor predictor for Finnish and Chinese, suggesting that generalization behavior is data-dependent but not tied to word order or morphological complexity.In general, GRU generalization is more predictable than transformer generalization.Finally, interpolation often substantially outperforms either base hypothesis (Fig. 3), with log-linear interpolation slightly more predictive than linear interpolation.In several cases, interpolated hypotheses approach the accuracy of a randomly retrained predictor, suggesting that they capture much of the generalization behavior that is determined by the underlying data alone.

What controls interpolation?
The previous section showed that pλ 1 ,λ 2 × gives the best fit to the empirical distribution of neural LM predictions across contexts: out-of-distribution prediction in both RNNs and transformers involves a mix of global and local information, with the precise weighting of these two sources of information dependent on structural properties of the language being modeled.A natural next question is whether this weighting can be controlled: that is, whether modifications can be made to models or training procedures that affect the relative importance of the local and global hypotheses.
In this section, we explore noise as a possible source of this control.Models of both perceptual (local) noise and retrieval (global / contextual) Figure 3: Effect of the log-linear interpolation parameter λ 1 (fixing λ 2 = 1−λ 1 ; Eq. ( 8)) when predicting outof-distribution behavior in language models.As shown in Fig. 2, in English, an all-local hypothesis (λ = 0) is better than an all-global hypothesis (λ = 1), but true model behavior is best approximated by a log-linear combination of the two (λ ≈ 0.5).Finnish and Chinese are also best approximated by an interpolation, but are closer to the global than the local hypothesis.
. noise play a key role in computational models of human sentence processing (Levy, 2008).In machine learning, various kinds of noise injected at training time-most prominently dropout (Srivastava et al., 2014), but also label noise and random word substitution and masking-are widely used as tools to regularize model training and limit overfitting.Here, we investigate whether these noising procedures qualitatively affect the kind of generalization behavior that neural LMs exhibit in the outof-distribution contexts explored in Section 4.2.We investigate two kinds of noise: random word substitution and hidden state dropout.In all experiments, this noise is applied at training time only; model inference is run noiselessly when evaluating fit in Eq. ( 11).When computing pG with Eq. ( 10) in these experiments, p LM is also trained without noise to approximate pG .
Random token substitution With probability p, input tokens are randomly replaced with samples from the unigram distribution.Random word substitution plays an important role in masking-based pretraining schemes (Devlin et al., 2018).
Hidden state dropout With probability p, features of context representations (RNN hidden states and transformer self-attention outputs) are randomly set to zero (Semeniuta et al., 2016).

Explaining the experiments
With empirical evidence that interpolation between local and global models is a good approximation to out-of-distribution language model behavior, we next investigate whether this behavior can be explained theoretically.While we leave for future work a complete answer to this question, we conclude with the following proposition, which describes a set of conditions under which LM generalization will be well approximated by log-linear interpolation between pL and pG .
Proposition 1.Let θ be the parameters of a log-linear model optimizing: where p(X n | X 1:n−1 ) ∝ exp{θ Xn φ(X 1:n−1 )} and φ(•) has an indicator feature for each value of X G , X L , and the conjunction (X G , X L ).Suppose further that models with only local or global features are boundedly worse than this model: specifically, that: uniformly for training X equal to either X G or X L .Then, in surprising contexts, p(x n | x 1:n−1 ; θ) can be approximated by p× : where p× ( ) is an 2regularized estimate of the corresponding distribution.
In other words, for log-linear language models with informative local and global features, the observed effectiveness of multiplicative interpolation is expected.Proof is given in Appendix C.
It is important to qualify this result in several ways: it relies on a feature function φ that may not be a realistic representation of the context feature produced by deep network models, involves strong assumptions about the independent predictive power of local and global features at training time, and becomes vacuous for large values of or small values of λ.It is weak in absolute sense (e 4 /λ grows considerably faster than except for very large values of λ); its function is simply to relate predictions in surprising contexts to measurable properties of the training distribution.Nevertheless, the result shows that some aspects of interpolation behavior can be predicted from the parametric form of predictors alone; future work might strengthen this claim to more directly characterize the neural network predictors studied in this paper and explain the observed differences across languages.
When neural sequence models are exposed to outof-distribution contexts with conflicting local and global information, their behavior can be predicted.Across natural and synthetic data distribution, sequence model generalization appears to be well approximated by either a local (n-gram-like) or global (skip-gram-like) predictor, and best approximated by a log-linear interpolation of the two whose weight can sometimes be controlled by noise-based regularization.This work suggests several avenues for future exploration: first, explaining data-dependent aspects of the local-global tradeoff (especially cross-linguistic differences that are not clearly explained by typological differences between languages); second, determining whether architectural improvements to standard sequence models can even more effectively target specific kinds of structured generalization.

A Noise Experiments in Other Settings
We include larger versions of the graphs from Figure 4 with a few other hypotheses added: the additive interpolation and random restart along with an ignore hypothesis that predicts LM generalization having consistent with having never seen the surprising token (i.e. using the predictive distribution from before the surprising token was observed).GRU models on English  We include graphs of some settings for the noise experiments of Figure 4 that were omitted from the main paper for space reasons: Chinese GRU models and transformer models on regular languages.The trends are mostly similar.

C Proof of Proposition 1
We begin with a simple lemma relating parameter weights in regularized log-linear models with mismatched feature sets.
Lemma 1.Let φ 1 and φ 2 be feature distinct functions producing binary feature vectors, with φ 1 i (x) = φ 2 i (x) for some i.Let y 1 , y 2 , . . .be output classes, and θ 1 = [θ 1 y 1 , θ 1 y 2 , . ..] and θ 2 be the result of optimizing: Proof.Eq. ( 16) is convex, and at optimality its gradient with respect to each component of θ is 0.Then, for any θ v,i we have: We can then obtain the main result: Proof of Proposition 1.First note that we can estimate both local and global models using distributions of the form: where φ(X G ) contains only global features.(and similarly for X L ).If these models are trained with the same regularization constant λ, the conditions of Lemma 1 will be satisfied with respect to each local or global feature.Then, Here we have used θ to denote the parameters of the full model, η to denote the parameters of the global-only model, and µ to denote the parameters of the local-only model.Because each φ has an indicator for local, global, and (local, global) values, only two features are active in surprising contexts, corresponding to X G and X L .We will denote the indices of these weights i and j in each weight vector: Without loss of generality, assume the first term is larger than the second.By applying Lemma 1, we can rewrite θ in terms of η and µ.For shorthand, we will write ∆ = /λ: ≤ e 4 /λ − 1 (30) Note that this proof is not specific to local and global feature representations, but applies generically to any pair of regularized log-linear models with overlapping features and similar predictions.The proof is adapted from Theorem 1 of Zou and Hastie (2005), and we expect that it could be strengthened (as done there) to depend only on correlations between features of φ(X G , X L ) and each of {φ(X G ), φ(X L )}.

( a )
The pandemic won't end children can. . .(b) Let him easter. . .(c) After we ate the pizza, the pizza ate. . .

Figure 4 :
Figure4: Accuracy acc(p, p LM ) of predicted generalization for English, Finnish, and regular-language GRUs when trained with token-swapping noise and state dropout noise.Token swapping sometimes improves the accuracy of the global context model, while state noising improves the accuracy of the local context model.Similar trends occur with Chinese; see Appendix A.

Figure 5 :
Figure 5: Detail version of graphs from Figure 4.

Figure 7 :
Figure 7: Detail version of graphs from Figure 4.
Graphs showing hypothesis performance for Chinese GRU language models trained under the same noise conditions as in Figure4and similar trends.
Figure9: Graphs showing hypothesis performance for transformer language models for regular languages, trained under the same noise conditions as in Figure4and similar trends.