Testing the Predictions of Surprisal Theory in 11 Languages

Abstract Surprisal theory posits that less-predictable words should take more time to process, with word predictability quantified as surprisal, i.e., negative log probability in context. While evidence supporting the predictions of surprisal theory has been replicated widely, much of it has focused on a very narrow slice of data: native English speakers reading English texts. Indeed, no comprehensive multilingual analysis exists. We address this gap in the current literature by investigating the relationship between surprisal and reading times in eleven different languages, distributed across five language families. Deriving estimates from language models trained on monolingual and multilingual corpora, we test three predictions associated with surprisal theory: (i) whether surprisal is predictive of reading times, (ii) whether expected surprisal, i.e., contextual entropy, is predictive of reading times, and (iii) whether the linking function between surprisal and reading times is linear. We find that all three predictions are borne out crosslinguistically. By focusing on a more diverse set of languages, we argue that these results offer the most robust link to date between information theory and incremental language processing across languages.


Introduction
Language processing is incremental and dynamic: When a reader encounters a word, they allocate a certain amount of time to process it before moving on to the next one.One influential theory for the mechanism underlying this process is surprisal theory (Hale, 2001;Levy, 2008), which states that the time required to successfully comprehend a word is based on its predictability.Notably, predictability is often quantified as surprisal (negative log-probability given preceding context), from which the theory's name is derived.Suprisal theory is supported, empirically, by a number of studies which have found that surprisal is strongly correlated with psychometric measurements in large naturalistic reading corpora (Demberg and Keller, 2008;Wilcox et al., 2020;Shain, 2019Shain, , 2021;;Meister et al., 2021;Pimentel et al., 2023;Hoover et al., 2022, inter alia).Put differently, a word's surprisal is a strong correlate of its processing effort, operationalized as reading time.
However, there is one serious limitation with most previous studies: While making general claims about human language processing, they predominantly investigate reading times in English.And, while a few studies have investigated surprisal effects in languages other than English, e.g., Meister et al. (2021) in Dutch and Kuribayashi et al. (2021Kuribayashi et al. ( , 2022) ) in Japanese, no systematic, crosslinguistic analysis has been performed.As multiple sentence processing phenomena exhibit significant crosslinguistic variation (Hillert, 1998), the extent to which surprisal theory generalizes crosslinguistically is a nontrivial limitation of the current state of the literature.
In addition, two recent contributions which we discuss here have posited several extensions to surprisal theory-most influentially, (a) that contextual entropy, i.e., expected surprisal, also correlates with reading times, and (b) that the relationship between surprisal and reading time is linear (Smith and Levy, 2013;Wilcox et al., 2020;Shain et al., 2022).Regarding (a), Pimentel et al. (2023) and Cevoli et al. (2022) have argued for what may be considered an expanded version of surprisal theory where processing difficulty is still determined by surprisal, but where people's reading behavior is additionally sensitive to expected surprisal (contextual entropy).Building off prior work that has investigated the role of entropy in language pro-arXiv:2307.03667v3[cs.CL] 10 Sep 2024 cessing (Hale, 2003;Roark et al., 2009;Linzen and Jaeger, 2016;van Schijndel and Schuler, 2017), these recent studies suggest that readers may allocate reading times in advance of encountering a word, based on their expectations of how difficult the word will be to process.Regarding (b), a number of studies have found evidence that the linking function between reading times and surprisal is linear (Smith and Levy, 2013;Wilcox et al., 2020;Shain et al., 2022).However, these results have been challenged recently, with different studies coming to different conclusions about the most appropriate linking function.In the past two years, for example, investigations have concluded that this function is sublinear (Brothers and Kuperberg, 2021), linear (Shain et al., 2022), and superlinear (Meister et al., 2021;Hoover et al., 2022).Here, we will use the term surprisal theory to refer to both the core hypothesis that reading times are correlated with surprisal, as well as the two extensions-(a) and (b)-described above.
We address a gap in the current literature by investigating the predictions of surprisal theory, on eleven languages distributed across five language families. 1We enumerate these three predictions as hypotheses below.Hypothesis 1 (Surprisal Hypothesis) Surprisal is predictive of reading times.
Hypothesis 2 (Contextual Entropy Hypothesis) Contextual entropy is predictive of reading times.
Hypothesis 3 (Linear Link Hypothesis) The linking function between surprisal and reading times is linear.
We facilitate crosslinguistic comparison by using the MECO dataset (Siegelman et al., 2022), which presents eye-tracking data on reading materials with the same content in each language.We estimate surprisal and contextual entropy from two types of autoregressive language models-a single, large, multilingual model (mGPT; Shliazhko et al. 2022), as well as monolingual models trained on large and small datasets, where the small dataset is the same size across languages (≈ 30 million words).We quantify the psychometric predictive power of surprisal and contextual entropy (i.e., how well each predicts reading times) by including them as variables in linear regression models.These models are then trained to predict by-word reading times; if the log-likelihood of the regression improves after including these variables, we take this as evidence that those variables have psychometric predictive power (Frank and Bod, 2011;Fossum and Levy, 2012;Goodkind and Bicknell, 2018).
We find that, in all languages tested, regression models that include surprisal are significantly better predictors of reading times over baselines which do not include surprisal, confirming the surprisal hypothesis.Additionally, we find that models which include contextual entropy are even better predictors of reading times in most languages tested, confirming the contextual entropy hypothesis.Finally, compatible with the linear link hypothesis, we find that models constrained to a linear relationship between surprisal and reading times are just as good as those that can express more complex relationships.Overall, our results provide the largest crosslinguistic analysis of the relationship between reading and word-level information theoretic properties to-date.

Psycholinguistic Predictive Power
Our behavior of interest is how long readers spend visually attending to a given word w t in its linguistic context, i.e., w t 's reading time.This quantity offers a window into the psychological processes that underlie language comprehension and is typically taken as a direct reflection of the word's processing difficulty (Rayner, 1998).A word's reading time can be measured via multiple experimental modalities, including self-paced reading (Jegerski, 2013) and the maze task (Forster et al., 2009;Boyce et al., 2020).In this work, we focus on eye-tracking measurements.These measurements have high temporal resolution and exhibit smaller spillover effects than self-paced reading (Smith and Levy, 2013), where spillover is the effect of a word's properties on later words' reading behavior.
Following previous work investigating reading, we ask what factors associated with each word are helpful for predicting its reading times.In the following section, we use the following notation.With w, we denote a word taken from an alphabet Σ.With w ∈ Σ * , we denote a string of words over the alphabet Σ.We write w t for the word at index t in a string w = w 1 • • • w T with 1 ≤ t ≤ T .Additionally, let EOS ̸ ∈ Σ be a distinguished endof-string symbol not in Σ and let Σ def = Σ ∪ {EOS} be an augmented alphabet that includes EOS.With each word w t in a context w <t , we associate a real column vector of predictor variables x t that we believe may impact reading times.Many of these predictors are attributes of w t itself, e.g., w t 's length.We use x t as predictors in a regression model f ϕ with parameters ϕ.The regression model is estimated to predict w t 's reading time from data.In symbols, we write that where y(w t , w <t ) is the reading time of word w t in context w <t .To be explicit, in our formulation we treat reading times as a continuous quantity and, thus, f ϕ is a probability density.
In order to contrast different theories of language processing, we compare regression models with different vectors of predictor variables x and with different architectures f ϕ , each of which is taken to instantiate a different hypothesis about what underlying factors determine reading times.We fit each regression model on a portion of our dataset and evaluate it by measuring the log-likelihood that it assigns to held-out data.Models that lead to higher log-likelihood can be said to have better predictive power or psychological accuracy for human reading-and their associated theories are then taken to be better models of the underlying psycholinguistic processes (Frank and Bod, 2011;Fossum and Levy, 2012).
Typically, for each experiment we will define a target regression model, which is trained to predict the reading times of individual words from a set of baseline predictors plus a predictor of interest (e.g., surprisal or contextual entropy).For a specific index t, we will refer to these predictors as our target predictors and denote them as x tgt t .We also define a baseline regression model that includes only the baseline predictors, which are a subvector of the target predictors, denoted as x base t for a specific t.We denote baseline and target regression models symbolically as ), respectively.Unless otherwise specified, the regression models that we use in this study are all linear.The choice to use linear linking functions, and whether this assumption is warranted, is addressed directly in Section 5.In order to assess whether the target predictors have contributed to better predictive power, we will inspect the (average) by-word difference in loglikelihood assigned by the two regression models to a held-out dataset (Goodkind and Bicknell, 2018;Wilcox et al., 2020).Following previous studies, we refer to this metric as the delta log-likelihood ∆, which is defined, for a specific index t, as where y(w t , w <t ) is the observed reading time of word w t in context w <t .The complete metric ∆ is the average of ∆ t over all word indices.A positive ∆ means that the target predictors contribute to psycholinguistic predictive power above the baseline predictor, whereas a ∆ of zero indicates that the added predictors either lack a robust relationship with reading times or that their functional relationship cannot be approximated by the class of models f ϕ we employ.2Below, we briefly introduce the two target predictors associated with the theories that we wish to test: surprisal and contextual entropy.

Surprisal
The surprisal (Shannon, 1948) of a word w t measures the information content it conveys in the context in which it appears.Using Shannon's formulation of entropy, we can define surprisal as where p(• | w <t ) is the true distribution over words w ∈ Σ in context w <t , which we omit from the notation for brevity.We focus here on reading, where the relevant context to compute surprisal is the w t 's preceding words w <t .However, in our studies, we do not have access to the true distribution p(• | w <t ) and instead estimate it using an autoregressive language model, as is common in previous studies (Smith and Levy, 2013;Goodkind and Bicknell, 2018;Wilcox et al., 2020).

Contextual Entropy
The contextual entropy of a Σ-valued random variable W t at index t is the expected value of its surprisal, which can be expressed as Again, as we do not have access to the true distribution p, so we resort to estimating the contextual entropy using an autoregressive language model.Prior work has investigated the relationship between different contextual entropy and reading behavior: A number of studies have investigated entropy reduction, or the extent to which w t reduces uncertainty over possible next words (Frank, 2010(Frank, , 2013) ) or the possible incremental parses that can be assigned to a sentence prefix (Hale, 2003(Hale, , 2006)).Other researchers have investigated the effect of successor entropy, i.e., the entropy of W t+1 , on predicting the current-word reading times (Roark et al., 2009;Linzen and Jaeger, 2016;van Schijndel and Schuler, 2017). 3In contrast, we look at the effect of W t 's contextual entropy on prediction, following Pimentel et al. (2023) and Cevoli et al. (2022).As discussed in Pimentel et al. (2023), investigating contextual entropy separately from surprisal can uncover to what extent reading behavior is responsive (i.e., driven by surprisal) or anticipatory (i.e., driven by expected surprisal).Pimentel et al. (2023) specifically found that contextual entropy is a significant predictor of reading times on 3 out of 4 of their tested English eye-tracking and self-paced reading datasets.

Dataset
We use the Multilingual Eye Movement Corpus (MECO; Siegelman et al., 2022).MECO contains eye-tracking data from L1 speakers (between 29 and 54 per language) for 12 simplified Wikipediastyle articles in thirteen languages; these languages are from five different language families.Articles in the MECO corpus went through an iterative translation process by separate teams of translators to ensure that article content was the same across languages and range from a minimum 1,487 total words (Finnish) to a maximum 3,021 total words (Russian).The eleven languages we include in our analysis are: Korean (Koreanic), Turkish (Turkic), Hebrew (Semitic), Finnish (Uralic), Dutch, English, German, Greek, Italian, Russian, and Spanish (Indo-European). 4 diverse than other previous studies, which have tended to focus exclusively on a single language.
The following pre-processing steps were taken: Words that were skipped on the first pass were given a reading-time of zero and included in the analysis.Eye-tracking datasets report multiple different word-based measurements of reading times, of which we use three (Rayner, 1998): The first fixation is the duration of the first fixation on a word during its first pass.Gaze duration is the sum of all first-pass fixations on a word.And total fixation time is the sum of all fixations on a word during the trial.While we report results for all three for the sake of completeness, our discussion will focus on results for gaze duration as has been done in previous studies, e.g., Wilcox et al. 2020.First fixation times are typically associated word identification (Clifton et al., 2007) and are expected to not reflect strong contextual influences.Total reading durations can be influenced by material from the right context (i.e., regressive saccades).Thus, for studies that focused on progressive movement through a text, such as ours, gaze duration is expected to be most strongly associated with firstpass processing difficulty, which is our cognitive process of interest.For each of these metrics, we fit a regression model on averages of the reading time measures taken across subjects, as has been done in previous work (Smith and Levy, 2013;Wilcox et al., 2020).This step was performed to mitigate the potentially high by-participant variance present in eye-tracking data.

Language Models
We derive surprisal and contextual entropy estimates from both monolingual and multilingual models, which we describe in greater detail below.

Monolingual Models
We train monolingual transformer models using the Wiki40B dataset (Guo et al., 2020), from which we rely on the training and validation splits from the original paper for each of our analyzed languages.We first fit language-specific UnigramLM tokenizers (Kudo, 2018) with a vocabulary size of 32k on the training portion of this dataset, which we then use to tokenize both the Wiki40B and MECO text into subword units.We then train two models per language, with different amounts of training data: For the monoT(all) variant, we train the model on the total amount of data in Wiki40B for each language; for the monoT(30m) variant, we subsample ≈ 30 million tokens from each language.For a list of the training dataset sizes for the monoT(all) models, as well as a list of language codes that will be used in figures, see Table 1.We train all our models using fairseq (Ott et al., 2019), following their recommended language modeling training hyper-parameters.We use a standard decoder-only transformer with 6 layers, a context window size of 512 tokens, and shared input-output embeddings.We train our models using Adam (Kingma and Ba, 2015), with a learning rate of 5e −4 , 4000 warm-up updates, and dropout of 0.1.For both of our monolingual models, as well as the multilingual model described below, per-word surprisals are computed by summing over subword unit surprisals, which is the appropriate procedure since surprisal decomposes additively over the units compromising a signal.Because of spurious ambiguity inherent in the tokenization scheme, an efficient algorithm to estimate contextual entropy over full words is unavailable to us; such an algorithm requires summing over an infinite number of sub-word combinations.Instead, we simplify this computation by estimating contextual entropy over one single step of sub-word tokens as suggested in Pimentel et al. (2023).Techniques similar to this have been employed previously in studies of entropy (Frank, 2010), e.g., account for clitics and contractions.
Multilingual Model We use mGPT (Shliazhko et al., 2022), a multilingual autoregressive language model, which was trained with the GPT-3 architecture on 60GB of text5 from a combination of Wikipedia and the Cleaned Common Crawl Corpus (Raffel et al., 2020).
Context Length One recent study has hypothesized that, when deriving surprisal estimates for psycholinguistic modeling, the size of the context window can bias estimates (Hoover et al., 2022).Their reasoning is that short context windows could shift probability mass away from very lowfrequency words, which would be better predicted from longer contexts.Therefore, we estimate surprisal and contextual entropy from mGPT in two contexts: In short contexts the model is given only the current sentence (up until the current word); in long contexts we use the model's full input window size of 512 characters.We use long contexts for our first analysis, and use both contexts for our second analysis, which investigates both the shape of the reading times-surprisal linking function and the influence of context length on these results.
Psychological Plausibility Increasingly, researchers that use language models for cognitive modeling have considered their psychological plausibility as estimates of humans' internal notions of word predictability.In particular, some researchers have compared the size of the models' training data to the amount of linguistic experience of the average human child (Zhang et al., 2021).Assuming that children are typically exposed to ≈ 11 million words per year as an upper limit (Hart and Risley, 1995), then the mGPT model is trained on multiple human lifetimes' worth of language data.The monoT(all) models are trained on data scales equivalent to or less than one human lifetime, 6 and the monoT(30m) models are trained on data equivalent to the linguistic exposure of a young child.However, we argue that the psychological plausibility of a model's next-word predictions is not completely determined by whether that model's training data is the same size as the amount of data a human learner is exposed to.Indeed, there is a body of evidence suggesting that, beyond a certain minimal amount of data, the more data a model is trained on, the more human-like that model's next-word predictions become (Goodkind and Bicknell, 2018;Wilcox et al., 2020).All of our models are trained of an amount of data within this range.Stars indicate the significance of a paired permutation test.We find a consistent significant effect of surprisal across languages for language models that are both multilingual (top row) and monolingual (bottom two rows), and for both progressive gaze duration and total fixation.
at the other end of the scale, the relationship flips: Models trained on an extremely large amount of data seem to be slightly worse predictors of human reading (Shain et al., 2022;Oh and Schuler, 2023).For our models, training datasets are uni-modal (i.e., language only) and learning is with arguably weaker priors for language-like structure, whereas humans learn from multi-modal data with potentially much stronger priors for linguistic structures.Likely, more data makes up for the lack of multi-modal data and uninformative priors.

Regression Models
All of our regression models are fit to predict the reading time y(w t , w <t ) of a word w t in a context w <t from the predictor vector x t .In addition to looking at the word w t , our predictor includes quantities derived from the previous two words w t−1 , w t−2 to control for potential spillover effects.
We will refer to the three words w t , w t−1 , w t−2 as our regressor words.Following previous work in this area, all regression models include the word length and log-unigram frequency, as estimated by Speer (2022), for all regressor words in a predictor x t for a specific index t.The predictors above constitute our (context invariant) baseline predictors.
Regression models are trained and evaluated using 10-fold cross validation.For more information on the regressions used in each of our experiments, see Appendix A. The significance of the observed ∆ values between target and baseline models is assessed via a paired permutation test that checks whether ∆ is significantly different from zero.We use permutation tests because our comparisons because they make no assumption about the distribution of the test statistic.Instead, the test uses the empirical distribution of differences in likelihoods, as estimated using averages computed over permutations of likelihoods, in order to compute p-values.

Surprisal
To test the surprisal hypothesis, we fit a target regression model whose predictors includes the surprisals of our regressor words plus our baseline predictors described above.We compare this to a baseline that does not include the surprisal predictors.For this and subsequent tests, we calculate results for each language individually, as well as for the combined data from all languages.Results can be seen in Figure 1 broken down by language, model, and each of our three word-based measurements of reading time.We observe a clear pattern in the results across the languages: Positive ∆ in nearly every test for gaze duration and total fixation, and less consistently positive ∆ for first fixation, where, as noted before, we would not necessarily expect surprisal effects to show up.Looking at the results for each model, we observe the most robust results for mGPT, where ∆ is significantly greater than zero in every language for gaze duration and total fixation.For the monolingual models, we observe more robust effects for the monoT(all) model over the monoT(30m) model, which is sensible given the latter's limited training data size.
For an aggregate test of the effects of surprisal, we fit an additional regression model on the combined data from all languages to predict gaze duration with random by-language effects.We use a fully maximal random effect structure, as advocated in Barr et al. (2013).We find that the model with surprisal leads to significantly greater than zero ∆ in all cases (p < 0.001).Although surprisal leads to a positive ∆ across languages, we do observe some variation in the magnitude of this effect, or the predictive power obtained by regression model.For both mGPT and monoT(all) we observe the highest predictive power in Russian and Dutch, with lower predictive power in Spanish, English, and Hebrew.One natural question to ask is whether imbalances in the model's training data leads to some of this variation-do models make better predictions for language where they have seen more data?However, there are converging pieces of evidence from our data suggesting that differences in dataset size is not the main cause of the by-language variation.First, both mGPT and monoT(all) show relatively lower predictive power for some large-data languages such as Spanish and English.Second, and quite interestingly, similar patterns of predictive power can be observed for our monoT(30m) models, where training dataset size is controlled across languages.Here, as with the other models, we observe larger values of ∆ in Dutch and Russian and smaller values of ∆ in English, Spanish and Hebrew.These results pose a puzzle, as the languages for which the models obtain higher ∆ are not obviously different from those for which the models obtain lower ∆, in terms of their linguistic features.For example, English (lower ∆) and Dutch (higher ∆) are both Western Germanic.Further investigation is needed to determine if these patterns hold up for other crosslinguistic reading time datasets.

Contextual Entropy
To test the contextual entropy hypothesis we first fit a single baseline regression model.Our baseline regression model includes the surprisal of all regressor words, plus baseline predictors.
We then evaluate target regression models in two variants: For the replace regression model, we replace surprisal with contextual entropy for all regressor words.For the add regression model, we add an additional term of contextual entropy for all regressor words.As results do not change much between our monolingual language models, we present results for monoT(all).
Results can be seen in Figure 2, where the replace regression is indicated with a triangle and the add regression is indicated with a circle.First, we find that replacing surprisal with entropy tends to hurt predictive power in most cases.For example, for mGPT, ∆ is negative in 6/11 languages and significantly so in two (Dutch (p < 0.05) and Italian (p < 0.05)), implying overfitting.Negative effects are even stronger for the monoT(all) model, where we find negative gaze duration ∆ in every language (results are significant in 5/11).Adding entropy as an additional predictor, on the other hand, generally improves the model's predictive power.For example, for mGPT and gaze duration, ∆ from the add regression is positive in 8/11 languages, and significantly so in 5 (English, Greek, Korean, Russian and Turkish).In addition, ∆ is significantly positive for the add regression for all three reading time measures when data is combined across languages, as shown in the 'All' column at the left of Figure 2. Results are less strong for monoT(all), where positive ∆ shows up predominantly for first fixation.As before, we run an aggregate test with data from all languages including by-language random effects. 7For gaze duration, we find that adding contextual entropy leads to positive ∆ (mGPT, p < 0.001; monoT(all), p < 0.01) and that replacement leads to negative ∆ (mGPT, p < 0.01; monoT(all), p < 0.001).Overall, we take these results as being in line with those reported in Pimentel et al. (2023).Our findings suggest that contextual entropy has a weakalbeit consistent-effect on reading times across languages, and therefore that participants may be pre-planning their processing times based on the expected surprisal of upcoming words.

Variation Across Languages
The crosslinguistic relationship between ∆ and language model quality is relevant to current debates about about whether language models can plausibly be used to understand psycholinguistic processes.As mentioned in Section 3.2, it has been observed that, within English, models with lower perplexity We find that replacing surprisal with entropy tends to hurt predictive power, while adding entropy tends to help.
tend to exhibit better predictive power (Goodkind and Bicknell, 2018;Wilcox et al., 2020).However, studies on Japanese have failed to replicate these results, suggesting that the relationship does not hold for all languages (Kuribayashi et al., 2021).Further, Oh and Schuler (2023) and Shain et al. (2022) show that this relationship may not hold even in English for the most recent language models.To investigate this, we compute, for mGPT, the Pearson's correlation between ∆ and test set perplexity, as reported in Shliazhko et al. 2022, both across languages, as well as across language families.8For this analysis we show results only for mGPT only and leave a full analysis, comparing different monolingual models for future work.The correlations can be seen in Figure 4. We do find a relatively strong negative correlation across languages, however it is not significant (ρ = −0.497,p = 0.1).We do not find any evidence of correlation in the language family data.Although the negative by-language correlation suggests that, for languages where mGPT has lower perplexity, it may be a better model of psycholinguistic behavior, the lack of significance is in line with the negative results from Japanese.
Notably, there are important differences between this analysis and the studies cited above, which train a number of different language models within a single language and a single shared vocabulary, as opposed to comparing the outputs of a single multilingual language model across languages as we do here.Additionally, although mGPT does share a single vocabulary across languages, different languages might be a priori harder or easier to language-model (Cotterell et al., 2018;Mielke et al., 2019), and quality of the tokenization might vary across languages as well.Thus, more finegrained linguistic controls are necessary before making strong conclusions about the relationship between perplexity and psychometric predictive power across languages.

Model Coefficients
How do surprisal, entropy, frequency and length individually affect reading times?Figure 3 shows the estimates for each of our predictor variables, estimated across 10 folds of data.Unlike the figures presented above, effects are broken down by the coefficients for each of our regressor words from w t (on the left of each facet) to w t−2 (on the right of each facet).Note that effect size here does not correspond to the predictive power of the model as a whole, but rather the impact of word-level properties on reading times.Because predictor variables are not normalized, units are different across rows.The top two rows indicates the estimated slowdown in milliseconds for each additional bit (of surprisal or entropy).The second row indicates slowdown for each additional occurrence per billion words of text (on a log scale).And the bottom row indicates slowdown for each additional character in the word.We find a consistent effect of surprisal for w t of between 2-4 ms/bit.There is some inter-language variability, with the smallest effect for Hebrew, and larger effects for Dutch, Russian, Greek and Italian.We find smaller effects for w t−1 , ranging from between 0-2 ms/bit.There is no obvious effect of surprisal for w t−2 .Overall, these results differ slightly from those reported in Smith and Levy (2013), who investigate reading times on the English Dundee Corpus (Kennedy et al., 2003) and find a stronger effect for w t−1 than we do.However, our results are not inconsistent with the relatively lower spillover effects traditionally observed in eye-tracking data.
Turning to contextual entropy, we find slightly smaller effects, and slightly more variance between languages.There is no obvious relationship between the effect sizes for surprisal and contextual entropy.For example, Dutch, which has a larger surprisal effect, has one of the smallest effect sizes for entropy.For frequency, we find a consistently negative effect for w t , as expected-as words get more frequent they take less time to read.For w t−1 and w t−2 effects are much smaller and less consistent across languages.For example, Dutch, Finnish, Italian and Russian all have consistently positive frequency effects for w t−1 , whereas in Turkish and Greek, these effects are negative.
We find consistent effects for word length, which are positive for every language on w t .We also find  We do not find a significant correlation between the ∆ and mGPT's perplexity for a language or language family.consistent negative effects for w t−1 .This may be due to the fact that readers are likely to skip a word if it comes after a long word, which would be associated with a reading time of zero in our analysis.Overall, these coefficient estimates are in line with previous reading time studies and further highlight the crosslinguistic consistency of our results.

Surprisal-RT Linking Function
The regression models we have been using to assess ∆ have implicitly assumed a linear linking function between surprisal and reading time-a relationship that has been empirically verified in some previous studies in English (Smith and Levy, 2013;Wilcox et al., 2020;Shain et al., 2022).Other recent studies, however, have questioned linearity, including Meister et al. (2021) and Hoover et al. (2022), who argue for a superlinear relationship, and Brothers and Kuperberg (2021), who argue for a sublinear relationship.In this section, we directly test the linear link hypothesis.We compare the ∆ of our linear regression models against regression models that can capture non-linear relationships.
We present results exclusively for gaze duration for the reasons discussed in Section 3.1.

Visualizing the Link with GAMs
In order to visualize the link between surprisal and reading times, we use generalized additive models (GAMs), a class of models that can fit non-linear relationships between predictor and response variables.Given the less-constrained hypothesis space of the GAM, if the model finds a relationship that is (visually) linear, this is good first evidence that the underlying effect is linear.We fit a GAM to predict reading times from word frequency, length and surprisal, derived for short contexts (sentence level) and long contexts (document level).We include smooth terms for current and previous word surprisal, as well as tensor product terms for a nonlinear interaction between log-frequency and word length.By way of comparison, we also fit a GAM that enforces a linear effect of surprisal, following (Hoover et al., 2022).For this comparison, we fit new models, all using the mgcv library, as opposed to simply comparing GAMs to our linear models from the previous section, to ensure that the effects of our baseline variables are exactly the same between models in this section. 9For each language and language model combination, we visualize the fitted curve using 10-fold cross validation, i.e., we train a GAM model on 9 of the 10 folds and sample reading times from the trained model using the remaining fold.To sample reading times, we vary the surprisal values for w t ranging 0-20 in increments of 0.1.No other predictors are fed into the model.The visualizations of the estimated GAMs for effects on w t can be seen in Figure 5. Below the fit, we show density plots for surprisal values in the corpus.The results are consistent across languages and contexts.Visually, the non-linear GAMs capture the effect of surprisal on reading times by fitting an approximately linear curve, which sometimes falls directly on top of the linear control GAM (e.g., for Finnish and Turkish).Unlike Hoover et al. (2022) we do not find a consistent difference for fits between surprisals derived in short contexts versus long contexts.We note, however, that Hoover et al. (2022) finds superlinear trends specially for their best examined models (e.g., GPT-3), which may outperform multilingual mGPT.

Testing Linearity
Although the GAM fits in Figure 5 are visually linear, we would like to test the question of linearity with a more rigorous method.To do so, we compare the ∆ of the linear and non-linear GAMs described above.∆ is calculated by comparing each model to a shared baseline that includes only tensor product terms for frequency and length.The idea is that if the underlying relationship between surprisal and reading time is non-linear, then the non-linear GAMs should be able to achieve higher ∆, whereas if the underlying relationship is linear then the non-linear GAMs would not have an advantage.Thus, a consistently null result across languages suggests that the relationship is linear.
The results of this comparison can be seen in Figure 6.Here, ∆ is slightly different for linear models than in Section 4.1, as we fit these models with tensor product terms for baseline predictors.
Visually, there is no consistent difference between linear and non-linear models across languages.We test the difference in ∆ statistically with permutation tests, as described in Section 3.3.Our tests do not support the alternative hypothesis for an α = 0.05 for any of the models or languages.Together with the visualizations presented above, these results support a linear linking function between surprisal and reading times.

Implications of Psycholinguistic Theories
Throughout the paper, we have mentioned that the eleven languages studied come from five different language families, but what does this mean in terms of the actual linguistic characteristics that they exhibit?At the highest organizational level, our sample includes languages with multiple different word orders and headedness including SVO (Hebrew, English), SOV (Korean, Turkish), as well as languages with no dominant word order (German and Greek; Haspelmath et al., 2005).Our sample includes languages with extensive case marking such as Finnish (15 cases), as well as languages with extremely impoverished case systems, such as English.In terms of word construction, our sample includes languages that are both agglutinating (Turkish, Finnish and Korean) and fusional (Russian, Romance languages).While this set is not close to covering all ways that human languages can vary, we bring up these differences to highlight how it does contain important high-level parametric variations observed in human languages.
In light of this, the stability observed in our results testing the surprisal hypothesis is rather remarkable.Across language families and model types, we observe essentially consistent results, in terms of the predictive power of the models, the effect size associated with surprisal, as well as for the shape of the surprisal-reading-time relationship.Focusing first on predictive power, we find a relatively tight range of ∆ values associated with surprisal.For example, for gaze duration and mGPT, all ∆ values fall between 0.012 and 0.040.Indeed, across languages and models, we find relatively little variance in the predictive power of surprisal.Turning to the effect size of surprisal, we observe a millisecond-per-bit trade-off that falls between 2-4 ms/bit for every language (See Figure 3).The previous estimate of 3.75 ms of slowdown per bit of surprisal reported in Smith and Levy (2013) for English falls well within this range (though note that this previous work used surprisal estimates derived from an n-gram model, which will generally be higher than surprisal estimates derived from large neural language models such as the ones we consider in this study).We take these results to suggest that humans may have stable crosslinguistic preferences for the rate at which they process information during reading, i.e., not greater than 4 milliseconds per bit of information.This is consistent with previous work that has observed crosslinguistic consistency in the rate of information during speech production (Pellegrino et al., 2011;Coupé et al., 2019), as well as trade-offs between the information content of a word and the time taken to produce it (Pimentel et al., 2021). 10ne point of difference between these and previous results, however, is the size of the effect of the surprisal of previous words.Looking at gaze duration in the Dundee corpus of English (Kennedy et al., 2003), Smith and Levy (2013)  which is about as strong as for the current word.We find much weaker effects in this study, ranging from 0-2 ms/bit.Note, that this lower effect for previous words is in line with other incremental processing measures which are strongly incremental, such as the maze task, where previous-word surprisal has little to no effect on reading time of the current word (Boyce and Levy, 2020), as well as with the results reported in Pimentel et al. (2023) for eye-tracking over the Provo (Luke and Christianson, 2018) and Dundee corpora.Turning to the shape of the surprisal-reading times relationship, our results support the linear link hypothesis and are in line with the comprehensive results recently reported in Shain et al. (2022).Unlike Hoover et al. (2022) we do not observe superlinear surprisal-reading time relationships for larger and more data-intensive language models, or for language models that had access to longer contextual windows.Interestingly, we do observe that the one language which visually appears to be superlinear, i.e., it has an upwards curve in Figure 5) is English.Thus, while we believe Hoover et al. (2022) was right to be concerned by a potential visual nonlinearity in the English relationship, this effect does not appear to exist crosslinguistically and is not borne out by our statistical testing.
Surprisal theory is attractive because it offers a general-purpose link between statistical properties of natural language and human behavior.While its domain generality gives the theory a universal-like flavor, previous literature has (in our opinion) correctly refrained from overtly discussing it as a universal of human language processing.By conducing the most comprehensive crosslinguistic assessment of surprisal theory to date, this study presents initial evidence which supports the universality of surprisal effects in naturalistic reading.That being said, further testing is a necessary next step.

Implications of Multilinguality
As the number of multilingual language models has proliferated, it has become increasingly important to understand how they differ from more traditional, monolingual models.Previous studies have produced mixed results: Some have found that the larger training data scales of multilingual models leads to better performance (Conneau et al., 2020), while others have found advantages for monolingual models (Agerri et al., 2020;Rönnqvist et al., 2019;Virtanen et al., 2019), which are often attributed to monolingual model's language-specific tokenization and vocabulary representation.The majority of these previous studies have focused on masked language models (mostly using architecture based off the BERT model) and evaluation based on performance of downstream tasks (Doddapaneni et al., 2021).This study offers a useful complement to previous work by focusing on autoregressive models, as well as on their cognitive modeling capacities. 11Our results are more or less in line with previous studies, insofar as we find no obvious differences between our multilingual model and our monolingual models.Our results thus suggest that for computational linguists interested in cognitive modeling, multilingual and monolingual language models may be equally viable options.However, we would like to note that we did not compare models in truly low-resource settings, as the training datasets of our smallest monolingual models still included 30 million tokens.It may be the case that when trained on much smaller datasets, multilingual models may benefit from crosslingual transfer.

Concurrent Work
We want to briefly note the differences between the work presented here and a concurrent study that also used the MECO dataset (i.e., de Varda and Marelli, 2022).While de Varda and Marelli's research questions are similar to ours, their methods and conclusions are quite different.Instead of an autoregressive language model, they use a masked language model (mBERT; Devlin et al., 2019), which has access to both left and right context.An issue with this strategy is that the surprisal values produced by this setup are not psychologically plausible estimates of actual surprisals, which are estimated from the left context alone.12which weakens the ability to test psycholinguistic causal claim about the relationship between surprisal and reading times.In their experiments, de Varda and Marelli do not find significant effects of pseudo-surprisal on gaze duration in four of the 12 languages in MECO, 13 including English, and find significant effects of pseudo-surprisal on other eye movement measures in even fewer of the languages, which they view as evidence that surprisal might not be a consistent predictor of reading times across languages. 14While we are aligned on the importance of de Varda and Marelli's research questions, we believe that their failure to replicate surprisal effects for English-or to find it for other languages-reflects the limitations in their methodological choices.

Limitations and Future Directions
Turning back to our own study, there are a few limitations we would like to discuss: Although our sample of languages is much larger than previous studies, Indo-European languages are still overrepresented.Indeed, each of our non Indo-European language families is represented by a single language.Additionally, all the data tested here comes from high-resource languages with long traditions of writing systems, and from individuals who live in industrialized societies.Finally, the methodology we employ here requires a large corpus of (written) language on which a language model can be trained.It may be the case, that for much lower-resource languages, there is often not enough linguistic data to derive statistical estimates needed to test surprisal theory in this manner.Thus, while our meth-ods may be able to test the predictions of surprisal theory in lower-resource settings, where corpora of a few hundred thousand words exist, they may not be suitable for a large number of the world's languages.While our results put surprisal theory on firmer empirical footing, testing its predictions beyond these settings is an important and necessary step in assessing the theory's universality.

Conclusion
This paper has presented the most comprehensive crosslinguistic evaluation of surprisal theory reported in the literature to date.Using eye-tracking data from controlled materials in eleven languages across five language families, we have tested three hypotheses: (i) the surprisal hypothesis (surprisal is predictive of reading times), (ii) the contextual entropy hypothesis (contextual entropy is predictive of reading times), and (iii) the linear link hypothesis (the relationship between surprisal and reading times is linear).We found exceptionally strong crosslinguistic stability in our results, with each prediction being borne out in every language tested.These results provide the most robust link between information-theoretic quantities and incremental processing.

Figure 1 :
Figure 1: Predictive Power of Surprisal Across Languages: Positive values mean surprisal contributes to predicting the reading times over a baseline where surprisal is removed.Error bars indicate 95% confidence intervals.Stars indicate the significance of a paired permutation test.We find a consistent significant effect of surprisal across languages for language models that are both multilingual (top row) and monolingual (bottom two rows), and for both progressive gaze duration and total fixation.

Figure 2 :
Figure 2: Psychometric Predictive Power of Contextual Entropy Across Languages: Positive values mean contextual entropy contributes to predicting the reading times of w t .Error bars are 95% confidence intervals across the 10 folds of held-out data.Stars indicate the significance of a paired permutation test.We find that replacing surprisal with entropy tends to hurt predictive power, while adding entropy tends to help.

Figure 3 :
Figure 3: Model Coefficients: Coefficients for a linear model that includes surprisal, entropy, frequency and length.Coefficients are shown for each regressor word individually.Zero is indicated with a black line and scales differ for each row.Error bars indicate 95% CIs across folds of data.

Figure 4 :
Figure 4: Test Perplexity versus ∆ (mGPT):We do not find a significant correlation between the ∆ and mGPT's perplexity for a language or language family.

Figure 5 :
Figure 5: Surprisal versus Reading Time Relationship: Non-linear GAMs are in green while linear control GAMs are in dotted blue.Shaded regions represent bootstrapped 95% confidence intervals.Results are for gaze duration.Grey subplots indicate the distribution of surprisal values.We find that GAMs recover a linear relationship between surprisal and reading-time slowdown.

Table 1 :
While this sample is still biased towards Indo-European languages, it is more Training data information for our monolingual transformer models, noted as monoT(all)