On the Effect of Anticipation on Reading Times

Abstract Over the past two decades, numerous studies have demonstrated how less-predictable (i.e., higher surprisal) words take more time to read. In general, these studies have implicitly assumed the reading process is purely responsive: Readers observe a new word and allocate time to process it as required. We argue that prior results are also compatible with a reading process that is at least partially anticipatory: Readers could make predictions about a future word and allocate time to process it based on their expectation. In this work, we operationalize this anticipation as a word’s contextual entropy. We assess the effect of anticipation on reading by comparing how well surprisal and contextual entropy predict reading times on four naturalistic reading datasets: two self-paced and two eye-tracking. Experimentally, across datasets and analyses, we find substantial evidence for effects of contextual entropy over surprisal on a word’s reading time (RT): In fact, entropy is sometimes better than surprisal in predicting a word’s RT. Spillover effects, however, are generally not captured by entropy, but only by surprisal. Further, we hypothesize four cognitive mechanisms through which contextual entropy could impact RTs—three of which we are able to design experiments to analyze. Overall, our results support a view of reading that is not just responsive, but also anticipatory.1


Introduction
Language comprehension-and by proxy, the reading process-is assumed to be incremental and dynamic in nature: readers take one word as input at a time, process it, and then move on to the next word (Hale, 2001(Hale, , 2006Rayner and Clifton, 2009;Boston et al., 2011). Further, as each word requires a different amount of processing effort, readers must allocate differing amounts of time to them. Indeed, this effect has been confirmed by a number of studies, which show a word's reading time is a monotonically increasing function of the word's length and surprisal (Hale, 2001;Smith and Levy, 2008;Shain, 2019, inter alia).
Most prior work (e.g., Levy, 2008;Demberg and Keller, 2008;Fernandez Monsalve et al., 2012;Wilcox et al., 2020), however, focuses only on the responsive nature of the reading process, i.e., they look solely at how readers' behaviour is influenced by attributes of the already observed words. Such analyses make the implicit assumption that readers dynamically allocate resources to a word as they read it, and planning ahead takes no part in reading time (RT) allocation. Albeit quite elegant in its simplicity, a closer analysis of reading time data shows the above theory might not capture the whole picture. In addition to being responsive, reading behaviour may also be anticipatory: readers' expectations before reaching a word may influence how they process it. One strong example of this is that often readers skip words while they read-a decision that must be made while the word's identity is unknown.
In this work, we look beyond responsiveness and investigate anticipatory reading behaviours. Specifically, we look at how a reader's expectation about a word's surprisal-operationalised as that word's contextual entropy-affects the time taken to read it. For various reasons, however, a reader's anticipation may not exactly match a word's expected surprisal value, which would make the contextual entropy a poor operationalisation of anticipation. Rather, readers may rely on skewed approximations instead, e.g., anticipating that the next word's surprisal is simply the surprisal of the most likely next word. We use the Rényi entropy (a generalisation of Shannon's entropy) to operationalise these different skewed expectation strategies. Further, in order to better grasp the anticipatory nature of the reading process, we identify four mechanisms under which anticipation may impact reading: (i) word-skipping: readers may completely omit fixating on a word; (ii) budgeting: readers may allocate reading times for a word before reaching it; (iii) preemptive processing: readers may start processing a future word based on their expectations (and before knowing its true identity); (iv) coping with uncertainty: readers may incur an additional processing load when in high uncertainty contexts.
We design several experiments to investigate these mechanisms, analysing the relationship between readers' expectations about a word's surprisal and its observed reading times. We run our analyses in four naturalistic datasets: two self-paced reading and two eye-tracking. In line with prior work, we find a significant effect of a word's surprisal on its RTs across all datasets, reaffirming the responsive nature of reading. In addition, we find the word's contextual entropy to be a significant predictor of its RTs in three of the four analysed datasets-in fact, in two of these, the entropy is a more powerful predictor than the surprisal. Moreover, we find the Rényi entropy with α = 0.5 to consistently leads to stronger predictors than the Shannon entropy. This suggests readers may anticipate a future word's surprisal to be a function of the number of available continuations (as opposed to the actual expected surprisal).

Predicting Reading Times
One behaviour of interest in psycholinguistics is reading time allocation, i.e., how much time a reader spends processing each word in a text. Specifically, reading times are important for psycholinguistics because they offer insights into the mechanisms driving the reading process. Indeed, there exists a vast literature of such analyses (Rayner, 1998;Hale, 2001Hale, , 2003Hale, , 2016Keller, 2004;Levy, 2008, 2013;van Schijndel and Schuler, 2016;van Schijndel and Linzen, 2018;Shain, 2019Shain, , 2021Schuler, 2021, 2022;Wilcox et al., 2020;Meister et al., 2021Hoover et al., 2022, inter alia). 2 The standard procedure for RT analysis is to first choose a set of variables x ∈ R d which we believe may impact reading times-e.g., it is common to choose x = [|w t |; u(w t )], where |w t | is the length of word w t and u(w t ) is its frequency (i.e., its unigram log-probability). These variables are then 2 For more comprehensive introductions to computational reading time analyses see Rayner (1998); Rayner et al. (2005). used to fit a regressor f φ (x) of reading times: (1) where f φ : R d → R and φ are learned parameters. We can then evaluate this regressor by looking at its performance, which is measured by, e.g., f φ (x)'s average log-likelihood on held out data. When comparing different theories of the reading process, each may predict a different architecture f φ or set of variables x which should be used in eq. (1). We can then compare these theories by looking at the performance of their associated regressors. A theory that leads to higher loglikelihoods on held out data has stronger predictive power, which implies that it is a better model of the underlying cognitive mechanisms. Model f φ (x) can then be used to understand the relationship between the employed predictors x and reading times.

Responsive Reading
One of the most studied variables in the above paradigm is surprisal, which measures a word's information content. Surprisal theory (Hale, 2001) posits that a word's surprisal should directly impact its processing cost, which makes sense at an intuitive level: the higher a word's information content, the more resources it should take to process that word. Surprisal theory has since sparked a line of research exploring the relationship between surprisal and processing load, where a word's processing load is typically quantified as its reading time. 3 Formally, the surprisal (or information content) of a word is defined as its in-context negative log-likelihood (Shannon, 1948). We write it as: where p is the ground-truth probability distribution over natural language utterances. This definition has an intuitive interpretation: a word is more surprising-and thus conveys more information-if it is less likely, and vice-versa. Time and again, the surprisal has proven to be a strong predictor in reading time analyses Levy, 2008, 2013;Goodkind and Bicknell, 2018;Wilcox et al., 2020, inter alia). In other words, adding surprisal as a predictor in models of reading time significantly increases their predictive power. Importantly, surprisal (as well as other properties of a word, like frequency or length) is a quantity that can only feasibly impact readers' behaviours after they have encountered the word in question. 4 Thus, by limiting their analyses to such characteristics, these prior works assume reading time allocation is purely responsive to the current context a reader finds themselves in, happening on demand as needed for processing a word.

Anticipatory Reading
Not all reading behaviours, however, can be characterised as reactive. As a concrete example, readers often skip words-a decision which must be made while the next word's identity is unknown. Furthermore, prior work has shown that the uncertainty over a sentence's continuations impacts reading time (Roark et al., 2009;Angele et al., 2015;van Schijndel and Schuler, 2017;van Schijndel and Linzen, 2019). Both of these observations offer initial evidence that some form of anticipatory planning is being performed by the reader, influencing the way that they read a text.
The presence of such forms of anticipatory processing suggests that, beyond a word's surprisal, a reader's predictions about a word may influence the time they take to process it. A word's reading time, for instance, could be (at least partly) planned before arriving at it, based on the reader's expectation of the amount of work necessary for processing that word. This expectation has a formal definition, the contextual entropy, which is defined as follows: Here we denote the random variable associated with this distribution as W t , which takes on values from a (potentially infinite) vocabulary W. 5 4 This property follows from standard theories of causality. Granger (1969), for instance, posits that future material cannot influence present behaviour.
5 Prior work has also investigated the role of entropy in reading times. Roark et al. (2009), Linzen and Jaeger (2014) and van Schijndel and Schuler (2017), for instance, investigate the role of successor entropy (i.e., word t + 1's entropy) on RTs, while Hale (2003) investigates the role of entropy reduction (i.e., word t's minus t + 1's entropy). In this work, we are instead interested on the role of the entropy of word t itself.

Skewed Anticipations
Eq. (3) gives us a mathematically optimal guess (i.e., in terms of mean squared error) of what the surprisal of the anticipated word w t will be, knowing only its context w <t . However, a reader may employ a different strategy when making anticipatory predictions. One possibility, for instance, is that readers could be overly confident, and trust their best (i.e., most likely) guess when making this prediction. In this case, readers would instead anticipate a subsequent word's surprisal to be: where w * = argmax w∈W p(w | w <t ) Another possibility is that readers could ignore each word's specific probability value when anticipating future surprisals; predicting the next-word's surprisal to instead be the number of competing words with non-zero probability: 6 where supp(p) = w∈W 1 p(w | w <t ) > 0 The above anticipatory predictions can be written in a unified framework using the contextual Rényi entropy (Rényi, 1961). Formally, the Rényi entropy generalises Shannon's entropy (in eq. (3)), being defined as: It is easy to see that the Rényi entropy is equivalent to eq. (5) as α → 0. Further, it is equivalent to eq. (4) in the limit of α → ∞. Finally, the Rényi entropy is equal to Shannon's entropy (i.e., to eq. (3)) in the limit of α → 1.

Hypothetical Mechanisms Behind Anticipatory Effects
In this paper, we are mainly interested in the effect of anticipations on reading times, where we operationalise anticipation in terms of the contextual entropies defined above. We consider four main mechanisms under which anticipation could effect reading times: word-skipping; budgeting; preemptive processing; and coping with uncertainty. We discuss each of these in turn now.
Word-skipping. The first (and perhaps most obvious) way in which anticipation could affect reading times is by allowing readers to skip words entirely, allocating the word a reading time of zero. A reader must, by definition, decide whether or not to skip a word before fixating on it-when the word's identity is not known yet. 7 The reader may thus decide to skip a word when its contextual entropy is low because they are confident in the word's identity. It follows that the contextual entropy may help us predict when readers will skip words.
Budgeting. The reading process can be described as a sequence of fixations and saccades. 8 These saccades, however, do not happen instantly: on average, they must be planned at least 125 milliseconds in advance (Reichle et al., 2009). Further, there is an average eye-to-brain delay of 50ms (Pollatsek et al., 2008). It follows that, for a word's surprisal to be taken into account when allocating reading times, it must be observed for at least 175ms-if not more. Considering this delay on saccade execution, it is not unreasonable that reading times could be decided (or budgeted) further in advance, when the reader still does not know the current word's identity. If a reader indeed budgets reading times before they get to a word, reading times should be predictable from the contextual entropy. Processing costs, however, may still be driven by surprisal. In this case, we might observe budgeting effects: e.g., if a reader under-budgets reading times for a word-i.e., if the word's contextual entropy is smaller than its actual surprisal-we may see a compensation, which could manifest as larger spillover effects in the following word.
Preemptive Processing. In an influential paper, Smith and Levy (2008) posit that for a reader to achieve optimal processing times, they must minimise a trade-off between preemptive processing and actual reading costs. They then derive what a reader's optimal reading time and preprocessing effort should be. Notably, given their derivations, 9 a 7 At least considering the reader is not able to identify the following word through their parafoveal vision. 8 Fixations are moments when the gaze focuses on a word, and saccades are rapid eye movements which shift gaze from one point to another. In self-paced reading, saccades are instead replaced by mouse clicks. 9 They argue a word's optimal reading time should be proportional to h(wt), and its preprocessing cost should be proportional to e −t . By plugging one equation into the other, and summing over the vocabulary, we find preprocessing costs should be fixed: preprocess(Wt) ∝ w∈W e −h(w) = 1. reader should always allocate a constant amount of time for preprocessing future words. 10 On another note, Goldstein et al. (2022) recently showed the brain's processing load before a word's onset correlates negatively with its entropy. This suggests the brain starts preemptively processing a future word before reaching it-and that this is specially true in low entropy contexts. This could mean that a smaller processing load for low entropy words at time t should be compensated by a higher load in the previous word. Finally, both Roark et al. (2009) and van Schijndel and Schuler (2017) find that successor entropy (i.e. word t + 1's entropy) has a positive impact on reading times, meaning that when a word has a smaller entropy, the previous word takes a shorter time to be read instead. Given all this prior work, preemptive processing makes no specific prediction about the effect of successor entropy on RTs; it is compatible with a negative, constant, or positive effect. We will investigate which is true later in our experiments.
Coping with Uncertainty. Finally, uncertainty about a word's identity-as quantified by its contextual entropy-may cause an increase in processing load directly. For example, keeping a large number of competing word continuations under consideration may require additional cognitive resources, impacting the reader's processing load beyond the effect of the observed word's surprisal. We know, however, no way of testing this hypothesis directly under our experimental setup. Ergo, we will not consider this mechanism for the rest of this paper, leaving it out from our analyses.
Language Models. We use GPT-2 (Radford et al., 2019) as our language model p θ in all experiments. 11 GPT-2 predicts sub-word units at each time-step (instead of predicting full words). Ergo, we sum over sub-word units' surprisal estimates to get these measures per word. Estimating the contextual entropies per word is harder, though, as the vocabulary W is an infinite-sized set. We approximate these contextual entropies by computing them over a single-step of sub-word units instead.
In practice, we thus compute a lower bound on the true contextual entropies, as we show in App. A.

Data
We perform our analyses on two eye-tracking and two self-paced reading datasets. The selfpaced reading corpora we study are Natural Stories (Futrell et al., 2018) and Brown (Smith and Levy, 2013), while the eye-tracking corpora are Provo (Luke and Christianson, 2018) and Dundee (Kennedy et al., 2003). We refer the readers to App. B for more details on these corpora, as well as dataset statistics and preprocessing steps. For the eye-tracking data, we focus our analyses on Progressive Gaze Duration-we consider the reading time for a word to be the sum of all fixations before the reader first leaves it; and we only consider fixations in a reader's first forward pass. Further, for our first set of experiments, we consider a skipped word's reading time as zero 12 -we will denote these datasets as Provo () and Dundee ().
In later experiments we discard skipped words, denoting these datasets with an () instead. Following prior work (e.g., Wilcox et al., 2020), we average RT measurements across participants, analysing one RT value per word token.

Predictive Function f φ
Prior work has shown the surprisal-reading time relationship to be mostly linear Levy, 2008, 2013;. Assuming this linearity extends to the contextual entropy, we will restrict our predictive function to the linear: where σ is a standard deviation, D test is a held-out testset, and y are reading times.

Evaluation
As mentioned above, we evaluate the different sentence processing hypotheses by looking at the respective predictive power of the regressors that correspond to them-operationalised as their negative log-likelihood on held out data. We use 10fold cross-validation, estimating our regressors (depicted in eq. (1)) using 9 folds of the data at a time, and evaluating them on the 10 th . Further, as is standard in reading time analyses, we test the predictive power of a hypothesis by comparing a target model against a baseline model. Models are the same, except the target model contains a predictor of interest, whereas the baseline model does not. For these analyses, our metric of interest is the difference in negative log-likelihood between the two models: 13 All of our models include the unigram frequency and the length of the current and three previous words, i.e., [u(w : ); |w : |] as predictors; we omit explicit reference to these predictors for succinctness.
Notation. For notational succinctness, we will write vectors containing predictors for the current plus the three previous words as h(w : ). Similarly, we will write vectors including three of these four words as h(w =i ). Further, we will write the concatenation of multiple predictors using vector notation, i.e., as x = [H(W t ); h(w : )]. Finally, unless otherwise noted, we will use a regressor with all four surprisal predictors (besides frequencies and lengths) as our baseline:

Experiment #1. Confirmatory Analysis
In this first experiment, we simply confirm prior results which show the predictive power of surprisal 13 Significance is assessed using paired permutation tests. We correct for multiple hypothesis testing (Benjamini and Hochberg, 1995) and mark: in green significant ∆ llh where a variable adds predictive power (i.e., when the model with more predictors is better); in red significant ∆ llh where a variable leads to overfitting (i.e., when the model with more predictors is worse). * p < 0.05, * * p < 0.01, * * * p < 0.001  over reading times, estimating the ∆ llh between a model with all surprisal terms as predictors, and different baselines for which we remove a single surprisal term. 14 We present these results in Table 1. In this table, we can see that the surprisal of the current word is a strong predictor of reading times in all four analysed datasets. Moreover, we see significant spillover effects for the surprisal of three previous words in self-paced reading corpora, for the two previous words in Dundee, and for the immediately previous one in Provo. Interestingly, and consistent with prior work (Smith and Levy, 2008), we find that spillover effects are stronger than the current word's effect in Brown. For the other three datasets, however, we find the main surprisal effect to be stronger than spillovers.

Experiment #2. Surprisal vs. Entropies
In this second experiment, we move on to analyse the predictive power of the contextual entropy on reading times. Specifically, Table 2 presents the ∆ llh between our full baseline model and: (i) a model where one of the surprisal terms is replaced by an entropy term; (ii) a model where we add an entropy term besides the predictors already in x base . From this table, we see that adding the entropy of the current word (i.e., H(W t )) significantly increases the predictive power in three out of the four analysed datasets. Furthermore, replacing the surprisal predictor with the entropy only leads to a model with worse predictive power in one of the three analysed datasets (in Provo). On the other three datasets, the entropy's predictive power is the same as the surprisal's (more precisely, they are not statistically different). Together, these 14 Prior work has shown that a word's RTs are impacted not only by its own surprisal, but also by the surprisal of previous words; these effects are referred to as spillover effects.  results suggest that the reading process is both responsive and anticipatory.
Analysing the impact of the previous words' entropies (i.e. H(W t−1 ), H(W t−2 ), H(W t−3 )) on reading times, we see a somewhat different story. When adding spillover entropy terms as extra predictors we see no consistent improvements in predictive power; with a weak improvement on self-paced reading datasets for word t−1 (specifically, this is only significant in Natural Stories) and a similarly weak one for word t−2 on eye-tracking data (only significant in Dundee). This lack of predictive power further stands-out when contrasted to surprisal spillover effects: replacing surprisal spillover terms with the corresponding entropy terms mostly leads to models with weaker predictive power. Together, these results imply the effect of expectations on reading times is mostly local-i.e., the expectation over a word's surprisal impacts its reading time, but not future words' RTs.

Experiment #3. Skewed Expectations
We now compare the effect of expectations with different Rényi entropies on reading times. We follow a similar setup to before. Specifically, we get the contextual Rényi entropy for several values of α and train regressors with two sets of input variables: (i) x model = [H α (W t ) ; h(w =t )]; and (ii) In other words, we either replace the current word's surprisal term with the Rényi entropy; or simply add the Rényi as an extra predictor. We then plot these values in Fig. 1. Analysing this figure, we see a clear trend in three   Table 3: ∆ llh (in 10 −2 nats) achieved after either replacing a surprisal term in the baseline with the contextual Rényi entropy (α = 0.5), or adding the Rényi entropy as an extra predictor.
of the four datasets (here, again, Provo presents different trends from the other datasets): the predictive power of expectations seem to improve for smaller values of α. More precisely, in Brown, Natural Stories and Dundee, an α = 0.5 seems to lead to stronger predictive powers than α > 0.5. Notably, with different values of α, the Rényi entropy leads to different interpretations of a reader's anticipation strategies. Recall from §2.3, that when α = 1, Rényi's and Shannon's entropies are equivalent. When α = 0, the Rényi entropy measures the size of the support of p(· | w <t )-or, in other words, the number of competing continuations at a timestep. The Rényi entropy with α = .5 does not have as clear of an intuitive meaning, but it can be interpreted as measuring a soft version of the distribution p's support-where a word needs to have probability above an , as opposed to above 0, to be counted. Based on these results, we then produce a similar table to the previous experiment's, but using the Rényi entropy with α = 0.5 instead. These results are depicted in Table 3. Similarly to before, we still see a significant improvement in predictive powers on three of the datasets when adding the entropy as an extra predictor. Unlike before, however, replacing the surprisal predictors (for time step t) with Rényi entropy predictors significantly improves loglikelihoods in two of the analysed datasets. In other words, the Rényi entropy has a stronger predictive power than the surprisal in both these datasets. We now move on to investigate why this is the case, analysing the mechanisms proposed in §3.

Experiment #4. Word-skipping
In §3, we discussed four potential mechanisms through which expectations could impact reading times. In this experiment, we analyse the impact of word-skipping effects on our results. We start this analysis by, similarly to previous experiments, looking at the ∆ llh between a baseline and an analysed model. Unlike before, though, we train a logistic regressor in this experiment, predicting whether or not a word was skipped instead. Our prediction function can thus be written as: where σ is the sigmoid function.
Provo Dundee  Table 4: ∆ llh (in 10 −4 nats) between a target model (with predictors on columns) vs baseline (with predictors on row) when predicting whether a word was skipped or not. All models also include the surprisal of the previous words as predictors. Table 4 presents these results. First, we see that surprisal is a significant predictor of whether or not a word is skipped in Dundee; yet it is not significant in Provo. Second, we find that in Dundee our predictive power over whether a word was skipped is significantly stronger when using the Rényi entropy of the current word than when using its surprisal. Finally, while we find an improvement in predictive power when adding entropy (besides surprisal) as a predictor, we find no significant improvement when performing the reverse operation. This implies that, at least for Dundee, word-skipping effects are predicted solely by the entropy, with the surprisal of the current word adding no extra predictive power.
Note that we represented skipped words as having reading times of 0ms in our previous experiments on eye-tracking datasets. Thus, our previous results could be driven purely by word-skipping effects. We now run the same experiments in §5.2 and §5.3, but with skipped words removed from our analysis. These results are presented in Table 5. In short, when skipped words are not considered, the Rényi entropy is not significantly more predictive of RTs than the surprisal anymore; in fact, the surprisal seems to be a slightly stronger predictor (although not significantly in Dundee). However, adding the Rényi entropy as a predictor to a model which already has surprisal still adds significant predictive power in Dundee. In short, this table shows that, while partly driven by word-skipping, there are still other effects of anticipation on reading times.

Experiment #5. Budgeting Effects
We now analyse budgeting effects: if reading times are affected by the entropy through a budgeting mechanism, we may expect to see budgeting spillover effects when a reader under-budgets-i.e., when the entropy is smaller than a word's surprisal,  causing less time to be allocated to the word than required for processing. Here, we operationalise under-budgeting as any positive difference between surprisal and entropy. Similarly, we may expect over-budgeting to lead to negative spillovereffects, since spending extra time in a word might allow the reader to start going through some of their processing debt (i.e. the still unprocessed spillover effects of that and of previous words). We operationalise several potential budgeting effects as: where ReLU(·) zeroes negative values, while returning positive ones unchanged. We then compute the ∆ llh of adding these effects as predictors of RT on top of a baseline with the current word's entropy, as well as all four surprisal terms, as Budget Over-budget Under-budget Abs-budget  Table 6: ∆ llh (in 10 −2 nats) achieved when predicting RTs after adding budgeting effect predictors on top of a baseline with entropy and surprisal as predictors.
predictor variables: Unlike previous experiments, thus, our baseline here already contains the entropy as a predictor. Further, we show results for eye-tracking datasets both including () and excluding () skipped words for this and future analyses. Table 6 presents these results. In short, we do find budgeting effects of word t−1 on RTs in our two analysed self-paced reading datasets, and in Dundee (). We do not, however, find them on Dundee (). This may imply budgeting effects impact word-skipping, but not actual RTs once the word is fixed. Further, we also find weak budgeting effects of word t − 2 in our () eye-tracking datasets; these, however, are only (weakly) significant in Dundee. We conclude that these results do not provide concrete evidence of a budgeting mechanism influencing RTs, but only of it influencing word-skipping instead. We will further analyse these effects in our discussion section ( §6).

Experiment #6. Preemptive Processing
In our analysis of preemptive processing, we will analyse the impact of the successor entropy (i.e., H α (W t+1 )) on RTs. Notably, while prior work has analysed this impact, the results in the literature are contradictory. Table 7 presents the results of our analysis. In short, this table shows that the successor entropy is only significant in Natural Stories. 15 In contrast, the current word's contextual entropy is a significant predictor of 15 We note this is the same dataset previously analysed by van Schijndel and Linzen (2019), who found a significant effect of the successor entropy.  Table 7: ∆ llh (in 10 −2 nats) achieved after adding the top predictor on top of a baseline with the predictors in the column. All models include surprisal as a predictor.
RTs in 3 /4 analysed datasets, even when added to a model that already has the successor entropy. Further, while most of our results suggest readers rely on skewed expectations for their anticipatory predictions-i.e., the Rényi entropy with α = 0.5 is in general a stronger predictor than Shannon's entropy-the successor Shannon entropy seems more predictive of RTs than the Rényi. Our full model, though, still has a larger log-likelihood when using Rényi entropies. Overall, our results support the analysis of Smith and Levy (2008), who suggested preemptive processing costs should be constant w.r.t. the successor entropy.

Discussion
We wrap-up our paper with an overall discussion of our results. To make this discussion more concrete, we plot the values of the parameters φ from our best regressor per dataset-showing the effect of predictor variables not-included in a dataset as zero. These parameters are depicted in Fig. 2. As the contextual Rényi entropy models provide overall higher log-likelihoods, we focus on them here. First, Fig. 2 shows that-for Brown, Natural Stories and Dundee-not only does the entropy have similar (or stronger) psychometric predictive power than the surprisal, it also has a similar (or even stronger) effect size on RTs. In other words, an increase of 1 bit in the contextual entropy leads to a similar or larger increase in RTs than a 1 bit increase in surprisals.
Second, Fig. 2 also shows that in Natural Stories-the only dataset where it is significantthe successor entropy has a larger effect on RTs than the surprisal, and its impact is positive. If these parameters implied a causal effect (which we note they do not), this would mean an increase in the next word's entropy leads to an increase in the current word's RT. This could suggest that readers are preemptively processing future words, and that they need more time to do this when there are more plausible future alternatives. Moreover, we see the successor Rényi entropy has a similar (or slightly smaller) effect on RTs than the current word's Rényi entropy. Why the successor entropy is only significant in the Natural Stories dataset is left as an open question.
Third, Fig. 2 shows the effect of over-budgeting on RTs in Brown, Natural Stories, and Dundee. 16 16 While over-budgeting is not a significant predictor in Brown, it leads to slightly stronger models and we add it to We see that our operationalisation of overbudgeting leads to a negative effect on RTs in Dundee (), but to no effect in Dundee (). Together, these results suggest that when a reader over-budgets time for a word, they are more likely to skip the following one. In Brown and Natural Stories, however, over-budgeting seems to lead to a positive effect on the next word's RT. As this is only the case in self-paced reading datasets, we suspect this could be related to specific properties of this experimental design; e.g., a reader's attention could break when they become idle due to over-budgeting reading time for a specific word.
Fourth, while we get roughly consistent results in Brown, Natural Stories, and Dundee, our analyses on Provo show a starkly different story. Further, while we note that Provo is the smallest of our analysed datasets (in terms of its number of annotated word tokens; see Table 8 in App. B), this is likely not the whole story behind these different results. As it is non-trivial to diagnose the source of these differences, we leave this task open for future work.
As a final remark, we note that, throughout our analyses, the Rényi entropy with α = 0.5 leads to stronger regressors (i.e., to higher log-likelihoods of held-out test data) than using α = 1, i.e., Shannon entropy. Recall that different values of α lead to Rényi entropies with different interpretations (see §2); the Rényi entropies with α < 1 can be roughly associated with a soft-measure of the number of competing alternatives considered at a time step, which suggests a mechanism by which contextual entropy may affect RTs.

Conclusion
This work investigates the anticipatory nature of the reading process. We examine the relationship this dataset's regressor for an improved comparison. between expected information content-as quantified by contextual entropy-and reading times in four naturalistic datasets, specifically looking at the additional predictive power over surprisal that this quantity provides. While our results confirm the responsive nature of reading, they also highlight its anticipatory nature. We observe that contextual entropy has significant predictive power in models of reading time, suggesting that readers may: skip words based on their expectations, and preemptively process future words. Such results give evidence of a significant anticipatory component to reading behaviour.

Limitations and Caveats
Throughout this paper, we have discussed the effect of anticipation on reading times (and on the reading process, more generally)-where we quantify a reader's anticipation as a contextual entropy. We do not, however, have access to the true distribution p, which is necessary to compute this entropy. Rather, we rely on a language model p θ to approximate it (as we note in §4.1). How this approximation impacts our results is a non-trivial question-especially since we do not know which errors our approximator is likely to commit. If we assume p θ to be a noisy version of p, for instance, we could have good estimates of p(· | w <t ) on average, while not-as-good estimates of each word w t 's probability. If this were true, our estimates of the entropy could be better than our estimates of the surprisal-which would bias our analyses towards preferring the entropy as a predictor.
We believe this not to be the main reason behind our results for two reasons. First, if the entropy helped predict reading times simply because we have noisy versions of the surprisal in our estimates, the same should be true for predicting spillover effects. We don't see this happening in general, though; a word's entropy mainly helps in predicting its own reading times, and not future word's RTs. Second-even if our estimates are noisy-a noisy estimate of the surprisal should better approximate the true surprisal than an estimate of the contextual entropy. Since replacing the surprisal with the contextual entropy eventually leads to better predictions of reading times, this is likely not the only mechanism on which our results rely.
Another limitation of our work is that we always estimate the contextual entropy and surprisal of a word w t while considering its entire context w <t .
Modelling surprisal and entropy effects while considering skipped words, however, would be an important future step for an analysis of anticipation in reading. As an example, van Schijndel and Schuler (2016) show that when a word w t−1 is skipped, the subsequent word w t 's reading time is not only proportional to its own surprisal (i.e., h(w t )), but to the sum of both these words surprisals (i.e., to h(w t ) + h(w t−1 )). They justify this by arguing that a reader would need to incorporate both word's information at once when reading. Another model of the reading process, however, could predict that readers simply marginalise the distribution over w t−1 , and compute the surprisal of word w t directly as: − log w t−1 ∈W p(w t | w <t ) p(w t−1 | w <t−1 ) (11) We leave it to future work to disentangle the effects that using a model p θ -as well as the effects of skipped words-has on our results.

A Rényi Entropy's Lower Bound
Theorem 1. Assume a language model with a deterministic tokeniser. Estimating the Rényi entropy over its vocabulary of sub-word units, as opposed to the (potentially infinite) vocabulary of full words W, leads to a lower bound on the Rényi entropy. 17 Proof. We will write a language model's distribution over sub-word units as p θ (s | s t,<i , w <t ) here. Notably, for most recent tokenisers, there may be multiple sequences of sub-word units s which would lead to the same word w. As we assume our model's tokeniser is deterministic, though, only one such sequence s will be produced when natural text data is tokenised. In this case, if the model assigns probability mass to sequences with other tokenisations, then it will technically loose probability mass-as those sequences will never occur. Ergo, we will focus here on a version of this model for which non-standard sub-word unit sequences (which are never generated by the tokeniser) are reassigned probability zero; this should lead to models which are as good or better than the originals, in terms of cross-entropy. Now consider the set of all words W s which start with a specific sub-word symbol s. 18 We have that:

C Surprisal vs. Entropy
The surprisal and the contextual entropy are bound to be strongly related, as one is the other's expected value. To see the extent of their relation, we compute their Spearman correlation per dataset and display it in Fig. 3. This figure shows that these values are indeed strongly correlated, and that Shannon's entropy is more strongly correlated to the surprisal than the Rényi entropy with α = 0.5. Given that the Rényi entropy is in general a stronger predictor of RTs than the Shannon entropy, this finding provides further evidence that our results do not only rely on the entropy "averaging out" the noise in our surprisal's estimates.