An information-theoretic analysis of targeted regressions during reading

Regressions, or backward saccades, are common during reading, accounting for between 5% and 20% of all saccades. And yet, relatively little is known about what causes them. We provide an information-theoretic operationalization for two previous qualitative hypotheses about regressions, which we dub reactivation and reanalysis. We argue that these hypotheses make different predictions about the pointwise mutual information or pmi between a regression's source and target. Intuitively, the pmi between two words measures how much more (or less) likely one word is to be present given the other. On one hand, the reactivation hypothesis predicts that regressions occur between words that are associated, implying high positive values of pmi. On the other hand, the reanalysis hypothesis predicts that regressions should occur between words that are not associated with each other, implying negative, low values of pmi. As a second theoretical contribution, we expand on previous theories by considering not only pmi but also expected values of pmi, E[pmi], where the expectation is taken over all possible realizations of the regression's target. The rationale for this is that language processing involves making inferences under uncertainty, and readers may be uncertain about what they have read, especially if a previous word was skipped. To test both theories, we use contemporary language models to estimate pmi-based statistics over word pairs in three corpora of eye tracking data in English, as well as in six languages across three language families (Indo-European, Uralic, and Turkic). Our results are consistent across languages and models tested: Positive values of pmi and E[pmi] consistently help to predict the patterns of regressions during reading, whereas negative values of pmi and E[pmi] do not. Our information-theoretic interpretation increases the predictive scope of both theories and our studies present the first systematic crosslinguistic analysis of regressions in the literature. Our results support the reactivation hypothesis and, more broadly, they expand the number of language processing behaviors that can be linked to information-theoretic principles.


Introduction
Reading offers a unique window into the cognitive mechanisms that support real-time language comprehension.It consists of fixations, where the gaze focuses on an individual word or word region, and saccades, where vision is suppressed and the gaze moves.Saccades are typically progressive, i.e., rightward moving in English.However, about 5%-20% are regressive, going against the general flow of reading (Rayner, 1998).Regressive saccades, or more simply, regressions, have played an important role in psycholinguistic theory, especially as they provide clues about what types of structures people maintain in the reading process (Rayner, 1998).Second, as such, regressions are understood to facilitate comprehension-regression rate increases in texts that are difficult or inconsistent (Rayner, Chace, Slattery, & Ashby, 2006) and suppressing regressions can make texts more difficult to understand (Schotter, Tran, & Rayner, 2014).Third, they are understood to support additional information gathering in the form of re-reading (Booth & Weger, 2013;Sturt & Kwon, 2018; although, see Mitchell, Shen, Green, & Hodgson, 2008 for a contrasting view).
However, despite this high-level agreement, there is considerable disagreement in the literature about the mechanism through which regressions support comprehension.The most influential hypothesis suggests that readers use regressions for selective reanalysis (Frazier & Rayner, 1982), i.e., to intentionally target words or sentence regions about which they want additional information.Evidence in favor of this view comes from the fact that regressions can be highly accurate, even when the saccade is quite long (Kennedy & Murray, 1987;Murray & Kennedy, 1988), as well as several studies that find readers make regressions to specific regions when this would help resolve processing of certain ambiguous syntactic structures (Christianson, Luke, Hussey, & Wochna, 2017;Meseguer, Carreiras, & Clifton, 2002;Schotter et al., 2014).However, other studies have questioned the extent to which regressions can be characterized as selective: First, many short regressions are likely due to low-level ocular-motor factors, such as corrections for longer saccades which may have overshot their intended target (Inhoff, Kim, & Radach, 2019) or as learned responses to certain visual or motor factors (O'Regan, 1987;Yang & McConkie, 2001).Second, when it comes to longer regressions, some studies suggest that they are used to support unselective re-reading of an entire sentence (Von der Malsburg & Vasishth, 2011, 2013), especially when this choice must be made consciously (Paape & Vasishth, 2022).A third perspective suggests that regressions may not support re-reading at all, but are merely timebuying operations (Inhoff & Weger, 2005;Mitchell et al., 2008), which is compatible with an observed lack of connection between selective regressions and increased comprehension (Christianson et al., 2017).Through this debate, a picture emerges of regressions as heterogeneous: Some are due to low-level ocular-motor factors, some are used to reread sentences, or sentence chunks, from the beginning and some are used to target specific sentence regions.
In this paper, we are not primarily concerned with evaluating whether, or when, regressions are selective.Rather, given the evidence, we take the perspective that regressions can be selective and ask when they are selective, how do they support online language comprehension during reading?First, we provide an information-theoretic operationalization for two hypotheses about the purpose of selective regressions.Second, we test these hypotheses by evaluating their ability to predict regressions in large, naturalistic reading corpora.We chose this setting because we do not have a way of a priori filtering selective from unselective regressions, and a large number of regressions may fall into the latter category.Thus, large amounts of data are needed to successfully test our theoretical predictions in an inherently noisy dataset.We focus on two different, although not necessarily mutually exclusive, perspectives, which we dub the reactivation hypothesis and the reanalysis hypothesis.
Under the reactivation hypothesis, regressions facilitate comprehension by re-activating material that helps the reader gain more confidence about the current word.In this case, regressions target sentence regions about which the reader has high confidence already, and help the reader find evidence that can confirm a structural or lexical candidate, such as the identity of the word from which the regression originated.Although not discussed as a previously unified proposal, we group several previous works (Kennedy, 1992;Kennedy & Murray, 1987;Lopopolo, Frank, van den Bosch, & Willems, 2019) under the reactivation hypothesis based on their shared idea that regressions are used to confirm or support current-word processing.
In contrast with the reactivation hypothesis, the reanalysis hypothesis posits that regressions occur when new material causes a loss of confidence about a previous interpretive choice, such as a previous word's identity (Bicknell & Levy, 2010, 2011), or a structural interpretation of the sentence as a whole.In this case, regressions are initiated to re-inspect the sentence region about which the reader is no longer confident.Reanalysis has historically played a more dominant role in the psycholinguistics literature, where it is discussed in relation to the processing of gardenpath sentences, sentences that have an initially locally probable, but globally impossible structural interpretation (Altmann et al., 1992;Frazier & Rayner, 1982).(We will discuss gardenpath sentences at greater length in Section 2.2.) The present contribution is motivated by two types of limitations in the prior literature on regressions.The first type of limitation is theoretical and stems from the fact that, as originally framed, both reactivation and reanalysis hypotheses are qualitative.Both were formulated to explain regressions during the processing of specific linguistic phenomena and do not make broad-coverage predictions about where and when regressions occur during naturalistic reading.Despite some previous efforts to provide quantitative interpretations (Bicknell & Levy, 2011;Lopopolo et al., 2019), the qualitative nature of both theories presently poses serious limitations to theory evaluation and theory comparison.The second type of limitation is empirical and stems from the fact that the majority of large-scale statistical studies of regressions, which might provide some answers to questions about theoretical predictive power, were conducted over ten years ago (Bicknell & Levy, 2011;Kliegl, Grabner, Rolfs, & Engbert, 2004;Rayner, Ashby, Pollatsek, & Reichle, 2004;Vitu & McConkie, 2000) and face several challenges enumerated below.First, such studies were conducted before accurate estimates of word predictability were available and, thus, either use predictability ratings from human participants (Kliegl et al., 2004;Rayner et al., 2004), or statistical estimators that have a limited context window (Bicknell & Levy, 2011).This means that the contexts that trigger long regressions cannot be represented by these previous statistical models.Second, previous studies have analyzed only a single corpus at a time, which makes it difficult to determine if their conclusions are specific to the dataset being tested, or if they are more general properties of reading.Third, the vast majority of studies have been conducted only in English, making it difficult to interpret whether previous results are due to language-specific or potentially cognitively universal processing considerations.
The goal of the present work is to provide a quantitative interpretation of the reactivation and reanalysis hypotheses and to evaluate both on a more diverse set of languages and corpora.Our contributions are both theoretical and empirical.Theoretically, we argue that the reactivation and reanalysis hypotheses can be operationalized information-theoretically as contextual pointwise mutual information (pmi; Fano, 1961) between a regression's source   and its target   .Intuitively, the pmi between two words measures how much more (or less) likely one word is to be present in a sentence given the other, and can be thought of as a measure of word-to-word relatedness, when taking the surrounding context into account.On one hand, the reactivation hypothesis predicts that regressions should occur between words that are associated with each other.Under this informationtheoretic lens, we should thus expect positive (high) pmi between a regression's source and target.On the other hand, the reanalysis hypothesis predicts that regressions occur between words that are not associated with each other, implying negative (low) pmi.Because pmi can be computed between any two words, under this interpretation the reactivation and reanalysis hypotheses make generalized predictions about the incidences of regressions over arbitrary word pairs in any sentence.
As a second theoretical contribution, in addition to looking at the pmi between observed words, we further consider expectations over possible word identity.The rationale for this is that language processing involves making inferences under uncertainty (Bicknell & Levy, 2010;Kleinschmidt & Jaeger, 2015), and readers may be uncertain about what they have read, especially if a previous word was skipped.In such cases, instead of basing the decision to regress on a previous word's realization, readers may regress based on the expectation about its likely identity.Thus, in addition to investigating the ability of pmi to explain regressions, our studies test the role of its expected value, where the expectation is taken solely over   .This value can also be thought of as a half-pointwise mutual information and it will be denoted for simplicity with E  [pmi].We argue that this measure more accurately reflects the types of decisions under uncertainty that people must make during naturalistic reading, at least vis-à-vis previous studies.
We use contemporary language models to estimate our pmi-based statistics over word pairs in three corpora of naturalistic reading in English, as well as for semantically controlled materials in six languages across three different language families (Indo-European, Uralic, and Turkic).To assess the predictive power of our pmi-based values, we include them as predictors in a zero-inflated Poisson model trained to predict the number of regressions that occur between pairs of words within each sentence of a corpus.We conduct two types of analyses: First, we fit a single model that includes pmi, as well as a number of baseline statistics, and inspect the coefficient associated with pmi.We ask whether it is negative or positive, with the former taken to support reanalysis and the latter reactivation.For our second analysis, we fit models with a separate pmi-based statistic for each hypothesis.To instantiate the reactivation hypothesis we include positive pointwise mutual information (ppmi), which is max(0, pmi) as its associated regressor, while for the reanalysis hypothesis, we include negative pointwise mutual information (npmi), which is min(0, pmi).Here, we compare our pmi-based models to models that only include baseline statistics.If the model with a certain predictor leads to a significant gain in log-likelihood over baselines, it supports that predictor's corresponding theory for regressions.
Our results are consistent across languages and testing environments.First, for baseline predictors, our results indicate that regressions occur between words that are close together, infrequent, and unlikely in context.These results largely confirm those of previous studies.For our pmi-based statistics, first, we find that models fit a positive coefficient for pmi, indicating that as pmi between words increases, so do the number of regressions between them.Second, we find that ppmi and E  [ppmi] consistently lead to predictive power above baselines, but npmi and E  [npmi] do not.In addition, we find that E  [ppmi] tends to demonstrate stronger predictive power over vanilla ppmi.
Our contributions are thus three-fold: We first provide a specific quantitative formulation of reactivation and reanalysis as theories of regressions and generalize them to consider uncertainties over previously observed word identities.We improve on the methods relative to prior work by using crosslinguistic datasets and contemporary natural language processing methods for uncertainty estimation.Our results support reactivation and, more broadly, they expand the number of language-processing behaviors that can be linked to informationtheoretic principles.We view this contribution as in line with recent work that has argued for an information-theoretic basis for explaining word skipping (Pimentel, Meister, Wilcox, Levy, & Cotterell, 2023), as well as other structural processing phenomena (Futrell, 2019;Hahn, Futrell, Levy, & Gibson, 2022).
The rest of this paper will proceed as follows: In Section 2, we present our two contrasting hypotheses-reactivation and reanalysisin greater detail.In Section 3, we discuss how both can be operationalized as making predictions about the pmi between the source and target of a regression.In Section 4, we introduce our methods.Section 5 presents results from our first study, investigating regressions in English.Section 6 presents results from our second study, investigating regressions in six different languages.Section 7 discusses the implications of these results, both for psycholinguistic theories as well as information-theoretic theories of language processing.Section 8 concludes.

Reactivation
The reactivation hypothesis for regressions grows out of work on language processing that investigated how reading is shaped by its relationship to a physical object, i.e., a computer screen or a printed page (Kennedy, 1992;Kennedy, Brooks, Flynn and Prophet, 2003;Kennedy & Murray, 1987).The primary purpose of this work was to demonstrate that readers store a spatial representation of a word's location during reading, using evidence from regressive saccades.For example, Kennedy, Hill and Pynte (2003) discuss an experiment where readers were given sentences, followed by a single word, such as in (1), and asked whether the word had appeared in the sentence.
(1) The novels in the library had started to go mouldy with the damp.novels They found that a portion of participants directed highly accurate saccades backward to the word novels.What is the purpose of these saccades?Because sentences were constructed not to contain syntactic ambiguities, it is unlikely that the regression was due to confusion about the overall sentence structure.The accuracy of the saccade further suggests that participants were not confused about whether novels was present, either.After all, those participants who launched targeted regressions remembered precisely where it was located on the page.While Kennedy, Hill et al. acknowledge that regressions can be used for reanalysis, they propose that backward saccades, such as the ones observed in their experiments, can serve alternative functions.Specifically, they suggest that some long-range, targeted regressions can be thought of as similar to pointing, or co-speech gesture-a way to aid the memory burden during language processing.
Building on this idea, Lopopolo et al. (2019) hypothesize that regressions are used to support memory processes during the online construction of syntactic dependencies by reactivating previously read information.They offer the hypothesis that regressions ''allow re-reading and cueing of previous words, as an aid to memory, when this is required for a successful construction of a syntactic representation of the text.''Specifically, they predict that regressions should occur between a word and its syntactic head.Conducting an analysis on a corpus of naturalistic reading in English, they find that regressions are more likely to occur from a word when its syntactic head is in the previous context, which is consistent with their hypothesis.Although not discussed in Lopopolo et al. (2019), previous work has also found evidence that between a dependent and its head there also tend to be high values of pmi (Futrell, 2019).Thus, taken together, these findings suggest a relationship between pmi and regressions, something which we formalize and test thoroughly.
Under the reactivation hypothesis, why might a reader want to reactivate previous words?While Kennedy and Murray (1987) and Kennedy, Hill et al. (2003) propose that reactivating previous material serves as a memory aid, moving forward, we will frame things in terms of word-to-word cueing (cueing is also used by Lopopolo et al. 2019).We opt for cueing because it proposes a specific mechanism through which reactivations help memory processes.Concretely, we will treat the reactivation hypothesis as saying that regressions facilitate the processing of some word   by activating words that cue it.We will adopt a simple notion of cueing based on probability.This notion draws on the neuro-psychological literature, which treats a cue as something that is statistically related to a given property of the environment (Fetsch, Pouget, DeAngelis, & Angelaki, 2012;Martin, 2016).We will think about reading as taking place in some overall context, which we will discuss in greater detail in the subsequent section.We say that a particular word   cues another word,   if including it in the context makes the other word more likely. 2We would like to note that, under our interpretation of the reactivation hypothesis, another way to think about regressions is as about confirmation.Readers regress from   to   , when reactivating the identity of   can help confirm or build more confidence about the identity of   .This notion of regressions as being about confirmation has been discussed recently in the literature on regressions (Christianson et al., 2017;Paape & Vasishth, 2022), however these works are primarily concerned with regressions as a mechanism to confirm syntactic structures, i.e., a more abstract linguistic representation, rather than word identity itself.

Reanalysis
The reanalysis approach to regressions grows out of the literature on garden paths-sentences that have an initial locally possible interpretation which is rendered impossible at a later, disambiguating region.For example, (2) illustrates the well-known Main Verb/Reduced Relative (MV/RR) gardenpath effect.
(2) The artist painted a portrait was impressed by its beauty.
The sentence has an initially possible interpretation in which painted is a past-tense matrix verb.This becomes impossible at the disambiguating region was impressed, in favor of a globally possible structural interpretation where painted starts off a reduced relative clause.(In this case, the sentence becomes semantically equivalent to The artist who was painted a portrait was impressed by its beauty.) Since Frazier and Rayner (1982) argued that readers use regressions to selectively re-analyze ambiguous portions of gardenpath sentences, several studies have found strong links between gardenpathing and regressive saccades (Meseguer et al., 2002), and some have suggestedor implicitly assumed as in the regression contingent analysis proposed by Altmann et al. (1992)-that regressions are a necessary consequence of gardenpathing.However, other work has argued that there are numerous strategies through which reanalysis could be pursued (Lewis, 1998).And further experimental studies demonstrate that garden path effects can be observed even when trials with regressions are eliminated from a dataset (Rayner & Sereno, 1994).Even in Frazier and Rayner (1982)'s original study only a subset of participants regressed to the ambiguous sentence region.Some just continued reading, albeit with longer fixation times, until they had apparently reanalyzed the sentence.
Furthermore, as Bicknell and Levy (2011) point out, gardenpaths cannot be frequent enough to explain all regressions in naturalistic reading.Bicknell and Levy propose an alternative, but related, hypothesis in which regressions are directed towards a target not because its initial structural interpretation has become impossible (as is argued with garden paths) but because the reader has reduced confidence about its identity, which they dub the falling confidence theory.Note that falling confidence can potentially explain reading behavior in garden path sentences as well.Turning back to (2), readers may have lower confidence in the identity of painted after they encounter was impressed.The crucial difference, we argue, between reactivation and reanalysis is that, in the former, the reader is assumed to have high confidence about the target of the regressive saccade and re-fixates on 2 One potential challenge for the reactivation hypothesis is the following: If the reader's goal is to process a given word, why would they regress to another word, even if it is highly related, rather than continuing to fixate on the original word itself?One potential answer to this comes from the notion of cue integration.More robust representations of the underlying state of affairs can be derived when multiple cues that reflect complementary aspects of the environment are integrated together (Ernst & Bülthoff, 2004).Thus, even though regressing to a previous word may take effort, it may be the optimal strategy to achieve confidence about the current word, especially when that word is unlikely or the input is particularly noisy.it to build more confidence about the identity of the source.Reanalysis assumes the opposite, namely that regressions are initiated when the reader has low (or lower) confidence about the target and fixates on it to gain more information about its possible identity.
How do we know when one word causes a loss of confidence about a subsequent word?Again, we adopt an intuitive notion that confidence is impacted by probability.Consider two words   , and   , in a situation where the reader has already read   , moved on, and is about to encounter   .We will say that confidence about   falls if the probability of   is less when   is known compared to when it is not known.

Modeling assumptions
We now argue that both reactivation and reanalysis can be naturally operationalized using different hypotheses about the probabilistic dependence between the source and target of a regression.Before we discuss our hypothesis in greater detail, however, we briefly describe and justify the foundational choices we make about the modeling space.Following extensive previous work in psycholinguistics and cognitive science, we assume that one fundamental capacity of the language system is to learn and keep track of the statistical regularities of language (Piantadosi, Tily, & Gibson, 2011;Saffran, Aslin, & Newport, 1996;Zipf, 1949).Specifically, we assume that humans have access to a probability distribution over strings of linguistic input, and, moreover, use statistics derived from this distribution during real-time language comprehension.We note that these modeling assumptions are not unique to our approach, and are shared among multiple well-developed theories in psycholinguistics, such as, e.g., surprisal theory (Hale, 2001;Levy, 2008).
The place where our modeling assumptions do differ from many previous contributions is in the fundamental probabilistic quantity we are interested in measuring.Much previous work has investigated the relationship between human behavior and the probability of an individual word given its preceding context, i.e., (  |  [1⋯𝑡−1] ).Instead, the fundamental quantity we are interested in is the joint probability of two words, given their surrounding context.In the subsequent sections, to make this quantity explicit and to make the surrounding context visually clear, we will write it as (  ,   |  [1⋯) □  (⋯) □  (𝑠⋯n] ).Example, (1) gives a visual representation: Here,  [1⋯N] =  1 ⋯   is a sequence of words, and any given word   is the realization of a random variable W  which has support over a vocabulary .In our interval notation,  [1⋯) =  1 …  −1 is the substring of words from the start of the sentence (inclusive) up until   (not inclusive),  (⋯) =  +1 ⋯  −1 is the substring of words from the target   (not inclusive) up until the source   (not inclusive), and  (⋯n] =  +1 ⋯  N is the substring of words from the source   (not inclusive) of the sentence up until the end of the sentence  (inclusive).
When considering the content of  (𝑠⋯n] in the context of regressions, we must be careful.Often, readers will regress with no, or only partial information about upcoming material, in which case  (⋯n] will be empty.However, readers can also regress on a second-pass reading of the sentence, in which case  (𝑠⋯n] would include all the words from the source word until the end of the sentence.For now, we will talk about  (⋯n] abstractly, and instantiate different possible values for it in our analyses. The grey boxes, □, indicate the locations of masked words (or simply, masks), which are words whose position is known to the comprehender but whose identity is not.Cases of masked words abound in daily language use.For example, masked words may come about when a sound E.G. Wilcox et al. occludes a word during speech comprehension, or an object occludes the hands during sign comprehension.In the domain of reading, masks could come about when a reader skips over a word.To make the relations between terms clear, we will sometimes condition probabilities on a sequence of words with only one hole in it, for example, (  |  [1⋯) □  (⋯)    (𝑠⋯n] ) gives the probability of   conditioned on all the other words in the sequence and (  |  [1⋯) □  (⋯) □  (𝑠⋯n] ) gives the probability of   conditioned on all the other words in the sequence, except for   .
We argue that the joint probability is an intuitive and interesting quantity to model.Just as people have opinions about what words are likely to come next given a prefix, so too do they have opinions about what pairs of words are jointly likely to fill in the gray boxes.For example, ⟨,  ⟩ and ⟨,  ⟩ are intuitively likely pairs for the example above, but ⟨,  ⟩ and ⟨,  ⟩ are not.Such intuitions suggest that some function of the joint probability of words is easy to access for native language users.We want to note that we are not the first to discuss joint probabilities between words in the psycholinguistics literature: various studies have implicitly made these same modeling assumptions when using measures of probabilistic dependence between words to predict various linguistic phenomena such as syntactic dependencies (Futrell, Qian, Gibson, Fedorenko, & Blank, 2019) or word order (Futrell, Gibson, & Levy, 2020;Futrell & Hahn, 2022); and Qian and Levy (2022) have used materials like Example (1) in a sentence infilling task.However, they allow participants to fill in multiple words, thus looking at the joint probability of spans instead of words.
We now describe how to link the probability distribution  to our behavior of interest, namely regressions.One option, in our case, would be to simply use the raw joint probability itself, or other metrics of similarity such as the expected co-occurrence.However, here we opt to use pointwise mutual information (pmi) between words instead.We do this for several reasons: First, pmi has been used previously in the study of natural language (Futrell, 2019;Pimentel, Roark, Wichmann, Cotterell, & Blasi, 2021;Williams et al., 2020).Second, using pmi invites connections between the present work and Information Theory as well as Natural Language Processing (NLP) where it is a commonly used measure of word-to-word association (Church & Hanks, 1990;Jurafsky & Martin, 2000, Chapter 6).Formally, pmi is defined in the following way: Let X and Y be two random variables with event spaces  and , respectively, and with joint distribution (, ).The pmi between the events  ∈  and  ∈  is defined as the log ratio of their joint probability, assuming dependence and independence, i.e.

pmi(𝑥; 𝑦
with the latter two expressions being equivalent by Bayes' Theorem. As the name suggests, pmi is a pointwise variant of mutual information (MI).The two have a tight relationship-specifically, the expectation of pmi gives us mutual information. 3Both mutual information and pmi are symmetric, i.e., pmi(; ) = pmi(; ).In contrast to mutual information, however, which is always non-negative, pmi can yield a negative value.
Instead of looking at pmi between abstract events  and , from here on out we will discuss the pmi between words,   and   conditioned on their surrounding context.Taking the right-hand formulation of pmi from Eq. ( 2), the conditional pmi we are interested in is: 3 The formal definition of mutual information is as follows: The main theoretical hypothesis of this work is that just as language comprehension is sensitive to the conditional probability of a word, given its preceding context, so too is it sensitive to the conditional pmi between pairs of words in a sentence.In particular, we are interested in the predictive ability of pmi to explain regressions during naturalistic reading.In light of this, one additional reason we choose pmi over other metrics of probabilistic dependence, such as expected co-occurrence, is because pmi can be easily linked to concrete predictions of the reactivation and reanalysis hypotheses, which we turn to in the following subsections.

The reactivation hypothesis
We start first with the reactivation hypothesis, which posits that readers regress to   because it cues or supports the source,   .We build on the intuitive notion of cueing introduced in Section 2, in which two words cue each other if knowing the identity of one makes the other more likely.That is, according to the reactivation hypothesis, the following should hold in cases where regressions happen: By taking the log of both sides and transforming the difference into a ratio this inequality can be re-written as the following: i.e., that the log ratio of the conditional probability of the source in a context that includes the target compared to a context that does not include the target should be greater than zero.Looking back to Eq. ( 4), we can see that the left-hand side of this inequality is just the conditional pmi between the   and   given the surrounding sentential context, i.e., pmi(  ; ).What the reactivation hypothesis predicts, then, is that regressions should occur between words when the pmi between them is greater than zero, or, using the symmetry of the measure, when pmi(  ; However, notions of cueing are not typically binary.If, instead, we assume that the strength of a cue is proportional to the ratio in Eq. ( 6), the reactivation hypothesis predicts that not only should the pmi between a source and target be positive, but that higher values of pmi will correspond to stronger cues, and should therefore be more strongly associated with regressions.This leads to a natural interpretation of the reactivation hypothesis in terms of positive pointwise mutual information (ppmi).To derive the ppmi from pmi, one simply zeros out all the negative values of pmi, i.e., ppmi(  ; (𝑠⋯n] ).What the reactivation hypothesis predicts, then, is that as the ppmi between the   and the   increases, so too will the number of regressions between these words.

The reanalysis hypothesis
Now we turn to the reanalysis hypothesis, which posits that regressions occur between   and   when the presence of   causes the reader to lose confidence about the identity of   .Using the probabilistic notion of falling confidence introduced in the previous section, we get that confidence falls when the presence of   causes   to be less likely: Or, using the same logic as above, when: Following this reasoning, the reanalysis hypothesis makes the opposite prediction of the reactivation hypothesis; namely, that regressions should occur when there is a negative pmi between two words.
As with cueing, notions of falling confidence are not typically thought of in purely binary terms.Under our probability-based operationalization, we will say that the degree to which confidence falls is proportional to the value of the pmi between   and   .Thus, we argue that under this framework, the reanalysis hypothesis predicts not only that the pmi between a source and target should be negative, but that lower values of pmi should be more strongly associated with regressions.This leads to a natural interpretation of reanalysis in terms of negative pointwise mutual information, (npmi), where npmi(  ; (𝑠⋯n] )).Contrasting with the reactivation hypothesis, the reanalysis hypothesis thus predicts that the probability of a regression should be a function of the npmi between the source and target.

A combined theory
As outlined so far in this section, reactivation and reanalysis theories are exclusive.In the former, pmi is associated with regressions, but only in the positive range, while for the latter pmi is only associated with regressions in the negative range.However, as articulated in the literature, these theories are not necessarily in conflict.If both theories are correct, then regressions could vary as a function of both ppmi and npmi, possibly to different degrees.We will explore this hypothesis in our subsequent studies.

Expectation vs. Realization
So far, we have discussed regressions between two words in a context where their actual realizations are known to the reader.However, this assumption may not always hold.For example, it may be the case that the reader skipped over   in their initial pass.Or it may be the case that readers initially read   but have forgotten its identity due to memory limitations.In such a circumstance, readers would have to decide to regress not based on the actual identity of   , but rather on their expectations about   given the context.
Thus, in addition to the operationalization presented above, we formulate an expectation-based view of each theory.Under an expectation-based view, regressions are predicted between words that have a high expected value of pmi across all possible realizations of the target word, i.e., over all words in a vocabulary .As noted above, if we were to take the full expectation over both   and   , this would be the mutual information.However, because we are taking the expectation with respect to just one of the words we derive the half-pointwise mutual information or the half-expectation of pmi.As this expectation is taken over the target, for simplicity, we will refer to it in text as the where, depending on the theory pmi is swapped in for ppmi (in which case we will use E  [ppmi]) or npmi (in which case we will use E  [npmi]).In cases of extreme uncertainty, a reader may not only have low confidence about   , but also about   .As noted above, if we were to take the expectation over both the target and the source of a regression, then this would give us the mutual information between the possible realizations of these two words.For our purposes, we assume that readers do know the identity of   and investigate only the half-pointwise mutual information in the following studies.

Methodology
We assess reactivation and reanalysis hypotheses for regressions by asking how well pmi-based statistics predict regressions in datasets of naturalistic reading.We estimate pmi values using a large, pretrained bidirectional language model.For each study, we define a set of variables that are hypothesized to impact regressions,  = [1,  1 , … ,   ] ⊤ and use them as predictors in a zero-inflated Poisson (ZIP) model trained to predict the total number of regressions that occur between pairs of words within a sentence.ZIP models (Lambert, 1992;Mullahy, 1986) assume that observed data is produced by two regimes, one of which produces values of only zero.Thus, they are useful for modeling data that have an inflated number of zeros, as is the case in our datasets, where readers do not regress between most pairs of words within the same sentence.Formally, ZIPs are two-component mixture models that consist of a point mass at zero and a Poisson distribution.Following Lambert (1992), the mixture between these two components is given by a logistic model: The parameters of the Poisson and logistic components are: where 0 ≤  ≤ 1 and    ,  ∈ Z ≥0 is the total number of regressions that occur between   and   in a dataset (summed across all participants),  ∈ R +1 are the parameters of the Poisson model,  = [ 0 ,  1 , … ,   ] ⊤ and  ∈ R +1 are the parameters of the logistic model,  = [ 0 ,  1 , … ,   ] ⊤ .Although the predictor variables,  need not be shared across the Poisson and logistic models, we choose to do so for our studies.
Each of our studies consists of two analyses.In the first, we simply fit, on our data, a ZIP model that includes pmi, as well as a number of baseline predictors.Baseline predictors are properties of words that are known to have broad impact on reading behavior, as shown in the prior literature on reading (Frank & Bod, 2011;Goodkind & Bicknell, 2018;Smith & Levy, 2013;Wilcox, Gauthier, Hu, Qian, & Levy, 2020;Wilcox, Pimentel, Meister, Cotterell, & Levy, 2023) or have been used in previous statistical analyses of regressions (Bicknell & Levy, 2011;Lopopolo et al., 2019).We then inspect the sign of the coefficient associated with pmi for both the count and zero-inflated model components.Generally, we will look for cases where the signs of the coefficients are flipped between the two components of the ZIP model.For example, the model fits a negative coefficient for the count portion and a positive coefficient for the zero-inflation portion, this would support the reanalysis hypothesis-as the pmi between   and   decreases the number of regressions increases, while the likelihood that there are zero regressions decreases (and vice versa if the signs were opposite).Positive (or negative) coefficients for both portions of the ZIP model do not lead to a clear interpretation of a particular predictor variable.Taking pmi as an example, positive coefficients for both Poisson and zero-inflation portions would mean that as the pmi between two words goes up the number of regressions is expected to increase, compatible with the reactivation hypothesis, but also that the likelihood of no regressions increases as well, which is not consistent with the reactivation hypothesis.From here on out, we refer to some predictor having a negative or positive effect based on its coefficient in the count portion of the model, but-following the logic above-only if the signs are flipped between the two model components.
While the analysis described above presents a good baseline, it only tells us about the direction of the effect of pmi overall.Thus, it cannot tell us anything about the relative predictive power of each theory and is incapable of offering support for the combined theory suggested in Section 3.4.Therefore, in addition to this baseline analysis, we present an additional analysis designed to tease out the relative predictive power of the reactivation and reanalysis hypotheses.For this analysis, we define a baseline ZIP model, with the same baseline variables as from our first analysis.In addition, we define a critical ZIP that includes baseline predictors plus one or more of our pmi-based predictors.Following previous studies that investigate naturalistic reading behavior (Frank & Bod, 2011;Goodkind & Bicknell, 2018), to assess the psycholinguistic power of our pmi-based predictors, we report the average by-word Delta Log-Likelihood ( llh ) between baseline models and critical models, where llh is the joint log-likelihood of a test dataset and  llh is the mean by-token difference in log-likelihood between the critical and baseline models.Positive  llh means that models that include pmi-based predictors obtain a higher likelihood on held-out data over baselines.We take positive  llh as empirical support for the underlying theory associated with those predictors.In addition to asking whether the  llh is greater than zero, to compare different theories against each other, we will also look at the difference in  llh between our different ZIP models.If, for example, ZIP models that include ppmi consistently obtain higher  llh than those fit with npmi (and ppmi has a positive coefficient in the count-based portion of the model) then we would take this as additional empirical support for the reanalysis hypothesis.

Zero-inflated Poisson models
For ZIP models, the response variable is the total number of regressions between a pair of words for all participants in a given dataset.We choose to look at total regressions, as opposed to say, the probability that a subject will regress, because this allows us to capture multiple regressions that may occur between words from a single participant.While multiple regressions or second-pass regressions between the same words are not common, neither are they rare in our data.For example, in the Provo dataset, about 3.5% of all regressions are second pass.The total dataset comprises all pairs of words within the same sentence, meaning we only look at within-sentence regressions.While longer, inter-sentential regressions are undeniably interesting, we choose to look at within-sentence regressions for two reasons: First, as the number of word pairs grows exponentially with sequence length if we chose longer units of analysis (such as whole passages or texts) we would suffer from extreme data sparsity in our analysis.Second, although some studies have investigated inter-sentential dependencies (e.g., Kennedy, Brooks et al. 2003), many psycholinguistic studies have analyzed regressions within single target sentences.Thus, given that a cutoff must be chosen, we believe that the sentence is an appropriate unit of analysis.We do, however, acknowledge that our inability to capture between-sentence regressions is a limitation of the current work.
In our first analysis, ZIP models are trained on our entire dataset.In the second analysis, ZIP models were trained on five different folds of data, fit using the zeroinfl function from the pscl library in R, and tested on heldout data.For baseline predictors we included: the number of words between the source and the target, as well as, for both the source and target, their length (in characters); log 10 unigram frequency as reported in Speer (2022); as well as surprisal estimated from a large autoregressive language model (see Section 4.3).We estimate surprisal with GPT-2 (Radford et al., 2019) for English or mGPT (Shliazhko et al., 2024), which is a large multilingual variant of GPT-2.
We fit critical ZIP models with four pmi-based predictors: npmi, ppmi, For npmi and E  [npmi] we invert their signs, so that model coefficients can be interpreted similarly to those for ppmi and E  [ppmi].However, because npmi and ppmi include many zeros, their coefficients cannot be interpreted as linear trends.To remedy this, for these two critical models, we include an additional indicator predictor, which takes on a value of 1 if ppmi or npmi is 0, and 0 otherwise.This is a well-used strategy (see, e.g., He et al. 2014) that allows us to model the log-linear effect of ppmi and npmi separately from their presence or absence.In addition to ZIP models that include a single additional pmi-based statistic, we train two ensemble models-one that is fit with both ppmi and npmi, as well as one that is fit with both E  [ppmi] and E  [npmi].These models instantiate the combined theory we introduce at the end of Section 3.4.Note that our ensemble models include each pmi variant as a different predictor.This allows the model to find (potentially) different coefficients for the effect of ppmi and npmi, meaning that each predictor can have impacts on regressions, albeit of a different magnitude.For between-model comparisons, as well as comparisons between critical models and baseline models, we report the result of a pairwise permutation test on 1000 samples using the jmuOutlier package in R. When reporting the statistical significance of the tests, we use standard star notation (i.e., * indicates that  < 0.05; * * indicates that  < 0.01; * * * indicates that  < 0.001).

Multicollinearity between variables
A high degree of colinearity between predictor variables makes the result of a regression model essentially impossible to interpret.In order to test for the presence of adverse multicollinearity in our data we run two assessments.First, we simply compute the Pearson correlation coefficient between our variables.In general, we observe very low correlations, with |  |< 0.1.However, for a few variables, we do observe stronger correlations, including, the relationship between frequency and length, as well as surprisal and length.In order to assess whether these higher correlations were adversely impacting the interpretation of our ZIP models, we computed the Variance Inflation Factor (VIF) for each of our predictor variables using the check_collinearity function in the performance package in R. We find that for our English corpora, all variables have VIF < 3.2 and for our multilingual dataset all variables have a VIF < 3.7, which is well below the typical cutoff of 5. Thus, we conclude that the correlations that do exist between our predictor variables are not adversely impacting the interpretation of our ZIP models.For the full results and discussion of these analyses, see Appendix A.

Datasets
In Study One, we predict human regression behavior over three datasets of naturalistic reading in English.Our datasets are the Provo corpus (Luke & Christianson, 2018), the UCL Corpus (Frank, Monsalve, Thompson, & Vigliocco, 2013), and the Dundee Corpus (Kennedy, Hill et al., 2003).The Provo Corpus includes data from 84 subjects reading 55 short paragraphs, or about ≈3,000 words of English text.The UCL Corpus contains data from 43 subjects reading 205 individual sentences totaling ≈5,000 words of English.The Dundee Corpus contains data from 10 subjects reading ≈55,000 words of newspaper editorials, which we downsample to include only the first five articles (≈7,600 words).Overall, the Dundee Corpus contains fewer participants than the others, but more reading data per participant.All subjects are self-reported native English speakers, however, Provo subjects are American English speakers, whereas subjects from the other two are British English speakers.In Study Two, we predict human regression behavior from a subset of the Multilingual Eye Movement Corpus (MECO; Siegelman et al., 2022).MECO contains eye-tracking data on 12 simplified Wikipedia-style articles in 13 different languages.Data is collected from between 29-54 native speakers of each language.Articles in the corpus went through an iterative translation and back-translation process by different teams of translators to ensure that content was the same across languages.We present results from a subset of six languages across three different language families: German, Russian, Spanish, and Italian (Indo-European), Finnish (Uralic), and Turkish (Turkic).Our languages were chosen because they have high-quality monolingual BERT-style models and are supported by mGPT, which we use to derive autoregressive surprisal estimates for baseline ZIPs.

Estimating PMI
We estimate the conditional PMI between   and   using a masked language model.In contrast with traditional autoregressive language modeling, where an algorithm is trained to predict the next word given a previous context, in masked language modeling, the model is given a whole sentence, with some words' identities replaced with a special [MASK] token.The training objective of the model is to predict the identity of the masked word.The mask tokens can be thought of in the same way as our masked words, discussed in Section 3, as cases where the model knows a word is present in a particular sentence slot, but is unaware of its identity.We derive the pmi by taking the difference in log probability of   when   is unmasked vs. when   is masked.We turn pmi into our predictors of interest by simply zeroing out negative values (to produce ppmi) or positive values (to produce npmi).That is, following the definition of pmi presented in Eq. ( 4), for probabilities derived from a masked language model   , then This is the same method for deriving contextual pmi that has recently been proposed and validated in Hoover, Du, Sordoni, and O'Donnell (2021).All of our language models use sub-tokenization schemes that split some words into smaller sub-word pieces; we exclude such words from our analysis.In Study One we use the base BERT model (Devlin, Chang, Lee, & Toutanova, 2019).In Study Two we use different BERTstyle models trained on corpora in German (Chan, Schweter, & Möller, 2020), Spanish (Cañete et al., 2020), Finnish (Virtanen et al., 2019), Italian (Schweter, 2020b), Turkish (Schweter, 2020a), and for Russian (Kuratov & Arkhipov, 2019).All of our models are implemented in the Huggingface Transformers library (Wolf et al., 2020).
We would like to highlight three features of this strategy for deriving pmi values: First, the interpretation of pmi as a difference in log probabilities gives an intuitive understanding of the metric-pmi tells us how much knowledge of   reduces the information content (i.e., surprisal) of   .If pmi is positive, this means that learning the identity of   makes   less surprising in context, whereas if the pmi is negative then the opposite is true, i.e., learning the identity of   makes   more surprising.Second, we would like to highlight that, unlike traditional methods for estimating pmi ours is context dependent.In count-based methods for estimating pmi between two words, one compares the total number of times two words appear in a corpus to the number of times they occur in the same context, which is often taken to be a single sentence, paragraph, or document.This method is blind to the position of the words in the context and the relationship between them.Because we feed the model inputs with [MASK] tokens at specified locations, our pmi values between two words are conditioned on those words' location within the sentence.We do not bring this up to say that our method is necessarily better than count-based methods, but simply to highlight the differences.Third, we want to emphasize that the pmi is a relative metric.The absolute probability of   does not matter for deriving pmi, only its relative probability given the presence or absence of   .To give a brief, toy example, let us take an extremely frequent English word, a, in the context where it is the first word in a bigram.In such a context, a will have high contextual pmi with nouns, because, for a given bigram if you know that the second word is a noun, it is likely that the first word is the indefinite determiner.For example, given the input a dog, BERT assigns pmi(a; dog) of 3.9, meaning that a is 3.9 bits less surprising if we know that it is followed by dog.However, a will have negative pmi with verbs because, if you know that the second word of a bigram is a verb, it is unlikely that the first word is a.If we give BERT the input a see, for example, it assigns a pmi of −0.27.Note that if we reverse these inputs, then the pmi flips: BERT assigns a pmi of −0.87 to the input dog a, but a pmi of 1.2 to the input see a.These toy examples are a far cry from the full sentences we will be inputting to models in our studies, however, they illustrate two key properties of our metric, namely pmi is determined by relative and not absolute word probabilities, and that our version of pmi is sensitive to word order.

Masking settings
The procedure described above produces one serious discrepancy between model estimates and the information available to people when they choose whether or not to regress.Consider the sentence The dog chased the cat around the yard and barked at it.Here, the conditional pmi between dog and cat is likely impacted by the presence of the subsequent word barked.However, people read sentences incrementally and thus cannot factor future information into decisions about whether or not to regress.To account for this, we derive model estimates in three settings, examples of which are given in Eq. ( 12).
In the allRC setting, we give the full right context to the model.In the dropRC setting we simply truncate the sentence at the source word.And in the maskRC setting we mask all words after the source word.In the maskRC setting the model is thus aware that the sentence continues for a known number of words after the source word, but unaware of the individual word identities.In the dropRC setting the model does not have such information.In Study One we derive model estimates in all three settings.As results do not change much between settings, in Study Two we get model estimates only in the maskRC setting.
To give a sense of our models' behaviors, Fig. 1 shows the pmi and expected pmi for each of our three masking variants for a Main-Verb/Reduced Relative Clause (MVRR) gardenpath sentence.This is the same sentence given in Example (2).There are a couple of things to note: First, we tend to find strong positive values of pmi and E  [pmi] between words that are close to each other, i.e., along the off-diagonal in Fig. 1. pmi and E  [pmi] are especially strong between a noun and a function word that immediately precedes it, such as in a portrait or its beauty.This is because the presence of the noun severely constrains the space of possible proceeding words, making the probability of a function word much higher.One possible concern is that because regressions most frequently occur between neighboring words, ppmi may appear to be an overly good predictor.This is why distance is included as a baseline regressor.Second, we find negative pmi between was and many of the other words in the sentence, with especially low pmi between was and painted, as we might expect given that these two words serve as the ambiguous region and disambiguator of the garden path.Looking at the differences between our masking set-ups (i.e., between the three columns in Fig. 1) we find that the three are roughly comparable.That being said, the model does seem to produce larger magnitude estimates in the dropRC and maskRC cases.Overall, the outputs in Fig. 1 provide a basic validation of our modeling approach.

Regression accuracy
One additional concern with the current approach is that it assumes that readers successfully land on the intended targets of their regressions.That is, we measure pmi between   and   because we assume that   is the word that the reader intends to reinspect.While it is true that readers are capable of making highly accurate saccades (Kennedy, Brooks et al., 2003), it is also the case that many saccades undershoot and overshoot their intended targets.For example, when regressing to the beginning of a line, readers often need to make a short, corrective, saccade, in order to fixate on the first word of the line.In order to check for the potential impact of undershooting and overshooting on our results, we conduct a version of Study One where, instead of using raw pmi(  ;   ), we use two aggregate pmi-based metrics, based on a one-word window around   .In the first, we re-run the analysis using the mean of the window around   .In the second, we use the maximum of the absolute value of the window around   , while still preserving the sign.We find that these aggregations do not change the results, and thus we simply report the non-aggregate pmi(  ;   ) values in the main body of the text.These additional analyses can be found in Appendix B.

Effect of PMI
In this first analysis, for each corpus, we fit ZIP models on baseline variables plus pmi and inspect their coefficients.To reveal patterns that are invariant to the choice of masking setting, we pool the data across masking settings and include a random intercept based on masking type (allRC, maskRC, and dropRC).The results are consistent between corpora: For Provo, we find that the count portion of our ZIP model fits a positive coefficient for pmi ( = 0.37,  < 0.001) and the zeroinflation portion fits a negative coefficient ( = −0.2, < 0.001).For UCL, the pmi coefficient in the count portion of the model is, again, positive ( = 0.02,  < 0.001) and the zero-inflation coefficient is negative ( = −0.3, < 0.001).And results are similar for Dundee: the pmi coefficient in the count portion is positive ( = 0.02,  < 0.001), and negative in the zero-inflation portion ( = −0.09, < 0.001).This means that, as the pmi between two words increases, the number of regressions between them is predicted to increase.And, for Provo and UCL, as the pmi decreases, the likelihood that there are no regressions increases.These results are, thus, compatible with the reactivation hypothesis for regressions, which predicts that higher pmi between two words should be positively correlated with regressions.4

Comparing PPMI and NPMI
The results from our second analysis comparing ppmi and npmi can be seen in Fig. 2, with models that include ppmi in red, npmi in teal, and ensemble models in grey.We also present fitted coefficients for pmi-based predictors in the sub-plots.The results are consistent across corpus and masking type: ppmi and E  [ppmi] lead to significantly positive  llh in every case except one ( > 0.05 in the UCL/maskRC;  < 0.05 for UCL/dropRC;  < 0.001 otherwise).However, results are not consistent for npmi and E  [npmi]: In the UCL corpus, E  [npmi] and npmi do not lead to positive  llh in any case, and the latter actually leads to negative  llh in the maskRC setting.In the Provo corpus, npmi did lead to positive  llh for all masking settings ( < 0.001).Additionally, E  [npmi] did lead to extremely small, but significant increases in  llh in two cases (for dropRC  < 0.01; maskRC  < 0.001).However, the coefficient for the count-based portion of the regression model was not found to be positive across folds of data-the 95% confidence intervals for E  [npmi] cross zero for all masking settings.This means that the results for E  [npmi] cannot be interpreted as providing straightforward evidence in favor of the reactivation hypothesis.Finally, in the Dundee corpus, npmi lead to positive  llh in two settings (for dropRC and maskRC where  < 0.01) but E  [npmi] did not lead to positive  llh in any settings.One other way to investigate the impact of npmi and E  [npmi] is to ask whether ensemble models, which include these terms as an additional predictor, obtain higher  llh than ppmi-only models.Here, the results are also mixed.We find that ensemble models have higher  llh than ppmionly models for Provo in the dropRC and maskRC settings ( < 0.001), for UCL in the allRC and maskRC settings ( < 0.001) settings, and for Dundee in all settings ( < 0.05).We find that ensemble models have a higher  llh than E  [ppmi]-only models for the Provo corpus in the dropRC setting ( < 0.01) and for UCL in the allRC setting ( < 0.01).In total 9/18 ensemble models have significantly higher  llh than the variant that includes only ppmi or E  [ppmi], which, again, provides only mixed evidence in favor of the reanalysis hypothesis.
Comparing ppmi to E  [ppmi], we find that models with E  [ppmi] obtain significantly higher  llh in all settings ( < 0.001), suggesting that readers are sensitive to expectations of pmi.Overall, these results paint a picture in which ppmi and especially E  [ppmi] between words is predictive of regressive saccades, however npmi and E  [npmi] do not consistently help to predict regressions in the corpus.These results are, therefore, compatible with the results presented in Section 5.1, which also supported the reactivation hypothesis for regressions.

Baseline model coefficients
Fig. 3 shows the coefficients for the baseline predictors (frequency, length, surprisal, and distance) in our ZIP models.We show averages across all models that include the variable, and we scale values so they are comparable across predictors.Although results are fairly consistent between different models and masking types, we do observe much larger errors for the Dundee models.This is likely due to the relatively sparse nature of the dataset, which results in more variation across different folds of our train/test splits and therefore more variation in the models' coefficients.That being said, there are a number of highlevel trends that are clear: First, we find a relatively large positive effect of surprisal for both source and target words, indicating that as the predictability of both words goes down, the likelihood of a regression between them increases.For frequency, we find a negative effect, across all corpora, but only for the source word, indicating that low-frequency words are associated with regression initiation.For length, we find a mixed result, with positive effects only in Provo and Dundee and only for source words.Finally, for distance, we find negative effects, which are about an order of magnitude stronger than for our other predictors.

Discussion of results
The positive coefficient for pmi, as well as the consistent, positive  llh for ppmi and E  [ppmi] support the reanalysis hypothesis for regressions.Additionally, the lack of obvious advantage for ensemble models does not support the combined theory for regressions, suggesting that it is mainly ppmi or E  [npmi] that is driving  llh in these models.Finally, the highest  llh was achieved by models with E  [ppmi], supporting the expectation-based version of the reactivation hypothesis, articulated in Section 3.5.
In addition to the novel PMI-based statistics, this study tested several baseline predictors that have been previously investigated in the literature, largely confirming previous findings.We place our discussion of these results here, rather than in the general discussion, because previous work has focused largely on English.Starting with cases where the prior literature is in consensus, prior work has found that target words of regressions tend to have low predictability (Kliegl et al., 2004;Rayner et al., 2004) or higher surprisal (Bicknell & Levy, 2011), which is in line with our positive effect of target surprisal.Prior work is mixed, however, when it comes to the surprisal of the source: Bicknell and Levy (2011) find a negative effect of predictability (i.e., a positive effect of surprisal) but (Lopopolo et al., 2019) find a negative effect of surprisal.Although the magnitude of the surprisal effect is greater for targets than sources, we do find a significantly positive effect of source-word surprisal for all corpora and probability estimation strategies which supports the results from Bicknell and Levy (2011).
Prior studies are mixed when it comes to frequency and length: Focusing first on target words, two studies have found no effect of target frequency (Kliegl et al., 2004;Rayner et al., 2004), while another has found a positive effect (Bicknell & Levy, 2011).Our results are mixed; we only find negative effects for target-word frequency in Provo.For word length, some studies found that regressions target short words (Engbert et al., 2005;Kliegl et al., 2004) and others that they target longer words (Bicknell & Levy, 2011;Vitu & McConkie, 2000).The results of this study, again, are aligned with the mixed nature of the prior literature.We find no consistent results for target-word length.Effects of source words on regressions are less common in the literature.Our results support the negative effect found previously by Lopopolo et al. (2019) for frequency.Note that Bicknell and Levy (2011) find a positive effect for source word frequency, however the result is not significant.Overall, where previous studies agree, our results are in line with previous work (e.g., on the positive effect of surprisal and the negative effect for source-word frequency).
In sum, these results suggest that regressions are likely to occur when the source or target word is difficult to process (it is high surprisal, long, and low frequency) although this seems to be more critical for the source word than the target word.

Study two: Regressions in multiple languages
In this study, we ask whether the results obtained for English hold up in a broader sample of languages.We investigate German, Spanish, Finnish, Italian, Turkish, and Russian, which represent three different language families: Indo-European, Uralic (Finnish), and Turkic (Turkish).For this study, we use the maskRC setup from the previous study.This was done because it is, intuitively, the closest to human language processing: People receive only limited information about their right context via parafoveal vision (Schotter, Angele, & Rayner, 2012), however they are aware that the sentence continues, a situation which is better encapsulated with the maskRC setting over the dropRC setting or the allRC setting.

Effect of PMI
For this first analysis, we fit a ZIP model with baseline variables plus pmi.We pool data from across all our languages and add a random by-language intercept.The results are consistent with those from the English-only study.We find a positive coefficient for pmi in the countbased portion of the model ( = 0.07,  < 0.001), and a negative coefficient for the zero-inflation portion of the model ( = −0.16, < 0.001).As before, these results support the reactivation hypothesis for regressions.

Effect of NPMI and PPMI
The results from this study can be seen in Fig. 4. The presentational paradigm is the same as from the first study, except that the different languages are shown in the facets.Consistent with the results from the previous study, we find that ppmi and E  [ppmi] lead to a significant increase in  llh across all languages tested ( < 0.001 in all cases).However, npmi and E  [npmi] do not consistently lead to better predictive power of the models.For E  [npmi] we find positive  llh in 4/6 languages (German and Spanish  < 0.001; Italian  < 0.01; Turkish  < 0.05), and for npmi we find positive  llh in only one language (German,  < 0.001).Also, as before, ensemble models do not show consistently higher  llh than models with only ppmi or E  [ppmi]: Compared to E  [ppmi] ensemble models are better for three languages (German and Italian,  < 0.001; Finnish  < 0.05), and compared to ppmi ensemble models are better in only one language (Italian,  < 0.01).Models with E  [ppmi] have higher  llh than those with vanilla ppmi in all languages, and the difference between them is significant in four-Spanish, Italian, Turkish ( < 0.001), and Russian ( < 0.01).In sum, as with Study One, we take these results as broadly supportive of the reactivation hypothesis, while only providing mixed support for the reanalysis hypothesis.

Model coefficients
The coefficients for the baseline variables of our ZIP models are shown in Fig. 5. Results vary somewhat between languages but are largely in line with the results from Study One.The biggest difference between this study and Study One is that, here, we observe a much smaller effect of distance.While distance was a full order of magnitude larger than any other predictor in Study One, its magnitude is relatively small in this study, and positive for Russian for the count portion of the model.In line with Study One, we find positive effects of surprisal for a regression's target as well as negative effects of source word frequency in every language.Effect directions are mixed for target word frequency, as well as for word length, which is not necessarily surprising given the mixed findings in the prior literature testing these variables.

Discussion of results
We believe that the crosslinguistic consistency evident in Fig. 4 is rather striking.It provides compelling evidence that the same information-theoretic principles are at play during language processing in multiple languages, suggesting that readers regress between words that are contextually associated with each other.There are, however, some intriguing differences between these results and those from Study One.Primarily these have to do with differences between the coefficient estimates between languages.First, we observe positive coefficients for source-word surprisal in English.However, in MECO, the effect for source-word suprisal is much weaker, being positive in only 3/6 languages.Second, we observe that distance has a much weaker effect in MECO than in our English-language corpora.In Study One, we found an effect of distance that was a full order of magnitude larger than any other predictors.In this study, distance was still found to have a negative effect in most languages, but the magnitude was relatively low.We hypothesized that this has to do with wrap-up effects.Wrap-up effects are regressions or longer reading times that occur at the ends of sentences and are thought to be caused by the reader consolidating the information contained in the sentence before moving on to the next one (Just & Carpenter, 1980;Meister, Pimentel, Clark, Cotterell, & Levy, 2022;Warren, White, & Reichle, 2009).Because wrap-up regressions are, definitionally, initiated from sentence-final positions they are likely longer, on average, than non-wrap-up regressions, which are also initiated from the middle or beginning of a sentence.We hypothesize that due to the nature of the dataset (longer, Wikipediastyle articles), MECO contains a larger portion of wrap-up regressions, change the relative effect of distance on our results.To test this hypothesis, we re-ran the analysis of Study Two excluding all regressions that originated from sentence-final words.After this exclusion, we observed a negative effect of distance in every language, with larger coefficients for distance, providing some support to this hypothesis. 5That being said, the new coefficients are all roughly in the same order of magnitude as the original coefficients, meaning there is still some unexplained difference between the multilingual results and the English-only results.We note that, after this exclusion, the overall pattern of main results did not change; models with ppmi and E  [ppmi] still showed consistently positive  llh whereas models npmi and E  [npmi] did not.

General discussion
The main results from our study can be summarized as follows: First, we find clear and consistent evidence in favor of the reactivation hypothesis.pmi is positively correlated with regressions, and ppmi and E  [ppmi] help to predict regressions above baselines in all settings and languages tested.Second, we do not find consistent evidence in favor of the reanalysis hypothesis.In particular, npmi and E  [npmi] help to predict regressions above baselines in less than half of the settings tested (i.e., the combinations of masking variants and corpora for Study One, and the combinations of languages in Study Two).Third, we found that including E  [ppmi] leads to larger  llh than vanilla ppmi in most cases, which is compatible with the hypothesis that readers are, in some cases, sensitive to the expectation of ppmi.Finally, for baseline variables, we found consistent effects across datasets and models for distance, surprisal, and frequency of the source word.Below, we discuss each of these three results in reverse order.

Effect of length, frequency and surprisal and distance
Although we are primarily concerned with the ability of our pmibased statistics to predict regressions during reading, it is important to take stock of our baseline variables as well.Here, we will refer to an effect as being ''consistent'' if its associated coefficient has the same sign across models, datasets, and languages.And we will refer to an effect as being ''strong'' if the magnitude of the scaled coefficient is large, relative to those of the other predictor variables.Overall, we found that distance produced the strongest and most consistent effects of any baseline variable, indicating that as words are more distal to each other in the sentence, the number of regressions between them decreased.Besides distance, in both Study One and Two, we found consistent, positive effects for target word surprisal as well as consistent negative effects of source word frequency.These results indicate that infrequent words are likely to trigger regressions and that regressions are likely to land on words that are surprising, given their context.Additionally, we found consistent effects for source word surprisal in our English language corpora, however, this effect was not consistent in the MECO corpus.Largely in line with previous studies, we found mixed results for word length, as well as target word frequency, suggesting that these properties of words are not responsible for driving regressions.

The role of expectations
We hypothesized that readers may not have perfect knowledge of previous material and, thus, may decide to regress based on their expected identity of   , given the context.In our studies, we find that including E  [ppmi] in ZIP models results in higher values of  llh compared to ppmi and that this difference is significant in all settings and corpora in Study One as well as in 4/6 languages in Study Two, suggesting that, indeed, expected values of ppmi are better predictors of regressions.These results are in line with and supportive of theories that highlight language processing as something that happens under uncertainty.Our findings suggest that, in some cases, readers may be relying on partial knowledge about previous word identity, either because they have skipped over that word initially, or because of imperfect memory representations during language comprehension.That being said, there are two points worth clarifying: First, we do not mean to suggest that, in the vast majority of cases, readers do not commit to words as they read, and only rely on expectations.There is a large body of experimental evidence indicating that readers often strongly commit to a word identity, sometimes even before fixating on the word itself.Further, maintaining high levels of uncertainty over all words in a sentence would likely be too taxing, due to the large amount of information that would need to be stored in memory.Second, in the studies presented here, we only explore two possible strategies-one that involves no uncertainty (i.e., uses ppmi) and one in which readers maintain a distribution over all the possible words.However, it is likely that maintaining this whole probability distribution in mind would, again, be too taxing on memory, as it would involve remembering the probabilities for thousands (or tens of thousands) of extremely lowlikelihood words.Therefore, it is likely that what readers represent is an annealed or re-normalized probability distribution, for example over the top  candidates.Such annealed probability distributions have been observed to best explain other aspects of human reading behavior.For example, Pimentel et al. (2023) find that such skewed probability distributions are better for explaining a reader's choice to skip words.Exploring the space of such resource-limited strategies in greater detail should form the basis for future work.

Implications for reanalysis
Initially, it may seem that our results are incompatible with the reanalysis hypothesis for regressions-we find that npmi and E  [npmi] are not consistently helpful for predicting regressions in both English, as well as in our multilingual corpus.But rather than being taken as disconfirmatory evidence, we believe our results are in line with the contemporary perspective on regressions that views them as an available, but not required, strategy to use during reanalysis.Ever since Frazier and Rayner's (1982) initial study, evidence from targeted experiments has demonstrated that when readers need to reanalyze a sentence due to syntactic ambiguities, only a subset will make use of regressions, and only a subset of those regressions may be targeted.Some may simply regress to a previous point in the sentence from which they can commence re-reading (e.g., to the sentence's beginning, although not necessarily).These previous data suggest that, while present, targeted regressions used for syntactic reanalysis are relatively rare.Our results extend these results beyond the case of syntactic reanalysis.They are consistent with the idea that when the presence of a new word causes a loss of confidence about a previous word,   , readers may sometimes regress to reinspect   .However, these types of regressions are not common enough to make npmi or E  [npmi] broadly predictive of regressions during reading.
In sum, evidence from targeted studies suggests that the need for reanalysis triggers regressions, leading to the hypothesis that the need to reanalyze could be broadly predictive of regressions during reading.Our results provide compelling evidence against the ''broadly predictive'' version of this hypothesis.However, they leave the door open for perspectives that view regressions as a strategy used occasionally for selective reanalysis.

Implications for reactivation
The main finding of our studies is that pmi is positively correlated with regressions and that ppmi and E  [ppmi] are predictive of regressions in natural reading.In fact, we observed only a single case where adding ppmi or E  [ppmi] into the regression model did not increase its predictive power above a baseline.These results offer firm support to the reactivation hypothesis, suggesting that readers regress to reinspect previous material that is associated with the word currently being processed.In this section, we expand on this result, discussing two perspectives for why readers might regress between associated words.We will assume that, when people fixate on a word during reading, they activate several candidates as to its identity, and may maintain distributions over these candidates even after they have moved on to process subsequent words.Both of the explanations discussed below view regressions as confirmatory and as a response to rising candidate probabilities.That is, readers regress because, after in-taking some new information, one particular word candidate has a higher probability than it did previously and the reader is using the regression to confirm that this new candidate is, indeed, correct.The two perspectives differ in terms of where the 'rising candidate' comes from.
The first perspective, which we discussed in Section 2.1 when the reactivation hypothesis was introduced, is that regressions occur between words with high pmi because the regression's target can serve as a cue for the current word being processed.After a reader has initially activated a number of candidates for a word's identity, regressions can be used to build confidence in favor of one of the candidates, by reactivating a word that boosts its relative probability.In this case, we assume that the reader already has high confidence about the previous word, and is using the regression in response to a rising candidate associated with new words.The rising candidate is thus the word at the source of the regression.
In contrast, the second perspective associates the 'rising candidate' with the regression's target.When the reader identifies a new word it causes a previous word candidate to gain probability and the reader then regresses to check that this candidate is, indeed, correct, by inspecting the word for a second time.Thus, regressions do not occur when the reader has high confidence about the previous word, but rather when they had initially low confidence, which has now been increased by subsequent information.This explanation is very similar to the one offered in Bicknell and Levy (2011), however rather than a response to falling confidence, it assumes that regressions are triggered in response to raising confidence over previous word identity.We want to note that these two explanations about where the 'rising candidate' is located are not contradictory and that they are likely both contributing to the correlations observed in our studies.
We close this section by briefly discussing our results in light of recent work (Futrell et al., 2020) on the role of pmi in online language comprehension, generally.When incorporating models of memory degradation into theories of online processing, Futrell et al. predict a linguistic pressure to maximize information locality.They cache this out in terms of pmi, suggesting that words with high pmi should cluster together as this allows nearby words to reduce each other's processing burden.Futrell et al. show that this general principle of information locality is predictive of things like adjective word ordering, and can be used as a formal underpinning for other linguistic theories, such as dependency locality effects (Gibson, 1998).Empirically, we observe the predictions of this theory in English in Fig. 1, where adjacent words tend to have large positive values of pmi.We believe that our results are broadly supportive of this theory and suggest that one way of understanding regressions is as a tool for artificially inducing information locality.When reading a text, it is not always the case that words that cue each other will be next to each other, and therefore cueing words may have weak representations in memory.Regressions are one way to reactivate their representations allowing the reader to process words with high pmi at the same time, even if they are spatially distal on the page.

Conclusion
The information-theoretic interpretation offered here contextualizes and advances the state of theories for regressions by forging explicit links to the literature on Information Theory and by providing a quantitative interpretation of what has typically been framed purely in qualitative terms.Our operationalization greatly increases the predictive power of previous theoretical explanations of regressions, enabling each to make predictions about the frequency of regressions between word pairs in arbitrary sentences.Our contributions create a setting conducive to the direct comparison between theories, something which has been lacking in the prior literature.Empirically, we measure the predictive power of each theory in a variety of novel settings, greatly expanding the scope of languages and datasets tested.The crosslinguistic consistency of our results raises the intriguing possibility that the cause and purpose for regressions are universal across reading in different languages.At the same time, the studies presented here have several limitations and raise several unanswered questions, which should serve as the basis for future work in this area.In particular, we have studied patterns of regressions in preexisting corpora, rather than in controlled experimental settings.Balancing the corpus-based approach presented here with studies that include causal manipulations would be a logical next step in validating and extending our conclusions.Additionally, although we do observe strong crosslinguistic consistency, there is some subtle variation between languages.For example, we observe that ppmi produces higher  llh in German and Finnish compared to other languages, suggesting that it is more useful for predicting regressions in these languages.Whether these findings are due to the specifics of the MECO dataset or broader crosslinguistic trends is another question that demands further investigation.  of the correlations are quite low.All this being said, we cannot infer whether these correlations are affecting the interpretation of our ZIP models from the Pearson correlation coefficient alone.Therefore, we also compute the Variance Inflation Factor (VIF) for our coefficients.
The VIF is defined as where  2  is the coefficient of determination one obtains from regressing the th predictor variable against all the others (Chapter 3; Wooldridge, 1996).Generally, a VIF of 1 is taken to imply that variables are not correlated, a VIF of 5 is taken to imply moderate although tolerable correlations, and a VIF > 5 is taken to imply adverse colinearity, although sometimes a cutoff of 10 is also used.Regardless, we observe VIF values of <4 for all variables, which are given in Table 1.

A.2. MECO
The Pearson correlation coefficient for the predictor variables in MECO is shown in Fig. 7, with each language on a different facet.As with the English data, we observe moderate correlations for surprisal, frequency, and length for a given word (both source and target).We report the Variance Inflation Factors for predictor variables in Table 2.As with English, we observe VIF values well below the typical threshold of 5.

Appendix B. Aggregation study with window
As mentioned in the main body of the text, by measuring the pmi between   and   directly, we make the assumption that the word which the reader lands on is their intended target.However, while readers can make very accurate saccades, they may also frequently undershoot or overshoot their targets.To account for this potential confound we re-run Study One with two different aggregations.In  the first, we take the mean pmi between the trigger and a three-word window around the regression's target.In the second, we take the max absolute value of the pmi between the trigger and the same three-word window.(While ignoring the sign to compute the max function, we preserve the sign of the pmi value in the subsequent analysis.)The idea here is that regressions to   likely include regressions intended to land on   , as well as regressions intended to land on  +1 and  −1 , therefore these aggregate metrics will give us a more accurate measure of why a reader chooses to regress to the particular spatial location, which is associated with   .The results of this analysis can be seen in Fig. 8 for Provo and Fig. 9 for UCL.Overall, the aggregations do not change the picture of the results and are therefore compatible with the idea that when readers do regress they are fairly accurate with their saccades, often landing on their intended targets.

Appendix C. Regression modeling code
Here, we give more details on the regression formulae used in the various studies reported in the main text.Our notation is as follows: Our dependent variable, n_regressions, is the number of regressions that occur between   and   across all participants in a given corpus.
For our baseline predictor variables dist is the distance between   and   in number of words; target_len is the length of   in number of characters; target_freq is the log 10 unigram frequency of   as reported in Speer (2022); target_surp is the surprisal of   given all the previous words in the sentence, as estimated by GPT-2; and notation is similar for properties associated with   , except that source is used instead of target.For our PMI-based predictor variables pmi is the pmi between   and   ; ppmi is the ppmi; npmi is the npmi; e_ppmi is the expected value of ppmi, i.e., E  [ppmi]; e_npmi is the expected value of npmi; ppmi_0 is the indicator variable associated with ppmi, and npmi_0 is the indicator variable associated with npmi.

Fig. 1 .
Fig. 1.Example PMI Estimates: Estimates from BERT are shown for a Main-Verb/Reduced-Relative Clause gardenpath sentence (The artist painted a portrait was impressed by its beauty).The bottom row shows pmi, the top row shows E  [pmi] taken over all possible realizations of the target (i.e., the sentence position indicated on the y-axis).

Fig. 2 .
Fig. 2. Results (Study One): All models also include baseline regressors.Error bars are 95% CIs estimated across 5 folds of data.Stars show the significance of paired permutation tests between baselines/conditions. Sub-plots show coefficient estimates for pmi-based predictors, with error bars showing 95% CIs of coefficients across five folds of data.Coefficients for NPMI/E[NPMI] have inverted signs to facilitate direct comparison with PPMI/E[PPMI].For sub-plots, circles indicate coefficient estimates for the count-based portion and triangles for the zero-inflation portion.For ensemble models that include both ppmi and npmi, coefficients for the different predictors are differentiated by color.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

3.
Baseline coefficients (Study One): Scaled estimates for model coefficients.Error bars are 95% CIs across 5 folds of data.Results are similar across corpora.

Fig. 4 .
Fig. 4. Results (Study Two): All models also include baseline regressors.Error bars are 95% CIs estimated across 5-folds of data.Stars show the significance a paired permutation test assessing whether results are different from zero.Sub-plots show coefficient estimates for pmi-based predictors, with error bars showing 95% CIs of coefficients across ten folds of data.Again, the signs of NPMI/E[NPMI] coefficients have been inverted to facilitate direct comparison with PPMI/E[PPMI].For sub-plots, circles indicate coefficient estimates for the count-based portion and triangles for the zero-inflation portion.For ensemble models that include both ppmi and npmi, coefficients for the different predictors are differentiated by color.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 5 .
Fig. 5. Model coefficients: Scaled estimates for model coefficients.Error bars are 95% Confidence Intervals across 5 folds of data.

Fig.
Fig. Results for Provo w/ Aggregation window: Error bars are 95% confidence intervals.Stars indicate the result of a paired permutation test assessing whether the  llh is significantly above zero.

Fig. 9 .
Fig. 9. Results for UCL w/ Aggregation window: Error bars are 95% confidence intervals.Stars indicate the result of a paired permutation test assessing whether the  is significantly above zero.

Table 1
VIF for English datasets: All values are below the threshold for adverse multicollinearity.Dist PMI   Freq.  Len.  Surp.  Freq.  Len.  Surp.

Table 2
VIF for MECO: All values are below the threshold for adverse multicollinearity.Dist PMI   Freq.  Len.  Surp.  Freq.  Len.  Surp.