Lexical Processing Strongly Affects Reading Times But Not Skipping During Natural Reading

Abstract In a typical text, readers look much longer at some words than at others, even skipping many altogether. Historically, researchers explained this variation via low-level visual or oculomotor factors, but today it is primarily explained via factors determining a word’s lexical processing ease, such as how well word identity can be predicted from context or discerned from parafoveal preview. While the existence of these effects is well established in controlled experiments, the relative importance of prediction, preview and low-level factors in natural reading remains unclear. Here, we address this question in three large naturalistic reading corpora (n = 104, 1.5 million words), using deep neural networks and Bayesian ideal observers to model linguistic prediction and parafoveal preview from moment to moment in natural reading. Strikingly, neither prediction nor preview was important for explaining word skipping—the vast majority of explained variation was explained by a simple oculomotor model, using just fixation position and word length. For reading times, by contrast, we found strong but independent contributions of prediction and preview, with effect sizes matching those from controlled experiments. Together, these results challenge dominant models of eye movements in reading, and instead support alternative models that describe skipping (but not reading times) as largely autonomous from word identification, and mostly determined by low-level oculomotor information.

GPT-2 predictabilities outperform cloze predictabilities.Result of a crossvalidated model comparsion in the full reading times model, using either GPT-2-derived surprisal as the lexical predictability metric, or a cloze task derived probability (A) or log-probability (B) value, evaluated on provo, since this includes cloze-norm-derived probability values for each word.In both cases we used the full reading times model, similar to the model comparison in Figure 6.The regression model with GPT-2 predictability values performs much better (bootstrap: P < 0.00001), this is not surprising because the cloze probabilities are not sensitive to small probability values, and hence unable to distinguish between subtle differences in predictability (e.g. between 0.01 and 0.001 or 0.0001) which are known to be important for modelling predictability effects in human langauge processing (Shain et al., 2022;Smith & Levy, 2013).Hence, this analysis confirms that for word-by-word predictability estiamtes in natural texts, where consraint is generally low (Luke & Christianson, 2016), language model derived predictabilities are superior to cloze-task-derived probability estimates.Encoding and inference scheme of the ideal observer analysis.Visualisation of the Ideal Observer, following formulation in Duan and Bicknell (2020).A word at a given eccentricity is converted into a noisy visual percept, after which a posterior probability of the identity of the word given the noisy percept was computed using Bayesian inference.The uncertainty of this posterior (expressed in terms of Shannon entropy) was then used to quantify the expected uncertainty in the parafoveal percept -or, inversely, a word's parafoveal identifiability.In this scheme, words are represented as a concatenation of one-hot encoded letter vectors.Visual information (I) is sampled from a multivariate Gaussian centred on the word vector y w with a diagonal covariance matrix Σ, the values of which (σ 2 ) are inversely related to the integral under the visual acuity function around each letter.The posterior is then computed by comining the likelihood of the visual information I given a particular word, with a prior probability of that word p(w) (e.g.derived from lexical frequency).This computation was performed using a log-odds formulation that exploits the proportionality in Bayes' rule to perform belief-updating without renormalisation (see Methods).For frequency (a), we randomly sampled 20 'rare' and 'frequent' 5-letter words (based on a quartile split), and computed the parafoveal identifiability (quantified via posterior entropy) at increasing eccentricities.As can be seen, the percept becomes uncertain at increasing eccentricities more quickly for low-frequency words, showing that lexical frequency boosts parafoveal identifiability.For orthography (c), we similarly sampled 20 7-letter words that were classified as orthographically common or uncommon based on the first three letters.Here, commonality was again defined using a quartile split but now on the number of alternative words starting with the same three letters.For instance, the letters 'awk' in the word 'awkward' are highly uncommon and allow to identify the entire word with high confidence based on just those three letters.As can be seen, the model predicts that orthographic uniqueness boosts parafoveal identifiability -as observed in experiments (see Schotter et al. (2012)).Notably, when we consider the difference between the two classes of words (b,d), an inverted U shape is apparent: the effects are strongest at intermediate visibility.This demonstrates the well-established fact that the effects of prior (linguistic) knowledge is strongest at intermediate levels of perceptual uncertainty (see Norris (2006) for discussion).(Note that, while both the orthography and frequency effects are effects of "prior linguistic knowledge", only the frequency effect is technically an effect of the prior, since the orthography effect is driven by the generative model.)In all plots, thick lines represent the mean entropy across words; shaded regions indicate bootstrapped 95% confidence intervals.Grid search to establish ideal observer parameters.Grid search result grand average (top) and individual results for different corpora and analyses (bottom).To decide on the values for σ and Λ, a grid search was performed on a random subset of 25% of the Dundee and Geco corpus; we did not apply it to PROVO because there was not enough data per participant.In both skipping and reading times, we performed a 10-fold cross-validation with the full model, using parafoveal entropy as computed with different visual acuity parameters σ and Λ (Equation 6).To avoid biasing the contextual vs non-contextual model comparison (Figure 6), we used both the contextual and non-contextual prior and averaged the results to obtain the results for each analysis in each corpus.To ensure that different analyses and corpora are weighted equally in the grand average, the prediction scores (R 2 or R 2 McF ) were normalised by dividing the prediction score of each parameter combination by the highest score (i.e.score of the best parameter combination) for each subject, for each analysis.This resulted in σ = 3 and Λ = 1, which we have used in all analyses.Note that σ determines the perceptual span (see  Average skipping rate in each dataset.Average rate of skipping in all words included in the skipping analysis (see Methods) in all datasets.Large dots with error bar show group mean plus bootstrapped 95% confidence interval.Small dots show indidividual participants.Distribution of (forward) saccade lengths in each datasets Kernel density estimate of the distribution of saccade lengths (amplitudes) of first-pass, forward saccades in all datasets, both on average (left column) and per individual participant (right column).Note that for this visualisation we only included progressive, forward saccades within the same line (excluding saccades that cross lines), up to a maximum amplitude of 24 characters (excluding saccades during periods participants were not actually reading).Lexical processing strongly affects reading times but not skipping during natural reading Heilbron et al.

Figure
Figure A.1.GPT-2 predictabilities outperform cloze predictabilities.Result of a crossvalidated model comparsion in the full reading times model, using either GPT-2-derived surprisal as the lexical predictability metric, or a cloze task derived probability (A) or log-probability (B) value, evaluated on provo, since this includes cloze-norm-derived probability values for each word.In both cases we used the full reading times model, similar to the model comparison in Figure6.The regression model with GPT-2 predictability values performs much better (bootstrap: P < 0.00001), this is not surprising because the cloze probabilities are not sensitive to small probability values, and hence unable to distinguish between subtle differences in predictability (e.g. between 0.01 and 0.001 or 0.0001) which are known to be important for modelling predictability effects in human langauge processing(Shain et al., 2022; Smith & Levy,  2013).Hence, this analysis confirms that for word-by-word predictability estiamtes in natural texts, where consraint is generally low (Luke & Christianson, 2016), language model derived predictabilities are superior to cloze-task-derived probability estimates.

Figure
Figure A.2.Encoding and inference scheme of the ideal observer analysis.Visualisation of the Ideal Observer, following formulation inDuan and Bicknell (2020).A word at a given eccentricity is converted into a noisy visual percept, after which a posterior probability of the identity of the word given the noisy percept was computed using Bayesian inference.The uncertainty of this posterior (expressed in terms of Shannon entropy) was then used to quantify the expected uncertainty in the parafoveal percept -or, inversely, a word's parafoveal identifiability.In this scheme, words are represented as a concatenation of one-hot encoded letter vectors.Visual information (I) is sampled from a multivariate Gaussian centred on the word vector y w with a diagonal covariance matrix Σ, the values of which (σ 2 ) are inversely related to the integral under the visual acuity function around each letter.The posterior is then computed by comining the likelihood of the visual information I given a particular word, with a prior probability of that word p(w) (e.g.derived from lexical frequency).This computation was performed using a log-odds formulation that exploits the proportionality in Bayes' rule to perform belief-updating without renormalisation (see Methods).

Figure A. 3 .
Figure A.3.Modulation of parafoveal identifiability by visual and linguistic features, and their interaction.The parafoveal entropy for a given word (Fig A.2) is a complex function that integrates linguistic and visual characteristics, and which can account for various known effects, such as the effect of lexical frequency and orthographic neighbourhood on visual word recognition.To illustrate this, we simulated some characteristic effects of eccentricity, frequency (a,b) and orthographic distinctiveness (c,d).For frequency (a), we randomly sampled 20 'rare' and 'frequent' 5-letter words (based on a quartile split), and computed the parafoveal identifiability (quantified via posterior entropy) at increasing eccentricities.As can be seen, the percept becomes uncertain at increasing eccentricities more quickly for low-frequency words, showing that lexical frequency boosts parafoveal identifiability.For orthography (c), we similarly sampled 20 7-letter words that were classified as orthographically common or uncommon based on the first three letters.Here, commonality was again defined using a quartile split but now on the number of alternative words starting with the same three letters.For instance, the letters 'awk' in the word 'awkward' are highly uncommon and allow to identify the entire word with high confidence based on just those three letters.As can be seen, the model predicts that orthographic uniqueness boosts parafoveal identifiability -as observed in experiments (seeSchotter et al. (2012)).Notably, when we consider the difference between the two classes of words (b,d), an inverted U shape is apparent: the effects are strongest at intermediate visibility.This demonstrates the well-established fact that the effects of prior (linguistic) knowledge is strongest at intermediate levels of perceptual uncertainty (seeNorris (2006)  for discussion).(Note that, while both the orthography and frequency effects are effects of "prior linguistic knowledge", only the frequency effect is technically an effect of the prior, since the orthography effect is driven by the generative model.)In all plots, thick lines represent the mean entropy across words; shaded regions indicate bootstrapped 95% confidence intervals.

Figure
Figure A.4.Grid search to establish ideal observer parameters.Grid search result grand average (top) and individual results for different corpora and analyses (bottom).To decide on the values for σ and Λ, a grid search was performed on a random subset of 25% of the Dundee and Geco corpus; we did not apply it to PROVO because there was not enough data per participant.In both skipping and reading times, we performed a 10-fold cross-validation with the full model, using parafoveal entropy as computed with different visual acuity parameters σ and Λ (Equation6).To avoid biasing the contextual vs non-contextual model comparison (Figure6), we used both the contextual and non-contextual prior and averaged the results to obtain the results for each analysis in each corpus.To ensure that different analyses and corpora are weighted equally in the grand average, the prediction scores (R 2 or R 2McF ) were normalised by dividing the prediction score of each parameter combination by the highest score (i.e.score of the best parameter combination) for each subject, for each analysis.This resulted in σ = 3 and Λ = 1, which we have used in all analyses.Note that σ determines the perceptual span (see FigureA.2) and that σ = 3 corresponds well to what is known about the size of the perceptual span and is close to default parameters in other models (see Methods).
Figure A.4.Grid search to establish ideal observer parameters.Grid search result grand average (top) and individual results for different corpora and analyses (bottom).To decide on the values for σ and Λ, a grid search was performed on a random subset of 25% of the Dundee and Geco corpus; we did not apply it to PROVO because there was not enough data per participant.In both skipping and reading times, we performed a 10-fold cross-validation with the full model, using parafoveal entropy as computed with different visual acuity parameters σ and Λ (Equation6).To avoid biasing the contextual vs non-contextual model comparison (Figure6), we used both the contextual and non-contextual prior and averaged the results to obtain the results for each analysis in each corpus.To ensure that different analyses and corpora are weighted equally in the grand average, the prediction scores (R 2 or R 2McF ) were normalised by dividing the prediction score of each parameter combination by the highest score (i.e.score of the best parameter combination) for each subject, for each analysis.This resulted in σ = 3 and Λ = 1, which we have used in all analyses.Note that σ determines the perceptual span (see FigureA.2) and that σ = 3 corresponds well to what is known about the size of the perceptual span and is close to default parameters in other models (see Methods).
Figure A.5. Distributions of reading times (gaze durations).Kernel density estimate of the distribution of reading times across all datasets, both on average (left column) and in individual participants (right column).
Figure A.7.Distribution of (forward) saccade lengths in each datasets Kernel density estimate of the distribution of saccade lengths (amplitudes) of first-pass, forward saccades in all datasets, both on average (left column) and per individual participant (right column).Note that for this visualisation we only included progressive, forward saccades within the same line (excluding saccades that cross lines), up to a maximum amplitude of 24 characters (excluding saccades during periods participants were not actually reading).
Figure A.8. Saccade lengths are tailored to word lengths and exhibit a preferred landing position.Left column: Kernel density estimate of the saccade lengths, estimated separately for target words of different lengths.Colours indicate word lengths, vertical lines indicate the mode of the distribution.Right column: Kernel density estimate (plus mode) of the relative landing position, averaged across words with different lengths.Saccades are longer for longer words, such that a systematic 'preferred landing position' is maintained, slightly left to the center of the word (indicated by the vertical dashed line); see McConkie et al. (1988); Rayner (1979).

Figure A. 9 .
Figure A.9. Skipping variation partitioning for all participants.Explained cross-validated variation partition for skipping (see Fig 2) of each partition, for each participant, for the skipping analysis.Models for the baseline, parafoveal preview and linguistic prediction are indicated by 'base', 'para', and 'ling', respectively.Unions are indicated by ∪, intersections by ∩; for the relative complement we use the asterisk-notation: e.g.'para*' indicates variation explained uniquely by parafoveal preview.Note that due to cross-validation, the amount of variation explained can become negative in some partitions for individual participants (see Methods).

Figure A. 11 .
Figure A.11. Reading times variance partitioning.Explained cross-validated variation partition for skipping (see Fig 3) of each partition, for each participant, for the skipping analysis.Models for the baseline, parafoveal preview and linguistic prediction are indicated by 'base', 'para', and 'ling', respectively.Unions are indicated by ∪, intersections by ∩; for the relative complement we use the asterisk-notation: e.g.'para*' indicates variation explained uniquely by parafoveal preview (see Methods).Note that due to cross-validation, the amount of variation explained can become negative in individual participants (see Methods).

Table A . 1 .
Heilbron et al.Explanatory variables for 3-way reading times analysis, comparing explanations for variation in reading times based on either two contextual sources of information about a word's identifiability: parafoveal preview or linguistic prediction, and based on non-contextual attributes of the fixated word.

Table A
.2. Explanatory variables for 3-way skipping analysis, contrasting explanations for skipping based a words prior identifiability based on parafoveal preview, a word's prior identifiability from constraint or contextual prediction, and low-level visual or oculomotor information.Note that, when we refer to 'full model' we simply mean the joint model combining all explanatory variables of the partial explanatory models.

Lexical processing strongly affects reading times but not skipping during natural reading Heilbron
et al.Explanatory variables for 2-way skipping analysis, contrasting explanations for skipping based on factors determining a word's lexical processing ease (i.e.how well it can be predicted from context or discerned from a parafoveal preview) and explanations based on low-level visual or oculomotor information.
OPEN MIND: Discoveries in Cognitive Science