Human attention during goal-directed reading comprehension relies on task optimization

The computational principles underlying attention allocation in complex goal-directed tasks remain elusive. Goal-directed reading, that is, reading a passage to answer a question in mind, is a common real-world task that strongly engages attention. Here, we investigate what computational models can explain attention distribution in this complex task. We show that the reading time on each word is predicted by the attention weights in transformer-based deep neural networks (DNNs) optimized to perform the same reading task. Eye tracking further reveals that readers separately attend to basic text features and question-relevant information during first-pass reading and rereading, respectively. Similarly, text features and question relevance separately modulate attention weights in shallow and deep DNN layers. Furthermore, when readers scan a passage without a question in mind, their reading time is predicted by DNNs optimized for a word prediction task. Therefore, we offer a computational account of how task optimization modulates attention distribution during real-world reading.


Introduction
Attention profoundly influences information processing in the brain [1,2], and a large number of studies have been devoted to studying the neural mechanisms of attention.
From the perspective of David Marr, the attention mechanism can be studied from three levels, i.e., the computational, algorithmic, and implementational levels [3]. At the computational level, attention is traditionally viewed as a mechanism to allocate limited central processing resources [4][5][6][7]. More recent studies, however, propose that attention is a mechanism to optimize task performance, even in conditions where the processing resource is not clearly constrained [8,9]. The optimization hypothesis can explain the attention distribution in a range of well controlled learning and decisionmaking tasks [10,11], but is rarely tested in complex processing tasks for which the optimal strategy is not obvious. Nevertheless, complex tasks are critical conditions to test whether the attention mechanisms abstracted from simpler tasks can truly explain real-world attention behaviors.
Reading is one of the most common and most sophisticated human behaviors [12,13], and it is strongly regulated by attention: Since readers could only recognize a couple of words within one fixation, they have to overtly shift their attention to read a line of text [14]. Computational modeling of the reading behavior has mostly focused on normal reading of single sentences. At the computational level, it has been proposed that the eye movements are programed to, e.g., minimize the number of eye movements [15]. At the algorithmic and implementational level, models such as the E-Z reader [16] can accurately predict the eye movement trajectory with high temporal and spatial resolution. Everyday reading behavior, however, often engages reading of a multi-line passage and generally has a clear goal, e.g., information retrieval or inference generation [17]. Few models, however, have considered how the reading goal modulates reading behaviors. Here, we address this question by analyzing how readers allocate attention when reading a passage to answer a specific question in mind. The question may require, e.g., information retrieval, inference generation, or text summarization (Fig. 1). We investigate whether the task optimization hypothesis can explain the attention distribution in such goal-directed reading tasks.
Finding an optimal solution for the goal-directed reading task, however, is computationally challenging since the information related to question answering is sparsely located in a passage and their orthographic forms may not be predictable.
Recent advances in DNN models, however, provide a potential tool to solve this computational problem, since DNN models equipped with attention mechanisms have approached and even surpassed mean human performance on goal-directed reading tasks [18,19]. Attention in DNN is a mechanism to selectively extract useful information, and therefore is conceptually similar to the human attention mechanism at computational level. Furthermore, recent studies have provided strong evidence that task-optimized DNN can indeed explain the neural response properties in a range of visual and language processing tasks [20][21][22][23][24][25][26][27]. Therefore, although the DNN attention mechanism certainly deviates from the human attention mechanism in terms of its algorithms and implementation, we employ it to probe the computational-level principle underlying human attention distribution during real-world goal-directed reading.
Here, we employed DNNs to derive the optimal attention distribution for the goaldirected reading task, and tested whether such optimal distribution could explain human attention measured by eye tracking. Critically, we investigated how the attention distribution evolved along the processing hierarchy for both humans and DNNs, e.g., how text properties and the top-down task differentially modulated attention at each processing stage. Additionally, we recruited both native and nonnative readers to probe how language proficiency contributed to the computational optimality of attention distribution.

Experiment 1: Task and Performance
In Experiment 1, the participants (N = 25 for each question) first read a question and then read a passage based on which the question should be answered (Fig. 1A). After reading the passage, the participants chose from 4 options which option was the most suitable answer to the question. In total, 800 question/passage pairs were adapted from the RACE dataset [28], a collection of reading comprehension questions designed for Chinese high school students who learn English as a second language.
The questions fell into 6 types (Fig. 1BC): Three types of questions required attention to details, e.g., retrieving a fact or generate inference based on a fact, which were referred to as local questions. The other 3 types of questions concerned the general understanding of a passage, e.g., summarizing the main idea or identifying the purpose of writing, which were referred to as global questions. None of the question directly appeared in the passage, and the longest string that overlapped in the passage and question was 1.8 ± 1.5 words on average. Participants in Experiment 1 were Chinese college or graduate students who had relatively high English proficiency. The participants correctly answered 77.94% questions on average and the accuracy was comparable across the 6 types of questions ( Fig. 1B). We employed computational models to analyze what kinds of computations were required to answer the questions. The simplest heuristic model chose the option that best matched the passage orthographically (Fig. S1A). This orthographic model achieved 25.6% accuracy (Fig. 1B). Another simple heuristic model only considered word-level semantic matching between the passage and option, and achieved 27.3% accuracy (Fig. 1B). The low accuracy of the two models indicated that the reading comprehension questions could not be answered by word-level orthographic or semantic matching.
Next, we evaluated the performance of 4 context-dependent DNN models, i.e., Stanford Attentive Reader (SAR) [29], BERT [30], ALBERT [18], and RoBERTa [19], which could integrate information across words to build passage-level semantic representations. The SAR used the bi-directional recurrent neural network (RNN) to integrate contextual information (Fig. S1B) and achieved 47.6% accuracy. The other 3 models, i.e., BERT, ALBERT, and RoBERTa, were transformer-based models that were trained in 2 steps, i.e., pre-training and fine-tuning (Fig. 1D). Since the 3 models had similar structures, we averaged the performance over the 3 models (see Fig. S2 for the results of individual models). The model performance on the reading task was 37.08% and 73%, respectively, after pre-training and fine-tuning (Fig. 1B).

Computational Models of Human Attention Distribution
In Experiment 1, participants were allowed to read each passage for 2 minutes, but the reward they would receive was disproportional to the reading time to encourage them to develop an effective reading strategy. The results showed that the participants spent, on average, 0.7 ± 0.2 minutes reading each passage (Fig. 1C), corresponding to a reading speed of 457 ± 142 words/minute when divided by the number of words per passage. The speed was almost twice the normal reading speed for native readers [14], indicating a specialized reading strategy for the task.
Next, we employed eye tracking to quantify how the readers allocated their attention to achieve effective reading and analyze which computational models could explain the reading time on each word, i.e., the total fixation duration on each word during passage reading. In other words, we probed into what kind of computational principles could generate human-like attention distribution during goal-directed reading. A simple heuristic strategy was to attend to words that were orthographically or semantically similar to the words in the question (Fig. S1A). The predictions of the heuristic models were not highly correlated with the human word reading time (Fig.   S3A, prediction accuracy around 0.2).
The DNN models analyzed here, i.e., SAR, BERT, ALBERT, and RoBERTa, all employed the attention mechanism to integrate over context to find optimal question answering strategies. Roughly speaking, the attention mechanism applied a weighted integration across all input words to generate a passage-level representation and decide whether an option was correct or not, and the weight on each word was referred to as the attention weight (see Fig. S1B and Fig. 2B for illustrations about the attention mechanisms in the SAR and transformer-based models, respectively). When the attention weights of the SAR were used to predict the human word reading time, the prediction accuracy was about 0.1 (Fig. 3A, Table S1).
In contrast to assigning a single weight on a word, the transformer-based model employed a multi-head attention mechanism: Each of the 12 layers had 12 parallel attention modules, i.e., heads. Consequently, each word had 144 attention weights (12 layers × 12 heads), which were used to model the word reading time of humans based on linear regression. Since the attention weights of 3 transformer-based models showed comparable power to predict human word reading time, we reported the prediction accuracy averaged over models (see Fig. S3A for the results of individual models). When the attention weights of pre-trained transformer-based models were used to predict the human word reading time, the prediction accuracy was around 0.5, significantly higher than the prediction accuracy of heuristic models and the SAR (Fig. 3A, Table S1). The prediction accuracy was further boosted for local but not global questions when the models were fine-tuned to perform the goal-directed reading task (Fig. 3A, Table S1). These results suggested that the human attention distribution was consistent with the attention weights in transformer-based models that were optimized to perform the same goal-directed reading task.

Factors Influencing Human Word Reading Time
The attention weights in transformer-based DNN models could predict the human word reading time. Nevertheless, it remained unclear whether such predictions were purely driven by basic text features that were known to modulate word reading time.
Therefore, in the following, we first analyzed how basic text features modulated the word reading time during the goal-directed reading task, and then checked whether transformer-based DNNs could capture additional properties of the word reading time that could not be explained by basic text features.
Here, we further decomposed text features into visual layout features, i.e., position of a word on the screen, and word features, e.g., word length, frequency, and surprisal.
Layout features were features that were mostly induced by line changes, which could be extracted without recognizing the words, while word features were finer-grained features that could only be extracted when the word or neighboring words were fixated. Linear regression analyses revealed layout features could significantly predict the word reading time (Fig. 3B, Table S2). Furthermore, the prediction accuracy was higher for global than local questions (P = 9 × 10 -5 , bootstrap, FDR corrected), suggesting a question-type-specific reading strategy. Word features could also significantly predict human reading time, even when the influence of layout features was regressed out. The predictive accuracy of the layout and word features, however, was lower than the predictive accuracy of attention weights of transformer-based models (P = 9 × 10 -5 , bootstrap, FDR corrected).
When the layout and word features were regressed out, the residual word reading time was still significantly predicted by the attention weights in transformer-based models ( Fig. S3B, prediction accuracy about 0.3). This result indicated that what the transformer-based models extracted were more than basic text features. Next, we analyzed whether the transformer-based models, as well as the human word reading time, were sensitive to task-related features. To characterize the relevance of each word to the question answering task, we asked another group of participants to annotate which words contributed most to question answering. The annotated question relevance could significantly predict word reading time, even when the influences of layout and word features were regressed out (Fig. 3B, Table S2). When the question relevance was also regressed out, the residual word reading time was still significantly predicted by the attention weights in transformer-based models (

Attention in Different Processing Stages for Humans and DNNs
Next, we investigated whether humans and DNNs attended to different features in different processing stages. The early stage of human reading was indexed by the gaze duration, i.e., duration of first-pass reading of a word, and the later stage was indexed by the counts of rereading. Results showed the influence of layout features increased from early to late reading stages for global but not local questions (Fig. 4A, Table S3).
Consequently, the passage-beginning-effect differed between global and local questions only for the late reading stage (Fig. S5A). The influence of word features did not strongly change between reading stages, while the influence of question relevance significantly increased from early to late reading stages (Fig. 4A, Fig. S5B).
These results suggested that attention to basic text features developed early, while the influence of task mainly influenced late reading processes.
In the following, we further investigated whether transformer-based DNN attended to different features in different layers, which represented different processing stages.
This analysis did not include layout features that were not available to the models.
The attention weights in shallow layers were sensitive to word features in both pretrained and fine-tuned models (Fig. 4BC). Only in the fine-tuned models, however, the attention weights in deep layers were sensitive to question relevance (see Figs. S6 & S7 for results of individual models). Therefore, the shallow and deep layers separately evolved text-based and goal-directed attention, and goal-directed attention was induced by fine-tuning on the task.

Experiment 2: Question-Type-Specificity of the Reading Strategy
In Experiment 1, different types of questions were presented in blocks which encouraged the participants to develop question-type-specific reading strategies. Next, we ran Experiment 2 in which questions from different types were mixed and presented in a randomized order. Since it was time consuming to measure the response to all 800 questions, we randomly selected 96 questions for Experiment 2 (16 questions per type). In Experiment 2, the reading speed was on average 298 ± 123 words/minute, lower than the speed in Experiment 1 (P = 6 × 10 -4 , bootstrap, FDR corrected), but still much faster than normal reading speed [14].
The word reading time was better predicted by fine-tuned than pre-trained transformer-based models (Fig. 5A, Table S4). For the influence of text and taskrelated features, compared to Experiment 1, the prediction accuracy in Experiment 2 was higher for layout and word features, but lower for question relevance (Fig. 5B, Table S5). The passage beginning effect was higher for global than local questions

Experiment 3: Effect of Language Proficiency
Experiments 1 and 2 recruited L2 readers. To investigate how language proficiency influenced task modulation of attention, we ran Experiment 3, which was the same as Experiment 2 except that the participants were native English readers. In Experiment 3, the reading speed was on average 506 ± 155 words/minute, higher than that in Experiment 2 (P = 6 × 10 -4 , bootstrap, FDR corrected). The question answering accuracy was comparable to L2 readers (Fig. 1B).
The word reading time for native readers was slightly better predicted by fine-tuned than pre-trained transformer-based models (Fig. 5A, Table S4). For the influence of text and task-related features, compared to Experiment 2, the prediction accuracy in Experiment 3 was higher for word features, but lower for layout features and question relevance ( Table S5). The passage beginning effect was higher for global than local questions, but the difference was smaller than in Experiment 2 ( FDR corrected). These results showed that the word reading time of native readers was significantly modulated by the task, but the effect was weaker than that on L2 readers.

Experiment 4: General-Purpose Reading
In the goal-directed reading task, participants read a passage to answer a question that they knew in advance, and the eye tracking results revealed that participants spent more time reading question-relevant words. Question-relevant words, however, were generally longer content words (Fig. S3CD) that were often associated with longer reading time even without a task [14]. Therefore, to validate the question relevance effect, we ran Experiment 4 in which the participants read the passages without knowing the question to answer. The experiment used the same 96 questions as in Experiments 2 and 3, but adopted a different experimental procedure: Participants previewed a passage before reading the question, and were allowed to read the passage again to answer the question. We analyzed the reading pattern during passage preview, which was referred to as general-purpose reading.
The participants were given 1.5 minutes to preview the passage, and the reading speed was on average 225 ± 40 words/minute, lower than that in Experiments 1-3 (P = 6 × 10 -4 , bootstrap, FDR corrected). Before question answering, they were given another 0.5 minutes to reread the passage, but on average they spent only 0.04 minute on rereading it. During passage preview, the word reading time was similarly predicted by the pre-trained and fine-tuned transformer-based models (Fig. 5A, Table S4).
Furthermore, the word reading time was significantly predicted by layout and word features, but not question relevance (Fig. 5B, Table S4). The passage beginning effect was not significantly different between local and global questions (

Discussion
Attention is a crucial mechanism to regulate information processing in the brain and it has been hypothesized that a common computational role of attention is to optimize task performance. Previous support for the hypothesis mostly comes from tasks for which the optimal strategy can be easily derived. The current study, however, considers a real-world reading task in which the participants have to actively sample a passage to answer a question that cannot be answered by simple word-level orthographic or semantic matching. In this challenging task, it is demonstrated that human attention distribution can be explained by the attention weights in transformerbased DNN models that are optimized to perform the same reading task but blind to the human eye tracking data. Furthermore, when participants scan a passage without knowing the question to answer, their attention distribution can also be explained by transformer-based DNN models that are optimized to predict a word based on the context. Furthermore, we demonstrate that both humans and transformer-based DNN models achieve task-optimal attention distribution in multiple steps: For humans, basic text features strongly modulate the duration of the first reading of a word, while the question relevance of a word only modulates how many times the word is reread, especially for high-proficiency L2 readers compared to native readers. Similarly, for DNN models, basic text features mainly modulate the attention weights in shallow layers, while the question relevance of a word modulates the attention weights in deep layers, reflecting hierarchical control of attention to optimize task performance.

Computational models of attention
A large number of computational models of attention have been proposed. According to Marr's 3 levels of analysis [3], some models investigate the computational goal of attention [8,15] and some models provide an algorithmic implementation of how different factors modulate attention [16,31]. Computationally, it has been hypothesized that attention can be interpreted as a mechanism to optimize learning and decision making, and empirical evidence has been provided that the brain allocates attention among different information sources to optimally reduce the uncertainty of a decision [8,9,15]. The current study provides critical support to this hypothesis in a real-world task that engages multiple forms of attention, e.g., attention to visual layout features, attention to word features, and attention to question-relevant information. These different forms of attention, which separately modulate different eye tracking measures (Fig. 4A), jointly achieve an attention distribution that is optimal for question answering.
The transformer-based DNN models analyzed here are optimized in two steps, i.e., pre-training and fine-tuning. The results show that pre-training leads to text-based attention that can well explain general-purpose reading in Experiment 4, while the fine-tuning process leads to goal-directed attention in Experiments 1-3 ( Fig. 4B &   Fig. 5A). Pre-training is also achieved through task optimization, and the pre-training task used in all the three models analyzed here is to predict a word based on the context. The purpose of the word prediction task is to let models learn the general statistical regularity in a language based on large corpora, and this process is crucial for model performance on downstream tasks [18,19,30]. Previous eye-tracking studies have suggested that the predictability of words, i.e., surprisal, can modulate reading time [32], and neuroscientific studies have also indicated that the cortical responses to language converge with the representations in pre-trained DNN models [22,23]. The results here further demonstrate that the DNN optimized for the word prediction task can evolve attention properties consistent with the human reading process.
A separate class of models investigates which factors shape human attention distribution. A large number of models are proposed to predict bottom-up visual saliency [33,34], and recently DNN models are also employed to model top-down visual attention. It is shown that, through either implicit [35,36] or explicit training [37], DNNs can predict which parts of a picture relate to a verbal phrase, a task similar to goal-directed visual search [38]. The current study distinguishes from these studies in that the DNN model is not trained to predict human attention. Instead, the DNN models naturally generate human-like attention distribution when trained to perform the same task that humans perform, suggesting that task optimization is a potential cause for human attention distribution during reading.

Models for human reading and human attention to question-relevant information
How human readers allocate attention during reading is an extensively studied topic, mostly based on studies that instruct readers to read a sentence in a normal manner, not aimed to extract a specific kind of information [39]. Previous eye tracking studies have shown that the readers fixate longer upon, e.g., longer words, words of lowerfrequency, words that are less predictable based on the context, and words at the beginning of a line [14]. A number of models, e.g., the E-Z reader [16] and SWIFT [40], have been proposed to predict the eye movements during reading based on basic oculomotor properties or lexical processing [16]. Some models also view reading as an optimization process that minimizes the time or the number of saccades required to read a sentence [15,41]. These models can generate fine-grained predictions, e.g., which letter in a word will be fixated first, for the reading of simple sentences, but have only been occasionally tested for complex sentences or multi-line texts [42] or to characterize different reading tasks, e.g., z-string reading and visual searching [43]. A recent model has also considered the specific reading goal of the participants [44], and can explain the word reading time when the readers read a passage to answer a relatively simple question that can be answered using a word-matching strategy [45].
Future studies can potentially integrate classic eye movement models with DNNs to explain the dynamic eye movement trajectory, possibly with a letter-based spatial resolution.
When human readers read a passage with a particular goal or perspective, previous studies have revealed inconsistent results about whether the readers spent more time reading task-relevant sentences [46][47][48]. To explain the inconsistent results, it has been proposed that the question relevance effect weakens for readers with a higher working memory and when readers read a familiar topic [49]. Similarly, here, we demonstrate that non-native readers indeed spend more time reading question-relevant information than native readers do (Fig. 5D & Fig. S8B). Therefore, it is possible that when readers are more skilled and when the passage is relatively easy to read, their processing is so efficient so that they do not need extra time to encode task-relevant information.

DNN attention to question-relevant information
A number of studies have investigated whether the DNN attention weights are interpretable, but the conclusions are mixed: Some studies find that the DNN attention weights are positively correlated with the importance of each word [50,51], while other studies fail to find such correlation [52,53]. The inconsistent results are potentially caused by the lack of gold standard to evaluate the contribution of each word to a task. A few recent studies have used the human word reading time as the criterion to quantify word importance, but these studies do not reach consistent conclusions either. Some studies find that the attention weights in the last layer of transformer-based DNN models better correlates with human word reading time than basic word frequency measures [54], and integrating human word reading time into DNN can slightly improve task performance [55]. Other studies, however, find no meaningful correlation between the attention weights in transformer-based DNNs and human word reading time [56].
The current results provide a potential explanation for the discrepancy in the literature: The last layer of transformer-based DNNs is tuned to task relevant information (Fig. 4B), but the influence of task relevance on word reading time is rather weak for native readers (Fig. 5B). Consequently, the correlation between the last-layer DNN attention weights and human reading time may not be robust. The current results demonstrate that the reading time of both native and non-native readers are reliably modulated by basic text features, which can be modeled by the attention weights in shallower DNN layers.
Finally, the current study demonstrates that transformer-based DNN models can automatically generate human-like attention, in the absence of any prior knowledge about the properties of the human reading process. Simpler models that fail to explain human performance also fail to predict human attention distribution. It remains possible, however, different models can solve the same computational problem using distinct algorithms, and only some algorithms generate human-like attention distribution. In other words, human-like attention distribution may not be a unique solution to optimize the goal-directed reading task. Sharing similar attention distribution with humans, however, provides a way to interpret the attention weights in computational models. From this perspective, the dataset and methods developed here provides an effective probe to test the biological plausibility of NLP models that can be easily applied to test whether a model evolves human-like attention distribution.

Participants
Totally, 162 participants took part in this study (19- In Experiments 1, 2 and 4, participants were native Chinese readers. They were college students or graduate students from Zhejiang University, and were thus above the level required to answer high-school-level reading comprehension questions.
English proficiency levels were further guaranteed by the following criterion for screening participants: a minimum score of 6 on IELTS, 80 on TOEFL, or 425 on

Experimental materials
The reading materials were selected and adapted from the large-scale RACE dataset, a collection of reading comprehension questions in English exams for middle and high schools in China [28]. We selected 800 high-school level questions from the test set of The experiment procedure in Experiment 1 was illustrated in Fig. 1A. In each trial, participants first read a question, pressed the space bar to read the corresponding passage, pressed the space bar again to read the question coupled with 4 options, and chose the correct answer. The time limit for passage reading was 120 s. To encourage the participants to read as quickly as possible, the bonus they received for a specific question would decrease linearly from 1.5 to 0.5 RMB over time. They did not receive any bonus for the question, however, if they gave a wrong answer.
Furthermore, before answering the comprehension question, the participants reported whether they were confident about that they could correctly answer the question (yes or no). Participants selected yes for 90.47% of questions (89.62% and 92.04% for local and global questions, respectively). After answering the question, they also rated their confidence about their answer on the scale of 1-4 (low to high). The mean confidence rating was 3.25 (3.28 and 3.18 for local and global question, respectively), suggesting that the participants were confident about their answers.

Experiments 2 and 3: Experiments 2 and 3 included 96 reading passages and
questions that were randomly selected from the questions used in Experiment 1 and included 16 questions for each question type. The 6 types of questions were mixed and presented in a randomized order. The trial structure, as well as the familiarization procedure, in Experiments 2 and 3 was identical to that in Experiment 1. Experiments 2 and 3 were identical except that Experiment 2 recruited high-proficiency L2 readers while Experiment 3 recruited native English readers.

Experiment 4: Experiment 4 included the 96 questions presented in Experiments 2
and 3, which were presented in a randomized order. The trial structure in Experiment 4 is similar to that in Experiments 1-3, except that a 90-s passage preview stage was introduced at the beginning of each trial. During passage preview, participants had no prior information of the relevant question. The participants could press the space bar to terminate the preview and to read a question. Then, participants read the passage again with a time limit of 30 s, before proceeding to answer the question. The payment method was similar to Experiment 2, and the bonus was calculated based on the duration of second-pass passage reading.

Stimulus presentation and eye tracking
The text was presented using the bold Courier New font, and each letter occupied 14 Eye tracking data were recorded from the left eye with 500-Hz sampling rate (Eyelink Portable Duo, SR Research). The experiment stimuli were presented on a 24-inch monitor (1920 × 1080 resolution; 60 Hz refresh rate) and administered using MATLAB Psychtoolbox [58]. Each experiment started with a 13-point calibration and validation of eye tracker, and the validation error was required to be below 0.5 degrees of visual angle. Furthermore, before each trial, a 1-point validation was applied, and if the calibration error was higher than 0.5 degrees of visual angle, a recalibration was carried out. Head movements were minimized using a chin and forehead rest.

Word-level reading comprehension models
The orthographic and semantic models probed whether the reading comprehension questions could be answered based on word-level orthographic or semantic information. Both models calculated the similarity between each content word in the passage and each content word in an option, and averaged the word-by-word similarity across all words in the passage and all words in the option (Fig. S1A). The option with the highest mean similarity value was chosen as the answer. For the orthographic model, similarity was quantified using the edit distance [59]. For the semantic model, similarity was quantified by the correlation between vectorial representations of word meaning, i.e., the glove model [60]. Performance of the models remained similar if the answer was chosen based on the maximal word-byword similarity, instead of the mean similarity.

RNN-based reading comprehension models
The SAR was a classical RNN-based model for the reading comprehension task [29].
In contrast to the word-level models, the SAR was context sensitive and employed bidirectional RNNs to integrate information across words (Fig. S1B). Independent bi-directional RNNs were employed to build a vectorial representation for the question and each option. An additional bi-directional RNN was applied to construct a vectorial representation for each word in the passage, and a passage representation was built by a weighted sum of the representations of individual words in the passage.
The weight on each word, i.e., the attention weight, captured the similarity between the representation of the word and the question representation using a bilinear function. Finally, based on the passage representation and each option representation, a bilinear dot layer calculated the possibility that the option was the correct answer.

Transformer-based reading comprehension models
We tested 3 popular transformer-based DNN models, i.e., BERT [30], ALBERT [18], and RoBERTa [19], which were all reported to reach high performance on the reading comprehension task. ALBERT and RoBERTa were both adapted from BERT, and had the same basic structure. RoBERTa differed from BERT in its pre-training procedure [19] while ALBERT applied factorized embedding parameterization and cross-layer parameter sharing to reduce memory consumption [18]. Following previous studies [18,19], each option was independently processed. For the i th option (i = 1, 2, 3, or 4), the question and the option were concatenated to form an integrated option. As shown in the left panel of Fig. 2B, for the i th option, the input to models was the following sequence: Following previous studies [18,19], we calculated a score for each option, which indicated the possibility that the option was the correct answer. The score was calculated by first applying a linear transform to the final representation of the CLS token, i.e., 12 , where CLSi 12 was the final output representation of CLS and Φ was a vector learned from data. The score was independently calculated for each option and then normalized using the following equation: The answer to a question was determined as the option with the highest score, and all the models were trained to maximize the logarithmic score of the correct option. The transformer-based models were trained in two steps (Fig. 1D). The pre-training process aimed to learn general statistical regularities in a language based on large corpora, while the fine-tuning process trained models to perform the reading comprehension task. All models were implemented based on HuggingFace [61] and all hyperparameters for fine-tuning were adopted from previous studies [18,19,62,63] (see Table S6).

Attention in transformer-based models
The transformer-based models we applied had 12 layers, and each layer had 12 parallel attention heads. Each attention head calculated an attention weight between any pair of inputs, including words and special tokens. The vectorial representation of each input was then updated by the weighted sum of the vectorial representations of all inputs [64]. Since only the CLS token was directly related to question answering, here we restrained the analysis to the attention weights that were used to calculate the vectorial representation of CLS (Fig. 2B, right panel). In the h th head, the vectorial representation of CLS was computed using the following equations. For the sake of clarity, we did not distinguish the input words and special tokens and simply denoted them as Xi.
, , where W V , W Q , W K , b V , b Q , and b K were parameters to learn from the data, and αi was the attention weight between CLS and Xi. The attention weight between CLS and the n th word in the passage, i.e., αPn, was compared to human attention. Here, we only considered the attention weights associated with the correct option. Additionally, DNNs used byte-pair tokenization which split some words into multiple tokens. We converted the token-level attention weights to word-level attention weights by summing the attention weights over tokens within a word [54,65].

Eye tracking measures
We analyzed eye movements during passage reading in Experiments 1-3, and the passage preview in Experiment 4. For each word, the total fixation time, gaze duration, and run counts was extracted using the SR Research Data Viewer software.
The total fixation time of a word was referred to as the word reading time. The gaze duration was the how long a word was fixated before the gaze moved to other words, reflected first-pass processing of a word. To characterize late processing of a word, we further calculated the counts of rereading, which were defined as the run counts minus 1. Words that were not reread were excluded from the analysis of counts of rereading.
Each of the eye tracking measure was averaged across all participants who correctly answered the question.

Regression models
We employed linear regression to analyze how well each model, as well as each set of text/task-related features, could explain human attention measured by eye tracking. In all regression analyses, each regressor and the eye-tracking measure were normalized within each passage by taking the z-score. The prediction accuracy, i.e., the correlation between the predicted eye-tracking measure and the actual eye-tracking measure was calculated based on five-fold cross-validation.

Statistical tests
In the regression analysis, we employed a one-sided permutation test to test whether a set of features could statistically significantly predict an eye tracking measure. Five hundred chance-level prediction accuracy was calculated by predicting the eye tracking measure shuffled across all words within a passage: The eye tracking measure to predict was shuffled but the features were not. The procedure was repeated 500 times, creating 500 chance-level prediction accuracy. If the actual correlation was greater than N out of the 500 chance-level correlation, the significance level was (N +

This PDF file includes:
Figures S1 to S8 Tables S1 to S6 forming a similarity matrix. The similarity measures used in the orthographic and semantic models are the edit distance and correlation between word embeddings, respectively. For each option, the similarity matrix is averaged across all rows and all columns to form a scalar decision score. The option with the largest decision score is chosen as the answer. (B) The SAR model uses bi-directional RNNs to encode contextual information. A vectorial representation for the passage is created using the weighted sum of the vectorial representation of each word, and the weight on each word, i.e., the attention weight, is calculated based on its similarity to the vectorial representation of the question. The summarized passage representation and the option representation is used to form the decision score with a bilinear dot layer.

Fig. S2. Question answering accuracy for individual transformer-based models.
Human results and other computational models are also plotted for comparison.  Question-relevant words are more often content words.    The question relevance effect was quantified by the ratio between mean word reading time on the line that was most relevant to the question and lines that were more than 5 lines away. See Fig. 1 for the explanation for the box plots. **P < 0.01; ***P < 0.001.