Individual Differences in Sensitivity to Style During Literary Reading: Insights from Eye-Tracking

defamiliarization to aesthetic appreciation: In order for aesthetic appreciation to emerge, the time it takes for the process of perception to be completed must be prolonged. The slowing down to foregrounded passage is sometimes called retardation [11]. It is important to note that Shklovsky does not claim that aesthetic appreciation itself causes the increase in processing time. Rather, the longer processing times that result from increased difficulty allow for aesthetic experience to arise. Relating this idea to characteristics of the reader, we expect that experience with reading Style is an important aspect of literature, and stylistic deviations are sometimes labeled foregrounded, since their manner of expression deviates from the stylistic default. Russian Formalists have claimed that foregrounding increases processing demands and therefore causes slower reading – an effect called retardation. We tested this claim experimentally by having participants read short literary stories while measuring their eye movements. Our results confirm that readers indeed read slower and make more regressions towards foregrounded passages as compared to passages that are not foregrounded. A closer look, however, reveals significant individual differences in sensitivity to foregrounding. Some readers in fact do not slow down at all when reading foregrounded passages. The slowing down effect for literariness was related to a slowing down effect for high perplexity (unexpected) words: those readers who slowed down more during literary passages also slowed down more during high perplexity words, even though no correlation between literariness and perplexity existed in the stories. We conclude that individual differences play a major role in processing of literary texts and argue for accounts of literary reading that focus on the interplay between reader and text.


Introduction
Literary reading can be distinguished from other types of reading in a number of aspects. Some differences can be attributed to characteristics of the text, such as the frequent and systematic use of rhetorical devices, whereas the reader and the reading context play an important role as well. How and to what extent those factors influence the literariness of the reading experience has been a matter of debate. According to the text-oriented perspective (e.g., [1]), text features are independent of readers and can be more or less literary. The reader-oriented perspective (e.g., [2]), however, claims that the (perceived) literariness depends on a reader's attention to certain aspects of the text. Interactional approaches emphasize that an author can manipulate text characteristics so that the text fulfills certain necessary conditions of being liter-ary, but the reader also needs to react in a certain way to those manipulations for the literary experience to emerge (e.g., [3][4][5][6][7][8]).
An important characteristic of literary reading is foregrounding. The term foregrounding is a translation by Garvin [9] of the Czech term aktualisace, actualization in English. The term refers to words, expressions or structures that stand out from their textual context, because they deviate stylistically in one or more features from the text. It is assumed that foregrounding causes readers to shift their attention from the content to the style of a text [10]. There has been much speculation about the effects foregrounding may have on the reader and the reading process. Mukařovský argued that foregrounded structures cause de-automatization of reading, which means that the text structure is processed less automatically.
Shklovsky [11] referred, much earlier, to the same process as defamiliarization. He explicitly links defamiliarization to aesthetic appreciation: In order for aesthetic appreciation to emerge, the time it takes for the process of perception to be completed must be prolonged. The slowing down to foregrounded passage is sometimes called retardation [11]. It is important to note that Shklovsky does not claim that aesthetic appreciation itself causes the increase in processing time. Rather, the longer processing times that result from increased difficulty allow for aesthetic experience to arise. Relating this idea to characteristics of the reader, we expect that experience with reading ORIGINAL RESEARCH REPORT Individual Differences in Sensitivity to Style During Literary Reading: Insights from Eye-Tracking can influence the effects foregrounding has on reading. More experienced readers are expected to experience less problems with texts high on foregrounding because they are more used to language that deviates from the stylistic norm. Retardation therefore may depend on individual experience with reading.
Empirical research has shown that foregrounding generally influences aesthetic appreciation (e.g., [12]), as well as reading times. But the effect of foregrounding has been shown to be subject to individual differences as well. A landmark study by Miall and Kuiken [13][14][15][16] confirmed that foregrounded passages are read more slowly than passages that are not foregrounded. Importantly, they found that a reader's level of experience influences to which type of foregrounding (phonetic versus semantic) they are sensitive.
The main goal of the current study is to directly test the hypothesis posed by the Russian Formalists (e.g., [11]) who proposed that readers slow down during the reading of literary passages, compared to non-literary passages. While several reading times studies have investigated this issue before, we extend the empirical literature on this matter by measuring word by word reading times with eye-tracking, and by an explicit focus on individual differences in our analysis. Foregrounded passages are expected to decrease reading speed in the majority of readers, as compared to non-foregrounded passages.
Using eye-tracking, a decrease in reading speed can be measured in numerous ways, but because the chance of a Type I error increases with using multiple dependent measures, we will restrict the analysis to two representative measures, namely gaze duration and chance of regression. Gaze duration is the total fixation time on a word during the first time a word is fixated [17]. So when a word is consecutively fixated multiple times, or when it is fixated only once, gaze duration consists of the sum of the fixation times. Out of the available reading time metrics, we consider gaze duration the best candidate because (i) it takes into account all fixations on a word during the first pass, rather than just the first fixation; (ii) it takes all words into account rather than just the ones that have been fixated only once; and (iii) it allows for the distinction between progressive fixations and regressions. The chance of regression (henceforth simply regressions) is based on a simple binary measure that indicates whether or not a reader fixates on a word w i after having fixated on any of the words w i+1 . . .w n . This measure represents cognitive difficulty experienced with a word only after the word has either been fixated or skipped.
Since foregrounding draws attention to the wording of the text rather than its content, it may not only cause readers to slow down, but it may also enhance readers' memory of the surface form of the text. Verbatim memory for a text has often been claimed to be short-lived (e.g., [18][19][20]), but many studies do find higher-thanchance scores on surprise surface form recognition tests [14,21,22]. Moreover, when an element that is in focus (through, e.g., syntactic devices like cleft structures, or through the use of italics) is changed in between two text readings, the change is more likely to be noticed than when an element that is not in focus is changed [23,24]. Therefore, it seems likely that the increase in attention to surface form caused by foregrounding results in improved recognition of the text's surface form.
The above considerations lead us to the following hypotheses: H1. Gaze durations and regressions are increased for words that are foregrounded compared to words that are not foregrounded.
Differences between individuals are expected to play a role in the effect of foregrounding on reading behavior. We expect that previous exposure to literary language is a crucial factor influencing reading behavior. Frequent readers have more experience with foregrounding than infrequent readers, and are less likely to slow down when reading foregrounded parts of a text, compared to infrequent readers: H2. Infrequent fiction readers show a larger retardation effect when exposed to foregrounding than frequent fiction readers.
The increased focus on style caused by foregrounding may have additional effects, besides affecting reading behavior. If readers pay more attention to the surface structure during foregrounded passages, then this increased attention is likely to result in enhanced memory for the surface form. This leads to the following hypothesis: H3. The memory for the surface form is better for foregrounded passages than for non-foregrounded passages.
Literary fiction, as compared to non-narrative texts, often causes the reader to get immersed into the story and construct multimodal situation models [25]. Being immersed or transported in(to) a story is linked to mental simulation (e.g. [7,8,26,27], and defined as "the state of feeling cognitively, emotionally, and imaginally immersed in a narrative world" ( [28], see also [29,30], see also [6] on disportation). Immersion is associated with enjoyment [31,32], meaning that the more we engage with a story, the more we enjoy it. In order to take effects of immersion into account, we added a self-report measure of transportation into a narrative and story liking as exploratory factors hypothesized to be related to foregrounding. It is possible that the degree of immersion and story liking affect readers' reactions to foregrounding. For instance, readers who are more immersed, or like the story more, may pay less attention to the style of the text, and thus be less affected by foregrounding. Alternatively, immersed readers, or those that like the story more, may attribute higher importance to the text, which could facilitate an intention for thorough understanding, and hence result in longer reading times for foregrounded passages. The hypotheses are therefore unspecified with regard to the direction of the effect: H4a. Readers who are more immersed in the story react differently to foregrounded words than readers who are less immersed in the story.
H4b. Readers who like the story more react differently to foregrounded words than readers who like the story less.
To test our hypotheses, we had thirty participants read three short stories from Dutch literature, while measuring their eye-movements. After reading the stories, participants filled in a questionnaire measuring how strongly they were immersed in each story, scored how much they liked each story, and filled out questions about personal reading habits, and a multiple choice test that measured the recognition performance of the surface structure of sentences in the stories.

Participants
Thirty healthy participants (25 females) without language impairments were recruited from the participant database of the Max Planck Institute for Psycholinguistics. Age ranged from 18 to 28 years (M = 21.90, SD = 2.63). The participants' native language was Dutch. Participants had normal (N = 17), or corrected to normal visual acuity with glasses (N = 4) or contact lenses (N = 9). Near vision was tested with a near vision test. None of the participants made an error when character size was above 1mm at a distance of 40cm, or 0.14° of the visual field (cf. the characters in the experiment, which were 3.3mm at a distance of 97cm, or 0.18° of the visual field). None of the participants studied literature science.

Materials
Three short stories from Dutch literature were selected (see Table 1). The selection was made on the basis of the length of the stories, as well as their potential to be of interest to the target group. All participants read all three stories (in randomized order). Story 1 had been read before by two of the participants and Story 3 had been read before by three of the participants. None of the participants had read Story 2 before. Because there were so few second time story-readings, those data were not excluded from the analysis. (Separate analyses were conducted in which second-time readings were excluded, which led to qualitatively similar results for all models). Pretest 16 participants from the same participant database who did not take part in any other parts of the study read each of the stories twice. The first time, they were instructed to read the story as they normally would. For the second reading, they were instructed to underline all the words, sentences and passages that they considered to be "literary". We will use the terms foregrounding and literariness interchangeably here. The two terms are not wholly synonymous, but our empirical operationalization of literariness ensures that only those literary passages that capture attention are included in the measure, whereas passages that might be considered literary exactly because they do not draw attention to themselves (i.e., they are backgrounded) are not included, since they are by definition unlikely to be noticed by our participants. In line with the idea that secondary processing leads to better appreciation of the qualities of literary texts (e.g., [12]), the instructions should lead to an adequate measure of the intersubjective perception of literariness. This pretest resulted in a literariness-score between 0 and 16 for every word in each of the three stories: a score of 0 if none of the participants had underlined it and a score of 16 if every participant had underlined it. On average, participants underlined 1106.19 words (SD = 573.36, ranging from 37 to 1989 words) in all three stories combined. That amounts to an average of 12.36% (SD = 6.64%) of the total amount of words. A repeated measures ANOVA showed that there were no statistically significant differences between stories in the percentage of words that was underlined, F(2,30) = 1.18, p = 0.32. Note that this operationalization of literariness does not to allow to distinguish between different types of foregrounding/literariness (e.g. phonological or semantic foregrounding).
As shown in Figure 1, there was a high autocorrelation between the scoring of a word w i and the scoring of the word that immediately followed it w i+1 . The size of the correlation gradually decreased as the distance between the words increased. This reflects the fact that participants mainly underscored sentences and passages rather than single words.
Participants did not show perfect consensus on what they considered the beginning and end of a literary passage. For instance, almost all participants agreed that the passage "Een zonnige metalen waterdruppel" ("A sunny metal water droplet") was literary. However, some participants also included the preceding parts of the text, whereas others did not. Figure 2 visualizes this gradual change from low to high literariness scores. The size of the characters shows literariness scoring, the bigger characters being underlined more often.
To test whether there was consensus among the participants regarding their literariness judgments (i.e., their judgments of what counts as literary language were not completely idiosyncratic), we estimated a chance  distribution by simulating 1000 distributions of underlining scores with the same (stationary) transition probabilities as the data using MCMC sampling. The probability of underlining a word given the underlining of the previous word, P(underlining w |underlining w-1 ), was calculated separately for each participant. A two-sample Kolmogorov-Smirnov test indicated that the underlining scores differed significantly from the estimated chance distribution, D = 0.94, p < .0001. Figure 3 shows the density plot for underlining in the three stories combined, compared to the mean percentages from the estimated chance distribution. All participants agreed on the non-literary status of 33.36% of the words, cf. 11.05% in the chance distribution. The percentage of words decreased as agreement upon literariness increased, but not as rapidly as might be expected purely based on chance. Therefore, the perception of literariness in these three stories was partially idiosyncratic, but there was also consensus.

Expert's ratings
As a validation of the non-expert ratings by the participants in the pretest, we compared them to an expert's (one of the authors, MB) rating of foregrounding in the stories. The expert performed the same task as the participants in the pretest, while being blind to the participants' ratings. The point-biserial correlation between the expert's scorings and the mean of the participants' scorings was significant, r = .39, p < .0001, thus corroborating our operationalization of foregrounding.

Additional measures
The participants filled in an immersion questionnaire after reading each story, and an additional test battery at the end of the experiment. The immersion questionnaire was based on the story world absorption scale (SWAS, [36]), and selected items from the 30-item version of the narrative engagement questionnaire (NEQ) developed by Buselle and Bilandzic [32]. Both questionnaires measure 4 dimensions of story engagement and show considerable overlap. SWAS measures attention, transportation, emotional engagement and mental imagery. NEQ measures narrative understanding, attentional focus, emotional engagement, and narrative presence. Narrative understanding is the only dimension not covered by SWAS, and relevant items from this subscale of the NEQ were included.
Participants were also asked to give a general score for Story Liking on a 10-point response scale (1 = extraordinarily bad, 10 = extraordinarily good). In addition, participants were asked to indicate whether they had read the story before and to answer four simple multiple choice questions regarding story content, specific to each story. The goal of these content questions was merely to ensure  Art. 14, page 5 of 16 that the participants processed the stories to a sufficient degree to understand what they were about. Participants who failed to answer minimally two out of the four questions for a story correctly would be excluded from the analysis.
A principal components analysis based on eigenvalues greater than 1 was conducted to extract the underlying factors from the immersion questionnaire. Promax rotation was used (an oblique rotation method), with a kappa of 4. Appendix A shows the final five components (Empathy, Self-loss, Imagery, Compassion and Understanding) and the items that loaded on them. In cases where an item loaded highly on multiple factors, the item was included in the factor with the closest match on a conceptual level. None of the items failed to load highly on any factor. Reliability was sufficient for all components (Cronbach's α ≥ .82 for all components). The mean score of all items within a factor was taken to represent each participant's score on that factor.
An additional test was conducted to measure recognition of the exact wording of a selection of foregrounded and non-foregrounded passages for each story (the sentence recognition test). From each story, three foregrounded and three non-foregrounded passages were selected. Foregrounded passages included words that were both underlined at least 6 times in the pretest and scored as literary by the expert, whereas non-foregrounded passages were underlined neither by the participants nor by the expert. For the sentence recognition test, we generated alternative items for each sentence, which either diverged from the original formulation in terms of their semantics, their syntax or both their semantics and their syntax. The participants' task was to recognize the original formulation in a multiple-choice test. (See Appendix B for the set of items.) Finally we measured three aspects of differences in reading behavior. First, participants indicated how much they liked fiction on a response scale ranging from 1 (not at all) to 7 (very much). Second, they indicated how many fiction books they read per year (0; 1-3; 4-6; 7-9; 10-12; more than 12). As a final measure we used a Dutch version [37] of the Author Recognition Test (ART). This is an indirect measure of print exposure ( [38] updated by [39]). The test assesses the participant's ability to recognize popular authors from a list. The test consists of 30 real author names and 12 foils. Every existing author that was recognized increased the participants' score by one, and every foil that was falsely recognized decreased their score by 1, so that in the end the potential total score on the ART was between -12 and 30.

Apparatus
A monocular tower-mounted EyeLink1000 eye-tracking system with a 25mm lens was used to collect eyemovement data. A head stabilizer minimized head movements. Eye position was recorded with a sampling rate of 1000Hz. Two separate DELL Precision 390 workstations were used for the presentation of the stimuli and data acquisition. Stimuli were presented on a 20'' Acer AL2023 LCD monitor with a refresh rate of 60Hz.

Stimulus presentation
SR Research's Experiment Builder software was used for the presentation of the stimuli. The stories were presented in 28-39 sections of on average 90.4 words (SD = 25.0). The division of the story into sections was kept as closely as possible to the author's original division of the story into paragraphs. The text was presented in the font Calisto MT, 15pt, in black color on a light grey background. The margins were 120 pixels on all sides. Interest areas for the eye-movement data were automatically defined by the Experiment Builder software. Each word corresponded to an interest area, and the limits for the interest areas were centered between adjacent words, leaving no space in between. Interest area margins on all sides of the text were 10 pixels. The monitor was 40.7cm by 30.5cm and the participants were seated at a distance of 97cm from the monitor.

Procedure
Participants were paid €16 as compensation for taking part in the study. Prior to the experiment, participants were informed about the procedure, and about possible contents of the story. The study was approved by the Ethics Committee Social Sciences of Radboud University Nijmegen (Ethics Approval Number ECG2013-1308-120). Participation was voluntary and participants could withdraw at any time without having to state their reasons. All participants gave written informed consent in accordance with the Declaration of Helsinki.
Participants performed an eye dominance test, so that the dominant eye could be tracked. For reasons not reported in the current article, skin conductance response electrodes were attached to the index and middle finger of their non-dominant hand.
The experiment took place in a sound proof cabin. Participants first read a practice story of 428 words, so that they could get used to the experimental setting and the task. They were informed that following each of the stories, they would have to answer questions regarding the content of the story and regarding their experience of reading the story. It was made clear that the content questions were not difficult and could be answered with ease without the need to remember trivial details of the story. Participants were instructed to move as little as possible. There was no time restriction and participants were encouraged to read the stories the way they would read them outside the laboratory.
The stories were presented in random order. A 9-point calibration preceded the beginning of each story. Every five to ten slides, a drift check was performed to make sure that the calibration was still valid. If this was not the case (four times in total), calibration was repeated. Prior to every slide, participants fixated on a fixation cross at the top left of the screen (where the first character of the text would appear) for 1000ms. After every slide, they pressed a button to continue to the next slide.

Eye-movement data preprocessing
Several variables that are known to influence reading times were controlled for in the analysis. We controlled for lexical frequency [40], word length [41], position on the screen, perplexity, orthographic and phonological neighborhood size, age of acquisition, word prevalence, and semantic relation.
The log-transformed lexical frequency per word was taken from the SUBTLEX-NL database [42]. Word length was measured in number of characters and position on the screen was measured as the horizontal distance from the left side of the screen measured in pixels, divided by 100 to make the scales of the measures more homogeneous.
Perplexity is a measure closely related to word surprisal, which indicates how unpredictable an incoming word is [43,44]. A trigram model was trained to assign probabilities to words given their context in a large corpus of Dutch. Perplexity values for the words in our stories were then calculated by taking 2 to the power of the negative base-2 logarithm of the probability the model assigned to the current word given the preceding context. This means that in the case of high perplexity, the model was very surprised to encounter the word that was just encountered.
We refer to word prevalence as the log odds of correctly identifying a letter string as a word rather than a non-word. These were obtained from Keuleers, Stevens, Mandera and Brysbaert [47]; http://crr.ugent.be/archives/1494.
All words' semantic relations to the previous content words in the sentence, a measure of semantic priming, were calculated as in Frank and Willems [48]. Semantic vectors were obtained from http://zipf.ugent.be/snaut/ [49].
We controlled for all of these variables by including them as predictors in the mixed effects model, if they were significant.
All fixations were checked to make sure they did not diverge so far from the line being read that they entered a different interest area, and were manually aligned using SR Research's EyeLink Data Viewer. Data for 10 entire story-readings (including all three from one participant) were excluded from the analysis due to inaccuracy of the eye-movement data. Another 74 individual slides were excluded for the same reason, amounting to the exclusion of 11.9% of all slides in total. Fixations on the first word of each slide were also excluded, because they were disproportionately long, reflecting the aftereffects of fixating on the fixation cross prior to each slide. This led to the exclusion of 1.1% of the data. Reading times that deviated more than 3.5 times the standard deviation from the mean were removed from the dataset, as were reading times of 0ms (0.6% in total).
Although the word-related datasets we used to retrieve our predictors from (age of acquisition, orthographic and phonological neighborhood size, prevalence and semantic relation) are cover a substantial part of the Dutch lexicon, not all words in our stories were in all of these datasets. Words for which one of the predictors was lacking were not included in the analysis (22581 data points; 9.6% of the data). For this reason, valence and arousal, two potentially interesting predictors, were not included in the van den Hoven et al: Individual Differences in Sensitivity to Style During Literary Reading Art. 14, page 7 of 16 analysis -these norms existed for only 35.8% of the word tokens in the data.
Two dependent variables were analyzed per word: gaze durations and regressions. Words that were skipped were treated as missing data.

Data analysis
All data were analyzed using the statistical software package R v3.0.2 [50]. A linear mixed model was created to analyze gaze durations, using the lmer function from the lme4 library [51]. First, a model with all fixed effects terms, random intercepts per participant and per story, and random slopes per participant for literariness and perplexity was constructed to predict gaze duration. Models with more elaborate random effects structures did not converge. Subsequently, fixed effects were deleted one by one. If the model fit did not deteriorate after their exclusion (i.e., the likelihood ratio test was not significant), the simpler model was chosen. The p-values for individual predictors were likewise determined on the basis of the change in model fit (in χ 2 ) when the individual predictors were excluded from the model. In a separate model, the scores for the predictors log frequency, log perplexity, orthographic and phonological neighborhood size, age of acquisition, word prevalence, semantic relation and literariness were taken from the previous word (word w i-1 ) rather than from word w i to account for spillover effects [52]. This analysis yielded similar results, but with stronger effects of literariness and semantic relation and less strong effects of frequency and age of acquisition. It is likely that this was because literariness and semantic relation were highly autocorrelated, so word w i and word w i-1 would have similar values for these predictors, whereas this was not the case for the other predictors. The only reason literariness and semantic relation did better in the spill-over model was probably that less variance was explained by the factors we wanted to control for. We consider the model with values for word w i to be more valid and we only report this model here (but see the supplementary materials for details of the spill-over model).
Regressions were analyzed using generalized mixed effects logistic regression (from the lme4 library, [48]), following the same procedure as for the gaze durations. All predictors were z-transformed to overcome problems with convergence. The p-values for individual predictors resulted from asymptotic Wald tests.
The recognition test data were likewise analyzed using generalized mixed effects logistic regression including random intercepts per participant. Random slopes for the predictors gaze duration and foregrounding were not included due to convergence problems.

Gaze duration
Appendix C shows the step-by-step results of the model selection process for the gaze duration model. Log word frequency, position on the screen, phonological neighborhood size, age of acquisition and word prevalence were all significant predictors of gaze duration. Figure 4 shows the standardized beta weights of the fixed effects terms. Literariness as perceived by the participants of the pretest (henceforth simply literariness) and log perplexity were also significant predictors of gaze duration. The slopes of these latter two predictors were allowed to vary per participant. Figure 5 shows the differences between participants in the effect of literariness on gaze duration without random slopes (dashed lines) and when allowing the slopes to vary across participants (solid lines). The results are plotted for each participant separately. The figure shows that there are differences between participants in how they reacted to literary passages: Some slowed down whereas others sped up their reading when encountering foregrounded passages. Table 2 shows the coefficients for the mixed effects logistic regression model fit to the regression data. Log word frequency, position on the screen, phonological neighborhood size, orthographic neighborhood size, age of acquisition, word prevalence and semantic relation were all significant predictors of the chance of regressing to the word, as were literariness and log perplexity, also when the slopes of these latter predictors were allowed to vary  per participant. The model did not improve after the exclusion of any of the predictors.

The relation between Immersion, Story Liking and the Retardation Effect
The analysis of the gaze duration data resulted in a score for each participant that indicated to which degree literariness affected reading times. This score equaled the slope of the regression line. It could be positive or negative, indicating slowing down or speeding up during reading of foregrounded passages. We will call these scores Retardation Effects (though for some participants there was no slowing down but speeding up for the more literary words).
In this section we discuss how Retardation Effects related to Immersion and Story Liking. Mean scores and standard deviations for the five factors of the immersion questionnaire that resulted from the factor analysis -Empathy, Self-loss, Imagery, Compassion and Understanding -and Story Liking are presented in Table 3.
The scores from the immersion questionnaire were included in a correlation matrix, together with the general score for Story Liking and Retardation Effects. For this correlation analysis Retardation Effects were calculated per participant-story pair instead of per participant, in order to avoid correlating each single Retardation Effect score with three Immersion or Story liking scores. Table 4 shows the correlation matrix. None of the factors from the immersion questionnaire correlated significantly with the Retardation Effect (nor did the mean of the subscales, Overall Immersion). The score on Story Liking did not correlate significantly with the Retardation Effect either.

The relation between reading experience and the Retardation Effect
Scores on the ART ranged from 0 to 14 (M = 8.14, SD = 3.59), indicating that most participants were able to recognize several authors from the list. Fiction reading scores ranged from 0 fiction books per year to more than 12. Most participants indicated they read 1-3 books per year. Fiction liking scores ranged from 3 to 7 on the 7-point scale (M = 5.59, SD = 1.23), indicating that most participants enjoyed reading fiction.
Scores relating to fiction reading were included in a correlation matrix (see Table 5), together with the Retardation Effect and random slopes for perplexity. Scores on the ART and amount of fiction books read per year showed a significant positive relationship, corroborating the reliability of the ART as an index of print exposure. The negative correlations between scores on the ART and reading experience on the one hand and the Retardation Effect on the other did not reach significance after bonferroni correction. Fiction liking did not correlate significantly   with the Retardation Effect either. There was a significant correlation, however, between the Retardation Effect and random slopes for perplexity. This correlation is illustrated in Figure 6.

Recognition
On average, participants recognized 8.3 out of 18 passages correctly (SD = 2.6; range = 4-15), significantly higher than chance, t(29) = 8.23, p < .0001. Passages that deviated both semantically and syntactically from the original were less often falsely recognized than those that deviated from the original in only one dimension, but the latter two did not differ from one another (see Table 6).
Inspection of the data in Table 6 shows a slight preference for correct answers in the foregrounded condition compared to the non-foregrounded condition, but a generalized logistic regression analysis showed that literariness was not a significant predictor of correct recognition, b = -0.282, SD = 0.188, p = .134. Neither literariness, nor the total amount of time spent reading a passage significantly improved the chance of recognizing the correct surface structure of the passage.

Discussion
This study investigated the effects of foregrounding on reading behavior as measured by gaze durations, regressions and recognition of surface structure. Foregrounding was operationalized by having laypeople underscore liter-ary passages. Thirty participants read three short stories from Dutch literature, while their eye-movements were recorded. In addition, we measured story immersion, reading behavior and recognition of the exact wording of sentences from the stories that were read.

Reading behavior
We investigated the effect of foregrounding on gaze durations and regressions, while controlling for several other variables. Lexical frequency, age of acquisition, word prevalence, word length, phonological neighborhood size, position on the screen and perplexity all had a significant effect on gaze duration and the chance of regression. Orthographic neighborhood size and semantic priming did have a significant effect on the chance of regression, but not on gaze duration. In accordance with the Russian Formalists' notion of retardation, it was found that in general, foregrounded words were indeed read slower than words that were not foregrounded, confirming H1 (see also [13,14]). However, when we zoom in on the level of individual participants, the picture becomes more   Table 5: Correlation matrix for the measures relating to reading experience, ART, fiction reading, fiction liking, random slopes for perplexity and the Retardation Effect. Note: *p < .05, **p < .01, ***p < .001. p-values are bonferronicorrected.  complex [53]. Clearly, there is significant variability between readers; some indeed slow down when reading foregrounded words, but others show the reverse effect: They speed up (Figure 5). The chance of regressing to a word was higher for words that are foregrounded as compared to words that are not foregrounded. Individual differences in sensitivity to text features has been observed before in behavioral (e.g. [54]), and neuroimaging studies [50,55], and our current findings highlight the importance of taking individual differences seriously in the study of literary reading (e.g. [6,56]). Foregrounded passages can be more difficult than nonforegrounded passages in a number of ways. For example, some passages include ellipsis, or other syntactic deviations that raise linguistic expectations that are not met. A pronoun at the beginning of a sentence creates the expectation that a verb will follow. If this expectation is not met, as in "Zij niet, zij zei dat nú niet" ("She not, she said that now not"), readers potentially regress to the part where a verb was expected in order to exclude the possibility that they simply missed a word.
It seems likely, in line with Shklovsky's [11] position, that both regressions and slowing down are due to difficulty in processing, but it may equally be due to aesthetic feelings the text provokes. Figure 7 shows a working model of the cognitive processes underlying the observed effect. In this model, the aesthetic response is triggered by the slowing down, and vice versa. Foregrounding deautomatizes perception in this framework not only by appealing to aesthetic preferences directly, but also by increasing processing demands. During foregrounded passages, readers can no longer rely on their usual expectations about the text as it unfolds, as they usually do, and need to pay more attention. This comes down to increased awareness of the surface form of the text, and it may lead to increased appreciation (as long as the linguistic input is not too difficult). Conversely, however, aesthetic appreciation, in the form of being interested in and concerned and fascinated by the text (as in [8]) can also influence reading times directly, as fascinated readers will be motivated to read the text more carefully. We want to point out that this is a hypothesized scenario; the directionality of the effect cannot be determined on the basis of the present data.
Our findings and our interpretation of them are largely in line with Jacobs' [7,8,57] recently developed Neurocognitive Poetics Model (NCPM) of literary reading. The NCPM is a dual-route model that predicts that foregrounded text elements are processed more slowly than backgrounded elements. Whereas backgrounded passages allow the reader to become immersed into the story because they consist of familiar words and do not draw attention to the surface structure, foregrounded passages evoke, among other things, attention ("explicit processing"), as well as aesthetic feelings. The aesthetic feelings are in turn predicted to cause slower reading.
With regard to our study, the NCPM correctly predicts not only that readers should be sensitive to foregrounding, but also that there should be no strong relation between immersion and sensitivity to foregrounding, since the two depend on different modes of processing. Of course we should be careful in interpreting a null result, but we have found no evidence that readers who are more immersed in a story also slow down more during foregrounding.
What the NCPM does not explicitly include, however, is a direct link between the modulation of attention and reading pace. We believe that increased attention, which may be a result of the violation of expectations that is brought about by deviations in the style of the text, can also directly affect the readers' pace, without the need for them to experience aesthetic feelings. This interpretation is supported by the strong correlation between response to foregrounding and response to perplexity. The slowing down response to high perplexity, unexpected words, cannot easily be explained with reference to increased aesthetic feelings, but is more likely due to general difficulty with reading passages that deviate from the norm in the language (in the parole sense), regardless of their literary status. Yet there is a correlation between sensitivity to foregrounding and sensitivity to perplexity (and both slowing down effects are stronger for those participants who read less often, but these latter correlations were not statistically significant after bonferroni-correction, so we

Response option Condition Foregrounded Non-foregrounded
Correct 136 114 Semantic deviation 55 60 Syntactic deviation 56 59 Semantic and syntactic deviation 23 37    should interpret them with caution). This suggests that part of the retardation effect is also simply due to literary language being more difficult than non-literary language, in ways that are not captured very well by any of our control variables.
More supporting evidence for our interpretation comes from a study by Song and Schwarz [58]. In their study, participants read the question "How many animals of each kind did Moses take on the Ark?" and either gave an answer, indicated they did not know the answer, or indicated that they could not say because the question was ill-formed (it was Noah who took animals on the Ark, not Moses). There were two conditions: one in which typeface was easy to read and one in which it was difficult to read. The difficult typeface led to significantly more discoveries of the anomaly than the easy typeface. Song and Schwarz conclude that this simple increase of low-level processing demands resulted in deeper processing of the text.
In an fMRI study by Bohrn, Altmann, Lubrich, Menninghaus, and Jacobs [59], it was found that reading unfamiliar proverbs, which are assumed to be more difficult to understand, increased both cognitive and affective processing compared to familiar proverbs. These results are in line with the theoretical considerations of Mukařovský [10] and Shklovsky [11]. If processing is too fluent, or automatic, there is little chance for appreciating the poetic dimension of the text (see [60,61]). Clearly, there is a limit to this: Very idiosyncratic language use can hinder the flow of information so much that it impairs comprehension. Such a scenario was not the case for the texts that we selected in the present study.
The NCPM makes another interesting prediction with regard to eye-movements during literary reading that we did not have hypotheses about. The model predicts readers to exhibit smaller saccades during foregrounded passages than during backgrounded passages. We here report the results from a mixed effects regression analysis of all rightward saccades in our dataset. We included in the model the same fixed effect predictors that were used for the gaze duration and regression analyses. The effect of literariness was allowed to vary per participant and story, and the effect of perplexity was allowed to vary per participant as well. Predictor values were based on the word that formed the launch site of the saccade. Predictors were z-transformed to overcome convergence problems. The results of the final model are shown in Table 7.
Only prevalence did not have a significant on saccade size. Literariness did have a significant effect, but in the opposite direction from what the NCPM predicts: saccades launched from more literary words are generally longer than those launched from less literary words. A closer look at the individual participants' effect of literariness on saccade size tells us something about how we might interpret this effect: The slope of the effect of literariness on saccade length shows a strong negative correlation with the Retardation Effect, r = -.66, p <. 001, and a moderate negative correlation with the individuals' slope of perplexity on gaze duration, r = -.37, p <.05. Participants who slow down more during literary passages also display smaller saccades during literary passages. This reading behavior is in line with earlier findings that readers show distinct reading profiles, or "strategies" [62][63][64]. According to the "Risky Reader Hypothesis" [59,60], more proactive, "risky readers", display long saccades and many regressions. They rely relatively strongly on guessing which words are in the parafovea, but often need to regress to an earlier word when this strategy fails. Conservative readers on the other hand, display shorter saccades and fewer regressions. It seems that the readers who slow down more during literary passages are the rather conservative readers, whereas those who slow down less are more proactive. This leaves open the possibility that, as a reviewer suggests, some of these more proactive participants may have simply skipped over the text during foregrounded passages because it was too difficult for them, leading to both shorter gaze durations and longer saccades.

Reading experience
We cannot confirm H2 -Infrequent fiction readers show a larger retardation effect when exposed to foregrounding than frequent fiction readers. After bonferroni-correction, neither the correlation between retardation and ART/fiction reading nor the correlation between perplexity and ART/fiction reading was significant. The sample size for these correlations was rather small (N = 29), so a follow-up  study with a larger number of participants is needed to say anything conclusive about this issue.

Recognition
According to H3, the memory for the surface form is better for foregrounded passages than for non-foregrounded passages. This hypothesis cannot be confirmed -we found no effect of foregrounding on the chance of recognition after controlling for total reading times. Correct phrases were recognized above chance-level (46% of the time), but there was no difference between foregrounded and nonforegrounded passages (cf. [14]).

Story liking and immersion
We cannot confirm H4a -Readers who are more immersed in the story react differently to foregrounded words than readers who are less immersed in the story. None of the factors of the immersion questionnaire significantly correlated with the effect of foregrounding on reading times. Because this is a null effect, we have to interpret it with caution, but immersion does not seem to play a role in slowing down during reading of foregrounded passages, as predicted by the NCPM [8,54]. Story Liking did not show a significant positive correlation with the Retardation Effect either: The participants who liked the story better were not more likely to slow down during reading of foregrounded passages. Therefore H4b -Readers who like the story more react differently to foregrounded words than readers who like the story lesscannot be confirmed either.
It should be noted that the immersion questionnaire does not necessarily capture immersion as it is experienced during reading. Participants need to recall their feeling of immersion after the story is already finished, and their memories need not be accurate. In future research, this issue can be partially overcome by, for instance, splitting the story into parts and collecting immersion ratings per story part (see [8]).

Conclusions
This study partially confirmed the Russian Formalist's idea of retardation -the idea that foregrounding makes readers slow down. By using direct measurements of eye movements combined with advanced statistical modeling, our study allows a more differentiated understanding of Miall and Kuiken's [13] initial findings. That is, readers do not always slow down during foregrounded passages, this depends on the reader.
We cannot say with certainty what it is that determines whether readers slow down or speed up. Slowing down is not related to how much readers appreciate the story, nor does it correlate with any aspect of immersion, be it empathy with the characters, self-loss, imagery, compassion or understanding, even though all of these factors contribute to how much readers appreciate the story. We can also not conclude from this study that the degree of retardation depends on experience with reading fiction, although the correlation between retardation and slowing down during high-perplexity words suggests that reading proficiency may play a role. What exactly the relevant variables are that cause the differences in effects between readers is therefore still an open question. Relating back to the introduction, our results provide evidence in favor of an interactional account of literary reading, an account that focuses on the reader as well as the text, and the interaction between the two.

Additional Files
The additional files for this article can be found as follows: