Eye Movement Evidence for Context-Sensitive Derivation of Scalar Inferences

A scalar expression like some can optionally have an enriched interpretation (approximately meaning “some, but not all”) depending on the context in which it appears. Numerous experiments using the self-paced reading method have found evidence that context has an online effect on the interpretation of a scalar term, resulting in faster or slower reading times for a later phrase whose comprehension is dependent on the interpretation of some. The present study used eye movements to isolate the time course of this process. We find evidence that the reading time facilitation observed in previous studies was driven by early reading measures, with little reading time evidence for an immediate inference-based processing cost at the scalar expression itself, consistent with previous studies. Our results suggest that comprehenders can rapidly commit to enriched interpretations online without cost and that these enriched interpretations are then used to guide the processing of upcoming sentence material.


Introduction
A substantial part of language comprehension is inferring messages that were not explicitly said. One of the most intensely investigated types of such inferences is scalar inference, whereby a speaker uttering a weaker or less informative expression is believed to mean that a stronger or more informative alternative is false. For instance, "all of the cookies" is more informative (more specific) than "some of the cookies" (which, logically speaking, means any nonzero number of the cookies, up to and including all of them); accordingly, the interpretation of some is often enriched, such that a person saying "I ate some of the cookies" will often be understood as meaning "I ate some, but not all, of the cookies" (Grice, 1975; for recent reviews see Chemla & Singh, 2014;Noveck & Reboul, 2008;Sauerland, 2012;Sauerland & Schumacher, 2016).
This sort of scalar inference is generally believed to be context-sensitive: in certain contexts, it is less likely to arise, or the inference is made more slowly and effortfully. This has frequently been shown in on-line psycholinguistic studies in which the [context-dependent] interpretation of a scalar expression, like some of the, modulates the processing of some downstream expression. Consider, for instance, the two vignettes below (based on Politzer-Ahles and Fiorentino, 2013): 1) a. Context highly supportive of scalar inference: Yousef asked Fatima whether all of the students had passed the test. Fatima said that some of them had. She added that the rest were planning to retake the class. b. Context less supportive of scalar inference: Yousef asked Fatima whether any of the students had passed the test. Fatima said that some of them had. She added that the rest were planning to retake the class.
In (1a), because the context explicitly introduces a question about whether all is true, then some of them is likely to be interpreted as meaning "not all of them". Subsequently, a person reading this passage will be aware that there are still some students who have not passed the test. When this reader later reads the rest they will be able to comprehend it quickly (either because they have expected this expression already, or because they are able to more easily integrate it into a discourse model which already has a salient group of students who have not passed the test). On the other hand, in (1b), the context raises a scenario in which knowing that all is not true does not provide relevant information to answer Yousef's question (since Yousef only wants to know if the number of students coming is greater than zero). Thus, some of them is less likely to be interpreted as meaning "not all of them", given that this interpretation does not answer the question under discussion. Accordingly, a reader may be less aware of a salient group of students who have not passed the test, and therefore will not comprehend the rest so quickly.
The above is one example of a sort of contextual manipulation that may influence the derivation of scalar inferences, but there are also many others. For example, scalar inferences are less likely to be realized when the speaker has incomplete information (Bergen & Grodner, 2012;Breheny et al., 2013;Goodman, & Stuhlmuller, 2013; among others), given that a pragmatic derivation of a scalar implicature requires an assumption that the speaker is knowledgeable, i.e., that the speaker knows how many students passed the test. Scalar inferences are also less likely (or unable) to be realized in downward entailing contexts (Chierchia et al., 2012;Hartshorne & Snedeker, ms.;Hartshorne et al., 2015;among others), as these contexts make the non-enriched interpretation (i.e., "at least one of, up to and possibly including all of") more informative than the enriched "not all" interpretation. More global aspects of the context, e.g., what types of alternatives are available in the experimental context, also modulate the availability of scalar implicatures (Degen & Tanenhaus, 2015, among others). It is possible that different types of contextual information are employed in qualitatively different ways during the online comprehension of scalar expressions; the present study focuses on one of the cases mentioned above, the comprehension of scalar expressions in upward vs. downward entailing contexts.
Upward entailing contexts are those in which a proposition about a set entails a proposition about its superset (i.e., if it's true that "A black dog came", then it's necessarily true that "A dog came"); downward entailing contexts are those in which a proposition about a set entails a proposition about its subset (i.e., if it's true that "No dog came", then it's necessarily true that "no black dog came"). The antecedent of a conditional (i.e., an if clause) creates a downward entailing context, and it has been frequently observed that a scalar expression is less likely to be enriched (e.g., some is less likely to be interpreted as meaning "not all") when it appears in such a context, compared to when it appears in an upward entailing context (Chierchia et al., 2012;Hartshorne & Snedeker, ms.;Hartshorne et al., 2015;among others). For example, some should be less likely to be interpreted as "not all" in (2b) than in (2a): 2) a. Upward entailing: Some of the students passed the class, and the rest need to retake it. b. Downward entailing: If some of the students passed the class, then the rest need to retake it.
In (2b), the if clause is actually more informative if the meaning of some is not enriched. That is to say, "if at least one and possibly all of the students passed the class" is a stronger generalization, covering more possible situations, than "if some but not all of the students passed the class" because the former case entails the latter case. 1 Therefore, the interpretation of some is often not enriched in this situation, since it would lead to a less informative rather than a more informative utterance. 2 The contextual manipulations shown in (1) and (2), as well as others mentioned above, all modulate both the availability of the "not all" interpretation of some, and the ease of processing a later anaphor like the rest, as described above. Variations of this paradigm have been widely used in experimental pragmatics (see Table 1), and almost all of them have found an effect of context on the comprehension of a downstream anaphor after a scalar expression. That is to say, almost all the above studies found that the rest is read more quickly after having read some in a context that is more supportive of scalar implicatures (e.g., 1a and 2a), compared to after having read some in a context that is less supportive (e.g., 1b and 2b). However, there are still open questions about the nature of the processes underlying this effect. The majority of these experiments have  (Just, Carpenter, & Woolley, 1982), which is a relatively coarse measure of how long it takes a participant to read a given word or phrase. This method provides only one measure of reading time per word or phrase, and is somewhat unnatural, as participants are only shown part of a sentence at a time and must repeatedly press a button to continue reading. By comparison, measuring reading time by recording eye movements allows for both more natural reading and, importantly, multiple measures of reading time (Rayner & Sereno, 1994). For instance, different cognitive processes may result in faster initial reading times on a word, more time spent re-reading a word before moving on, or more time looking back to re-read a word after moving on; such differences are not detectable with self-paced reading. For these reasons, using eye movement measures to shed light on the specific locus of the abovementioned reading time slowdowns is a valuable means to better understand the kind of processing that underlies this effect. Currently, there are multiple possible explanations for why reading times might speed up at the rest in the context that is more supportive of scalar inferences. While it seems relatively uncontroversial that this effect is attributable to facilitation by the enriched interpretation of some (i.e., the effect is a downstream consequence of having realized the scalar inference), it is not yet known exactly how that interpretation is eventually deployed to ultimately result in faster reading times at the rest. On the one hand, the effect might be related to prediction of this particular expression or concept (i.e., after interpreting some as meaning "not all", the reader expects that the next sentence will explain what the situation is for the remaining referents) or facilitation of lexical access of the rest, in which case this difference might mainly influence measures of early reading processes. On the other hand, the effect might be related to the difficulty of integrating the rest into the discourse model, or even to revision of the interpretation of some as meaning "not all" (i.e., enriching the meaning at this late point in the sentence). In these cases, the difference might mainly influence measures of late processes (although this assumption is not uncontroversial; see Discussion); in the latter case, it might also result in more eye movements from the rest back to some as readers reconsider their interpretation of the quantifier.
Thus far, only Lewis (2013) has tested the context-sensitivity of scalar inferences using eye-tracking with this paradigm. In that study, the reading time facilitation at the rest was driven by differences in late reading measures: specifically, re-reading time (the sum of the durations of every fixation on the rest after it had been passed once) and total time (the sum of the durations of all fixations on the rest). However, as that study had somewhat different aims (in addition to measuring this effect, it was also focused on testing other contextual manipulations and other types of linguistic scales), there are still open questions regarding how much of this effect was due to scalar inferences in particular. In the paradigm typified in (1) and also used by Lewis (2013), there are multiple differences between the highly inference-supporting context (1a) and the less inference-supporting context (1b). For instance, reading time differences later in the sentence might be due to some other downstream effects elicited by the different contexts themselves, rather than being due to the interpretation of "some of" per se. In many experiments using this paradigm or variations thereof (Bergen & Grodner, 2012;Hartshorne & Snedeker, ms.;Hartshorne, Liem Azar, Snedeker, & Kim, 2015;, control conditions are included to replicate the context difference while removing the scalar inference difference; e.g., (3a-b) has the same context manipulation as (1a-b), but uses the critical quantifier only some, the interpretation of which is not dependent on context (unlike some).
In these experiments, evidence for a context-based effect of scalar inferences takes the form of an interaction, such that there is a context effect on reading times for the rest in some passages but not in only some passages.
3) a. Yousef asked Fatima whether all of the students had passed the test. Fatima said that only some of them had. She added that the rest were planning to retake the class. b. Yousef asked Fatima whether any of the students had passed the test. Fatima said that only some of them had. She added that the rest were planning to retake the class.
The purpose of such only some control conditions is to attempt to rule out the possibility that a context effect (i.e., different reading times for some or for the rest in all vs. any contexts, or in upward vs. downward entailing contexts) is due to general effects of the context itself, rather than specifically to the context's effects on scalar implicatures. While this type of control may still not fully eliminate potential confounding differences between the contexts (Barbet & Thierry, 2016), it at least allows for a stronger argument that the observed effects are based on scalar implicatures, compared to the argument that could be made without such a control. Thus, while Lewis (2013) provides useful prior information about which reading time measures we might predict to show the effect of interest, it remains an open question whether the facilitation effect at the rest in this particular study can be attributed to scalar inferences. The present study, therefore, aims to examine which reading time measures are modulated by scalar inference processing in the full factorial research design used by the majority of other experiments in this area. A secondary goal of the present study is to examine whether context influences the reading time for some itself. There is a longstanding debate over whether scalar inferences are realized rapidly and effortlessly, or slowly and with a cognitive cost (see Chemla & Singh, 2014, among others, for review). Under the latter hypothesis, some of them is expected to be read more slowly in the context that strongly supports scalar inferences (1a), given that such a context requires readers to realize a more specific interpretation of this expression. Under the former hypothesis, on the other hand, some of them is not expected to be read more slowly in this context. Empirical results regarding this question are mixed. Three studies have observed such a pattern (see Table 1), but these results are also subject to alternative explanations. The finding from Breheny and colleagues (2006) was probably due to a repeated name penalty evoked by their stimuli, as has been argued previously (Hartshorne & Snedeker, ms.;Lewis, 2013;. Regarding the finding from Bergen and Grodner (2012), Lewis (2013) has suggested that it may be due to having to infer a relevant set, rather than to realizing a scalar inference per se. Many stimuli were of the form "This morning, I took attendance at an important meeting with the manager. Some of the company's accountants were there." Thus, the referent in question was not explicitly identified earlier in the discourse and the process of connecting the referent to the discourse may have been costly. Finally, in Lauter (2013), the scalar inference was made explicit by orthographically stressing the quantifier (i.e., SOME of them as opposed to some of them); thus, the longer reading times may have been due to orthography rather than due to the cost of making a scalar inference. Overall, even if these three results are taken at face value, the state of the field is still such that evidence is mixed regarding whether or not realizing a scalar inference elicits an immediate processing cost. Therefore, in addition to providing fine-grained detail about the reading time effects at the rest, the present study will also provide additional data regarding whether or not this context manipulation elicits a reading time slowdown at some itself, in a design that allows a potentially more direct comparison than these previous studies did.

2a. Participants
Data were collected from 51 native English speakers at the University of Oxford and the Oxford community. Data from three participants who frequently dozed off during the experiment were excluded from analysis, as were data from one participant who reported having mild dyslexia, leaving a total of 47 participants (35 women, 12 men; age 18-55, mean age 24.1) in the analysis. Individual demographic information for the participants is available in Supplementary File 1. All participants provided their informed consent and were paid for their participation. Methods were approved by the Central University Research Ethics Committee at the University of Oxford.

2b. Materials
The present study used a manipulation of entailment polarity (2a-b), following Hartshorne (Hartshorne et al., 2015;Hartshorne & Snedeker, ms.), rather than a manipulation of information structure (1a-b). The reason for this was that this manipulation allows for single-sentence stimuli, rather than multi-sentence stimuli like those shown in (1, 3); this made it easier to mix these stimuli with single-sentence stimuli from other experiments within the same recording session.
The critical stimuli (listed in Supplementary File 2) were 48 sentences adapted from Hartshorne and Snedeker (ms., Experiment 1), following the template shown below. "^" indicates where the sentence was segmented into regions of interest; this character was not actually shown in the experiment. a) Upward-entailing, some: Isabella recommended^ some of^ the applicants^ to the hiring director,^ and the rest^ didn't meet her criteria. b) Downward-entailing, some: If^ Isabella recom-mended^ some of^ the applicants^ to the hiring director,^ then the rest^ didn't meet her criteria. c) Upward-entailing, only some: Isabella recom-mended^ only some of^ the applicants^ to the hiring director,^ and the rest^ didn't meet her criteria. d) Downward-entailing, only some: If^ Isabella recommended^ only some of^ the applicants^ to the hiring director,^ then the rest^ didn't meet her criteria.
As described in the Introduction, the rest is predicted to be read more slowly in the downward-entailing than the upward-entailing condition -but mainly in some sentences, not only some sentences. A reading time slowdown on the rest can only be attributed to an enriched interpretation of some if it appears only in some sentences, or is greater in some than in only some sentences. On the other hand, if both kinds of sentences show similar reading time slowdowns on the rest in the downward entailment condition, then that might be occurring just because the conditional itself causes sustained processing cost over the rest of the sentence, for whatever reason. Downward-entailing clauses were used to provide a context in which scalar implicatures are less supported. It should be noted that there is disagreement on how scalar implicatures could be derived in such contexts. Under a purely pragmatic account, a "not all" inference in this context cannot be derived via Gricean conversational implicatures (because the sentence with the enriched "not all" reading would be less, rather than more, informative than the sentence without it); it can be realized via other routes, however (Geurts & van Tiel, 2013). A "not all" reading realized in the downward-entailing clause is presumably not an implicature at all, given that it may be derived by other mechanisms. Nonetheless, it seems to be an empirical fact that, for whatever reason, the "not all" reading is less supported or less available in this context than in the upward entailing context, all else being equal. The primary goal of the study is to examine how downstream reading times on the rest are affected when the preceding scalar implicature was more or less available (under the assumption that in downward-entailing contexts the scalar implicature is less available or completely unavailable), and thus this manipulation was considered appropriate for that purpose.
The factors Quantifier (some vs. only some) and Entailment (upward-entailing ["clause 1, and clause 2"] vs. downward-entailing ["If clause 1, then clause 2"]) were factorially manipulated to yield four conditions. And/then and the rest were combined into a single region since the connective was short and frequently skipped (on 51% of trials, in a preliminary analysis of the data from 35 participants, no fixation occurred on and/then the first time it was passed), and because when viewing the connective the reader was likely also able to get a parafoveal preview of the critical region. 3 There were also 83 filler stimuli, including 48 items with the same structure and manipulations of Quantifier and Entailment as the critical sentences but not including the rest, and 35 items using other quantifiers in the place of some or only some: nine each of all and none in upwardentailing contexts, nine of all in downward-entailing contexts, and eight of none in downward-entailing contexts. These fillers served to make sure that the rest and some or only some were not completely predictable in their respective positions, and to establish a contrast between relevant quantifiers in the experimental context. The session also included 48 items from a separate experiment on semantic enrichment and 104 items from an experiment on morphosyntactic prediction; none of these items included if-then constructions, the rest, or the quantifiers used in this experiment.

2c. Procedure
The experiment was conducted on an Eyelink 1000 system with a chin rest. Before the beginning of the experiment, and during the experiment whenever necessary, the participant completed a nine-point calibration. Viewing was binocular, but only the right eye was tracked (except for one participant, for whom the left eye was tracked because the right eye was not tracked well by the system). Each trial began with a drift correction, during which the participant had to fixate on a dot (located at the left boundary of where the sentence would appear) before the trial proceeded. All sentences fit on a single line on the screen, and were presented in black Courier New text against a light gray background. To finish reading the sentence and reveal the comprehension question, the participant had to fixate a dark gray box in the upper-right corner of the monitor.
Each sentence was followed by a comprehension question, which appeared after the participant fixated the gray box. The question was presented along with two possible choices and the participant made their response using the arrow keys on the keyboard. The experiment began with 6 practice trials to acclimate the participant to the procedure, after which the 283 remaining items (critical items and fillers from this experiment, as well as items from the two other experiments) were presented in a fully random order, divided into eight blocks with self-paced breaks in between. The stimuli were organized into 24 lists according to a Latin Square design, such that each participant saw 48 critical sentences (12 per condition). 4 Overall the experiment session lasted from 50 to 80 minutes, including setup and debriefing.

2d. Eye movement measures
Data were cleaned in four steps (SR Research, 2014): first, fixations of 80 ms or shorter were merged into a neighboring fixation of greater than 80 ms within 0.5 degrees horizontally (if both the preceding and following fixation were longer than 80 ms, the short fixation was merged to the longer of the two); second, the same process was repeated with a duration threshold of 40 ms and a distance threshold of 1.25 degrees; third, in interest areas that had at least three fixations of 140 ms or shorter and none of longer than 140 ms, the short fixations were merged into one; and last, remaining fixations shorter than 140 ms or longer than 800 ms were deleted. These values were based on the defaults in the Eyelink Data Viewer program.
We analyzed the following eye movement measures, mainly based on Lewis (2013): • First pass time (also known as gaze duration): The sum of all fixations within a region from when the region was first entered until when the region was exited in either direction. • Go-past time (also known as regression path duration): The sum of all fixations (including fixations in previous regions) from when the region was first entered until when the region is exited to the right (i.e., until a fixation at a later region is made). • Selective go-past time (also known as right-bounded time): The sum of all fixations on the region in question until the region is exited to the right. In other words, go-past time without including fixations on previous regions. • Re-read time: The sum of all fixations on the region in question after it has been exited to the right; in other words, total time minus selective go-past time.
(Note that in the literature "re-reading time" is also sometimes used to refer to a different measure, the sum of fixations after exiting the region to either direction -i.e., total time minus first-pass time.) We only included regions with nonzero re-read times in this analysis. • Total time: The sum of all fixations on the region in question. • Regressions in: Whether or not the region was refixated after being exited to the right. (While this measure and the regressions-out measure are reported as percentages in the results below, they were treated as binomial variables in the statistical analysis.) This measure was not used in Lewis (2013), but we included it here to account for trials that were not re-fixated (given that it is possible, for example, for a given region to take equal amounts of time to be reread in two conditions, but to be re-read more often in one condition than another). Furthermore, this was also included to test the possibility, mentioned above, that reading the rest after a less inferencesupporting context triggers participants to realize the scalar inference late and, possibly, look back at some more frequently in the process of making this re-interpretation. • Regressions out: Whether or not other regions to the left were re-fixated after this region was viewed. This measure was not used in Lewis (2013), but it is a relatively commonly analyzed measure.
Trials in which the comprehension question was answered incorrectly were excluded from analysis. Regions for which the first fixation in that trial was not progressive (i.e., regions that were skipped, such that the first incoming saccade [if any] came from a later region rather than an earlier region) were also excluded from analysis.

2e. Statistical analysis
The factors Quantifier and Entailment were each sumcoded (with values of −0.5 for only some and for downward entailment, and 0.5 for some and for upward entailment) so that their coefficients would correspond to main effects. This means that faster reading times in the upward entailing context correspond to negative coefficients for Entailment, and if that pattern is larger in some than only some sentences the Quantifier * Entailment interaction will have a negative coefficient. For linear models, the outcome variables were transformed if necessary (models were calculated with raw, square-root, log, or reflectedreciprocal transformed data, and whichever model had the least skewed residuals was used; the analysis code in Supplementary File 4 shows which transform was ultimately used for which measure) 5 and then z-scored; z-scoring was done so that the coefficients would be in standardized units, making it possible to compare the effect sizes of the terms from different reading time measures. Our interest in standardized effect sizes was due to the fact that the research question was less about whether a significant effect would appear (given that we already expected a particular pattern in the reading times at the rest, based on the previous literature), but on which measures would show the largest effect. Coefficients were estimated with linear mixed-effects models (Baayen, Davison, & Bates, 2008) with fixed effects of Quantifier, Entailment, and their interaction. By-subject, by-item, and by-list random effects were fitted, including intercepts and slopes for all model terms (Barr, Levy, Scheepers, & Tily, 2013). Analysis code can be seen in Supplementary File 4.

Results
The data are available in Supplementary File 3, and the analysis code in Supplementary File 4. Reading measures for each region are shown in Table 2.
Accuracy was high overall (median 95%, range 88-98%, minimum number of correct trials per subject per condition = 8) and we do not analyze it further.

3a. "and/then the rest" region
and/then the rest was read more quickly after some in an upward-entailing context than after some in a downwardentailing context in all reading time measures, as can be seen in Table 2. The same pattern, however, also held for only some, which indicates that this effect is not wholly due to scalar inferences. The focus of the present study (and the reason for including the only some control sentences) was to identify which reading measures showed an interaction such that the facilitation was larger in some sentences than in only some sentences.
Results from the statistical model are shown in Table 3. At this region, the crucial interaction was significant (in a one-tailed test, given that we were only interested in one pattern of interaction) in first pass times and not in other measures; first pass times also had the numerically highest coefficient. This interaction is shown in Figure 1.
The presence of such an interaction, where the context effect in some sentences is significantly larger than that in only some sentences, conceptually replicates the results observed in self-paced reading (Bergen & Grodner, 2016;Hartshorne & Snedeker, ms.; and event-related brain potentials (Hartshorne et al., 2015); the results further suggest that this commonlyobserved effect may have been driven by first-pass reading processes. 6 As Hartshorne and Snedeker (ms.) propose that the reading time effects on the rest in this paradigm may be modulated by the amount of time readers have between the scalar inference trigger some and this critical region, we also measured the amount of time between readers' first fixation on the quantifier and their first fixation on the critical and/then the rest region. On average this latency was 1813 ms. (For comparison, for the long sentences in Hartshorne & Snedeker, ms., which showed reading time facilitation on the rest, the average time was about 2500 ms. For the short sentences, which did not show facilitation on the rest, it was about 900 ms.)

3b. Quantifier region
As noted above, a secondary aim of the study was to test whether reading some in a context that supports scalar inferences would trigger a processing cost. For this question, we are only interested in measures that correspond to reading times before moving on past the quantifier (first pass time, go-past time, and selective go-past time); later times could be driven by re-reading that happened after the rest was encountered, and thus would not be evidence for a processing cost that occurred when the quantifier was first read. We analyzed both the quantifier region and the following region; results from the statistical model are shown in Table 3. At the quantifier, the interaction effect was negligible for go-past times and selective go-past times, and for first pass times it was negligible and in the opposite of the predicted direction (with a larger context effect on only some than on some). None of these effects was statistically significant. At the region following the quantifier, the interaction effect showed numerical trends in the direction consistent with a processing cost (i.e., slower reading times in upward-entailing than downwardentailing contexts, with this effect larger in some than only some sentences), although this pattern did not reach statistical significance in any of the three measures.
Thus, the present dataset does not provide strong evidence against the hypothesis that scalar inferences are realized effortlessly. This is consistent with Lewis (2013), Hartshorne & Snedeker (ms.), and Politzer-Ahles & Fiorentino (2013), who also did not find significant reading time slowdowns at the quantifier itself (see, however, Bergen & Grodner, 2012). Nonetheless, it is worth noting that at the region following the quantifier, all three measures showed numerical trends towards slower reading times for some in upward-entailing compared to downward-entailing contexts, with smaller or no trends in that direction for only some sentences. On the other hand, in the previous studies that did not find effects at the quantifier, there generally was not even a trend in this direction. Thus, while the data are overall most consistent with the hypothesis that scalar inference does not engender a tion's mean, the two conditions are likely (but not guaranteed) to be significantly different in a mixed effect model. For the Subject region, go-past time and selective go-past times are not shown in upward-entailing contexts since this region was the first region in the sentence (and thus go-past time and selective go-past times are the same as first pass time, except in cases where the participant made regressions to somewhere on the screen outside the sentence). For the final region, re-read times are not shown because they are not possible (except in cases where the participant fixated outside the sentence), and total times are not shown because they are the same as selective go-past times. Note that total time does not equal the sum of selective go-past time and re-read time, because the mean for re-read time does not include observations with re-read times of 0. For regression measures, intervals are not shown since these are binomial data.

Quantifier
(only) some of the  processing cost at the moment it is realized, they also do not cast serious doubt on the hypothesis that it does (i.e., the study may simply have not had sufficient power to detect such an effect).
To quantify the extent to which the data did or did not support an inference-specific processing cost, we performed a post-hoc analysis using Bayesian mixed models. Unlike frequentist null hypothesis tests, Bayesian models yield a posterior distribution for each parameter, allowing one to make inferences about the likely values of parameters in question. Models, with the same terms as in the analysis above, were fit using the {brms} package in R (Bürkner, in press). The prior for all fixed effects was a normal distribution with a mean of 0 and standard deviation of 0.1, such that most of the prior lay within .3 standardized units of zero. The prior and posterior distributions for the Quantifier*Entailment interaction, for first pass, go-past, and selective go-past times, are shown in Figure 2, along with 95% credible intervals for the interaction coefficient and Bayes factors, which quantify how much the data changed one's confidence in a hypothesis (see, e.g., Wagenmakers, 2007, among others;c.f. van der Linden & Chryst, 2017). As can be seen in the figure, while the posterior distributions for the interaction are all slightly positive, they are not extremely so; only 60% of the distribution is positive for first-pass time, 67% for go-past time, and 68% for selective go-past time. Given that perfect certainty in the sign of the interaction would correspond to having 100% of the posterior on one side of zero, and perfect uncertainty would correspond to 50%, this suggests that we cannot be very certain about the sign of the interaction (we are closer to being perfectly uncertain than perfectly certain); in other words, the evidence in favor of an interaction effect is not strong. Likewise, the Bayes factors (the ratios of the height of the posterior distribution at a particular point hypothesis, to the height of the prior distribution at that point) for the hypothesis of a zero effect are all close to 1, indicating that the new data has only negligibly changed the confidence in this hypothesis. (Commonly, Bayes factors above 3 (or below 1/3) are taken as indicating that the data have substantially increased (or decreased) confidence   Figure 1D). The x-axis represents the context effect (reading times in upward-entailing contexts minus reading times in downward-entailing contexts) in only some sentences, and the y-axis represents this context effect in some sentences. Thus, points in the negative range represent subjects/items for whom and/then the rest had faster first-pass reading times in upward-entailing than downward-entailing contexts. Most importantly, points below the diagonal line represent subjects/items for whom this context effect was larger in some sentences than in only some sentences (i.e., the predicted interaction pattern). The red and blue error bars indicate the 95% confidence intervals for the mean of the subject-wise differences and the mean of the item-wise differences, respectively. (Panel B) A more traditional visualization of the same interaction pattern, using connected points to show each subject's or item's context effect in some and only some sentences. Within each condition, the error bar on the left side is the 95% confidence interval of the subject-wise differences, and the error bar on the right side is the 95% confidence interval of the item-wise differences. (Because this is a repeated-measures comparison, these confidence intervals can be compared against zero to evaluate whether each simple effect is significant, but they cannot be compared against one another to evaluate whether the simple effects are different; Baguley, 2012;Loftus & Masson, 1994).

Figure 2:
Posterior distributions for the critical interaction effect on first pass times, go-past times, and selective go-past times at the region following the quantifier. Red shaded regions represent the 95% credible interval of the coefficient. The solid black curve represents the posterior distribution and the dashed green curve the prior distribution; the ratio of these two distributions' densities at 0 is the Bayes factor.
in a hypothesis.) Thus, while the eye movements at this spillover region do show a numerical pattern in the direction that would be expected if scalar inferences elicited an immediate processing cost, overall we conclude that there is little evidence that such an effect exists in the population.

Discussion
The present study used eye-tracking while reading to identify the locus of reading time facilitation effects that have commonly been observed downstream of a scalar expression. We replicated the observation of faster reading times for the rest after some appeared in an upward-entailing context that is more supporting of scalar inferences, compared to when it appeared in a downward-entailing context that is less supporting. Crucially, this pattern was larger in some sentences, where the interpretation of some is subject to pragmatic context effects, than in only some sentences, where the interpretation is semantically fixed. This interaction provides evidence that at least part of the reading time facilitation for the rest in the upward entailment condition is due to increased rate of scalar inference realization, rather than just due to declaratives being overall easier to process than conditionals or to other general differences between the upward entailing and downward entailing conditions. This is, to our knowledge, the first study to replicate this pattern of results with eye movement measures. Furthermore, the results suggest that this pattern is due to early eye movement measures (first pass time). Finally, we failed to find strong evidence for a slowdown related to inference-making at the quantifier itself. The observation that the reading time facilitation is driven by early rather than late reading measures is potentially informative for explanations of the computational locus of this effect. As noted in the introduction, this effect could be explained by early prediction (or facilitation of lexical access) or by late integration, or even by the assumption that encountering the rest after the downwardentailing context triggers an enriched interpretation to be realized late. The present results suggest that the effect is likely to be driven by early processes; it is possible that this is related to prediction, although there may be other candidate explanations as well. Such a pattern would indicate that processing measures at the rest in this sort of paradigm do not directly reflect scalar inference-making or meaning enrichment per se, but its downstream consequences (i.e., predictions of upcoming words based on a different interpretation of the scalar expression some). This has also been argued to be the case in event-related potential experiments that use brain responses to downstream words to make indirect inferences about the processing of some, rather than directly measuring the response to some itself (Hunt, Politzer-Ahles, Gibson, Minai, & Fiorentino, 2013;Nieuwland, Ditman, & Kuperberg, 2010;Noveck & Posada, 2003). It should be noted, however, that there was a significant interaction on regressions to the quantifier, such that there were more regressions to some in downward-entailing contexts than in upward-entailing contexts (see Tables 2 and 3); this may be consistent with the conjecture that seeing the rest causes participants to look back at the quantifier and re-evaluate its interpretation (perhaps by enriching the meaning with "not all"). However, further study (or re-analysis of the present dataset) is necessary to confirm whether these additional regressions to the quantifier are triggered by seeing the rest, as opposed to coming from other parts of the sentence.
A limitation of this conclusion is that the link between various eye movement measures and various cognitive processes is not completely clear (Clifton et al., 2007;Boland, 2004), especially for a topic like conversational implicatures, which has received substantially less attention in eye-tracking research than topics like lexical access and syntactic ambiguity resolution. Therefore, our assumption that early processing measures like first pass time are likely to reflect processes like prediction, and that late measures are more likely to reflect integrative processes, is not uncontroversial. There is some evidence that some discourse-level processing may affect only late reading measures and not early reading measures (e.g., Boland & Blodgett, 2001), or that they may affect both late (including spillover) and early measures, whereas more lexical processing may be mostly limited to early measures (Clifton et al., 2007;Staub, 2015). There is substantial variability, however, in which measures are implicated across various studies, and many studies operationalize prediction, integration, lexical processing, discourse processing, etc., in different ways. There is also still general debate regarding how quickly various processes occur in comprehension, not just in the eye movement literature but also in many other psycholinguistic methods. Thus, while the present study provides evidence that the facilitation from scalar inferences on the comprehension of the rest happens in early reading measures and presumably early cognitive processes, it is difficult to say precisely which cognitive processes these are.
The failure to find a significant processing cost at the quantifier itself adds an additional piece of evidence to the currently equivocal literature regarding this question. While many reviews assume that there is convincing evidence that scalar inferences are delayed and/or elicit processing costs (e.g., Chemla & Singh, 2014, among others), the vast majority of studies supporting this claim are those based on end-of-sentence judgments. Many of these are unable to distinguish whether the observed processing costs are directly due to the process of realizing a scalar inference itself, or to subsequent processes (such as ambiguity resolution, or evaluating the inferencederived interpretation of the sentence relative to the context); see, e.g., Bott and colleagues (2012), Chemla and Singh (2014), and Politzer-Ahles and Fiorentino (2013) for review of these issues. When it comes to studies measuring processing costs on-line at the moment the quantifier is read or heard, the results are fairly equivocal. As noted in the introduction, three of the seven extant experiments using this paradigm have observed processing costs at the quantifier, although there may be alternative explanations for each of these effects. It is also possible that the presence or absence of processing cost is moderated by other experimental factors-for instance, there may be reasons why Bergen and Grodner's (2012) manipulation of speaker's epistemic state would elicit measurable processing costs that were not observed in the other experiments manipulating information structure or entailment polarity. In addition to these, Politzer-Ahles and Gwilliams (2015), using magnetoencephalography to measure neural responses in a very similar paradigm as this, found the opposite pattern of processing cost at the quantifier, with greater neural activity elicited in the context that is less supportive of scalar inferences. They argued that, rather than an across-the-board cost for making scalar inferences, which is either wholly present or wholly absent, there may rather be a gradient processing cost that can be reduced as a function of context. Studies using other paradigms have also provided little evidence for an across-the-board processing cost. Barbet and Thierry (2016), using a singleword oddball paradigm, did not find significant evidence from event-related brain potentials that the inferencebased interpretation of some was more costly to compute, and an unpublished experiment by Politzer-Ahles (ms.), also using a single-word paradigm, failed to find any robust differences between the processing of some in a context that required a scalar inference versus a context that did not. Overall, then, the present dataset joins several previous ones in suggesting a parser in which scalar inferences are context-sensitive but not necessarily costly, which is in line with gradient constraint-based proposals of inference processing (e.g., Degen & Tanenhaus, 2015), among others. Nonetheless, there are several potential alternative explanations for the failure to observe an immediate processing cost at the quantifier. In addition to the possibility that there simply is not such a processing cost, it is also possible that there is a processing cost but eye movements are not sensitive to it (e.g., because the costs do not immediately influence the planning and control of eye movements) or that the costs do not occur immediately at the quantifier but rather unfold gradually over the course of the sentence (but, crucially, before the rest is read).
Overall, the present study provides a conceptual replication of the observation that context influences the realization of scalar inferences and subsequent processing of a related downstream expression, extending this paradigm into the eye-tracking method. Furthermore, it sheds light on the locus of this effect by revealing that this processing cost may be driven by early reading processes rather than late reading processes. There are many open questions remaining about this effect, such as whether it generalizes to other populations of readers and to other types of scalar expressions and context manipulations.

Data Accessibility Statement
All the stimuli, participant demographic information, analysis scripts, and pre-processed reading measures are included in the Supplementary Materials in the online version of this article. Raw eye movement data can be found on this paper's project page on the Open Science Foundation (https://osf.io/wrzjq/). statistical significance or which one fit our hypotheses the best; the only piece of information used to choose the model was the skewness of the residuals. As the data and analysis code are available in Supplementary Files 3 and 4, the data can easily be re-analyzed using different transforms. 6 An anonymous reviewer raised the concern that participants reading so many sentences with some in a conditional followed by a context that disambiguates to a "not all" reading might behave unnaturally. Thus, we also did an exploratory examination of reading times on only the first trial of this experiment for each participant. When looking at only the first trials, there is a large difference between first-pass times on the rest preceded by some in downward entailing contexts (798 ms, SE 141 ms) and those preceded by some in upward entailing contexts (688 ms, SE 129 ms), whereas the difference at the rest is smaller for sentences with only some (downward entailment: 757 ms, SE 110 ms; upward entailment: 729 ms, SE 99 ms). This pattern is numerically consistent with the pattern found in the original analysis (although the overall reading times are substantially slower, as is to be expected early in the experiment before participants have gotten used to the stimuli).