Degree of incrementality is modulated by experimental context – ERP evidence from German quantifier restriction

ABSTRACT The current ERP study investigates the role of non-linguistic context in incremental semantic comprehension. Using picture-sentence verification, we examined the neurophysiological correlates of contextual adaptation effects. We manipulated experiment-inherent frequency and tested whether a relatively high ratio of experimental to filler sentences constrains the contextual restriction of the German quantifier alle (“all”). While previous results indicate that the truth evaluation may be postponed when the experimental setting is completely ambiguous with respect to a potentially following restriction, the current study used a higher ratio of non-restricted vs. restricted sentences than the previous one (4:1 vs. 1:1) and thereby tested whether a low restriction probability increases the likelihood of an immediate truth evaluation. Our results show that experiment-inherent frequency distributions immediately modulate the N400 amplitude, analogous to previous studies on speaker reliability.


Introduction
Language understanding involves rapidly combining information types from various sources. A case in point is sentence comprehension, where structure building and meaning assignment may be determined by linguistic and non-linguistic contextual cues. In linguistic theory, contextual information is essential to what is called utterance meaning (e.g. Searle, 1978;Weigand, 1993): the context may include information about the preceding discourse or the utterance situation and often determines the interpretation of deictic (e.g. you, then or there), referential (e.g. definite descriptions like the waitress) or anaphoric expressions (e.g. pronouns like he, she or they that refer back to previously introduced discourse referents). Besides such discourserelated effects, contextual cues may also contain information that is more indirectly related to linguistic theory and the linguistic processing proper, for example, world knowledge, plausibility, and also completely non-linguistic information, such as information about the speaker or about the environment and the frequency with which particular constructions are encountered (Fine et al., 2013;Hagoort & Van Berkum, 2007;Van Berkum et al., 2008). The current study examines the nature of such non-linguistic frequency effects on compositional-semantic processing.
One way in which contextual cues quite generally affect the interpretation of an utterance is by determining what its parts can denote: In certain contexts, the possible denotations of an expression may be more restricted than in others. As an illustration, consider a situation at the counter of a fast food restaurant in which you are asked: Would you like something to drink? In that kind of situation, the set of options to choose from is usually rather limited. The knowledge about restrictions like these comes from world knowledge, e.g. in the present case, by considering the fact that there is usually not so much choice concerning the drinks in fast food locations, and it is rather implausible to find drinks like homemade lemonade in fast food places. A strikingly different situation would arise when being asked the same question in the hippest burger place in town. The set of potential options would presumably be much less restricted, and you might want to acquire further information, if it is not provided right away, e.g.: …from our selection of local craft beers and homemade lemonades.
While a majority of these recent studies indicate that immediate effects of contextual cues on the processing of semantic and pragmatic information are principally possible, relatively little is known about the effects of non-linguistic context information. The current ERP study aims to add to these recent approaches by examining non-linguistic context effects on the on-line compositional-semantic interpretation of sentences containing the quantifier alle ("all") in German. Quantifiers like all, none, some, etc., express quantitative relations between properties and relations mentioned in a sentence. One theoretically challenging characteristic of quantifiers is that their semantic interpretation is heavily context-dependent (e.g. Geurts & Van der Sandt, 1999). For instance, in the fast food context introduced above, the phrase all lemonades would be contextually restricted to the available options in the specific restaurant where they are offered. Whereas previous studies have shown that linguistically-related contextual cues may indeed affect the online processing of quantified sentences (see Augurzky et al., 2017;Freunberger & Nieuwland, 2016;Hunt et al., 2013;Nieuwland, 2016;Politzer-Ahles et al., 2013;Spychalska et al., 2016; but see also Augurzky, Schlotterbeck, et al., 2020;Urbach et al., 2015, for mixed evidence), the present study examines how non-linguistic context effects during online comprehension interact with such restrictive processes associated with the universal quantifier alle ("all"). Specifically, we tested whether the probability of a potential meaning shift that may be induced by delayed restrictive cues occurring at a later position in a sentence affects the incremental interpretation of the universal quantifier.
Therefore, our main aim is to test whether expectations affect compositional semantic processes. More specifically, we investigate whether non-linguistic information modulates the expectation of whether an incomplete utterance will be continued or not. For this reason, the distinction between linguistic and non-linguistic context effects is crucial for defining our present purpose. There are context effects that are quite uncontroversially considered to be linguistic in nature like, e.g. effects due to met vs. unmet presuppositions (Tiemann et al., 2011). Similarly, there are other effects that are clearly non-linguistic, e.g. effects of background noise on processing difficulty (Wendt et al., 2016). However, further examples of context effects are less obviously classifiable as linguistic or non-linguistic. One way of dealing with these latter cases could be to consider specific properties of the context and determine whether their effects are in some way encoded in the linguistic representation derived from the linguistic input during processing. A similar stance has been taken by Chambers et al. (2004), who distinguished between linguistic and non-linguistic context effects in a visual world eye-tracking study. In particular, they argue that context effects on language understanding that are mediated through lexical representations that encode certain selectional restrictions (i.e. restrictions on properties of suitable arguments of verbs) can be viewed as linguistic. By contrast, contextual effects are viewed as non-linguistic if the relevant contextual properties are not encoded in the linguistic representation. For instance, a verb like pour selects for arguments that denote liquid entities, whereas a verb like put imposes only weak restriction on its arguments. As a consequence, the selection of possible referents has to be restricted by contextual properties that are not encoded in the lexical restrictions of a particular verb. When distinguishing between possible referents that are available in a context, both lexically mediated and purely contextual restrictions are accessed, but, according to Chambers et al. (2004), the underlying representation may be different, namely linguistic vs. nonlinguistic. In the case of pour, the decision if a boiled or an unboiled (liquid) egg is considered to be a possible referent is linguistically-driven, as the information about the aggregate state is encoded in the selectional restrictions of the verb. By contrast, the effect associated with put is considered as non-linguistic as the relation between the verb and the possible contextually specified referents is much more arbitrary and therefore not specified in selectional restrictions. Similar distinctions between linguistic and non-linguistic context effects may affect other representational levels, such as representations of complex expressions (e.g. sentence meaning or phrase structure) that are derived through compositional interpretation or related combinatorial processes.
As a matter of fact, we consider the frequency of certain constructions in the context of an experiment (i.e. our present manipulation) as non-linguistic in exactly this way: The linguistic representations that are commonly assumed in syntactic parsing and compositional interpretation do not contain a record of recent encounters with different utterance types. This is also in line with influential computational cognitive models in which it is assumed that aspects like frequency of encounter with some piece of information or its salience are only implicitly encoded in memory representations via their increased activity but usually not themselves part of the stored representations (e.g. spreading activation models or Anderson's, 1990, memory model, implemented in the ACT-R architecture, and its psycholinguistic applications, e.g. Lewis & Vasishth, 2005). By contrast, we would, for example, classify effects of Questions Under Discussion (QUDs; Roberts, 1996Roberts, , 2012 on quantifier processing like reported by Politzer-Ahles and Gwilliams (2015) as being potentially mediated linguistically, because QUDs are part of commonly assumed representations of the discourse context (see Roberts, 1996Roberts, , 2012. One method that is particularly well-suited to demonstrate that contextual cues not only can guide syntactic ambiguity resolution but can also contribute to the online, incremental compositional-semantic meaning assignment during the comprehension process is the ERP method. ERPs have been shown to exhibit a high temporal resolution for capturing semantic and pragmatic effects. In particular, one well-studied ERP component, namely the N400 (Kutas & Hillyard, 1984), has been repeatedly associated with context effects on online sentence comprehension. The N400 is a negative potential with a centro-parietal maximum between 200 and 600 ms post stimulus onset. A wide range of different phenomena has been shown to constitute such context effects on the N400 amplitude, amongst them cloze probability (e.g. Nieuwland, 2016), world knowledge, plausibility, speaker identity, and pragmatic inferencing (Filik & Leuthold, 2008;Grey & van Hell, 2017;Hagoort & Van Berkum, 2007;Kutas & Federmeier, 2011;Nieuwland et al., 2010;Van Berkum et al., 2005).
With regard to non-linguistic context effects, such as effects of frequency, initial evidence suggests that even factors that are not related to the linguistic processing proper at all, like, for instance, the properties of an experiment itself may affect on-line sentence interpretation. Such properties may include details of the experimental setting, paradigm or procedure as well as the composition of the whole set of stimuli that are presented during the course of the experiment. One way these kinds of contextual properties can be viewed is as a kind of microcosm that can be actively controlled to simulate or modulate statistical regularities and pragmatic properties of the environment in which a sentence is processed. For example, it is common practice to include filler trials into psycholinguistic experiments in order to counterbalance necessary imbalances in the experimental design or to distract participants from the properties of the experimental items and the experimental manipulation. Recent studies suggest that fillers affect the experimental results in significant and theoretically relevant ways (e.g. Brothers et al., 2019;Fernandes et al., 2018;Fine et al., 2013;Grodner et al., 2010). These studies show that comprehenders adapt rapidly to changing frequencies in the language input, even during the course of an experiment, to facilitate expectation-based language processing. Consequently, such adaptations (or lack thereof) can affect the interpretation of the experimental results and they should thus be taken into account explicitly.
So far, only few sentence-level ERP studies have investigated the neurophysiological correlates of such adaptation processes. For instance, Brothers et al. (2019), examined whether statistical regularities in an experiment may have an effect on predictive sentence comprehension and may thus constrain the amplitude of the N400. In particular, by manipulating speaker reliability, the authors investigated cue validity effects on predictive sentence comprehension. To this end, participants were listening to sentences that were spoken by either a male or a female voice. Experimental sentences differed with respect to the predictability of specific target words (e.g. Eric sued the taxi driver and took him to court vs. Eric picked up his friend and took him to court) as determined by a cloze probability task. Both of the speakers produced an equal amount of predictable and unpredictable experimental target sentences. Crucially, the experiment also involved filler sentences. For one of the speakers, the reliable speaker, these fillers only involved highly predictable continuations, whereas for the other speaker, the unreliable speaker, the fillers only involved highly unpredictable continuations. By hypothesis, the authors expected a larger N400 predictability effect when sentences were uttered by a reliable speaker if statistical regularities are considered a valid cue for constraining predictive strategies. These expectancies were borne out by the data. For both speakers, unexpected lexical items elicited the larger N400, but, as revealed by difference waves, the amplitude difference between expected and unexpected items was significantly decreased in the unreliable speaker condition. This experiment thus shows that statistical regularities in an experimental context may lead to quick adaptation effects. According to the authors, the results can be taken as evidence for an active prediction account, according to which the amount of anticipatory processing a comprehender engages in depends on the reliability of contextual cues. These results thus indicate that predictions are not triggered automatically by the preceding sentence context. Instead, participants strategically adapt their use of predictive processing to the experimental context. If the validity of predictive cues is low in the experimental context, predictive processing is used to a lesser degree. Finally, comparable adaptation effects were also found in a reading time study by Brothers et al. (2017). Taken together, these findings offer an initial evidence that non-linguistic statistical cues may cause immediate adaptation towards the statistical regularities in the environment.
Further evidence for immediate adaptation effects comes from three visual-world eye-tracking studies that consider the effects of alternatives in the processing of scalar implicatures. Grodner et al. (2010) recorded participants' eye movements while they were listening to sentences like Click on the girl with summa the socks in a context showing multiple potential referents. These sentences contain a prosodically reduced form (i.e. summa) of the scalar expression some of, whose lexical meaning (i.e. some and possibly all) is often enriched to some but not all. This enrichment is often assumed to involve an extra computational step that is associated with increased processing cost. Grodner et al. challenged previous findings by Huang and Snedeker (2009), who used filler items that contained small exact numerals like two or three. These numerals constitute salient lexical alternatives to some, and may thus have led to delays in the interpretation of some because pragmatic enrichment of scalar words like some is assumed to be fundamentally rooted in implicit comparison to lexical alternatives. In contrast to the results of Snedeker (2009), Grodner et al. (2010) found no delay in implicature calculation when fillers containing numerals were removed from the experiment. However, the studies by Grodner et al. (2010) and Huang and Snedeker (2009) differed across two dimensions: first, the use of numerals, and second, the use of prosodically reduced (Grodner et al.) vs. full (Huang & Snedeker) versions of some. A follow-up study by Huang and Snedeker (2018), therefore, investigated which of these factors was responsible for the more immediate effects in the Grodner et al. study and showed that it was indeed the presence vs. absence of numeral fillers that had caused the difference. Huang and Snedeker (2018) propose what they call the "lexical encoding hypothesis" to explain these findings in terms of the predictability of the critical lexical item some as a description of "subsets" (e.g. scenarios in which a girl has two of four socks shown in the display): When only quantifiers like some or all were presented in an experiment, the description of a referent (e.g. the girl with two of the four socks) was highly predictable and, as a result, this referent was identified faster than in an experiment where alternative lexical forms like the numeral two compete with some for referent description. Thus, when only a few alternatives are present, top-down predictions allow listeners to anticipate how the various referents will be described and to use these expectations to identify referents faster once they have processed the quantifier. In their view, such top-down processes may coincide with simultaneous bottom-up processing during online comprehension. In fact, Huang and Snedeker (2018) consider it possible that the frequency of encounter with competing expressions in the context may also affect bottomup pragmatic inferencing in addition to top-down predictions. The reason is that informativity is often considered one main driver of such inferencing (e.g. Frank & Goodman, 2012) and expressions that are more predictable than others as descriptions of some context tend to also be more informative about that context.
In terms of the distinction between linguistic and non-linguistic context effects, one could, arguably, consider effects of sentential filler items as still linguistic in nature. However, determining the set of alternatives that underlie pragmatic enrichment processes and their relative salience has also been shown to involve non-linguistic properties of the context, e.g. purely perceptual ones (see Frank & Goodman, 2012;Qing & Franke, 2015;van Tiel et al., 2021 for discussion and for data-driven approaches to determining salience of alternatives). In contrast, we focus on the question of whether the recent frequency of encounter with certain utterance types, e.g. in the context of an experiment, affects the time course of semantic commitments regarding quantificational restriction the parser is willing to make during online comprehension of quantified sentences.
By investigating frequency effects, our studies are also closely related to studies investigating adaptation effects in syntactic ambiguity resolution. Initial support for rapid adaptation to frequencies in the stimulus set was obtained by Fine et al. (2013), who investigated the online resolution of syntactic ambiguities during the processing of garden-path sentences. Their results indicate that rapid adaptation to the frequency with which certain syntactic structures are presented during the course of an experiment may affect how similar structures are processed in subsequent trials and may thus also affect the conclusions that are drawn from the experimental results. Whether this initial indication of adaption effects in syntactic parsing does withstand further experimental scrutiny is, however, still undecided. In a recent study, Harrington Stack et al. (2018) failed to replicate the adaptation effects of Fine et al. (2013) in experiments using slightly modified designs. Whether, or to which degree, the data of Harrington Stack et al. (2018) challenge conclusions presented by Fine et al. (2013) is currently debated (see the preprint in Jaeger et al., 2019).
Finally, Fernandes et al. (2018) studied the resolution of anaphoric reference and modulated the frequencies of different kinds of antecedents. By comparing two closely related languages, they showed how prior statistical knowledge is integrated with statistical knowledge obtained during the course of the experiment.
In sum, previous studies have shown that contextual adaptations may proceed rapidly, and that, as a consequence, non-linguistic contextual cues may have an effect on on-line sentence comprehension. The current experiment examines whether such effects, namely adaptation to frequency, may also interact with on-line compositional-semantic processing, as evidenced by the processing of quantifier domain restriction related to the German quantifier alle ("all"). Moreover, as data about the neurolinguistic correlates of adaptation effects is still scarce, we carried out an ERP study testing whether such effects immediately constrain the sentences truth evaluation, as evidenced by the amplitude of the N400 component.

Experiment: adaptive revision sensitivity?
The present ERP study investigates the effects of nonlinguistic contextual information on on-line compositional-semantic processing. We examined experiment-inherent adaptation effects on the processing of quantifier restriction in German. To this end, we conducted a follow-up study to a previous experiment . In the current experiment, we changed the frequency in which particular constructions were presented relative to this earlier study. Specifically, we tested whether an increased proportion of experimental items (i.e. short sentences without extra restrictive cues) to filler items (i.e. long sentences with restrictive cues in the form of extraposed relative clauses) could be a salient contextual cue for eliciting incremental effects in sentences that have been shown to be processed in a delayed fashion in Augurzky et al. (2017). In the previous experiment, frequency effects were counterbalanced by presenting a 1:1 ratio of experimental items to fillers, and in the current study, we induced a bias against restriction by means of a 4:1 ratio between the two sentence types. 1 Thus, by comparing the results from the present study with the original experiment, we examined whether the on-line comprehension of sentences involving the quantifier all is affected by statistical regularities in the experimental context.
In Augurzky et al. (2017), the processing of questions like (1) was investigated that were presented in the context of a picture showing coloured objects, e.g. triangles, placed inside or outside a container shape, e.g. a circle. At the beginning of each trial, one of four visual contexts ( Figure 1) was presented, which was followed by a question containing the quantifier alle ("all"). Half of the sentences were followed by an extraposed restrictive relative clause. The other half ended earlier in the sentence, i.e. on the colour adjective.
In that kind of setting, a truth-value judgment is, in principle, possible on the colour adjective blue, but it may have to be revised later on if a further restriction is given in the form of a restrictive relative clause. Augurzky et al. (2017) found evidence for incremental truth evaluation on the colour adjective in the form of a reduced N400 for true vs. false sentences (i.e. questions involving true vs. false answers) 2 . However, this effect was sensitive to the risk of a potential revision, i.e. a change in truth-value: Whenever an extraposed relative clause could lead to a change in truth-value (e.g. when all and only the triangles outside the container shape were blue), the N400 amplitude was in between unambiguously true and false sentences (i.e. questions involving unambiguously true vs. false answers), suggesting a non-incremental truth evaluation for the ambiguous conditions. Augurzky et al. (2017) interpreted this pattern of results as an indication of revision-sensitive incrementality: The processor plans ahead and only commits to an interpretation if there is no substantial risk to revise that interpretation later on.
Our specific motivation for the present study was thus to test whether the N400 effects observed by Augurzky et al. (2017) are sensitive to the overall frequency of restrictive relative clauses presented during the course of the experiment. To test this, we replicated their experiment with a reduced proportion of extraposed relative clauses. In particular, the only difference between the two experimental designs was the number of restricted vs. non-restricted sentences. The present experiment only differed from the original experiment in the relative frequency of sentences with and without extraposed restrictive relatives. 3 In our current study, we again presented one of the four visual contexts in Figure 1, and then a question containing the quantifier alle ("all"). In each picture, geometrical objects, e.g. triangles, were presented inside and outside of a shape containing these objects (e.g. a circle). As in Augurzky et al. (2017), the visual contexts were either simple (B, C), or complex (A, D). In the simple pictures, all geometrical objects were of identical colour, and in the complex pictures, objects inside and outside of the container shape differed in colour and either matched or did not match the colour mentioned in the sentence. Specifically, while the contexts in Figure 1(a) were always presented with the colour adjective blue as in Alle Dreiecke sind blau ("All triangles are blue"), we also included a second item set (Figure 1(b)), in which the sentences were presented with the colour adjective red as in Alle Dreiecke sind rot ("All triangles are red"). Contrary to Augurzky et al. (2017), 80% of all questions ended directly on the colour adjective, and only 20% involved a further restriction. As we were particularly interested in the effects of our frequency manipulation on the processing of the colour adjective across studies, we averaged ERPs from the onset of the colour adjective (e.g. blue) in the current experiment and compared these results with the previous results reported in Augurzky et al. (2017). In Table 1, the complete sentence materials used in the current study are summarised and compared to the reference study of Augurzky et al. (2017).
The comparison between the current study and the previous one by Augurzky et al. allows us to compare the predictions of two theoretical accounts of frequency-driven adaptation effects. First, under a truly revision-sensitive account, one would expect an increasing likelihood of incremental effects in conditions involving a potential change in truth value if the risk of a revision is reduced. In the experiment reported below, we tested this prediction by comparing the current results with those reported in Augurzky et al. (2017). Second, we also considered the predictions derived from an alternative account that is based on predictive processing. In this second alternative, the N400 is not taken as an indication of the commitment to a semantic interpretation but rather of expectations about upcoming lexical items, as was already sketched above.

Hypotheses and predictions
The purpose of the present experiment was to contrast two hypotheses. The first hypothesis, which we call STRICT REVISION SENSITIVITY, is derived from Augurzky et al. (2017). According to this hypothesis, the processor only commits to a semantic interpretation (i.e. a truth value or answer in the context of the present experiment) if it does not run the risk of a later revision to that interpretation. For example, consider the position of the colour adjective in (1). If sentence (1) is processed in the context of a picture that shows only red triangles (e.g. context C: the simple, false condition), the truth value is unambiguously false. Thus, a semantic commitment (i.e. a commitment to a truth value or answer) is safe and can be made without the risk of a later revision. In contrast, revising the interpretation of the sentence may be necessary if the sentence is processed in the context of a picture like context D (complex, match outside) that shows red triangles within a circle but blue ones outside of it. In that case, the local truth value on the colour adjective would again be false. However, if the sentence is continued as shown in (1) where the interpretation of the quantifier is restricted to the triangles outside of the circle, the local truth value would have to be revised from false to true on the colour adjective. Note that STRICT REVISION SENSITIVITY depends on the meaning of the quantifier, the utterance context and also on the possibilities a language offers for delayed restriction (e.g. in the form of an extraposed restrictive relative clause in German), but it is completely independent of frequency information about how likely a later revision in interpretation actually is in a particular context.
By contrast, our second hypothesis, which we call FRE-QUENCY-DRIVEN ADAPTIVITY takes frequency information into account and states that incremental processing is sensitive to how likely revisions in interpretation actually are in the specific context in which a sentence is processed. For instance, according to this view, comprehension processes associated with the processing of the adjective in (1) in visual contexts C (simple, false) vs. A (complex, match inside) and D (complex, match outside) are hardly distinguishable from each other if the probability of encountering further restriction, e.g. in the form of a relative clause, is very low. There are various possible ways of spelling out FREQUENCY-DRIVEN ADAPTIVITY. In the following, we will focus on two of these possible refinements. First, we could think of interpretation processes in terms of probabilistic (in contrast to deterministic) semantic commitments and assume that a lower risk of revision would lead to a higher probability of a local semantic commitment to a truth value at the position of the adjective. We call this first version FREQUENCY-DRIVEN ADAPTIVITY IN JUDGMENTS. Second, we could spell out FREQUENCY-DRIVEN ADAPTIVITY in terms of expectation-based pragmatic processing. In particular, we could draw on the uncontroversial assumption that during language comprehension, expectations for upcoming words are constantly updated based on prior knowledge and the current linguistic input. But how does the meaning of the current linguistic input determine such expectations? One common and straightforward assumption would be that listeners expect true utterances. This assumption can be motivated by basic pragmatic principles (e.g. the Gricean, 1975, maxim of quality or a tendency to communicate about salient aspects of the environment, e.g. Frank & Goodman, 2012). If we do in fact assume that true utterances are expected, or in the case of yes-no questions, questions asking about propositions about the world that are actually the case, we arrive at another version of FREQUENCY-DRIVEN ADAPTIVITY. In that version, words are the more expected, the more 'true' sentence continuations they still allow. We call this second version FREQUENCY-DRIVEN ADAPTIVITY IN EXPEC-TATIONS. These alternative versions of FREQUENCY-DRIVEN ADAPTIVITY are elaborated and related to the results of the current experiment in the discussion section below. To preview one aspect of this discussion, we would like to highlight that the assumption of an expectation for 'true' utterances is somewhat challenging to relate to the processing of questions. This is because questions are, by themselves, not true or false in a given context but rather, usually, ask for information about that context that is unknown but relevant to the speaker. Nevertheless, we would like to maintain that, if we consider listeners' expectations about continuations of questions, continuations that are in some sense related to the context are plausible candidates and that this relatedness between a question and the context is, in turn, in a general sense rooted in propositions that are true of the context. 4 In the discussion section and in Appendix C, we compare several alternatives of how this relatedness between question and context could be spelled-out under FREQUENCY-DRIVEN ADAPTIVITY IN EXPECTATIONS. These alternatives range from expectations due to superficial lexical matching or salience (e.g. blue is expected if blue objects are shown in the context picture) to strictly truthvalue based expectations (i.e. continuations are expected that trigger a "yes, true" response) and we discuss how well these various options explain our experimental results. Taking the importance of truth in deriving expectations about continuations of questions for granted for the moment, the shared prediction of FREQUENCY-DRIVEN ADAPTIVITY IN JUDGMENTS and FRE-QUENCY-DRIVEN ADAPTIVITY IN EXPECTATIONS is that comprehension processes should be sensitive to how likely revisions in semantic interpretation are in the specific context in which a sentence is processed.
In sum, comprehension processes should be unaffected by frequency information under STRICT REVI-SION SENSITIVITY. As a matter of fact, the basic pattern of results of Augurzky et al. (2017), in particular the intermediate N400 amplitude of the complex in comparison to the simple contexts, should replicate in the setting of the present experiment even though the probability of encountering a restrictive relative clause is substantially reduced. By contrast, FREQUENCY-DRIVEN ADAPTIVITY lets us expect that comprehension processes quickly adapt to changing frequencies in the language input. As a consequence, processes associated with reading the colour adjective in sentences like (1) should diverge less strongly between contexts like C (simple, false) and contexts like A (complex, match inside) or D (complex, match outside) in the current as compared to the reference study because continuation that would still lead to a "yes, true" response are less likely in the present experiment. Thus, FREQUENCY-DRIVEN ADAPTIVITY lets us expect that the difference in N400 amplitude between the simple, false and complex conditions should be reduced in the present as compared to the reference study by Augurzky et al. (2017).

Participants
Twenty-four right-handed students from the University of Tübingen took part in the study. They were native speakers of German with normal or corrected-tonormal vision and were paid for their participation. Two of the participants were excluded from the final analyses due to excessive ocular or muscular artefacts, so that 22 participants entered the final analyses (mean age: 23.4; SD = 3.1, years, 9 male).

Materials
A one-factorial within-subjects design was used with the factor CONTEXT (levels: A (complex, match inside), B (simple, true), C (simple, false) and D (complex, match outside)). Pictures were generated as quadruplets via MS PowerPoint. Each picture contained geometrical objects like triangles and a container shape like a circle. Geometrical objects were both inside and outside of this shape. For each quadruplet, the geometrical forms were identical but differed in colour. A total set of 40 quadruplets were generated. Per condition, 40 experimental picture-question pairs were presented, resulting in 160 experimental sentences. Of these experimental sentences, eight sentences per condition were presented with a further restriction, as shown in Table 1. Four of these sentences were presented with the preposition innerhalb ("inside-of"), and four with the preposition außerhalb ("outside-of"). These long sentences were treated as filler items. As a result, 32 short sentences per condition entered the final EEG analyses, resulting in a total of 128 experimental sentences in which ERPs were aggregated. Conditions were evenly spread over 4 blocks. To control for positional effects, two experimental versions were generated. The first block of the first version corresponded to the final block of the second version, the second block of version 1 corresponded to the third block of version 2 and so on.

Procedure
Prior to the experiment, participants received written instructions informing them about the experimental task. According to these instructions, participants had to perform a truth evaluation, i.e. they had to decide whether the preceding sentence truly or falsely reflected the content of the picture. After processing the written instructions, participants were seated in a dimly-lit, sound-shielded booth in front of a 17" computer screen. Stimuli appeared in a pseudo-randomised fashion. The experimental session was divided into four blocks (40 trials per block), with breaks between each of the blocks. At the beginning of a trial, the context picture appeared in the centre of the screen for 1500 ms. After the context picture was removed from the screen, the question was shown in a wordby-word manner via RSVP (500 ms per word). The following question mark or comma was presented as a separate segment (also 500 ms) in order to avoid the predictability of a further restriction. After each sentence, participants performed their truth evaluation: After the final word had disappeared, three question marks were shown, indicating that participants now should answer with "wahr" (true) or "falsch" (false) by pressing one of two buttons ("F" or "J") on the keyboard. The keys for true and false answers were counterbalanced across participants. Participants were encouraged to make their truth evaluations as quickly as possible. The initial above timeout was 1200 ms and was adapted to the participants' response speed by employing an exponentially weighted moving average (Leonhard et al., 2011). When participants' reaction times exceeded the current timeout, they received visual feedback (Schneller!, "faster") on the screen. Following each judgment, a blank screen appeared for 500 ms and three exclamation marks in yellow (1200 ms) signalled that participants now could blink until the next picture was presented. A practice session consisting of eight trials preceded the experimental session. Practice trials were designed analogously to the target items in the experiment but contained different geometrical shapes. Moreover, they contained an equal number of yes and no answers and of short and long sentences. Including electrode application, the experimental session lasted between 1 and 1.5 hours.

EEG recording
The EEG was continuously recorded using a BIOSEMI Active-Two amplifier system. Thirty-two Ag/AgCl electrodes were recorded: FP1, FP2, AF3, AF4, F7, F3, Fz,  F4, F8, FC5, FC1, FC2, FC6, T7, C3, Cz, C4, T8, CP5, CP1,  CP2, CP6, P7, P3, Pz, P4, P8, PO3, PO4, O1, Oz, O2. Six further electrodes were registered: The electrooculogram (EOG, 4 electrodes) was registered by means of electrodes placed at the outer canthus of each eye (horizontal EOG) and above and below the participant's left eye (vertical EOG). In addition, two electrodes were put on the left and right mastoid for off-line referencing. EEG and EOG recordings were sampled at 1024 Hz during recording and downsampled to 256 Hz for data analysis. Off-line, electrode sites were referenced to averaged mastoids. Before entering the ERP analysis, individual participant data were automatically and manually screened for each trial in order to exclude trials with eye movement and muscular artefacts. Data below or above a threshold of 40 microvolts were automatically excluded from the analyses, and muscular and other artefacts that were not covered by this exclusion were manually rejected. The averages were aligned to a 100 ms pre-stimulus baseline. Data per participant and per condition were aggregated from the onset of the colour adjective to 1000 ms post onset. Afterwards, grand averages were calculated over all participants.

Data analysis
To analyse reaction times (RT), a repeated measures analysis of variance (ANOVA) with the within-subject factor CONTEXT (levels: A (complex, match inside), B (simple, true), C (simple, false) and D (complex, match outside)) was computed. To analyse the truth-value judgments, logit mixed-effects models with the same factor as fixed effect and random intercepts for participants and items as well as random slopes for items were sed. 5 ERPs were aggregated from the onset of the colour adjective. EEG data were analysed statistically by carrying out repeated-measures ANOVAs for mean amplitude values per condition within a 300-400 ms time window which was chosen based on previous studies on truth evaluation tasks related to quantifier processing (e.g Augurzky et al., 2017;Augurzky, Schlotterbeck, et al., 2020). Individual conditions are only reported if the omnibus ANOVA revealed significant interactions or in case of significant main effects of the factor CONTEXT. The detailed results of all ANOVAs are provided in the Appendix A.
In the text, we will focus on the most relevant results. Only significant interactions (p<.05) were resolved. Corrected p-values (Huynh & Feldt, 1970) were chosen when the analysis involved more than one degree of freedom in the numerator. P-values were Bonferroni adjusted for the planned comparisons between the four levels of the factor CONTEXT.

Behavioural data
Mean RT and accuracy for the verification task are given in Tables 2(a,b), respectively (within-subjects standard errors were computed using the R function summary-SEwithin from the Rmisc package; Hope, 2013). Mean error rates in the truth evaluation task were 6.8%, reflecting overall good task performance. The logit mixed-effects model analysis revealed a significant effect of CONTEXT (x 2 (3) = 182.06; p , .001). This effect was computed on the basis of a model comparison with a simpler model that did not contain the fixed effect. Bonferroni-corrected pairwise comparisons revealed significant differences between all conditions (all corrected p<.05). While comparisons between binary judgment data that are so close to ceiling may be difficult to interpret, we take the difference between the true simple (B) and all the false (A, C and D) conditions as an indication of an overall bias to respond "false". Such a bias may have been induced by the fact that "false" was the correct response in 70% of the trials. That participants indeed showed a bias to respond "false" in the current experiment was corroborated in an analysis of indices from Signal Detection Theory (SDT; see Wickens, 2001, for an introduction and Huang & Ferreira, 2020 for a recent psycholinguistic application). 6 We also computed d ′ as an index of sensitivity and l center (see e.g. Wickens, 2001, p. 27; same index called c by Huang & Ferreira, 2020) as an index of bias. These two indices were computed separately for the short sentences, the long sentences in combination with complex contexts and the long sentences in combination with simple contexts (see Table 2(d)). The reason for collapsing conditions into groups in this way was that expected responses differed between them correspondingly (see Table 1). As indicated by d ′ values between 2 and 3, sensitivity was good across all three groups of conditions. The long sentences, in combination with simple contexts, led to the lowest sensitivity. In comparison, a simple linear mixed-effects model analysis revealed that the long sentences in combination with complex contexts led to significantly higher and the short sentences to marginally higher sensitivity (t = 3.05; p = .003 and t = 1.93; p = .061, respectively). Concerning bias, positive values of l center indicate a bias to say "no" (in our setting "no" corresponds to the judgment "false"). The long sentences in combination with complex contexts showed no bias towards either response at all, whereas the short sentences and long sentences with simple contexts showed the "false" bias we had conjectured and both differed significantly form the former conditions (t = 7.23; p<.001 and t = 3.01; p = .004, respectively). Coming back to the accuracy data in Table 2(b), the difference between the simple (B, C) and complex (A, D) conditions can be explained if we assume that participants in some trials misremembered the colours shown in the pictures. For the short sentences, which we analysed here, mixing up colours (e.g. blue vs. red) would lead to an incorrect response in the simple but not the complex conditions. RT did not differ reliably between conditions (F(3, 63) = 1.54; p = .21). RT are, however, difficult to interpret in our task because responses could be prepared even before the end of a trial and the onset of RT measurement.  Figure 2 shows ERPs from the onset of the colour adjective. Conditions differed within the early (300-400 ms) N400 time window (cf. Augurzky et al., 2017Augurzky et al., , 2019Augurzky, Hohaus, et al., 2020;Augurzky, Schlotterbeck, et al., 2020). 7 The detailed results of the ANOVA are provided in Appendix A. Statistical analyses revealed a significant main effect of CONTEXT (F(3, 63) = 9.90; p , .001). Complex conditions (A, D) did not differ from each other (p = 1) and from the simple, false condition (C; both p = 1), each of these conditions differed from the simple, true condition B (A vs. B: F(1, 21) = 15.21; p , .01; B vs. D: F(1, 21) = 16.01; p , .01). Moreover, the simple, true and the simple, false conditions also differed from each other (B vs. C: F(1, 21) = 28.77; p , .001). Additionally, an interaction between ROI and CONTEXT was found (F(9, 189) = 4.03; p , .01), showing that the observed main effects were most pronounced in the posterior Rois 3 and 4 (see Appendix A, for the complete statistical analyses). 8 Augurzky et al. (2017) To enable direct comparison with the results from the reference study by Augurzky et al. (2017), we included a plot showing ERPs from the onset of the colour adjective in that study (Figure 3). In the N400 time window, Augurzky et al. (2017) found significant differences between the two complex conditions (contexts A and D) and the simple false condition (context C), on the one hand, as well as the simple true condition (context B), on the other. 9 FREQUENCY-DRIVEN ADAPTIVITY predicted a decreased difference in N400 amplitude between simple false (C) and complex (A & D) conditions in the current experiment as compared to the reference study. This comparison is thus most directly relevant to our present research question. The absence of a significant difference in the present but not the previous experiment that we observed is compatible with FREQUENCY-DRIVEN ADAPTIV-ITY. However, comparison between p-values obtained from two hypothesis tests conducted on two separate data sets is known to be problematic (cf. Wagenmakers, 2007) and does, in itself, not constitute a direct statistical test. For this reason, we carried out a direct statistical test of FREQUENCY-DRIVEN ADAPTIVITY in an analysis comparing effects between the two experiments. 10 To test for the predicted interaction between the factors EXPERIMENT and CONTEXT, we conducted a linear mixed-effects model analysis on the basis of participants' mean voltage in the N400 time window. In this model, fixed effects of CONTEXT and ROI were included as within-participants effects and a fixed effect of the EXPERIMENT was included as a between-participants effect. The main prediction that followed from FRE-QUENCY-DRIVEN ADAPTIVITY was that the difference between the simple, false context (C) and the two complex contexts (A & D) would be smaller in the present experiment than in the original study of Augurzky et al. (2017). We tested this prediction using orthogonal custom contrasts: In the first step, we Figure 2. Grand average ERPs from the onset of colour adjective. As before and throughout, context label A stands for complex, match inside; B stands for simple, true; C for simple, false; and D for complex, match outside. simplified the global model, including all simple effects and interactions, as long as such simplification did not worsen model fit significantly. In the second step, we performed a hypothesis test for the specific interaction between the experiment and a contrast encoding the comparison between the simple, false (C) and complex (A & D) conditions (see Appendix B). Crucially, this interaction was highly significant in the final model (t = 7.18, p<.001) because, as predicted by FREQUENCY-DRIVEN ADAPTIVITY, the difference between the simple, false context (C) and the two complex contexts (A & D) was smaller in the present experiment than in the reference study. Thus, the main prediction of FREQUENCY-DRIVEN ADAPTIVITY is borne out, rendering an alternative explanation of the present findings, e.g. in terms of insufficient statistical power, implausible.

Discussion
The current ERP study examined the neurophysiological correlates of contextual adaptation effects in the on-line comprehension of quantified sentences following visual contexts. On a general level, our main finding is that non-linguistic experiment-inherent frequency distributions can immediately modulate the amplitude of the N400, analogous to previous studies on speaker reliability (Brothers et al., 2019). Our study adds to these previous findings by focusing on compositionalsemantic processing. Based on previous experiments, we tested whether experiment-inherent frequency distributions may exert an immediate effect on processes related to determining the restriction (i.e. the domain of quantification) of sentences containing the German universal quantifier alle ("all"). In contrast to previous studies, we directly compared the current results to another experiment  that was completely identical except for the frequency manipulation. Our results indicate that non-linguistic contextual cues from beyond the sentence boundary immediately affect restrictive processes in quantifier interpretation. The time course of these processes appears to depend on how often further restriction is provided in the experimental context, e.g. in the form of extraposed restrictive relative clauses. Only if this happens relatively infrequently, is the comprehender ready to commit to a semantic interpretation immediately, as reflected in the observed modulation of the N400 amplitude.
In principle, our results seem compatible with recent considerations on the online processing of scalar implicatures. Based on visual-world eye-tracking, Huang and Snedeker (2018) argued that, depending on the context, top-down and bottom-up processes differentially contribute to the interpretation of sentences containing the scalar quantifier some. When filler items like the numeral two are included in an experiment, they may constitute highly salient and even more informative alternatives for referent description than scalar some. As a consequence, the expectancy for two as a description for "subsets" is stronger than for its competitor some in such cases. This results in a local slow-down in the processing of the scalar, and in particular a slow-down in reference resolution, because the scalar cannot simply  Augurzky et al., 2017, baseline-corrected). As before and throughout, context label A stands for complex, match inside; B stands for simple, true; C for simple, false; and D for complex, match outside. be matched to the anticipated referent description. By contrast, in the absence of numerals, some is anticipated as a "subset" description already before the position of the quantifier, and no such slow-down is observed. There are, however, a number of differences between the current study and the study of Huang and Snedeker (2018). Most importantly, we did not investigate lexical alternatives but the effects of larger constituents (namely extraposed restrictive relative clauses) and their frequency. Despite this difference, one could argue that the probability of encountering a particular construction might have affected the relative contribution of top-down vs. bottom-up processing from early on in the sentence. For instance, one could assume that in the current experiment, top-down predictions of an early sentence closure were particularly strong.
The more specific purpose of our study was to decide between two hypotheses that could explain the previous results from Augurzky et al. (2017). In that study, the same picture-sentence pairs as in the current one, albeit with a different proportion of sentences containing restrictive relative clauses, were used to investigate whether and how the truth evaluation of quantified sentences affects the N400. In contrast to the present study, extraposed relatives providing extra restrictive cues appeared in half of the trials. At the position of the colour adjective, a reduced N400 for true vs. false sentences was observed for the simple conditions, where an additional restriction could not trigger a meaning shift in the locally assigned truth value. By contrast, the amplitude for the adjective in complex conditions, which involved such a potential meaning shift, was in between the true and the false simple conditions. The authors interpreted this result as being indicative of a non-incremental truth evaluation for ambiguous conditions and proposed that it can be explained by an account based on revision-sensitive interpretation.
While Augurzky et al. (2017) investigated whether there is a general tendency towards revision-sensitivity, the current study addresses a refinement of this question by examining whether revision-sensitivity can be affected by how large the risk of a potential revision actually is in a specific context. In order to address this open question, we manipulated experiment-inherent frequencies, and we contrasted two hypotheses (see Section 2.1) regarding ERPs on the colour adjective in the present study.
Our first hypothesis, STRICT REVISION SENSITIVITY (see Section 2.1), is derived from Augurzky et al. (2017) and states that the processor only commits to a semantic interpretation (i.e. it performs a truth-value judgment) if it does not run any risk of a later revision to that interpretation. However, while Augurzky et al. applied this hypothesis to an experimental setting in which frequencies of all the possible continuations were counterbalanced, the current study manipulated the ratio of sentences that ended on the colour adjective and sentences that continued with a restrictive relative clause. We reduced the number of relative clause continuations to test whether the parser more often commits to a risky interpretation if the chance of a following meaning shift is reduced. While STRICT REVISION SENSITIVITY offered an explanation of the effects in Augurzky et al., it does not account for the present pattern of results without refinement. Though the experimental context in the present study was set up in such a way that the probability of potentially revision-inducing restrictive relative clauses was reduced relative to the experiment of Augurzky et al., it was still non-zero and extraposed restrictive relative clauses still occurred in 20% of the trials. As a consequence, the risk of a meaning revision was also reduced but still not removed completely. As STRICT REVISION SENSITIVITY only takes into account whether a revision is possible at all, its predictions for the current experiment were the same as for Augurzky et al.. In contrast to these predictions, the present results differ from those from Augurzky et al.: In the current study, the N400 amplitude was indistinguishable between the complex conditions and the simple, false condition. Comparing the two experiments leads us to assume that the comprehension processes, as reflected by the N400, were affected by the extra-linguistic context, in particular by how large the risk of a later meaning revision actually was in the specific context. More specifically, the effects observed locally on the colour adjective indicate that comprehension processes are sensitive to how probable various sentence continuations and the associated global semantic interpretations actually are in a specific context.
The second hypothesis introduced above, FREQUENCY-DRIVEN ADAPTIVITY, is consistent with the obtained results. The general idea behind this hypothesis is that incremental processing quickly adapts to various aspects of the linguistic and non-linguistic contexts, including frequency information about various types of potential sentence continuations. This makes the processor sensitive to how likely a revision in interpretation actually is in the specific context in which a sentence is processed. In the context of the present study, FREQUENCY-DRIVEN ADAP-TIVITY can be related to the concept of speaker reliability (cf. Brothers et al., 2019). Instead of adjusting to the surface frequencies of potential sentence continuations, we could also think of comprehenders as being adaptive to speaker reliability. In particular, there are two ways in which the (imagined) speaker in the present experiment could be considered unreliable. The first is by producing sentences that are not related in a straightforward way to the visual contexts, e.g. sentences that ask about aspects of the world that are not reflected in the visual contexts (more on this below). The second way in which the speaker can be unreliable is by producing sentences that contain shifts in meaning, i.e. questions in which the correct answer changes during incremental processing. Due to the semantic properties of the sentence materials we used, these two types of unreliablity are opposed to each other in the present experiment. We highlight connections between FREQUENCY-DRIVEN ADAPTIVITY and the speaker reliability where they are relevant.
In Section 2, we briefly introduced two versions of FREQUENCY-DRIVEN ADAPTIVITY, namely FREQUENCY-DRIVEN ADAPTIVITY IN JUDGMENTS and FREQUENCY-DRIVEN ADAPTIVITY IN EXPECTATIONS. In the following, we discuss these hypotheses in light of our experimental results. The first version, FREQUENCY-DRIVEN ADAPTIVITY IN JUDGMENTS is based on the distribution of sentence-final truthvalue judgments across the experiment and, more specifically, on expectations about what the final truthvalue judgment will be in a given trial. According to that account, judging a sentence locally as true at a given sentence position during incremental processing will lead to a decreased N400 amplitude as compared to a local "false"-judgment. 11 In addition, the decision about whether a commitment to a semantic interpretation is made, and thus about whether a truth-value judgment is performed, at a given sentence position, generally depends on what the sentence-final truthvalue judgment is expected to be. In the present design, the final truth-value judgment is "false" in the majority of the trials. Under the assumption that the parser predicts this final truth-value at each sentential position, the experiment-inherent frequency distribution provides a strong contextual cue, which is expected to already affect the pattern observed midsentence, i.e. on the colour adjective. Therefore, the finding that the N400 amplitude in the complex conditions does not differ from the amplitude elicited by the simple false conditions could simply be explained in terms of a bias for "false"-judgments that leads to a higher proportion of local truth-value judgements. If it is true that frequency manipulations in the present study function in a similar manner as speaker reliability, the current results could reflect a scenario in which the comprehender adapts to a speaker who is unreliable in terms of producing a high proportion of false sentences, or rather questions that lead to the answer "false". In the results section (Section 2.3), it was already established, using indices from SDT, that there was bias for "false"-judgments in the present experiment. A post hoc comparison to the same bias parameter c for the short sentences in Augurzky et al. (2017) further revealed a significantly higher bias in the present vs. the previous study (means (and sds): .150 (.269) vs. .289 (.109); t = 2.244; p = .030). Thus, we observed an adaptation in response bias that could plausibly also have led to an increased probability of committing to local "false"-judgments already when encountering the colour adjective during online comprehension.
Similarly, the observed effects could also be explained by taking into account how high the risk of a meaning revision is, irrespective of whether the final truth-value judgment will be "true" or "false". Such an explanation would be based on a type of adaptivity that is driven by meaning shifts: Comprehenders in the present experiment could adapt to a speaker that produces a relatively small proportion of sentences containing meaning shifts. If we relate this type of adaptivity to speaker reliability, comprehenders would adapt to a speaker that is actually more reliable than in Augurzky et al. (2017), in the sense that meaning shifts are avoided. Readers may be cautious in making early commitments if a later semantic revision is highly likely. But if revisions are unlikely in a given context, they may be more inclined to commit to an interpretation early on as this is likely to match the final interpretation. Interestingly, a speaker who is reliable in terms of a large proportion of sentence-final "true"-judgments is, in the context of the present experiment, at the same time unreliable in terms of mid-sentence meaning shifts. As our current design does not allow us to differentiate between these two types of reliability, both alternatives remain viable explanations that may, in fact, both simultaneously play a role in incremental semantic processing.
At any rate, the prediction of a truth-value-based version of FREQUENCY-DRIVEN ADAPTIVITY would be an increased number of local "false"-judgments in the complex conditions of the present experiment since locally, on the colour adjective, the complex conditions are false. This type of account not only explains the results of the present experiment but is compatible with the entire set of results from Augurzky et al. (2017) and from the present experiment: For the simple conditions, which did not involve any risk of a later revision, readers were expected to commit to an interpretation early on. In both studies, this was reflected in a decreased N400 amplitude for true vs. false conditions. By contrast, in the complex conditions, the likelihood of local commitments depends on expectations about sentence-final truth-value judgments, which we manipulated by controlling the frequency of extraposed restrictive relative clauses across experiments. In Augurzky et al., extraposed relative clauses appeared with a probability of 50%. As a consequence, the risk of a semantic revision was relatively high and the probability of sentences that induce global "false"judgments was relatively low. This may have led to a decreased likelihood of local semantic commitments in the complex conditions, explaining why the N400 amplitude for the simple conditions was in between the N400 elicited by the simple true and simple false conditions. In the present experiment, in which relative clauses appeared only with a probability of 20%, local semantic commitments to "false"-judgments were, by contrast, expected to occur more frequently and thus N400 amplitude in the complex conditions were expected to fall closer together with the amplitudes in the simple false conditions.
The second version of FREQUENCY-DRIVEN ADAPTIVITY, i.e. FREQUENCY-DRIVEN ADAPTIVITY IN EXPECTATIONS, provides a conceptually different explanation of the observed effects in terms of expectation-based pragmatic processing. In particular, it is assumed that the N400 amplitude reflects the probability of encountering a specific sentence continuation in context, and an N400 effect is expected if there is a sufficiently large difference in the probabilities assigned to two potential sentence continuations (e.g. Augurzky et al., 2019). In order to assign probabilities to sentence continuations, we distinguish between (i) prior probabilities, i.e. expectations that are based solely on statistical regularities in the language but do not take into account contextual information, and (ii) posterior probabilities which are updated according to contextual cues (cf. Kuperberg & Jaeger, 2016, and references therein). Contextual cues include linguistic cues like the local sentence context as well as non-linguistic cues like the visual context and also the context of an entire experiment, including frequency information about the distribution of different types of sentence continuations in the experiment. In the following paragraphs, we sketch how the observed effects can be explained along these lines (details are given in Appendix C).
We start with a rather simple model in which the set of potential sentence continuations is restricted to those that actually occurred in the experiments (cf. Augurzky et al., 2019). We assume that a relatively low prior probability, p(sc), is assigned to sentence continuations, sc, with extraposed relative clauses (cf. Figure C1 in the appendix). Moreover, since relative clause continuations did, in fact, occur with a relatively low frequency in the present experiment, we assume that these prior probabilities are not modulated strongly, such that the prior probabilities p(sc) are roughly identical to the probabilities of the various sentence continuations in the present experiment, p(sc|e). In Augurzky et al. (2017) we assume a larger divergence between p(sc|e) and p(sc), such that extraposed restrictive relative clauses are assigned a relatively high probability.
Next, we assume that the posterior probabilities over possible sentence continuations in visual context c are derived from the probabilities p(sc|e) according to the following formula, where d [[s]] c =1 is equal to 1 if the complete sentence s is true (resp. the answer to the question is "yes") in context c and 0 otherwise.
Deriving expectations about sentence continuations in this way is not uncommon (e.g. Augurzky et al., 2019, for further motivation see Appendix C), but it is not sufficient to account for the differences between our current findings and Augurzky et al. (2017). The reason is that only continuations that were actually presented in the experiment are considered and among these only those that lead to "yes"-answers remain viable options. As a consequence, the only possible continuations after having processed the sentence beginning Are all the triangles… following the complex context A would be either blue that are inside of the circle or red that are outside of the circle. Both of these continuations are assigned a probability of 0.5, irrespective of how probable the extraposed relative clauses would have been in the absence of a visual context.
We would like to discuss two ways in which this simple expectation-based model can be extended in order to capture the observed differences between  and the present experiment. The first is by integrating adaptations to speaker reliability and the second is by incorporating sentence continuations that never actually occurred in the experiments.
Adaptations to speaker reliability can be integrated into the model by modulating how strongly p(sc|c, e) depends on sentence meaning (see the appendix for details). We could assume that comprehenders in the present experiment adjust to a speaker who is unreliable in the sense of producing relatively many non-matching sentences (i.e. the first type of reliability from above). For illustration, consider context C (i.e. the simple false condition). If readers pay little attention to sentence meaning, then, in that context, a relatively high probability would still be assigned to sentence continuations starting with the colour adjective blue. This would lead to a decreased N400 amplitude in that condition and, in turn, decrease the difference in N400 amplitudes between condition C and the complex conditions A/D, as was actually observed in the present experiment.
Alternatively, our results could also be explained by incorporating continuations into the model that never actually occurred in the experiments. An example of such a continuation following the sentence beginning Are all the triangles… would be of comparable size. If an increasing probability is assigned to continuations of this kind, readers' expectations of colour adjectives in the complex contexts vanish. The reason is that among our sentence continuations (all starting with colour adjectives) only those containing relative clauses match the visual contexts and relative clauses are by themselves relatively improbable continuations. Under these assumptions, it does indeed make a difference how probable continuations with relative clauses actually are in the context of the experiment. The more probable they are the more expected are the colour adjectives. Thus, differences between N400 amplitudes in the simple false and in the complex conditions depend on the frequency of relative clauses and the observed effects are explained.
At this point, we would like to point out one noteworthy consequence of the just described version of FRE-QUENCY-DRIVEN ADAPTIVITY. Recall that, in the introduction, the experimental context was considered a microcosm that can be controlled by the experimenter to simulate adaptation effects in language processing. Note, however, that the situation would be a bit more complicated, if the just described version of FREQUENCY-DRIVEN ADAPTIVITY IN EXPECTATIONS is on the right track: How likely sentence continuations that never actually occur in an experiment are expected to be and, more importantly, how these expectations are affected by experiment internal frequency distributions are something that are difficult or maybe even impossible to control in an experiment.
Our current findings also highlight the relation between frequencies in more naturalistic settings, e.g. in corpora, and frequencies within an experiment. Taking for granted, as the current results indicate, that experiment-inherent frequencies affect processes during online language comprehension, the question arises of what role such frequency distributions should play in experimental design (see also Brothers & Kuperberg, 2020, for related discussion). On the one hand, one could aim for distributions that mimic natural frequencies, much like we did in the present study. This does, however, often lead to imbalances in experimental design, e.g. in terms of frequencies of different sentence types, word lengths, response biases or biases in interpretation. On the other hand, one may aim for balanced experimental designs. In that case, deviation from natural frequencies may, however, affect comprehension processes in undesirable ways, calling into question the generalisability of the experimental results. As long as explicit models of adaptation effects as reported above have not yet been established, we recommend to compare results obtained using both types of designs whenever possible.
Before moving on to our conclusions, we would like to address one potential concern regarding the validity of the present results and their interpretation, namely the possibility that participants had used an artificial lexical match-to-sample strategy rather than incrementally processing the sentences in a natural and compositional fashion. This point of criticism was already addressed in Augurzky et al. (2017), who carried out a second ERP study to test this assumption. Crucially, the second study involved probe detection instead of a verification task. As a consequence, attention was guided to different positions across the entire sentence, and for successful task fulfilment, readers needed to process the sentence as a whole, as they could not know in advance which of the sentence positions would be probed. Crucially, the N400 effects in this study were replicated, suggesting that they could not have been the result of simple lexical processing strategy (see Augurzky et al., 2017, for further details). The possibility of lexical priming has also been extensively discussed in further previous studies, and we refer the reader to this literature, which involves highly comparable experimental designs and procedures. As a whole, the arguments discussed in these studies were based on considerations about potential quasi-compositional meaning representations that would be required to solve the experimental task (Augurzky et al., 2019;Augurzky, Hohaus, et al., 2020;Augurzky, Schlotterbeck, et al., 2020).
For the sake of the present discussion, we would like to additionally point out that the comparison between the current experiment and Augurzky et al. (2017) in itself actually provides one of the strongest arguments against a non-compositional processing strategy. In particular, the mere finding that restriction frequency modulated the N400 amplitude on the colour adjective cannot be straightforwardly explained by referring to a simple lexical match-to-sample strategy. Moreover, if we do assume that participants in the Augurzky et al. (2017) study waited for further information to combine it with the linguistic representation assembled thus far, at least some degree of compositionality is needed to explain the results we obtained. To see this, consider how the two properties expressed by the colour adjective and the preposition could be combined in principle. The simplest way to combine them is presumably their conjunction (e.g. blue & inside). If the processor would have integrated this conjunction into the sentence representation in a way that does not necessarily obey standard rules of compositional interpretation, representations like, e.g. ALL(dots, blue & inside) would have been possible. Such representations would, however, have yielded completely unexpected truth-value judgments that would not match our observations. In particular, such representations would have led participants to respond "false" in all conditions with restrictive relative clauses, amounting to an error rate of 100% in these conditions. By contrast, the type of representation that obeys standard rules of compositional interpretation, e.g. ALL(dots & inside, blue), does explain the truth-value judgments we observed. Taken together, our results thus indicate that participants were more likely to delay their interpretation in order to wait for additional restrictive cues in the experiment of Augurzky et al. (2017) as compared to the present one, and in case additional restriction was in fact provided, they integrated it into the sentence representation in a way that obeys the rules of compositional interpretation.
In conclusion, our present findings show that experiment inherent frequency distributions affect compositional semantic processes during online comprehension. We propose that, in the broadest sense, these effects are driven by expectations about upcoming linguistic material and the associated interpretations. We conclude that such expectations can actually be modulated by controlling the frequencies with which various types of sentences occur in an experiment. This type of sensitivity to frequencies may actually be an essential aspect of adaptations to different types of speakers or different kinds of contexts that comprehenders routinely perform during natural language processing. We have presented four possible accounts of these observations (two versions of FREQUENCY-DRIVEN ADAPTIVITY IN JUDGMENTS and two versions of FREQUENCY-DRIVEN ADAP-TIVITY IN EXPECTATIONS of lexical items in sentence continuations). Which of these accounts, if any, is correct cannot be decided on the basis of the data presented here and thus has to be left open for future research. from the original Augurzky et al. study, as we had to manually re-analyse the data to exclude trials with extreme slow drifts. ERPs yielded by that analysis led to the same effects as in the results reported in Augurzky et al. (2017) (see Figure 3). 10. We would like to thank an anonymous reviewer for pointing out to us that the absence of a significant effect in the present study is by itself hard to interpret. It does not imply the non-existence of the effect but could, instead, also be due to a lack of statistical power. We would like to add to this that FREQUENCY-DRIVEN ADAPTIVITY actually does not predict the absence but rather a decreased magnitude of the effect in the present as compared to the previous experiment. This is one reason, besides others that are discussed in Froman and Shneyderman (2004), Levine and Ensom (2001) and Miller (2009), we refrain from computing a power analysis but decided to carry out a direct statistical test, instead. 11. Asymmetries in processing true vs. false sentences could be explained based on fundamental ideas in logic, in particular computational logic, distinguishing between positive and negative judgments. Such asymmetries can be linked to the closed world assumption (see e.g. Lifschitz, 1985Lifschitz, , 1996, which equates the negation of a proposition with the failure to derive that proposition from a given knowledge base. Interpreting negation in this way is central to efficient logic programming and has also been used successfully in modelling human reasoning and language understanding (Stenning & van Lambalgen, 2008;van Lambalgen & Hamm, 2005; and see also Baggio, 2018 for related research). It leads to an asymmetry between positive and negative judgments absent from classical logic and may thus provide a theoretical explanation for difficulties in processing false sentences that is based on the negative judgments themselves.

Disclosure statement
No potential conflict of interest was reported by the authors.

Data availability statement
The data that support the findings of this study are openly available in the Tübingen Archive of Language Resources at https://hdl.handle.net/11022/0000-0007-EC4A-D.