Are words pre-activated probabilistically during sentence comprehension? Evidence from new data and a Bayesian random-effects meta-analysis using publicly available data

Several studies (e.g., Wicha et al., 2003; DeLong et al., 2005) have shown that readers use information from the sentential context to predict nouns (or some of their features), and that predictability effects can be inferred from the EEG signal in determiners or adjectives appearing before the predicted noun. While these findings provide evidence for the pre-activation proposal, recent replication attempts together with inconsistencies in the results from the literature cast doubt on the robustness of this phenomenon. Our study presents the first attempt to use the effect of gender on predictability in German to study the pre-activation hypothesis, capitalizing on the fact that all German nouns have a gender and that their preceding determiners can show an unambiguous gender marking when the noun phrase has accusative case. Despite having a relatively large sample size (of 120 subjects), both our preregistered and exploratory analyses failed to yield conclusive evidence for or against an effect of pre-activation. The sign of the effect is, however, in the expected direction: the more unexpected the gender of the determiner, the larger the negativity. The recent, inconclusive replication attempts by Nieuwland et al. (2018) and others also show effects with signs in the expected direction. We conducted a Bayesian random-effects meta-analysis using our data and the publicly available data from these recent replication attempts. Our meta-analysis shows a relatively clear but very small effect that is consistent with the pre-activation account and demonstrates a very important advantage of the Bayesian data analysis methodology: we can incrementally accumulate evidence to obtain increasingly precise estimates of the effect of interest.


Introduction
Consider the following sentence, with two possible continuations: (1) The grandmother wants to pick mushrooms in the forest. She takes … (a) [HIGH-PROB.] a basket.
It is well known that the predictability of a word affects sentence comprehension. Words with low predictability are read slower than words with high predictability (e.g., Ehrlich and Rayner, 1981). In the EEG literature, it has been shown that words with low-predictability are accompanied by a relative negativity that peaks around 300-500 ms after word onset over central parietal scalp sites in comparison with high-predictable words, this difference between conditions is usually referred to as an N400 effect, (first noticed in Kutas and Hillyard, 1980, for semantic anomalies; and in Kutas and Hillyard, 1984, for low predictable words; for a review, see Kutas and Federmeier, 2011). 1 In (1), for example, the continuation "a basket" has higher predictability than the continuation "a cloth", and thus we would expect a more negative signal in "a cloth" in (1b) in comparison with "a basket" in (1a).
There is no doubt that language comprehension is at least minimally predictive: The human parsing system seems to be a left-corner parser (Resnik, 1992) and the context clearly influences the state of the system (see the discussion in Kuperberg and Jaeger, 2016). However, the fact that a word is predictable does not necessarily entail that the lexical item and/or its semantic (and maybe phonological or grammatical) features are always going to be automatically predicted: The N400 effects seen when words with different predictability are compared (e.g., "basket" vs. "cloth" in (1a) and (1b)) could be due to differences in the difficulty in integrating their meaning with the semantic context rather than differences in preactivation: Any factors that facilitate pre-activation could correspondingly be argued to make semantic integration easier (Van Berkum et al., 2005).
There are, in fact, two views regarding the relationship between predictability and the amplitude of the EEG signal at the N400 spatiotemporal window: (i) the access view (Kutas & Federmeier, 2000 which can be linked to the idea of prediction error (Rabovsky and McRae, 2014;Kuperberg and Jaeger, 2016), and (ii) the semantic integration view (e.g., Brown and Hagoort, 1993). These views, however, are not mutually exclusive, and both access/prediction error and integration may play a role (Hagoort et al., 2009;Nieuwland et al., 2019), but they can be disentangled. We briefly describe these two views below: (i) Access view Before a word w is read, its context triggers the preactivation of semantic features in memory (e.g., Kutas and Federmeier, 2000;Rabovsky and McRae, 2014;Kuperberg and Jaeger, 2016) or the entire lexical item (e.g., DeLong et al., 2005). Reading the word w triggers a memory access. The amplitude of the N400 evoked is a function of the mismatch between the pre-activation due to the context and the activation due to word w. While the access view is compatible with both lexical pre-activation and the pre-activation of features, recent computational modeling work shows that most effects reported in the literature are more easily accounted for by the idea of pre-activation of (semantic) features (Rabovsky and McRae, 2014). (ii) Semantic integration view After the word w is read and retrieved from memory, it needs to be integrated into the partial semantic representation that has been built so far. The N400 is a function of the difficulty of integrating a word in its global and local context (Van Berkum, Hagoort and Brown, 1999). The reason why predictable words are easier to integrate is that they are also more plausible in the given context and fit the reader/listener's world knowledge (Baggio and Hagoort, 2011;Hagoort et al., 2004).

Evidence for the access view
One of the pieces of evidence that appeared to settle the debate in favor of the access view comes from a series of experiments that manipulated the gender of a determiner or adjective (in Spanish: Wicha et al., 2003aWicha et al., , 2003bWicha et al., , 2004and in Dutch: Van Berkum et al., 2005;Otten et al., 2007;Van Berkum, 2008, 2009) or the phonological form of a determiner (DeLong et al., 2005) that appeared prior to a predicted noun. Although these studies varied considerably regarding methodology (see Table 1), what they had in common was that they all intended to provide evidence in favor of the access view. Wicha et al. (2003b) used ERP to examine the role of grammatical gender and semantic congruity in Spanish. In their experiment, native Spanish speakers read sentences in which a drawing depicting a target noun was either congruent or incongruent with sentence meaning, and either agreed or disagreed in gender with the preceding article. The most novel finding in the study was that when determiners with gender markings that were unexpected based on the context up to that point where compared with expected determiners as in (2b) vs (2a), an N400 effect was found (notice that BASKET was presented as a picture in (2)). It is important to notice that a determiner of either gender is syntactically, semantically, and pragmatically correct at this point, since Red Riding Hood in sentence (2) could have been carrying food in a sack masc ("un costal") or in a basket fem ("una canasta"), though basket was more expected according to an offline cloze task. It can be argued that studies using this type of gender manipulation provide evidence for the access view and against semantic integration, because gender information is largely arbitrary, and synonymous words that fit into the context equally well can have different genders.
(2) a. Caperucita Roja llevaba la comida para su abuela en una Little Red Riding Hood carried the food for her grandmother in a [fem] CANASTA muy bonita … BASKET [fem] very pretty … b. Caperucita Roja llevaba la comida para su abuela en un Little Red Riding Hood carried the food for her grandmother in a [masc] CANASTA muy bonita … BASKET [fem] very pretty … However, other experiments using gender in Spanish and Dutch did not show a consistent pattern of results at the prenominal determiner or adjective; see Table 1 for a summary of published results. In all these studies, although the ERP elicited by the unexpected noun consistently showed an N400, no consistent ERP effect was found at the determiner or adjective, not even when the experiment was repeated within the same laboratory. As discussed below, either a negativity or a positivity was found at the determiner/adjective.

Table 1
List of EEG studies with a manipulation before the predicted noun. Note. In each study, the EEG signal is analyzed from the onset of the region of interest (except if indicated with (*), where it is measured from the inflection); the results are based only on native speakers (except when indicated with (**), where 29 bilinguals were included in the analysis); the reference for the EEG is the average of left and right mastoids (except when indicated with (***) where the reference was a global average); the experiment is conducted in serial visual presentation (except if it is indicated with 'audio'); and the critical region is always presented alone. The following abbreviations are used; Pres.: presentation durarion, ISI: interstimulus interval, ROI: region of interest, F: frontal, C: central, M: medial, P: parietal, W: widely distributed, R: right, L: left, Neg: negative, Pos: positive, ns: non significant.
Spanish shows a more consistent pattern of results than Dutch. However, while both Wicha et al. (2003a) and Wicha et al. (2003b) found an N400 effect for determiners with an unpredictable gender in comparison with a predictable one, Wicha et al. (2004) reported a late positivity. While there certainly are differences between the studies (see Table 1), there is no clear explanation for the difference in the patterns. Interestingly, Foucart et al. (2014) did find an N400 effect for differences in predictability of determiners in Spanish sentences presented to Spanish native and non-native speakers with a design similar to Wicha et al. (2004), where a positivity was reported. One important difference between the studies is that in contrast to the studies led by Wicha, Foucart et al. (2014) showed no sentences with agreement or semantic violations. Similar results to Foucart et al. (2014) were found by Martin et al. (2018) with only native speakers.
Dutch, in contrast, shows more mixed results. While Van Berkum et al. (2005) detected a very early positivity at unpredictable marked adjectives in comparison with predictable ones (from 50 ms after the gender inflection), Otten et al. (2007) detected a negativity (300-600 ms after the gender inflection) with a very similar design (and auditory presentation in both experiments). Furthermore, in a similar experiment with visual presentation, Otten and Van Berkum (2008) showed a very late negativity: about 900 ms after onset of the critical adjective. Otten and Van Berkum (2009) Otten et al. (2007) (with adjectives and auditory presentation). Kochari and Flecken (2019) attempted a replication of Otten and Van Berkum (2009) with a larger sample size (N=70 vs. N=38), and while they failed to find a significant effect at the gender-marked determiner, they did observe a pattern consistent with the original experiment. Recently, Fleur et al. (2019) manipulated gender expectancy and whether the expected noun phrase would be definite or not in Dutch, and detected an N400-like effect for the gender manipulation. DeLong et al. (2005) capitalized on the fact that the form of an indefinite determiner in English depends on the phonological form of the following word (an precedes nouns beginning with vowel sounds, whereas a precedes nouns beginning with consonant sounds).
In their experiment, participants were presented with sentences ending with a predictable noun phrase, such as "a kite" in (3a), or an unpredictable one, such as "an airplane" in (3b).
(3) a. The day was breezy so the boy went outside to fly a kite.
b. The day was breezy so the boy went outside to fly an airplane.
Since the determiners preceding the critical noun have identical minimal meaning, there should be no difference between the fit of "a" and "an" to the semantic context (i.e., both determiners should incur the same integration costs). However, DeLong et al. (2005) showed that the amplitude of the negativity was smaller with increasing cloze probability of the word, both at the noun and, critically, at the determiner. This is evidence for the access view.
However, recent replication attempts (Ito, Martin and Nieuwland, 2017a;Nieuwland et al., 2018) Ito et al. (2017a) also conducted a similar study; however, DeLong et al. (2005) used an average reference instead of linked mastoids as is common in studies investigating the N400 making the comparison with the previous findings less straightforward (see Nieuwland et al., 2018;Ito, Martin and Nieuwland, 2017b). A reviewer points out that it is not too difficult to compare results from different references, although admittedly the average reference, which reduces the data rank, is slightly more challenging than the other unipolar references, which are just affine transformations. Ito et al. (2017a) used materials and a procedure similar to those of DeLong et al. (2005), but failed to find a significant N400 effect at the articles in two experiments with different presentation rates (reported for native English speakers in Ito et al., 2017b), but unexpected articles did elicit numerically larger N400 amplitudes than expected ones.
It might be that the differences between DeLong et al. (2005), and Ito et al. (2017a) have to do with low statistical power: The original study was conducted with 32 participants and the attempted replications with 23 participants. It is likely that both studies had relatively low power. It is well-known (Gelman and Carlin, 2014) that when power is low, statistically significant results will be overestimates, Type-M(agnitude) errors, that will be difficult to replicate . However, Nieuwland et al. (2018) attempted a large-scale direct replication (nine laboratories with a total of 334 participants) of DeLong et al. (2005). They also failed to find the statistically significant effect claimed in the original study with the same statistical analysis, although the sign of the effect was in the expected direction in a linear mixed model analysis. Nieuwland et al. (2018) noticed a possible reason for the failure to find evidence for the access view in the DeLong et al. manipulation: The implicit assumption of the DeLong et al. study was that people expect the initial phoneme of the predicted noun immediately after the determiner. However, unexpected determiners do not necessarily disconfirm the upcoming noun, they might be indicating that another word would come first (e.g., "an old kite"). Nieuwland and colleagues argue that given that corpus data (the Corpus of Contemporary American English and the British National Corpus) show that "a"/"an" is followed by a noun only one third of the time, participants might only occasionally be actively predicting a determiner matching the initial phoneme of the noun.

Some potential reasons for the inconsistencies in the literature
In the above-mentioned studies, even if there is a difference in the N400 spatiotemporal window at the determiner as a function of gendermarking, it might be the case that it is too small and/or too noisy to be reliably detected. We identify two closely related issues that may be contributing to the inconsistencies across studies.
The first issue is that the statistical comparison across conditions for the determiner or adjective is performed across different words with different lengths and frequencies. In English, "a" is shorter and more frequent than "an". In languages with gender-marked determiners/adjectives, some genders are more frequent than others. For example, in Dutch, common-gender words are much more frequent than neutergender words (Van Berkum, 1996). Even though comparisons across different words are fairly common in the ERP literature on predictions, comparing different words might be problematic when one is dealing with small and subtle effects. One countermeasure adopted to mitigate the effect of comparing different words is to counterbalance the appearance of the words in high and low predictability conditions. However, different words contribute to their own idiosyncratic EEG signal and this may increase the variability of the signal, making it more difficult to find the expected N400 or any other effect. The problem of small effects is coupled with the fact that in many of the studies of the prediction literature, ERP analysis is performed after visual inspection of the data, and thus it is unclear to which extent the results from the studies hinge on the specific choices that were made during the analysis (see also Luck and Gaspelin, 2017;Nieuwland, 2019). 2 Martin et al. (2013) also conducted a similar study; however, Martin et al. (2013) used an average reference instead of linked mastoids as is common in studies investigating the N400 making the comparison with the previous findings less straightforward (see Nieuwland et al., 2018;Ito, Martin and Nieuwland, 2017b). A reviewer points out that it is not too difficult to compare results from different references, although admittedly the average reference, which reduces the data rank, is slightly more challenging than the other unipolar references, which are just affine transformations.
A second issue is that in all the previously mentioned studies, predictability is manipulated by making one continuation highly predictable and the other unpredictable. For example, given a context sentence like "The day was breezy so the boy went outside to fly …", "a kite" is highly predictable because it has a high cloze probability, and an airplane is unpredictable because it has a low cloze probability. However, in the case of the unpredictable "an airplane", not only does the comprehender have to process an unpredictable continuation, but in addition the comprehender experiences a violation of a highly expected continuation ("a kite") in a constraining context. The context "The day was breezy so the boy went outside to fly …" is constraining because it has a potential best completion with high cloze probability, regardless of the cloze probability of the actual continuation. In terms of information theory, a constraining context has lower entropy (or in other words, lower uncertainty) than a less constraining one, because in constraining contexts the cloze probability mass is concentrated in one (or few) responses. 3 The low-cloze coupled with the violation of a highly expected continuation (i.e., a low-cloze word in a highly constraining context) may lead to a bi-phasic response: an N400 followed by a late positivity. In fact, Van Petten and Luka (2012) show that biphasic ERP responses (negativities followed by frontal positivities) are fairly common when low vs. high-cloze words are examined in high-constraint contexts, and it has been hypothesized that the late positivity is a response to a violation of a strong expectation (Federmeier et al., 2007;Van Petten and Luka, 2012). Although the late positivity has been studied only for nouns and not for the unexpected gender (or phonological marking) of the determiner or adjective, it might be the case that the effect of the low cloze probability and the violation of a strong expectation are confounded at the determiner/adjective.
A potential solution to this confound is to introduce two conditions that appear in a non-constraining context. Consider for example the design of the studies led by Otten (Otten et al., 2007;Otten and Van Berkum, 2008;Otten and Van Berkum, 2009), where a non-constraining context was also included. This is exemplified with (4) and (5) from Otten et al. (2007) translated from Dutch to English, where the two Dutch genders, neuter and common, are indicated with neut and com.
(4) Constraining context: Anne had finally found a quiet place for studying. She sat down and grabbed a … a. … big neut and pretty well-thumbed book neut out of her bag. b. … big com and pretty well-thumbed novel com out of her bag. (5) Non-constraining context: After studying Anne had found a quiet place in the park. She sat down and grabbed a … a. … big neut and pretty well-thumbed book neut out of her bag. b. … big com and pretty well-thumbed novel com out of her bag.
In this experimental design, in the constraining context (4) there are a highly predictable adjective big neut , and an unpredictable adjective big com that also involves a violation of a strong expectation. A nonconstraining context (5) is added to this design; here, the same adjectives big neut and big com appear as continuations, but both are unexpected. Thus, the non-constraining context is intended to serve as a baseline for the constraining context: the prediction effect at the adjective would be expressed as an interaction between constraint (constraining vs. non-constraining) and completion (big neut vs. big com ). That is, the focus is on the difference [(4a)-(4b)] -[(5a)-(5b)]. This is because no difference is expected due to predictability between the baseline continuations big neut and big com in the non-constraining context (5), but a difference is expected due to predictability between the two continuations in the constraining context (4).
We argue that modifying the design to have a constraining and nonconstraining context and examining the interaction in the resulting 2×2 design may not be enough to rule out an overlap between negativity and late positivity (since this has been reported for nouns at least; Van Petten and Luka, 2012). This is so because the first term [(4a)-(4b)] might be eliciting both a negativity and a late positivity, while the second term [(5a)-(5b)] is not expected to differ (at least not as much as the first set of sentences), and thus is not acting as a baseline. We suggest that the relevant comparisons would be (5a) vs. (4a) where only an N400 effect is expected, and (5a) vs. (4a) where only a late positivity but not an N400 might appear.
Given these concerns, in our experiment, we follow the 2×2 design by Otten and colleagues, including a constraining and a nonconstraining context and two possible continuations. But instead of looking for an interaction in the 2×2 design, we planned a nested comparison (Schad et al., 2018) within the two levels of the factor constraint: we compare (5a) vs. (4a), where we expect an N400 effect, and (5b) vs. (4b), where we expect a late positivity.

The current experiment
Our EEG experiment takes advantage of the fact that all German nouns have a gender (i.e., masculine, feminine, or neutral) and that the determiners that precede the nouns can show an unambiguous gender marking when the noun phrase has accusative case. 4 This allows us to create stimuli such as (6)-(7), where the critical region is the determiner (in bold) preceding the hypothesized (un)predicted target noun. We preregistered the following design, and two models for the determiner region and the noun region at predetermined time windows and electrodes (see https://osf.io/vdx8g/). For the determiners, one model examines a potential N400 effect modulated by the gender probability on the determiners of (6a) and (6b), and another model focuses on a potential late positivity modulated by the constraint of the context on the determiners of (6b) and (7b). As we discussed in the introduction, both our models use a different approach than the one commonly used in the studies reviewed in Table 1 (and their replications). In these earlier studies, the context was kept constant (i.e., constraining) and the critical determiner and noun were manipulated, changing the cloze probability. In our study, the context changes (i.e., constraining vs. non-constraining context) and the target word is kept constant. We detail our preregistered models in section 2.5.1. In an exploratory analysis, we investigate a more traditional design that keeps the context constant (i.e., fix the context to constraining).
Sie nimmt einen groβen Korb mit und macht sich auf den She takes a.masc.acc big.masc.acc basket with and makes herself on the Weg. way 'She takes a big basket and starts on her way.' b. Low-probability continuation: Sie nimmt ein groβes Tuch mit und packt es in einen Korb. She takes a.neut.acc big.neut.acc cloth with and packs it in a.masc basket. 'She takes a big cloth and packs it in "a basket".' (7) Non-constraining context: Die Groβmutter möchte heute keine Pilze im Wald sammeln. The grandmother wants today no mushrooms in.the forest gather. 'Today the grandmother doesn't want to pick mushrooms in the forest.' a. Low-probability continuation: Sie nimmt einen groβen Korb mit und bereitet ein She takes a.masc.acc big.masc.acc basket with and prepares a.neut Picknick vor. picnic for 'She takes a big basket and prepares a picnic.' b. Low-probability continuation: Sie nimmt ein groβes Tuch mit und packt es in den She takes a.masc.acc big.masc.acc cloth with and packs it in the Picknickkorb. picnic.basket 'She takes a big cloth and packs it in the picnic basket.'

Participants
For the EEG experiment, we preregistered a sample size of 100 participants: 50 university students and 50 non-students. Due to time constraints, before the first submission of the manuscript, we only recruited 89 participants: 54 university students and 35 non-students. No power analysis to determine the required sample size was performed before the initial data collection. Instead, in order to maximize the chances to detect an effect, we aimed to collect data from a sample size much larger than in typical language ERP studies (N = 30), given the time and resources available. Because of a bug in the presentation script, one of the four experimental lists was not presented. To balance the design, we recruited 31 more participants (27 students and 4 nonstudents) at a later stage. This means that in total, 120 participants completed the EEG experiment.
Informed consent from the participants was obtained before the session. The study was approved by Research Ethics Committee of the faculty of Psychology and Human Movement of the University of Hamburg.
Although psycholinguistic studies have mainly utilized the college subject-pool population, extending our research to the communitybased sample is important because it is more representative of the population at large. We excluded non-native German speakers, lefthanded subjects, subjects with dreadlocks, subjects with self-reported neuronal abnormalities and/or language impairments, and participants of the related sentence completion tasks detailed below. All participants had normal or corrected-to-normal vision.
Students were recruited from the Vasishth Lab subject pool at the University of Potsdam. Non-university students were recruited using fliers placed on public transportation hubs and local retail, and by the referral of other recruited participants and of the university and administrative staff. We excluded non-university student participants below the age of 18 and above the age of 35 so that they match the age of the student population.

Sentence completion tasks
We first conducted a pilot sentence completion task to create the potential experimental items. Next, we conducted a second sentence completion task to establish the predictability of the gender of the target nouns before the gender-marked determiner was read for all the experimental items. Finally, we conducted a third sentence completion task to establish the predictability of the nouns conditional on their previous (gender agreeing) determiner and adjective. Participants of the sentence completion tasks were recruited from the same subject pool used for the EEG experiment; these participants had not taken part in the other tasks mentioned above. We detail the tasks below.

Pilot task.
We originally created 400 potentially constraining and non-constraining contexts together with sentences truncated before a potential accusative determiner. Some of these items were based on the stimuli of experiment 1B of Otten and Van Berkum (2008). We presented these items in Latin-square design to 40 participants (so that each participants only saw the constraining or the non-constraining context version of each item) using Linger (https://tedlab.mit.edu/~dr/Linger/).

Sentence completion task for the gender probability and constraint.
We refined the truncated sentences based on the results of the pilot task, and we discarded problematic items. We then built 295 potentially constraining and non-constraining contexts together with sentences truncated before a potential accusative determiner. These sentences were presented in Latin-square design to 65 native speaker participants using Linger.
Based on the results of this task, we selected 283 constraining and nonconstraining contexts, and we built items similar to (6)-(7). We present a summary of the mean and 90% quantiles for the cloze probability of the genders in the stimuli and the best completion in Table 2. The best completion is the continuation that has the highest cloze probability for a given context. We show gender characteristics of the items in Table 3.

Sentence completion task for the noun probability and constraint.
We ran an additional sentence completion task where we truncated the experimental items before the noun to calculate the probability of each Table 2 Mean and 90% quantiles of the gender probability for each condition for sentences truncated before the determiner. The column called gender cloze probability refers to the cloze probability of the gender of the actual noun phrase that completed the sentence in the experimental items. The column called gender best completion shows the cloze probability for the most likely gender as continuation for a given context. noun conditional on the complete context. We presented these items in Latin-square design to 83 participants using Linger. In this way, we were able to calculate the predictability of the nouns conditional on the context that includes the gender marked determiner and adjective. 5 These cloze probabilities were used for the analysis at the noun region.

Procedure
Participants were tested in a single session, seating approximately 60 cm from a 56 cm presentation screen. Before starting the EEG experiment, participants performed a stop signal task (Lappin and Eriksen, 1966;Logan and Cowan, 1984) that closely follows the design of Verbruggen et al. (2008). The experiment was run using OpenSesame (Mathôt et al., 2012) and its results will be part of a future exploratory analysis of the data.
Following the stop-signal task, participants performed an EEG experiment run in OpenSesame (Mathôt et al., 2012). Participants were instructed to read sentences appearing in serial visual presentation for comprehension and answer yes/no questions (that followed a quarter of the trials) by pressing "f" or "j" in a keyboard.
Following the electrode cap preparation, the experiment started with five practice trials. The stimuli of the experiment were presented using a Latin-square design. At the beginning of each trial, a fixation cross appeared for 500 ms followed by a blank screen for 500 ms. After that, the context was presented broken in short phrases (e.g., Die Groβmutter|möchte|heute|keine Pilze|im Wald|sammeln.). Each phrase appeared in the center of the screen for 190 ms × number of words + 20 ms × number of characters with a minimum of 250 ms, followed by a blank screen for 250 ms. Then the continuation appeared word by word. Each word appeared in the center of the screen for 190 ms + 20 ms × number of characters (with a minimum of 250 ms) followed by a blank screen for 250 ms. Although this means that target determiner presentation times differed for the different gender conditions, as we explain later, the preregistered comparisons examined the same determiner (determiner in completions (a) in one model, and determiners in completions (b) in another model).

EEG recording and data processing 2.4.1. Recording
EEG was recorded from 32 scalp sites by means of AgAgCl active electrodes mounted in an elastic electrode cap at the standard 10-20 system (Jasper, 1958) using an ANT Neuro amplifier manufactured by TMSi. As is standard in ERP research, eye movements and blinks were monitored with bipolar electrodes placed at the outer canthus of each eye (horizontal EOG) and above and below the right eye (vertical EOG). EEG and EOG were recorded with a sampling rate of 512 Hz and a low-pass filter of 138 Hz. Recordings were initially referenced to the left mastoid and re-referenced offline to the average of the left and right mastoid channels.

Preprocessing
The data was pre-processed with the R (R Core Team, 2018) package eeguana (Nicenboim, 2018). The EEG data was filtered using a zero-phase band-pass finite impulse response (FIR) filter with pass band-edge frequencies of 0.1 and 30 Hz. The width of the transition band for low and high edges was 0.10 and 7.50 Hz respectively. This filter was adapted from the default filters in the python package MNE (Gramfort et al., 2013, v 0.0.17.1).
Eye movements were corrected using independent components analysis (ICA: Jung et al., 2000) with the deflation-based FastICA algorithm (Hyvärinen and Oja, 1997;Nordhausen et al., 2011;Miettinen et al., 2014). After this, we rejected segments containing a voltage difference of over 100 μV in a time window of 150 ms or containing a voltage step of over 50 μV/ms. The corrected signal was segmented and then baseline-corrected relative to a 100 ms interval preceding the stimulus. The preprocessing of the data was done as preregistered, with two exceptions: we did not downsample the EEG data to 500 Hz, and, as suggested by a reviewer, we did not use a notch filter.
Data from five participants were excluded from the analysis due to technical problems with the recording. The average comprehension question accuracy of the participants who were included in the analysis was 90% (range 77-100). We lost data from 0.51% trials due to technical problems, and we rejected 5% of the trials due to large differences in voltage. The data, code, and scripts of this study can be found in htt ps://osf.io/w85gc/.

Preregistered data analysis
Taking our literature review as a starting point, the pre-registered analyses focus on two ERP effects that have been associated with predictions: (i) the so-called N400 effect, that is, a negativity with a centroparietal distribution in the electrode sites Cz, CP1, CP2, P3, Pz, P4, POz  Table 1), we chose a rather wide time window for both effects. Notice that even though there is an overlap in the time windows, we expect the effects in different models (see also section 1.2).
For the data analysis, we define the ERP amplitudes as the average scalp potential over the pre-specified time windows and electrode sites for each trial (as was done in, for example, Frank et al., 2015;Nieuwland et al., 2018). This allows us to use a Bayesian linear mixed model of the single-trial data and to simultaneously model variance associated with each subject and with each item.
We use a Bayesian data analysis approach implemented in the probabilistic programming language Stan (Stan Development Team, 2018b) using the model wrapper package brms (Bürkner, 2017(Bürkner, , 2018 in R (R Core Team, 2018). 6 An important motivation for using the Bayesian approach is that it is easy to fit fully hierarchical models with fully specified variance-covariance matrices for all random effects, which provide the most conservative estimates of uncertainty (Schielzeth and Forstmeier, 2009). We use regularizing priors for all our parameters, except in the parameters of interest of the models used for the Bayes factor calculation; see below. Regularizing priors are minimally informative and have the objective of yielding stable inferences (Chung et al., 2013;Gelman et al., 2008;Gelman et al., 2017). We fit the models with four chains and 3000 iterations, with 1000 of which were the burn-in or warm-up phase. In order to assess convergence, we verified that there were no divergent transitions, that the R s (the between-to within-chain variances) were close to one, that the number of effective sample size were at least 10% of the number of post-warmup samples (i.e. at least 300 out of 3000), and we visually inspected the chains. All these convergence checks seemed adequate for the models reported in this paper, but some variance parameters had relatively low effective sample size (the lowest one was 488); see the complete models in https://osf. io/w85gc/. Nicenboim and Vasishth (2016) discuss the Bayesian approach in detail in the context of linguistic and psycholinguistic research. We report mean estimates and 95% quantile-based Bayesian credible intervals. A 95% Bayesian credible interval has the following interpretation: it is an interval containing the true value with 95% probability (see, for example, Jaynes and Kempthorne, 1976;Morey et al., 2016).

Additional (non preregistered) inference criteria
In the pre-registration, we intended to carry out inference using the posterior probability of the estimate being positive given the data (P(β > 0|data)). However, we realized later that this quantity can be confused with the evidence against β = 0. Given that the research question is often framed in terms of the presence or absence of an effect, we use the Bayes factor to quantify the extent to which the data support a null model (M0), which assumes that a particular parameter of interest, β, is 0, over an alternative model (M1), which assumes that the parameter is not zero. For example, if we want to test the effect of predictability, our alternative model would contain the parameter representing the effect of predictability, and the null model would not have this parameter.
The Bayes factor quantifies the weight of the evidence in favor of the null model M0 relative to some alternative model. The Bayes factor is similar to the likelihood ratio test, with the difference that the likelihood is integrated over the parameter space defined by the priors, instead of being maximized. In Bayes factors computations, the priors can be very influential; vague or uninformative priors on the parameter of interest can strongly bias the Bayes factor in favor of the null model (Lee and Wagenmakers, 2014). Consequently, for each model, we carry out a sensitivity analysis, showing a range of Bayes factors keeping the model specification constant but defining increasingly informative priors for the parameter of interest (e.g., the effect of predictability on the amplitude of the N400). Defining increasingly informative priors amounts to defining smaller standard deviations for the relevant prior. The different Bayes factors BF 01 represent the evidence against the alternative model (or in favor of the null), contingent on the magnitude of the effect assumed a priori via the priors. In the Bayes factor plots, we indicate what we consider well-calibrated priors, namely, priors that are agnostic regarding the direction of the effect, but relatively informative regarding the size of the possible effect. For these well-calibrated priors we used Normal(0, SD) with standard deviation SD between 0.10 and 0.50. These standard deviations are based on the N400 estimates at the noun region in Nieuwland et al. (2018).
Bayes factor values for the comparison BF 01 that are larger than one favor the null model, and values below one favor the alternative model. Values close to one indicate that the evidence is inconclusive (see Wagenmakers et al., 2010). In order to use the Bayes factor to make decisions about the evidence for or against a particular hypothesis, Jeffreys (1939Jeffreys ( /1998 has suggested the following approximate criteria; these should not be regarded as hard and fast rules with clear boundaries. Here, we assume that we are comparing two models, labeled M0 and M1.
To calculate the Bayes factor, we used bridge sampling (Bennett, 1976;Meng and Wong, 1996;Gronau et al., 2017) with eight chains and 30,000 iterations, 1000 of which were the warm-up phase.
2.5.2.1. Effect of the predictions at the determiner. We fit a linear mixed model to the average amplitude of the N400 spatiotemporal window with respect to the determiner using the predictor gender predictability and the nuisance predictor gender. We used the data from the same determiners pooled from items with a constraining context and a high probability gender, and from items with a non-constraining context and a low probability gender: "einen" in examples (6a) and (7a). Gender predictability (pred in the model) is the scaled and smoothed probability transformed to log 2 (log to the base two) of gender cloze probability. We use additive smoothing (also called Laplace or Lidstone smoothing; Lidstone, 1920;Chen and Goodman, 1999) with pseudocounts set to one. This means that the smoothed probability is calculated as the number of responses with a given gender plus one divided by the total number of responses plus two. The effect of predictability is then assumed to be continuous and non-linear. The estimate represents the change in the amplitude for a change in one point in the non-linear scale: To make it concrete, it can be understood as the average change in amplitude when one compares a determiner with gender probability of 0.74 with a determiner with a gender probability of 0.26, but also when one compares the change between 0.26 and 0.09. Notice that we use gender probability rather than the determiner probability based on the idea that the pre-activation might be at the feature level and not lexical (Rabovsky and McRae, 2014). However, we cannot disentangle these two possibilities because of the high correlation between determiner cloze probability and gender probability: 0.77 (but see Fleur et al., 2019, for an experimental design where word probability and gender probability are disentangled). The motivation for using log-transformed cloze probability is two-fold: First, there is evidence that at least for reading time experiments, the use of log-transformed probability shows a better fit to the data than other transformations, including no transformation (Smith and Levy, 2013). Second, log-transformed probabilities have been shown to be useful accounting for reading time data (among others: Levy, 2008;Hale, 2001) and EEG/ERP data (Frank et al., 2015;Delaney-Busch et al., 2017). Indeed, model comparison using the EEG data of the noun region of Nieuwland et al. (2018) shows support for the use of log-transformed cloze-probabilities, see Appendix A. Gender is a Helmert coded predictor (Schad et al., 2018) that compares masculine vs. feminine, and masculine together with feminine vs. neuter (gender). We did not have specific predictions for the effect of the gender predictor; we chose Helmert coding here because it gives us orthogonal comparisons. For this model and the following ones, we specify the following weakly informative (regularizing) priors in brms: reg_priors <-c(prior(normal(0, 10), class = Intercept), prior(normal(0, 1), class = b), prior(normal(0, 10), class = sd), prior(normal(0,10), class = sigma), prior(lkj(2), class = cor)) We specify the model in brms as follows: brm(A_N400 ~ pred + gender ‫|1(+‬ item) + (pred + gender ‫|‬ subj), prior = reg_priors, data_a, …) In the code above, … refers to some other parameters that are needed for sampling.

Effect of the disconfirmation of strong predictions at the determiner.
We fit a linear mixed model similar to the previous one to the average amplitude of the late positivity with respect to the determiner. We use the predictor gender context constraint, (constr) and the nuisance predictors gender predictability (pred) and gender. We used the data from the same low-probability-gender determiners pooled from items with a constraining context and from items with a non-constraining context: "ein" in examples (6b) and (7b). Gender context constraint (constr) is operationalized as the best gender completion, that is the highest cloze probability of a gender of a given context, and it is calculated as the highest number of responses for a given gender divided by the total number of responses and then scaled. In contrast to predictability, and since there are no data in favor (or against) logtransforming best completion values, we left them in the raw scale. We specify this model in brms as follows: brm(A_PNP ~ constr + pred + gender + (1 ‫|‬ item) + (constr + pred + gender ‫|‬ subj), prior = reg_priors, data_b, …)

Effect of the predictions at the noun.
We fit a linear mixed model similar to the one presented in 2.5.2.1 using noun cloze probability instead of gender probabilities. The linear mixed model is fit to the Table 4 Mean and 90% quantiles of the noun probability for each condition for sentences truncated before the noun. The column called Noun cloze probability refers to the cloze probability of the actual noun that completed the sentence in the experimental items. The column called Noun best completion shows the cloze probability for the most likely noun continuation for a given context.  Fig. 1. Figure (A) shows the posterior distributions of the effect of predictability on the N400 spatiotemporal window for the same determiners pooled from items with a constraining context and a high probability gender, and from items with a non-constraining context and a low probability gender; see examples (6a) and (7a). average amplitude of the N400 spatiotemporal window with respect to the target noun using the predictor noun predictability (pred_noun), which is the scaled, smoothed probability of the noun cloze transformed to log to the base two, taking into account the context that includes the gender information. We used the data from the same nouns pooled from conditions (a): "Korb", "basket", in examples (6a) and (7a). We specify this model in brms as follows: brm(A_N400 ~ pred_noun + (1 ‫|‬ item) + (pred_noun ‫|‬ subj), prior = reg_priors, data_a_noun, …)

Effect of the disconfirmation of "strong" predictions at the noun.
For completeness, we present here one model that was preregistered but that might be problematic because of the restricted range of noun constraint values in our stimuli; see Table 4. This is the case, because noun cloze probability and best completions are updated given the entire context where the noun appears, which includes gender information. For this model, we fit a linear mixed model to the average amplitude of the late positivity with respect to the target noun using the predictors context constraint and noun predictability. We used the data from the same nouns pooled from conditions (b): "Tuch", "cloth", in examples (6b) and (7b). Context constraint (constr_noun) is operationalized as the best noun completion, that is the highest cloze probability of a noun of a given context, and it is calculated as the highest number of responses for a given noun divided by the total number of responses and then scaled. We specify this model in brms as follows: brm(A_PNP ~ constr_noun + pred_noun + (1 ‫|‬ item) + (constr_noun + pred_noun ‫|‬ subj), prior = reg_priors, data_b_noun, …)

Results of the preregistered analysis at the determiner
Results were inconclusive for the preregistered models at the determiner region. For the determiners pooled from conditions (a) of items with constraining context (i.e., high probability gender) and of items with non-constraining contexts (i.e., low probability gender), the estimate for the effect of predictability on the N400 spatiotemporal window was β = 0.17μV, 95% CrI = [ − 0.17, 0.51]; the signal was more negative for more unexpected genders, although the estimate has a large degree of uncertainty. For determiners pooled from conditions (b) of items with constraining context (i.e., low probability gender) and of items with non-constraining contexts (i.e., low probability gender), the estimate for the effect of the constraint of the context on the late positivity was β = 0.03μV, 95% CrI = [ − 0.17, 0.23]; the signal was more positive when the context was more constraining (controlling for gender probability), although the estimate has again a large degree of uncertainty. Analyses using the Bayes factor show that for wellcalibrated priors (Normal(0, SD) with SD between 0.10 and 0.50) that represent realistic effect sizes, the evidence is between inconclusive and weakly supporting the null; see Fig. 1. By-condition averaged ERP waveforms on determiners for both preregistered models are shown in Figs. 2 and 3 and in Appendix C ( Figures C1 and C2); scalp topographies are shown in Fig. 4.   Fig. 2. ERP for the grand average across N400 spatial window (Cz, CP1, CP2, P3, Pz, P4, POz) elicited by the presentation of the same determiners pooled from items with a constraining context and a high probability gender, and from items with a non-constraining context and a low probability gender; see examples (6a) and (7a). The x-axis indicates time in seconds and the light grey square represents the pre-specified time window used in the analysis. The y-axis indicates voltage in microvolts, note that negative polarity is plotted downwards. The dark grey areas around the traces represent 95% confidence intervals based on cross random effects; see Politzer-Ahles (2017).

Results of the preregistered analysis at the noun
Results were also inconclusive for the preregistered models at the noun region. First, there is no clear N400 effect of predictability at the noun (6a-7a): β = 0.05μV, 95% CrI = [ − 0.2,0.3]; the signal was, as expected, more negative for more unexpected nouns, although the estimate has again a large degree of uncertainty. Analyses using the Bayes factor show that for well-calibrated priors, the evidence is between inconclusive and weakly supporting the null; see Fig. 5. By-condition averaged ERP waveforms on nouns following a constraining context are shown in Fig. 6 and in Appendix C ( Figure C3); the scalp topography of the difference between conditions is shown in Fig. 7(A). We speculate that this could be due to the fact that the non-constraining context was in fact somewhat constraining after reading the determiner and adjective. Both types of contexts use similar words and, for most experimental items, the main difference between the contexts is that either an event happens or does not happen (e.g., "the grandmother wanting to pick up mushrooms in the forest" vs. "the grandmother not wanting to pick up mushrooms in the forest"), or the temporal order of events (e.g., "someone wanting to buy sport shoes" vs. "someone having already bought sport shoes"). For this reason, both type of contexts are likely to be semantically associated to the critical noun. Kuperberg (2016) argues that in the presence of multiple competing cues, those combinations that provide  (6b) and (7b). The x-axis indicates time in seconds and the light grey square represents the pre-specified time window used in the analysis. The y-axis indicates voltage in microvolts, note that negative polarity is plotted downwards. The dark grey areas around the traces represent 95% confidence intervals based on cross random effects; see Politzer-Ahles (2017). Figure (A) shows a topographic plot with the difference between ERP amplitudes in the time-window of 250-600 ms following the presentation of the same determiners pooled from items with a constraining context and a high probability gender, and from items with a non-constraining context and a low probability gender; see examples (6a) and (7a). Figure (B) shows a topographic plot with the difference between ERP amplitudes in the time-window of 500-900 ms following the presentation of the same determiners with low probability gender pooled from items with a constraining context and a non-constraining context; see examples (6b) and (7b). more reliable evidence for a prediction will outweigh other combination of cues, including the full set of contextual cues provided by the semantic and syntactic structure. This entails that comprehenders may temporarily ignore the syntactic structure of the non-constraining (or somewhat constraining) context reducing the mismatch of the low cloze probability noun that follows them. In fact, there is evidence that under some configurations, the N400 amplitudes only depend on semantic associations between the target word and the context without regard to the full syntactic and semantic structure (e.g., in negation: Fischler et al., 1983; in atypical thematic roles: Chow et al., 2016), and that negative quantifiers (such as few) pose a particular challenge to incremental comprehension in plausible but low-cloze continuations (Nieuwland, 2016).

Fig. 4.
Second, results from the linear mixed model investigating the effect of constraint at the noun region for the late positivity show puzzling results: The model shows an effect of constraint in an unexpected direction, as the constraint of the context increases the average signal is more negative: β = − 0.27μV, 95% CrI = [ − 0.53, 0.0032]. Bycondition averaged ERP waveforms on nouns following a constraining context are shown in Fig. 8 and in Appendix C (20); the scalp topography of the difference between conditions is shown in Fig. 7(b). Analysis with the Bayes Factor shows weak to no support against the null model; see Fig. 5(C) and (D). In addition, the more predictable the presented noun, the more positive the average signal: β = 0.089μV, 95% CrI = [ − 0.17, 0.35]. As we mentioned before, one possibility is that this manipulation did not work as we expected because of the restricted range of noun constraint values in our stimuli; see Table 4. Once the gender of the upcoming noun is known, the predictability of potential upcoming nouns changes, and that means that the stimuli were not built appropriately. We avoid overinterpreting the effect of the constraint of the context at the nouns.

Exploratory analysis
While the preregistered analysis at the determiner showed no evidence for an effect, it could be the case that the effect of the manipulation was not strong enough. One reason to suspect this is that the nonconstraining contexts were at least slightly constraining and their completions not entirely unexpected. Indeed, the difference between the mean cloze probabilities of the gender of the determiners in conditions (a) following a constraining context vs. the ones in condition (a) following a non-constraining context is 0.31. In contrast, the difference between the mean cloze probability of the gender of the determiner following constraining contexts in conditions (a) vs. (b) is 0.72. While there is a lot of variability among the items, and the model included cloze probability as a continuous predictor (and not as a factorial comparison), it might be the case that a larger range of cloze probabilities would show an effect. Another potential issue is that to keep identical target regions in each item, we rely on varying contexts. This means that the brain activity before the determiner is not identical -whether or not there is prediction-and this might increase the noise in the signal.
Given that a comparison holding the (constraining) context fixed and varying the completion is the most common in the literature and the one present in all the studies of Table 1, we also explored the effect of predictability under a constraining context using gender probability as a predictor (and gender as a nuisance predictor as before) on the N400 spatiotemporal window. While we will be comparing different determiners in this model, some of the idiosyncratic differences between determiners should be captured by the gender predictor and its interaction with predictability. We specify this model in brms as follows: brm(A_N400 ~ pred * gender +(1 ‫|‬ item) + (pred * gender ‫|‬ subj), prior = reg_priors, data_Cab, …) Another possibility is to treat each context as an item and explore the effect of predictability for our entire experimental stimuli. This would be the same model as before, but replacing item for context and run on the entire data for the determiner region: brm(A_N400 ~ pred * gender ‫|1(+‬ context) + (pred * gender ‫|‬ subj), prior = reg_priors, data, …) Furthermore, as a sanity check that our experiment can elicit an N400 effect, we fit another model using only data from the target nouns appearing in the constraining context. We specify the model in brms as follows: brm(A_N400 ~ pred_noun ‫|1(+‬ item) + (pred_noun ‫|‬ subj), prior = reg_priors, data_noun_Cab, …) We included two more exploratory analysis focusing on the adjective region and focusing on the determiner but replacing gender probability with determiner probability in Appendix B.

Results of the exploratory analysis
At the determiner region, the results were still inconclusive for the exploratory models. The estimate for the effect of predictability on the N400 spatiotemporal window for the determiners pooled from conditions (a) and (b) of items with constraining context was β = 0.11μV, 95% CrI = [ − 0.0069, 0.23], and the effect of predictability on the N400 spatiotemporal window for the determiners pooled from all the conditions treating context as item was β = 0.12μV, 95% CrI = [0.015,0.22]; the signal was more negative for more unexpected genders, although the estimate has a large degree of uncertainty in both cases. Analysis using the Bayes factor for both models shows that the evidence supporting the null is inconclusive and slightly favors the null for a priori larger effect sizes ( Fig. 9A and C); notice that this is the case even when the 95% credible interval of the model that pools all the items does not overlap with zero (see Rouder et al., 2018, for a discussion on the difference between estimation and Bayes factor approaches). By-condition averaged ERP waveforms on determiners following a constraining context are shown in Figures C5 and 10, and the scalp topography of the difference between conditions is shown in Fig. 11.
By contrast, at the noun region there is a clear N400 effect: β = 0.57μV, 95% CrI = [0.42,0.72]. Analyses using the Bayes factor show that for well-calibrated priors that represent realistic effect sizes, there is very strong evidence for an effect; see Fig. 12. By-condition averaged ERP waveforms on nouns following a constraining context are shown in Figures C6 and 13, and the scalp topography of the difference between conditions is shown in Fig. 14.

Discussion
The results from our preregistered and exploratory models show no clear support for the access view that posits that before a word is read, its Fig. 6. ERP for the grand average across N400 spatial window (Cz, CP1, CP2, P3, Pz, P4, POz) elicited by the presentation of the same nouns pooled from items with a constraining context and a high probability gender, and from items with a non-constraining context and a low probability gender; see examples (6a) and (7a). The x-axis indicates time in seconds and the light grey square represents the pre-specified time window used in the analysis. The y-axis indicates voltage in microvolts, note that negative polarity is plotted downwards. The dark grey areas around the traces represent 95% confidence intervals based on cross random effects; see Politzer-Ahles (2017).
context triggers the pre-activation of semantic features (and maybe phonological and grammatical features as well). We found no evidence for an N400 effect or a late positivity due to differences in predictability or the level of constraint of the context respectively. Thus, from our experiment alone, we should conclude that there is no evidence for the pre-activation of features of expected nouns, measured on gendermarked articles in German. Given that results from recent studies are either inconclusive (Kochari and Flecken, 2019;Ito et al., 2017a), or weakly supporting the null (Nieuwland et al., 2018), if there is such pre-activation, its effect is too small to be detected reliably by a single experiment, even if the sample size is very large by EEG standards (e.g., 334 participants in Nieuwland et al., 2018). Furthermore, a self-paced reading study with a relatively large-sample size (120 participants and 60 items) has shown that unexpected gender-marking in determiners in Spanish do not lead to a slowdown in comparison to expected ones, and thus if pre-activation of gender marking occurs it may have no measurable impact in reading times (Guerra et al., 2018).
However, as in recent replication attempts (Nieuwland et al., 2018;Ito et al., 2017a;Kochari and Flecken, 2019), the patterns we found are in the expected direction in terms of polarity: the more unexpected the gender of the determiner, the larger the negativity of the average amplitude from the typical electrodes and time window of the N400 Fig. 7. Figure (A) shows a topographic plot with the difference between ERP amplitudes in the time-window of 250-600 ms following the presentation of the same critical nouns pooled from items with a constraining context and a high probability cloze, and from items with a non-constraining context and a low probability cloze; see examples (6a) and (7a). Figure (B) shows a topographic plot with the difference between ERP amplitudes in the time-window of 500-900 ms following the presentation of the same critical nouns with low probability cloze pooled from items with a constraining context and a non-constraining context; see examples (6b) and (7b). Fig. 8. ERP for the grand average across late positivity spatial window (F7, F3, Fz, F4, F8, FC5, FC1, FC2, FC6, C3, C4) elicited by the presentation of the same nouns pooled from items with a constraining context and a high probability gender, and from items with a non-constraining context and a low probability gender; see examples (6b) and (7b). The x-axis indicates time in seconds and the light grey square represents the pre-specified time window used in the analysis. The y-axis indicates voltage in microvolts, note that negative polarity is plotted downwards. The dark grey areas around the traces represent 95% confidence intervals based on cross random effects; see Politzer-Ahles (2017). effect. It is therefore unclear from a qualitative literature review alone how strong the evidence is in favor (or against) pre-activation.
In order to quantitatively synthesize the results of the aforementioned studies and the current one, we conducted a Bayesian random effects meta-analysis. A random effects meta-analysis assumes that there is a unique underlying effect (i.e., the effect of pre-activation on the N400 spatiotemporal window) to be estimated from all the studies, while at the same time it allows for the estimates in the individual studies to vary because of differences between the specific studies and because of sampling variability.
It would also be possible to conduct a frequentist random effects meta-analysis instead of a Bayesian one to synthesize the evidence of multiple studies. A frequentist meta-analysis would certainly be an improvement over examining whether the p-value of each study is below 0.05. However, the end result of a frequentist meta-analysis is a maximum likelihood estimate with its confidence interval and/or a pvalue. Reliance on p-values has been criticized on several grounds: It has been argued that p-values promote the dichotomization of evidence into the artificial categories of "statistically significant" and "not statistically significant" which encourages erroneous scientific reasoning, and that, in addition, p-values are usually misinterpreted (among other Cumming, 2012;Vasishth and Nicenboim, 2016;McShane et al., 2019). Even though some frequentist approaches do discourage examining whether p-values are below 0.05 or not (e.g., the new statistics of Cumming, 2012; the continuously cumulating meta-analytic of Braver et al., 2014) and instead advocate employing only continuous descriptive criteria such as effect size estimates and their confidence intervals, confidence intervals are also often misinterpreted (Hoekstra et al., 2014). Morey et al. (2016) identify several common misconceptions regarding confidence intervals. In particular, the following two misconceptions are properties of Bayesian credible intervals rather than of frequentist confidence intervals: (i) confidence intervals cannot be used to assess the certainty with which a parameter is in a particular range, and (ii) there is no necessary connection between the precision of an estimate and the size of a confidence interval. In contrast, a Bayesian meta-analysis inherits all the desirable characteristics of the Bayesian framework, such as the possibility of yielding (i) credible intervals to quantify the uncertainty (given the data and model) over a parameter and (ii) Bayes factors to quantify the weight of the evidence in favor of the null model relative to a specific alternative model.

Meta-analysis
We conducted a Bayesian random effects meta-analysis that allows for heterogeneity in the different studies. That is, each individual study is adjusted based on its estimated effect and its estimated standard error (see also Vasishth et al., 2013;Jäger et al., 2017). Given that there might be differences between studies manipulating the phonological features of the determiner (Nieuwland et al., 2018;Ito et al., 2017a) and gender marking (Kochari and Flecken, 2019, and the current study), we included a sum-coded predictor that indicates the type of manipulation (1 for phonological features and − 1 for gender marking).
To anticipate our conclusion, even though the studies, analyzed individually, were "inconclusive", as we show below, the meta-analysis does present evidence in favor of an effect consistent with the preactivation account, albeit a very small one.

Estimates of the individual studies
For the meta-analysis, we included the publicly available data from  (2019), and a subset of our own data (only data from items with constraining contexts). We reanalyzed the publicly available data using Bayesian linear mixed models that were as similar as possible: For all the studies, the predictor labeled predictability is smoothed, transformed to log base two, and then scaled using the same mean and standard deviation as in our data. 7 This means that for all the studies the estimate of the effect of predictability represents the change in the average EEG amplitude for a change in one point in the same logarithmic scale. It is, for example, the average change in amplitude when one compares a determiner with cloze probability of 0.74 with a determiner with a cloze of 0.26. In the case of Nieuwland et al. (2018) and Ito et al. (2017a), predictability refers to the log cloze probability of the article form ("a" vs. "an"). In contrast, in the case of Kochari and Flecken (2019) (and our own study), predictability refers to the log cloze probability of the gender of the upcoming noun. For the cloze task data of Kochari and Flecken (2019), we determined the gender of the noun phrases using frog (Desmet and Hoste, 2014) and udpipe (Wijffels, 2019). Since the studies included had only two conditions with a constraining context, for the estimate of our own study, we also use data only from items with constraining contexts, that is, the estimate from our exploratory analysis (β = 0.11μV, 95% CrI = [ − 0.0069, 0.23]).
For Ito et al. (2017a), we used their data from native speakers averaged across the central parietal electrodes (C1, Cz, C2, CP1, CPz, CP2, P1, Pz, P2, PO3, POz, and PO4) in a time window of 250-400 ms after the determiner for each trial (https://osf.io/5m4br). We fit the data of their two experiments with a single model, where we assume heterogeneity of variance across the experiments that were run with different modes of presentation (i.e., both experiments can have different variances). The model is specified in brms as follows: Fig. 10. ERP for the grand average across N400 spatial window (Cz, CP1, CP2, P3, Pz, P4, POz) elicited by the presentation of the critical determiners with high and low probability genders from items with a constraining context; see examples (6a) and (6b). The x-axis indicates time in seconds and the light grey square represents the pre-specified time window used in the analysis. The y-axis indicates voltage in microvolts, note that negative polarity is plotted downwards. The dark grey areas around the traces represent 95% confidence intervals based on cross random effects; see Politzer-Ahles (2017).    (6a) and (6b). The x-axis indicates time in seconds and the light grey square represents the pre-specified time window used in the analysis. The y-axis indicates voltage in microvolts, note that negative polarity is plotted downwards. The dark grey areas around the traces represent 95% confidence intervals based on cross random effects; see Politzer-Ahles (2017).

Results of the meta-analysis
The meta-analysis shows a relatively clear effect of predictability: β = 0.11μV, 95% CrI = [0.05, 0.16], and no clear suggestion of a difference between studies that manipulated gender and phonological features: β = 0.0088μV, 95% CrI = [ − 0.046, 0.064]. Fig. 15 shows the 95% credible intervals of the meta-analytic estimate (on the top), and of the non-pooled and partially-pooled estimates of the original studies, that is, the 95% credible intervals estimated either without taking into account the other studies or as part of the random-effects meta-analysis. The Bayes factor analysis shows that, for wellcalibrated (i.e., informative) priors, there is evidence for an overall effect in the expected direction: the more unexpected the determiner, the larger the average amplitude in the N400, for the given time window and electrodes. As Fig. 16 shows, if we assume a priori that the effect is relatively small (but not too small, since the priors with a SD lower than 0.1 show between a weak support for the alternative and some support for the null), the data furnishes strong evidence for an effect. Conversely, if we assume a priori that the effect could be large, the evidence is weaker.

General discussion
We conducted a study investigating pre-activation capitalizing on the fact that all German nouns have gender marking (masculine, feminine, or neutral) and that their preceding determiners can show an unambiguous gender marking when the noun phrase has accusative case. We used a relatively large sample size (N = 120). However, both our preregistered and our exploratory analyses failed to yield conclusive evidence for an effect of pre-activation (or for the absence of it). However, the sign of the effect is in the expected direction: the more unexpected the gender of the determiner, the larger the negativity on average amplitude of the N400 spatiotemporal window. This seems to be also the case for the recent replication attempts (Nieuwland et al., 2018;Ito et al., 2017a;Kochari and Flecken, 2019).
We see two possible explanations for the small N400 effect at the determiner: 1. Under the view that the N400 is indexing the prediction error for only semantic features as suggested by Rabovsky and McRae (2014); Kuperberg (2016), we would interpret the N400 effect at the determiner as follows. Prior to the determiner, the context pre-activates semantic features to different levels and, incidentally, the noun with the highest cloze probability (e.g., basket.masc) is the word with features that overlap the most with the highest pre-activated features in memory. Seeing a determiner with a grammatical gender (or a phonological feature in English) that is incompatible (e.g., a.fem) with the word with the highest cloze probability changes the context for the upcoming noun, and thus provokes an update of the activation of semantic features in memory. The difference between the previous levels of activation and the updated ones would correspond to the N400 effect. 2. Another alternative that may explain the very small effect is that the prediction error is not limited to semantic features but extends to grammatical and phonological features. A very small effect at the determiner may be indicating that the mismatch at the grammatical gender (or phonological) features may not be affecting other semantic features that are also activated: For example, an unexpected gender in German at the determiner can still lead to a different noun with a similar meaning (i.e., similar semantic features) as the most predictable word (e.g., Couch and Sofa in German have different genders). Even though the N400 has been linked to access to semantic memory, there is evidence of purely phonological features playing a role: words that do not rhyme with a prime word produce greater negativity in the N400 time range than rhyming words (Praamstra and Stegeman, 1993;Perrin and García-Larrea, 2003). Furthermore, although the N400 has been generally characterized as purely semantic, there is evidence that the mismatch in grammatical features (such as argument structure) can elicit an N400 effect (Chow et al., 2016); the N400 has also been found when syntactic reanalysis occurs (Bornkessel et al., 2004). A theoretical advantage of assuming that feature activation is not restricted to semantic features is that the extension of features allows for a single mechanism responsible for the access to memory for prediction generation and for the creation of dependencies with previous words in the sentence (as in Lewis and Vasishth, 2005;Engelmann et al., 2019;Jäger et al., 2019;Vasishth et al., 2019).
Interestingly, preliminary work of Fleur et al. (2019) points in the direction of a possible dissociation between the prediction of gender and phonological features (namely the lexical form of the determiner). However, without formal computational modeling it is not trivial to distinguish whether the N400 at the determiner can be explained with the grammatical or phonological context updating the activation of semantic features (Rabovsky and McRae, 2014;Kuperberg, 2016), or with the direct change in the activation of semantic, grammatical, or phonological features (in line, to some extent, with Chow et al., 2016).
To sum up, our meta-analysis shows a relatively clear and very small effect in the direction consistent with the prediction error account (or access view) that assumes that before a word is read, its context leads to the probabilistic pre-activation of semantic (and maybe phonological and grammatical) features in memory. It is noteworthy that almost every experiment in our meta-analysis had a 95% credible interval overlapping zero, and that only very weak conclusions could be deduced from the individual studies in isolation. Given the small effect present in this type of experimental manipulation, it seems unfeasible that a single study could provide closure on this topic. We were only able to gather more conclusive evidence once we quantitatively synthesized these same studies. This highlights the fact that science is and should be a cumulative enterprise, which benefits from data and materials that are publicly available (also see . Fig. 16. Figure (A) shows the posterior distribution of the meta-analytic estimate of the effect of predictability at a determiner prior to a noun; a positive effect is consistent with the pre-activation account (more expected determiners lead to a reduction of the N400 effect). Figure (B) shows the Bayes factors in favor of the null (BF01) for different standard deviations (SD) of normally distributed priors centered in zero for the effect of predictability. The grey area indicates well-calibrated priors: Normal(0, SD) with SD between 0.10 and 0.50. A Bayes factor below one indicates evidence against the null. Fig. 15. Forest plot of the estimates of the effect of predictability at a determiner prior to a noun; a positive effect is consistent with the pre-activation account (i.e., a more expected determiner leads to a reduction of the N400 effect). Horizontal lines represent 95% credible intervals. The cross at the top of the plot represents the mean of the meta-analytic estimate, blue circles are the estimates reconstructed from the individual studies, and black triangles are the shrinkage estimates of the individual studies delivered by the random-effects meta-analysis (i.e., partially pooled estimates).

Potential concerns and limitations
There are some potential concerns and some limitations in our metaanalysis. We address these issues below:

Previous studies had much larger effects of the prenominal manipulation
A potential objection to our conclusion of a small effect due to the prenominal manipulation across the studies of the meta-analysis is that it is hard to reconcile our results with studies that did show larger N400 effects with similar manipulations: the original study in English with the a/an manipulation (DeLong et al., 2005), some studies manipulating gender in Dutch (Otten et al., 2007;Otten and Van Berkum, 2009;and recently Fleur et al., 2019), and most studies manipulating gender in Spanish (Wicha et al., 2003a(Wicha et al., , 2003bFoucart et al., 2014;and Martin et al., 2018;but cf. Wicha et al., 2004). For English and Dutch (except maybe Fleur et al., 2019, see below), it is likely that the larger effects are overestimates due to Type-M(agnitude) errors (Gelman and Carlin, 2014). This might be the case because the original studies had low power to detect small effects, and thus only overestimates could be detected and yield significant results . The fact that studies attempting to replicate these findings failed to find significant effects (and had much smaller estimates) supports this idea (e.g., Kochari and Flecken, 2019;Nieuwland et al., 2018). Results of Fleur et al. (2019) might be different since in their experiment the larger N400 effect might have been driven by both the mismatch of gender and lexical features (or lexical form).
Studies with Spanish speakers seem to be relatively consistently showing large effects. Possibly, the effects are genuinely stronger for the gender manipulation in Spanish. One explanation is that due to the transparency of gender marking and the reliability of gender cues in Spanish, readers (and listeners) can rely more strongly on the gender feature assigning more weight to it. With some exceptions, there are clear phonological patterns mostly associated with either feminine or masculine nouns (Beatty-Martínez and Dussias, 2019), and the identity of the determiner reliably indicates the gender of the following noun (except for the cases where the following noun starts with /a/; see Eddington and Hualde, 2015). The transparency of gender marking and the reliability of gender cues in Spanish seem to affect the acquisition of gender, making it easier to acquire in comparison to gender in German and Dutch languages (Eichler et al., 2012;Audring, 2014). The general reliability of gender cues and the transparency of gender marking might be relevant to both their acquisition and their weight in predictive processes (as reflected in the amplitude of the EEG signal in the N400 spatiotemporal window when there is a mismatch). This could explain the large effects for Spanish and the small effect found in our experiment in German. Nouns in German are not as clearly gender-marked as in Spanish: There are semantic, phonological, pseudosuffix, as well as suffix regularities on the basis of which gender can be assigned to German nouns, but there is a group of nouns that does not have any gender-marking regularities (Köpcke, 1982;Hohlfeld, 2006). Furthermore, while the critical regions in our experimental stimuli are singular accusative indefinite determiners and these are unambiguously gender marked, this is not the case for the complete paradigm. The complete determiner paradigm in German displays a high degree of homophony since gender marking in German is intertwined with case and number (Eichler et al., 2012). A more complete meta-analysis with studies in different languages that vary in the regularity of their gender-marking might elucidate this point.

The results are very sensitive to the prior used in the bayes factor analysis
Regarding the sensitivity analysis for the Bayes factor, a potential objection was pointed out by a reviewer: doesn't varying the priors amount to simply fishing for an effect under different assumptions, analogous to frequentist analyses that use terms like "marginally significant" to claim significance? The goal behind varying the priors, however, is to provide a full picture of what the researcher should believe under different assumptions; i.e., under different priors. Our meta-analysis showed that only if we assume a priori that the effect is small do we have relatively strong evidence for an effect. Using informative vs. uninformative priors to understand the range of conclusions one can draw under different assumptions is one of the strengths of the Bayesian framework.

The meta-analysis should be treated as exploratory
Whereas the meta-analysis suggests evidence in favor of a very small effect and provides no evidence for systematic differences between the "a/ an" manipulation and the gender manipulation in Dutch or German, the case is by no means closed. The current meta-analysis was based on only the few studies with publicly available data (Ito et al., 2017a;Nieuwland et al., 2018;Kochari and Flecken, 2019) and our own exploratory results, and only one covariate (manipulation type) was investigated. Thus future work can still inform new and more comprehensive meta-analyses. It is certainly possible that our estimates are biased due to the studies included being not representative. However, a meta-analysis is certainly an improvement over the conventional approach, which just counts the number of significant and non-significant effects across studies in order to decide whether an effect is present or not (see . Even though the only compelling solution for the pervasive problem of bias is the use of large-scale preregistered replication attempts, a meta-analysis provides a valuable starting point and recommendations for future studies with respect to sample size and potential covariates (Van Elk et al., 2015). Ideally a large number of studies should be investigated, and studies could be clustered according to their language and paradigm leading to a precise estimate of both the commonality between the studies and the effect of different factors.

Conclusion
In closing, this study makes several novel contributions. First, this work is to our knowledge the first attempt at investigating the effect of gender on predictability in German. Second, the meta-analysis demonstrates a very important advantage of the Bayesian data analysis methodology: we can incrementally accumulate evidence to obtain increasingly precise estimates of the effect of interest. The approach we present, of assimilating existing evidence into the data analysis, has wide applicability in psycholinguistic research and has the potential to significantly advance our understanding of different phenomena of interest. Finally, although we cannot resolve the question of whether the "a/an" vs. gender manipulation at the determiner shows any difference in pre-activation, the evidence available so far indicates that regardless of the type of manipulation there is a small pre-activation effect at the determiner.

Model comparison between different measures of predictability as predictors for the average amplitude of the N400 spatiotemporal window
We compare here the differences in predictive accuracy between smoothed cloze probabilities, log-transformed cloze probabilities, and factorial (1 for high cloze and − 1 for low cloze) predictors using the noun region of Nieuwland et al. (2018). Since this dataset shows a clear effect of predictability, the dataset might be useful to compare different measures of predictability. We compare the out-of-sample predictive accuracy of the different models using Pareto smoothed importance sampling approximation to leave-one-out cross validation implemented in the package loo (Vehtari et al., 2015(Vehtari et al., , 2017. Table A1 shows the results of the comparison: There is a modest advantage in predictive accuracy by using log-transformed cloze probability in comparison with both non-transformed cloze scores and a factorial predictor.

Additional exploratory analyses
For completeness, we report here additional exploratory analyses that might be of interest. Model for the adjective region with a constraining context. For this model, we analyzed the adjective region under a constraining context using gender probability as a predictor (and gender as a nuisance predictor as before) on the average amplitude of the N400 spatiotemporal window. We specify the previous model in brms as follows: brm(A_N400 ~ pred * gender + (1 ‫|‬ item) + (pred * gender ‫|‬ subj), prior = reg_priors, data_Cab_adjectives, …) The estimate for the effect of predictability on the N400 for the adjectives pooled from conditions (a)  Model for the determiner region with a constraining context using the cloze probability of the determiner as a predictor. For this model, we analyzed the determiner region under a constraining context using the cloze probability of the determiner, rather than its gender probability, as a predictor (and gender as a nuisance predictor as before) on the average amplitude of the N400 spatiotemporal window. We specify the previous model in brms as follows: brm(A_N400 ~ pred_det * gender + (1 ‫|‬ item) + (pred_det * gender ‫|‬ subj), prior = reg_priors, data_Cab_adjectives, …) The estimate for the effect of predictability on the N400 for the determiners pooled from conditions (a) and (b) of items with constraining context using the cloze probability of the determiner as a predictor was β = 0.084 μV, 95% CrI = [− 0.043, 0.21].

Additional figures
For completeness, we report here additional figures. Note. The table is ordered by the expected log-predictive density (êlpd) score of the models, with a higher score indicating better predictive accuracy. The highest scored model is used as a baseline for the difference in ê lpd and the difference standard error (SE).

Fig. C1.
ERPs elicited by the presentation of the same determiners pooled from items with a constraining context and a high probability gender, and from items with a non-constraining context and a low probability gender; see examples (6a) and (7a). The x-axis indicates time in seconds and the light grey square represents the prespecified time window used in the analysis. The y-axis indicates voltage in microvolts, note that negative polarity is plotted downwards. The dark grey areas around the traces represent 95% confidence intervals based on cross random effects; see Politzer-Ahles (2017).

Fig. C2.
ERPs elicited by the presentation of the same determiners with low probability pooled from items with a constraining context and a non-constraining context; see examples (6b) and (7b). The x-axis indicates time in seconds and the light grey square represents the pre-specified time window used in the analysis. The y-axis indicates voltage in microvolts, note that negative polarity is plotted downwards. The dark grey areas around the traces represent 95% confidence intervals based on cross random effects; see Politzer-Ahles (2017).

Fig. C3.
ERPs elicited by the presentation of the same nouns pooled from items with a constraining context and a high probability noun, and from items with a nonconstraining context and a low probability noun; see examples 6a and 7a. The x-axis indicates time in seconds and the light grey square represents the pre-specified time window used in the analysis. The y-axis indicates voltage in microvolts, note that negative polarity is plotted downwards. The dark grey areas around the traces represent 95% confidence intervals based on cross random effects; see Politzer-Ahles (2017).

Fig. C4.
ERPs elicited by the presentation of the same critical nouns with low cloze probability pooled from items with a constraining context and a non-constraining context; see examples (6b) and (7b). The x-axis indicates time in seconds and the light grey square represents the pre-specified time window used in the analysis. The y-axis indicates voltage in microvolts, note that negative polarity is plotted downwards. The dark grey areas around the traces represent 95% confidence intervals based on cross random effects; see Politzer-Ahles (2017).  (6a) and (6b). The x-axis indicates time in seconds and the light grey square represents the pre-specified time window used in the analysis. The y-axis indicates voltage in microvolts, note that negative polarity is plotted downwards. The dark grey areas around the traces represent 95% confidence intervals based on cross random effects; see Politzer-Ahles (2017).  (6a) and (6b). The x-axis indicates time in seconds and the light grey square represents the pre-specified time window used in the analysis. The y-axis indicates voltage in microvolts, note that negative polarity is plotted downwards. The dark grey areas around the traces represent 95% confidence intervals based on cross random effects; see Politzer-Ahles (2017).