Limits on prediction in language comprehension: A multi-lab failure to replicate evidence for probabilistic pre-activation of phonology

In the last few decades, the idea that people routinely and implicitly predict upcoming words during language comprehension has turned from a controversial hypothesis to a widely-accepted assumption. Current theories of language comprehension1–3 posit prediction, or context-based pre-activation, as an essential mechanism occurring at all levels of linguistic representation (semantic, morpho-syntactic and phonological/orthographic) and facilitating the integration of words into the unfolding discourse representation. The strongest evidence to date for phonological pre-activation comes from DeLong, Urbach and Kutas4, who monitored participants’ electrophysiological brain responses as they read sentences, presented one word at a time, with expected/unexpected indefinite article + noun combinations like, “The day was breezy so the boy went outside to fly a kite/an airplane”. The sentences varied expectations (‘cloze’ probability) for a consonant- or vowel-initial noun, as determined in a sentence-completion task using other participants. Expectedly, the amplitude of the N400 event-related potential (ERP) decreased (became less negative) with increasing cloze reflecting ease of processing5–6. Whereas the decreased N400 at the noun could be due to its pre-activation or because high-cloze nouns are easier to integrate, crucially, N400s at the immediately-preceding article a or an showed the same relationship with cloze, i.e., encountering an indefinite article that mismatched a highly-expected word (e.g., an when expecting kite) also elicited a larger N400. This led to the claim that participants pre-activated highly-expected nouns, including their initial phonemes, based on the preceding context, with larger N400s on mismatching articles reflecting disconfirmation of this prediction. The Delong et al. study warranted stronger conclusions than related results available at the time. Unlike previous work, it did not rely on the precursory visual-depiction of upcoming nouns, clearly de-confounded prediction and integration effects, and tested for graded phonological pre-activation of specific word form. Correspondingly, the study has been enthusiastically received as strong evidence for probabilistic phonological pre-activation, receiving over 650 citations to date and featuring in authoritative reviews2–3. However, there is good cause to question the soundness of the original finding (and the appropriateness of the analysis used). Attempts to replicate the critical article-effect have failed7. Moreover, an earlier, alternative analysis of the same data by the authors8 failed to reach statistical significance, but was omitted from the published report. To obtain more definitive evidence, we conducted a direct replication study spanning 9 laboratories (Ntotal = 334). We pre-registered one replication analysis that was faithful to the original, and one single-trial analysis that modeled subject- and item-level variance using linear mixed-effects models. Applying the replication analysis to our article data (Figure 1a), the original finding did not replicate: no laboratory observed a significant negative relationship between cloze and N400 at central-parietal electrodes. In contrast, the negative relationship was successfully replicated for the nouns: 6 laboratories observed such an effect and 2 laboratories observed relatively strong but non-significant effects in the expected direction (range r = .30 to .50). In the single-trial analysis (Fig. 1b-c), there was no statistically significant effect of cloze on article-N400s, also with stricter control for pre-article voltage levels (Supplementary Fig. 1). Crucially, there was a strong and significant cloze effect on noun-N400s (in all laboratories), which was significantly different from that on article-N400s. We observed no significant differences between laboratories for article or noun effects. Exploratory Bayesian analyses with priors based on DeLong et al. further support our conclusions (Fig. 1d, Supplementary Fig. 2). Finally, a control experiment confirmed our participants’ sensitivity to the a/an rule during online language comprehension (Supplementary Fig. 3). Figure 1 A multi-lab failure to replicate evidence for probabilistic pre-activation of phonology. (a) Pre-registered replication analysis: Pearson’s r correlations between ERP amplitude and article/noun cloze probability per EEG channel (* P < 0.05) and per laboratory. (b, c) Pre-registered single-trial analysis: (b) Grand-average ERPs elicited by relatively expected and unexpected words (cloze higher/lower than 50%) at electrode Cz, with standard deviation are shown in dotted lines, and (c) the relationship between cloze and N400 amplitude as illustrated by the mean ERP values per cloze value (number of observations reflected in circle size), along with the regression line and 95% confidence interval. A change in article cloze from 0 to 100 is associated with a change in amplitude of 0.296 µV (95% confidence interval: −.08 to .67), χ2(1) = 2.31, p = .13. A change in noun-cloze from 0 to 100 is associated with a change in amplitude of 2.22 µV (95% confidence interval: 1.75 to 2.69), χ2(1) = 56.5, p < .001. The effect of cloze on noun-N400s was statistically different from its effect on article-N400s, χ2(1) = 31.38, p < .001. (d) Bayes factor analysis associated with the replication analysis, quantifying the obtained evidence for the null hypothesis (H0) that N400 is not impacted by cloze, or for the alternative hypothesis (H1) that N400 is impacted by cloze with the size and direction of effect reported by DeLong et al. Scalp maps show the common logarithm of the replication Bayes factor for each electrode, capped at log(100) for presentation purposes. Electrodes that yielded at least moderate evidence for or against the null hypothesis (Bayes factor of ≥ 3) are marked by an asterisk. At posterior electrodes where DeLong et al. found their effects, our article data yielded strong to extremely strong evidence for the null hypothesis, whereas our noun data yielded extremely strong evidence for the alternative hypothesis (upper graphs). These results were also found when applying a 500 ms pre-word baseline correction (lower graphs). Despite a sample size 10 times larger than the original and improved statistical analysis, we observed no statistically significant effect of cloze on article-N400s, while replicating the strong and statistically significant effect of cloze on noun-N400s4,6. The effect of cloze on article-N400s, if existent, must be very small to evade detection given our expansive approach. Whether such an effect would constitute convincing evidence for routine phonological pre-activation as assumed in theories of language comprehension3 can be questioned, but, more generally, such an effect cannot be meaningfully studied in typical small-scale studies. Consequently, current theoretical positions may be based on potentially unreliable findings and require revision. In particular, the strong prediction view that claims that pre-activation routinely occurs across all – including phonological – levels3, can no longer be viewed as having strong empirical support. Our results do not constitute evidence against prediction in general. We note a lack of convincing evidence specifically for phonological pre-activation, which would have to be measured before a noun appears and unobscured by processes instigated by the noun itself. However, our results neither support nor necessarily exclude phonological pre-activation. Unlike gender-marked articles9 (e.g., in Dutch or Spanish) that agree with nouns irrespective of intervening words, English a/an articles index the subsequent word, which is not always a noun. Maybe our participants did not use mismatching articles to disconfirm predicted nouns, possibly because it was not a viable strategy (American and British English corpus data show a mere 33% chance that a noun follows such articles). Perhaps a revision of the predicted meaning is required to trigger differential ERPs. DeLong et al. recently described filler-sentences in their experiment10, cf. 7, which were omitted from their original report, and were neither provided nor mentioned to us upon our request for their stimuli. DeLong used the existence of these filler-sentences to dismiss an alternative explanation of their results, namely that an unusual experimental context wherein every sentence contains an article-noun combination leads participants to strategically predict upcoming nouns. Importantly, we failed to replicate their article-effects despite an experimental context that could inadvertently encourage strategic prediction. Therefore, the difference between their experiment and ours cannot explain the different results, and may even strengthen our conclusions. In sum, our findings do not support a strong prediction view involving routine and probabilistic pre-activation of phonological word form based on preceding context. Moreover, our results further highlight the importance of direct replication, large sample size studies, transparent reporting and of pre-registration to advance reproducibility and replicability in the neurosciences.

In the last few decades, the idea that people routinely and implicitly predict upcoming words 1 during language comprehension has turned from a controversial hypothesis to a widely-2 accepted assumption. Current theories of language comprehension 1-3 posit prediction, or 3 context-based pre-activation, as an essential mechanism occurring at all levels of linguistic 4 representation (semantic, morpho-syntactic and phonological/orthographic) and facilitating 5 the integration of words into the unfolding discourse representation. The strongest evidence 6 to date for phonological pre-activation comes from DeLong, Urbach and Kutas 4 , who 7 monitored participants' electrophysiological brain responses as they read sentences, presented 8 one word at a time, with expected/unexpected indefinite article + noun combinations like, 9 "The day was breezy so the boy went outside to fly a kite/an airplane". The sentences varied 10 expectations ('cloze' probability) for a consonant-or vowel-initial noun, as determined in a 11 sentence-completion task using other participants. Expectedly, the amplitude of the N400 12 event-related potential (ERP) decreased (became less negative) with increasing cloze 13 reflecting ease of processing [5][6] . Whereas the decreased N400 at the noun could be due to its 14 pre-activation or because high-cloze nouns are easier to integrate, crucially, N400s at the 15 immediately-preceding article a or an showed the same relationship with cloze, i.e., 16 encountering an indefinite article that mismatched a highly-expected word (e.g., an when 17 expecting kite) also elicited a larger N400. This led to the claim that participants pre-activated 18 highly-expected nouns, including their initial phonemes, based on the preceding context, with 19 larger N400s on mismatching articles reflecting disconfirmation of this prediction. 20 The Delong et al. study warranted stronger conclusions than related results available 21 at the time. Unlike previous work, it did not rely on the precursory visual-depiction of 22 upcoming nouns, clearly de-confounded prediction and integration effects, and tested for 23 graded phonological pre-activation of specific word form. Correspondingly, the study has 24 been enthusiastically received as strong evidence for probabilistic phonological pre-25 3 activation, receiving over 650 citations to date and featuring in authoritative reviews [2][3] . 26 However, there is good cause to question the soundness of the original finding (and the 27 appropriateness of the analysis used). Attempts to replicate the critical article-effect have 28 failed 7 . Moreover, an earlier, alternative analysis of the same data by the authors 8 failed to 29 reach statistical significance, but was omitted from the published report.

30
To obtain more definitive evidence, we conducted a direct replication study spanning 31 9 laboratories (Ntotal = 334). We pre-registered one replication analysis that was faithful to the 32 original, and one single-trial analysis that modeled subject-and item-level variance using 33 linear mixed-effects models. Applying the replication analysis to our article data (Figure 1a), 34 the original finding did not replicate: no laboratory observed a significant negative 35 relationship between cloze and N400 at central-parietal electrodes. In contrast, the negative 36 relationship was successfully replicated for the nouns: 6 laboratories observed such an effect 37 and 2 laboratories observed relatively strong but non-significant effects in the expected 38 direction (range r = .30 to .50). In the single-trial analysis (Fig. 1b-c), there was no 39 statistically significant effect of cloze on article-N400s, also with stricter control for pre-40 article voltage levels ( Supplementary Fig. 1). Crucially, there was a strong and significant 41 cloze effect on noun-N400s (in all laboratories), which was significantly different from that 42 on article-N400s. We observed no significant differences between laboratories for article or 43 noun effects. Exploratory Bayesian analyses with priors based on DeLong et al. further 44 support our conclusions (Fig. 1d, Supplementary Fig. 2). Finally, a control experiment 45 confirmed our participants' sensitivity to the a/an rule during online language comprehension 46 ( Supplementary Fig. 3).

47
Despite a sample size 10 times larger than the original and improved statistical 48 analysis, we observed no statistically significant effect of cloze on article-N400s, while 49 replicating the strong and statistically significant effect of cloze on noun-N400s 4,6 . The effect 50 4 of cloze on article-N400s, if existent, must be very small to evade detection given our 51 expansive approach. Whether such an effect would constitute convincing evidence for routine 52 phonological pre-activation as assumed in theories of language comprehension 3 can be 53 questioned, but, more generally, such an effect cannot be meaningfully studied in typical 54 small-scale studies. Consequently, current theoretical positions may be based on potentially 55 unreliable findings and require revision. In particular, the strong prediction view that claims 56 that pre-activation routinely occurs across allincluding phonologicallevels 3 , can no 57 longer be viewed as having strong empirical support.

58
Our results do not constitute evidence against prediction in general. We note a lack of 59 convincing evidence specifically for phonological pre-activation, which would have to be 60 measured before a noun appears and unobscured by processes instigated by the noun itself.

61
However, our results neither support nor necessarily exclude phonological pre-activation.

62
Unlike gender-marked articles 9 (e.g., in Dutch or Spanish) that agree with nouns irrespective 63 of intervening words, English a/an articles index the subsequent word, which is not always a 64 noun. Maybe our participants did not use mismatching articles to disconfirm predicted nouns, 65 possibly because it was not a viable strategy (American and British English corpus data show 66 a mere 33% chance that a noun follows such articles). Perhaps a revision of the predicted 67 meaning is required to trigger differential ERPs.   were also found when applying a 500 ms pre-word baseline correction (lower graphs).  Article cloze and noun cloze ratings were obtained from a separate group of native 126 speakers of English who were students at the University of Edinburgh and did not participate 127 in the ERP experiment. They were instructed to complete the sentence fragment with the best 128 continuation that comes to mind 1 . We obtained article cloze ratings from 44 participants for 129 80 sentence contexts truncated before the critical article. Noun cloze ratings were obtained by 130 first truncating the sentences after the critical articles, and presenting two different,     The replication experiment was followed by a control experiment, which served to 163 detect sensitivity to the correct use of the a/an rule in our participants. Participants read 80 164 relatively short sentences (average length 8 words, range 5-11) that contained the same 165 critical words as the replication experiment, preceded by a correct or incorrect article. As in 166 the replication experiment, each critical word was presented only once, and was followed by 167 at least one more word. All words were presented at the same rate as the replication (in a few participants with a noisy mastoid channel, only one mastoid channel was used), and 184 segmented the continuous data into epochs from 500 ms before to 1000 ms after word onset. 185 We then performed visual inspection of all data segments and rejected data with amplifier 186 blocking, movement artifacts, or excessive muscle activity. Subsequently, we performed  Pre-registered single-trial analysis. In this analysis we did not apply the 0.2-15 Hz 217 band-pass filter, which carries the risk of inducing data distortions [5][6] . For each trial, we 218 performed baseline correction by subtracting the mean voltage of the -100 to 0 ms time 219 window from the data. This common procedure corrects for spurious voltage differences 220 before word onset, generating confidence that observed effects are elicited by the word rather 221 than differences in brain activity that already existed before the word. Baseline correction is a 222 standard procedure in ERP research 5 , and although it was not used or not reported in DeLong   In addition to using a 100 ms pre-stimulus baseline, we also computed the replication Bayes 293 factors using a 500 ms pre-stimulus time window for baseline correction. Results are shown 294 in Figure 1.

295
Supplementing the single-trial analyses, we performed Bayesian mixed-effects model 296 analysis using the brms package for R 11 , which fits Bayesian multilevel models using the 297 Stan programming language 12 . We used a prior based on the Delong et al. observed effect 298 size at Cz for a difference between 0% cloze and 100% cloze (1.25 V and 3.75 V for 299 articles and nouns, respectively) and a prior of zero for the intercept. Both priors had a 300 normal distribution and a standard deviation of 0.5 (given the a priori expectation that 301 average ERP voltages in this window generally fluctuate on the order of a few microvolts; 302 note that these units are expressed in terms of the z-scored cloze values, rather than the 303 original cloze values, such that μ for the cloze prior was 0.45, which corresponds to a raw 304 cloze effect of 1.25). We computed estimates and 95% credible intervals for each of the 305 mixed-effects models we tested, and transformed these back into raw cloze units. The 306 credible interval is the range of values such that one can be 95% certain that it contains the 307 true effect, given the data, priors and the model. The results from these analyses did not