Ignoring the alternatives: The N400 is sensitive to stimulus preactivation alone

The N400 component of the event-related brain potential is a neural signal of processing difficulty. In the language domain, it is widely believed to be sensitive to the degree to which a given word or its semantic features have been preactivated in the brain based on the preceding context. However, it has also been shown that the brain often preactivates many words in parallel. It is currently unknown whether the N400 is also affected by the preactivations of alternative words other than the stimulus that is actually presented. This leaves a weak link in the derivation chain-how can we use the N400 to understand the mechanisms of preactivation if we do not know what it indexes? This study directly addresses this gap. We estimate the extent to which all words in a lexicon are preactivated in a given context using the predictions of contemporary large language models. We then directly compare two competing possibilities: that the amplitude of the N400 is sensitive only to the extent to which the stimulus is preactivated, and that it is also sensitive to the preactivation states of the alternatives. We find evidence of the former. This result allows for better grounded inferences about the mechanisms underlying the N400, lexical preactivation in the brain, and language processing more generally.


Introduction
Perhaps the best studied neural signal of language comprehension, the N400 is a negative component of the eventrelated brain potential peaking roughly 400 msec after the presentation of a stimulus (Kutas & Federmeier, 2011;Kutas & Hillyard, 1980, 1984).Studying the amplitude of the N400 has provided key evidence about language processingdmost notably that words and their meanings are preactivated in the brain before they are encountered during online language comprehension, and that this preactivation is correlated with the extent to which the words are contextually predictable (Federmeier, 2021;Kuperberg et al., 2020;Kutas et al., 2011;Kutas & Federmeier, 2011;Kutas & Hillyard, 1984;Van Petten & Luka, 2012).Specifically, the amplitude of the N400 response is large (more negative) by default, and is reduced in proportion to the extent that the word is predictable (Dambacher et al., 2006;Federmeier, 2021;Payne et al., 2015;Van Petten, 1993;Van Petten & Kutas, 1990, 1991;Van Petten & Luka, 2012).The predictability effect has been replicated numerous times when predictability is operationalized as cloze probability (Kutas & Federmeier, 2011;Kutas & Hillyard, 1984), the proportion of participants in a norming study to fill in a gap in a sentence with a specific word (Taylor, 1953(Taylor, , 1957)).More recently, this has also been found to be the case when predictability is operationalized using the predictions of language models (Aurnhammer & Frank, 2019;Frank et al., 2015;Merkx & Frank, 2021;Michaelov et al., 2022Michaelov et al., , 2023;;Szewczyk & Federmeier, 2022;Yan & Jaeger, 2020), computational systems designed to predict the probability of a word in context based on the statistics of language (Jurafsky & Martin, 2023).
However, while it is by now widely accepted that the amplitude of the N400 response to a word reflects its preactivation, there is a weak link in the derivation chaindexactly how the N400 indexes this preactivation is not clear.The current general consensus is that the amplitude of the N400 response to a word only reflects the extent to which the word or its semantic content were preactivated before the word was encountered (DeLong et al., 2014;DeLong & Kutas, 2020;Federmeier, 2021;Federmeier et al., 2007;Kuperberg et al., 2020;Kutas et al., 2011;Thornhill & Van Petten, 2012;Van Petten & Luka, 2012).We refer to this as the stimulusdependent account.
The main kind of evidence supporting this idea comes from the N400's resilience to variability.A key line of research in this area involves looking at the effect of sentence constraint on the N400.The term sentence constraint in this context refers to the cloze probability of the highest-cloze continuation of a sentencedif the highest-cloze continuation has a high cloze probability, the sentence has a high constraint, while if it has a low probability, the sentence has a low constraint.The central finding is that with cloze probability as a metric of contextual predictability, sentence constraint does not impact N400 amplitude at all; only the cloze probability of the stimulus word itself does (Federmeier et al., 2002(Federmeier et al., , 2007;;Federmeier, 2007, Otten & Berkum, 2008;Van Petten et al., 1999;Vissers et al., 2006;Wlotko & Federmeier, 2007; for review see Federmeier, 2021;Kuperberg et al., 2020;Van Petten & Luka, 2012).For example, Federmeier et al. (2007) find that if a word such as look has a low cloze probability, it elicits a large N400 response no matter whether the preceding context is strongly constraining, such as in the children went outside to look (highest-cloze completion: play), or only weakly constraining, such as in Joy was too frightened to look (highestcloze completion: move).The reliability of the effect across contexts with different degrees of constraint suggests that only the contextual predictability of the stimulus that is presented, and not the predictability of the most likely alternate word, impacts N400 amplitude.
However, this kind of finding still does not rule out the possibility that preactivation of other words can impact N400 amplitude.The aforementioned experiments only consider the extent to which two words (the highest-cloze continuation and the stimulus word) are preactivated.But many candidate words are typically possible in any position.Lexical prediction has been theorized to involve the graded preactivation of a more than two words, ranging from a few candidates, as proposed by Brothers and Kuperberg (2021) to 'large portions of [the] lexicon', as proposed by Smith and Levy (2013).If the N400 truly does index processing difficulty, this processing difficulty might include not only the effort required to activate neural representations associated with the actual stimulus, but also inhibition of the neural representations associated with other possible stimuli, as some researchers have argued (Debruille, 2007;Fitz & Chang, 2019;Hale, 2001;Hoeks et al., 2004).We refer to this as the distribution-dependent account in line with the idea that the N400 reflects the full distribution of stimulus preactivation across possible next words.
One approach to evaluating whether a larger cohort of predicted words affects the N400 is to create an aggregate metric derived from the cloze probabilities of all completions generated in the cloze task such as entropy (as in Stone et al., 2022).However, cloze has its limitations.For example, it is well-established that words with cloze probabilities of zero can vary in their degree of preactivation (see, e.g., DeLong et al., 2019;Ito et al., 2016;Metusalem et al., 2012).An alternative approach is to include information about potential preactivation across the entire lexicon (and thus provide a more complete assessment of alternate word predictability) by modeling preactivation with language models, which, given any context, can provide a probability distribution over all words in their vocabulary (Jurafsky & Martin, 2023).
While language models have been successfully used to predict N400 amplitudes recorded from experimental participants, thus far this has only involved stimulus-dependent metricsdnamely, surprisal and probability (Aurnhammer & Frank, 2019;Frank et al., 2015;Merkx & Frank, 2021;Michaelov et al., 2022Michaelov et al., , 2023;;Szewczyk & Federmeier, 2022;Yan & Jaeger, 2020).To the best of our knowledge, no study has thus far attempted to directly test whether N400 amplitude can be predicted by the probability assigned to any word other than the stimulus itself by a language model, and only one (Frank et al., 2015) has tested a metric even in part derived from the whole probability distribution.Because language models are currently the only way to calculate the contextual probability of all words in the lexicon, it is thus the case that the question of whether the amplitude of the N400 is affected by the extent to which words other than the stimulus itself were predicted has not been sufficiently investigated for any conclusions to be drawn.This severely limits the inferences we can draw from the N400 effect.Namely, we do not know whether the N400 indexes the preactivation of the stimulus alone, or also its alternatives.
This presents a problem for theoretical advancement.Making progress on neural mechanisms of language comprehension relies on reliable and sensitive signals such as the N400.Researchers hope to draw inferences from effects like the N400 about, for instance, what is preactivated during comprehension.But to do this requires a precise account of what affects those signals.In addition to presenting an obstacle to our understanding of language comprehension more generallydfor example, whether language processing fits into our general understanding of predictive processing in the braindthe weak derivation link presents a challenge for investigating how certain linguistic features impact preactivation.The majority of contemporary work on the N400 investigates how the context preceding a stimulus impacts the extent to which the stimulus is preactivated in the brain (for review, see, e.g., Federmeier, 2021;Kuperberg et al., 2020), but uncertainty about whether the N400 reflects only the preactivation of the stimulus drastically reduces the scope of what we can hope to understand.This issue is especially important in a field where noise and small effect sizes can often lead to inconsistent findings across studies (for a recent discussion, see Nicenboim et al., 2020).
The aim of this study, therefore, is to test whether, to the extent that this can be evaluated using current methods, the amplitude of the N400 response solely reflects the preactivation of the stimulus presented, or whether it in some way also reflects the inhibition of alternatives.To do this, we use state-of-the-art large language models.This is because, as previously stated, the conventional cloze approach fails to capture preactivation that varies systematically between different words with a cloze probability of zero (e.g., DeLong et al., 2019;Ito et al., 2016;Metusalem et al., 2012).This may not just be a methodological issue; as discussed in subsection 2.1, it is likely that the task itself (which asks for the best completion of a sentence) may preclude more anomalous words being filled in.But even if the issue is purely methodological, human vocabularies are very large, on the order of tens of thousands of words (Brysbaert et al., 2016), making it impractical to collect judgments from enough participants for every possible word.There is also reason to believe that the probabilities derived from language models are actually more informative than cloze.In addition to being more clearly interpretable from an information-processing perspectivedthey reflect the contextual probabilities of words based on the statistics of language alonedrecent work has shown that the predictions of contemporary language models can outperform cloze probability as predictors of N400 amplitude (Michaelov et al., 2022).Thus, even if it were possible to collect and calculate cloze values for all words in the vocabulary, it might still be preferable to use language models.

Constraint
Since early work on the N400 (Kutas & Hillyard, 1984), cloze probability has been used to operationalize the extent to which words are preactivated such that their preactivation impacts N400 amplitude.Most subsequent work explicitly or implicitly assumes that the amplitude of the N400 is only (or at least, most importantly) correlated with the extent to which the stimulus itself is preactivated.However, more recently, there have been attempts to consider the how the broader, distributed 'landscape of activation' (Federmeier, 2021, p. 1) impacts N400 amplitude.An exemplary case is the study carried out by Federmeier et al. (2007), who test whether sentence constraintdthe cloze probability of the most probable word in contextdimpacts N400 amplitude.The idea is that if inhibition does impact N400 amplitude, one should expect to see it most clearly with low-probability stimuli in high-constraint sentences.Under an distribution-dependent account, the high-probability completion is preactivated to a large extent, and thus, when this prediction is violated, we should expect a strong inhibition response.But as discussed, Federmeier et al. (2007) did not find any effect of constraint, leading them, and many other researchers (DeLong & Kutas, 2020;Federmeier, 2007Federmeier, , 2021;;Federmeier et al., 2002;Kuperberg et al., 2020;Kutas et al., 2011;Otten & Berkum, 2008;Thornhill & Van Petten, 2012;Van Petten et al., 1999;Van Petten & Luka, 2012;Vissers et al., 2006;Wlotko & Federmeier, 2007) to argue that N400 amplitude does not reflect inhibition.Under these accounts, N400 amplitude only reflects new activation elicited by the stimulusdthat is, the activation of neural representations that were not already preactivated by the context.
However, as argued earlier, this approach does not speak to failed predictions for words other than the best completion, since it only takes into account the activation of the highestprobability item.Moreover, word prediction might not linearly impact N400 amplitudedit might or might not be ten times harder to inhibit a word with a probability of 50% than a word with a probability of 5%.And finally, this approach assumes that cloze probability actually reflects the proportion of activation given to a specific candidate word (as argued by Brothers & Kuperberg, 2021;Staub et al., 2015).While it may intuitively seem a given that cloze probability should be directly proportional to the relative activation level of each word, this is not necessarily the case, especially given that the cloze task may have specific deforming effects on the probability distribution.One possible example of this can be illustrated by looking at the related anomaly effect, where an anomalous word that is semantically related to the best (highest-cloze) completion of a sentence elicits a smaller N400 response than an anomalous word that is not (for review, see Amsel et al., 2015;DeLong et al., 2019;Federmeier & Kutas, 1999;Ito et al., 2016;Kutas & Hillyard, 1984;Metusalem et al., 2012).In such cases, while both semantically related and unrelated anomalous words have a cloze probability of zero (or almost zero) but elicit N400 responses of different amplitudes, when we look at language model predictions, we see that the semantically related words have a higher probability (Michaelov & Bergen, 2022a).This suggests that such semantically related anomalous words are in fact more likely than their unrelated counterparts, but this is not detectable by looking at cloze probability.In this case, it is likely that the cloze task discourages participants from filling in anomalous words, even if they are more likely in the context, and thus more strongly preactivated (for related discussion, see Michaelov et al., 2022;Smith & Levy, 2011).

Surprisal
One attempt to consider the full distribution of prediction is that of Levy (2008).Levy (2008) frames lexical processing difficulty as involving the effort required to reallocate neurocognitive resources upon encountering a stimulus, based on altering the entire predicted probability distribution.To do this Levy (2008) proposes that the relevant metric should be the KullbackeLeibler divergence (Kullback & Leibler, 1951) between the probability distribution of predictions and the 'true' probability distributionda distribution where the actual next word (i.e., the stimulus word) has a probability of 1, and all other words have a probability of 0. It should be noted that while Levy's (2008) account is based on considering reading times as an index of lexical processing difficulty, it may in fact be even more applicable to the N400.As discussed, the N400 is frequently thought to reflect the extent to which encountering a stimulus shapes the activation of neurocognitive representations, or more specifically, indexes the processing difficulty associated with updating the activation states of the brain to bring the total landscape of activation in the brain in line with the new stimulus.
The KullbackeLeibler divergence thus appears to reflect both the extent to which the true stimulus was predicted and the extent to which other words were predicted.The problem, however, is that Levy (2008) finds that the KullbackeLeibler divergence between the probability distribution that is the output of language models and the true probability distribution is mathematically equivalent to the surprisal S of the stimulus itself, that is, the negative logarithm of the probability p of a word w i given its preceding context, shown in Equation (1).
Thus, while under an information-theoretic account, surprisal may be a good characterization of processing difficulty envisioned as the updating of activation states in the braindand indeed, Hale (2001) proposes surprisal as a metric of lexical processing difficulty that reflects the difficulty of disconfirming alternativesdit is critically determined solely by the predicted probability of the stimulus word.From a theoretical perspective, this is not a problem.The fact that the KullbackeLeibler divergence between the true and predicted probability distributions is equivalent to surprisal may actually help to explain the finding that the N400 does not appear to be sensitive to constraintdif the brain reflects informationtheoretic principles, the effort required to update our probability distribution might indeed only be determined by the probability of the stimulus (with a logarithmic linking function).Empirically, surprisal has also been incredibly successful in the prediction and modeling of the N400 (Aurnhammer & Frank, 2019;Frank et al., 2015;Merkx & Frank, 2021;Michaelov & Bergen, 2020;Parviz et al., 2011;Szewczyk & Federmeier, 2022), with one recent study even finding the surprisal of the GPT-3 language model (Brown et al., 2020) to be the best predictor of the N400 measured thus far, beating other language models and even cloze probability, the canonical metric of word probability (Michaelov et al., 2022).Nonetheless, because surprisal is not affected at all by the extent to which other words are preactivated, it cannot be used to investigate whether the preactivation of non-stimulus words impacts N400 amplitude.

L 1 distance
Another metric that ostensibly includes information about the preactivation states of non-stimuli is developed by Fitz and Chang (2019).Fitz and Chang (2019) propose that rather than simply indexing prediction error of some kind, the N400 has a functional significance in itself as a learning signal used to update our neurocognitive representations of the statistics of language for use in production (for related accounts, see, e.g., Federmeier, 2021;Fitz & Chang, 2019;Kuperberg et al., 2020;MacDonald, 2013;Pickering & Garrod, 2013).For this reason, Fitz and Chang (2019) take the true and predicted probabilities for each word in their model's vocabulary, and then model N400 amplitude as the sum of absolute error for each worddthat is, the sum of the difference between the true and predicted probability of each word.This is equivalent to the Manhattan distance or L 1 norm between the predicted and true probability distributions.However, like surprisal, this metric is in fact only dependent on the probability of the stimulus, as we show in Appendix A. Specifically, L 1 distance is has relationship to p(w i ) shown in Equation ( 2).
Like surprisal, L 1 distance is a metric based on the distance between the true and predicted probability distributions, and like surprisal, it is in fact only dependent on the predicted probability of the stimulus.Again, this is a theoretically meaningful result.If we take the idea of proportional preactivationdthat is, the idea that words are preactivated in proportion to probabilitydseriously, and expect the processing difficulty indexed by the N400 to reflect the sum of the absolute error between the true and predicted probabilities of words, then this mathematical result suggests that we only need to calculate the probability of the stimulus itself in order to understand the N400 response.Indeed, Fitz and Chang (2019) are successful in using L 1 distance to model N400 amplitude, though it should be noted that Fitz and Chang's (2019) main model is not a language model in the strict sense because it is trained using structured semantic information (though its output is still a probability distribution over words).
However, as is the case with KullbackeLeibler divergence, this means that L 1 distance cannot be used to investigate the question of whether the possible inhibition of preactivated stimuli impacts the processing difficulty indexed by the N400.But by the same token, what it does tell us is that if the distribution-dependent account of the N400 is true, the mathematical relationship between the true and predicted probability distributions cannot be L 1 distance.The same is true for KullbackeLeibler divergence.However, this does not rule out the possibility that other difference metrics between the true and predicted probability distribution could capture the effectdeven including other L k distance metrics.For example, it could be that the L 1 distance metric under-estimates the difficulty of inhibiting high-probability items relative to low-probability items, something which might be detectable using the L 2 (Euclidean) distance as the relevant metric.On the other hand, it might be that using L 1 distance under-estimates the difficulty in inhibiting low-probability items relative to highprobability items, something that could be addressed by using the L 0.5 distance as a metric.

Entropy
A final metric that has been used to predict N400 amplitude (Stone et al., 2022), but which does in fact take into account the full probability distribution of preactivation is entropy (Shannon, 1948).The equation for entropy H is given in Equation (3), where b pðw i Þ is the predicted probability of w i in context.
Entropy reflects uncertaintydgiven a probability distribution over words, the distribution with the highest possible entropy would be a uniform distribution, and the lowest-entropy distribution is one where one word has a probability of 1 and the remaining words have a probability of 0. A theoretical account of how entropy should influence N400 amplitude is not necessarily intuitive.In line with work on constraint, one might expect that in cases with low-probability stimuli, a low-entropy distribution might lead to the most processing difficulty, as this would result from a probability distribution where one very high-probability word is greatly preactivated.On the other hand, Stone et al. (2022) hypothesize that we might be less likely to make predictions in situations with higher entropydwhere there are a larger number of possible continuations of a sentencedand thus, higher entropy should be associated with larger N400 responses.In this way, either a positive or negative relationship between entropy and N400 amplitude is plausible based on previous work.
Of course, the fact that previous work on the N400 and language comprehension more generally can lead to multiple predictions is not in itself an issuedthis is something that could be resolved empirically, if indeed it is the case that entropy impacts N400 amplitude.But there does remain a fundamental problem with entropy as a metric of processing difficulty: it does not take into account the actual stimulus.Specifically, it only reflects the activation state before the word is encountered.Thus, if stimulus preactivation itself impacts processing difficulty, entropy alone cannot be used to model it.In the one study the one study that directly tests the effect of entropy on N400 amplitude, Stone et al. (2022) do not find it to be a significant predictor, either as a main effect or in interaction with word probability.However, it is worth noting that Stone et al. (2022) calculate their entropy based on cloze probabilities, and thus only a limited number of possible preactivations are considereddthe maximum number of different responses to filling in the blank in the cloze task in their study is 8 (Stone et al., 2021).If there are differences in levels of preactivation based on contextual probability beyond that reflected by cloze, as previously discussed, then this approach does not take into account the full distribution of preactivation.Thus, despite the aforementioned theoretical problems with entropy, it is still valuable to directly test how well entropy calculated from the full distribution of predictionsdfor example, by using probabilities derived from a language modeldcan predict N400 amplitude, which we do in the present work.This is especially so given the recent findings that entropy appears to correlate with some of the neural activity that occurs during language comprehension when measured using magnetoencephalography (Brodbeck et al., 2022;Huizeling et al., 2022).
One metric that at least at first glance would appear to be better suited to testing whether N400 amplitude is sensitive to the probability of words other than the stimulus is cross-entropy.Cross-entropy is a measure of the difference between two distributions that is often used as a loss function (Goodfellow et al., 2016;Jurafsky & Martin, 2023), and thus is in line with some theories of the N400 (e.g., Fitz & Chang, 2019).However, cross-entropy is the sum of the KullbackeLeibler divergence between the true and predicted probability distributions and the entropy of the true probability distribution (Goodfellow et al., 2016, p. 73).Given that the entropy of the true probability distribution is zero, this means that, at least for language models, the cross-entropy is equivalent to KullbackeLeibler divergence, and thus, surprisal.And so this metric is also only dependent of the probability of the stimulus.
There are also several other related metrics that bear mentioning.Frank et al. (2015) and Aurnhammer and Frank (2019) test how well next-word entropy, the difference between entropy and next-word entropy, and two forms of what they refer to as Lookahead Information Gain predict N400 amplitude as well as reading time.However, next-word entropy in this case refers to the entropy of the probability distribution of the predictions for the word after the stimulus, and thus does not take into account the preactivation at the time that the stimulus is encountered, or the actual stimulus itself.The two Lookahead Information Gain metrics are also both based on this probability distribution for the following word.Finally, it should also be noted that none of these four metrics were found to be good at modeling the N400 (Aurnhammer and Frank, 2019;Frank et al., 2015).

3.
Language models and the N400 Using the predictions of language models rather than a human-derived metric such as cloze probability can evoke skepticism.As articulated above, language models allow us to test hypotheses about how the full distribution of preactivation may impact N400 amplitude, but this is naturally only a viable strategy if language model predictions bear a clear relationship to this preactivation.Intuitively it may seem problematic to use the predictions derived from systems trained only on text data with no grounding in sensorimotor experience of the world or explicit propositional knowledge to model the kinds of predictions that humans may make during language comprehension.However, as discussed, recent work has shown that the predictions of language models can model N400 amplitude incredibly successfully (Aurnhammer & Frank, 2019;Frank et al., 2015;Merkx & Frank, 2021;Michaelov & Bergen, 2020, 2022a;Michaelov et al., 2021Michaelov et al., , 2023;;Szewczyk & Federmeier, 2022).Thus, at worst, language models appear to make predictions in line with the preactivation that underlies the N400 response.This in itself would not necessarily be surprising.The language we use encodes information about the world and our understanding of it to such an extent that its statistics can be used to calculate the semantic similarity of words (Landauer et al., 1998), identify structured semantic relations between words (Mikolov, Sutskever, et al., 2013), and even identify cultural biases (Bolukbasi et al., 2016).Thus, it may be that the statistics of language are able to approximate the statistics of the worlddwe are more likely to talk about more c o r t e x 1 6 8 ( 2 0 2 3 ) 8 2 e1 0 1 likely things.Therefore, even if the preactivation that occurs during online language comprehension is in fact largely based on our knowledge of the world (direct or indirect), this may be approximated well enough by the statistics of language that those statistics may be informative about neurocognitive systems underlying language comprehension.However, there is a stronger alternative possibility: humans may actually be using the statistics of language in preactivation as part of language comprehension.Given the amount of information contained in the statistics of language (contemporary language models continue to improve performance at increasingly impressive tasks, see, e.g., Nie et al., 2020;Srivastava et al., 2022;Wang, Pruksachatkun, et al., 2019;Wang, Singh, et al., 2019), it would not in principle be surprising if the human language comprehension system took advantage of this.In fact, this would bring language processing in line with evidence for predictive coding in other domains, in which statistical learning is thought to play a key role.For example, in visual processing, there is evidence that environmental statistics are relevant from the level of neurons in the primary visual cortex to the overall encoding of scenes (de Lange et al., 2018;Rao & Ballard, 1999;Sherman & Turk-Browne, 2020).

The present study
The aim of the present study is to explore whether the amplitude of the N400 response is impacted not only by the extent to which a given stimulus was preactivated by its preceding context, but also by the extent to which other possible stimuli were preactivated.Most contemporary theoretical accounts of the N400, and by extension, the neurocognitive processes underlying language comprehension, assume that solely the stimulus word matters.But this has not yet been convincingly demonstrated.
To investigate this, we use language models to calculate several distribution-dependent metricsdthat is, metrics that operationalize the difference between the true and predicted probability distributiondspecifically, L 0.5 distance, L 2 distance, Hellinger distance, c 2 distance, and cosine distance, as well as the previously-investigated constraint and entropy metrics (the equations for all metrics are presented in Table 2).We then test whether any of these can account for variance in N400 amplitude above and beyond that explained by predictability alone.We test this on the large N400 dataset made available by Szewczyk and Federmeier (2022), comprised of data from four published studies (Federmeier et al., 2007;Hubbard et al., 2019;Szewczyk et al., 2022;Wlotko & Federmeier, 2012) and one previously-unpublished ERP study.
We divide our study into two experiments.In the first, we test how well the predictability metrics calculated using seven contemporary language models predict N400 amplitude.Because our study tests whether metrics operationalizing the whole landscape of word preactivation predict N400 amplitude above and beyond the predictability of the stimulus itself, our first task is to find the best operationalization of predictability to compare these to.Previous work shows that surprisal is overall a better predictor of N400 amplitude than probability is (Szewczyk & Federmeier, 2022;Yan & Jaeger, 2020), especially for the best-performing models (Michaelov & Bergen, 2022b).However, Szewczyk and Federmeier (2022), analyzing the same dataset that we analyze, found that untransformed probability can also explain additional variance in N400 amplitude, especially for higher-probability items.As a result, we use both metrics as predictors in our linear mixedeffects regressions assessing how well different language models predict N400 amplitude.
In Experiment 2, we run our tests on the predictions of the best-performing language model: GPT-J (Wang & Komatsuzaki, 2021).First, we test whether any of the distribution-dependent metrics, entropy, or constraint out-perform predictability as predictors of N400 amplitude on their own, using the overall fit of linear mixed-effects regressions.We then test whether adding any of these to regressions already including the stimulus-only predictability variables improves model fit.If so, this would suggest that they explain variance not explained by predictability, and thus would provide evidence that the amplitude of the N400 response is impacted by the effort required to inhibit the activation of words other than the eliciting stimulus itself.If not, this would add to the evidence from research on sentence constraint suggesting that only the probability of the stimulus itself impacts N400 amplitude.The collection of metrics of each type that we use has, to the best of our knowledge, not been used previously to model N400 amplitude.

Introduction
The overall purpose of the current study is to model the full landscape of neural preactivation using the probability of language models, and to use these probability distributions to investigate whether the amplitude of the N400 response to a stimulus is sensitive not only the extent to which it is preactivated, but also the extent to which alternatives are preactivated.To do this, in Experiment 1, we first select a language model that makes predictions that are highly correlated with word preactivation.
Previous work shows that surprisal from transformersdthe current state-of-the-art language model architecturedcorrelate most closely with N400 amplitude compared with other models architectures (Merkx & Frank, 2021;Michaelov et al., 2022).In fact, the surprisals calculated using some of the most powerful models testeddALBERT, RoBERTa, and GPT-3dhave been found to out-perform cloze probability as predictors of N400 amplitude on one dataset (Michaelov et al., 2022).Given that the full probability distribution of GPT-3 is not directly accessible, it is not suitable for the present study.However, in recent work by Michaelov and Bergen (2022b), a much larger selection of contemporary transformer language modelsdincluding ALBERT, RoBERTa, and a number of models released after Michaelov et al. (2022)dare evaluated in terms of how well their probability and surprisal predicts N400 amplitude.Because surprisal appears to be a better predictor than probability overall, for the present study, we also include the two monolingual (i.e., trained only on English) transformer language models that generate surprisals which Michaelov and Bergen (2022b) find to be better correlated with N400 amplitude than ALBERT and RoBERTadnamely, GPT-J and OPT 6.7B.Since the publication of Michaelov and Bergen (2022b), a number of new language models have been released, and thus, we include 3 additional language models with a similar number of parameters as GPT-J and OPT 6.7B that have also been trained on datasets of the same order of magnitude: Pythia 6.9B (Biderman et al., 2023), Cerebras-GPT 6.7B (Dey et al., 2023), and StableLM-Base-Alpha 7B (Stability AI, 2023).
One thing that should be noted is that the set of models used comprises both autoregressive language models, those trained to predict a word based on only the preceding context; and masked language models, those trained to also predict based on the following context.In the present study, all models are only presented with the preceding context as humans were in the original N400 experiments, but it is unclear whether the fact that masked language models are also trained to 'postdict' (Huettig, 2015) makes them more or less human-like.While it would be impossible for us to use such postdictions during online comprehension, it is possible that we might still learn these reverse probabilities.Thus, in addition to the more practical question of which language model is best able to make predictions that correlate with the preactivation of neural representations during online language comprehension, the results of the present study may also shed light on what kinds of language statistics may be learned by humans.

Dataset
The experimental stimuli and N400 data used in the present study come from a large dataset recently made available online by Szewczyk and Federmeier (2022) at https://osf.io/were selected to be plausible and vary 'continuously through the full range of cloze probability' (Wlotko & Federmeier, 2012, p. 359).This experiment contributed data from 4440 trials (300 stimuli; 16 experimental participants) to the dataset.Third is a dataset from a study carried out by Hubbard et al. (2019).The stimuli in this study were 192 sentences selected from the Federmeier et al. (2007) experiment with the same 2 Â 2 design: half of the sentences were high-constraint and half were low constraint; and each sentence had either the best completion or a low-cloze completion as the critical word.The data from this experiment included 5705 trials (32 experimental participants).
The final previously-published study included in the dataset is that of Szewczyk et al. (2022).The stimuli in this study were based on 168 sentence frames from previouslypublished studies including Federmeier et al. (2007), with high and low-cloze completions for each sentence frame.Stimuli were then expanded by adding an adjective before the completion that either increased the cloze probability of the low-cloze completion or further increased the cloze probability of the high-cloze completion.Thus there were four experimental conditions for each item, totaling 672 stimuli.Data from 4939 trials (32 experimental participants) were included from this study.
As previously discussed, the dataset also includes data from an unpublished study.The stimulus selection procedure is not mentioned in the paper (Szewczyk & Federmeier, 2022); however, looking at the data, we can see that all stimuli are present in one of the other four previously-published studies, and that the stimuli are comprised of a higher-cloze (mean ¼ 57%) and lower-cloze (mean ¼ 1%) critical word for each sentence frame.This study contributed 4822 trials (600 stimuli; 26 experimental participants) to the dataset.
Thus, the total dataset provided by Szewczyk and Federmeier (2022) was made up of 27,762 trials (138 experimental participants).Because of the overlap in stimuli between the different experiments, the total number of unique experimental stimuli was 1330.After removing data for stimuli where critical words are not tokens in all models' vocabulary, our analysis includes data from 25,506 trials (1238 stimuli; 138 experimental participants).

Models
The details of the seven models tested are provided in Table 1.All models are pretrained transformer language models, four of which are autoregressivedtrained to predict the next word given the preceding contextdand two of which are masked language modelsdtrained to predict a word given the previous and following context.Note that in this study, we present all language models with only the preceding context.We used the PyTorch (Paszke et al., 2019) versions of all models made available through the transformers (Wolf et al., 2020) Python (Van Rossum & Drake, 2009) package.

Statistical analysis
All data manipulation, statistical analyses, and graphs were carried out and produced in R (R Core Team, 2020) using Rstudio (RStudio Team, 2020) and the tidyverse (Wickham et al., 2019) and lme4 (Bates et al., 2015) packages.In this paper, we report how we determined all data exclusions, all inclusion/ exclusion criteria, whether inclusion/exclusion criteria were established prior to data analysis, and all measures in the study.The sample size and all experimental manipulations were decided by the researchers who ran the original studies comprising the dataset (Federmeier et al., 2007;Hubbard et al., 2019;Szewczyk et al., 2022;Szewczyk & Federmeier, 2022;Wlotko & Federmeier, 2012).No part of the study procedures and no part of the analyses were pre-registered prior to the research being conducted.All data, code, and statistical analyses are available at https://osf.io/jrsgh.

Results
We ran each of the preprocessed stimulus contexts through the seven language models, calculating the probability and surprisal for each critical word.We then combined this data with the single trial ERP data provided by the original authors, using linear mixed-effects regressions to predict N400 amplitude, with each regression including the probability and surprisal calculated using each language model as predictors.
Following Szewczyk and Federmeier (2022), regressions also included baseline voltage, word frequency (log-transformed), concreteness, and orthographic neighborhood distance (OLD20), all of which were provided by Szewczyk and Federmeier (2022) as covariates.We also included random intercepts for each subject and sentence frame (each sentence frame in each experiment was treated as a separate sentence frame), as well as random slopes of the covariates (baseline voltage, word frequency, and orthographic neighborhood distance) for each of these.Following Michaelov et al. (2022), all variables were z-scored.In order to evaluate the performance of each metric, we compared each regression's Akaike Information Criterion (AIC) (Akaike, 1973), a metric of regression fit, where a lower AIC indicates a better fit.Results are presented in Fig. 1, where AICs are are shown relative to the AIC of a baseline null model with the same predictors as the other regressions except without surprisal or probability.As can be seen, the best-performing model is GPT-J (AIC ¼ 58549.22),followed by Pythia 6.9B (AIC ¼ 58567.19),OPT 6.7B (AIC ¼ 58568.82),Cerebras-GPT 6.7B (AIC ¼ 58590.77),RoBERTa Large (AIC ¼ 58627.10),StableLM-Base-Alpha 7B (AIC ¼ 58708.78),and finally, ALBERT XXL (AIC ¼ 58761.13).A Table 1 e Details of all the models used in the present study.Note that the ALBERT model uses shared parameters, and so the model is larger than the parameter counts suggest.The number of tokens for RoBERTa is estimated based on the fact that the dataset is 10 times larger than that on which ALBERT was trained.

Discussion
The language model that best predicts N400 amplitude for this dataset is GPT-J, suggesting that its probability distributions most closely correlate with the preactivation state underlying the N400 response.We thus use metrics calculated from GPT-J for the remainder of our analyses.The results of this experiment differ from the singletoken results of Michaelov and Bergen (2022b) in that all but one of the autoregressive models tested here (StableLM-Base-Alpha 7B) performed better than the masked language models.It should be noted, however, that this result is in line with Michaelov and Bergen's (2022b) findings when analyzing the performance of language models at predicting N400 amplitude for stimuli including those made up of more than one token.Given this and the far larger number of experimental stimuli in the present study (1238 stimuli with singletoken critical words compared to 37 single-token critical words and even 160 total critical words in Michaelov & Bergen, 2022b), it is likely that the results of the present study are more representative of the performance of the models at predicting N400 amplitude.Whether this is because the autoregressive architecture is more human-like or because the autoregressive models were trained on far more data than the masked language models is a question for future research.

Experiment 2
Equipped with a best-performing language model, we can now address the main research question, namely, whether the preactivation of possible stimuli other than the stimulus that elicits the N400 response can impact the amplitude of the response.To do this, we select a number of metrics that reflect the difference between the true and predicted probability distributionsdthat is, distribution-dependent metricsdas calculated using GPT-J.Many metrics relating the predicted and observed probability distributions across words were unsuitable for our analysis.Some, as discussed earlier, are linearly related to a metric of stimulus-dependent predictability.For example, total variation distance (as given in Gibbs & Su, 2002) is equivalent to half of the L 1 distance between the two distributions and thus is linearly related to probability.Similarly, because they involve element-wise multiplication between the distributions, R enyi divergence (as given in van Erven & Harremos, 2014) and Bhattacharyya distance (as given in Jain, 1976) simplify such that they become the logarithm of the stimulus probability multiplied by a constant, and thus, are directly proportional to surprisal.Other metrics are incalculable because in the true probability distribution, all words have a probability of zero with the exception of the true stimulus, which has a probability of 1.Because the zeros in the true distribution are meaningful, we do not use smoothing, and thus, we do not use any metrics that would involve dividing by or taking the logarithm of zero, e.g., KullbackeLeibler divergence in the opposite direction or information radius (as given in Manning & Schutze, 1999).We therefore selected two metrics that were both calculable and not linearly related to predictability: c 2 distance and Hellinger distance.Beyond the aforementioned restrictions on suitable metrics, these specific metrics were not in themselves chosen for any theoretical reason beyond reflecting a difference between the true and predicted probability distributions.As discussed, the aim of the study is to test whether there is an effect of the full probability distribution on N400 amplitude at all rather than necessarily to precisely characterize such an effect.If either c 2 and Hellinger distance successfully operationalize the difficulty inhibiting false predictions, then we should expect a negative correlation between the metric and N400 amplitude, indicating a stronger N400 response when there is a greater difference between the true and predicted probability distributions.Other metrics were selected based on the theoretical perspective presented by Fitz and Chang (2019), which considers the probability distributions generated by predictive models to reflect the relative differences in preactivation between candidate stimuli, but also considers that these need not be meaningful as probabilities in themselves.Fitz and Chang (2019) operationalize the difference in the activation across all words before and after encountering a stimulus as L 1 distance; but as discussed, this is only dependent on the probability of the true stimulus itself.However, this is not the case for other L k distances metrics.It may be the case, for example, that L 1 distance underestimates the extent to which lower-probability false predictions impact N400 amplitude, something which could be tested using a fractional L k distance (in fact, fractional L k norms are generally argued to be preferable for high-dimensional data; see Aggarwal et al., 2001).Conversely, if it is relatively more difficult to inhibit higherprobability false predictions than is operationalized by L 1 distance, it may be that a L k distance with k > 1 is a more suitable way to operationalize this.In the present study, we test one of each of these: L 0.5 and L 2 distance.In addition to L k distance, we also choose another distance metric that has had a large degree of success as a metric of the distance between two vectors in computational linguistics and psycholinguistics (Chwilla & Kolk, 2005;Dumais et al., 1988;Deerwester et al., 1990;Ettinger et al., 2016;Landauer et al., 1998;Mikolov, Chen, et al., 2013;Mikolov, Sutskever, et al., 2013;Parviz et al., 2011;Van Petten, 2014): cosine distance.As with other distribution-dependent metrics, if L k or cosine distance successfully models the effect of inhibition on N400 amplitude, we should expect a negative correlation between the two; with a greater distance between the true and predicted probability distribution resulting a stronger N400 response.
We also compare these metrics (that to the best of our knowledge have not previously been used to predict N400 amplitude), with both constraint and entropy, also calculated from GPT-J.For constraint, we record the probability of the highest-probability continuation in a given context, analogous to the Best Completion calculated with cloze probabilities.To account for the possibility of a logarithmic linking function between constraint and the N400 (as there appears to be for predictability), we also convert these probabilities into surprisal, and test both metrics.

Data
For this experiment, we used experimental data from all stimuli in the dataset that have critical words that are in the vocabulary of the GPT-J language model (i.e., the data from all single-token critical words).Because we include constraint as a metric in our analysis, we also restrict our analysis to stimuli that are not the best completions in their context, following Federmeier et al. (2007)dthat is, we exclude cases where the surprisal variant of the constraint metric is identical to stimulus surprisal.Our analysis thus includes data from 17,892 trials (873 stimuli; 138 experimental participants).Note that these exclusion criteria were decided before the analyses were carried out.

Metrics
All metrics used in this analysis are defined in Table 2.The correlations between all metrics is shown in Fig. 2.

Results
First, we compared how well each of the metrics performs compared to surprisal, probability, and an overall Table 2 e The names of the metrics used in the present study and the equations used to calculate them.All equations are based on the version given in the citation, but have been adapted for consistency.b p refers to the predicted probability, p to the true probability (i.e., 0 or 1), w i to the critical word, and w BC to the best completion (i.e., the word with the highest probability in a given context).

Metric Name Equation Citation
Surprisal À logðb pðw i ÞÞ Levy (2008) Gibbs and Su (2002) Hellinger Distance Gibbs and Su (2002) Cosine Distance 1 À predictability regression that includes both variables.We compared the AIC of linear mixed-effects models with each metric as a predictor and with the same covariates and random effects structure as those in Experiment 1, where, as in Experiment 1, all variables were z-scored.The results can be seen in Fig. 3, which shows that the aggregate predictability regression best fits the N400 data, followed by (in order of increasing AIC, and thus, decreasing fit) surprisal, Hellinger distance, probability, cosine distance, L 2 distance, constraint operationalized as probability, and c 2 distance.On their own, constraint operationalized as surprisal, entropy, and L 0.5 distance appear to reduce model fit, compared to a model including just the covariates and random effects structure.This result demonstrates that no distribution-dependent metric is a better predictor of N400 amplitude than a combination of surprisal and probability, or even surprisal alone.However, the question we seek to address is whether these variables can explain any variance in N400 amplitude not explained by predictability alone.Thus, in the final, critical step, we test whether adding any of the distributiondependent metrics to the predictability regression improves fit.The results are shown in Fig. 4. As can be seen, the only metric that improves model fit numerically if added to the regression is cosine distance; the rest decrease model fit.However, as discussed in Experiment 1, generally only a difference in AIC of 4 or more is considered to reflect a substantial difference in model fit (Burnham & Anderson, 2004), suggesting that the improvement due to cosine distance is not meaningful.
In order to test directly and to verify whether there is indeed a lack of improvement from adding the other metrics, we run likelihood ratio tests comparing the predictability regression with regressions also including each distributiondependent variable.We find that cosine distance does not improve model fit [c 2 (1) ¼ 3.0969, p ¼ .0784],and neither does c 2 distance [c 2 (1) ¼ 1.8036, p ¼ .1793],entropy [c 2 (1) ¼ .5557,p ¼ .4560],L 0.5 distance [c 2 (1) ¼ .4025,p ¼ .5258],Hellinger Fig. 2 e The Pearson Correlation r between all variables of interest in our study for all critical words that were single tokens for GPT-J.
Second, like Szewczyk and Federmeier (2022), we find that including un-transformed probability as a predictor in addition to surprisal improves fit to the N400 data in this dataset.However, we extend this finding to also include GPT-J, a model that appears calculate probabilities that more closely correlate with N400 amplitude both when used directly and transformed into surprisal (Michaelov & Bergen, 2022b) compared to GPT-2 (Radford et al., 2019), the model used by Szewczyk and Federmeier (2022).Finally, as in previous work, neither constraint (Federmeier, 2007;Federmeier et al., 2002Federmeier et al., , 2007;;Otten & Berkum, 2008;Van Petten et al., 1999;Vissers et al., 2006;Wlotko & Federmeier, 2007) nor entropy (Stone et al.,Fig. 3 e The AICs of all regressions including a single metric of interest as a predictor, as well as one including both predictability metrics (probability and surprisal).
Fig. 4 e The AICs of all regressions including a single metric of interest as a predictor, as well as one including both predictability metrics (probability and surprisal).
2021) predict N400 amplitude above and beyond predictability.Crucially, our study extends these findings to probabilities derived from language models in addition to cloze probability.In this experiment we set out to investigate whether the preactivation of stimuli other than the actually-occurring stimuli impact the amplitude of the N400 response using metrics operationalizing the difference between the true distribution for each critical word and the distribution predicted by GPT-J.We found that neither the variables that treat this difference as a difference between probability distributions (c 2 distance and Hellinger distance) nor the metrics that treat it as the distance between two vectors (cosine distance, L 0.5 distance, and L 2 distance) explain any variance in N400 amplitude not explained by predictability alone, as operationalized by probability and surprisal.

General discussion
It has long been widely believed (with a few exceptions, e.g., Debruille, 2007;Fitz & Chang, 2019;Hoeks et al., 2004) that the N400 is only sensitive to the preactivation of the stimulus that it is elicited by, and not the rest of the landscape of activation elicited by its context.This premise forms the basis of the majority of contemporary accounts of the effect (e.g., Brouwer et al., 2012;Brouwer & Hoeks, 2013;Delogu et al., 2019;DeLong et al., 2014;Federmeier, 2021;Kuperberg et al., 2020;Kuperberg & Jaeger, 2016;Kutas et al., 2011;Van Petten & Luka, 2012).But, as discussed in section 1, this never been fully testeddprevious work has looked at constraint (Federmeier, 2007;Federmeier et al., 2002Federmeier et al., , 2007;;Otten & Berkum, 2008;Van Petten et al., 1999;Vissers et al., 2006;Wlotko & Federmeier, 2007), or in one more recent study, entropy based on the words generated by the cloze task (Stone et al., 2022).In both cases, the approaches only consider a small subset of the full landscape of preactivation at the time when the stimulus is encountereddin the case of constraint, only the extent to which the most predictable word is expected, and in the case of the cloze-derived entropy study (Stone et al., 2022), the degree to which at most 8 predictable words are expected.Thus, prior to the current study, a key link in the derivation chain was weak.Do metrics that consider the full probability distribution predict variance in the amplitude of the N400 not captured by metrics that consider only the probability of the stimulus itself?Our results suggest that they do notdno distribution-dependent metric on its own predicts N400 amplitude better than surprisal, and like constraint and entropy, none of the distribution-dependent metrics explain a significant amount of the variance in N400 amplitude above and beyond that explained by predictability alone.
In our experiments, no distribution-dependent metric significantly predicts N400 amplitude once predictability has been accounted for.In addition, no individual distributiondependent is a better predictor of N400 amplitude than surprisal.These results are consistent with the account that the amplitude of the N400 response is dependent only on the extent to which the stimulus itself was preactivated.The present study is the first to directly test whether the full distribution of preactivation can impact N400 amplitude.The finding that no distribution-dependent metric better correlates with N400 amplitude than surprisal (which only reflects the preactivation of the stimulus itself) suggests that the extent to which a word is preactivated is still the best predictor of N400 amplitude; and this is further strengthened by the fact that no distribution-dependent metric explains variance not explained by either surprisal or probability.Thus, the derivation chain is strengthened, and we can more confidently make inferences directly from N400 effects about the degree to which the neural representations associated with given stimuli are activated before they are encountered.It is therefore possible to investigate exactly which factors impact and modulate thisdas one example, the line of research investigating whether the amplitude of the N400 response, and hence, preactivation, is sensitive to the animacy features of entities under discussion (Kim & Osterhout, 2005;Kuperberg, 2007;Kuperberg et al., 2003;Nieuwland et al., 2013;Nieuwland & Van Berkum, 2005;Paczynski & Kuperberg, 2011, 2012;Szewczyk & Schriefers, 2011, 2013;Vega-Mendoza et al., 2021;Wang et al., 2020).

Surprisal and predictive coding
The research carried out in the present study is compatible with most contemporary accounts of the N400.However, as noted in section 3, a strong interpretation of the study and results uses the predictive coding framework, under which the neurocognitive system responsible for the preactivation underlying the N400 response is a predictive system (Bornkessel-Schlesewsky & Schlesewsky, 2019; Kuperberg et al., 2020;Lewis & Bastiaansen, 2015).As shown in the current work, language models can serve as computational-level cognitive models of at least part of this proposed system.The results of the present study also provide evidence to support the predictive coding account of the N400.Under a predictive coding account, the functional significance of neural metrics of processing difficulty is twofold: the new activation is information that allows the current stimulus to be correctly processed by the system; and the new activation is a learning signal (Clark, 2013;Huang & Rao, 2011;Rao & Ballard, 1999).In the language domain, this learning signal is thought to allow the neurocognitive system underlying language comprehension (and under some accounts also production, see, e.g., Fitz & Chang, 2019;Kuperberg et al., 2020) to learn and adapt, either long-term as part of continual language learning, or to a specific situation (Bornkessel-Schlesewsky & Schlesewsky, 2019;Hodapp & Rabovsky, 2021;Kuperberg et al., 2020).
While all metrics tested in the present study could conceivably fulfill both of these roles, it is striking that surprisal, the best-performing metric, also seems best suited to fulfilling the role of learning signal.As discussed, when c o r t e x 1 6 8 ( 2 0 2 3 ) 8 2 e1 0 1 comparing the true and predicted probabilities generated by language models, surprisal is equivalent to cross-entropy.This is interesting because cross-entropy is precisely the loss function used to train virtually all language models (Jurafsky & Martin, 2023).In other words, if we were to determine what would be the best loss function for a neurocognitive system engaging in lexical prediction to use, based on current research, it would be cross-entropydand thus, surprisal.For this reason, the fact that surprisal is the metric most correlated with N400 amplitude is striking.In this way, our results provide indirect evidence to support the predictive coding account of the N400.

Mechanistic implications
Predictability alone explaining variance in N400 amplitude is consistent with two specific mechanistic accounts of how the preactivation that occurs as part of online language comprehension is indexed by the N400 response.
The first is that the processing difficulty indexed by the N400 is only due to the activation of the neural representations associated with the stimulus that were not already activated due to the preceding context.That is, the amplitude of the N400 response is not just stimulus-dependent, but also only reflects this stimulus-driven activation.This is in line with most contemporary accounts of the N400 (DeLong et al., 2014;DeLong & Kutas, 2020;Federmeier, 2021;Kuperberg et al., 2020;Kuperberg & Jaeger, 2016;Kutas et al., 2011;Kutas & Federmeier, 2011;Van Petten & Luka, 2012).So what happens to words that are preactivated but not encountered?One possibility is that the metabolic resources required for preactivation (see, e.g., Brothers & Kuperberg, 2021;Levy, 2008) are constantly required to be expended to maintain preactivation, and thus, simply stopping doing so is enough to suppress them.Alternatively, there may not be any active suppression or inhibition at alldthe evidence suggests that highly probable words that are not presented as stimuli can remain activated over the course of an experiment (Rommers & Federmeier, 2018).
The other mechanistic account consistent with the results is that inhibition does indeed contribute to the processing difficulty indexed by the N400 response, but that the effort required to do this is dependent on the extent to which the stimulus was preactivated.Under such an account, it is simply the case that surprisal, or another metric that is only dependent on the preactivation state of the stimulus, mathematically expresses the combined processing difficulty of activating the representations associated with the stimulus and inhibiting others.Indeed, given the number of metrics of the difference between the true and predicted probability distributions that simplify to a stimulus-dependent metricdKullback-Leibler divergence, R enyi divergence (a generalization of Kullback-Leibler divergence), Bhattacharyya distance, total variation distance, and L 1 distancedperhaps it would not be surprising if this were the case.This idea is in line with the account of Hale (2001), who envisions surprisal as reflecting the difficulty of disconfirming predictions, and perhaps implicitly in line with the account of Fitz and Chang (2019), who argue that N400 amplitude reflects the activation and inhibition effort and present L 1 distance as the metric to express thisdwhich, as we show, is a stimulus-dependent metric.If this is the case, however, it does not diminish the importance of determining whether the amplitude of the N400 response is sensitive to the preactivation of the stimulus only or the to the whole distribution (i.e., the whole landscape of activation in long-term memory).The weak link in the derivation chain has still been strengtheneddwe can be more comfortable in using the N400 to understand exactly how much a given stimulus was preactivated under one experimental condition relative to anotherdbut further work would need to be carried out to investigate exactly to what extent the activation and inhibition contribute to the final amplitude measured.

Conclusions
In this study, we used computational methods to investigate the question of whether the amplitude of the N400 response to a word is impacted only by the degree to which the word was preactivated or to the entire landscape of activation elicited by the preceding context.We found that across the data from the five experiments modeled, surprisal was the best single predictor of N400 amplitude.Furthermore, no metrics reflecting the extent to which words other than the stimulus were preactivated explained any variance in N400 amplitude beyond that explained by surprisal and probability.This result supports the idea that N400 amplitude is only sensitive to the degree to which the stimulus itself was preactivated at the point at which it was encountered.Based on this and another property of surprisaldits equivalence with cross-entropy for language model predictionsdwe argue that the results of the present study support a predictive coding account of the N400.

Human studies statement
No experiments involving human subjects were carried out by the authors for the study reported in this manuscript.The study involves data collected from experiments using computational language models and the reanalysis of published data.
(Note: The original N400 data were gathered and preprocessed by the original authors of the respective studies whose data are re-analyzed in the present study.The above CRediT roles apply only to the work carried out as part of the present study and not these previously-published experiments.)

Open practices
The study in this article earned Open Material badge for transparent practices.The material used in this study are available at: https://osf.io/jrsgh.

Declaration of competing interest
None.

Fig. 1 e
Fig. 1 e AICs of regressions including the probability and surprisal calculated from the indicated model as predictors.A lower AIC indicates a better fit.