Speech register influences listeners’ word expectations

We utilized the N400 effect to investigate the influence of speech register on predictive language processing. Participants listened to long stretches (4 - 15 min) of naturalistic speech from different registers (dialogues, news broadcasts, and read-aloud books), totalling approximately 50,000 words, while the EEG signal was recorded. We estimated the surprisal of words in the speech materials with the aid of a statistical language model in such a manner that it reflected different predictive processing strategies; generic, register-specific, or recency-based. The N400 amplitude was best predicted with register-specific word surprisal, indicating that the statistics of the wider context (i.e., register) influences predictive language processing. Furthermore, adaptation to speech register cannot merely be explained by recency effects; instead, listeners adapt their word anticipations to the presented speech register.


Introduction
Human perception of sensory input involves more than passive registration. A rich body of research (e.g., Bar, 2007;Friston, 2005Friston, , 2012Friston, , 2018De Lange, 2018) shows that prediction is a core aspect of perception. Similarly, humans engaged in reading or listening show sensitivity to the statistical structure of the language input (e.g., Ellis, 2002). Importantly, as studies investigating register variation show (e. g., Staples et al., 2015), patterns of language use differ extensively between registers, influencing the statistical distributions of words of the different varieties (e.g., Bentum et al., 2019). Consequently, expectations on the occurrences of words that are valid for one register might be invalid for a different register. In the current study, we utilize the N400 effect to investigate whether listeners adapt their word expectations as a function of the speech register they are listening to.

Register variation
The three examples below illustrate a range of registers: chatting with friends (1), coverage from a news reporter (2) and a novelist telling a story (3).
(1) It just irritated me and then Joanne, Joanne's like "did you hear someone page Dan's brother-in-law?" I said "he wouldn't give his name." And she just started laughing. (Barbieri, 2005). (2) The leader's gunshot wounds are taking their toll, complicating efforts to persuade him to surrender. (Biber, 1999). (3) Last summer, a short time before my son was due to leave home for college, my wife woke me in the middle of the night. (Nicholls, 2014).
The examples illustrate that language use varies in relation to the communicative context (Borrillo, 2000) and purpose (Biber & Conrad, 2001): Conversational speech is produced in real time, without much time to prepare disfluencies are prevalent, and it is characterized by a lower type-token ratio and a frequent use of pronouns. News reportage is typically prepared and intended to convey information about a certain event, which results in frequent use of time and place adverbials as well as proper nouns. A novel is typically written with ample time to revise and refine the language use, affording a rich vocabulary and complex sentence structure.
Differences in language use between registers are well documented (see Biber & Conrad, 2009, for an overview). For example, word choice differences (Biber, 1999), such as the use of like in (1) is typical of informal conversation (Barbieri, 2005). Variation in grammatical constructions (Staples et al., 2015), such as the retention of the complementizer in that-clauses, as in (4), which differs between conversational speech and academic prose. In conversation that-omission is typical, while academic prose typically retains it (Biber, 1999).
The lexical and grammatical variation between registers gives rise to register-specific word co-occurrence statistics. Bentum et al. (2019a) indeed found that word predictability differs between speech registers. The probability of a word thus not only depends on the directly preceding words but also on the wider context of register. This raises the research question of the current paper, namely, whether listeners adapt their word expectations based on the register of the speech input.

Predictive language processing and the N400 effect
Evidence for predictive language processing is well established in the literature (see e.g. Elman, 2009;Huettig, 2015;Kuperberg & Jaeger, 2016;Pickering & Gambi, 2018 for overviews). Importantly, there is converging evidence from many different experimental paradigms. For example, self-paced reading studies show that unlikely words are read more slowly compared to more likely words (Rayner, 1998;Kliegl et al., 2006). The visual word paradigm used in eye-tracking studies shows that listeners more often gaze in anticipation to a picture that matches the verb of the sentence, among multiple objects (e.g., they more often look at a picture of a cake when they hear The boy eats compared to The boy moves; Altmann & Kamide, 1999).
The so-called N400 effect is also associated with anticipatory language processing (see Kutas & Federmeier, 2011 for an overview). The N400 is a negative deflection of the event related potential (ERP), which peaks about 400 ms after word onset at central posterior electrode sites. When participants read short sentences, such as (5) and (6), with occasionally an anomalous final word, as in (6), the semantically incongruous word (here socks) results in a more negative deflection of the ERP compared to the congruent word (here work; Kutas & Hillyard, 1980). Later experiments revealed that semantic incongruency is not required for an N400 effect (e.g. Hagoort & Brown, 1994). For example, constraining sentence pairs such as (7), which raise a strong expectation for a specific word (i.e., palms), elicit a graded N400 effect, with the unexpected but semantically related word pines resulting in an attenuated N400 amplitude compared to the unexpected and unrelated tulips (Federmeier & Kutas, 1999). Importantly, the different sentence final words (i.e., palms, pines and tulips) are all possible non-anomalous endings, indicating the N400 effect is not dependent on semantic anomaly.
(7) They wanted to make the hotel look more like a tropical resort.
So, along the driveway, they planted rows of [palms / pines / tulips].
Several experiments document that the N400 also provides evidence for anticipatory activation of words (e.g., Wicha et al., 2004;Van Berkum et al., 2005). These experiments use a paradigm where the anticipatory effects are measured before the expected word is presented. For example, when participants read sentence (8), the determiner an resulted in a more negative deflection of the N400 waveform compared to a, indicating that readers expected the following word to start with a consonant ; see also Yan et al., 2017, andNieuwland et al., 2018). Furthermore, they found that word predictability (as estimated with a cloze test) correlates with the N400 amplitude, indicating that people generate probabilistic expectations of upcoming language input.
(8) The day was breezy so the boy went outside to fly [a kite / an airplane] ….
The findings for the determiner reported in DeLong et al. (2005) failed to be replicated in a large-scale study conducted by nine different labs, reported in Nieuwland et al. (2018). Nieuwland et al. (2018) argue that this failure to replicate challenges an empirical cornerstone of the 'strong prediction' view (i.e. people pre-activate words at all levels of representation in a routine and implicit fashion, and pre-activation is thus not limited to a word's meaning but includes its grammatical features and orthographic or phonological form). Yan et al. (2017) discuss both Delong et al. (2005) and a preprint version of Nieuwland's study and argue that the findings from this replication study can also be interpreted as in line with a prediction account; for example, the correlation between cloze values and the N400 was replicated. We interpret the combined literature (see also Wicha et al., 2004;Van Berkum et al., 2005) as supportive of anticipatory processing, especially because new evidence supporting phonological pre-activation can be found in Bentum et al. (2019b); see also Poulton & Nieuwland (2022) for a critical view).
Despite the wealth of evidence, predictive language processing remains (to some extent) controversial. For example, Huettig (2015) notes that much evidence for prediction is based on studies that only use the extremes of predictability and questions whether prediction plays an important role during natural language perception across the entire range of word probabilities. For example, the N400 effect is typically elicited by comparing highly likely versus highly unlikely words (e.g., Hagoort & Brown, 2000), which does not reflect normal language use.
We follow Kuperberg & Jaeger (2016) and use prediction to mean graded probabilistic prediction, whereby multiple candidates (e.g., words) have probabilities assigned based on the preceding context. In our interpretation, predictive language processing at the lexical level can be conceived of as generating a probability distribution over all words in the mental lexicon. Consequently, there will (almost) always be prediction error since not all probability is assigned to a single word. This prediction error is indexed by the N400-effect. For example, it is well attested that even with highly constraining sentences (e.g., Federmeier & Kutas, 1999), words other than the expected word show an attenuated N400 compared to unrelated words.

Word predictability estimation, cloze tests, and statistical language modeling
Word predictability is typically established with cloze tests, whereby participants fill in blanks in sentences, such as So, along the driveway, they planted rows of … The percentage of participants that fill in a specific word, such as palms, is referred to as the word's cloze probability. This percentage measures the 'expectedness' of a word given the context. This approach has two drawbacks: It is labor intensive to gather cloze probabilities and the method cannot distinguish among the predictability of low cloze probability words (Yan et al., 2017).
A different approach to estimate word predictability is the use of statistical language modelling. Work on statistical language modelling shows that, given a set of n preceding words, it is possible to assign a probability to each word to be the next word (e.g., Chen & Goodman, 1999;Och & Ney, 2003;Kilgarriff, 2001). In their most basic implementation, a statistical language model (SLM) is based on counting 'word n-grams' (henceforth n-gram) in corpora. An n-gram is a sequence of n consecutive units. For example, the fast horse is a word trigram with the bigrams the fast and fast horse; and the unigrams the, fast and horse. Based on counts of these n-grams in a large body of text, contextdependent word probabilities can be estimated: P denotes the conditional probability of word W i given a sequence of n preceding words. The automation of word predictability estimation allows for the investigation of predictability effects for many words across the whole predictability spectrum. For example, Smith & Levy (2013) used this approach to determine that reading time is log-linearly related to the probability of a word on the basis of a dataset of approximately 50,000 words.
The log-linear relation between word probability and reading time fits well with Surprisal Theory of language processing (Hale, 2001;Levy, 2008), according to which language processing costs relate to surprisal. Surprisal is an information theoretic measure that captures the amount of Shannon information an item (i.e., word) in a message conveys. It is defined as the negative logarithm of the probability of a word given its pre-context and can informally be thought of as the 'unexpectedness' of a word given its pre-context. Frank et al. (2015) used statistical language modelling to estimate word surprisal for all content words in sentences from several novels. In this manner they could analyze a large set of approximately 30,000 word tokens. They used these sentences in an EEG experiment. Participants read sentences word-by-word while their EEG was recorded. Less expected words (i.e., words yielding high surprisal) elicited a larger negative amplitude in the N400 time window compared to more expected words. Pickering & Gambi (2018) argue that surprisal effects (e.g. Smith & Levy, 2013;Frank et al., 2015) do not constitute evidence of predictive language processing, since surprisal and experimentally correlated effects are found on the perceived word. We disagree with this interpretation. Word surprisal is based on the preceding context (i.e. words), by computing the probability for each word in a lexicon to follow that context. The surprisal value is therefore not a static attribute of a specific word but a value derived from preceding context with respect to all words in the lexicon. The prediction consists of distributing probability over all words in the lexicon. In the 'activation' vernacular, each word in the mental lexicon is 'pre-activated' to the extent the context makes the word a probable continuation.

Discourse based ERP research
Most ERP studies investigating language processing use sentences presented in isolation. However, there have been discourse-level studies, whereby discourse is typically interpreted as anything more than one sentence, for example, short narratives such as (9 & 10).
(9) The brave knight saw that the dragon threatened the benevolent sorcerer.
He quickly reached for a [sword / lance] …. (10) The benevolent sorcerer saw that the dragon threatened the brave knight.
The short narratives were carefully matched on prime words. In the examples (9) and (10), only brave knight and benevolent sorcerer switched position. The sentence He quickly reached for a … by itself does not constrain in favor of either sword or lance. The preceding sentence in (9) favors sword while in (10) it does not. The unexpected words (e.g., lance in (9)) resulted in a larger N400 effect than the expected words (e.g., sword in (9)), whereas the words elicited similar N400s in less constraining sentences (e.g., 10) (Otten & Van Berkum, 2007). This and other results (see Van Berkum, 2012 for an overview) show that readers and listeners use the wider context of discourse to build up predictions of upcoming input.
One understudied aspect of predictive language is the effect of discourse beyond multi-sentence short narratives. In more natural communication situations, readers or listeners are engaged with reading or listening within a much wider context, which is itself modulated by the register. In the following section, we explain how we studied the influence of register variation on listeners' language processing.

Current study
In the current study we investigate whether listeners' word anticipations are influenced by speech register information. We test long stretches (4 -15 min) of natural speech from different registers. Following Frank et al. (2015), we use statistical language modeling to estimate the word surprisal of all content words in our language materials and use word surprisal to predict the N400 amplitude for the content words. We estimate and compare four different 'types' of word surprisal: register-specific, register-mismatch, generic, and recency-based word surprisal.
The different 'types' of word surprisal reflect different processing strategies, which we compare to investigate the role of register in predictive language processing. Register-specific surprisal reflects the word predictability in a specific register. We hypothesize that if listeners adapt their word expectations based on register information, this registerspecific surprisal will best predict the N400 amplitude. Register-mismatch word surprisal is used as a sanity check and reflects the word predictability based on an incorrect (mismatching) register. It should therefore predict the N400 amplitude less accurately than a register-specific model if listeners adapt their predictions to the register at hand. Generic word surprisal reflects the word predictability of registerunspecific, average language use. If listeners do not adapt to a register, this word surprisal should perform at least on par with registerspecific word surprisal. Finally, recency-based word surprisal reflects generic word surprisal updated with information on recent words, of which the likelihood of recurring is temporarily boosted. If listeners do not use register characteristics, but instead recent language input, recency-based word surprisal may better predict the N400 amplitude.
The different word surprisal 'types' can be estimated by training SLMs on a specific set of language materials, as the estimated word surprisal depends crucially on the selected language materials the SLM is trained on. For example, an SLM trained on a book corpus will perform worse when tested on news materials as compared to when tested on an unseen book corpus. We therefore train SLMs on register-specific language materials to estimate register-specific word surprisal. Registermismatch word surprisal is estimated by using an SLM trained on language materials from a mismatching register (see Section 2.2).
Generic word surprisal is more difficult to operationalize, because sampling language materials always introduces bias in some manner (Kilgarriff, 2007;Biber & Conrad, 2001); i.e., there is no 'general' corpus to train a bias-free SLM. To address this issue, we train an SLM on a large corpus (see Section 2.1.1) that does not overlap with the register-specific language materials. The resulting SLM can be considered generic (register-unspecific) to the extent that the register-specific SLMs are expected to show improved performance on the register-specific materials, i.e., the register-specific SLMs can be expected to assign overall higher probabilities to the next words in register-specific texts as compared to the generic SLM. Lastly, we estimate recency-based word surprisal with a cached SLM, a standard extension of the generic SLM, whereby the SLM is updated with the most recent n unigrams (i.e., words).
The current study also tests whether the effect that word surprisal predicts the N400 amplitude (for reading, Frank et al., 2015) generalizes to a listening study. There are two methodological reasons why this effect may be difficult to detect in a listening study. First, word onsets are harder to accurately determine in connected speech compared to the onsets of visually presented words. The uncertainty in word onset determination could potentially lead to temporal 'smearing' of the ERP (Van Berkum et al., 2005) and thereby to less clear temporal patterning of ERP components. Second, while it is possible to use fixed-paced presentation for a reading experiment (with a predetermined pause between words), this is neither feasible nor desirable with auditory presentation of natural speech. For example, due to co-articulation in speech, it would sound wholly unnatural to insert pauses between the words of a recorded sentence. The continuous nature of speech therefore likely results in overlapping, temporally smeared word effects in the EEG signal. As a result, the N400 could be attenuated when this ERP is elicited with all content words in long stretches of natural connected speech.
To counterbalance the issue of smaller expected effect sizes, we collected a large amount of data. We used audio recordings of speech from three different speech registers: dialogues, (read-aloud) books, and (broadcast) news. The registers were selected to be distinct in word predictability, based on the findings by Bentum et al. (2019a), and were assigned to three separate experiments. The reasons for conducting separate experiments are twofold. First, an experiment dedicated to one register allows the participant to adapt their anticipations to that speech register. Second, it is possible to present more materials of each register by spreading them over three experiments, fulfilling our requirement of a large dataset. 1 In summary, in this study we test whether listeners anticipate words in long stretches (4 -15 min) of natural speech, sampled from three speech registers. We estimate word surprisal and test whether this predicts the N400 amplitude and compare how well register-specific, register-mismatched, generic, and recency-based word surprisal estimates predict the N400 amplitude. With this comparison, we test whether listeners adapt their anticipations of upcoming words based on speech register; i.e., whether register-specific word surprisal is a better predictor of the N400 amplitude compared to the other word surprisal estimates.

Participants
Forty-eight neurologically unimpaired right-handed native speakers of Dutch (18-29 years, mean age = 21.7 years), 14 men and 34 women, participated in the three EEG experiments of the study. All participants gave informed consent for the experiments and the subsequent publication of the EEG recordings. They were paid 80 Euros for their participation.

Materials
The stimuli for the EEG experiments consisted of audio recordings of Dutch speech from different registers, with approximately 90 min of speech materials for each register. The recordings were extracted from two corpora: the Spoken Dutch Corpus (Oostdijk, 2001) and the Institute of Phonetic Sciences Amsterdam Dialogue Video Corpus, henceforth IFADV (Van Son et al., 2008), see also Section 2.2.1). The books and the news speech materials were extracted from the Spoken Dutch Corpus, the dialogues were extracted from IFADV.
Six distinct dialogues of approximately 15 min each were included for the dialogues experiment. Each dialogue was between two wellacquainted interlocutors (e.g., friends, colleagues), who freely talked about any topic that came to mind (see Van Son et al., 2008, for details). Seven 12-minute excerpts from read-aloud Dutch books were included in the books experiment. Finally, the news experiment consisted of 21 sections of approximately-four minutes long. Each section contains multiple news items presented by the same broadcaster. We inserted 0.9 s of silence between news items and combined the four-minute sections into seven 12-minute blocks.
All recordings used in the experiments were orthographically and phonemically annotated, which allowed for the time-locking of each individual word to the EEG-recording. All recordings were equalized at 60 dB with Praat (Boersma & Weenink, 2018). See Table 1 for an overview of the speech materials presented in the EEG experiments.

Estimating generic, register-mismatch, recency and register-specific word surprisal 2.2.1. Training and test materials
To estimate word surprisal of each content word in the experimental materials we trained and applied multiple statistical language models (SLMs). To train these SLMs, we used language materials from four corpora, NLCOW14, SoNaR, the Spoken Dutch Corpus, and IFADV. The NLCOW14 corpus, henceforth COW (Schäfer, 2015;Schäfer & Bildhauer, 2012), is a collection of web-crawled Dutch texts consisting of approximately 4,7 billion words. The SoNaR corpus (Oostdijk et al., 2013) is a collection of written Dutch texts of approximately 500 million words. From this corpus, we used a subset of the Dutch teleprompt texts (SoNaR news) and Dutch books (SoNar books). The Spoken Dutch Corpus (Oostdijk, 2001) is a corpus of recorded and transcribed Dutch speech from different registers containing approximately 10 million word tokens. We used three components from the Spoken Dutch Corpus: the spontaneous dialogue component (CGN dialogues), the news broadcasts (CGN news) and the read-aloud books (CGN books). Finally, we used the IFADV corpus (Van Son et al., 2008), a collection of recorded and transcribed dialogues, containing approximately 70,000 word tokens.
We preprocessed the COW corpus by excluding sentences with three or more word or character repetitions, or with characters not used in standard Dutch orthography. The following preprocessing steps were performed for all language materials from all corpora. We replaced characters with diacritics to the equivalent characters without diacritics, and mapped all numbers, websites and tagged words (e.g., #tag#) to special word codes. We removed punctuations, except for commas. We normalized shortened words with apostrophes to a standard spelling (e. g., 't' becomes het 'the'). For an overview of the processed text materials see Table 2.
IFADV, CGN news and CGN books contain language materials used in the EEG experiment. For the purpose of SLM training, we removed these particular materials. Subsequently, we created register-specific sets by combining CGN dialogues with IFADV, CGN books with SoNaR books and, finally, CGN news with SoNaR news. We will refer to these sets as dialogues, books, and news respectively. Each set was split randomly into a training set with 80 % of the materials and a test set with the remaining 20 % of the text materials. We used all preprocessed materials from COW for training purposes.

Statistical language modeling
We trained the SLMs with the aid of the SRILM toolkit (Stolcke, 2002) and used the same settings for each language model; a tetragram SLM with Kneser-Ney discounting (Chen & Goodman, 1999) for smoothing.
We trained separate SLMs on the following training materials: COW, dialogues, news, and books. The SLM trained on the COW materials will be referred to as the generic SLM. This SLM was also used for the computation of the recency-based SLM and as the background language model which we interpolated with the SLMs trained on the dialogues, news and book training materials to create register-specific SLMs.
To find the best interpolation weights for the register-specific SLMs, we interpolated each with the background SLM (trained on the COW corpus) and tested a series of weights. We chose the weight resulting in the lowest perplexity on the register-specific held-out test materials (perplexity is a performance metric for SLMs whereby a lower score indicates better performance). The optimal weights for the background model were 0.3 for both news and books and 0.13 for the dialogues model.
Finally, we created a recency-based SLM, based on the generic model trained on the COW materials. We determined the optimal cache size (number of preceding words used to update the SLM) by testing different sizes (i.e., 2, 4, 8, …, 512, 1024 words) on the test materials of the different registers. The SLM performance asymptotes quickly with increasing cache sizes and we therefore selected a cache size of 64. Table 3 shows an overview of the perplexity scores for each SLM on the materials used in the EEG experiment and (between brackets) the score on the test materials. We observe that each register-specific model performs better on the corresponding register material compared to the other materials (the mismatching register materials), and the recency-based model performs better still. The book SLM performed worse on the materials used in the EEG experiment than the testing materials, indicating a discrepancy between the training and test materials and the language materials used in the EEG experiment. However, this model is still better than the generic SLM (i.e., 714 versus 1736).

Word surprisal estimation
To estimate word surprisal, we used the generic, recency-based, and register-specific SLMs described in Section 2.2.2. The different SLMs were used to estimate the surprisal of each word in the experimental speech materials. We used the generic and recency-based SLM to estimate generic and recency-based word surprisal, respectively. The registerspecific SLMs were used to compute register-specific word surprisal for

Table 4
Comparisons between the generic, register-mismatch, recency probability estimates on the one hand and the register-specific word probability estimates on the other hand, based on the AIC of LME models (AIC difference between parenthesis). The p-value indicates the probability that a model with generic, mismatch, or recency estimates is a better fit than the register-specific word surprisal.  the different register-specific materials, i.e., the dialogues SLM was used to estimate word surprisal in the dialogue materials, etcetera. Finally, we used mismatching pairs of registers, e.g., books SLM to estimate probability for words in the news materials. We used the following mismatch pairs: books-news, news-dialogues, news-books. We refer to this as register-mismatch word surprisal.

Procedure
Participants came to the lab on three separate occasions. Consecutive visits were a week or more apart. Participants were fitted with the correct size electrode cap and the electrodes were placed. The participants were seated in a sound-attenuating booth and listened to approximately 90 min of speech from a specific register (i.e., dialogues, books, or news), 270 min in total for the three experiments. The order of the speech registers was counterbalanced across participants. The audio materials were presented via in-earphones (Etymōtic ER1) at a comfortable listening volume. To this end, a short speech fragment (corresponding to the register but not part of the further experiment) was played to check the volume. When necessary, the ear-plugs or volume were adjusted. The participants were asked to sit still and keep eye movements and blinks to a minimum.
The audio materials were presented in blocks of approximately 15 min. The order of block presentation was counterbalanced across participants. After each block, the participant could take a break before the experiment continued.
To ensure participants listened attentively, yes-no comprehension questions were visually presented during breaks in the experiment and participants responded with a button box. For example, a participant could be asked: Heeft ze het British Museum bezocht? 'Did she visit the British Museum?' For both the dialogues and books the questions were presented at the end of each block. For the news experiment the questions were present at approximately 4-minute intervals, to compensate the higher information density of the materials compared to the other registers. During the dialogues, books and news experiments 36, 42 and 84 yes-no comprehension questions were presented, respectively.

Fig. 2.
Grand average plots of the ERP response averaged over all content words, participants and channel set, but split between speech registers: dialogues, books and news. The solid line shows the average of words with highest tertile register-specific surprisal, dotted line the lowest tertile. The blue shaded areas indicate the analysis window (300 -500 ms from word onset). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

EEG recording
The electroencephalogram (EEG) was recorded from 26 silverchloride cap-mounted electrodes. The electrodes were placed according to the Standard International 10-20 System (Fp2, Fz, F3, F4, F7, F8,  FC1, FC2, FC5, FC6, Cz, C3, C4, T7, T8, P3, Pz, P4, P7, P8, CP1, CP2,  CP5, CP6, O1, O2). Four additional electrodes were used to monitor eyerelated artifacts (eye movements and blinks), placed at the outer left and right canthi, and below and above the left eye (converted off line to horizontal and vertical electro-oculogram (EOG) signals). Two additional electrodes were placed on the left and right mastoids. All electrodes were referenced to the left mastoid electrode and electrode impedances were below 15 kΩ before recording started. The EEG-data was amplified with an Easycap system and band-pass filtered with 0.01 and 100 Hz cut off frequencies and digitized at a 1000 Hz sample frequency.

Preprocessing
The data were re-referenced off-line to the mean of the left and right mastoids and filtered with a 5th order Butterworth bandpass filter with cut-off frequencies at 0.05 and 30 Hz. We removed sections containing artefacts from the data in a semi-automatic fashion. To automatically detect artefacts in the EEG materials, we first manually annotated 60 hours (out of a total of 207 hours) of EEG materials for artefacts. Based on the manual annotations we trained a convolutional neural network classifier with Tensorflow (Abadi et al., 2016) to detect these artefacts. The classifier was trained such that it was very sensitive to artefacts, erring on the side of classifying more EEG materials as artefact to find as many as possible. The classifier achieved an F1 score of 0.89 on unseen EEG materials. We used this classifier to classify all EEG materials for artefacts. Subsequently, we manually checked all found artefacts and made corrections when needed.
Individual channels were removed when a channel was contaminated with artefacts for minimally 40 % of an experimental block. Otherwise, we removed the section (all channels) where one or more channels showed artefact corruption. The Fp2 channel was removed for all recordings, due to poor overall signal quality.
After artefact removal, independent component analysis (ICA) was used to filter out activity related to eye blinks and eye movement. Following Winkler et al. (2015), the ICA was computed on the EEG data band-passed filtered at cut off frequencies of 1-30 Hz. Subsequently, components were selected that reflected eye blinks and eye movements based on visual inspection of topographic and time-course plots. The ICA solution was then used to recompose the EEG data (band-pass filtered at cut off frequencies of 0.05-30 Hz) without the eye-activity-related components. This approach attenuates the sensitivity of ICA to slow drift (Winkler et al., 2015) without adversely affecting ERP analysis (see Tanner et al., 2015).
We extracted EEG-data in the time window − 300 to 1000 ms relative to word onset, for each content word (i.e., nouns, verbs, adverbs and adjectives) in our dataset. We used the following exclusion criteria to construct the dataset. We excluded items which overlapped with artefactual EEG data or if the signal exceeded ± 75 μV in the previously defined time window of the word. We excluded all data from nine participants because less than 40 % of the data remained after artefact removal. We excluded all words from a stop list of words (see Appendix A), and excluded words that occurred in overlapping speech (only relevant in the dialogues experiment). We excluded the first word of each sentence, to lower the correlation between word surprisal and word frequency (a covariate in our statistical model, see below). We excluded words shorter than 50 ms or longer than 700 ms and words that occurred less than 24 times in the dataset, to lower the number of word types in the experiment (from 5,866 to 3,423). A smaller set of word types was needed to achieve convergence of the statistical model. Across all experiments, these steps resulted in a dataset of 568,931 word epochs.
No participants were excluded based on the yes-no comprehension results that were not already excluded based on the EEG data quality. Overall, the participants performed well on the yes-no comprehension question: with 83 % correct for the news experiment, 96 % correct for the books experiment and 94 % correct for the dialogues experiment.

Analysis
Based on previous literature (see Frank et al., 2015), we defined the N400 amplitude as the average of the channel set C3, C4, Cz, CP5, CP1, CP2, CP6, P7, P3, Pz, P4, P8, O1, O2 within the time window 300 -500 ms after word onset. Following Frank et al. (2015), we did not subtract the baseline from the ERP. Instead, the baseline was used as a covariate in the statistical model. We computed the baseline by averaging the amplitude over the time window − 150 -0 ms (relative to word onset) and the same channel set.
We estimated several linear mixed effect (LME) models (Bates et al., 2015) with the statistical package R (R Core team, 2015) to predict the N400 amplitude. We first estimated a null LME model with the following standardized covariates: the aforementioned baseline, the log word frequency (based on the COW corpus), the word duration, the word position in the sentence and finally a factor for experiment (with three levels, one for each register). In addition, we added participant and word as random effects with random slopes for surprisal for both participant and word.
The predictor of interest (word surprisal) was added to the null model to create a generic, recency-based, register-specific, and registermismatch LME model, based on the corresponding word surprisal type (i. e., generic word surprisal corresponds with a generic LME model). We also added an interaction term between word surprisal and experiment to allow for differences between speech registers. We considered to include a random slope for word surprisal by participant but this resulted in convergence issues.

Results
Model comparison with the anova likelihood-ratio test revealed that the LME model with generic word surprisal improved compared to the null model (χ 2 = 553.46, p <.001). The N400 amplitude is more negative with increasing values of word surprisal (see Fig. 1).
Subsequently we compared the generic, register-mismatch, recencybased, and register-specific LME models. For these comparisons, we were precluded from using the anova likelihood-ratio test since these models were not nested versions of each other. We therefore compared the AIC of each LME model and computed the corresponding relative likelihood. This comparison revealed that the register-specific word surprisal values best predict the N400 amplitude (see Table 4, left). The recency-based model performed better than the generic model, and the register-mismatch model.
In the register-specific LME model (see Table 5), the interaction term for the news materials and word surprisal has a t-value of 5.46. To further investigate this interaction effect, we split the data according to register (dialogues, books and news) and fitted LME models to each subset. The LME models for the news materials failed to converge. We therefore computed the news LME models without random slopes for surprisal for participant and words. Model comparison with the anova likelihood-ratio test revealed that the LME model with register-specific word surprisal improved compared to the null model for both dialogues (χ 2 = 260.73, p <.001) and books (χ 2 = 279.97, p <.001), while this was not the case for the news materials (χ 2 = 0, see also Fig. 2).

Discussion
In the current study, we recorded the EEG signal from participants who listened to long (4 -15 min) stretches of natural speech sampled form different speech registers: dialogues, news broadcasts, and readaloud books. The speech materials were analyzed with statistical language models (SLM) estimating word surprisal. We found that the N400 amplitude was more negative for words with higher surprisal (i.e., unexpected words).
We investigated the influence of speech register on prediction in speech comprehension by estimating and comparing different word surprisal 'types'. We compared generic with register-specific word surprisal and found that register-specific word surprisal best predicted the N400 amplitude. This finding indicates that listeners are sensitive to the specific statistical structure of the speech register they listen to, and that they adjust their anticipations accordingly. To test whether the adaptation of word anticipations was the result of register, we also compared register-specific word surprisal with register-mismatch word surprisal. This comparison provided a sanity check to test whether any 'specific' word surprisal would better predict the N400 amplitude compared to generic word surprisal. Register mismatch was defined as the surprisal estimated on mismatching register materials, e.g., the SLM was trained on books but used to estimate surprisal for the news materials. We found that register-mismatch word surprisal did not improve upon generic word surprisal, providing further evidence that register-specific information influenced participants' word expectations.
Furthermore, we tested whether the register-specific effects could be explained merely by tracking recent input. In theory, listeners could adapt their expectations not based on register characteristics, but just on recent language input. We therefore also compared the register-specific word surprisal to recency-based word surprisal. The recency-based word surprisal is computed by updating the generic SLM with caching of a number (n = 64) of recent words. As Table 4 shows, the recency-based word surprisal better predicts the N400 amplitude compared to the generic word surprisal. Importantly, the register-specific word surprisal does better still. This finding indicates that listeners do not only use recent language input to adjust their predictions of upcoming words but also register information. Listeners may have stored representations of the statistical structure of registers, whereby different expectations are generated when listening to a story than when listening to a dialogue.
Our results are also relevant for the question whether prediction occurs during normal language processing (Huettig, 2015;Nieuwland et al., 2018). In our experiments, we used long stretches (4 -15 min) of naturalistic speech. Therefore, there are no artificial pauses between the presentation of words, which could potentially influence predictive processing (Luka & Van Petten, 2014). Our finding shows that listeners anticipate words in normally-paced language input. Furthermore, we investigated most words in the speech materials, which allows for the investigation of predictive language processing across the whole probability spectrum, from very unexpected to highly expected words. This is relevant in light of Huettig's (2015) criticism that most evidence for prediction comes from comparing extremes of predictability.
Do the presented results provide evidence for prediction ? Pickering & Gambi (2018) note that effects found on the target word, such as surprisal effects, can alternatively be explained by an integration account. This interpretation might be influenced by the traditional assumption that prediction entails one or at most a few words. However, we argue that surprisal effects should be interpreted in a different way (leading, in our opinion, to a clearer interpretation of the underlying mechanism). The surprisal value is not a static attribute of a single word. Instead, it derives from a probability distribution that is computed over an entire lexicon based on the preceding context. The prediction is not a specific word or small set of words. The prediction is the specific probability distribution over the lexicon. To state it differently, each word in the mental lexicon is 'pre-activated' to the extent it is a probable continuation given the preceding context.
In our experiment, each target word (with corresponding surprisal value) samples the predicted probability distribution of the lexicon. The word does not have this surprisal value in isolation, a probability can only be defined over a set of options, in this case a lexicon. Sampling these predicted probability distributions explains variance in the amplitude of the N400: it is more negative for when sampling less probable continuations and more positive when sampling more probable continuations. Therefore, our results can be interpreted as evidence for these predicted word probability distributions, and we propose that our data are therefore best explained by a predictive account.
Furthermore, probability effects found during language perception, especially those found over a wide probability spectrum such as Smith & Levy (2013), Frank et al. (2015), and the effects reported in this article, fit comfortability within more general and well attested predictive perception framework, such as predictive coding (Friston, 2005;, in which top-down predictions result in prediction error from bottom-up input. And while it might be a logical possibility that an integration account could explain these specific effects, it would need further specification beyond the bare claim that more probable words are easier to integrate (see Pickering & Gambi, 2018). Furthermore, if we apply Occam's razor, a predictive account is preferable, since it is more parsimonious when we assume language perception is similar to more general perception mechanisms.
We want to emphasize that the preceding argument does not exclude the N400 effect as an index for integration mechanisms. Indeed, several studies (e.g. Frank & Willems, 2017;Nieuwland et al., 2020 ) suggest that these processes are dissociable, where the variance of the N400 effect is partly captured by word predictability and partly by semantic plausibility operationalized by either participant rating tests or automatic techniques such as latent semantic analysis.
The current results do show that listeners engage in predictive language processing while listening to natural everyday speech (without artificially constraining sentences). This result is in line with the results reported for reading by Smith & Levy (2013) and Frank et al. (2015). Interestingly, a recent article by Heilbron et al. (2022), reported similar findings with speech materials using a state of the art GTP-2 language model (i.e. a language model based neural networks).
We found an unexpected difference between the speech registers: we did not observe an N400 effect for the news broadcast speech materials (see Fig. 2). It is unlikely that this difference was caused by news broadcasts being less predictable than the other speech materials: The perplexity scores for SLMs tested on news materials were comparable to scores for dialogues and books (see Table 3), indicating that the SLMs could predict upcoming words in the news materials with performance similar to for the other register materials. If news broadcasts were less predictable, the SLMs performance should drop accordingly.
An explanation for the interaction effect between word surprisal and register could be participants' attention to the speech materials. Participants possibly found it harder to concentrate on the news materials compared to dialogues and book materials. Attention difficulties for the news materials could be caused by the high topic density in this register. The news materials consisted of sequences of short news items on many different topics. In fact, because of this high density of topics, we decided to segment the news materials into 4-minute sections, while books and dialogues materials were segmented into 12-and 15-minute sections, respectively. Still the participants performed worse on average for the comprehension questions on news (83 % correct) than on books (96 % correct) and dialogues (94 % correct), indicating that they indeed found it harder to pay attention to the news materials. There is evidence that attention can modulate the N400 (for a discussion, see Kutas & Federmeier, 2011), but it is unclear to what extent lack of attention would completely suppress the N400 effect.
We found an unexpectedly high correlation between word surprisal and log word frequency. A high correlation between the predictor of interest (word surprisal) and a covariate make statistical results less reliable (e.g., effects can flip, because the variance can be ascribed to either of the variables). An explanation for the unexpectedly high correlation is related to the first word in a sentence. The dialogues materials contain a high number of very short sentences resulting in a relatively high proportion of first words. Statistical language models (SLM) generally do not use cross-sentence-boundary pre-context. Therefore, the word surprisal of the first word in a sentence will tend to the frequency of that word. We therefore removed the first word of each sentence for our analyses. In future studies, it would be interesting to test whether SLMs could be used that take cross-sentence-boundary precontext into account.
Our study raises questions for future research. First, how do listeners adjust their expectations to a specific register? Our results show that simply using the most recent words to adjust anticipations does worse in modelling N400 amplitude in listeners compared to using registerspecific information. This indicates that listeners do not merely use recent context to adjust expectations, and could imply that registers are represented in the listener's mind in some form and can be utilized to adapt expectations to upcoming input. This could mean that multiple generative models (e.g., registers, schema's) are represented and language perceivers switch between these models (see also Kuperberg, 2016).
A second question for future research is whether speech register provides the correct level of granularity for a predictive model of language? The current study found evidence that listeners can use registerspecific information to adjust their anticipations. However, register is a high-level construct that correlates with, for example, topic. It could be that topic differences are also an important factor in structuring language perceivers' expectations.
Third, how to interpret the success of SLMs in modelling language perceivers' processing costs? SLMs are an implausible cognitive model for language prediction. For example, an SLM could not model prediction effects produced by humans with sentences 9 & 10 (Section 1.4), because these effects are based on long range dependencies. What aspects of predictive human language processing do SLM capture that make them successful in modelling processing costs and when would they fail?

Conclusion
We analyzed ERPs elicited with spoken words from long stretches (4-15 min) of naturalistic speech and found that word surprisal predicts the N400 amplitude. Listeners anticipate words while listening to natural speech that is not highly constrained nor limited to very likely or very unlikely words. Moreover, by comparing generic, recency-based, and register-specific word surprisal, we showed that listeners broadly adapt their expectations to the register of the speech they are perceiving, which indicates that listeners also use cues from the wider context to predict upcoming words.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data will be made available on request.

Acknowledgements
We would like to thank Lou Boves and Tineke Snijders for helpful discussions and Tim Zee for his invaluable help with the EEG annotations.