Entrainment to speech prosody influences subsequent sentence comprehension

ABSTRACT Speech processing is subserved by neural oscillations. Through a mechanism termed entrainment, oscillations can maintain speech rhythms beyond speech offset. We here tested whether entrainment affects higher-level language comprehension. We conducted four online experiments on 80 participants each. Our paradigm combined acoustic entrainment to repetitive prosodic contours with subsequent visual presentation of ambiguous target sentences (e.g. “Max sees Tom and Karl laughs”). We aimed to elicit faulty segmentations through the duration of the preceding contour (e.g. the segment “Max sees Tom and Karl” leads to an error at “laughs”). Across experiments, self-paced reading data showed that participants employed the duration of the initial prosodic contour to predict the duration of the upcoming segments. Prosody entrainment may thus serve a predictive function during language comprehension, not only helping the reader to segment the current speech input, but also inducing temporal predictions about upcoming segments.


Introduction
Prosody is critical to sentence comprehension. Pauses, pitch modulations, and duration changes constitute intonational phrase boundaries (IPBs) that delineate multi-word constituents (Frazier et al., 2006). In auditory comprehension, IPBs trigger the segmentation of speech into multi-word units. During reading, where acoustic cues are unavailable, listeners actively construct implicit prosody that facilitates comprehension (Breen, 2014;Breen et al., 2016;Fodor, 2002).
The processing of IPBs and the construction of implicit prosody depend on prosodic context. IPBs are not processed in terms of their absolute acoustic magnitude, but relative to the magnitude of preceding IPBs (Clifton et al., 2002;Snedeker & Casserly, 2010). Even more distant prosodic cues can influence the prediction of upcoming material, such that prosody at sentence onset affects subsequent segmentation and word recognition (Brown et al., 2011;Dilley & McAuley, 2008). Context effects have also been reported across domains; for instance, IPBs can prime segmentation of subsequent visual sentences (Steinhauer & Friederici, 2001). By priming, we mean a phenomenon whereby exposure to a preceding stimulus influences a response to a subsequent stimulus, without conscious guidance or intention (Branigan, 2007;Pickering & Ferreira, 2008). In the Steinhauer & Friederici (2001) study, structural priming can be considered an explanation of the obtained resultsthat is, exposure to preceding prosodic contours can pre-activate representations of upcoming structures, which would facilitate the processing of upcoming visual sentences.
Instead of priming, distant context effects might be elicited by rhythmicity of the prior prosodic context. Periodic amplitude-or frequency-modulated sounds at a given frequency improve the auditory detection of subsequent targets that arrive at the same frequency (Henry & Obleser, 2012;Hickok, Farahbod, & Saberi 2015). Likewise, it has been shown that during speech processing, the syllable rate of a lead-in sentence can affect detection of subsequent target syllables: After a fast-rate sentence, subjects overhear short target syllables that they do perceive when the lead-in sentence is presented at a slow rate (Bosker, 2017;Dilley & Pitt, 2010). In neuroscience, such transfer effects are attributed to an electrophysiological mechanism termed entrainment, by which neural oscillations inherit a stimulation rhythm to continue after stimulus offset (Kösem et al., 2018;Luo & Poeppel, 2007). Importantly for the current study, it has been suggested that entrainment can also be triggered by prosodic contours, dominated by modulations of pitch and/or amplitude at frequencies below 4 Hertz (Bourguignon et al., 2013;Mai et al., 2016; for review, see Meyer, 2018).
The usage of the term entrainment in auditory and cognitive neuroscience must be distinguished from its usage in pragmatics. Weidman et al. (2016) uses the term to denote a "general tendency of two speakers to demonstrate similarity in aspects of their speech over the course of a conversation" (cf. Lehnert-LeHouillier et al., 2020;Edlund, 2011). This definition applies both to lexical units (Brennan, 1996) and features of intensity, pitch, speech rate, and voice quality. Some of these features might be rhythmic enough to trigger entrainment of electrophysiological oscillators, while others are not. In the current study, we refer only to rhythmic features that mark IPBs, relying on recent corpus evidence for a periodicity of prosodic boundaries (Inbar et al., 2020;Stehwien & Meyer, 2022).
Our series of four experiments combined an initial rhythm of a prosodic contour repeated three times with a subsequent visual target sentence. In Experiment 1, the frequency of the prosodic rhythm was hypothesised to drive participants into a specific segmentation option when processing a target sentence that contained a syntactic ambiguity (i.e. coordination ambiguity; e.g. Max sees Tom and Karl laughs.; Hoeks et al., 2002). Results showed that indeed, the initial contour affected downstream reading times. To test whether this apparent entrainment effect would generalise, Experiment 2 combined the same trial structure with shorter, non-ambiguous target sentences. While still significant, the original transfer effect was weakened, suggesting that the duration of the target sentence is critical for the strength of the entrainment effect. This intuition was further confirmed by experiments 3 and 4, which presented progressively longer target sentences.

Methods (experiments 1-4)
Participants Every experiment was run on 80 participants (German native speakers, right-handed, age range = 18-35 years; for the demographic data of participants in every experiment see Appendix 1). In order to determine the sample size, we conducted a power analysis using G*Power 3 software (Faul et al., 2007) based on effect sizes obtained from previous research with a similar paradigm (Dilley & Pitt, 2010). Our estimate yielded a sample size of 32; however, we adopted a more conservative sample of 80 to ensure reproducibility of the effects (Aarts et al., 2015). Participants had normal or corrected-to-normal vision and no reported history of neurological or hearing disorders. They were recruited through the online platform Prolific (www.prolific.co); participation was reimbursed with £ 6 per hour. All participants were naive as to the purpose of the study. Written informed consent was obtained prior to the experiment. The study conformed to the guidelines of the Declaration of Helsinki and was approved by the local ethics committee of the University of Leipzig, DE.

Stimuli
In order to investigate the influence of prosodic entrainment on subsequent sentence comprehension, our experiments began with initial auditory presentation of a prosodic contour repeated three times. This prosodic contour always belonged to one of the conditions -SLOW or FASTcharacterised by different length. Prosodic contour exposure was followed by visual word-byword presentation of a target sentence. Contour and sentence durations varied across the experiments. Critically, in every experiment the presentation of the first several visual words was adjusted to contour duration, whereby word presentation rate was calculated from the same auditory sentences from which the contour was extracted. Minor rate changes across experiments result from different sentence lengths across experiments. Presentation of the last words of the sentence, however, was self-pacedthat is, the words would remain on screen until participants would push a button (cf. Figure 1).
It was important to match the auditory and visual presentations exactly, so that the critical word (e.g. third noun (N3) in Experiment 1), which was already selfpaced, fell potentially within the time interval spanned by one SLOW contour, but outside of FAST contour. That is, if the contours elicited subsequent transfer effects, the reaction time at this word would try to match the timing pre-imposed by the prosodic entrainmentnamely, in case of the SLOW condition, the participant would try to fit the critical word into the time interval, defined by the SLOW contour; therefore, the reaction time would be shorter than that in the FAST condition. This allowed us to interpret the self-paced reaction time at this critical word as a measure of the transfer effect of the preceding contours on the sentence comprehension, potentially via entrainment.
After sentence presentation and a delay, some trials were followed by comprehension questions (see Procedure). Transfer effect was inferred from self-paced reading (SPR) times at critical words, as well as from reaction times (RTs) and accuracy on comprehension questions. An overview of the stimuli for the 4 experiments is reproduced in Table 1 (self-paced words are indicated in bold; arrows mark the ends of the contours if potentially imposed on the sentence timing). Further, we describe the stimuli for every experiment in more detail.

Experiment 1
Two TYPEs of German target sentences were used (see (1)-(4) for English translations). In the first sentence TYPE, LONG sentences (1), Karl is the subject of the verb laughs, whereas Tom is the object of sees. However, if a participant would implicitly construct an IPB after Karl, this would trigger the wrong interpretation (2) where both Tom and Karl together are the object. This error would be recognised at laughs. A similar effect could be expected in the second type of target stimuli, SHORT sentences (3) and (4) For construction, we used 32 monosyllabic first names of 3-6 characters to balance word-by-word presentation (New et al., 2006). Noun frequencies were normally distributed (Heister et al., 2011). Since male and female first names differ in length, we used male first names only. We also selected 75 transitive and 75 intransitive German verbs in 3rd person singular present tense. Length was matched (1-2 syllables, 5-8 characters). Verb frequencies were also normally distributed. Pairs of transitive and intransitive verbs were made based on semantic fit (e.g. expect-come, wake up-sleep). Combination of verb pairs and names yielded 6,000 sentences. A different name triplet was used for each of these. Name triplets were selected to not contain similarsounding names (e.g. Frank and Franz).
To elicit the two segmentation options, prosodic contours of two different SPEED values were presented before the target sentences. Contours were made by averaging the pitch tracks of the visual sentences, which were stripped off synthetic recordings (Oord et al., 2016) in Praat (Boersma & van Heuven, 2001). We used a female voice (minimum pitch = 116 Hz, maximum pitch = 267 Hz, average pitch = 191.5 Hz) because of its broad pitch range and high variability. The two entrainment SPEED values were SLOW (based on the 3,000 SHORT sentences; e.g. Max sees Tom and Karl) and FAST, based on 3,000 additional sentences (e.g. Max sees Tom). Thus, in LONG sentences, the SLOW prosodic contour matched the duration of Max sees Tom and Karl, aimed at eliciting an error at laughs (2). In contrast, the FAST contour matched the duration of Max sees Tom, aimed at eliciting the correct segmentation (1). For SHORT sentences, on the contrary, correct segmentations (3) were expected in the case of SLOW entrainment, but errors (4) were expected in the case of FAST entrainment ( Figure 2).
For averaging, contour durations were adjusted to the average duration of the respective sentence recordings (SLOW: 1,570 ms; FAST: 942 ms). For full delexicalisation, PURR (Prosody Unveiling through Restricted Representation) method was used (Sonntag & Portele, 1998). This pipeline was recommended in previous studies for constructing a delexicalised prosodic contour (Steinhauer & Friederici, 2001). The method involves extracting the pitch values from the original contour and constructing a sound by adding a sine wave at pitch, its second harmonic of 1/4 of the amplitude, and its third harmonic of 1/16 of the amplitude (suggested by Klasmeyer, 1997). Therefore, out of the original spectral characteristics of the speech signal, only the pitch modulations are retained, Figure 1. Procedure (all experiments). Participants first listen to the audio contour. Then the sentence is presented word-by-word, apart from the last one/two words, which are self-paced. Finally, in 75% of the trials, a comprehension question follows.
which permits to disentangle prosodic modulations from other speech components. Prosodic pitch has proved to provide a substantial contribution to entrainment separately from other acoustic and phonetic features (Teoh et al., 2019). PURR has been tested extensively and compared to other methods, proving over a variety of experiments to have the best functionality and acceptability (listeners recognising the signal as coming from natural human speech) for speech delexicalisation (Kotz et al., 2003;Meyer et al., 2002;Pannekamp et al., 2005). Contours were further normalised to 65 dB and lowered in pitch by 55 Hz to ensure a comfortable hearing level.
Average word duration for timed visual presentation calculated from the synthesised contours length was 314 ms. Presentation of the third noun (N3; e.g. Karl) and the second verb (V2; e.g. laughs; LONG sentences only) was self-paced; these words stayed on the screen until the participant pressed a button.
The 6,000 sentences and contours were combined into 20 experimental lists of 300 trials each. Within list, every verb pair was used 4 times, once within each condition (i.e. SHORT-FAST, SHORT-SLOW, LONG-FAST, and LONG-SLOW). Pairs and conditions did not repeat across subsequent trials. We disallowed adjacent name triplets with identical or similar names. Identical triplets did not repeat within list.

Experiment 2
Our second experiment tested the entrainment hypothesis on short non-ambiguous sentences (see Results Experiment 1). Prosodic contours and average visual word duration were identical to Experiment 1. The two conditions were LONG (e.g. Max sees Tom and Karl) and SHORT (e.g. Max sees Tom). In the SHORT condition, the final word (second noun (N2); e.g. Tom) was selfpaced; in the LONG condition, the three final words (N2, "und" (U), and third noun (N3); e.g. Tom, und, and Karl) were self-paced (see Figure 3). List generation was identical to Experiment 1.

Experiment 3
Our third experiment tested the entrainment hypothesis on longer sentences (see Results Experiment 2).  Figure 2. Paradigm (Experiment 1). Prosodic contours repeated 3 times to induce entrainment, followed by time-matched rapid serial visual presentation (RSVP) of target sentence. FAST entrainment aims at inducing correct segmentation of LONG sentence, but incorrect segmentation of SHORT sentence. SLOW entrainment, on the contrary, aims at inducing correct segmentation of SHORT sentence, but incorrect segmentation of LONG sentence.
Sentences were modified from Experiment 1. We added surnames to the names to increase sentence duration. We used 153 monosyllabic German surnames matched for length to 3-7 characters; word frequency was normally distributed. Similar-sounding surnames (e.g. Wolff and Wolf) were not included. On the basis of this list, we constructed surname triplets, none of which contained repeating items, and added one triplet to each of the 6,000 sentences from Experiment 1. Every sentence was combined with a different surname triplet. Adjacent triplets with repeating surnames were avoided. List generation followed the previous experiments. In LONG sentences (e.g. Max Scholz sees Tom Schmidt and Karl Weiss laughs), the last three words (third noun (N3), third surname (SUR3), and second verb (V2); e.g. Karl, Weiss, and laughs) were self-paced. In SHORT sentences (e.g. Max Scholz sees Tom Schmidt and Karl Weiss), the last two words (N3 and SUR3; e.g. Karl and Weiss) were self-paced. The duration of the preceding prosodic contour and thus the frequency of the prior prosodic entrainment was adjusted to the duration of the critical sentence segments: The SLOW contour matched the duration of Max Scholz sees Tom Schmidt and Karl Weiss; the FAST contour matched the duration of Max Scholz sees Tom Schmidt (Figure 4). Contour generation was identical to Experiments 1 and 2. The duration of the SLOW contour, based on the average duration of the respective recordings was 1,900 ms; the duration of the FAST contour was 3,040 ms. Average word duration for visual presentation calculated from the synthesised contours length was 320 ms.

Experiment 4
Our fourth experiment tested the entrainment hypothesis on even longer sentences (see Results Experiment 3). Sentences were modified from Experiment 3. We added prepositional phrases (e.g. "in the city") after the verb to increase sentence duration. Every phrase consisted of three words: preposition, article, and noun, all of which were 1-syllable and 3-8 letters in length. Frequency and length of the nouns were normally distributed; the prepositions were 2-3 letters in length (in, an, auf, vor, bei) and articles were 3 letters in length (dative caseder or dem). Each phrase was semantically matched to the corresponding verb pair. List generation matched the previous experiments.
In LONG sentences (e.g. Max Scholz sees in the city Tom Schmidt and Karl Weiss laughs), the last three words (third noun (N3), third surname (SUR3), and second verb (V2); e.g. Karl, Weiss, and laughs) were selfpaced. In SHORT sentences (e.g. Max Scholz sees in the city Tom Schmidt and Karl Weiss), the last two words (N3 and SUR3; e.g. Karl and Weiss) were self-paced. The duration of the preceding prosodic contour and thus . Paradigm (Experiment 2). Prosodic contour repeated 3 times to induce entrainment, followed by time-matched RSVP of the target sentence. According to the prediction hypothesis, FAST entrainment is thought to facilitate the comprehension of SHORT sentences, but inhibit the comprehension of LONG sentences. SLOW entrainment, on the contrary, is expected to facilitate comprehension of LONG sentences, but hinder comprehension of SHORT sentences. . Prosodic contour repeated 3 times to induce entrainment, followed by time-matched RSVP presentation of the target sentence. FAST entrainment aims at inducing correct segmentation of LONG sentence, but incorrect segmentation of SHORT sentence. SLOW entrainment, on the contrary, aims at inducing correct segmentation of SHORT sentence, but incorrect segmentation of LONG sentence. the frequency of the prior prosodic entrainment was adjusted to the duration of the critical sentence segments: The SLOW contour matched the duration of Max Scholz sees in the city Tom Schmidt and Karl Weiss; the FAST contour matched the duration of Max Scholz sees in the city Tom Schmidt ( Figure 5). Contour generation matched Experiments 1-3. The duration of the SLOW contour, based on the average duration of the respective recordings was 2,600 ms; the duration of the FAST contour was 3,450 ms. Average word duration for visual presentation calculated from the synthesised contours length was 380 ms.

Procedure
The procedure was identical across the four experiments. Experiments were run online utilising the Gorilla Experiment Builder (www.gorilla.sc; Anwyl-Irvine et al., 2020). Each trial started with a visual fixation cross and auditory presentation of one of the two prosodic contours. Each contour was repeated three times with pauses of 160 ms between repetitions (Ghitza, 2017). For the duration of auditory stimulation in every experiment, see the respective Stimuli section. A pause corresponding to the difference of duration between the FAST and SLOW contours was added before the FAST contour to equalise entrainment phase duration across conditions. Following a contour, a target sentence (either SHORT or LONG) was presented word by word. The first several words were presented in rapid serial visual presentation (RSVP; Young, 1984). Critically, presentation rate, albeit identical across conditions, was adjusted to contour duration to elicit prosodic transfer effects. Presentation of the final words was self-paced in order to measure those transfer effects (see also Stimuli).
After sentence presentation and a delay of 500 ms, comprehension questions were presented in 75% of trials. To avoid strategy build-up, questions requiring a yes answer and questions requiring a no answer were both included for each condition. Three kinds of questions were used: Type 1negative for LONG and positive for SHORT (e.g. Exp. 1: Did Max see Karl?), Type 2positive for both LONG and SHORT (e.g. Exp.1: Did Max see Tom?), Type 3negative for both LONG and SHORT (e.g. Exp.1: Did Tom see Max?). To balance the proportion of expected yes and no answers, Type 1 and Type 2 were each used in 50% of questions for the LONG sentences. For SHORT, we used Type 1 in 50% of cases and Type 3 in the other 50%.
Participants were instructed to listen to the audio contour, read the sentences presented word-by-word, press a button to continue for the last word(s) of the sentence, and then answer a comprehension question in 75% of trials. There was a response timeout of 2,000 ms. In case of timeout, a screen stating "Please answer faster" appeared, and the experiment advanced to the next trial. Transfer effect was inferred from self-paced reading (SPR) times, as well as from reaction times (RTs) and accuracy on comprehension questions.

Data analysis (experiments 1-4)
Data analysis was performed in R (R Core Team, 2019). For the self-paced reading data, in each experiment we focused on 2 critical words: The first word was the last one in a potential segment hypothetically induced by the previous contour (N3 for Experiment 1,3,4 and N2 for Experiment 2), and the second word was the final word of the sentence, where the hypothetical gardenpath effect could be discovered (V2 for Experiment 1,3,4 and N3 for Experiment 2).
Outliers were removed separately within the two critical self-paced words and within the comprehension questions, across conditions, according to the interquartile range (outer fences, using 1.5 as a multiplier; Tukey, 1977). Across experiments 1-4, rejection rates for self-paced reading were 10%, 9.48%, 10.85%, and 11.3% respectively; rejection rates for comprehension questions were 0%, 2.16%, 0.02%, and 0.11% respectively. RTs were normalised utilising the Box-Cox Figure 5. Paradigm (Experiment 4). Prosodic contour repeated 3 times to induce entrainment, followed by time-matched RSVP presentation of the target sentence. FAST entrainment aims at inducing correct segmentation of LONG sentence, but incorrect segmentation of SHORT sentence. SLOW entrainment, on the contrary, aims at inducing correct segmentation of SHORT sentence, but incorrect segmentation of LONG sentence. method (Box & Cox, 1964). The effect of entrainment on sentence processing was modelled using a series of mixed-effects regression models, implemented in the lme4 package (Bates et al., 2015). Separate models were built for self-paced reading data at the two critical words as well as for comprehension questions.
For SPR, the predictor was entrainment SPEED (SLOW versus FAST), while the dependent variable was RT. Note that sentence TYPE (LONG versus SHORT) was not included as a factor in the SPR analysis: At the first critical word, the sentences were identical for LONG and SHORT, while second critical word was only present in LONG sentences. For the comprehension questions, predictors were entrainment SPEED, sentence TYPE, and their interaction. Separate model comparisons were performed for RT and accuracy.
Predictors were coded using mean-centered effects coding. Random intercepts were included for subjects and verb pairs. The models with the maximal randomeffects structure were attempted initially, but failed to converge in all cases. We therefore used a forward best-path method to determine which random slopes to include, based on established inclusion criteria (α = 0.2; Barr et al., 2011). The best-fitting models included by-subject and by-pair-number adjustments to the slopes of entrainment duration and sentence type in specific combinations. The fixed-effects structure and the model comparison path were determined via the predictor structure and the results of initial model comparison. In case of a single predictor (i.e. SPR), we compared the model that included both the predictor as a fixed effect and random effects to the random-effectsonly null model. In case of two predictors (i.e. comprehension questions), we first compared single-predictor models separately to the null model. In case any of the comparisons were found significant, we then proceeded to comparing the single-predictor model to the model which included the main effects of both predictors, to see whether the second predictor explain additional variance. In case the combination of both main effects improved model fit, we compared this model against a final model with their interaction included.

Experiment 1
We hypothesised that SLOW entrainment should accelerate SPR at N3 (third noun)that is, button presses would be accelerated when this word appears on the screenas it would fall into a single segment with the preceding words (e.g. [Max sees Tom and Karl]). Conversely, SPR should decelerate under SLOW entrainment at V2 (second verb), which would be unexpected (e.g. [Max sees Tom and Karl] laughs). For comprehension questions, we expected higher RTs for LONG sentences, as they would produce a garden-path effect, and for this effect to be further strengthened under SLOW entrainment, which would reinforce the incorrect segmentation.
While these results show an influence of prosodic entrainment on processing, they do not support a direct influence on segmentation as such: If the sentence had been expected to end after N3, we should have obtained increased SPR times at V2 due to unexpectedness.
One possible alternative interpretation is that entrainment triggers an approximate temporal prediction of the duration of the upcoming segment or sentence. This converges on an earlier report that found prosodic context to enable the prediction of the count of remaining upcoming words within a sentence (Grosjean, 1983). This study presented subjects with sentences cut off at different timepoints of a target word. Participants could determine the count of the words of the entire sentence based on preceding prosodic cue. Critically, count predictions were only approximate, covering a range of multiple counts around the correct one. The "duration prediction" account therefore claims that subjects respond faster to stimuli at the end of approximately predicted temporal intervals, irrespective of the sentence content and segmentation. Furthermore, this effect is not exclusive to sentence comprehension: it has also been validated for low-level stimuli, like acoustic tones  and recently further replicated (Herbst et al., 2022b). Moreover, the same studies have shown that the reaction times in response to these stimuli correlate with the phase of the delta band oscillation in the brain. Importantly in the context of current study, the prediction mechanism is not always precise in terms of segments; it constitutes a rough bet, whether the upcoming stimuli will be shorter or longer (cf. Grosjean, 1983), unlike the segmentation mechanism, which is precise in the length of the upcoming constituents. In the context of our study, participants may have defaulted to always expecting a LONG sentence (i.e. 50% of sentences continued after N3), as evident from their lower RTs on comprehension questions to LONG sentences. SLOW entrainment may have strengthened this default. Under FAST entrainment, however, the prediction for a longer sentence is weakened; this is demonstrated by higher RTs at N3.
Our pattern could also be captured by a second alternative interpretation, under which the rhythmic prosodic context directs the attentional focus towards a specific future time point (Calderone et al., 2014;Lakatos et al., 2007Lakatos et al., , 2019. This "temporal attention" account also presupposes temporal prediction, which is, however, not directed to segment duration, but rather specific points in time. In our case, this account could capture the pattern of an effect at the N3 time point in spite of no detectable garden-path effect at the time point of V2.
To test the temporal attention hypothesis further, we conducted Experiment 2, which employed short nonambiguous sentences without involving a hypothesis on segmentation. Additionally, this second experiment helped investigating whether the effect of prosodic entrainment is confined to a certain range of sentence durations.

Experiment 2
Under the temporal attention account, for the SPR, we expected a speed-up at N2 (second noun) for the FAST contour, similar to the N3 (third noun) effect in Experiment 1. In addition, we hypothesised a speed-up at N3 (third noun) for SLOW contours in LONG sentences. For comprehension questions, we expected lower RTs for LONG sentences after SLOW contours and higher RTs after FAST contours, since the SLOW contour would draw attentional focus to the end of the LONG sentence, facilitating its processing. On the contrary, for the SHORT sentences we expected lower RTs under the FAST entrainment and higher RTs under SLOW, since the attentional  focus of FAST contour would be precisely at the end of a SHORT sentence. We expected the accuracy data to follow the same pattern as RTs, due to the speed-accuracy trade-off we found in Experiment 1.
Again, the results confirm an influence of the rhythmic prior prosodic stimulus on sentence processing. The specific time point of the effect within the sentence, however, argues against an explanation in terms of temporal attention. Rather, it appears to support the duration prediction account: The N2 speedup for the SLOW condition appears to indicate that subjects expected the sentence to continue into a LONG sentence, when they had heard a SLOW prosodic contour before. In contrast, the N2 effect for the SLOW contour falsifies the temporal attention account, which did predict this effect for FAST instead. Moreover, the temporal attention account does not predict a speedup for SLOW contour at N2instead, it would have required a speedup at N3. As to the isolated main effect of TYPE for accuracy without an RT decrease, it might be an artefact of the experimental design: as the SHORT sentences consisted of only three words in Experiment 2 (e.g. Max sees Tom), they were very easy to process for the participants.
The numeric decrease of the SPR effect relative to Experiment 1 may indicate that entrainment depends on sentence duration. Alongside testing our working hypothesis of a duration prediction, we thus reconsider our original segmentation hypothesis in Experiment 3: Insufficient sentence duration in Experiments 1 and 2 could have decreased potential garden-path effects at the verb. Consequently, we employed longer versions of the sentences from Experiment 1 again. Should we observe an SPR effect at the verb, we would conclude that prosodic entrainment affects segmentation only at a certain minimum sentence duration.

Experiment 3
Should the segmentation account be valid, FAST entrainment would induce correct segmentation of LONG sentences ( = lower RTs), but incorrect in case of SHORT sentences ( = higher RTs). SLOW entrainment, in turn, would induce correct segmentation of SHORT sentences (lower RTs), but incorrect in case of LONG sentences ( = higher RTs). On the contrary, should the duration prediction account be correct, the processing of the LONG sentences would be facilitated by SLOW contour ( = lower RTs) and inhibited by FAST ( = higher RTs), while the processing of the SHORT sentences would be facilitated by the FAST contour ( = lower RTs) and hindered by the SLOW contour ( = higher RTs). We also expected the accuracy to be proportional to RTs according to the speed-accuracy trade-off effect found in Experiment 1, no matter the supported account. Regarding the self-paced reading data, both accounts would predict a speed-up effect at N3 (third noun) for the SLOW contour. However, only the segmentation account generates a strong hypothesis for a slow-down at V2 (second verb) for the SLOW contour.
The speed-up at N3 is consistent with a strengthened prediction for a LONG sentence under SLOW entrainment. Still, this is also consistent with the segmentation hypothesis, since under SLOW entrainment N3 would be putatively in the same segment as the previous words, which could reduce processing speed as well. Acrossexperiment comparisons suggest that speed-up increases with sentence duration. The V2 trend, together with the increase of the V2 difference across experiments 1 and 3, does provide some limited evidence for the segmentation account, indicating a possible garden-path effect. To test the hypothesis that the garden-path effect emerges only for longer sentences, we conducted Experiment 4 with increased sentence length.

Experiment 4
In Experiment 4, should the segmentation account be valid, FAST entrainment would induce correct segmentation of LONG sentences ( = lower RTs), but incorrect segmentation in case of SHORT sentences ( = higher RTs). SLOW entrainment, in turn, would induce correct segmentation of SHORT sentences ( = lower RTs), but incorrect in case of LONG sentences ( = higher RTs). On the contrary, should the duration prediction account be correct, the processing of the LONG sentences would be facilitated by SLOW contour ( = lower RTs) and inhibited by FAST ( = higher RTs), while the processing of the SHORT sentences would be facilitated by the FAST contour ( = lower RTs) and hindered by the SLOW contour ( = higher RTs). We also expected the accuracy to be proportional to RTs according to the speed-accuracy trade-off effect found in Experiment 1, no matter the supported account. Regarding the selfpaced reading data, both accounts would predict a speed-up effect at N3 (third noun) for the SLOW contour. However, only the segmentation account would predict a slow-down at V2 (second verb) for the SLOW contour.
For self-paced reading, the speed-ups at N3 (Experiment 1 and 3) and N2 (Experiment 2) were replicated at N3 in Experiment 4: On average, participants were 14.57 ms faster under SLOW as compared to FAST entrainment -SPEED improved the model fit (χ 2 (1) = 15.85, p < 0.001, Figure 9a left). No effects were found at V2 (χ 2 (1) = 0.52, p = 0.471, Figure 9a right); numerically, there was not a delay, but a speed-up of 0.4 ms for the SLOW entrainment, contrary to the predictions of the segmentation account. The results of the Experiment 4 disrupted the pattern, observed in the previous three experiments: average speed-up/delay in the RTs is shorter for both N3 and V2 in Experiment 4 (14.57 ms and 0.4 ms, respectively) than in Experiment 3 (18.61 and 1.9 ms, respectively).
In comprehension questions, TYPE improved model fit for RTs (χ 2 (1) = 199.18, p < 0.001): SHORT sentences were processed more slowly than LONG sentences (Figure 9b left). Neither the factor of SPEED (χ 2 (1) = 0.006, p = 0.94) nor the TYPE × SPEED interaction (χ 2 (1) = 0.13, p = 0.71) improved the model fit for question RTs. For accuracy (Figure 9b right), we also found a significant improvement under the inclusion of TYPE (χ 2 (1) = 10.97, p < 0.001)participants were more accurate at SHORT sentences as compared to LONG, replicating the speed-accuracy trade-off found in previous experiments. No significant effects were found for the inclusion of SPEED (χ 2 (1) = 1.58, p = 0.209). We found, however, model improvement under the inclusion of the TYPE × SPEED interaction (χ 2 (1) = 5.43, p = 0.0198). Taken together with the above-mentioned speed-accuracy trade-off, this supports the duration prediction account: It was harder for participants to process LONG sentences under FAST entrainment, and SHORT sentences under SLOW entrainment (Figure 8b right).
The replicated N3 speed-up can be explained by both the duration prediction and segmentation accounts. However, the segmentation account would predict a slowdown at V2, which we failed to observe in Experiments 1 and 4. While these null effects do not allow us to draw a definitive conclusion, their combination with the time point of the SPR effect in Experiment 2 rather speaks in favour of a tentative interpretation in terms of duration prediction. Critically, since the results of Experiment 4 do not fall in line with the trend observed in previous experiments (increase of the speed-up/delay in RTs together with the sentence length), we can no longer claim that the prediction effect is strengthened or a garden-path effect is emerging as the duration of the sentence increases.
Nevertheless, we should note that another type of pattern emerges across the four experiments. As shown in Table 2, the duration prediction effect appears to be strongest at the prosodic contour duration of around 2.5 s. The 2.5-2.7 s limit has been previously associated with other important restraints in psycholinguistics, such as median sentence duration in speech (Vollrath et al., 1992) and a general time window for perception, simultaneity, or information integration (Baddeley et al., 1975;Pöppel, 1997;White, 2017). A possible explanation of this preference could be linked to working memory constraints; yet another interpretation could be connected to constraints on electrophysiological level. In particular, entrainment could potentially only be possible within a particular electrophysiological frequency band. Further electrophysiological research is needed to investigate this hypothesis.

General discussion
We investigated here whether entrainment to prosody influences the processing of a subsequent sentence. We initially hypothesised that entrainment would facilitate the segmentation of upcoming sentences, in line with the function of prosody in bottom-up segmentation. While the results suggest that prosodic entrainment does affect the processing of subsequent sentences, the specific pattern speaks against this hypothesis. The results rather suggest that that prosodic entrainment allows for a prediction of the duration of the upcoming sentence; potentially, it even allows to predict the duration of syntactic segments.
In general, a duration prediction interpretation converges on earlier psycholinguistic conceptualisations (Breen, 2014;Frazier et al., 2006;Grosjean, 1983): According to the rational speaker hypothesis, listeners interpret strong IPBs as predictive of longer upcoming segments (Carlson et al., 2001;Watson & Gibson, 2004). While our behavioural results do not allow for strong links to possible neuronal underpinnings of such effects, they are consistent with the finding that oscillations in the delta range entrain to rhythmic speech stimuli, triggering the estimation of the duration of subsequent time intervals between stimulation offset and the occurrence of upcoming auditory targets (Breska & Deouell, 2017;Herbst et al., 2022;Stefanics et al., 2010). Future work using electrophysiological methodology is required to ascertain that the repetitive prosodic carrier wave in the current study elicited a persisting oscillation that continued after the offset of the auditory stimulus, facilitating the comprehension of subsequent visual sentences through a prediction of their duration.
As already mentioned in the Introduction, we distinguish the current usage of the term entrainment from its usage to denote the general adjustment of two speakers to each other in a dialogue. From a broader perspective, however, the results of our study are also relevant in face-to-face communication between speakers. As known from previous work, prosodic context is rhythmic (Stehwien & Meyer, 2022) and can therefore facilitate speech processing by providing tentative predictions of upcoming sentence/segment length (Fodor, 2002;Frazier et al., 2006;Steinhauer & Friederici, 2001). It has also been previously shown on syllable level that rhythm in the context may constitute a tool for the listener to facilitate processing in a dialogue, providing conversational entrainment (Wilson & Wilson, 2005). As to neural entrainment, it is known to operate across levels, such as phonetic, syntactic, prosodic, and finally, discourse-and referential-level processing (Meyer et al., 2019. Moreover, latest evidence shows that conversational entrainment operates on prosodic level as well (Levitan et al., n.d.;Levitan & Hirschberg, n.d.;Reichel et al., 2018). Therefore, speech rhythms could facilitate conversational entrainment via neural entrainment: speakers may use rhythmic prosodic cues to predict the duration of the interlocutor's upcoming phrases, thus improving and speeding up the communication process. Future studies could address the question whether speakers do also actively manage prosodic cues, as well as the possible neural underpinnings of this effect. Another interesting research direction could look into which particular speech features that speakers align during conversation are periodic enough to be exploited by periodic and possibly predictive processes. The current results do not provide strong support for our initial segmentation hypothesis: If entrainment affected segmentation, sentence-final verbs in LONG sentences should have been harder to process under SLOW entrainment. This so-called garden-path effect has been observed previously for the coordination ambiguities used in the present study (Henke & Meyer, 2020;Hoeks, Hendriks, Vonk, Brown, & Hagoort, 2006;Hoeks et al., 2002). In Experiment 3, we obtained a trend for a modulation of this effect via entrainment; however, that was subsequently refuted by Experiment 4, where even longer sentences did not show such an effect. Our results also do not support an explanation in terms of priming (see Introduction), as in the event of direct priming of the sentence length by the prosodic contour, we would have had a garden-path effect for the second verb in the LONG sentences under the SLOW prosodic condition, which was not the case for any of the experiments in the series.