Linguistic constraints modulate speech timing in an oscillating neural network

Neuronal oscillations putatively track speech in order to optimize sensory processing. However, it is unclear how isochronous brain oscillations can track pseudo-rhythmic speech input. Here we investigate how top-down predictions flowing from internal language models interact with oscillations during speech processing. We show that word-to-word onset delays are shorter when words are spoken in predictable contexts. A computational model including oscillations, feedback, and inhibition is able to track the natural pseudo-rhythmic word-to-word onset differences. As the model processes, it generates temporal phase codes, which are a candidate mechanism for carrying information forward in time in the system. Intriguingly, the model’s response is more rhythmic for non-isochronous compared to isochronous speech when onset times are proportional to predictions from the internal model. These results show that oscillatory tracking of temporal speech dynamics relies not only on the input acoustics, but also on the linguistic constraints flowing from knowledge of language.


Introduction 27
Speech is a biological signal that is characterized by a plethora of temporal information. The 28 temporal relation between subsequent speech units allows for the online tracking of speech in order 29 to optimize processing at relevant moments in time [1][2][3][4][5][6][7]. Neural oscillations are a putative index 30 of such tracking [3,8]. The existing evidence for neural tracking of the speech envelope is consistent 31 with such a functional interpretation [9,10]. In these accounts, the most excitable optimal phase of 32 an oscillation is aligned with the most informative time-point within a rhythmic input stream [8, 33 11-14]. However, the range of onset time difference between speech units seems more variable than 34 fixed oscillations can account for [15][16][17]. As such, it remains an open question how is it possible 35 that oscillations can track a signal that is at best only pseudo-rhythmic [16]. 36 Oscillatory accounts tend to focus on the prediction in the sense of predicting "when," 37 rather than predicting "what": oscillations function to align the optimal moment of processing given 38 that timing is predictable in a rhythmic input structure. If rhythmicity in the input stream is 39 violated, oscillations must be modulated to retain optimal alignment to incoming information. This 40 can be achieved through phase resets [15,18], directly coupling of the acoustics to oscillations [19], 41 or the use of many oscillators at different frequencies [2]. However, the optimal or effective time 42 of processing stimulus input might not only depend on when you predict something to occur, but 43 also on what stimulus is actually being processed [20][21][22][23]. 44 What and when are not independent, and certainly not from the brain's-eye-view. If 45 continuous input arrives to a node in an oscillatory network, the exact phase at which this node 46 reaches threshold activation does not only depend on the strength of the input, but also on how 47 sensitive this node was to start with. Sensitivity of a node in a language network is naturally affected 48 by predictions in the what domain generated by an internal language model [24][25][26][27]. If a node 49 represents a speech unit that is likely to be spoken next, it will be more sensitive and therefore 50 active earlier, that is, on a less excitable phase of the oscillation. In the domain of working memory, 51 this type of phase precession has been shown in rat hippocampus [28,29] and more recently in 52 human electroencephalography [30]. In speech, phase of activation and perceived content are also 53 associated [31][32][33][34] and phase has been implicated in tracking of higher-level linguistic structure [18, 54 35, 36]. However, the direct link between phase and the predictability flowing from a language 55 model has yet to be established. 56 (in Dutch; [43][44][45]). Specifically, when speech rates are fast the stimulus is interpreted as a long 66 vowel and vice versa for slow rates. However, modulating the entrainment rate effectively changes 67 the phase at which the target stimulus -which is presented at a constant speech rate -arrives (but 68 this could not be confirmed in [46]). A second speech phenomena shows the direct phase-69 dependency of content [31]. Ambiguous /da/-/ga/ stimuli will be interpreted as a /da/ on one phase 70 and a /ga/ on another phase. An oscillatory theory on speech tracking should account for how 71 temporal properties in the input stream can alter what is perceived. 72 In the speech production literature, there is strong evidence that the onset times (as well as 73 duration) of an uttered word is modulated by the frequency of that word in the language [47-49] 74 showing that internal language models modulate the access to or sensitivity of a word node [24,50]. 75 This word-frequency effect relates to the access to a single word. However, it is likely that during 76 ongoing speech internal language models use the full context to estimate upcoming words [51]. If 77 so, the predictability of a word in context should provide additional modulations on speech time. 78 Therefore, we predict that words with a high predictability in the producer's language model should 79 be uttered relatively early. In this way word-to-word onset times map to the predictability level of 80 that word within the internal model. Thus, not only the processing time depends on the 81 predictability of a word (faster processing for predictable words; see [52,53]), but also the 82 production time (earlier uttering of predicted words). 83 Language comprehension involves the mapping of speech units from a producer's internal 84 model to the speech units of the receiver's internal model. In other words, one will only understand 85 what someone else is writing or saying if one's language model is sufficiently similar to the speakers 86 (and if we speak in Dutch, fewer people will understand us). If the producer's and receiver's internal 87 Figure 1. Proposed interaction between speech timing and internal linguistic models. A) Isochronous production and expectation when there is a weak internal model. All speech units arrive around the most excitable phase B) When the internal model of the producer does not align with the model of the receiver temporal alignment and optimal communication fails. C) When both producer and receiver have a strong internal model, speech is nonisochronous and not aligned to the most excitable phase, but fully expected by the brain. D) Expected time is a constraint distribution which center can be shifted due to linguistic constraints. oscillatory network. First, we show that speech timing from the Corpus Gesproken Nederlands 120 (CGN), a Dutch spoken speech corpus, depend on the constraints flowing from the internal 121 language model (estimated through a recurrent neural net). Next, we use a computational model to 122 investigate how well a stable oscillator can process speech when it is combined with top-down 123 linguistic predictions. The proposed model can explain speech timing in natural speech as well as 124 explaining how the phase/time of presentation can influence content perception. Intriguingly, this 125 model shows stronger rhythmic responses (higher power at mean presentation rate) for non-126 isochronous compared to isochronous stimuli as long as the timing is proportional to the predictions 127 of the internal model. Our results reveal that tracking of speech needs to be viewed as an interaction 128 between ongoing oscillations as well as constraints flowing from an internal language model [21, 129 24]. In this way, oscillations do not have to shift their phase after every speech unit and can remain 130 at a relatively stable frequency as long as the internal model of the speaker matches the internal 131 model of the perceiver. 132

Word frequency influences word duration 135
We used the Corpus Gesproken Nederlands (CGN; (Version 2.0.3;) to extract the temporal 136 properties in naturally spoken speech. This corpus consists of elaborated annotations of over 900 137 hours of spoken Dutch and Flemish words. We focus here on the subset of the data of which onset 138 and offset timings were manually annotated at the word level in Dutch. Cleaning of the data 139 included removing all dashes and backslashes. Only words were included that were part of a Dutch 140 word2vec embedding (github.com/coosto/dutch-word-embeddings; needed for later modeling) and 141 required to have a frequency of at least 10 in the corpus. All other words were replaced with an 142 <unknown> label. This resulted in 574,726 annotated words with 3096 unique words. 2848 of the 143 words were recognized in the Dutch Wordforms database in CELEX (Version 3.1) in order to 144 extract the word frequency as well as the number of syllables per word. Mean word duration was 145 0.392 seconds, with an average standard deviation of 0.094 seconds (Supporting Figure 1A). By 146 splitting up the data in sequences of 10 sequential words we could extract the average word, 147 syllable, and character rate ( Figure Supporting Figure 1B). The reported rates fall within the 148 generally reported ranges for syllables (5.2 Hz) and words (3.7 Hz; [5,54]). 149 We predict that knowledge about the language statistics influences the duration of speech 150 units. As such we predict that more prevalent words will have on average a shorter duration (also 151 reported in [49]). In Figure 2A the duration of several mono-and bi-syllabic words are listed with 152 their word frequency. From these examples it seems that words with higher word frequency 153 generally have a shorter duration. To test this statistically we entered word frequency in an 154 ordinary least square regression with number of syllables as control. Both number of syllables 155 (coefficient = 0.1008, t(2843) = 75.47, p < 0.001) as well as word frequency (coefficient = -0.022, 156 t(2843) = -13.94, p < 0.001) significantly influence the duration of the word. Adding an interaction 157 term did not significantly improve the model (F (1,2843) = 1.320, p = 0.251; Figure 2B+C). The effect 158 is so strong that words with a low frequency can last three times as long as high frequency words 159 (even within mono-syllabic words). This indicates that word frequency could be an important part 160 of an internal model that influences word duration. 161 The previous analysis probed us to expand on the relation between word duration and 162 length of the words. Obviously, there is a strong correlation between word length and mean word 163 duration (number of characters 0.824, p < 0.001; number of syllables: ρ = 0.808, p < 0.001; for 164 number of syllables already shown above; Figure2D+E). In contrast, this correlation is present, but 165 much lower for the standard deviation of word duration (number of characters: ρ = 0.269, p < 0.001; 166 number of syllables: ρ = 0.292, p < 0.001). Finding a strong correlation does not imply that for every 167 time unit increase in the word length, the duration of the word also increases with the same time 168 unit, i.e., bi-syllabic words do not necessarily have to last twice as long as mono-syllabic words. 169 Therefore, we recalculated word duration to a rate unit considering the number of syllables/ 170 characters of the word. Thus a 250 ms mono-versus bi-syllabic word would have a rate of 4 versus 171 8 Hz respectively. Then we correlated character/syllabic rate with word duration. If word duration 172 increases monotonically with character/syllable length there should be no correlation. We found 173 that the syllabic rate varies between 3 and 8 Hz as previously reported (Figure 2E right;[5,54]). 174 However, the more syllables there are in a word, the higher this rate (ρ = 0.676, p < 0.001). This 175 increase was less strong for the character rate (ρ = 0.499, p < 0.001; Figure 2D right). 176 Figure 2. Word frequency modulates word duration. A) Example of mono-and bi-syllabic words of different word frequencies in brackets (van=from, zijn=be, snel=fast, stem=voice, hebben=have, eten=eating, volgend=next, toekomst=future). Text in the graph indicates the mean word duration. B) Relation between word frequency and duration. Darker colors mean more values. C) same as B) but separately for mono-and bi-syllabic words. D) Relation character amount and word duration. The longer the words, the longer the duration (left). The increase in word duration does not follow a fixed number per character as duration as measured by rate increases. E) same as D) but for number of syllables. Red dots indicate the mean.
These results show that the syllabic/character rate depends on the number of characters 177 /syllables within a word and is not an independent temporal unit [37]. This effect is easy to explain 178 when assuming that the prediction strength of an internal model influences word duration: 179 transitional probabilities of syllables are simply more constrained within a word than across words 180 [55]. This will reduce the time it takes to utter/perceive any syllable which is later in a word.  The output of the RNN reflects a probability distribution in which the values of the RNN 204 sum up to one and each word has its own predicted value ( Figure 3A). As such we can extract the 205 predicted value of the uttered word and compare it with the stimulus onset delay relative to the 206 previous word. We entered 207 word prediction in a 208 transformed the data accordingly (see Table 1; results were robust to changes in these 233 transformation). 234 All predictors except word frequency of the previous word showed a significant effect 235 (Table 1). The variance explained by word frequency was likely captured by the mean duration 236 variable of the previous word which is correlated to word frequency. The RNN predictor could 237 capture more variance than the bigram model suggesting that word duration is modulated by the 238 Figure 3. RNN output influence word onset differences. A) Sequences of ten words were entered in an RNN in order to predict the content of the next word. Three examples are provided of input data with the label (bold word) and probability output for three different words. The regression model showed that the last word in the sequence was systematically shorter depend on the probability of the RNN output (illustrated here with the shorted black boxes). B) Regression line estimated at mean value of word duration and bigram. C) Scatterplot of prediction and onset difference of data within ± 0.5 standard deviation of word duration and bigram. Note that for B and C the axes are linear on the transformed values. In order to investigate how much of these duration effects can be explained using an oscillator 256 model, we created the model Speech Timing in a Model Constrained Oscillatory Network 257 (STiMCON). STiMCON in its current form will not be exhaustive; however, it can extract how 258 much an oscillating network can cope with asynchronies by using its own internal model 259 illustrating how the brain's language model and speech timing interact [56]. The current model is 260 capable of explaining how top-down predictions can influence the processing time as well as 261 provide an explanation for two known temporal illusions in speech. 262 STiMCON consists of a network of semantic nodes of which the activation A of each level 263 l is governed by: 264 in which C represents the connectivity patterns between the levels, T the time in a sentence, and 266 Ta the vector of times of an individual node in an inhibition function (in milliseconds). The 267 inhibition function is a gate function: 268 (2) 269 in which BaseInhib is a constant for the base level of inhibition (negative value, set to -0.2). As such 270 nodes are by default inhibited, as soon as they get activated above threshold (activation threshold 271 set at 1) Ta sets to zero. Then, the node will have suprathreshold activation, which after 20 272 milliseconds returns to increased inhibition until the base level of inhibition is returned. The 273 oscillation is a constant oscillator: 274 in which Am is the amplitude of the oscillator, ω the frequency, and φ the phase offset. As such we 276 assume a stable oscillator which is already aligned to the average speech rate (see [15,19] for phase 277 alignment models). 278

Language models influence time of activation 280
To illustrate how STiMCON can explain how processing time depends on the prediction of internal 281 language models, we instantiated a language model that had only seen three sentences and five 282 words presented at different probabilities (I eat cake at 0.5 probability, I eat nice cake at 0.3 283 probability, I eat very nice cake at 0.2 probability; Table 2). This language model will serve as the 284 feedback arriving from the l+1-level to the l-level. The l-level consists of five nodes that each 285 represent one of the words and receives proportional feedback from l+1 according to Table 2 with 286 a delay of 0.9*ω milliseconds, which then decays at 0.01 unit per millisecond and influences the l-287 level at a proportion of 1.5. This feedback is only initiated when supra-activation arrives due to l-288 1-level bottom-up input. l-1-level input is modelled as 289 linearly function representing increasing sensory 290 confidence at a length of 125 milliseconds (half a cycle). 291 φ is set such that the peak of a 4 Hz oscillation aligns to 292 the peak of sensory input of the first word. Sensory 293 input is presented at a base stimulus onset asynchrony 294 of 250 milliseconds (i.e. 4 Hz). 295 When we present this model with different sensory input at an isochronous rhythm of 4 296 Hz it is evident that the timing at which different nodes reach activation depends on the level of 297 feedback that is provided (Figure 4). For example, while the /I/-node needs a while to get activated 298 after the initial sensory input, the /eat/-node is activated earlier as it is pre-activated due to 299 feedback. After presenting /eat/ the feedback arrives at three different nodes and the activation 300 timing depends on the stimulus that is presented (earlier activation for /cake/ compared to /very/). 301 302

Time of presentation influences processing efficiency 303
To investigate how the time of presentation influences the processing efficiency we presented the 304 model with /I eat XXX/ in which the last word was varied in content (either /I/, /very/, /nice/, or 305 /cake/), intensity (linearly ranging from 0 to 1), and onset delay (ranging between -125 to +125 306 relative to isochronous presentation). We extracted the time at which the node matching the 307 stimulus presentation reached activation threshold first (relative to stimulus onset, and relative to 308 isochronous presentation). 309 Figure 5A shows the output. When there is no prediction strength, i.e., for the /I/ 310 presentation, a classical efficiency map can be found in which processing is most optimal (possible 311 at lowest stimulus intensities) at isochronous presentation and then drops to either side. For nodes 312 that have feedback, input processing is possible at earlier times relative to isochronous presentation 313 activation which included input from l+1 as well as l-1, orange indicates activation due to l+1 input. and parametrically varies with prediction strength (earlier for /cake/ at 0.5 probability, then /very/ 314 at 0.2 probability). Additionally, the activation function is asymmetric. This is a consequence of the 315 interaction between the supra-activation caused by the feedback and the sensory input. As soon as 316 supra-activation is reached due to the feedback, sensory input at any intensity will reach supra-317 activity (thus at early stages of the linearly increasing confidence of the input). This is why for the 318 /very/ stimulus activation is still reached at later delays compared to /nice/ and /cake/ as the /very/-319 node reaches supra-activation due to feedback at a later time point. 320 Figure 5. Model output on processing efficiency and rhythmicity. A) Time of presentation influences efficiency. Dashed line is presented to ease comparison between the four content types. White indicates that threshold is never reached. B) Same as A, but estimated at a threshold of 0.53 showing that oscillations regulate feedforward timing. Panel A shows that the earlier the stimuli are presented (on a weaker point of the ongoing oscillation), the longer it takes until supra-threshold activation is reached. This figure shows that timing relative to the ongoing oscillation is regulated such that the stimulus activation timing is closer to isochronous. Line discontinuities are a consequence of stimuli never reaching threshold for a specific node. C) Strength of 4 Hz power depends on predictability in the stream. When predictability is alternated between low and high, activation is more rhythmic when the predictable odd stimulus arrives earlier and vice versa. D) Slice of D at intensity of 0.8 and 1.0. E) Magnitude spectra at three different odd word offsets at 1.0 intensity. To more clearly illustrate the differences the magnitude to the power of 20 is plotted.

When we investigate timing differences in stimulus presentation it is important to also 321
consider what this means for the timing in the brain. Before, we showed that the amount of 322 prediction can influence timing in our model. It is also evident that the earlier a stimulus was 323 presented the more time it took (relative to the stimulus) for the nodes to reach threshold (more 324 yellow colors for earlier delays). This is a consequence of the oscillation still being at a relatively 325 low excitability point at stimulus onset for stimuli that are presented early during the cycle. 326 However, when we translate these activation threshold timing to the timing of the ongoing 327 oscillation, the variation is strongly reduced ( Figure 5B). A stimulus timing that varies between 130 328 milliseconds (e.g. from -59 to +72 in the /cake/ line; excluding the non-linear section of the line) 329 only reaches the first supra-threshold response with 19 milliseconds variation in the model 330 (translating to a reduction of 53% to 8% of the cycle of the ongoing oscillation, i.e. a 1:6.9 ratio). 331 This means that within this model (and any oscillating model) the activation of nodes is robust to 332 some timing variation in the environment. This effect seemed weaker when no prediction was 333 present (for the /I/ stimulus this ratio was around 1:3.5. Note that when determining the /cake/ 334 range using the full line the ratio would be 1:3.4). 335 336

Top-down interactions can provide rhythmic processing for non-isochronous stimulus input 337
The previous simulation demonstrate that oscillations provide a temporal filter and the processing 338 itself can actually be closer to isochronous than what can be solely extracted from the stimulus 339 input. Next, we investigated whether dependent on changes in top-down prediction, processing 340 within the model will be more or less rhythmic. To do this, we create stimulus input of 10 sequential 341 words at a base rate of 4 Hz to the model with constant (low at 0 and high at 0.8 predictability) or 342 alternating word-to-word predictability. For the alternating conditions word-to-word 343 predictability alternates between low to high (sequences which word are predicted at 0 or 0.8 344 predictability, respectively) or shift from high to low. For this simulation we used Gaussian sensory 345 input (with a standard deviation of 42 ms aligning the mean at the peak of the ongoing oscillation; 346 see Supporting Figure 5 for output with linear sensory input). Then, we vary the onset time of the 347 odd words in the sequence (shifting from -100 up to +100 ms) and the stimulus intensity (from 0.2 348 to 1.5). We extracted the overall activity of the model and computed the Fast Fourier transform of 349 the created time course (using a Hanning taper only including data from 0.5 -2.5 seconds to exclude 350 the onset responses). 351 The first thing that is evident is that the model with no content predictions has overall 352 stronger power, and specifically around isochronous presentation (odd word offset of 0 ms) at high 353 stimulus intensities ( Figure 5C-E). Adding overall high predictability drops the power, but also here 354 the power seems symmetric around zero. The spectra of the alternating predictability conditions 355 look different. For the low to high predictability condition the curve seems to be shifted to the left 356 such that 4 Hz power is strongest when the predictable odd stimulus is shifted to an earlier time 357 point (low-high condition). This is reversed for the high-low condition. At middle stimulus 358 intensities there is a specific temporal specificity window at which the 4 Hz power is particularly 359 strong. This window is earlier for the low-high than the high-low alternation ( Figure 5D, Figure  360 5E, and Supporting Figure 6). These results show that even though stimulus input is non-361 isochronous, the interaction with the internal model can still create a potential rhythmic structure 362 in the brain (see [57,58]). Note that the direction in which the brain response is more rhythmic 363 matches with the natural onset delays in speech (shorter onset delays for more predictable stimuli). We can solve this formula in order to investigate the relative time shift (T) in processing that is a 374 consequence of the strength of the prediction (ignoring that in the exact timing will also depend 375 on the strength of the input and inhibition): 376 ω was set as the syllable rate for each sentence, Am and φ were systematically varied. We again 378 fitted a linear model, however as we were interested in how well non-transformed data could 379 predict the exact onset timing we did not perform any normalization besides equation (5) Results show a modulation of the R 2 dependent on the amplitude and phase offset of the 393 oscillation ( Figure 6A) which was stronger than the non-transformed R 2 (which was 0.389). This 394 suggests that oscillatory modulated top-down predictions influence word-by-word duration. This 395 was even more strongly so for specific oscillatory alignments (around -0.25π offset) suggesting an 396 optimal alignment phase relative to the ongoing oscillation [3,8]. Interestingly, the optimal 397 transformation seemed to automatically alter a highly skewed prediction distribution ( Figure 6B) 398 towards a more normal distribution of relative time shifts ( Figure 6C). rate [45]. The only assumption that has to be made is that there is an uneven base prediction balance 406 between the ways the ambiguous stimulus can be interpreted. 407 To produce the speech rate effect we repeated the previous simulation, but fixed the onset 429 of the target word at 300 milliseconds and varied the frequency of presentation (2-6 Hz). Indeed, 430 for ambiguous stimuli, content interpretation depended on the speech rate ( Figure 7B; but see [46]. 431 432 Figure 7. Results for temporal illusions. A) Modulations due to ambiguous input at different times. Illustration of the node that is active first (A). Different proportions of the /very/ stimulus show activation timing modulations at different delays (B). Outline indicate the node that is active first at that specific delay.

Discussion 433
In the current paper, we combined an oscillatory model with a proxy for linguistic knowledge, an 434 internal language model, in order to investigate the model's processing capacity for onset timing 435 differences in natural speech. We show that word-to-word speech onset differences relate to 436 predictions flowing from the internal language model (estimated through an RNN). Fixed 437 oscillations aligned to the mean speech rate are robust against natural temporal variations and even 438 optimized for temporal variations that match the predictions flowing from the internal model. 439 Strikingly, when the pseudo-rhythmicity in speech matches the predictions of the internal model, 440 responses were more rhythmic for matched pseudo-rhythmic compared to isochronous speech 441 input. These results show that part of the pseudo-rhythmicity of speech is expected by the brain 442 and it is even optimized to process it in this manner, but only when it follows the internal model. 443 Speech timing is variable and in order to understand how the brain tracks this pseudo-444 rhythmic signal we need a better understanding of how this variability arises. Here, we isolated one 445 of the components explaining speech time variation, namely, constraints that are posed by an 446 internal language model. This goes beyond extracting the average speech rate [5,19,54], and might 447 be key to understanding how a predictive brain uses temporal cues. We show that speech timing 448 depends on the predictions made from an internal language model, even when those predictions 449 are highly reduced to be as simple as word predictability. While syllables generally follow a theta 450 rhythm, there is a systematic decrease in syllabic rate as soon as more syllables are in a word. This 451 is likely a consequence of the higher close probability of syllables within a word which reduces the 452 onset differences of the later uttered syllables [55]. However, an oscillatory model constrained by 453 an internal language model is sensitive to these temporal variations, it is actually capable of 454 processing them optimally. 455 The oscillatory model we here pose has three components: oscillations, feedback, and 456 inhibition. The oscillations allow for the parsing of speech and provide windows in which 457 information is processed [3,38,60,61]. Importantly, the oscillation acts as a temporal filter, such 458 that the activation time of any incoming signal will be confined to the high excitable window and 459 thereby is relatively robust against small temporal variations ( Figure 5B). The feedback allows for 460 differential activation time dependent on the sensory input ( Figure 5A). As a consequence, the 461 model is more sensitive to higher predictable speech input and therefore active earlier on the duty 462 cycle. The inhibition allows for the network to be more sensitive to less predictable speech units 463 when they arrive later (the higher predictable nodes get inhibited at some point on the oscillation; 464 best illustrated by the simulation in Figure 7A). In this way speech is ordered along the duty cycle 465 according to its predictability [42,62]. This form of inhibition in combination with an oscillatory 466 model can explain speech rate and phase dependent content effects. Moreover, it is an automatic 467 temporal code that can use time of activation as a cue for content [41]. The three components in 468 the model are common brain mechanisms [29,41,[63][64][65][66] and follow many previously proposed 469 organization principles (e.g. temporal coding and parsing of information). While we implement 470 these components on an abstract level (not veridical to the exact parameters of neuronal 471 interactions), they illustrate how oscillations, feedback, and inhibition interact to optimize 472 sensitivity to natural pseudo-rhythmic speech. 473 The current model is not exhaustive and does not provide a complete explanation of all the 474 details of speech processing in the brain. For example, it is likely that the primary auditory cortex 475 is still mostly modulated by the acoustic pseudo-rhythmic input and only later brain areas follow 476 more closely the constraints posed by the language model of the brain. Therefore, more hierarchical 477 levels need to be added to the current model (but this is possible following equation (1)). Moreover, 478 the current model does not allow for phase or frequency shifts. This was intentional in order to 479 investigate how much a fixed oscillator could explain. We show that onset times matching the 480 predictions from the internal model can be explained by a fixed oscillator processing pseudo-481 rhythmic input. However, when the internal model and the onset timings do not match the internal 482 model phase and/or frequency shift are still required and need to be incorporated (see e.g. [15,19]. 483 Still, any coupling between brain oscillations and speech acoustics [19] needs to be extended with 484 the coupling of brain oscillations to brain activity patterns of internal models [67]. 485 In the current paper we use an RNN to represent the internal model of the brain. However, 486 it is unlikely that the RNN captures the wide complexities of the language model in the brain. The 487 decades-long debates about the origin of a language model in the brain remains ongoing and 488 controversial. Utilizing the RNN as a proxy for our internal language model makes a tacit 489 assumption that language is fundamentally statistical or associative in nature, and does not posit the 490 derivation or generation of knowledge of grammar from the input [68,69]. In contrast, our brain 491 could as well store knowledge of language that functions as fundamental interpretation principles 492 to guide our understanding of language input [21,24,50,61,70]. Knowledge of language and 493 linguistic structure could be acquired through an internal self-supervised comparison process 494 extracted from environmental invariants and statistical regularities from the stimulus input [71-495 73]. Future research should investigate which language model can better account for the temporal 496 variations found in speech. 497 A natural feature of our model is that time can act as a cue for content implemented as a 498 phase code [42,62]. This code unravels as an ordered list of predictability strength of the internal 499 model. We predict that if speech nodes have a different base activity, ambiguous stimulus 500 interpretation should dependent on the time/phase of presentation (see [31,59]). Indeed, we could 501 model two temporal speech illusions (Figure 7). There have also been null results regarding the 502 influence of phase on ambiguous stimulus interpretation [46,74] Figure 8. Predictions of the model. A) Acoustics signals will be more rhythmic when a producer has a weak versus a strong internal model (top right). When the producer's strong model matches the receiver's model the brain response will be more rhythmic for less rhythmic acoustic input. B) When a producer realizes the model of the receiver is weak it might transform its model and thereby their speech timing to match the receiver's expectations. specifically use predictions in their design. 526 The temporal dynamics of speech signals needs to be integrated with the temporal dynamics 527 of brain signals. However, it is unnecessary (and unlikely) that the exact duration of speech matches 528 with the exact duration of brain processes. Temporal expansion or squeezing of stimulus inputs 529 occur regularly in the brain [75,76] and this temporal morphing also maps to duration [77][78][79] or 530 order illusions [80]. Our model predicts increased rhythmic responses for non-isochronous speech 531 matching the internal model. The perceived rhythmicity of speech could therefore also be an 532 illusion generated by a rhythmic brain signal somewhere in the brain. model is lacking either on the producer's or on the receiver's side, respectively. The infant-directed 537 speech also illustrates that a producer might proactively adapt its speech rhythm to the expectations 538 of the internal model of the receiver to align better with the predictions from the receiver's model 539 ( Figure 8B; similar to when you are speaking to somebody that is just learning a new language). 540 Other examples in which speech is more isochronous is during poems, during emotional 541 conversation [83], and in noisy situations [84]. While speculative, it is conceivable that in these 542 circumstances one puts more weight on a different level of hierarchy than the internal linguistic 543 Table 3. Predictions from the current model The more predictable a word, the earlier this word is uttered.
When there is a flat constraint distribution over an utterance (e.g., when probabilities are uniform over the utterance) the acoustics of speech should naturally be more rhythmic ( Figure   8A).
If speech timing matches the internal language model, brain responses should be more rhythmic even if the acoustics are not ( Figure 8A).
The more similar the internal language models of two speakers, the more effective they are in 'entraining' each other's brain.
If speakers suspect their listener to have a flatter constraint distribution than themselves (e.g., the environment is noisy, or the speakers are in a second language context), they adjust to the distribution by speaking more rhythmically ( Figure 8B).
One adjusts the weight of the constraint distribution to a hierarchical level when needed. For example, when there is noise, participants adjust to the rhythm of primary auditory cortex instead of higher order language models. As a consequence, they speak more rhythmically.
The theoretical account provides various predictions that are listed in this table.
model. In the case of poems and emotional conversation an emotional route might get more weight 544 in processing. In the case of noisy situations, stimulus input has to pass the first hierarchical level 545 of the primary auditory cortex which effectively gets more weight than the internal model. 546 547

Conclusions 548
We argued that pseudo-rhythmicity in speech is in part a consequence of top-down predictions 549 flowing from an internal model of language. This pseudo-rhythmicity is created by a speaker and 550 expected by a receiver if they have overlapping internal language models. Oscillatory tracking of 551 this signal does not need to be hampered by the pseudo-rhythmicity, but can use temporal 552 variations as a cue to extract content information since the phase of activation parametrically relates 553 to the likelihood of an input relative to the internal model. Brain responses can even be more 554 rhythmic to pseudo-rhythmic compared to isochronous speech if they follow the temporal delays 555 imposed by the internal model. This account provides various testable predictions which we list in 556 Table 3 and Figure 8. We believe that by integrating neuroscientific explanations of speech tracking 557 with linguistic models of language processing [21, 24], we can improve to explain temporal speech 558 dynamics. This will ultimately aid our understanding of language in the brain and provide a means 559 to improve temporal properties in speech synthesis.