Oscillatory tracking of pseudo-rhythmic speech is constrained by linguistic predictions

Neuronal oscillations putatively track speech in order to optimize sensory processing. However, it is unclear how isochronous brain oscillations can track pseudo-rhythmic speech input. Here we propose that oscillations can track pseudo-rhythmic speech when considering that speech time is dependent on predictions flowing from internal language models. We show that the temporal dynamics of speech are dependent on the predictability of words in a sentence. A computational model including oscillations, feedback, and inhibition is able to track the natural pseudo-rhythmic speech input. As the model processes, it generates temporal phase codes, which are a candidate mechanism for carrying information forward in time. The model is optimally sensitive to the natural temporal speech dynamics and can explain empirical data on temporal speech illusions. Our results reveal that speech tracking does not only rely on the input acoustics but instead entails an interaction between oscillations and constraints flowing from internal language models.


Introduction 27
Speech is a biological signal that is characterized by a plethora of temporal information. The 28 temporal relation between subsequent speech units allows for the online tracking of speech in order 29 to optimize processing at relevant moments in time [1][2][3][4][5][6][7]. Neural oscillations are a putative index 30 of such tracking [3,8]. The existing evidence for neural tracking of the speech envelope is consistent 31 with such a functional interpretation [9,10]. In these accounts, the most excitable optimal phase of 32 an oscillation is aligned with the most informative time-point within a rhythmic input stream [8, 33 11-14]. However, the range of onset time difference between speech units seems more variable than 34 fixed oscillations can account for [15][16][17]. As such, it remains an open question how is it possible 35 that oscillations can track a signal that is at best only pseudo-rhythmic [16]. 36 Oscillatory accounts tend to focus on the prediction in the sense of predicting "when," 37 rather than predicting "what": oscillations function to align the optimal moment of processing given 38 that timing is predictable in a rhythmic input structure. If rhythmicity in the input stream is 39 violated, oscillations must be modulated to retain optimal alignment to incoming information. This 40 can be achieved through phase resets [15,18], directly coupling of the acoustics to oscillations [19], 41 or the use of many oscillators at different frequencies [2]. However, the optimal or effective time 42 of processing stimulus input might not only depend on when you predict something to occur, but 43 also on what stimulus is actually being processed [20][21][22][23]. 44 What and when are not independent, and certainly not from the brain's-eye-view. If 45 continuous input arrives to a node in an oscillatory network, the exact phase at which this node 46 reaches threshold activation does not only depend on the strength of the input, but also on how 47 sensitive this node was to start with. Sensitivity of a node in a language network (or any neural 48 network) is naturally affected by predictions in the what domain generated by an internal language 49 model [24][25][26][27]. If a node represents a speech unit that is likely to be spoken next, it will be more 50 sensitive and therefore active earlier, that is, on a less excitable phase of the oscillation. In the 51 domain of working memory, this type of phase precession has been shown in rat hippocampus [28, 52 29] and more recently in human electroencephalography [30]. In speech, phase of activation and 53 perceived content are also associated [31][32][33][34][35] and phase has been implicated in tracking of higher-54 level linguistic structure [18,36,37]. However, the direct link between phase and the predictability 55 flowing from a language model has yet to be established. 56 stream can alter what is perceived. 73 In the speech production literature, there is strong evidence that the onset times (as well as 74 duration) of an uttered word is modulated by the frequency of that word in the language [48-52] 75 showing that internal language models modulate the access to or sensitivity of a word node [24,53]. 76 This word-frequency effect relates to the access to a single word. However, it is likely that during 77 ongoing speech internal language models use the full context to estimate upcoming words [54]. If 78 so, the predictability of a word in context should provide additional modulations on speech time. 79 Therefore, we predict that words with a high predictability in the producer's language model should 80 be uttered relatively early. In this way word-to-word onset times map to the predictability level of 81 that word within the internal model. Thus, not only the processing time depends on the 82 predictability of a word (faster processing for predictable words; see [55,56] and [57] showing that 83 speech time in noise matters), but also the production time (earlier uttering of predicted words).   Figure 1D). If this is true, 105 pseudo-rhythmicity is fully natural to 106 the brain and it provides a means to use 107 time or arrival phase as a content 108 indicator. It also allows the receiver to be 109 sensitive to less predictable words when 110 they arrive relatively late. Current  . Proposed interaction between speech timing and internal linguistic models. A) Isochronous production and expectation when there is a weak internal model (even distribution of node activation). All speech units arrive around the most excitable phase B) When the internal model of the producer does not align with the model of the receiver temporal alignment and optimal communication fails. C) When both producer and receiver have a strong internal model, speech is non-isochronous and not aligned to the most excitable phase, but fully expected by the brain. D) Expected time is a constraint distribution which center can be shifted due to linguistic constraints.
Here, we propose that neural oscillations can track pseudo-rhythmic speech by taking into 120 account that speech timing is a function of linguistic constrains. As such we need to demonstrate 121 that speech statistics are influenced by linguistic constrains as well as showing how oscillations can 122 be sensitive to this property in speech. We approach this hypothesis as follows: First, we 123 demonstrate that in natural speech timing depends on linguistics predictions (temporal speech 124 properties). Then, we model how oscillations can be sensitive to these linguistic predictions 125 (modeling speech tracking). Finally, we validate that this model is optimally sensitive to the natural 126 temporal properties in speech and displays temporal speech illusions (model validation). Our results 127 reveal that tracking of speech needs to be viewed as an interaction between ongoing oscillations as 128 well as constraints flowing from an internal language model [21,24]. In this way, oscillations do 129 not have to shift their phase after every speech unit and can remain at a relatively stable frequency 130 as long as the internal model of the speaker matches the internal model of the perceiver. 131 132

Results 133
Temporal speech properties: word frequency influences word duration 134 To extract the temporal properties in naturally spoken speech we used the Corpus Gesproken 135 Nederlands (CGN; (Version 2.0.3;2014)). This corpus consists of elaborated annotations of over 900 136 hours of spoken Dutch and Flemish words. We focus here on the subset of the data of which onset 137 and offset timings were manually annotated at the word level in Dutch. Cleaning of the data 138 included removing all dashes and backslashes. Only words were included that were part of a Dutch 139 word2vec embedding (github.com/coosto/dutch-word-embeddings; needed for later modeling) and 140 required to have a frequency of at least 10 in the corpus. All other words were replaced with an 141 <unknown> label. This resulted in 574,726 annotated words with 3096 unique words. 2848 of the 142 words were recognized in the Dutch Wordforms database in CELEX (Version 3.1) in order to 143 extract the word frequency as well as the number of syllables per word. Mean word duration was 144 0.392 seconds, with an average standard deviation of 0.094 seconds (Supporting Figure 1A). By 145 splitting up the data in sequences of 10 sequential words we could extract the average word, 146 syllable, and character rate ( Figure Supporting Figure 1B). The reported rates fall within the 147 generally reported ranges for syllables (5.2 Hz) and words (3.7 Hz; [5,58]). 148 We predict that knowledge about the language statistics influences the duration of speech 149 units. As such we predict that more prevalent words will have on average a shorter duration (also 150 reported in [50]). In Figure 2A the duration of several mono-and bi-syllabic words are listed with 151 their word frequency. From these examples it seems that words with higher word frequency 152 generally have a shorter duration. To test this statistically we entered word frequency in an 153 ordinary least square regression with number of syllables as control. Both number of syllables 154 (coefficient = 0.1008, t(2843) = 75.47, p < 0.001) as well as word frequency (coefficient = -0.022, 155 t(2843) = -13.94, p < 0.001) significantly influence the duration of the word. Adding an interaction 156 term did not significantly improve the model (F (1,2843) = 1.320, p = 0.251; Figure 2B+C). The effect 157 is so strong that words with a low frequency can last three times as long as high frequency words 158 (even within mono-syllabic words). This indicates that word frequency could be an important part 159 of an internal model that influences word duration. 160 The previous analysis probed us to expand on the relation between word duration and 161 length of the words. Obviously, there is a strong correlation between word length and mean word 162 duration (number of characters 0.824, p < 0.001; number of syllables: ρ = 0.808, p < 0.001; for 163 number of syllables already shown above; Figure2D+E). In contrast, this correlation is present, but 164 much lower for the standard deviation of word duration (number of characters: ρ = 0.269, p < 0.001; 165 number of syllables: ρ = 0.292, p < 0.001). Finding a strong correlation does not imply that for every 166 time unit increase in the word length, the duration of the word also increases with the same time 167 unit, i.e., bi-syllabic words do not necessarily have to last twice as long as mono-syllabic words. 168 Therefore, we recalculated word duration to a rate unit considering the number of syllables/ 169 characters of the word. Thus a 250 ms mono-versus bi-syllabic word would have a rate of 4 versus 170 8 Hz respectively. Then we correlated character/syllabic rate with word duration. If word duration 171 increases monotonically with character/syllable length there should be no correlation. We found 172 that the syllabic rate varies between 3 and 8 Hz as previously reported (Figure 2E right;[5,58]). 173 However, the more syllables there are in a word, the higher this rate (ρ = 0.676, p < 0.001). This 174 increase was less strong for the character rate (ρ = 0.499, p < 0.001; Figure 2D right). 175 Figure 2. Word frequency modulates word duration. A) Example of mono-and bi-syllabic words of different word frequencies in brackets (van=from, zijn=be, snel=fast, stem=voice, hebben=have, eten=eating, volgend=next, toekomst=future). Text in the graph indicates the mean word duration. B) Relation between word frequency and duration. Darker colors mean more values. C) same as B) but separately for mono-and bi-syllabic words. D) Relation character amount and word duration. The longer the words, the longer the duration (left). The increase in word duration does not follow a fixed number per character as duration as measured by rate increases. E) same as D) but for number of syllables. Red dots indicate the mean.
These results show that the syllabic/character rate depends on the number of characters 176 /syllables within a word and is not an independent temporal unit [38]. This effect is easy to explain 177 when assuming that the prediction strength of an internal model influences word duration: 178 transitional probabilities of syllables are simply more constrained within a word than across words 179 [59]. This will reduce the time it takes to utter/perceive any syllable which is later in a word. representations, and possibly about which specific units, such as words, to expect next when 188 listening to ongoing speech [21,24]. As such, it is also expected that word-by-word onset delays are 189 shorter for words that fit the internal model (i.e. those that are expected; [54]). To investigate this 190 possibility, we created a simplified version of an internal model predicting the next word using 191 recurrent neural nets (RNN). We trained an RNN to predict the next word from ongoing sentences 192 ( Figure 3A). The model consisted of an embedding layer (pretrained; github.com/coosto/dutch-193 word-embeddings), a recurrent layer with a tanh activation function, and a dense output layer with 194 a softmax activation. To prevent overfitting, we added a 0.2 dropout to the recurrent layers and the 195 output layer. An adam optimizer was used at a 0.001 learning rate and a batch size of 32. We was < 50% of the words). was allowed in the input, but not in output. Validation consisted of a 202 randomly chosen 2% of the data. 203 The output of the RNN reflects a probability distribution in which the values of the RNN 204 sum up to one and each word has its own predicted value ( Figure 3A). As such we can extract the 205 sentences). Many of the variables were skewed to the right, therefore we transformed the data 234 accordingly (see Table 1; results were robust to changes in these transformation). 235 All predictors except word frequency of the previous word showed a significant effect 236 (Table 1). The variance explained by word frequency was likely captured by the mean duration 237 Figure 3. RNN output influence word onset differences. A) Sequences of ten words were entered in an RNN in order to predict the content of the next word. Three examples are provided of input data with the label (bold word) and probability output for three different words. The regression model showed a relation between the duration of last word in the sequence and the predictability of the next word such that words were systematically shorter when the next word was more predictable according to the RNN output (illustrated here with the shorted black boxes). B) Regression line estimated at mean value of word duration and bigram. C) Scatterplot of prediction and onset difference of data within ± 0. In order to investigate how much of these duration effects can be explained using an oscillator 257 model, we created the model Speech Tracking in a Model Constrained Oscillatory Network 258 (STiMCON). STiMCON in its current form will not be exhaustive; however, it can extract how 259 much an oscillating network can cope with asynchronies by using its own internal model  in which C represents the connectivity patterns between differrent hierarchical levels, T the time 267 in a sentence, and Ta the vector of times of an individual node in an inhibition function (in 268 milliseconds). The inhibition function is a gate function: 269 (2) 270 in which BaseInhib is a constant for the base level of inhibition (negative value, set to -0.2). As such 271 nodes are by default inhibited, as soon as they get activated above threshold (activation threshold 272 set at 1) Ta sets to zero. Then, the node will have suprathreshold activation, which after 20 273 milliseconds returns to increased inhibition until the base level of inhibition is returned. The 274 oscillation is a constant oscillator: 275 in which Am is the amplitude of the oscillator, ω the frequency, and φ the phase offset. As such we 277 assume a stable oscillator which is already aligned to the average speech rate (see [15,19] for phase 278 alignment models). The model used for the current simulation has one an input layer (l-1 level) and 279 one single layer of semantic word nodes (l level) that receives feedback from a higher level layer 280 (l+1 level). As such only the word (l) level is modeled according to equation 1-3 and the other levels 281 form fixed input and feedback connection patterns. 282 283 Modeling speech tracking: language models influence time of activation 284 To illustrate how STiMCON can explain how processing time depends on the prediction of internal 285 language models, we instantiated a language model that had only seen three sentences and five 286 words presented at different probabilities (I eat cake at 287 0.5 probability, I eat nice cake at 0.3 probability, I eat 288 very nice cake at 0.2 probability; Table 2). This language 289 model will serve as the feedback arriving from the l+1-290 level to the l-level. The l-level consists of five nodes that 291 each represent one of the words and receives 292 proportional feedback from l+1 according to Table 2  293 with a delay of 0.9*ω seconds, which then decays at 0.01 294 This model has seen three sentences at different probabilities. Rows represent the prediction for the next word, e.g. /I/ predicts /eat/ at a probability of 1, but after /eat/ there is a wider distribution. unit per millisecond and influences the l-level at a proportion of 1.5. This feedback is only initiated 295 when supra-activation arrives due to l-1-level bottom-up input. Each word at the l-1-level input is 296 modelled as a linearly function to the individual nodes lasting length of 125 milliseconds (half a 297 cycle, ranging from 0-1 arbitrary units). As such, the input is not the acoustic input itself but rather 298 reflects a linear increase representing the increasing confidence of a word representing the specific 299 node. φ is set such that the peak of a 4 Hz oscillation aligns to the peak of sensory input of the first 300 word. Sensory input is presented at a base stimulus onset asynchrony of 250 milliseconds (i.e. 4 Hz). 301 When we present this model with different sensory input at an isochronous rhythm of 4 302 Hz it is evident that the timing at which different nodes reach activation depends on the level of 303 feedback that is provided (Figure 4). For example, while the /I/-node needs a while to get activated 304 after the initial sensory input, the /eat/-node is activated earlier as it is pre-activated due to 305 feedback. After presenting /eat/ the feedback arrives at three different nodes and the activation 306 timing depends on the stimulus that is presented (earlier activation for /cake/ compared to /very/). 307 308 Modeling speech tracking: time of presentation influences processing efficiency 309 To investigate how the time of presentation influences the processing efficiency we presented the 310 model with /I eat XXX/ in which the last word was varied in content (either /I/, /very/, /nice/, or 311 /cake/), intensity (linearly ranging from 0 to 1), and onset delay (ranging between -125 to +125 312 relative to isochronous presentation). We extracted the time at which the node matching the 313 stimulus presentation reached activation threshold first (relative to stimulus onset, and relative to 314 isochronous presentation). 315 Figure 5A shows the output. When there is no feedback (i.e. at the first word /I/ 316 presentation) , a classical efficiency map can be found in which processing is most optimal (possible 317 at lowest stimulus intensities) at isochronous (in phase with the stimulus rate) presentation and 318 then drops to either side. For nodes that have feedback, input processing is possible at earlier times 319 relative to isochronous presentation and parametrically varies with prediction strength (earlier for 320 Figure 5. Model output on processing efficiency and rhythmicity. A) Time of presentation influences efficiency. Outcome variable is the time at which the node reached threshold activation (supra-time). The dashed line is presented to ease comparison between the four content types. White indicates that threshold is never reached. B) Same as A, but estimated at a threshold of 0.53 showing that oscillations regulate feedforward timing. Panel A shows that the earlier the stimuli are presented (on a weaker point of the ongoing oscillation), the longer it takes until supra-threshold activation is reached. This figure shows that timing relative to the ongoing oscillation is regulated such that the stimulus activation timing is closer to isochronous. Line discontinuities are a consequence of stimuli never reaching threshold for a specific node. C) Strength of 4 Hz power depends on predictability in the stream. When predictability is alternated between low and high, activation is more rhythmic when the predictable odd stimulus arrives earlier and vice versa. D) Slice of D at intensity of 0.8 and 1.0. E) Magnitude spectra at three different odd word offsets at 1.0 intensity. To more clearly illustrate the differences the magnitude to the power of 20 is plotted. /cake/ at 0.5 probability, then /very/ at 0.2 probability). Additionally, the activation function is 321 asymmetric. This is a consequence of the interaction between the supra-activation caused by the 322 feedback and the sensory input. As soon as supra-activation is reached due to the feedback, sensory 323 input at any intensity will reach supra-activity (thus at early stages of the linearly increasing 324 confidence of the input). This is why for the /very/ stimulus activation is still reached at later delays 325 compared to /nice/ and /cake/ as the /very/-node reaches supra-activation due to feedback at a later 326 time point. 327 When we investigate timing differences in stimulus presentation it is important to also 328 consider what this means for the timing in the brain. Before, we showed that the amount of 329 prediction can influence timing in our model. It is also evident that the earlier a stimulus was 330 presented the more time it took (relative to the stimulus) for the nodes to reach threshold (more 331 yellow colors for earlier delays). This is a consequence of the oscillation still being at a relatively 332 low excitability point at stimulus onset for stimuli that are presented early during the cycle. 333 However, when we translate these activation threshold timing to the timing of the ongoing 334 oscillation, the variation is strongly reduced ( Figure 5B). A stimulus timing that varies between 130 335 milliseconds (e.g. from -59 to +72 in the /cake/ line; excluding the non-linear section of the line) 336 only reaches the first supra-threshold response with 19 milliseconds variation in the model 337 (translating to a reduction of 53% to 8% of the cycle of the ongoing oscillation, i.e. a 1:6.9 ratio). 338 This means that within this model (and any oscillating model) the activation of nodes is robust to 339 some timing variation in the environment. This effect seemed weaker when no prediction was 340 present (for the /I/ stimulus this ratio was around 1:3.5. Note that when determining the /cake/ 341 range using the full line the ratio would be 1:3.4). 342 343 Modeling speech tracking: top-down interactions can provide rhythmic processing for non-344 isochronous stimulus input 345 The previous simulation demonstrate that oscillations provide a temporal filter and the processing 346 itself can actually be closer to isochronous than what can be solely extracted from the stimulus 347 input. Next, we investigated whether dependent on changes in top-down prediction, processing 348 within the model will be more or less rhythmic. To do this, we create stimulus input of 10 sequential 349 words at a base rate of 4 Hz to the model with constant (low at 0 and high at 0.8 predictability) or 350 alternating word-to-word predictability. For the alternating conditions word-to-word 351 predictability alternates between low to high (sequences which word are predicted at 0 or 0.8 352 predictability, respectively) or shift from high to low. For this simulation we used Gaussian sensory 353 input (with a standard deviation of 42 ms aligning the mean at the peak of the ongoing oscillation; 354 see Supporting Figure 5 for output with linear sensory input). Then, we vary the onset time of the 355 odd words in the sequence (shifting from -100 up to +100 ms) and the stimulus intensity (from 0.2 356 to 1.5). We extracted the overall activity of the model and computed the Fast Fourier transform of 357 the created time course (using a Hanning taper only including data from 0.5 -2.5 seconds to exclude 358 the onset responses). 359 The first thing that is evident is that the model with no content predictions has overall 360 stronger power, and specifically around isochronous presentation (odd word offset of 0 ms) at high 361 stimulus intensities ( Figure 5C-E). Adding overall high predictability drops the power, but also here 362 the power seems symmetric around zero. The spectra of the alternating predictability conditions 363 look different. For the low to high predictability condition the curve seems to be shifted to the left 364 such that 4 Hz power is strongest when the predictable odd stimulus is shifted to an earlier time 365 point (low-high condition). This is reversed for the high-low condition. At middle stimulus 366 intensities there is a specific temporal specificity window at which the 4 Hz power is particularly 367 strong. This window is earlier for the low-high than the high-low alternation ( Figure 5D, Figure  368 5E, and Supporting Figure 6). The effect only occurs at specific middle intensity combination as at 369 high intensities the stimulus dominates the responses and at low intensities the stimulus does not 370 reach threshold activation. These results show that even though stimulus input is non-isochronous, 371 the interaction with the internal model can still create a potential rhythmic structure in the brain 372 (see [61,62]). Note that the direction in which the brain response is more rhythmic matches with 373 the natural onset delays in speech (shorter onset delays for more predictable stimuli). Next, we aimed to investigated whether STiMCON would be optimally sensitive to speech input 378 timings found naturally in speech. Therefore, we tried to fit STIMCON's expected word-to-word 379 onset differences to the word-to-word onset differences we found in the CGN. At a stable level of 380 intensity of the input and inhibition, the only aspect that changes the timing of the interaction 381

between top-down predictions and bottom-up input within STiMCON is the ongoing oscillation. 382
Considering that we only want to model for individual words how much the prediction ( −1→ * 383 −1, ) influences the expected timing we can set the contribution of the other factors from 384 equation (1) to zero remaining with the relative contribution of prediction: 385 We can solve this formula in order to investigate the expected relative time shift (T) in processing 387 that is a consequence of the strength of the prediction (ignoring that in the exact timing will also 388 depend on the strength of the input and inhibition): 389 ω was set as the syllable rate for each sentence, Am and φ were systematically varied. We fitted a 391 linear model between the STiMCON's expected time and the actual word-to-word onset 392 differences. This model was similar to the model described in the section word-by-word 393 predictability predicts word onset differences and included the predictor syllablerate and duration 394 of the previous word. However, as we were interested in how well non-transformed data matches 395 the natural onset timings we did not perform any normalization besides equation (5)  that STiMCON expected time durations matches the actual word-by-word duration. This was even 411 more strongly so for specific oscillatory alignments (around -0.25π offset) suggesting an optimal 412 alignment phase relative to the ongoing oscillation is needed for optimal tracking [3,8]. 413 Interestingly, the optimal transformation seemed to automatically alter a highly skewed prediction 414 distribution ( Figure 6B) towards a more normal distribution of relative time shifts ( Figure 6C) Table 2). The model consists of four nodes (N1, N2, Nda, and Nga) at which N1 activation predicts 451 a second unspecific stimulus (S2) represented by N2 at a predictability of 1. N2 activation predicts 452 either da or ga at 0.2 and 0.1 probability respectively. Then, we present STiMCON (same parameters 453 as before) with /S1 S2 XXX/. XXX is varied to have different proportion of the stimulus /da/ and /ga/ 454 (ranging from 0% /da/ to 100% /ga/ in 12 times steps; these reflects relative propotions that sum up 455 to 1 such that at 30% the intensity of /da/ would be at max 0.3. and of /ga/ 0.7) and is the onset is 456 varied relate to the second to last word. We extract the time that a node reaches suprathreshold 457 activity after stimulus onset. If both nodes were active at the same time the node with the highest 458 total activates was choosen. Results showed that for some ambiguous stimuli, the delay determines 459 which node is activated first, modulating the ultimate percept of the participant ( Figure 7A, also 460 see supplementary Figure 7A). The same type of simulation can explain how speech rate can 461 influence perception (supplementary Figure 7B; but see [47].). 462 To further scrutinize on this effect we fitted our model to the behavioral data of Ten Oever 463 & Sack [31]. As we used an iterative approach in the simulations of the model, we optimized the 464 model using a grid search. We varied the parameters of proportion of the stimulus being /da/ or /ga/ 465 (ranging between 10:20:80%), the onset time of the feedback (0.1:0.2:1.0 cycle), the speech of the 466 feedback decay (0:0.02:0.1), and a temporal offset of the final sound to account for the time it takes 467 to interpret a specific ambiguous syllable (ranging between -0.05:0.02:0.05 sec). Our outcome 468 variable was the node that show the first suprathreshold activation (Nda = 1, Nga = 0). If both nodes 469 were active at the same time the node with the highest total activates was choosen. If both nodes 470 had equal activation or never reached threshold activation we coded the outcome to 0.5 (i.e. fully 471 ambigous). These outcomes were fitted to the behavioral data of the 6.25 Hz and 10 Hz presentation 472 rate (the two rates showing a significant modulation of the percept). This data was normalize to 473 have a range between 0-1 to account for the model outcomes being binary (0, 0.5 or 1). 474 We found that our model could fit the data at an average explained variance of 43% (30% 475 and 58% for 6.25 Hz and 10 Hz respectively; Figure 7B+C). This explained variance was higher than 476 the original sinus fit (40% for 3 parameter sinus fit [amplitude, phase offset, and mean]). Note that 477 our fit cannot account for variance ranging inbetween 0-0.5 and 0.5-1, while the sinus fit can do 478 this. If we correct for this (by setting the sinus fit to the closest 0, 0.5 or 1 value and doing a grid 479 search to optimize the fitting) the average fit of the sinus is 21%. The average AIC of the model and 480 sinus fit are -27.0 and -24.1 respectively suggesting that the STiMCON model has the better fit. 481 Thus, STiMCON does better than a fixed-frequency sinus fit. This is a likely consequence of the 482 sinus fit not being able to explain the dampening of the oscillation later (i.e. the perception bias is 483 stronger for shorter compared to longer delays). 484 Finally, we investigated the relevance of the three key features of our model for this fit: 485 inhibition, feedback, and oscillations. We repeated the grid search fit but set either the inhibition 486 to zero, the feedback matrix equal for both /da/ and /ga/ (both 0.15), or the oscillation at an 487 amplitude of zero. Results showed that especially the oscillation and the differential feedback were 488 essential to reach a good fit ( Figure 7B). Without the oscillation the model could not even fit better 489 than the mean of the model (R 2 < 0). Removing the inhibition had the least influence on the fit. 490 This suggest that all features (with a lesser extend the inhibition) are required to model the data 491 suggesting that oscilatory tracking is dependent on linguistic constrains flowing from the internal 492 language model. 493

494
Discussion 495 In the current paper, we combined an oscillatory model with a proxy for linguistic knowledge, an 496 internal language model, in order to investigate the model's processing capacity for onset timing 497 differences in natural speech. We show that word-to-word speech onset differences in natural 498 speech are indeed related to predictions flowing from the internal language model (estimated 499 through an RNN). Fixed oscillations aligned to the mean speech rate are robust against natural 500 temporal variations and even optimized for temporal variations that match the predictions flowing 501 from the internal model. Strikingly, when the pseudo-rhythmicity in speech matches the 502 predictions of the internal model, responses were more rhythmic for matched pseudo-rhythmic 503 compared to isochronous speech input. Our model is optimally sensitive to natural speech 504 variations, can explain phase dependent speech categorization behavior [31,35,44,63], and 505 naturally comprises a neural phase code [40,42,43]. These results show that part of the pseudo-506 rhythmicity of speech is expected by the brain and it is even optimized to process it in this manner, 507 but only when it follows the internal model. 508 Speech timing is variable and in order to understand how the brain tracks this pseudo-509 rhythmic signal we need a better understanding of how this variability arises. Here, we isolated one 510 of the components explaining speech time variation, namely, constraints that are posed by an 511 internal language model. This goes beyond extracting the average speech rate [5,19,58], and might 512 be key to understanding how a predictive brain uses temporal cues. We show that speech timing 513 depends on the predictions made from an internal language model, even when those predictions 514 are highly reduced to be as simple as word predictability. While syllables generally follow a theta 515 rhythm, there is a systematic increase in syllabic rate as soon as more syllables are in a word. This 516 is likely a consequence of the higher close probability of syllables within a word which reduces the 517 onset differences of the later uttered syllables [59]. However, an oscillatory model constrained by 518 an internal language model is sensitive to these temporal variations, it is actually capable of 519 processing them optimally. 520 The oscillatory model we here pose has three components: oscillations, feedback, and 521 inhibition. The oscillations allow for the parsing of speech and provide windows in which 522 information is processed [3,39,64,65]. Importantly, the oscillation acts as a temporal filter, such 523 that the activation time of any incoming signal will be confined to the high excitable window and 524 thereby is relatively robust against small temporal variations ( Figure 5B). The feedback allows for 525 differential activation time dependent on the sensory input ( Figure 5A). As a consequence, the 526 model is more sensitive to higher predictable speech input and therefore active earlier on the duty 527 cycle (this also means that oscillations are less robust against temporal variations when the feedback 528 is very strong). The inhibition allows for the network to be more sensitive to less predictable speech 529 units when they arrive later (the higher predictable nodes get inhibited at some point on the 530 oscillation; best illustrated by the simulation in Figure 7A). However, adding inhibition only 531 slightly improved the modeling fit ( Figure 7B). In this way speech is ordered along the duty cycle 532 according to its predictability [43,66]. The feedback in combination with an oscillatory model can 533 explain speech rate and phase dependent content effects. Moreover, it is an automatic temporal 534 code that can use time of activation as a cue for content [42]. The three components in the model 535 are common brain mechanisms [29,42,[67][68][69][70] and follow many previously proposed organization 536 principles (e.g. temporal coding and parsing of information). While we implement these 537 components on an abstract level (not veridical to the exact parameters of neuronal interactions), 538 they illustrate how oscillations, feedback, and inhibition interact to optimize sensitivity to natural 539 pseudo-rhythmic speech. 540 The current model is not exhaustive and does not provide a complete explanation of all the 541 details of speech processing in the brain. For example, it is likely that the primary auditory cortex 542 is still mostly modulated by the acoustic pseudo-rhythmic input and only later brain areas follow 543 more closely the constraints posed by the language model of the brain. Therefore, more hierarchical 544 levels need to be added to the current model (but this is possible following equation (1)). Moreover, 545 the current model does not allow for phase or frequency shifts. This was intentional in order to 546 investigate how much a fixed oscillator could explain. We show that onset times matching the 547 predictions from the internal model can be explained by a fixed oscillator processing pseudo-548 rhythmic input. However, when the internal model and the onset timings do not match the internal 549 model phase and/or frequency shift are still required and need to be incorporated (see e.g. [15,19]). 550 Still, any coupling between brain oscillations and speech acoustics [19] needs to be extended with 551 the coupling of brain oscillations to brain activity patterns of internal models [71]. 552 In the current paper we use an RNN to represent the internal model of the brain. However, 553 it is unlikely that the RNN captures the wide complexities of the language model in the brain. The 554 decades-long debates about the origin of a language model in the brain remains ongoing and 555 controversial. Utilizing the RNN as a proxy for our internal language model makes a tacit 556 assumption that language is fundamentally statistical or associative in nature, and does not posit the 557 derivation or generation of knowledge of grammar from the input [72,73]. In contrast, our brain 558 could as well store knowledge of language that functions as fundamental interpretation principles 559 to guide our understanding of language input [21,24,53,65,74]. Knowledge of language and 560 linguistic structure could be acquired through an internal self-supervised comparison process 561  Figure 8. Predictions of the model. A) Acoustics signals will be more rhythmic when a producer has a weak versus a strong internal model (top right). When the producer's strong model matches the receiver's model the brain response will be more rhythmic for less rhythmic acoustic input. B) When a producer realizes the model of the receiver is weak it might transform its model and thereby their speech timing to match the receiver's expectations. modifying the time of presentation with a neutral entrainer (summed sinusoidals with random 589 phase), no obvious phase effect was reported [47]. A second null result relates to a study where 590 participants were specifically instructed to maintain a specific perception in different blocks which 591 likely increases the pre-activation and thereby the phase [78]. Future studies need to investigate 592 the use of temporal/phase codes to disambiguate speech input and specifically use predictions in 593 their design. 594 The temporal dynamics of speech signals needs to be integrated with the temporal dynamics 595 of brain signals. However, it is unnecessary (and unlikely) that the exact duration of speech matches 596 with the exact duration of brain processes. Temporal expansion or squeezing of stimulus inputs 597 occur regularly in the brain [79,80] and this temporal morphing also maps to duration [81][82][83] or 598 order illusions [84]. Our model predicts increased rhythmic responses for non-isochronous speech 599 matching the internal model. The perceived rhythmicity of speech could therefore also be an 600 illusion generated by a rhythmic brain signal somewhere in the brain. model is lacking either on the producer's or on the receiver's side, respectively. The infant-directed 605 speech also illustrates that a producer might proactively adapt its speech rhythm to the expectations 606 Table 3. Predictions from the current model The more predictable a word, the earlier this word is uttered.
When there is a flat constraint distribution over an utterance (e.g., when probabilities are uniform over the utterance) the acoustics of speech should naturally be more rhythmic ( Figure   8A).
If speech timing matches the internal language model, brain responses should be more rhythmic even if the acoustics are not ( Figure 8A).
The more similar the internal language models of two speakers, the more effective they are in 'entraining' each other's brain.
If speakers suspect their listener to have a flatter constraint distribution than themselves (e.g., the environment is noisy, or the speakers are in a second language context), they adjust to the distribution by speaking more rhythmically ( Figure 8B).
One adjusts the weight of the constraint distribution to a hierarchical level when needed. For example, when there is noise, participants adjust to the rhythm of primary auditory cortex instead of higher order language models. As a consequence, they speak more rhythmically.
The theoretical account provides various predictions that are listed in this table. of the internal model of the receiver to align better with the predictions from the receiver's model 607 ( Figure 8B; similar to when you are speaking to somebody that is just learning a new language). 608 Other examples in which speech is more isochronous is during poems, during emotional 609 conversation [87], and in noisy situations [88]. While speculative, it is conceivable that in these 610 circumstances one puts more weight on a different level of hierarchy than the internal linguistic 611 model. In the case of poems and emotional conversation an emotional route might get more weight 612 in processing. In the case of noisy situations, stimulus input has to pass the first hierarchical level 613 of the primary auditory cortex which effectively gets more weight than the internal model. 614

Conclusions 616
We argued that pseudo-rhythmicity in speech is in part a consequence of top-down predictions 617 flowing from an internal model of language. This pseudo-rhythmicity is created by a speaker and 618 expected by a receiver if they have overlapping internal language models. Oscillatory tracking of 619 this signal does not need to be hampered by the pseudo-rhythmicity, but can use temporal 620 variations as a cue to extract content information since the phase of activation parametrically relates 621 to the likelihood of an input relative to the internal model. Brain responses can even be more 622 rhythmic to pseudo-rhythmic compared to isochronous speech if they follow the temporal delays 623 imposed by the internal model. This account provides various testable predictions which we list in 624 Table 3 and Figure 8. We believe that by integrating neuroscientific explanations of speech tracking 625 with linguistic models of language processing [21, 24], we can improve to explain temporal speech 626 dynamics. This will ultimately aid our understanding of language in the brain and provide a means 627 to improve temporal properties in speech synthesis.