Expectations about word stress modulate neural activity in speech-sensitive cortical areas

A recent dual-stream model of language processing proposed that the postero-dorsal stream performs predictive sequential processing of linguistic information via hierarchically organized internal models. However, it remains unexplored whether the prosodic segmentation of linguistic information involves predictive processes. Here, we addressed this question by investigating the processing of word stress, a major component of speech segmentation, using probabilistic repetition suppression (RS) modulation as a marker of predictive processing. In an event-related acoustic fMRI RS paradigm, we presented pairs of pseudowords having the same (Rep) or different (Alt) stress patterns, in blocks with varying Rep and Alt trial probabilities. We found that the BOLD signal was significantly lower for Rep than for Alt trials, indicating RS in the posterior and middle superior temporal gyrus (STG) bilaterally, and in the anterior STG in the left hemisphere. Importantly, the magnitude of RS was modulated by repetition probability in the posterior and middle STG. These results reveal the predictive processing of word stress in the STG areas and raise the possibility that words stress processing is related to the dorsal "where" auditory stream.


Introduction
The human brain is best viewed as an inference machine, actively predicting and explaining its sensations through internal representations modeling the dynamic sensory context (Friston, 2010). One human-specific cognitive faculty where predictive processing may be especially important is linguistic communication (Donhauser and Baillet, 2019;Kuperberg and Jaeger, 2016;Lau et al., 2016;Willems et al., 2016). Predictive inference have been specifically integrated into a recent neurobiological model of language processing (Bornkessel-Schlesewsky et al., 2015;Bornkessel-Schlesewsky and Schlesewsky, 2013). This model proposes a dual auditory stream network involving ventral and dorsal streams similarly to previous models (Friederici, 2011;Hickok and Poeppel, 2007;Rauschecker and Scott, 2009;Saur et al., 2008;Scott et al., 2000), but suggests slightly different functions related to these streams. The antero-ventral or "what" stream of the linguistic network (including the primary auditory cortex, the anterior superior temporal cortex, and anterior and ventral parts of the inferior frontal cortex) is thought to be responsible for the recognition of linguistic elements in an order-insensitive way, while the postero-dorsal or "where" stream (including the primary auditory cortex, the posterior superior temporal cortex, the inferior parietal lobule, the premotor cortex, and posterior and dorsal parts of the inferior frontal cortex) performs predictive sequential processing of linguistic information in successively larger temporal windows related to different linguistic levels (sounds, words, sentences, discourse). This predictive sequential processing is suggested to be based on hierarchically organized internal models, corresponding to temporal receptive windows that allow the processing of linguistic information at different time scales.
The model suggests that one of the dorsal stream's functions is the prosodic segmentation of input. Prosody, the melodic and rhythmic aspect of speech (Cutler et al., 1997), contributes to speech understanding at different levels: at the sentence level, intonation modifies the interpretation of the sentence (Friederici et al., 2007;M€ annel and Friederici, 2011;Sammler et al., 2015;Steinhauer et al., 1999;van der Burght et al., 2019), while at the word level, word stress plays a major role in the segmentation of continuous speech input into words (Cutler and Norris, 1988;Mattys et al., 2005;Norris et al., 1995;van Donselaar et al., 2005).
In accordance with the dual auditory stream model, previous research provided evidence that the postero-dorsal stream contributes to the prosodic segmentation of linguistic input. Particularly, intonation and discourse processing elicited increased BOLD responses in the posterior superior temporal gyrus (STG) and inferior frontal gyrus (IFG) (Geiser et al., 2008;Inspector et al., 2013;Ischebeck et al., 2008;Kandylaki et al., 2016;Meyer et al., 2004;Sammler et al., 2018Sammler et al., , 2015. Furthermore, word stress processing has been associated with activations in the STG/superior temporal sulcus (STS), together with other areas like the IFG, SMA (supplementary motor area), and areas in the parietal (angular gyrus, superior parietal gyrus, parietal lobule) and frontal lobes (precentral, postcentral, and middle frontal gyrus); most of which could be assumed to be part of the dorsal stream (Aleman et al., 2005;Domahs et al., 2013;Heisterueber et al., 2014;Kandylaki et al., 2017;Klein et al., 2011).
Meanwhile, it remains an open question whether predictive processes are involved in the prosodic segmentation of linguistic input and specifically in words stress processing. Previous studies indicated that the processing of the prominence of words at the sentence level was guided by the acoustic and lexical predictability of words (Kakouros and R€ as€ anen, 2016;Magne et al., 2005). In the short term, the perception of prominence was modified by the preceding prosodic exposure (Kakouros et al., 2018). Moreover, ERP evidence suggested the role of long-term expectations in the processing of stress at the word level (Honbolyg� o and Cs� epe, 2013). However, direct investigation of predictive processing of word stress related to cortical regions is missing.
To address this question, we used a possible neural marker of prediction, the probabilistic modulation of the fMRI repetition suppression (RS) effect (Summerfield et al., 2008). fMRI RS refers to reduced BOLD responses to repeated sensory stimuli (Henson and Rugg, 2003;Grill--Spector et al., 2006). The neural background of RS is still debated (Kov� acs and Schweinberger, 2016): the most widely accepted explanation is provided by predictive theories, according to which RS reflects the reduced prediction error in a Bayesian multi-stage model of cortical functions (Auksztulewicz and Friston, 2016;Friston, 2010Friston, , 2005Rao and Ballard, 1999;Summerfield et al., 2008). Indeed, there is increasing neuroimaging evidence that higher-order contextual expectations modulate the magnitude of RS for visual Kov� acs, 2016, 2014;Kov� acs et al., 2013;Larsson and Smith, 2012;Mayrhauser et al., 2014) as well as for acoustic stimuli (Andics et al., 2013a(Andics et al., , 2013bTodorovic et al., 2011;Todorovic and de Lange, 2012).
Based on these results, we hypothesized that cortical regions involved in the processing of speech stimuli might be the primary locus of RS effects related to the processing of word stress information (H1). Furthermore, we also assumed that word stress is encoded by predictive mechanisms. Therefore, we expected the RS effects to be modulated by the predictability of stress violations in areas related to the dorsal auditory stream, evoked by the repetition of the legal stress pattern (H2).
To investigate the above hypothesis, we followed the paradigm of Summerfield et al. (2008), previously widely used for visual and acoustic stimuli (for a review, see Grotheer and Kov� acs, 2016). Briefly, we embedded pseudoword pairs with repeated and alternating word stress in longer blocks where the repetitions were either likely, thereby predicted or rare, thereby surprising. We measured RS by comparing the repeated and alternating pseudoword pairs. Stimuli in the repeated pairs had the same stress pattern, i.e., stress on the first syllable, which is the only existing word stress pattern in Hungarian. In the alternating pairs, the difference between the two stimuli was the position of stress: stress was on the first syllable for the first stimulus (legal stress pattern) while it was on the second syllable for second stimulus (illegal stress pattern). The reason for this choice was that in our previous ERP study (Hon-bolyg� o and Cs� epe, 2013), we found that only the illegal stress pattern elicited the Mismatch Negativity (MMN) component when it was in the deviant position. The legal stress pattern did not elicit MMN in the deviant position, arguably because it did not violate the predictions about the native stress pattern. Based on this, to address the primary question of the present study, i.e., the predictive processing of word stress, we focused on the conditions in which the illegal stress pattern in the alternating pairs violated the prediction formed on the legal stress pattern by the repeated pairs.

Participants
Twenty-three healthy adults took part in the experiment. Three of them were excluded from further analysis: one because the overall hit rate was lower than 80% and the overall false alarm rate was higher than 10% in the behavioral task, the other two participants due to excessive head movements during MRI scanning. Therefore, 20 participants remained in the final sample (13 females; all right-handed, M Age ¼ 28.6 years, SD ¼ 6.1 years, M Years of education ¼ 18.4 years, SD ¼ 2.1 years). Note that the final sample size varies between 15 and 20 in the fMRI data analyses as a function of the number of the successfully identified region of interest (ROI) during the functional localizer runs (see sections fMRI data analysis and fMRI results). All participants were native speakers of Hungarian, had normal or corrected-to-normal vision and reported normal hearing levels. None of them reported a history of any neurological and/or psychiatric condition. All participants provided written informed consent before enrolment and received no compensation for taking part in the experiment. The study was approved by the Ethical Board of the Medical Research Council, Hungary and was conducted in accordance with the Declaration of Helsinki.

Experimental task
For the main experimental task, we used disyllabic pseudowords as auditory stimuli, uttered with legal (stress on the first syllable) and illegal (stress on the second syllable) stress patterns, according to the stress assignment rules of Hungarian language (Sipt� ar and T€ orkenczy, 2007). Hungarian language is ideal to study the predictive mechanisms of word stress processing because it is a fixed-stress language (Sipt� ar and T€ orkenczy, 2007). This means that in contrast to e.g., English, a variable stress language, the stress pattern of every disyllabic word is the same without exception, i.e., stress always falls on the first syllable. Therefore, it can be expected that Hungarian speakers are especially sensitive to any violation of this highly regular stress pattern. As it has been shown previously, Hungarian speakers detect stress pattern violations pre-attentively in both meaningful words (Garami et al., 2017;Hon-bolyg� o et al., 2004) and meaningless pseudowords (Honbolyg� o and Cs� epe, 2013).
All pseudowords had a consonant-vowel-consonant-vowel structure (e.g., /bidi/, /divi/, /sipi/, /tiki/, etc.) and we used the same vowel /i/ (pronounced "e" as in the word "me") to ease clear pronunciation. Of all the possible permutations of Hungarian consonants and the /i/vowel, altogether 47 pseudowords were selected, excluding meaningful words and pseudowords that sounded odd. The average length of the pseudowords was 594 ms (SD ¼ 85 ms) and ranged between 410 ms and 863 ms.
To avoid potential confounds of the acoustic features on the experimental effects, we created trial-unique auditory stimuli randomized across participants. To obtain trial-uniqueness, we recorded each of the 47 pseudowords with 4 different female speakers (who were native Hungarian speech therapists and/or linguists and were trained to produce the required stress patterns). Speakers produced naturally both the legal and illegal stress patterns, and no post-processing was applied to artificially enhance the difference between the stress patterns. After checking all recorded tokens, we selected 40 legal-stressed and 40 illegal-stressed pseudowords from each speaker to be used during the experiment. These were the 40 best stimuli in terms of intelligibility and clearness of stress patterns, as judged by two of the authors (F.H., A.K.) and another colleague (B.Cs.). Next, we manipulated the acoustical parameters of the stimuli using the Praat software (Boersma and Weenink, 2007). We modified the fundamental frequency of the stimuli by shifting the overall f0 to 90%, 100%, and 110% of the original. This technique has been applied by Dupoux et al. (2001) and also by our group in a previous study (Honbolyg� o et al., 2019) in order to increase the acoustical variability of the stimuli. Consequently, we obtained 480 (40 stimuli * 4 speakers * 3 shifts of the f0) legal-stressed pseudowords and 480 illegal-stressed pseudowords.
We also created target stimuli by modifying the fundamental frequency of the original stimuli to 110%, 120%, and 130% (i.e., the frequency of the target stimuli was 20% higher than that of the respective original stimuli after shifting their overall f0). Targets were needed to maintain the attention of participants and were not included in the analysis. To create the target stimuli, a well-detectable perceptual difference between the stimulus pairs was needed that was different from the perceptual difference investigated (i.e., stress difference). One of the most prominent and easily detectable acoustical feature of speech stimuli is fundamental frequency, this is why we decided to manipulate this feature.
Finally, we equalized the loudness level of all stimuli using RMS (root mean square) normalization and added a rise/fall amplitude envelope to the beginning and the ending of the sound to avoid the "clicking" sound at the stimulus onset. The acoustical characteristics of the recorded stimuli are summarized in Table1.

Functional localizer
For the independent functional localizer scans, we created four 15 s long speech segments, consisting of a sequence of disyllabic pseudowords. Pseudowords conformed to the phonotactical rules of Hungarian and were uttered by two male speakers. Each segment consisted of 15-16 pseudowords, randomly selected from both speakers, and there was a 200-300 ms long pause between successive pseudowords. Using the four original speech segments, we created two distorted, unintelligible segments which served as baseline conditions: signal correlated noise (SCN) and spectrally rotated speech (SRSP). Both manipulations were performed using scripts in the Praat software (the SCN script was written by Matt Davis, MRC Cognition and Brain Sciences Unit; the SRSP script was written by Holger Mitterer, University of Malta). The SCN was created by extracting the amplitude envelope of the original recordings and applying it to the randomized phase spectrum of the spectrogram of the original recording, i.e., to a pink noise having the same spectral profile as the recordings. This resulted in amplitude modulated noise-like stimuli, which retained the temporal characteristics of speech but removed all spectral information, effectively making the stimuli unintelligible and completely dissimilar to speech. The SRSP was created by inverting the spectral content of the original recordings at 3600 Hz, i.e., spectro-temporal information of lower frequencies became high frequency information and vice-versa. This resulted in an alien-like speech: it had very similar temporal and spectral complexity to the original speech but it was unintelligible (see also Scott et al., 2000).

Experimental task
The design and procedure of the present experiment were based on previous studies testing RS for human voices (Andics et al., 2013a) and for various visual stimuli such as faces and letters Kov� acs et al., 2013;Summerfield et al., 2008). The trial and block structure of the experiment are shown in Fig. 1.
Stimuli were presented pairwise with a stimulus onset asynchrony (SOA) varying randomly between 800 ms and 1000 ms. The stress pattern of the first stimulus (S1) was either identical to (Repetition Trial ¼ RepT) or different from that of the second stimulus (S2; Alternation Trial ¼ AltT). In the RepT, both S1 and S2 had the same legal stress pattern (stress on the first syllable). In the AltT, the only difference between S1 and S2 was the position of stress: stress was on the first syllable for S1 (legal stress pattern) while it was on the second syllable for S2 (illegal stress pattern), but the two stimuli were otherwise identical. Stimulus pairs were separated with a randomized inter-trial interval (ITI) of 4 or 6 s.
Besides the different trial types, two different types of blocks were presented to test the modulation of repetition probability: Repetition Blocks (RepBs) and Alternation Blocks (AltBs). In each block, 20% of the trials were target trials, which were either AltTs or RepTs with the same probability (i.e., 10% of the targets were AltTs and 10% were RepTs in each block, respectively). In target trials, the frequency of the S2 in the stimulus pair was 20% higher than that of the S1 (see Stimuli section). In the RepBs, including the target trials, 70% of the trials were RepTs while 30% were AltTs. In the AltBs, including the target trials, 70% of the trials were AltTs and 30% were RepTs. The first four trials of each block were always non-target trials and consisted of the more frequent trial type of that block (RepT in RepB, AltT in AltB). The order and identity of AltTs and RepTs in each block was random and unique for each participant with the constraint that target trials were separated by at least two nontarget trials. Particularly, twelve randomly mixed and unique stimulus files per participant determined stimulus presentation during the experimental task: four of them coded the S1 in each block, another four files coded the trial type (RepT, AltT, Repetition target, or Alternation target), and the remaining four files coded the S2 that corresponded to the matching of the S1 and the trial type (e.g., if the given trial was an AltT, the exact same pseudoword as the S1 was presented as the S2 but with illegal stress pattern). There were 120 trials (stimulus pairs) in each Table 1 Acoustical characteristics of the stimuli. The maximum (highest value measured within the syllable) and slope (increase and direction of the change measured within the syllable) values of the f0 and intensity averaged across all tokens and speakers are summarized. block, and altogether four blocks (i.e., 480 stimuli) were presented during a scanning session. In order to obtain a stronger repetition probability effect, the different blocks within a particular functional run were not mixed (cf. Andics et al., 2013a;. Instead, RepBs and AltBs were presented in separate functional runs with block (run) order counterbalanced across participants following a Latin square design. The total time of one experimental run was 10 m 44 s. Between each run, breaks lasted until the next run was initiated (approx. 1 min). There was one-way communication between the experimenter and the participant during these breaks; only basic information was provided (number of the remaining runs, etc.). The task of the participants throughout these runs was irrelevant to the manipulation of the stress pattern regarding S1 and S2 stimuli, but it involved decision on the phonetic characteristics of the stimulus pair. Namely, participants had to signal the detection of the target stimulusi.e., when the second stimulus in the pair had a higher overall pitch than the first one with a button press as fast and accurate as possible using their right index finger. They were also instructed to maintain their gaze on the central fixation cross appearing on a screen throughout the experiment.

Functional localizer
Speech-sensitive cortical regions were defined in separate functional localizer runs, the structure of which was based on the paradigm described in Stoppelman et al. (2013). The authors used a paradigm in which blocks of continuous speech, reversed speech, and SCN were presented, and found that while SCN served as an effective baseline to contrast speech stimuli, reversed speech removed much of the speech-related responses in speech specific areas. Taking these results into account, we applied two baseline conditions: SCN and SRSP. The latter condition was selected because previous studies (Obleser et al., 2006;Scott et al., 2000) effectively used it as a baseline condition, and, in contrast to SCN, it retains much of the complex spectro-temporal properties of speech while leaving it unintelligible.
Consequently, three conditions -Speech, SCN, SRSPwere presented in the functional localizer scan, using a block design. Each block was 15 s long followed by 12, 14, or 16 s long silent intervals. Two localizer runs were presented with 12 blocks in each (altogether 24 blocks were presented). The order of blocks was pseudorandomized such that two subsequent blocks belonged to different conditions (cf. Stoppelman et al., 2013). During the localizer runs, participants were instructed to pay attention to all auditory stimuli without any tasks. The total time of one functional localizer run was 6 m 34 s.

Stimulus presentation
Stimulus presentation was controlled via MATLAB 2013b (The MathWorks Inc., Natick, MA, USA.) using the Psychophysics Toolbox Version 3 (PTB-3) extensions (Brainard, 1997;Pelli, 1997). Auditory stimuli were delivered binaurally via MRI-compatible headphones (MR Confon, Magdeburg, Germany) at a comfortable volume (previously set based on the pilot scans and used throughout the experiment). Written instructions, feedback on performance after each block (number of hits in a given block), and a central fixation cross were displayed on an MRI-compatible LCD screen (32' NNL LCD Monitor, NordicNeuroLab, Bergen, Norway; refresh rate: 60 Hz) placed at 142 cm from the observer, and were viewed via a mirror attached to the top of the head coil.
The four experimental runs (two runs with RepBs, two runs with AltBs), the structural run, and the two runs of the functional localizer were administered in one scanning session in that order. The length of the full scanning session was around 1 h 5 m. During scanning, the presentation of S1s was synchronized to the trigger pulses of the MRI scanner. Before scanning, participants practiced the target detection task with eight trials outside the scanner. They were not informed about Note that different pseudowords were used; and, during trial presentation, participants saw only a fixation cross on the screen. Capital letters in bold indicate stress on the syllable. Consequently, a repetition (RepT), an alternation (AltT), and a target trial are illustrated. In the target trial, bold letters signal the higher frequency of the target stimulus. B. The structure of the repetition (RepB) and alternation (AltB) blocks. Note that in each block, half of the target trials (10%) were AltTs and the other half (10%) were RepTs. Thus, altogether, in the RepBs, 70% of the trials were RepTs (60% non-target RepTs plus 10% target RepTs) while 30% were AltTs (20% non-target AltTs plus 10% target AltTs). Similarly, in the AltBs, 70% of the trials were AltTs and 30% were RepTs. C. Acoustic waveform of a typical repetition and an alternation trial. the different presentation probabilities of RepTs and AltTs in the two block types. The entire experimental procedure lasted about 2 h.
2.6. fMRI data analysis 2.6.1. Preprocessing fMRI data preprocessing and analysis was performed using SPM12 (Wellcome Trust Centre for Neuroimaging, Institute of Neurology, University College London, UK) running under MATLAB 2013b. The functional images were realigned to spatially match the mean of the images created after a realignment to the first volume. The structural images were coregistered to the mean functional images. To spatially normalize the realigned functional images to MNI space, we applied the deformation field parameters that were obtained during the normalization of the anatomical T1-weighted image. After the normalization procedure, functional images were spatially smoothed with an 8-mm full-width at half-maximum isotropic Gaussian kernel. The same preprocessing steps were performed on the functional images of the localizer runs, as well.

Single-subject analysis
The separate functional localizer runs were used to determine ROIs (ROI; see Fig. 2), which were analyzed using MARSBAR 0.44 toolbox for SPM (Brett et al., 2002). Previous studies investigating RS and repetition probability modulation effects for different visual stimuli found that only the analysis of specific ROIs was sufficiently sensitive to reliably detect these effects; and from whole-brain analyses testing the same effects, no significant activations emerged in additional brain regions Kov� acs et al., 2013;Summerfield et al., 2008). Therefore, in this study, we primarily focus on the analysis of specific ROIs.
We used two types of localizer contrasts (Speech > SRSP and Speech > SCN) to determine the location of the left posterior (p) STG, left middle (m) STG, and left anterior (a) STG as well as the right pSTG, right mSTG, and right aSTG. By default, the Speech > SRSP was used, with the help of which all ROIs were identified in 13 participants. Beyond this contrast, we used the Speech > SCN contrast to identify the left mSTG in one, the right mSTG in two, and the right aSTG in one participant as these ROIs could not be determined using the default Speech > SRSP contrast in their cases. Meanwhile, none of the two contrasts was used successfully in identifying the location of the left aSTG, the right mSTG, and the right aSTG in the case of two, one, and three participants, respectively. Accordingly, the left pSTG, left mSTG, and right pSTG could be defined in all participants, while the left aSTG, the right mSTG, and the right aSTG were defined in 18, 19, and 17 participants, respectively. Therefore, the number of participants in whom all ROIs were successfully identified was 15 (see Table 2).
The location of the three areas (posterior, middle, and anterior STG) in both hemispheres was determined for each participant individually as responding more strongly to speech than to SRSP or SCN stimuli in the localizer runs (p uncorrected � .001). Areas closest to the corresponding reference clusters (according to the whole-brain random-effects analysis for the Speech > SRSP or SCN; p FWE < .05, cluster extent of >100), and where the activations reached local maxima were considered as appropriate on the individual level. The individual and the average MNI coordinates for these areas are presented in Table 2.
To analyze the fMRI data of the experimental task, we extracted the mean percent signal change and a time series of the voxel values within an 8-mm radius sphere around the ROIs' centers using MARSBAR as follows. Regressors were created by modeling the four experimental conditions (AltB_AltT, AltB_RepT, RepB_AltT, and RepB_RepT) and the target trials were modeled at the onset of the S1 stimuli, using delta  Table 2). Colored areas show t-values (p FWE < .05, with an additional cluster extent of >100 voxels). SRSP: spectrally rotated speech; SCN: signal correlated noise. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.) functions convolved with the canonical hemodynamic response function of SPM12, for the general linear model analysis of the data. Lowfrequency components were excluded from the data using a high-pass filter with 128 s cut-off. Correction for temporal auto-correlations was done using an autoregressive AR(1) model and movement-related variance was accounted for by the spatial parameters resulting from the realignment procedure.

Multi-subject analysis
The mean percent signal change values obtained in the experimental task for all ROIs were analyzed in two steps. Firs, a five-way repeated measures analysis of variance (ANOVA) was conducted with Hemisphere (right, left), Region (posterior, middle, anterior), Run (1, 2), Block (Alternation, Repetition), and Trial type (Alternation, Repetition) as within-subject factors. Second, to separately analyze each ROI, threeway repeated measures ANOVAs with Run (1, 2), Block (AltB, RepB), and Trial type (AltT, RepT) as within-subject factors were conducted. In all ANOVAs, partial eta-squared (η p 2 ) is reported as the measure of effect size. To control for Type I error, we used Tukey HSD tests for pair-wise comparisons.

Behavioral performance
Participants detected the target stimuli with an average accuracy of 90.1% (SE ¼ 1.4%) and with an average RT of 843 ms (SE ¼ 34 ms). Their false alarm rate was on average 4.9% (SE ¼ 1.1%). Behavioral measures (accuracy and RT) calculated for the target stimuli were entered into two-way repeated measures ANOVAs with Block and Trial type as within-subject factors. The Block * Trial type ANOVA performed on accuracy data revealed a significant main effect of Trial type , F(1, 19)

fMRI results
In the analyses below, RS is indicated by the main effect of Trial type (overall difference between AltTs and RepTs), while the repetition probability modulation of RS is indicated by the Block * Trial type interaction (Summerfield et al., 2008). Since we report only the significant main effects and interactions in the following sections, results from ANOVAs performed on the mean percent signal change in each ROI are detailed in Table 3.
between Run 1 and Run 2 only in the anterior regions (see below the control effects in relation to the aSTG), but no other significant main effects or interactions were found involving Hemisphere, Region, or Run as a factor.

Repetition suppression (H1) in each ROI
We  3c). In sum, all ROIs showed RS effect, although this was weaker in the right aSTG.
In sum, we found significant repetition probability modulations in the pSTG and mSTG bilaterally, and a lack of this effect in the bilateral aSTG. Although the interaction effect was only a tendency in the left mSTG and right pSTG, pair-wise tests suggested repetition probability modulations in these ROIs. These tendencies were also confirmed by the overall ANOVA.

Control effects
Other main effects and interactions including Run, Block, Run * Block, Run * Trial type, and Run * Block * Trial type were not significant in any of the ROIs, except for a significant Run * Trial type interaction in the left aSTG, F(1, 17) ¼ 4.56, p ¼ .048, η p 2 ¼ .211 (see Table 3). Here, the RepTs < AltTs difference was larger in Run 1 (0.13%, p < .001) than in Run 2 (0.08%, p < .001).

Whole-brain analyses
To test whether other areas beyond those identified in the ROI analysis reflected the repetition probability modulation effect, we performed whole-brain random-effects analysis. We tested the main effect of Block (AltB > RepB) and the main effect of Trial type (AltT > RepT). These analyses, however, did not yield any significant activations in any brain regions at the threshold of p FWE < .05. We also checked the potential activations at a more liberal threshold (p uncorrected < .0001, cluster extent of >20) and no activations were found either. No significant activations were found for the opposite contrasts either (RepB > AltB, RepT > AltT).
The contrast testing the Block * Trial type interaction [(AltT_AltB vs. RepT_AltB) vs. (AltT_RepB vs. RepT_RepB)] did no yield significant activations in any brain regions even at the liberal threshold of p uncorrected < .0001 (cluster extent of >20) either.

Discussion
In the present study, we used RS to investigate the processing of word stress in speech-sensitive regions of the superior temporal cortex identified with independent functional localizer. The results revealed RS effects related to word stress processing in several superior temporal cortical areas. In particular, RS was found for word stress in the bilateral pSTG and mSTG, as well as in the aSTG of the left hemisphere. The comparison of the magnitude of RS in the different ROIs revealed that its size decreased along the posterior-anterior axis, suggesting that the pSTG and mSTG regions were more actively involved in word stress encoding than more anterior regions. In addition, it was shown that fatigue effects and possible changes in the magnetic field did not influence remarkably the experimental results. Crucially, the results also revealed that the RS effect was modulated by the repetition probability in the speech specific pSTG and mSTG regions and thus provide evidence for the predictive processing of word stress in the human cortex.
In order to select the specific regions sensitive to word stress processing, we developed a functional speech localizer. According to our results, several regions of the STG were activated bilaterally by meaningless speech stimuli compared with both SCN and SRSP. Previous studies using the SRSP stimuli as contrast in localizing speech specific brain areas found activity in the STG/STS regions (Bautista and Wilson, 2016;Golden et al., 2015;Halai et al., 2015;Sabri et al., 2008) and also in IFG regions (Halai et al., 2015;Sabri et al., 2008). The lack of IFG activity in our case could be due to using pseudowords: it has been previously found that IFG regions are involved in the processing of semantically and syntactically more complex stimuli (Halai et al., 2015;Newman et al., 2003;Thompson-Schill et al., 1997). These results confirm the effectiveness of the SRSP stimuli as baseline in identifying speech specific regions. Note. p-values below .05 are boldfaced, and below 0.10 are italicized. Fig. 3. Time course (mean þ/À SE) and average peak activation profiles (þ/À SE) in the pSTG (a), mSTG (b) and aSTG (c), separately for the different trial and block types. Short horizontal lines denote the significance of RS effects, long horizontal lines denote the significance of repetition probability effects. Note: þ p < .10; *p < .05; **p < .01; ***p < .001.
In previous studies investigating the nature of language comprehension, evidence for predictive mechanisms has been found at several levels of the linguistic hierarchy: prediction of upcoming words has been shown for the phonological (DeLong et al., 2005), morphosyntactic (Van Berkum et al., 2005;Wicha et al., 2004Wicha et al., , 2003, lexical-semantic/discourse (Federmeier and Kutas, 1999;Hasson et al., 2006;Lau et al., 2016;Orfanidou et al., 2006;Otten and Van Berkum, 2008;Poppenk et al., 2016;Van Petten et al., 1999), and syntactic contexts (Arai and Keller, 2013;Bornkessel-Schlesewsky and Schlesewsky, 2013;Kuperberg and Jaeger, 2016;Kutas et al., 2011;Matchin et al., 2016;Rohde et al., 2011;Weber et al., 2016). The importance of predictive mechanisms in word stress processing is that the representation of word stress is suggested to involve hierarchical rules, i.e., rules about the assignment of stress to certain syllables at the word level or words at the sentence level Hayes, 1995;Liberman and Prince, 1977). Even in languages where stress is specified in the mental lexicon, some rule-based mechanisms exist to compute the stress pattern of unknown words (Colombo, 1992;Cutler and Isard, 1980).
One important aspect of the present study was to investigate word stress processing in Hungarian, a fixed-stress language having stress on the first syllable of words. We argued that Hungarian speakers might be especially sensitive to changes of the highly regular stress pattern, because any other stress patterns could be considered as illegal (Garami et al., 2017;Honbolyg� o and Cs� epe, 2013). This raises the question if the results obtained are language specific and valid only for fixed-stress languages. As we argue above, predictions about word stress are based on phonological rules, particularly in the case of unknown words that do not have their stress pattern specified in the mental lexicon. Therefore, it can be expected that listeners of variable stress languages would show similar results to Hungarian listeners, involving similar brain regions when processing pseudowords. We could expect a crucial difference between languages, however, when processing known words, as the lexical specification of stress is probably different for fixed-stress and variable stress languages. This is an especially interesting question for further brain imaging studies, because it is unclear if the stress pattern of words is specified or not in the mental lexicon of listeners of fixed-stress languages.
Concerning the neurobiological background of words stress processing, the superior temporal lobe (STG/STS regions) has been previously found to be active (Aleman et al., 2005;Domahs et al., 2013;Heisterueber et al., 2014;Kandylaki et al., 2017;Klein et al., 2011). In these previous studies, several other brain regions have been found active, including the IFG, SMA, areas in the parietal lobe (angular gyrus, superior parietal gyrus, parietal lobule), and frontal lobe (precentral, postcentral, and middle frontal gyrus). The possible reason of the activation of diverse brain areas is that these studies used various paradigms (discrimination, imagery, recall tasks, well-formedness judgement), which might have tapped on different cognitive functions and consequently involved differing brain areas. In our study, the RS paradigm as an implicit and passive task allowed us to investigate areas directly and specifically related to word stress processing.
Moreover, the paradigm also allowed us to demonstrate the importance of the STG region in the processing of word stress based on longterm memory traces (reflected in the observed repetition probability effect). In our previous ERP study (Honbolyg� o and Cs� epe, 2013), we suggested that word stress processing is based on so-called stress templates, which were assumed to be pre-lexical, speech specific long-term traces of word stress patterns. In that study, we found that the MMN component appeared only when the illegal stress pattern (stress on the second syllable) was the deviant and the legal stress pattern (stress on the first syllable) was the standard. However, there was no MMN in the reversed condition when the legal stress pattern was the deviant and the illegal stress pattern was the standard, indicating that the legal stress pattern in the deviant position did not violate the predictions based on the long-term traces of the native stress pattern. Here, we show that the possible neural background of the word stress processing based on these suggested stress templates involves the bilateral pSTG and mSTG regions. Although there have been some claims about a possible right hemispheric dominance in prosodic processing (Gandour et al., 2004;Meyer et al., 2004), studies about word stress showed either left hemispheric dominance (Aleman et al., 2005) or bilateral activations (Domahs et al., 2013;Heisterueber et al., 2014;Kandylaki et al., 2017;Klein et al., 2011). Our study further supports the bilateral nature of word stress processing.
The results also fit with the assumptions of the neurobiological language model proposed by Bornkessel-Schlesewsky and Schlesewsky (2013). The model suggests that speech is processed along a dual auditory stream network, consisting of an antero-ventral stream responsible for the recognition of linguistic elements and a postero-dorsal stream responsible for the predictive sequential processing of linguistic information. The model assumes that the postero-dorsal stream, among other tasks, engages in prosodic segmentation, which includes the detection of word stress. Word stress processing is a time-sensitive mechanism, and as discussed above, its representation is based on hierarchical rules, which allow the formation of expectations about the upcoming stress information (whether a syllable will be stressed or unstressed). Given that repetition suppression is closely connected to predictive processes, the RS effects found confirm that word stress information is processed based on predictive processes. Furthermore, we found some indication that the RS repetition probability effect was present to a different extent in the various regions of the STG: although the overall ANOVA did not show a significant Block * Trial type * Region interaction, the individual analysis of ROIs revealed that the RS effect was modulated by the repetition probability in the pSTG and mSTG but not in the aSTG. This might indicate that the RS probability effects are related to the posterior part of STG, assumed to belong to the dorsal stream. Nevertheless, further studies are required to uncover if word stress processing is indeed specifically associated with the dorsal stream.
Contrary to previous studies investigating the neural background of word stress processing, we did not find any IFG activation in our wholebrain analysis. As mentioned above, the lack of IFG activity in our case could be due to using pseudowords. Bornkessel-Schlesewsky and Schlesewsky (2013) suggest that the IFG is involved mostly in cognitive control and conflict resolution, and it brings together the representations generated by the two auditory streams. Since pairs of meaningless pseudowords were presented in our case, it can be assumed that their processing did not require the involvement of the IFG. Furthermore, the activation of IFG was found in studies investigating the role of prosody in sentence processing and prosodic structure building (Sammler et al., 2018;van der Burght et al., 2019), or when participants had to make same-different judgements about stress pairs (Klein et al., 2011); that is, when prosodic information had to be explicitly used. This was not the case in our study, which might have also contributed to the lack of IFG activation.
In summary, the present study provides further evidence that, among other linguistic features, the processing of word stress is also based on predictive mechanisms, as shown by the RS and repetition probability effects found in the posterior and middle parts of the STG bilaterally. Further studies are needed to clarify if these predictive representations are similar in meaningful and meaningless words, and if they differ between languages having different stress systems. Moreover, we need more data on the role of the dorsal auditory stream and predictive processes in prosody perception at both the word and sentence level.

Declaration of competing interest
The authors claim no conflict of interest.

Data availability
Data will be made available on request.

Declaration of competing interest
The authors claim no conflict of interest.