Multivariate analysis of speech envelope tracking reveals coupling beyond auditory cortex

The systematic alignment of low-frequency brain oscillations with the acoustic speech envelope signal is well established and has been proposed to be crucial for actively perceiving speech. Previous studies investigating speech-brain coupling in source space are restricted to univariate pairwise approaches between brain and speech signals, and therefore speech tracking information in frequency-speciﬁc communication channels might be lacking. To address this, we propose a novel multivariate framework for estimating speech-brain coupling where neural variability from source-derived activity is taken into account along with the rate of envelope’s amplitude change (derivative). We applied it in magnetoencephalographic (MEG) recordings while human participants (male and female) listened to one hour of continuous naturalistic speech, showing that a multivariate approach outperforms the corresponding univariate method in low-and high frequencies across frontal, motor, and temporal areas. Systematic comparisons revealed that the gain in low frequencies (0.6 - 0.8 Hz) was related to the envelope’s rate of change whereas in higher frequencies (from 0.8 to 10 Hz) it was mostly related to the increased neural variability from source-derived cortical areas. Furthermore, following a non-negative matrix factorization approach we found distinct speech-brain components across time and cortical space related to speech processing. We conﬁrm that speech envelope tracking operates mainly in two timescales ( 𝛿 and 𝜃 frequency bands) and we extend those ﬁndings showing shorter coupling delays in auditory-related components and longer delays in higher-association frontal and motor components, indicating temporal diﬀerences of speech tracking and providing implications for hierarchical stimulus-driven speech processing.


Introduction
Our senses are confronted with signals often exhibiting regular or semi-regular patterns over time.Similarly, fluctuations of synchronized excitatory and inhibitory cortical activity drive rhythmic patterns of brain activity ( Bishop, 1932 ;Buzsáki and Draguhn, 2004 ) which modulate the processing of incoming sensory signals ( Arieli et al., 1996 ;Romei et al., 2010 ).Brain dynamics can temporally align to the rhythmic structure of sensory signals ( Henry and Obleser, 2012 ;Obleser and Kayser, 2019 ;Schroeder and Lakatos, 2009 ), a process that is considered to facilitate structuring and gating of incoming information ( Arieli et al., 1996 ;Lakatos et al., 2019 ;Romei et al., 2010 ) while forming predictions about future incoming events in space and time ( Arnal and Giraud, 2012 ).In the case of audition, rhythmic auditory input can coordinate the phase of ongoing oscillations in the auditory Figure 1.Multivariate speech tracking pipeline.(A) Participants (n = 24) listened to ∼1 hour of an audiobook, divided into 6 blocks, while Magnetoencephalographic (MEG) measurements were acquired.(B) Amplitude envelope (black) and its derivative (purple) were extracted from the continuous speech signal (see Methods).(C) Individual MRIs were used to estimate source models per participant which were interpolated to a template volumetric grid.(D) Cortical areas were divided into 362 anatomical parcels according to the parcellation from the Human Connectome Project ( Glasser et al., 2016 ) (E) For each parcel, estimated source time-series were extracted.(F) Acoustic signals (amplitude envelope and its derivative, see (B)) and source activity (three time-series per parcel, see Methods) were subjected to a continuous .After the transformation to frequency space, we computed Mutual Information between multivariate time-series (acoustic signals and source-time series per parcel; see Methods for details).Peelle and Davis, 2012 ;Rimmele et al., 2021 ).Notably, while the spectral content of speech is crucial for comprehension ( Lorenzi et al., 2006 ;Obleser et al., 2012 ;Scott and McGettigan, 2012 ), recent work prop οsed that disrupting entrainment through electrical stimulation leads to compromised speech intelligibility ( Asamoah et al., 2019 ;Riecke et al., 2018 ;Wilsch et al., 2018 ;Zoefel et al., 2018 ).Speech tracking is also modulated by attention ( Obleser and Kayser, 2019 ;Zion Golumbic et al., 2013 ) and correlated with semantic context ( Broderick et al., 2019 ;Koskinen et al., 2020 ), indicating an interaction between bottomup (that is, feedforward) and top-down (that is, feedback) processes ( Assaneo et al., 2019 ;Barczak et al., 2018 ).Still, a complete whole brain characterization of speech tracking across relevant frequencies, delays, and cortical areas is lacking.
To date, studies investigating speech-brain coupling in cortical space have used a univariate pairwise approach between source estimates and speech signals.Voxel-based studies typically use the dominant source direction for each voxel and atlas-based studies often use the first component of a principal component analysis (PCA) of all voxel time series for each atlas parcel.Both approaches can miss relevant information as there is no fundamental justification to the assumption that the strongest source orientation (which is based on power) or the strongest PCA component (again based on power) captures the brain activity with the strongest coupling to the speech envelope ( Jaworska et al., 2022 ).Here, we aimed to alleviate this limitation by using a novel atlas-based multivariate approach.
First, we aimed to provide a comprehensive characterisation of whole-brain speech tracking in data with high signal-to-noise ratios based on a novel multivariate mutual information analysis ( Ince et al., 2017 ).To this end, we developed an analytical multivariate approach in which neural variability across and within parcels was taken into account for estimation of speech-brain coupling along with information from speech envelope (Env) and its rate of change (derivative; EnvD, see Figure 1 ) .We show that multivariate speech tracking is superior to univariate speech tracking in all brain areas and helps to uncover the diverse spectro-temporal structure of speech tracking across cortical areas.Then, we followed an unsupervised non-negative matrix factorization (NMF) approach to uncover spectral components across cortical areas and temporal delays coupled to the acoustic envelope in passive listening using MEG ( Baillet, 2017 ;Gross, 2019 ).

Participants and data acquisition
A total of 24 volunteers (12 females; mean age = 24.0years, age range 18-35 years) participated in this study.The study was approved by the College of Science and Engineering Ethics Committee at the University of Glasgow (application number: 300170024).While participants listened to a 55 minute duration audiobook, brain activity was monitored with a 248-magnetometer whole-head MEG system (MAGNES 3600, 4-D Neuroimaging) in a magnetically-shielded room.Data were acquired at a sampling rate of 1017.25 Hz for 10 participants and 2035.51Hz for 14 participants.Individual head shapes were digitized before each recording via five coils attached to the head.Each MEG session was separated into six blocks of ∼9.16 mins.To allow participants to better comprehend the story, the last 10 seconds of each block were repeated in the following block.In the case that head movement exceeded 5mm for a block, measurement was repeated.The stimulus was delivered using PsychToolBox ( Brainard, 1997 ) with two Etymotic ER-30 insert earphones.To assess whether participants paid attention to the story, they had to answer 18 multiple choice questions (with three response options each) with the number of correct options varying between 1-3 per question (mean performance 0.95; SD 0.05; range 0.78-1).A different analysis of this dataset has been reported elsewhere ( Daube et al., 2019 ).

Speech envelope extraction
We extracted the amplitude envelope from the continuous speech signal.To this end, 31-channel Log-Mel-Spectograms (124.1 Hz -7284.1 Hz) were computed and absolute values were summed across bands to obtain a wideband speech envelope ( Schädler et al., 2012 ).

Data preprocessing
MEG data were processed using the FieldTrip toolbox ( Oostenveld et al., 2011 ) for MATLAB 2021a (The MathWorks, Inc.) and in-house MATLAB routines.We note that data were preprocessed again using the same scripts used for Daube et al., (2019) .We briefly mention here that for each block, continuous data starting at the onset of the story were denoised using the denoise_pca function of FieldTrip where bad channels were manually detected and spherical-spline interpolated from neighboring channels (mean number of rejected channels per block M = 3.07; SD = 3.64).Squid jumps were replaced with DC patches.Continuous data were filtered offline with a fourth-order forward-reverse zero-phase Butterworth high-pass filter with a cutoff-frequency of 0.5 Hz and downsampled to 100 Hz for computational efficiency.Independent components (mean number of rejected components per block M = 5; SD = 5.3) arising from heartbeats and eye movements were visually isolated and removed using the runica ICA algorithm.

Source localization
Individual T1-weighted MRIs were coregistered in the MEG coordinate system, aligned with the digitized head shapes using the iterative closest point algorithm ( Besl and McKay, 1992 ), and segmented (into white matter, gray matter, and cerebrospinal fluid) for generating single-shell volume conductor models ( Nolte, 2003 ).For group analyses, individual MRIs were linearly transformed to a MNI template provided by FieldTrip.Source activity was estimated computing LCMV beamformer coefficients from the MEG time-series for each voxel on a 5mm grid ( Van Veen et al., 1997 ).The sensor covariance matrix used was computed across all trials.The lambda regularization parameter was set to 0% and time series were extracted for each dipole orientation, resulting in three time-series per voxel.To reduce the dimensionality of the data, we applied an atlas-based parcellation of cortical space, resulting in 181 ROIs per hemisphere ( Glasser et al., 2016 ).Source time-series for each parcel were concatenated across voxels and orientations and we extracted the principal components, along with their explained variance.

Multivariate Mutual Information
Statistical dependencies between the speech envelope together with its first derivative ( Ince et al., 2017 ) and the source space parcels were computed on the basis of information theory ( Shannon, 1948 ).We estimated mutual information (MI) using Gaussian Copula MI (GCMI) between multivariate speech signals and source time-series ( Ince et al., 2017 ).GCMI was estimated for various delays (-300ms to 300ms, steps of 10ms).To identify frequency-specific interactions, we applied a continuous wavelet transformation (CWT; cwtfilterbank.m in MATLAB; wt.m performs the actual transformation into the frequency domain) for 64 frequencies (from 0.1 Hz to 40 Hz).With L1-normalization implemented in the algorithm, equal amplitude oscillatory components across different scales have equal magnitude in the CWT, providing a more accurate depiction of the signal.
Our multivariate analysis capitalized on the inherent ability of GCMI to estimate dependencies between multidimensional data ( Ince et al., 2017 ).We included three source time-series per parcel and two speech signals (envelope and derivative), making the estimation multivariate (3 ×2 analytical framework) for each parcel and frequency.While we are not aware of any multivariate methods quantifying speech-brain coupling in the frequency domain, multivariate methods based on regression have been used in the time domain for this type of analysis ( Crosse et al., 2016 ;Daube et al., 2019 ).Our multivariate framework differs from these methods in three important ways: First, as mentioned, our method operates in the frequency domain and not the time domain.Second, our analysis is based on mutual information and therefore sensitive to nonlinear dependencies between speech and brain signals in contrast to the linear regression models.Third, regression models work on n x 1 data meaning that decoding models reconstruct 1-dimensional stimulus data from n-dimensional brain data or encoding models predict 1-dimensional brain data from n-dimensional stimulus data.Our framework operates on n x m data and allows the quantification of dependencies between n-dimensional stimulus data and m-dimensional brain data.
For the GCMI estimation we additionally computed 500 surrogate MI computations on the basis of random temporal shifting of the speech signal with respect to the source time-series via a circular wrapping around the edges as proposed in ( Andrzejak et al., 2003 ).This way, we created a distribution of 500 surrogate MI values for each frequency, delay, and parcel.From the surrogate distribution we obtained normalized MI values for each frequency and parcel by subtracting the mean of the distribution and dividing it by the standard deviation, thus correcting for auto-correlation across frequencies.We obtained normalized MI values for each participant, frequency, delay, and parcel.Significance of normalized MI values at the group level was determined with cluster-based permutation tests ( Maris and Oostenveld, 2007 ) comparing empirical MI values with the 95th percentile of the 500 surrogate distribution.We note that in statistical comparisons maximum MI values across delays were used.This includes multiple steps: a series of one-tailed t-tests of individual MI per parcel and frequency were conducted and thresholded at p = 0.05.This included 64 ×61 ×362 comparisons (Frequency x Delays x Parcels).To control for multiple testing a non-parametric cluster-based permutation test was applied.For that, spectro-and spatial-adjacent data were clustered together and assigned a cluster-level statistic depicting the sum of t-values within each cluster.Then, each cluster was subjected to significance testing through Monte-Carlo approximation.For that, individual MI spectra were randomly interchanged with the 95th percentile of the initial surrogate distribution and t-tests were recomputed in the cluster-level.This procedure was applied 5000 times and then the original cluster-statistics were compared with the distribution of the 5000 randomized null-statistics distribution.Significance for the original clusters was reached when they exhibited a higher test statistic than 95% of the randomized null data.

Non-negative Matrix Factorization
As MI values are inherently positive, we sought to further describe the spatial, spectral, and temporal characteristics of cortical responses to continuous speech.To this end, we applied non-negative matrix factorization (NMF) to mutual information values across parcels, frequencies, and delays.NMF factors a n-by-m matrix A into W (n-by-k) and H (kby-m) factors, such that the root mean square between A and WxH is minimized ( Berry et al., 2007 ).This method has been used previously in speech processing ( Hamilton et al., 2018 ) and for identifying distinct spectral and spatial patterns of source-reconstructed data ( Ince et al., 2015 ;Kluger and Gross, 2020 ;Schoffelen et al., 2017 ) as it extracts features to reduce the dimensionality of the data, while preserving distinct profiles across a predefined number of components.The optimal number of components (k) to be extracted from the NMF was determined with a rank optimization using singular value decomposition ( Qiao, 2015 ), resulting in 15 components.
NMF is implemented as an iterative optimisation, where convergence may vary according to random initial values.To ensure reproducibility, we repeated the NMF estimation 300 times.Each time, the NMF was initiated using the multiplicative algorithm with 10 iterations and the best solutions based on residuals were used as starting points for 1000 more NMF iterations using the alternating least squares algorithm.This combination of algorithms was applied to make use of the computational efficiency of the multiplicative algorithm and stability in convergence NMF components of Frequency x Delays x Parcels were estimated per subject.For identifying group-level consistent effects we applied a cluster-based permutation statistical analysis of the spectral profiles for each component.Statistical significance was determined with onesample t-tests, after permuting 5000 times the Frequency x Delays x Parcels matrix and thresholding it at p < 0.05 (FDR -corrected).

Multivariate speech tracking: Differential modulation across cortical areas
Before estimating speech-brain tracking, we wanted to validate the amount of variance that the principal components time-series capture after the PCA of source-time series within each parcel.Specifically, in each parcel we stacked the time series (corresponding to the three source orientations [x,y,z]) across all voxels in this parcel and computed PCA components of this matrix.We used the grand average of variance per components, parcels, and participants.In Figure 2 we summarize the main findings.It is evident that the first component captures maximum variance (up to 80%) in early auditory areas but only around 40% of the total variance for higher association, frontal, and motor areas ( Figure 2 a).Further investigation of the variance explained for the primary auditory and motor cortex reveals that at least the first three components contribute non-negligible variance to the total signal ( Figure 2 b).More importantly, when the first three components are considered, it is sufficient to capture 66-100% of total variance in cortical space ( Figure 3 c).
Thus, we simultaneously used three time series to represent each brain area.The resulting time series represent the optimal (in the sense of explained variance) three-dimensional representation of brain activity in a given parcel resulting from a weighted mixing of time series across different voxels and orientations For the speech signal we used the speech envelope and its derivative.Although the speech envelope is critical for comprehension, recent evidence suggests that acoustic landmarks (local maxima in the envelope rate of change) are encoded in the human superior temporal gyrus ( Hertrich et al., 2012 ;Oganian and Chang, 2019 ).Thus, along with the speech envelope, we included the rate of amplitude change, estimated by the first derivative.In summary, we followed an information-theoretic approach following previously validated and systematically compared approaches ( Gross et al., 2021 ), quantifying multivariate mutual-information (MI) between speech signals (Env -EnvRate) and three source estimates per parcel (3 PCA components from 362 parcels; parcellation by ( Glasser et al. 2016 )).
As we were interested in frequency-specific speech tracking, we transformed source estimates and speech signals with a continuous wavelet transform from 0.1-40 Hz before computing MI.We tested phase alignment of cortical areas to speech signals and compared it to the 95th percentile of a surrogate distribution (see Methods).
Figure 3 summarizes the group level results obtained with our multivariate MI analysis.First, areas that are phase-aligned to speech signals (0.1-15 Hz) were localized to broad clusters including areas in temporal, parietal, frontal, and motor cortex in both hemispheres (shown as t-values, p < .05,cluster corrected, see Figure 3 c and Methods section).MI spectra show stronger phase alignment in the right hemisphere, while  Our approach replicates previous findings showing phase alignment of cortical oscillations in low  and  bands (1 and 6 Hz), lateralized to the right hemisphere ( Boemio et al., 2005 ;Gross et al., 2013 ;Luo and Poeppel, 2007 ) and extends those findings by revealing differential modulations of cortical oscillations across frequencies and areas.Before we describe these differential modulations in more detail, we first compare the multivariate approach to a standard univariate approach for a better understanding of the information we gain with multivariate data.

Neural variability in cortical components captures speech-tracking
We hypothesized that accounting for higher-dimensional representations in each brain area would improve our estimate of coupling between neural and speech signals in frequencies and areas relevant to speech tracking compared to univariate analyses.We further hypothesized that this improvement is not uniform across frequencies and cortical areas.Instead, we expected that brain areas showing more complex brain activity patterns will benefit more from the higher dimensionality used in the multivariate analysis.Therefore, we computed cortical maps quantifying the difference between multivariate and univariate analysis.The analysis is based on a statistical comparison on the multivariate MI maps used in the previous section (based on three PCA components) and univariate results where only the first PCA component is used for each anatomical parcel.
In Figure 4 a, we show the normalized MI values for the multivariate and univariate approach (top panels) and the corresponding t-spectra from parcels where coupling with the speech envelope is significantly higher in multivariate compared to univariate analysis (left hemisphere: 0.1-12 Hz, 52 parcels; right hemisphere: 0.1-14.1 Hz, 70 parcels, group statistics; p < 0.05 cluster-corrected; bottom panels).Visual inspection of t-value spectra indicates that the benefit of our multivariate approach differed across parcels.As we wanted to further unravel their spectral characteristics, we proceeded with an unsupervised clustering of spectra to an optimal number of clusters (k = 3; see Methods).In Figure 4 b we show spectra of three corresponding clusters (bottom) and their rendering on the cortical surface (top).We find that bilateral auditory and left motor areas (Cluster #1) show increased MI values both for low frequencies (peaks at 0.8 and 1.2 Hz) and higher frequencies (peak at 7 Hz).Bilateral motor, temporal, and frontal areas exhibit a spectral peak at 7 Hz (Cluster #2), whereas prefrontal and visual areas show no peak within the significant frequencies (Cluster #3).
Thus, we find that the investigation of speech tracking benefits from the inclusion of more components in the estimation of speech-tracking, as it includes more neuronal activity relevant for speech tracking.At the same time, there is no disadvantage of applying multivariate speech tracking analysis since -as expected-we did not find significantly reduced speech-tracking for multivariate compared to univariate coupling in any brain parcel.

Rate of amplitude change in the envelope drives low-frequency oscillations
Having established the benefit of using multivariate brain signals for speech tracking analysis, we proceeded with a similar analysis for the speech signal.Figure 1 illustrates that our multivariate approach represents the speech signal with both the speech envelope (Env) and the derivative of the envelope (EnvD).Here, we aimed to contrast the individual contributions of both signals (Env and EnvD) to speech tracking.As rapid changes in the speech envelope (corresponding to peaks in EnvD) provide important temporal cues, we hypothesized that these acoustic landmarks will coordinate excitatory states of ongoing neural activity by re-aligning the phase of low-frequency oscillations.This realignment would not be evident considering only the amplitude modulation of speech represented in Env.Thus, we expected to find increased phase alignment of slow oscillatory activity when using EnvD compared to Env.
Consequently, we statistically compared normalized MI values based on Env to those based on EnvD using a two-tailed nonparametric cluster test (see Methods).Results of this analysis are summarized in Figure 5 .We found acoustic landmarks (EnvD) to modulate the phase of lowfrequency oscillations in temporal areas of the right hemisphere significantly stronger than the speech envelope (Env; 0.1-0.8Hz; group statistics; p < 0.05 cluster-corrected; see Figure 4 b).Interestingly, this effect is localized to these low frequencies whereas higher frequencies (0.8-20 Hz) show the opposite effect.Here, brain activity shows higher phase synchronization with speech envelope compared to its derivative.
In sum, we find evidence that acoustic landmarks in the broadband speech signal provide temporal evidence to ongoing low-frequency oscillations (below 1 Hz) in the right auditory areas, indicating a phase realignment.For the rest of the spectra (0.8-20 Hz) we find that the phase of cortical oscillations is modulated to a stronger extent by the speech envelope.

Cortical coupling to speech consists of multiple spectral modes across delays
Speech tracking has been extensively studied with respect to the temporal cortex ( Ding and Simon, 2014 ;Hamilton et al., 2018 ;Oganian and Chang, 2019 ;Yi et al., 2019 ).However, other brain areas in the frontal, parietal, and motor cortex have been implied as well ( Assaneo and Poeppel, 2018 ;Park et al., 2015 ).Different areas are likely engaged in different frequencies and at different delays relative to the incoming speech stream, reflecting bottom-up and top-down processes at different timescales.
We aimed to utilize the superior performance of our multivariate approach (compared to univariate analysis) to comprehensively characterize the spatial and spectral structure of speech tracking across delays.We applied our multivariate MI approach (using three PCA components for each parcel and Env and EnvD for speech, see Figure 1 ) for 61 delays between brain and speech signals (brain signal following speech signals with a delay of -300 to 300ms in steps of 10ms).As MI values are inherently positive, we applied a non-negative matrix factorization (NMF) to mutual-information spectra across participants.This way, we sought to reduce the dimensionality of our data (Parcels x Frequencies x Delays) into a fixed number of spatial components, with each one exhibiting distinct neural tracking for each time-lag.This approach is motivated by NMF's inherent clustering property ( Ding et al., 2005 ;Lee and Seung, 1999 ) without a priori assumptions of the spectral or topological properties of the components.Using rank optimization, we found that our speech-brain mutual-information spectra can be divided into an optimal number of 15 anatomical components (see Methods).We used one-sample non-parametric statistical testing to emphasize consistent effects at the group level.The results of this analysis are summarized in Figure 6 (p < .05,FDR corrected).We note that in Supplementary figure 2 we show an identical figure but with constant y-axis from the t-value spectra across all components.
Each subpanel corresponds to one component.The localisation of each component is displayed in the cortical rendering together with a frequency-delay map of group-level t-values.To illustrate how speechtracking spectra change with delay, we plot t-values spectra for selected delays (10, 100, 200, and 300 ms; see Supplementary figure 1 for GCMI spectra).A number of observations can be made.
First, NMF components show a striking variability of spectral patterns and delay dependencies.This variability suggests the existence of multiple, partly independent processes that go well beyond two speech tracking channels at  and  frequencies predominantly described in the literature ( Ding and Simon, 2014 ).
Second, t-values at low frequencies show a consistent profile independent of delays.This is expected, because our speech-tracking captures phase synchronization at any given delay [e.g., 50ms leads to very small changes in phase difference for low frequency signals (such as 1 Hz) compared to higher frequency signals (such as 5 Hz)].Still, low frequencies are informative and show three different patterns across components.Components localized partly in parietal and temporal areas show no consistent peak at frequencies below about 2 Hz (see Components 6,7,8,9,and 12).In contrast, components with contributions from motor areas show strong speech-tracking at these frequencies peaking at the lowest computed frequency (Components 10, 11), even at negative delays (i.e Component 10; see the frequency-delay map).Still other components show a distinct peak at frequencies below 3 Hz and are mostly localized in the temporal cortex and around the lateral fissure (Components 2,3,4,13,and 14).
Third, two types of delay dependencies can be seen mostly at frequencies of about 5 Hz and above and are clearly evident in Components 2 and 8.While Component 8 shows highest t-values for longer delays (300 ms) at frequencies between about 5-10 Hz, Component 2 shows highest t-values for short delays.More generally, components reflecting the strongest speech tracking at short delays are localized near early auditory areas (Components 1, 2, 3, and 12) also extending to motor areas (Component 5).Components with maxima at longer delays include higher-order areas in the left frontal and bilateral parietal cortex (Components 7, 15).However, NMF also identifies components with the preference for long delays in the temporal cortex (Components 3,4,8,and 9).
All in all, we report that speech tracking is associated with partlydistinct spectral components operating in lower-and higher-association areas with distinct faster-and slower-temporal modulations.

Discussion
"Speech tracking " operates at multiple timescales and arguably reflects both feed-forward and top-down processing, while orchestrating computations towards understanding.This study aimed to provide a novel multivariate framework, in which neural variability from multiple cortical areas along with the rate of speech signal amplitude change are also utilized for the assessment of speech-brain coupling.After establishing the gain of information coupling with the multivariate approach in speech-relevant timescales in auditory, frontal, and motor areas, we proceeded with a data-driven spectro-temporal characterization of cortical components coupled to the acoustic envelope.In summary, consistent with previous reports ( Ding and Simon, 2014 ;Gross et al., 2013 ;Poeppel and Assaneo, 2020 ), we find auditory components in two distinct timescales in low frequencies [  (0.6-3 Hz) and  (4-7 Hz)], and we extend those findings providing a characterisation of speech-brain coupling across the cortex exhibiting higher-association bilateral motor, frontal, and temporal components operating in distinct temporal and frequency channels.
Our novel multivariate analysis was applied to investigate speechbrain coupling, but we propose that it is of relevance to any study using atlas-based source localisation.We find that representing activity of a brain area with a single time series derived from dimension-reduction techniques (such as PCA) is generally not sufficient as the first principal component is computed to provide the linear combination of the original time series (all time series along all three orientations of all voxels/vertices in a parcel) that explains most of the total variance.Since signal amplitude in MEG/EEG signals is strongest in low frequencies compared to higher frequencies, the first SVD component mostly accounts for low frequency activity at the expense of higher frequencies.Therefore, using multivariate analysis can preserve higher frequency components that are lost in the univariate case.Indeed, Cluster 2 in Figure 4 b shows a distinct boost of MI in frequencies between 5-10 Hz.
However, even frequencies below 5 Hz benefit significantly from the multivariate approach indicating the existence of several independent components in the data.Here we chose a three-dimensional representation per brain area but future studies will need to assess the optimal dimensionality for a multivariate approach and its dependence on the size and location of the respective brain area.Similarly, it will be interesting to compare multivariate mutual information to other multivariate methods such as multivariate pattern dependence, distance correlation, representational connectivity analysis, or canonical component analysis ( Basti et al., 2020 ;de Cheveigné et al., 2018 ;Lankinen et al., 2014 ).We report significantly higher stimulus-brain coupling (as measured with GCMI) compared to univariate analysis across 122 out of 362 brain areas and as we expected there was no significantly reduced coupling in any brain area ( Figure 4 ).Therefore, we can confidently posit that there is no drawback in using multivariate analysis for studying speech tracking -a finding that awaits replication in other MEG and EEG data.Instead, significantly increased performance can be seen for all frequencies up to 15 Hz and in temporal, parietal, and frontal brain areas.Regarding superior temporal areas, we note that, as shown in Figure 3 b, a univariate analysis might still be sufficient, as we did not find significant improvement with the multivariate approach.Higher-order areas are those which might benefit more, as seen in lower captured variance in their first principal component ( Figure 2 , Panel A), possibly related to higher gradients of myelin density reflected in hierarchical gradients of timescales ( Chien and Honey, 2020 ;Gao et al., 2020 ;Glasser and Van Essen, 2011 ) and thus indicative of more multifold relations to speech processing.
We have similarly extended the multivariate approach for the speech signal by including the speech envelope and its derivative.Recent studies have demonstrated that the speech envelope and derivative account for partly different components of speech signal and neural activity during listening ( Brodbeck et al., 2018 ;Daube et al., 2019 ).Whereas the envelope quantifies the instantaneous overall amplitude of speech, the derivative quantifies the rate of change of speech amplitude.The derivative is potentially informative since acoustic onset edges cue syllabic nucleus onsets and their slopes are related to syllabic stress ( Oganian and Chang, 2019 ) and are represented in the auditory cortex ( Brodbeck et al., 2020 ).Nevertheless, sharp amplitude modulations of the speech signal provide acoustic indices for segmentation and encoding of continuous speech in faster timescales ( Doelling et al., 2014 ;Oganian and Chang, 2019 ).We speculate that acoustic edges coded in the envelope's derivative could serve as an update of sensory gain for the attended speech ( Obleser and Kayser, 2019 ).Supporting this notion, we found that the derivative was more informative compared to the envelope at around 0.6 Hz.Low-frequency oscillations have been traditionally associated with sensory selection ( Schroeder and Lakatos, 2009 ), attentional orientation ( Lakatos et al., 2013( Lakatos et al., , 2008 ) ), and temporal anticipations ( Herbst and Obleser, 2019 ;Stefanics et al., 2010 ), which -in line with the active sensing framework ( Bajcsy, 1988 ;Bajcsy et al., 2018 ;Prescott et al., 2011 ;Schroeder et al., 2010 ) -can facilitate language processing ( Giraud, 2020 ;Kandylaki and Kotz, 2020 ;Meyer et al., 2020 ).At this frequency (around 0.5 Hz), neural tracking was also found to be disrupted after altered temporal distribution of speech pauses ( Kayser et al., 2015 ).Future studies would need to assess the extent to which endogenous -band activity is associated with acoustic fluctuations of speech for sensory ( Boucher et al., 2019 ) and linguistic ( Bourguignon et al., 2013 ;Keitel et al., 2018 ) related chunking of continuous speech.
How complex sounds such as speech are integrated and processed in the human brain remains a central question to neuroscientific research ( Brodbeck et al., 2018 ;Ding et al., 2016 ;Jin et al., 2020 ).For that, both the analysis of acoustic [i.e spectrotemporal modulations; ( Daube et al., 2019 ;Hullett et al., 2016 )] and category-specific [i.e linguistic elements; ( Davis and Johnsrude, 2003 ;Di Liberto et al., 2019, 2015 ;Gwilliams et al., 2020 )] computations are potentially informative.How these -presumably parallel ( Hamilton et al., 2021 ) -com-putational streams interact remains unclear.Our multivariate characterization of speech-brain coupling across temporal delays, cortical areas, and frequencies adds to the understanding of the stimulus-driven hierarchical organization of speech processing.We show varying coupling delays ranging from short negative to longer positive going beyond a fixed delay of around 100 ms previously reported in similar studies ( Brodbeck et al., 2018 ;Broderick et al., 2019 ;Ding and Simon, 2014 ;Gross et al., 2013 ).We speculate that short negative delays found in motor areas (see Components 10, 11) might be indicative of predictive processes that serve as a top-down facilitation of early auditory processing during speech comprehension ( Park et al., 2015 ).We have to note that while the GCMI used here for assessing speech tracking preserves sensitivity despite changes in phase distributions ( Gross et al., 2021 ;Ince et al., 2017 ), a natural concern arises whether the degree of temporal smoothing (and hence precision of delay estimates) will vary as a function of frequency and consequently confound our findings.For instance, it is apparent in Figure 5 that at lower frequencies, analyses appear to be relatively insensitive to changes in delays, but these become more sensitive at higher frequencies.Considering this, delayed GCMI was previously successful in recovering ground truth in a broadband filtered time series in various frequencies ( Daube et al., 2022 ).As previously reported, we observe two main temporal timescales (  and ) in auditory areas ( Donhauser and Baillet, 2020 ;Teng and Poeppel, 2020 ) extending to frontal and motor areas, showing distinct computations across cortical areas, previously reported for primary and nonprimary auditory areas ( Norman-Haignere and McDermott, 2018 ).In a similar vein, substantially longer integration windows (longer than 200ms) were found in non-primary compared to primary auditory areas.Interestingly, we observed low- tracking of the acoustic envelope in motor areas (see Components 5, 13, and 15), possibly reflecting the excitability-related activity during segments of high sensory gain ( Kayser et al., 2015 ).Here, speech-brain coupling was assessed during continuous speech, without dissociating temporal segments of speech.For example, spatial dissociation in the STG has been reported between speech onsets (i.e the starting of a sentence) and sustained speech ( Hamilton et al., 2018 ).In the future, it would be interesting to investigate whether such a distinction is reflected in excitability-related responses within the , , and  frequency bands during attentive speech.

Figure 2 .
Figure 2. Explained variance of principal components in cortical space.(A) Grand-average of cortical maps depicting the amount of variance (%) explained for components 1 -6 after principal component analysis of source time-series of all voxels within a parcel (B) Amount of total variance (%) explained per participant in Left Primary Auditory Cortex ( Upper Left ), Right Primary Auditory ( Upper Right ), Left Primary Motor Cortex ( Bottom Left ), Right Motor Auditory ( Bottom Right ) (C) Cortical maps of grand-average of cumulative variance explained per parcel with 2 to 6 components (from top to bottom)

Figure 3 .
Figure 3. Significant speech tracking.(A) Upper graphs show the averaged mutual information spectra of significant parcels for the left (Left) and right ( Right) hemisphere after a non-parametric cluster permutation test, comparing normalized mutual information values with the 95th percentile of the surrogate distribution.For the comparison, maximum MI values across computed delays were selected.Shaded area depicts the bootstrapped standard deviation around the mean.Accordingly, lower graphs show t-values belonging to significant clusters (see Methods; n = 184, p < .05).Lines depict t-values for each parcel which are statistically significant (p < .05,cluster corrected) and bold lines depict the average (B) Grand-average of MI across significant parcels from panel (A) and frequencies at a fixed delay of 100 ms.(C) Cortical maps of t-values averaged across significant frequencies.

Figure 4 .
Figure 4. Multivariate versus univariate speech-brain coupling.(A) Mutual information spectra of significant parcels with one PCA component (univariate; black) and three PCA components (multivariate; red), averaged across participants.Here, multivariate refers to the neural time-course at each parcel [three PCA components (multivariate); 1 PCA component (univariate)].Shaded area depicts the bootstrapped standard deviation around the mean.On the bottom, t-values of significant clusters for the left and right hemisphere ( Upper) .Thin lines depict t-values that belong to significant clusters (see Methods; n = 122, p < .05)for each parcel and thick lines represent the average.Lines depict t-values for each parcel which are statistically significant, (p < .05,cluster corrected).For the comparison, maximum MI values across computed delays were selected (B) Cortical maps of averaged t-values across frequencies.Significant areas and spectra were clustered with k-means (k = 3; see Methods).Colormap indicates the clusters in the surface area.On the right, clustered mutual information spectra with k-means, extracted from significant t-values.

Figure 5 .
Figure 5. Comparing speech tracking using the derivative vs the envelope.(A) At the top row, averaged mutual information spectra of significant parcels with the envelope (green) and the derivative of the envelope (black).Shaded area depicts the bootstrapped standard deviation around the mean.At the bottom, t-value spectra for the left and right hemisphere.Only significant t-values that reached significance at the cluster level (n = 33, cluster p < .05;see Methods) are shown (p < .05,cluster corrected).For the comparison, maximum MI values across computed delays were selected (B) Cortical maps of t-values averaged across frequencies where derivative > envelope.

Figure 6 .
Figure 6.Spectral modes of speech coupling across delays: MI spectra clustered into 15 components as estimated from non-negative matrix factorization (NMF).Each subpanel (1-15) represents a component of mutual information spectra across delays.For each component, we illustrate the cortical rendering of the component (top left), the frequency-delays plot (top right ) , and t-values spectra for 4 delays (10, 100, 200, 300 ms; bottom) .In the center of the figure, we plot the cortical rendering of all estimated components with the same color code as in subpanels (1-15).