Contributions of local speech encoding and functional connectivity to audio-visual speech integration

Seeing a speaker’s face enhances speech intelligibility in adverse environments. We investigated the underlying network mechanisms by quantifying local speech representations and directed connectivity in MEG data obtained while human participants listened to speech of varying acoustic SNR and visual context. During high acoustic SNR speech encoding by entrained brain activity was strong in temporal and inferior frontal cortex, while during low SNR strong entrainment emerged in premotor and superior frontal cortex. These changes in local encoding were accompanied by changes in directed connectivity along the ventral stream and the auditory-premotor axis. Importantly, the behavioural benefit arising from seeing the speaker's face was not predicted by changes in local encoding but rather by enhanced functional connectivity between temporal and inferior frontal cortex. Our results demonstrate a role of auditory-motor interactions in visual speech representations and suggest that functional connectivity along the ventral pathway facilitates speech comprehension in multisensory environments.


Introduction 25
When communicating in challenging acoustic environments we profit tremendously from visual 26 cues arising from the speakers face. Movements of the lips, tongue or the eyes convey significant 27 information that can boost speech intelligibility and facilitate the attentive tracking of individual 28 speakers (Ross et al., 2007;Sumby and Pollack, 1954). This multisensory benefit is strongest for con-29 tinuous speech, where visual signals provide temporal markers to segment words or syllables, and 30 provide linguistic cues (Grant and Seitz, 1998). Previous work has identified the synchronization 31 of brain rhythms between interlocutors as a potential neural mechanism underlying the visual en-   . Experimental conditions lasted 1 (SNR) or 3 (VIVN) minutes, and were presented in pseudo-randomized order. (C) Analyses were carried out on band-pass filtered speech envelope and MEG signals. The MEG data were source-projected onto a grey-matter grid (LCMV beamformer). One analysis quantified speech entrainment, i.e. the mutual information (MI) between the MEG data and the speech envelope, and the extent to which this was modulated by the experimental conditions. A second analysis quantified directed functional connectivity (DI) between seeds and the extent to which this was modulated by the experimental conditions. A final analysis assessed the correlation of either MI or DI with word-recognition performance. enhancement remains unclear.

41
Previous work has implicated many brain regions in the visual enhancement of speech, including one needs to manipulate both factors to fully address this question. Overcoming these problems, 67 we capitalized on the statistical and conceptual power offered by naturalistic speech to study the 68 network mechanisms that underlie the visual facilitation of speech perception. 69 Using source localized MEG activity we systematically investigated how local speech representa-  To study the brain activity underlying this behavioral ben-97 efit we analyzed source-projected MEG data using information theoretic tools to quantify the fidelity 98 of local neural representations of the speech envelope (speech-to-brain entrainment), as well as the 99 directed causal connectivity between relevant regions. For both, coding and connectivity, we (1) 100 modelled the extent to which they were modulated by the experimental conditions and (2) asked 101 whether they correlated with behavioural performance across conditions and with the visual benefit 102 (VI-VN) across SNRs (Fig. 1C).

103
Widespread speech-to-brain entrainment at multiple time scales 104 Speech-to-brain entrainment was quantified by the mutual information (MI) between the MEG time   Figure 3. Modulation of speech-to-brain entrainment by acoustic SNR and visual informativeness. Changes in speech entrainment with the experimental factors were quantified using a GLM for the condition-specific speech MI based on the effects of SNR (A), visual informativeness VIVN (B), and their interaction (SNRxVIVN) (C). The figures display the cortical-surface projection onto the Freesurfer template (proximity = 10 mm) of the group-level significant statistics for each GLM effect (FWE = 0.05). Graphs show the average speech MI values for each condition (mean ± SEM), for local and global (red asterisk) of the T maps. Lines indicate the across-participant average regression model and numbers indicate the group-average standardized regression coefficient for SNR in the VI and VN conditions. (D) T maps illustrating the opposite SNR effects within voxels with significant SNRxVIVN effects. MI graphs for the peaks of these maps are shown in (C) (IFGor-R and SFG-R = global T peaks for SNR effects in VI and VN, respectively). (E) Location of global and local seeds of GLM T maps, used for the analysis of directed connectivity. (F) Correlation between condition-specific behavioural performance and speech MI (perform. r) and between visual enhancement of performance and MI (vis. enhanc. r; see inset) in pSTG-R and IFGt-R. error-bars = ± SEM. See also Tables 1 and 3. we observed significant speech-to-brain entrainment not only within temporal cortices but across

145
Since visual benefits for perception emerge mostly when acoustic signals are degraded (Fig. 2  Maximum condition-averaged DI (zscore)  We observed significant condition-averaged DI between multiple nodes of the speech network   The present study provides a comprehensive picture of how acoustic signal quality and visual con-

218
Entrained speech representations in temporal, parietal and frontal lobes 219 We observed functionally distinct patterns of speech-to-brain entrainment along the auditory path-    We observed significant intra-hemispheric connectivity between right temporal, parietal and tic quality and the visual relevance in a block design within each text (Fig. 1B). The visual relevance 358 was manipulated by either presenting the video matching the respective speech (visual informative, 359 VI) or presenting a 3 s babble sequence that was repeated continuously (visual not informative, VN), 360 and which started and ended with the mouth closed to avoid transients. The signal to noise ratio 361 (SNR) of the acoustic speech was manipulated by presenting the speech on background cacophony 362 of natural sounds and scaling the relative intensity of the speech while keeping the intensity of the 363 background fixed. We used relative SNR values of +8, +6, +4 and +2 dB RMS intensity levels. The acous-364 tic background consisted of a cacophony of naturalistic sounds, created by randomly superimposing 365 various naturalistic sounds from a larger database (using about 40 sounds at each moment in time, 366 Kayser et al., 2016). This resulted in a total of 8 conditions (four SNR levels; visual informative or irrel-367 evant) that were introduced in a block design (Fig. 1B). The SNR changed from minute to minute in 368 a pseudo-random manner (12 one minute blocks per SNR level). Visual relevance was manipulated 369 within 3 minute sub-blocks. Texts were presented with self-paced pauses. Subjects performed a de-

428
To quantify the entrainment of brain activity to the speech envelope we first determined the  shuffled estimates (using the same randomization procedure as for MI). DI was computed for speech 487 lags between 0 and 500 ms and brain lags between 0 and 250 ms, at steps of one sample (1/150 Hz). 488 We estimated DI on the frequency range of 0. We used a permutation-based RFX approach to assess (1) whether an increase in condition-specific 503 speech-MI or DI was associated with an increase in behavioural performance, and (2) whether the 504 visual enhancement (VI-VN) of MI or DI was associated with stronger behavioural gains. We focused 505 on the 8 regions used as seeds for the DI analysis. For speech-MI we initially tested whether the 506 participant-average Fisher Z-transformed correlation between condition-specific performance and 507 speech-MI was significantly larger than zero. Uncorrected p-values were computed using the per-508 centile method, where FWE = 0.05 p-values corrected across regions were computed using maxi-509 mum statistics. We subsequently tested the positive correlation between SNR-specific visual gains 510 (VI-VN) in speech-MI and behavioural performance using the same approach, but considered only 511 those regions characterized by a significant condition-specific MI/performance association. For DI, 512 we focused on those lags characterized by a significant SNR, VIVN, or SNRxVIVN DI modulation.

513
Significance testing proceeded as for speech MI, except that Z-transformed correlations were com-   Figure S1. Entrainment of rhythmic MEG activity to the speech envelope. (A) Projection of significant speech MI maps, which quantify the entrainment of MEG source activity to the speech envelope, onto the Freesurfer template (FWE = 0.05; proximity = 10 mm; surface-projected significant MI maps rescaled within volume from minimum significant MI to the 99.5 th percentile of the surface projection). (B) Peak MI in the two hemispheres as a function of frequency (mean ± SEM).   Figure S2. Directed functional connectivity within the speech-entrained network. (A) Significant condition-averaged directed information (DI) values between all seed-target pairs as a function of the speech ( ℎ ) and brain lags ( ). (B): Group-level statistical maps for the GLM effects on DI of acoustic signal quality (SNR), visual informativeness (VIVN) and their interaction.