Elsevier

Neuropsychologia

Volume 109, 31 January 2018, Pages 126-133
Neuropsychologia

Electrophysiological evidence for Audio-visuo-lingual speech integration

https://doi.org/10.1016/j.neuropsychologia.2017.12.024Get rights and content

Highlights

  • The EEG study examined whether visual tongue movements integrate with speech sounds.

  • A modulation of P2 AEPs was observed for AV compared to A+V conditions.

  • Dynamic and phonetic informational cues appear sharable across sensory modalities.

Abstract

Recent neurophysiological studies demonstrate that audio-visual speech integration partly operates through temporal expectations and speech-specific predictions. From these results, one common view is that the binding of auditory and visual, lipread, speech cues relies on their joint probability and prior associative audio-visual experience. The present EEG study examined whether visual tongue movements integrate with relevant speech sounds, despite little associative audio-visual experience between the two modalities. A second objective was to determine possible similarities and differences of audio-visual speech integration between unusual audio-visuo-lingual and classical audio-visuo-labial modalities. To this aim, participants were presented with auditory, visual, and audio-visual isolated syllables, with the visual presentation related to either a sagittal view of the tongue movements or a facial view of the lip movements of a speaker, with lingual and facial movements previously recorded by an ultrasound imaging system and a video camera. In line with previous EEG studies, our results revealed an amplitude decrease and a latency facilitation of P2 auditory evoked potentials in both audio-visual-lingual and audio-visuo-labial conditions compared to the sum of unimodal conditions. These results argue against the view that auditory and visual speech cues solely integrate based on prior associative audio-visual perceptual experience. Rather, they suggest that dynamic and phonetic informational cues are sharable across sensory modalities, possibly through a cross-modal transfer of implicit articulatory motor knowledge.

Introduction

Audio-visual speech perception is a specific case of multisensory processing that interfaces with the linguistic system. Like most natural perceptual events in which information from different sensory sources is merged, bimodal integration of the acoustic and visual speech signals depends on their perceptual saliency, their spatial and temporal relationships, as well as their predictability and joint probability to occur (Campbell and Massaro, 1997, Jones and Munhall, 1997, Green, 1998, Schwartz et al., 2004). When combined to the acoustic speech signal, visual information from the speaker's face is known to enhance sensitivity to acoustic speech information by decreasing auditory detection threshold, and to improve auditory speech intelligibility and recognition, notably when the acoustic signal is degraded/noisy (Benoît et al., 1994, Grant and Seitz, 2000, Schwartz et al., 2004, Sumby and Pollack, 1954). Audio-visual speech perception is also known to facilitate the understanding of a semantically complex statement (Reisberg et al., 1987) or a foreign language (Navarra and Soto-Faraco, 2005), and to benefit hearing-impaired listeners (Grant et al., 1998). Besides the studies demonstrating a perceptual gain for bimodal compared to unimodal speech perception, one of the most striking evidence for Audio-visual speech integration is the so-called McGurk illusion, when adding incongruent visual movements interferes with auditory perception and creates an illusory speech percept (McGurk and MacDonald, 1976).

Complementing these psychophysical and behavioral findings, a number of neurophysiological studies have provided new advances in the understanding of Audio-visual speech binding, its neural architecture and the time course of neural processing. One major finding is that activity within both unisensory auditory and visual cortices as well as the posterior superior temporal sulcus (pSTS) is modulated during Audio-visual speech perception when compared with auditory and visual speech perception (Calvert et al., 2000, Callan et al., 2003, Callan et al., 2004, Skipper et al., 2005, Skipper et al., 2007). Since the pSTS displays supra-additive and sub-additive haemodynamic responses during congruent and incongruent Audio-visual speech perception, it has been proposed that visual and auditory speech cues are integrated within this heteromodal brain region (Calvert et al., 2000, Beauchamp et al., 2004). Complementing this finding, it has been consistently shown that adding lip movements to auditory speech modulates activity quite early in the supratemporal auditory cortex, with the latency and amplitude of N1/M1 and/or P2 auditory evoked responses attenuated and speeded-up during Audio-visual compared to unimodal speech perception (Arnal et al., 2009, Baart et al., 2014, Baart and Samuel, 2015, Besle et al., 2004, Frtusova et al., 2013, Ganesh et al., 2014, Hisanaga et al., 2016, Huhn et al., 2009, Kaganovich and Schumaker, 2014, Klucharev et al., 2003, Paris et al., 2016, Pilling, 2009, Schepers et al., 2013, Stekelenburg and Vroomen, 2007, Stekelenburg et al., 2013, Treille et al., 2014a, Treille et al., 2014b, Treille et al., 2017a, van Wassenhove et al., 2005, Vroomen and Stekelenburg, 2010, Winneke and Phillips, 2011; for a recent review and discussion, see Baart, 2016). The latency facilitation of auditory evoked responses, but not the amplitude reduction, also appears to be directly function of the visemic information, with the higher visual recognition of the syllable, the larger latency facilitation (van Wassenhove et al., 2005, Arnal et al., 2009). In light of these studies, recent theoretical proposals postulate a fast direct feedforward neural route between motion-sensitive and auditory brain areas that helps tuning auditory processing to the incoming speech sound, thanks to the available information from the speaker's articulatory movements that precede sound onset in these studies (Chandrasekaran et al., 2009; but see Schwartz and Savariaux, 2014),1 and a slower and indirect feedback pathway from the posterior superior temporal sulcus to sensory-specific regions that functions as an error signal between visual prediction and auditory input (Hertrich et al., 2007, Arnal et al., 2009).

The above-mentioned studies and theoretical proposals support the view that Audio-visual speech integration partly operates through visually-based temporal expectations and speech-specific predictions. This can be encompassed in a more general Bayesian perspective, with auditory and visual speech cues likely integrated based on their joint probability distribution derived from prior associative Audio-visual perceptual experience (for recent discussions, see van Wassenhove, 2013; Rosenblum et al., 2016). A number of experimental data however pose a challenge to this probabilistic perceptual account. Indeed, bimodal speech interaction has been shown to occur not only for well-known auditory and lipread, visuo-labial, modalities but also for other modalities with little, if any, associative perceptual experience.

One first example comes from a set of behavioral and electrophysiological studies showing that bimodal speech interaction can occur between auditory and haptic modalities, even with participants inexperienced with the haptic speech modality. In these studies, orofacial speech gestures were felt and monitored from manual tactile contact with the speaker's face. When the auditory and haptic modalities were presented simultaneously, a felt syllable affected judgment of an ambiguous auditory syllable, and vice-versa (Fowler and Dekle, 1991). In case of noisy/degraded acoustic speech signal, adding the haptic modality enhanced recognition of the auditory speech stimulus (Gick et al., 2008, Sato et al., 2010a). A similar perceptual gain was also observed when adding the haptic modality to lipreading (Gick et al., 2008). Further, audio–haptic McGurk-type illusion has been also observed (Fowler and Dekle, 1991; but see Sato et al., 2010a for inconclusive results). Finally, two recent electro-encephalographic studies showed that N1/P2 auditory evoked potentials are speeded up and attenuated not only during Audio-visuo-labial but also during audio-haptic speech perception, when compared to unimodal auditory perception (Treille et al., 2014a, Treille et al., 2014b). By providing evidence for cross-modal influences between auditory and haptic modalities, for a perceptual gain for audio-haptic compared to unimodal speech perception, and for cross-sensory speech modulation of the auditory cortex, these studies draw an exquisite parallel between Audio-visual and audio-haptic speech perception. Given that participants were inexperienced with the haptic speech modality, they clearly argue against the view that prior associative bimodal, and even unimodal, speech perceptual experience is needed for the two sensory sources to interact.

Other tactile stimuli can also affect heard speech. When applying in synchrony a small, inaudible, puff of air to the skin of participant's hands, neck (Gick and Derrick, 2009), or ankles (Derrick and Gick, 2013), the auditory perception of aspirated and unaspirated syllables embedded in white noise is more often perceived as an aspirated syllable (causing participants to mishear /ba/as/pa/, or/da/as/ta/). These results suggest that perceivers integrate tactile-relevant information during auditory speech perception without prior training and even without frequent or robust location-specific experience. A final example comes from a study by Ito et al. (2009) who showed that the identification of ambiguous auditory speech stimuli can be modified by stretching the facial skin of the listener's mouth, thanks to a robotic device that induced cutaneous/kinesthetic changes, and that perceptual changes only occur in conjunction with speech-like patterns of skin stretch. A subsequent study showed the reverse effect, with the somatosensory perception of facial skin stretch modified by auditory speech sounds (Ito and Ostry, 2012).

Altogether, these haptic and tactile instances of multisensory speech perception provide strong support for a supramodal view on multisensory speech perception. They nicely exemplify the way lawful and speech-relevant information from many distinct sources, including one hardly uses at all, can be extracted to give rise to an integrated speech percept. From these findings, in an attempt to reconcile them with a Bayesian, associative probabilistic account of multisensory perception, speech theorists have argued that prior experience and learning should be sharable across modalities, and that dynamic and phonetic informational cues available across sensory modalities partly derive from the listener's knowledge of speech production (Fowler, 2004, Rosenblum et al., 2016). This appears in line with the longstanding, albeit debated, proposal of a functional coupling between speech production and perception systems in the speaking and listening brain, and a common currency between motor and perceptual speech primitives (Liberman et al., 1967, Liberman and Mattingly, 1985, Fowler, 1986, Liberman and Whalen, 2000, Galantucci et al., 2006, Skipper et al., 2007, Rauschecker and Scott, 2009, Schwartz et al., 2012; Skipper et al., 2016).

The present electroencephalographic (EEG) study capitalizes on these findings and theoretical proposals with the aim of determining whether visual tongue movements, which are audible but not visible in daily life, might integrate with relevant speech sounds. A second objective was to examine possible similarities and differences of Audio-visual speech integration between unusual Audio-visuo-lingual and classical Audio-visuo-labial modalities. To this aim, participants were presented with auditory, visual, and Audio-visual isolated syllables, with the visual presentation related to either a sagittal view of the tongue movements or a facial view of the lip movements of a speaker, with lingual and facial movements previously recorded by an ultrasound imaging system and a video camera. In line with previous EEG studies, Audio-visual integration was estimated using an additive model (i.e., AV ≠ A +V; for a recent review, see Baart, 2016) by comparing the latency and amplitude of N1/P2 auditory evoked potentials in both the Audio-visual-lingual and Audio-visuo-labial conditions with the sum of those observed in the unimodal conditions.

Audio–motor association for tongue movements is frequently experienced in daily life (for instance, when speaking or eating). However, despite implicit articulatory motor knowledge on tongue movements, only a few recent studies explored the influence of visual tongue movements on heard speech. Using virtual tongue movements or ultrasound images of tongue movements, they showed that visual tongue feedback can strengthen the learning of novel speech sounds (Katz and Mehta, 2015) and enhance and/or speed up auditory speech discrimination when compared with unimodal auditory or incongruent Audio-visual speech perception (Badin et al., 2010, d’Ausilio et al., 2014). A recent functional magnetic brain imaging (fMRI) study, done by our team, further revealed that Audio-visuo-lingual and Audio-visuo-labial speech perception share a common sensorimotor neural network, with stronger motor and somatosensory activations observed during Audio-visuo-lingual perception (Treille et al., 2017b). However, for all their importance, these studies did not reveal the time course at which auditory speech sounds and visual tongue movements may truly integrate. In keeping with the above-mentioned EEG studies on Audio-visuo-labial speech integration, and taking advantage of the EEG temporal resolution, electrophysiological evidence for early Audio-visuo-lingual integration as well as possible similarities between Audio-visuo-lingual and Audio-visuo-labial integration mechanisms would argue against the view that auditory and visual speech cues solely integrate based on prior associative Audio-visual perceptual experience.

Section snippets

Participants

Eighteen healthy adults (11 females and 7 males, with a mean age with standard deviation of 25 (± 7) years, ranging from 20 to 52 years) participated in the study after giving their informed consent. All participants were right-handed according to standard handedness inventory (Oldfield, 1971) and were native French speakers. They all had normal or corrected-to-normal vision and reported no history of hearing, speaking and language disorders. The protocol was carried out in accordance with the

Accuracy - see Fig. 2-left

Overall, the mean proportion of correct responses was 86%. As expected, the ANOVA revealed a main effect of modality (F(2,34) = 35.5; p < 0.001) with the percentage of correct responses for visual stimuli (74%) lower than for auditory (91%) and Audio-visual stimuli (92%). The main effect of articulator was not significant (F(1,17) = 0.2) nor the modality x articulator interaction (F(2,34) = 1.0) with no difference observed between labial and lingual stimuli whatever the modality (on average, A

Discussion

The present EEG study investigated possible Audio-visual speech integration between auditory and visuo-lingual modalities, despite little associative Audio-visual experience between these two sensory sources in daily life. To further determine the impact of visual experience on bimodal speech integration, similarities and differences between unusual Audio-visuo-lingual and classical Audio-visuo-labial modalities were also tested. Several findings were observed. First, both visuo-lingual and

Acknowledgments

This study was supported by research funds from the European Research Council (FP7/2007-2013 Grant Agreement No. 339152). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agency.

Conflict of interest

The authors declare no competing financial interests.

References (84)

  • Z. Huhn et al.

    Perception based method for the investigation of audiovisual integration of speech

    Neurosci. Lett.

    (2009)
  • N. Kaganovich et al.

    Audiovisual integration for speech during mid-childhood: electrophysiological evidence

    Brain Lang.

    (2014)
  • V. Klucharev et al.

    Electrophysiological indicators of phonetic and non-phonetic multisensory interactions during audiovisual speech perception

    Brain Res. Cogn. Brain Res.

    (2003)
  • A.M. Liberman et al.

    The motor theory of speech perception revised

    Cognition

    (1985)
  • V. Ojanen et al.

    Processing of Audio-visual speech in Broca's area

    NeuroImage

    (2005)
  • R.C. Oldfield

    The assessment and analysis of handedness: the Edinburgh inventory

    Neuropsychologia

    (1971)
  • T. Paris et al.

    Using EEG and stimulus context to probe the modelling of auditory-visual speech

    Cortex

    (2016)
  • J. Pekkola et al.

    Perception of matching and conflicting Audio-visual speech in dyslexic and fluent readers: an fMRI study at 3T

    NeuroImage

    (2006)
  • M. Sato et al.

    A mediating role of the premotor cortex in phoneme segmentation

    Brain Lang.

    (2009)
  • M. Sato et al.

    Auditory-tactile speech perception in congenitally blind and sighted adults

    Neuropsychologia

    (2010)
  • M. Sato et al.

    On the tip of the tongue: modulation of the primary motor cortex during audiovisual speech perception

    Speech Commun.

    (2010)
  • I.M. Schepers et al.

    Noise alters beta-band activity in superior temporal cortex during audiovisual speech processing

    NeuroImage

    (2013)
  • M. Scherg et al.

    Evoked dipole source potentials of the human auditory cortex

    Electroencephalogr. Clin. Neurol.

    (1986)
  • J.L. Schwartz et al.

    The Perception for Action Control Theory (PACT): a perceptuo-motor theory of speech perception

    J. Neurolinguist.

    (2012)
  • J.I. Skipper et al.

    Listening to talking faces: motor cortical activation during speech perception

    NeuroImage

    (2005)
  • J.J. Stekelenburg et al.

    Deficient multisensory integration in schizophrenia: an event-related potential study

    Schizophr. Res.

    (2013)
  • A. Treille et al.

    Haptic and visual information speed up the neural processing of auditory speech in live dyadic interactions

    Neuropsychologia

    (2014)
  • P. Tremblay et al.

    On the context-dependent nature of the contribution of the ventral premotor cortex to speech perception

    NeuroImage

    (2011)
  • V. van Wassenhove et al.

    Temporal window of integration in auditory-visual speech perception

    Neuropsychologia

    (2007)
  • K.E. Watkins et al.

    Seeing and hearing speech excites the motor system involved in speech production

    Neuropsychologia

    (2003)
  • L.H. Arnal et al.

    Dual neural routing of visual facilitation in speech processing

    J. Neurosci.

    (2009)
  • M. Baart

    Quantifying lip-read induced suppression and facilitation of the auditory N1 and P2 reveals peak enhancements and delays

    Psychophysiology

    (2016)
  • M. Baart et al.

    Electrophysiological evidence for speech-specific audiovisual integration

    Neuropsychologia

    (2014)
  • C. Benoît et al.

    Effects of phonetic context on Audio-visual intelligibility of French speech in noise

    J. Speech Hear. Res.

    (1994)
  • J. Besle et al.

    Bimodal speech: early suppressive visual effects in human auditory cortex

    Eur. J. Neurosci.

    (2004)
  • D.E. Callan et al.

    Neural processes underlying perceptual enhancement by visual speech gestures

    NeuroReport

    (2003)
  • D.E. Callan et al.

    Multisensory integration sites identified by perception of spatial wavelet filtered visual speech gesture information

    J. Cogn. Neurosci.

    (2004)
  • G.A. Calvert et al.

    Reading speech from still and moving faces: the neural substrates of visible speech

    J. Cogn. Neurosci.

    (2003)
  • C.S. Campbell et al.

    Perception of visible speech: influence of spatial quantization

    Perception

    (1997)
  • C. Chandrasekaran et al.

    The natural statistics of audiovisual speech

    PLoS Comput. Biol.

    (2009)
  • D. Derrick et al.

    Aerotactile integration from distal skin stimuli

    Multisens. Res.

    (2013)
  • C. Fowler et al.

    Listening with eye and hand: crossmodal contributions to speech perception

    J. Exp. Psychol. - Hum. Percept. Perform.

    (1991)
  • Cited by (6)

    • The timing of visual speech modulates auditory neural processing

      2022, Brain and Language
      Citation Excerpt :

      Since visual cues after the acoustic onset were the same in all three sets of syllables, this last result likely indicates that a single relevant frame of 40 ms immediately preceding the acoustic onset was here sufficient for /pa/, /ta/ and /ka/ syllable discrimination. In the EEG experiment, results from the additive model appeared to be largely in agreement with previous EEG studies on audiovisual speech integration (Lebib et al., 2003; Klucharev et al., 2003; Besle et al., 2004; van Wassenhove et al., 2005; Stekelenburg and Vroomen, 2007; Winneke and Phillips, 2011; Treille et al., 2017, 2018; Pinto et al., 2019; Tremblay et al., 2021), with a reduced amplitude of P1 and P2 and a shorter latency of N1 and P2 for AV240 compared to A + V240 (see supplementary materials). More interestingly, the timing of visual gestures was found to modulate auditory neural processing differently for P1, N1 and P2 AEPs, presumably due to their respective roles in sensory gating, acoustic and phonetic decoding stages of auditory speech processing.

    • Motor and visual influences on auditory neural processing during speaking and listening

      2022, Cortex
      Citation Excerpt :

      In the perceptual domain, a rich literature also demonstrates the impact of exogeneous visual-to-auditory cross-modal effects during audiovisual speech perception. It has been consistently shown that adding lip movements to auditory speech modulates activity early in the supratemporal auditory cortex, with the latency and amplitude of N1/M100 AEPs attenuated and speeded up during audiovisual compared to unimodal speech perception (Klucharev et al., 2003; Besle et al., 2004; van Wassenhove et al., 2005; Stekelenburg & Vroomen, 2007; Arnal et al., 2009; Huhn et al., 2009; Pilling, 2009; Vroomen & Stekelenburg, 2010; Winneke & Phillips, 2011; Frtusova et al., 2013; Schepers et al., 2013; Stekelenburg et al., 2013; Baart et al., 2014; Ganesh et al., 2014; Kaganovich & Schumaker, 2014; Treille, Vilain, & Sato, 2014, 2017, 2018, 2014a; Baart & Samuel, 2015; Hisanaga et al., 2016; Paris et al., 2016; Pinto et al., 2019; for reviews, see; van Wassenhove, 2013; Baart, 2016). Like SIS, visually induced suppression (VIS) is thought to help tuning auditory processing to the incoming speech sound, based on the available information from the speaker's articulatory movements that precede sound onset in these studies (Chandrasekaran et al., 2009; see also; Schwartz & Savariaux, 2014).

    • Visual prediction cues can facilitate behavioural and neural speech processing in young and older adults

      2021, Neuropsychologia
      Citation Excerpt :

      The auditory P2 is thought to reflect synchronous neural activation in the thalamic-cortical segment of the central nervous system, mainly originating from the supratemporal plane of the auditory cortex (e.g., Naatanen and Picton, 1987). Previous neurophysiological studies, as well as the present one, have shown that the N1/P2 complex occurs earlier and its amplitude is lower for AV compared to unimodal (A) speech processing (e.g., Besle et al., 2004; Klucharev et al., 2003; Treille et al., 2014a; Treille et al., 2017; Treille et al., 2014b; Treille et al., 2018; van Wassenhove et al., 2005). P2 is often considered as an index of AV integration.

    View full text