Benefit of visual speech information for word comprehension in post-stroke aphasia

Aphasia is a language disorder that often involves speech comprehension impairments affecting communication. In face-to-face settings, speech is accompanied by mouth and facial movements, but little is known about the extent to which they benefit aphasic comprehension. This study investigated the benefit of visual information accompanying speech for word comprehension in people with aphasia (PWA) and the neuroanatomic substrates of any benefit. Thirty-six PWA and 13 neurotypical matched control participants performed a picture-word verification task in which they indicated whether a picture of an animate/inanimate object matched a subsequent word produced by an actress in a video. Stimuli were either audiovisual (with visible mouth and facial movements) or auditory-only (still picture of a silhouette) with audio being clear (unedited) or degraded (6-band noise-vocoding). We found that visual speech information was more beneficial for neurotypical participants than PWA, and more beneficial for both groups when speech was degraded. A multivariate lesion-symptom mapping analysis for the degraded speech condition showed that lesions to superior temporal gyrus, underlying insula, primary and secondary somatosensory cortices, and inferior frontal gyrus were associated with reduced benefit of audiovisual compared to auditory-only speech, suggesting that the integrity of these fronto-temporo-parietal regions may facilitate cross-modal mapping. These findings provide initial insights into our understanding of the impact of audiovisual information on comprehension in aphasia and the brain regions mediating any benefit.


Introduction
Post-stroke aphasia is a language disorder most frequently associated with difficulties with speech production and/or comprehension (Stroke Association UK, 2021). However, faceto-face communication goes beyond speech as it also involves processing a great deal of other communicative information, including mouth and facial movements. We know very little about whether these movements benefit the comprehension of people with aphasia (PWA) and if particular brain regions mediate any benefit. Studies with neurotypical individuals have shown that observing mouth movements facilitates auditory comprehension, particularly when speech is challenging to process due to message complexity (Arnold & Hill, 2001;Reisberg et al., 1987) or additional noise (Krason et al., 2021;Ma et al., 2009;Ross et al., 2007;Schwartz et al., 2004;Sumby & Pollack, 1954;Tye-Murray et al., 2007). This benefit is thought to occur because mouth movements support temporal and phonological encoding of the auditory speech information, as well as constrain lexical competition (for a review see Peelle & Sommers, 2015). For instance, during a conversation in a busy restaurant, mouth movements inform the listener about when to attend to others' speech and complement auditory signals by disambiguating the place of articulation of a consonant (e.g., /baet/ versus /caet/). Studies with PWA have primarily investigated audiovisual speech processing using the McGurk and MacDonald paradigm (McGurk & MacDonald, 1976). In this paradigm, simultaneous mismatching information from speech acoustics (e.g., "pa") and mouth movements (e.g., "ka") induce an audiovisual illusion in which individuals perceive a fused percept (e.g., "ta"; McGurk & MacDonald, 1976). Despite great individual variability in susceptibility to the McGurk effect (Brown et al., 2018), most neuroanatomically healthy individuals and PWA perceive a fused percept during mismatching presentations, which has been interpreted in terms of audiovisual integration mechanisms (see Alsius, Par e, & Munhall, 2018 for a review). However, processing mismatching information from mouth and auditory speech is of unknown relevance to word comprehension and may be driven by different cognitive mechanisms (Hickok et al., 2018;Van Engen et al., 2017).
Notably, and of greater potential relevance to comprehension, PWA also benefit from mouth movements when acoustic and visual speech cues match, e.g., when "pa" is produced both auditorily and visually, relative to when "pa" is produced auditorily only (Andersen & Starrfelt, 2015;Baum et al., 2012;Campbell et al., 1990;Hessler et al., 2012;Hickok et al., 2018;Michaelis et al., 2020; but see also Youse et al., 2004). However, in a study assessing the ability of individuals with left hemisphere stroke to extract visual speech information, Schmid and Ziegler (2006) showed that PWA did not benefit from audiovisual relative to auditory-only stimuli and were impaired in matching asynchronous stimuli across auditory and visual modalities. This was particularly the case for individuals with poor verbal repetition skills and apraxia of speech (i.e., a motor speech planning disorder), suggesting that these factors may be important for successful encoding of phonological information from mouth movements and integration with auditory speech. However, as with studies of the McGurk and McDonald illusion, the relevance of these findings for naturalistic speech comprehension may be limited: stimuli were nonsense syllables and non-speech sounds (e.g., whistling), as well as matching of asynchronous cross-modal information. Finally, associations between lesion site and behavior were not assessed.
Although lesion information is often not available in behavioral studies with clinical populations, it may strongly influence performance. Functional neuroimaging studies with neurotypical individuals generally, but not exclusively, report that three brain regions play central roles in audiovisual speech processing (for a review see Peelle, 2019). The left posterior superior temporal sulcus/gyrus (STS/STG) displays enhanced activation for audiovisual speech (with visible mouth and facial movements) relative to combined responses to auditory-only and visual-only stimuli Calvert & Campbell, 2003;Calvert et al., 2000;Erickson et al., 2014;Nath & Beauchamp, 2012;Sekiyama et al., 2003;Skipper et al., 2005Skipper et al., , 2007Venezia et al., 2017;Wright et al., 2003), suggesting that it contributes to multisensory integration, including cross-modal integration for speech (Amedi et al., 2005;Beauchamp, 2005;Beauchamp et al., 2004;Baum et al., 2012; see also Hocking & Price, 2008;Olson, Gatenby, & Gore, 2002 for contradictory results). Some fMRI studies have also reported increased activation in the auditory cortex, including primary auditory cortex (A1), for visual speech relative to silent non-speech movements (Calvert et al., 1997;Pekkola et al., 2005). Similar findings from electrophysiological experiments show that visual cues modulate oscillations in A1 (Crosse et al., 2015;Luo et al., 2010) and that this modulation starts early, i.e., approximately 100e300 ms before speech onset, which is often related to mouth opening/closing (Chandrasekaran et al., 2009). Finally, the left inferior frontal cortex, including ventral premotor cortex (PMv) and inferior frontal gyrus (IFG), has also been associated with audiovisual processing (Calvert & Campbell, 2003;Erickson et al., 2014;Skipper et al., 2007;Watkins et al., 2003). These inferior frontal regions have been argued to play a role in mapping articulatory gestures to phoneme representations (Hickok & Poeppel, 2007;Rauschecker & Scott, 2009), with some suggesting that observing mouth movements while listening to speech evokes activity in similar frontal brain regions as during speech production (see Skipper et al., 2017 for a review).
There is very little converging evidence from PWA that those regions are involved in audiovisual processing, and the studies that exist are also focused on perception and not comprehension. Hickok et al. (2018) conducted a large-scale voxel-based lesion-symptom mapping study assessing performance of PWA with McGurk-type stimuli. They found that left posterior superior and middle temporal regions, insula (INS), as well as parts of the occipital cortex, but not the IFG, are associated with audiovisual integration (Hickok et al., 2018). More recently, Michaelis et al. (2020) tested audiovisual integration abilities of PWA using asynchronous auditory and visual signals. Lesions to the left supramarginal gyrus (SMG) and planum temporale of the STG were associated with reduced temporal sensitivity to the asynchronous audiovisual signal, indicating that these regions are important for temporal perception that mediates audiovisual integration.
Although these findings provide important initial insights into the mechanisms driving audiovisual processing in PWA, both studies used the McGurk paradigm and are therefore subject to the criticisms raised above, i.e., they investigated syllable perception rather than comprehension.

The current study
This study is the first to investigate, using both lesionsymptom mapping and behavioral methods, the benefit of visual speech information for spoken word comprehension in PWA. We assessed 36 PWA and 13 neurotypical controls with a computer-based picture-verification task requiring judgements about whether a spoken word from a video matched a previously seen picture. We manipulated the presence of mouth and facial movements, and speech clarity. As face-toface interactions are typically embedded in noise (e.g., a conversation on a busy street) and such adverse listening conditions increase reliance on visual speech information in neurotypical individuals (e.g., Krason et al., 2021;Ma et al., 2009;Ross et al., 2007;Sumby & Pollack, 1954), we compared clear speech to 6-band noise-vocoded stimuli. Finally, we assessed the neural regions associated with any benefit of visual speech information during word comprehension using Support-Vector Regression Lesion-Symptom mapping (SVR -LSM, Zhang et al., 2014). Based on the current literature on the processing of audiovisual speech by neurotypical individuals, we predicted that performance of both groups would improve in the degraded condition when mouth movements were present thanks to the support they provide to phonological encoding of degraded auditory signals (e.g., Ross et al., 2007;Sumby & Pollack, 1954). Given limited studies on audiovisual speech processing beyond syllable level and involving individuals with poststroke aphasia, it is unclear whether PWA would benefit from observing mouth and facial movements to a larger extent than neurotypical individuals. It is possible that PWA would use visual speech information to overcome noise (similarly to neurotypical individuals), but also to remedy any auditory speech deficits caused by aphasia. It may also be the case, however, that integrating visual and auditory channels is more challenging for PWA than for healthy adults, thus, resulting in a smaller audiovisual benefit. We hypothesized that any benefit from observing mouth and facial movements to PWA would depend on individuals' lesion location. That is, we predicted that a reduced audiovisual benefit should be observed in patients with lesions to the posterior STS/STG, a key region for multisensory integration in studies with neurotypicals (e.g., Beauchamp et al., 2004). As we considered comprehension of real words with visible mouth and facial movements, other regions including A1 and inferior frontal cortices (PMv and IFG; e.g., Calvert et al., 1997;Watkins et al., 2003) may also contribute to visual speech benefit.

Methods
In the methods section, we report how we determined our sample size, all data exclusions, all inclusion/exclusion criteria, whether inclusion/exclusion criteria were established prior to data analysis, all manipulations, and all measures in the study.

Participants
Forty-nine native speakers of North American English were recruited from the Moss Rehabilitation Research Institute (MRRI) Research Registry (Schwartz et al., 2005) to participate in the study. Participants included (i) 36 individuals at least 6 months-post a single left hemispheric cerebrovascular accident who exhibited aphasia and, to ensure that they would be able to understand experimental task instructions, had a score of at least 5 (out of 10) on the auditory comprehension subtest of the WAB (Kertesz, 1982

Neuroimaging acquisition
Twenty-nine participants with aphasia received researchquality structural MRI (26) or CT (3) scans if the former was medically contraindicated. The MRI scans included wholebrain T1-weighted images acquired on a 3 T Siemens Trio (Erlangen, Germany) scanner with repetition time of 1620 ms, echo time of 3.87 ms, field of view of 192,256 mm, with 1 Â 1 Â 1 mm voxels, and using a Siemens eight-channel head coil. The CT scans were obtained without contrast (60 axial slices, 3e5 mm slice thickness) on a 64-slice Siemens SOMA-TOM Sensation scanner. Lesions were manually segmented on each patient's highresolution T-1 weighted structural images. Lesioned voxels were assigned a value of 1, and preserved voxels were assigned a value of 0. Both contained grey and white matter. Binarized lesion masks were then registered to an MNI template (Montreal Neurological Institute "Colin27") using a symmetric diffeomorphic registration algorithm (Avants 1 We tested an accuracy model including all predictors of interest (see Data Analysis section) but excluding the 3 participants who did not pass the audiometry screening. The results are consistent with the results from the accuracy model with the full sample, suggesting that the hearing factor did not influence our findings. All results are presented in the Supplementary Materials for comparison. c o r t e x 1 6 5 ( 2 0 2 3 ) 8 6 e1 0 0 et al., 2008; www.picsl.upenn.edu/ANTS). First, volumes were registered to an intermediate template of healthy brain images acquired on the same scanner, and they were then mapped onto the "Colin27" template. Lesion maps were subsequently inspected by an experienced neurologist (H.B. Coslett), naive to the behavioral results of the study, to ensure mapping accuracy. The same neurologist drew the CT scans directly onto the "Colin27" template using MRIcron (Rorden & Brett, 2000). To ensure maximum accuracy with high intraand inter-rater reliability (>.85%), the pitch of the template was rotated to approximate the slice plane of each participant's scan (see e.g., Schnur et al., 2009).

Materials
In the experimental picture-word verification task participants indicated whether a spoken stimulus matched a previously seen picture. Experimental materials for the study consisted of 120 words, a corpus of 480 pictures with high name agreement, and 240 video-clips. The list of words and the video-clips are publicly available at https://osf.io/fuscq/. The picture materials could not be publicly archived due to copyright concerns.

Words
All words were concrete (Mn. 3.5 out of 5 on a concreteness scale; Brysbaert et al., 2014) and referred to common objects and living things. Words were grouped into sets of four (e.g., "cow", "ear", "egg", "pie") and items within a set were matched on number of syllables and as closely as possible on number of phonemes, lexical frequency (Brysbaert & New, 2009), age of acquisition (AoA; Kuperman et al., 2012), and phonological neighborhood density (Luce & Pisoni, 1998). Each participant saw all 120 words, but the words within a group were presented in different conditions (see below) to different participants. For example, participant 1 heard the word "cow" in the clear condition with visible mouth movements, whereas participant 2 heard the same word in the clear condition, but with no visible facial cues. The sets of four words remained constant across participants and experimental conditions.

Video-clips
The video-clips were recorded in a professional, well-lit soundproof booth at University College London. They depicted a female native speaker of American English with visible head and shoulders uttering target words. The videos were further manipulated in iMovie (version 10.1.12) in the following way. First, we extracted the audio from the video files and combined it with a still image of a female silhouette. As a result, each video had two versions: with (audiovisual) and without (auditory-only) visual cues. This contrast is analogous to real-life scenarios in which interlocutors have face-to-face versus telephone conversations. The decision to use a still image of a silhouette as an auditory-only baseline, rather than a still picture of a speaker or video of a speaker with a blurred lip area, was driven by the concern that the auditory-visual mismatch would create expectancy conflicts that would actively disrupt processing rather than serving as a truly neutral condition. In addition, blurring different parts of the face to control for their role in speech processing is ecologically less valid. Second, we moderately noise-vocoded the audio in Praat (Boersma, 2021) using a 6-band pass filter following Drijvers and € Ozyü rek (2017) and Krason et al. (2021). Noise-vocoded speech is a type of degraded speech in which pitch-related information is manipulated to simulate the listening experience of someone with a cochlear implant (Shannon et al., 1995). Six-band filtering makes the speech challenging, but still intelligible (to a certain degree) and has been previously shown to increase neurotypical individuals' use of visual cues in word recognition tasks (Drijvers & € Ozyü rek, 2017;Krason et al., 2021). The final set of videos was therefore presented in one of the following conditions: clear audiovisual (clear audio þ visible mouth and facial movements), degraded audiovisual (noise-vocoded audio þ visible mouth and facial movements), clear auditory-only (clear audio þ no visual cues), and degraded auditory-only (noise-vocoded audio þ no visual cues). Fig. 1 depicts the experimental conditions and trial types used in the study.
The stimuli were displayed on a 24-inch monitor with 1920 Â 1080 resolution. The videos occupied the upper 2/3 of the screen, and the pictures occupied the lower part.

Procedure
The experiment was programmed in Gorilla (https://gorilla.sc/). Participants wore high-quality headphones during the experiment. Participants' task was to indicate whether a spoken stimulus matched a previously seen picture. Each trial started with a still image of an actress (or a female silhouette in the auditory-only modality) and a fixation cross beneath it. After 500 ms, a picture of an object or living thing appeared in place of the fixation cross. After another 1500 ms, a 200 ms beep tone was played indicating the beginning of a~1500 ms video. The picture remained in view until the end of a video, after which a new screen with a question ("Does the speech match the picture?") and two response boxes appeared. Participants used their left hand (i.e., the unaffected hand in the PWA) to indicate their responses using "z" and "x" buttons on the keyboard with corresponding colored stickers ("z" [yellow sticker] ¼ "yes", "x" [blue sticker] ¼ "no"). See Fig. 2 for an example of the trial sequence. Prior to the main task, participants were presented with four practice trials illustrating all possible conditions (i.e., clear audiovisual, degraded audiovisual, clear auditory-only, degraded auditory-only). The practice trials were repeated as necessary to ensure participants understood the task. Both visual and oral feedback was provided during the practice phase. In the main task, participants were exposed to all the target words twice, resulting in 240 trials, with 50% of the trials requiring a "yes" response (matching trials) and the other 50% requiring a "no" response (mismatching trials). None of the mismatching pictures appeared as targets. The second presentation of each word always appeared in a different experimental condition and in the second half of the experiment (after a 10-min break). The trials were pseudo-randomized into eight blocks of 30 trials. Each session lasted approximately 1.5 h.

Behavioral analysis
The lme4 package (Bates et al., 2015) was used to perform a set of mixed-effect analyses in RStudio (RStudio Team, 2015). We carried out generalized logistic mixed-effect regressions (glmer) on accuracy separately for the matching and mismatching trials. 2 The decision to analyze matching and mismatching trials separately was driven by the findings from neurotypical individuals showing that different cognitive processes may be involved when responding yes/no to matching and mismatching picture-word pairs, with matching trials being overall more reliable (see, e.g., Stadthagen-Gonzalez et al., 2009). Specifically, matching trials have been suggested to reflect conceptual (semantic) matching, i.e., individuals access the meaning of both spoken words and pictures (Stadthagen-Gonzales et al., 2009). In contrast, mismatching trials have been shown to elicit much more variability in how people respond to them, which may be related to the number of additional "checks" one has to perform to decide that a word and a picture mismatch (Krueger, 1978). Potential cognitive mechanisms that may be triggered during mismatching, but less so during matching, trials are cognitive control and priming. Finally, assessing the benefit of congruent visual information is of clinical relevance.
Prior to the analyses, we removed trials with technical difficulties and trials with a phonologically related word "gauge", because its visual speech information is identical with the visual information of its matching target word "cage" (21 trials in total). We entered the following predictors and up to three-way interactions between them in our models: Speech Clarity (clear, degraded), Modality (audiovisual,  2 Reaction time data were unreliable due to a number of responses prior to the response window, i.e., while videos played. c o r t e x 1 6 5 ( 2 0 2 3 ) 8 6 e1 0 0 auditory-only) and Group (PWA, neurotypicals), as well as Relation Type (semantic, phonological, unrelated) in the mismatching trial analysis. Following a design-driven approach (Barr et al., 2013), we included by-Participant and by-Item random intercepts to account for participant and item variability. We also entered random slopes for Speech Clarity and Modality both by-Participant and by-Item to better control for type I error. Random slopes of Modality were removed from the analysis of the mismatching trials due to model singularity fit. The control variables entered in the models included the Number of Syllables of the target words, Log Frequency (Brysbaert & New, 2009), AoA (Kuperman et al., 2012), and Phonological Neighborhood Density (Luce & Pisoni, 1998). We applied the "bobyqa" algorithm to optimize model convergence and speed of iterations (Powell, 2009). There was no obvious multicollinearity, with Variance Inflation Factors (VIFs) below 2.7 and 4.8 for the matching and mismatching trial analyses, respectively. Finally, the coefficients were used to interpret the size and direction of effects (Jaeger, 2008) and significance values were assessed with Laplace approximation using the LmerTest package (Kuznetsova et al., 2017). Plots were created using the ggplot2 package (Wickham, 2009). The R code for the analyses is available on OSF at https://osf.io/fuscq/. No part of the study procedures or analyses was preregistered.
Finally, we calculated d' and c, using the psycho package (Makowski, 2018), to check for task sensitivity and response bias, respectively. D' was calculated by taking the difference in z-scores between hits (correct responses to "yes" trials) and false alarms (incorrect responses to "no" trials). Larger d' values indicate better sensitivity to the task, and d' values closer to 0 signify performance approximating chance level (Stanislaw & Todorov, 1999). C was calculated by looking at the number of standard deviations from the point where neither response is preferred (so-called "neutral point"), with positive values indicating a tendency towards "no" responses and negative values indicating a tendency towards "yes" responses (Stanislaw & Todorov, 1999). The d' values in our study varied between .62 and 4.35, suggesting that task sensitivity was good and all participants responded above chance level. The c values ranged from À.72 to .66 and fell well within ± 3SD from the neutral point, suggesting that participants were not biased towards "yes" or "no" responses.

Lesion-symptom mapping analysis
Support Vector Regression Lesion-Symptom Mapping (SVR-LSM) was performed in MATLAB using a toolbox developed by DeMarco and Turkeltaub (2018). SVR-LSM is a multivariate machine learning technique that uses a nonlinear function to determine the association between a map of lesioned voxels in the brain (rather than single voxels) and patients' behavior (Zhang et al., 2014). As compared with its predecessor, voxel based lesion-symptom mapping (VLSM), it offers better specificity and sensitivity (Mah et al., 2014) by controlling for type I and type II errors caused by correlations between neighboring voxels (Pustina et al., 2018) and multiple comparisons (Bennett et al., 2009), respectively. Importantly, SVR-LSM also outperforms VLSM if a particular behavior is associated with multiple brain regions (Herbet et al., 2015;Mah et al., 2014), as may be the case for speech comprehension.
Here, we focused on the lesions associated with reduced audiovisual benefit (as compared to auditory-only speech) in the matching trials (i.e., requiring a "yes" response) because of their clinical relevance. Based on the accuracy data distribution, we looked at the degraded speech condition only. We used residuals of the audiovisual condition with auditory-only scores regressed out as the dependent variable. We excluded any voxels that were lesioned in less than three patients (~10% of the total number of patients). We regressed lesion volume from both the individual lesion masks and participants' behavioral scores to control for total lesion volume following DeMarco and Turkeltaub (2018). We generated a null distribution using 10,000 Monte Carlo permutations to determine voxelwise statistical significance. We cross-validated our model by dividing our sample into 5-folds, with four subgroups used to create a regression model and the fifth subgroup used to validate it. The resulting map was then thresholded at p < .05, and any clusters smaller than 500 voxels were excluded, following Garcea et al. (2019) Finally, we used the Johns-Hopkins DTI-based probabilistic white matter tractography atlas (Mori et al., 2008) to determine the overlap between significant voxels from the SVR-LSM analysis and major white matter tracts at a 75% probability threshold (Baldo et al., 2012;Schwartz et al., 2012).

Matching trials
We found significant main effects of Speech Clarity (b ¼ 1.29, SE ¼ .22, z ¼ 5.92 p < .001) and Group (b ¼ .50, SE ¼ .19, z ¼ 2.59, p ¼ .01), with participants performing better on the clear speech relative to degraded speech and the control group performing more accurately than the PWA group. There was also a significant interaction between Speech Clarity and Modality (b ¼ À.30, SE ¼ .13, z ¼ À2.38, p ¼ .02). Pairwise comparison with Holm's corrections showed that when the speech was degraded, participants made fewer errors on audiovisual compared to auditory-only presentations (p < .001). There was no difference between audiovisual and auditoryonly modalities when the speech was clear, likely because performance was at ceiling (p > .05). One control variable was also significant (Number of Syllables: b ¼ 1.10, SE ¼ .29, z ¼ 3.76, p < .001, with participants performing better on longer words). Fig. 3 (A) shows mean accuracy scores per group for the matching trials.  Follow-up pairwise comparisons with Holm's corrections showed that although both groups performed better in the audiovisual modality compared to auditory-only (P's < .04), the difference between conditions was larger for the control group (p ¼ .05). This effect was further examined in a posthoc analysis with only the PWA for whom lesion information was available (29) and including a new variable e Lesion Volume (in mm3) e to establish whether lesion size is a significant predictor of smaller benefit from the audiovisual modality. There was no effect of lesion volume on the behavioral results (see Supplementary Materials for full results). There was also a significant interaction between Speech Clarity and Relation Type (for the phonological type with the unrelated type as a reference: b ¼ À.59, SE ¼ .10, z ¼ À6.00, p < .001). Pairwise comparisons showed significantly better performance for the clear relative to degraded speech for phonological and unrelated pictures (P's < .01), but not semantically related pictures (p > .05). When the speech was clear, participants were also more accurate on the phonological than semantic pictures (p ¼ .004), but when the speech was degraded, they were more accurate on the semantic trials compared to phonological ones (p < .001). One control variable (Phonological Neighborhood Density) was also significant (b ¼ À.02, SE ¼ .01, z ¼ À2.05, p ¼ .04, with participants performing better on words with smaller phonological neighborhood density). Fig. 3 (B) shows mean accuracy scores per group for the mismatching trials.

Lesion-symptom mapping results
To assess which brain areas, when lesioned, are associated with reduced benefit of visual speech cues, we carried out a SVR-LSM analysis in the 29 PWA who had scans (see Table 1). The overlap map with regions lesioned in at least three participants is depicted in Fig. 4. The dependent variable was the residuals of the audiovisual condition with the auditory-only condition regressed out for degraded speech in the matching condition. The SVR-LSM analysis showed several significant clusters, including parts of the superior temporal pole (TPOsup, STG), postcentral gyrus (PoCG), SMG, INS, and IFG (pars triangularis and pars orbitalis). Table 2 and Fig. 5 summarize the results. Finally, based on the Johns-Hopkins DTI probabilistic atlas (Mori et al., 2008), we found an overlap between significant clusters and superior longitudinal fasciculus (SLF). The probabilistic location of SLF and the overlap is presented in Fig. 6. The percentage overlap between SLF and SVR-LSM results is presented in Table 3.

Discussion
The current study is the first to investigate the benefit of mouth and facial movements in word comprehension of people with aphasia using both behavioral and lesionsymptom mapping methods. In contrast to previous studies, we used a picture-verification task and manipulated the presence of visual speech information and the clarity of   c o r t e x 1 6 5 ( 2 0 2 3 ) 8 6 e1 0 0 Table 2 e SVR-LSM results with X, Y and Z centers of mass associated with reduced benefit from the audiovisual speech relative to auditory-only in the degraded listening condition for the matching trials. Regions with clusters of >500 voxels were identified by Automated Anatomical Labeling (AAL).  Fig. 5 e SVR-LSM results depicting significant voxels (in red), which when lesioned, are associated with reduced benefit from audiovisual presentation relative to auditory-only presentation during degraded listening condition for the matching trials. Voxelwise threshold set to p < .05 with 10,000 Monte Carlo permutations and 5-fold cross-validation. Clusters of <500 contiguous 1 mm 3 voxels were excluded. Fig. 6 e Probabilistic location of the white matter tracts based on the JHU white matter atlas overlaid onto SVR-LSM results. The dependent variable was the amount of benefit from audiovisual speech relative to auditory-only speech in the degraded condition for the matching trials. White matter tract probability threshold: 75%. c o r t e x 1 6 5 ( 2 0 2 3 ) 8 6 e1 0 0 auditory signal to assess the extent to which these factors impact speech comprehension in adults with post-stroke aphasia and a neurotypical control group. We also conducted exploratory SVR-LSM to investigate the neural regions associated with any benefit of visual speech for word comprehension.

Regions
In line with previous studies assessing audiovisual comprehension of neurotypical individuals, we found that visual information accompanying speech benefits comprehension in challenging listening conditions and that such benefit is larger for the controls relative to PWA regardless of speech clarity conditions. Our SVR-LSM and tractographic analyses indicated that TPOsup, STG, SMG, PoCG, INS, IFG, and SLF may mediate the benefit of audiovisual information in comprehension.

Benefit of visual speech for aphasic comprehension
Potential benefits of visual speech information were assessed separately for matching (i.e., speech matched a previouslyseen picture) and mismatching (i.e., speech mismatched a previously-seen picture) trials. For the matching trials, we showed that comprehension of degraded speech was easier when the speech was accompanied by mouth and facial movements relative to when the visual information was absent. This result is in line with previous findings with neurotypical adults (Krason et al., 2021;Ma et al., 2009;Ross et al., 2007;Schwartz et al., 2004;Sumby & Pollack, 1954;Tye-Murray et al., 2007), indicating that visual speech information plays a role particularly when phonological processing is more difficult. In such challenging listening conditions, mouth movements are likely to be beneficial because they support temporal predictions of the upcoming auditory speech information and constrain lexical competition (Peelle & Sommers, 2015). Our findings are also consistent with previous research showing similar performance of PWA and neurotypical adults under adverse listening conditions (Kittredge et al., 2006;Healy, Moser, Morrow-Odom, Hall, & Fridriksson, 2007). The lack of audiovisual benefit for PWA in the clear speech condition is likely to be driven by a ceiling effect; that is, like controls, these individuals with mildmoderate aphasia performed relatively well in the clear condition. For this reason, the present findings may not generalize to individuals with more severe comprehension impairments. Additionally, we found effects of visual speech for the mismatching trials. Both groups benefited from seeing mouth and facial movements in addition to hearing speech, but the control group showed a larger advantage than the aphasic group, which may be related to the involvement of additional cognitive processes (such as cognitive control that is often impaired in PWA; Brownsett et al., 2014) during mismatching presentations. To our knowledge, only one recent unpublished study investigated audiovisual speech benefit in a sentence repetition task in PWA and found a similar pattern of larger audiovisual advantage for neurotypical individuals in one of their experimental conditions (i.e., during very high noise levels of 0 dB SNR; Raymer, Ringleb, Sandberg, & Schwartz, 2021). Although the reported methods and data analysis are insufficiently detailed to allow strong comparisons to our findings, both our study and the study of Raymer et al. (2021) suggest the possibility that PWA may have difficulty integrating visual and auditory streams of information into a coherent percept, as would be required for mouth movements to be useful in disambiguating speech (Massaro & Jesse, 2007;Schmid & Ziegler, 2006).
Moreover, it is interesting to note that the control group in the present study also showed audiovisual benefit for the mismatching trials when the speech was clear, as well as when it was degraded. This is a different pattern than we observed in the matching trials; however, in the mismatching trials performance was "off-ceiling" in the auditory-only condition, leaving room for a benefit of visual information. Finally, neurotypical and aphasic individuals also responded more accurately to unrelated trials compared to both phonologically and semantically related trials. Moreover, the performance on the latter two relation types depended on speech clarity: Individuals performed equally well on semantic trials whether the auditory signal was clear or degraded; In contrast, they made more errors on phonological trials when speech was degraded than clear. Altogether, this finding demonstrates that phonological discriminability is reduced by noise, whereas semantic discriminability is not.

Neural substrates of visual speech benefit
Our exploratory lesion-symptom mapping analysis identified several clusters in the left hemisphere that appear to be involved in audiovisual speech comprehension. These include perisylvian regions in temporal (TPOsup, STG), insular (INS) and inferior frontal (IFG) cortices, as well as parts of parietal (SMG) and somatosensory cortices (PoCG). Although the SVR-LSM was conducted on a relatively small sample size (see Ivanova et al., 2021) and replication is needed, our results are consistent with previous findings in neurotypical populations suggesting involvement of a large fronto-temporo-parietal network, including STG, STS, INS, superior and inferior frontal cortex, as well as SMG and IPL, in sensorimotor speech interactions (Calvert et al., 2001;Campbell, 2008;Dick et al., 2010;Peelle, 2019;Bernstein and Liebenthal, 2014). Our findings also indicate that both ventral and dorsal streams may contribute to the benefit of visual speech for word comprehension. Portions of the ventral stream, and in particular, posterior superior and middle temporal cortex, have been associated with sound-to-meaning mapping. In the present study we found a cluster of regions distributed along the lateral and medial surfaces of the STG to be associated with audiovisual speech comprehension. The STG is known for its multifunctionality and heteromodality (Hein & Knight, 2021;Venezia et al., 2017), and previous studies have found that posterior STG/STS play a crucial role in audiovisual and visual speech processing Calvert & Campbell, 2003;Calvert et al., 2000;Erickson et al., 2014;Nath & Beauchamp, 2012;Okada & Hickok, 2009;Sekiyama et al., 2003;Skipper et al., 2005Skipper et al., , 2007Venezia et al., 2016;Wright et al., 2003), likely because of its multisensory integration properties (Amedi et al., 2005;Beauchamp, 2005;Beauchamp et al., 2004). Less is known about the involvement of the temporal pole in the processing of visual speech cues. The temporal pole has primarily been linked with higher-order c o r t e x 1 6 5 ( 2 0 2 3 ) 8 6 e1 0 0 cognitive processes, such as naming (e.g., Rice, Hoffman, & Lambon Ralph, 2015), word retrieval (e.g., Damasio, Tranel, Grabowski, Adolphs, & Damasio, 2004), and semantic processing (Lambon Ralph, 2013;Patterson, Nestor, & Rogers, 2007). A few studies have suggested a role for the anterior STG in audiovisual speech processing (Hertrich et al., 2011;Lee & Noppeney, 2011;Ozker et al., 2017). For example, Hertrich et al. (2011) showed that relatively anterior parts of STG are linked with the processing of visual speech information (e.g., syllables "pa" and "ta") and more posterior STG is associated with cross-modal integration with non-speech stimuli (e.g., moving shapes and tones). Although the stimuli in these studies were not directly relevant to comprehension, it is of interest to note the convergence of our results with these findings. The dorsal stream, by contrast, including portions of the posterior-frontal and parietal-temporal cortices, has been previously associated with sound-to-articulatory mapping in speech production. Here, we showed that insular regions medial to the superior temporal surface and fronto-parietal regions of the dorsal stream may play a role in visual speech comprehension, in line with previous literature Calvert et al., 2001;Hickok et al., 2018;Skipper et al., 2007). For instance, Hickok et al. (2018) found associations between INS and susceptibility to perceiving fused perceptions with McGurk stimuli, while Callan et al. (2003) reported INS to be involved in mouth movement processing when the auditory signal is degraded or absent. Although the precise role of the insula in audiovisual speech comprehension is still debated, these findings indicate that it may act as a mediator during crossemodal interactions and/or executive demand processing under challenging conditions Calvert et al., 2001;Hickok et al., 2018;Skipper et al., 2007). Given that our stimuli consisted of videos of a speaker's full face rather than solely mouth movements, the involvement of the insular cortex found in the current study may also be related to processing socio-emotional facial cues (Rae et al., 2018).
We also found that primary somatosensory cortex (PoCG) and parietal association areas (SMG) appear to mediate the benefit of visual speech information for comprehension. These regions may be engaged in encoding phonological information from mouth movements Skipper et al., 2005Skipper et al., , 2007 and binding it with the auditory signal (Bernstein, Auer, Wagner, & Ponton, 2008;Bernstein & Liebenthal, 2014;Michaelis et al., 2020). Additionally, we showed that IFG may be associated with the benefit of mouth movements, which is in line with Skipper et al. (2005; and Watkins et al. (2003), but not other recent studies with PWA (Andersen & Starrfelt, 2015;Hickok et al., 2018). These findings may be discrepant because the involvement of IFG in audiovisual processing is task specific (for a review see Peelle, 2019). For example, when speech encoding is more challenging, IFG may play a compensatory role in supporting the extraction of visual information from the mouth. It is important to note that although our findings are consistent with the prior literature in our identification of multiple fronto-temporal brain regions involved in audiovisual processing, our sample was small for a robust SVR-LSM analysis and future studies may identify additional regions.
Another limitation of the present study was that our sample of chronic patients largely consisted of anomic aphasics and lacked individuals with Wernicke's or transcortical sensory aphasia. Although these aphasia types are less common in the chronic than acute phases of recovery, future research may benefit from a more diverse sample of PWA.
Finally, our results are also in line with a recent study of Zhang and Du (2022), showing involvement of the dorsal stream, including PMv, IFG, SMG and the underlying white matter tracts of the arcuate fasciculus, in phonological encoding from mouth movements during audiovisual speech perception. Their findings are also consistent with our white matter tractographic analysis demonstrating that the SLF is associated with the benefit of visual speech information. In particular, the parts of the SLF connecting superior temporal with inferior frontal regions have been found to be critical for phonological processing (Dick et al., 2014;Glasser & Rilling, 2008). Thus, a disruption to phonological processing caused by lesions to SLF may lead to cross-modal integration failure, which could explain the reduced benefit from audiovisual speech relative to the auditory signal alone. Future studies should investigate how the connectivity between these fronto-temporo-parietal regions, as well as between these regions and the right hemisphere, impacts audiovisual speech comprehension in aphasia.

Conclusions
The current study brings together behavioral and lesionsymptom mapping profiles of people with aphasia to establish the benefit of visual speech information for word comprehension. We have demonstrated that mouth and facial movements are more beneficial for the comprehension of neurotypical individuals than adults with aphasia, and are more beneficial for both groups when listening conditions are challenging. We have also provided preliminary evidence that the integrity of a number of specific inferior frontal, temporal, parietal regions as well as fronto-temporal connection via the superior longitudinal fasciculus may be associated with this benefit, consistent with the previouslydemonstrated role of these regions in cross-modal mapping. Although studies of spoken word comprehension have typically focused on the auditory modality, our findings suggest that future investigations should consider whether and how visual speech information impacts comprehension in aphasia.

Open Practices Section
The study in this article earned Open Data badge for transparent practices. The data from this study are publicly available on Open Science Framework (OSF) at https://osf.io/fuscq/

Author information
Authors declare no conflict of interest.