Revealing the spatiotemporal brain dynamics of covert speech compared with overt speech: A simultaneous EEG-fMRI study

, who generated words both overtly and covertly. By integrating spatial constraints from fMRI into EEG source localization, we precisely estimated the spatiotemporal dynamics of neural activity. During CS, EEG source activity was localized in three regions: the left precentral gyrus, the left supplementary motor area


Introduction
Covert speech (CS), also known as inner speech, refers to speaking internally to oneself without producing any sound or movement.CS is believed to be associated with multiple cognitive functions, such as memory, control, planning, reading, writing, and self-awareness (Alderson- Day and Fernyhough, 2015;Lidstone et al., 2010;Marvel and Desmond, 2012).CS also plays an important role in self-regulation, in the situations like impulse and emotional regulation and delayed reinforcement (Barkley, 1997;Day and Smith, 2013;Tullett and Inzlicht, 2010).The ability of CS can be evaluated by self-reports or by the performance during specialized covert language tests, and the deficit of CS underlies speech disorders like stuttering, auditory verbal hallucinations, and aphasia (Brocklehurst and Corley, 2011;Fernyhough et al., 2019;Langland-Hassan et al., 2015), and mental diseases like autism and schizophrenia (Albein-Urios et al., 2021;Barber et al., 2021).Reconstructing CS content by brain-computer interface (BCI) also offers the potential to restore communication functions when hindered by disease or environmental factors (Cooney et al., 2018;Gonzalez-Lopez et al., 2020;Proix et al., 2022).
It is necessary to understand the neural mechanisms of CS.Current theories posit that, from behavioral view, CS can be seen as truncated overt speech (OS) without motor execution and actual sensory feedback (Oppenheim and Dell, 2010).However, the neural mechanisms underlying CS compared with OS remain poorly understood (Cooney et al., 2018).
Neuroimaging techniques, such as functional magnetic resonance imaging (fMRI) and positron emission tomography (PET), have identified shared brain regions between CS and OS (including typical dominant-hemispheric language regions), which suggests that CS and OS exist within the same continuum of neural processes, although the difference in neural activity strengths is debated (Alderson- Day and Fernyhough, 2015;Basho et al., 2007;Perrone-Bertolotti et al., 2014).In contrast, other studies argue that CS might involve independent neural processes based on the preservation of CS ability in aphasia patients (Geva et al., 2011;Stark et al., 2017).Beyond the debates of involved regions, neuroelectrophysiology studies detect distinct brain dynamics patterns between CS and OS, although in lower spatial precision (Burle et al., 2015).For example, different neural activity strengths were found between CS and OS in EEG (Stephan et al., 2020) and magnetoencephalography (MEG) (Goto et al., 2011;Voigtlaender et al., 2023) during various stages of speech production.Furthermore, different patterns were detected by electrocorticogram (ECoG) (Martin et al., 2014;Pei et al., 2011).In summary, a more inclusive view between CS and OS is that they can engage a shared network in different extents and might involve independent pathways with distinct dynamics.
However, as suggested in a previous review (Martínez-Manrique and Vicente, 2015), more comprehensive neuroimaging data are needed to precisely understand the distinct neural mechanisms between CS and OS.This is because single modalities have limitations in spatial resolution (EEG, MEG), temporal resolution (fMRI) and feasibility (the invasive ECoG) (Kim et al., 1997;Lopes da Silva, 2013;Parvizi and Kastner, 2018).A promising approach forward is multi-modal analysis, using simultaneous EEG-fMRI imaging with the techniques such as fMRI-informed EEG source localization (Huster et al., 2012).Currently, although there are some non-simultaneous EEG-fMRI studies investigating the decoding of CS content by BCI (Simistira Liwicki et al., 2023;Wilson et al., 2023;Yoshimura et al., 2016), the temporal signal-to-noise ratio is significantly lower as compared with the simultaneous EEG-fMRI (Schrooten et al., 2018).To the best of our knowledge, the neural mechanisms of CS have not been directly explored through the technique of simultaneous EEG-fMRI, which can provide precise spatiotemporal insights into the ongoing debate of distinct CS and OS mechanisms, and might help to optimize the timing and targeting of future language-related treatments (Abo et al., 2012;Musso et al., 2022;Stoeckel et al., 2014).In this study, we hypothesized that simultaneous EEG-fMRI can potentially reveal the spatiotemporal dynamics of CS.If CS and OS differ in activations observed in specific regions or specific time window, it will support the involvement of unique CS pathways, and vice versa.To test this hypothesis, we performed a simultaneous fMRI-EEG experiment including both CS and OS tasks.We extracted spatiotemporal features and analyzed their effect pathways using functional connectivity and multivariate path analysis.We expect that these findings might offer insights into the mechanisms of CS and potential neural features for the future treatments of speech disorders and the development of BCI speech applications.

Participants
We applied the following inclusion criteria to maintain data homogeneity.Only males were included, as numerous studies reported significant sex differences in neural activity during speech (Sato, 2020;Shaywitz et al., 1995;Zaidi, 2010;Hirnstein et al., 2019).Participants aged between 20 and 40 years were selected to ensure homogeneity in the healthy status of brain anatomy and function.Only right-handed native English speakers were included to account for homogeneity in language-related brain features.
The exclusion criteria included: 1. MR contraindications.2. A lifetime history of psychotic, bipolar, or substance-use disorders.3. Depressive or anxiety disorders within the past year.4. Any neurological or significant somatic illnesses.5.The use of psychotropic medications or other agents known to affect cognitive functions.
As a result, 34 healthy male participants were initially included in this experiment.Two of the participants were excluded because they did not complete the study.Finally, 32 participants completed the experiment, falling within the age range of 24.1 ± 3.9 years.

Hierarchical structure of the experiment design
We performed a word-speaking experiment with simultaneous EEG- fMRI recording.The structure of the experimental protocol is hierarchical (Fig. 1a).The experiment is organized into three main types of units, from largest to smallest: blocks, runs, and trials: 1) The experiment comprised 1 block (5.5 min) of resting state (RS), 5 blocks (5 min each) of CS tasks, and 5 blocks (5 min each) of OS tasks.CS and OS blocks alternated with each other during the experiment.
Rest interval between blocks is 45 s. 2) Each CS/OS block consisted of 20 runs.Rest interval between runs is 3 s.3) Each run consists of 5 trials.Rest interval between trials is 0.5 s. 4) A trial is the smallest unit of the experiment, lasting 1.5 seconds, during which a single word is displayed and spoken.

Word speech task implementation
In each run, a word was randomly selected from a set of five words and repeated across 5 trials within that run.
During the trials within OS blocks, participants were directed to speak each word loudly and clearly, while maintaining the natural tone of their daily speech.During the trials within CS blocks, participants were instructed to imagine the muscle movements of speaking the word without any physical movement or producing any sound.
The five-word set was designed based on the following criteria, as suggested in a previous review (Cooney et al., 2018).1. Appropriate length with multiple syllables.2. Distinct pronunciations between words.3. Frequent usage in BCI scenarios, mirroring real-world BCI applications.The chosen words were "Go there", "Distract Target", "Follow me", "Explore Here", and "Terminate".In total, each word was repeated 100 times in random order separately during CS and OS.

Experimental settings
For the experimental settings, participants were situated inside the MRI scanner with simultaneous EEG recording devices.A mirror was mounted at their forehead level within the scanner, allowing them to view the screen in a straight-ahead gaze.The screen, which was positioned behind the scanner bore, displayed speech task cues and experiment instructions using PsychoPy (Peirce et al., 2019).The screen instructions of the whole experiment are detailed in Supplementary Material 1 with both figure and video demos.
The study was in accordance with the Declaration of Helsinki and was conducted in Cognitive Neuroimaging Centre (CoNiC) in Nanyang Technological University (NTU) in Singapore.All the participants provided written informed consent.Ethics committee approval was obtained from the Institutional Review Board (IRB) of NTU, with the approval number: IRB-2022-040.A trial is the smallest unit of the experiment, lasting 1.5 seconds, during which a single word is displayed and spoken.Rest intervals are set between blocks, runs, and trials.(b) The analyzing pipeline of the simultaneous EEG-fMRI data.Except for the basic single-modality analyses, we performed the method of fMRI-informed EEG source localization for multi-modal analysis, which is constraining and screening the EEG dipole sources using the regions of interest (ROIs) extracted from fMRI analysis.At the final step, we performed the path analysis to statistically examine the hypothesized multivariate associations between various features.
W. Zhang et al.

Data acquisitions 2.3.1. EEG acquisitions
EEG was recorded continuously throughout the experiment alongside fMRI acquisition, using the same EEG hardware settings as the previous study (Andreou et al., 2017).We used MR-compatible devices of BrainVision Recorder (Version 1.10, Brain Products, Munich, Germany), AC-amplifiers specifically designed for EEG in MRI scanner (BrainAmp MRplus; Brain Products), and an electrode cap with 64 ring-type electrodes (BrainCapMR 64, Brain Products).The cap was equipped with 62 EEG electrodes based on a modified 10/10 system, with the reference electrode of FCz, the ground electrode of AFz, and an electrocardiogram (ECG) electrode on the participants' back.The cable connecting the electrode wires to the amplifiers was positioned using sand weights atop foam pads to decrease artifacts from the scanner's oscillations.The impedance between the electrode and skin remained under 10 kΩ by manual inspection.Data were initially sampled at a high frequency of 5000 Hz for the MRI artifact removal during preprocessing.
The amplitude resolution is 0.5 μV.Event and fMRI marker information were automatically embedded in EEG data as annotations.

Preprocessing of EEG
We summarized the whole processes and data format of EEG preprocessing in Table 1.
First, the fMRI-induced gradient artifacts and pulse artifacts were corrected in the raw EEG data using an average artifact template subtraction (AAS) algorithm.Ballistocardiogram (BCG) artifacts were corrected based on the ECG channel.The EEG data were average rereferenced and further filtered between 0.5 -70 Hz.The preprocessing is based on BrainVision Analyzer 2.2 (BrainVision Analyzer, 2021).
Next, EEG data were resampled to 250 Hz, and the independent component analysis (ICA) was performed using EEGLAB (Delorme & Makeig, 2004).The ICLABEL (Pion-Tonachini et al., 2019) was employed to automatically clear independent components (IC) associated with non-brain artifacts (especially speech-motor artifacts).Then, the EEG data were segmented into repeated overt and covert trials, which was followed by excluding bad trials characterized as containing EEG with abnormal signal gradient (> 50µV / ms) using BrainVision Analyzer 2.2.
After the above segmentation and exclusion of trials with bad EEG quality, in total 67.79 % (10,847 out of 16,000) of OS trials and 77.57% (12,411 out of 16,000) of CS trials were retained, and the trials retained for each participant were shown in Supplementary Material 2.

Preprocessing of fMRI
Task-fMRI images were processed based on Statistical Parametric Mapping 12 (SPM12) (Ashburner et al., 2014), following the procedure of a previous simultaneous EEG-fMRI study (Andreou et al., 2017).It included the processes of removing the first 10 time points, slice timing, realignment, affine registration, spatial normalization to Montreal Neurological Institute (MNI) space based on individual T1-MRI, and smoothing with a width at half maximum (FWHM) = 4 mm Gaussian kernel.No nuisance covariates regression or linear trend was removed because they were usually used in resting-state fMRI and might result in the removal of the task effects.

Data analysis 2.5.1. Identifying regions of interest using fMRI with general linear model and T-test
General linear model (GLM) was performed for estimating the effect of different conditions on each series of fMRI images.In GLM, for each voxel of fMRI: Where t refers to the time points in a fMRI series, S t refers to the BOLD signal at t, β 0 refers to the intercept of the signal, R t refers to the label of RS at t, O t refers to the label of OS at t, C t refers to the label of CS at t, β 1− 3 refer to the coefficients of different conditions, and ε t refers to the residuals.The condition labels were adjusted using the canonical Hemodynamic Response Function (HRF) to fit the BOLD response.
We performed voxel-wise T-tests to identify areas with significant BOLD changes during CS and OS compared with RS.A second-level group analysis summarized these findings across all participants, including age as a covariate to account for its effect on BOLD signal (Richter and Richter, 2003).Voxels with P < 0.05 (false discovery rate (FDR) corrected for multiple comparisons) were considered significant and used to define the regions of interest (ROIs), with the labeling from Automated Anatomical Labeling (AAL) atlas (Tzourio-Mazoyer et al., 2002).Any regions outside the AAL brain mask or smaller than 500 mm 3 were excluded to minimize false positives.

FMRI-informed EEG source localization analysis
The data analysis pipeline for simultaneous EEG-fMRI is shown in Fig. 1b.We performed fMRI-informed EEG source localization (Huster et al., 2012) to estimate EEG dipole sources guided by the ROI  (Oostenveld et al., 2011).The EEG pipeline of the source localization process is also included in Table 1.First, we utilized the standard boundary element model (BEM) template (Oostendorp and Van Oosterom, 1989;Oostenveld et al., 2001) in the MNI space, which was spatially coregistered with the individual T1-MRI image to model the head anatomy.The electrode system was also aligned with the MNI template.
Second, ICA was performed on the concatenated EEG data across all trials for each participant separately on the covert and overt conditions, resulting in 63 ICs per participant.The optimal dipole locations and orientations for each EEG IC was estimated based on the best fit of its equivalent dipole's projection on the scalp electrodes, while single dipole but not symmetric double dipoles was used to better fit the lateralization of speech (Price, 2012).This IC-based source localization approach ensured that the dipole fitting results were based on the full experimental data rather than individual trials, thus improving the signal-to-noise ratio and the reliability of the results.The IC activations were then Z-scored across the entire experiment and baseline-corrected using the mean value during the 0.25 s period before each trial onset.Finally, to identify the EEG dipole sources that were relevant to speech, we used the ROIs from fMRI GLM analysis as the spatial constraints to screen the speech-relevant EEG dipole sources that were contained within, following the implementations of the previous studies (Bak et al., 2011;Meyer et al., 2012;Ullsperger & Von Cramon, 2001).
After identifying EEG sources by fMRI-ROIs, we compared the following EEG features between CS and OS: To identify significant event-locked EEG activation, we performed Ztests among all valid trials from all participants, separately in six discrete time bins.These time bins were 0.25 s long, and were distributed within the task period (0 -1.5 s after the stimulus).These Z-tests aimed to identify time bins where the amplitude was greater than zero with a 95 % confidence level, and correcting P-values by FDR.Z-test was more suitable for the large sample size of repeated EEG trials (Abebe, 2019).Both bootstrapping and permutation tests, as non-parametric methods, were performed to re-verify the Z-test results for CS and OS data.Event locked time-frequency analysis was described in Supplementary Material 4.
Spectral analysis of each dipole source was performed by the periodogram power spectral density (PSD) estimate from Matlab.We used 1000 sampling points that were logarithmically spaced between 0.5-30 Hz.The upper frequency limit was set as 30 Hz to avoid muscular artifacts of OS (Muthukumaraswamy, 2013).

FMRI network analysis
We also conducted functional connectivity (FC) analyses to investigate the network-level differences between CS and OS, complementing the regional findings from fMRI-informed EEG source localization.FC was calculated using Pearson's correlation coefficient between the average BOLD signals within the ROIs.The ROI list was defined by the group-wise T-tests on GLM.We calculated FC within three periods of the tasks: OS, CS, and RS.We performed T-tests on the FC values of "OS -RS", "CS -RS", and "OS -CS" across all participants.Permutation tests were also performed to detect significant FC pairs as a non-parametric reverification for the T-tests.
We calculated the FC degree of each ROI as a quantifiable measure of the connection centrality of a specific region within the entire network.It is computed by summing the absolute FC values between the chosen ROI and all other ROIs within the network.Hub nodes were defined as the nodes with the Z-scored degree > 2.5 (Power et al., 2013).
To identify the community structures within the FC network that group together regions with stronger FC, we utilized the unsupervised method based on modularity optimization (Blondel et al., 2008).This method was implemented using the Community Detection Toolbox (Community Detection Toolbox -File Exchange -MATLAB Central, n.d.) in MATLAB.The algorithm was applied to achieve optimal clustering of all regions, with the objective of maximizing the modularity metric across the FC network.The FC strength towards each cluster was calculated by averaging across its regions.

Path analysis
Path analysis (PA) is a statistical method based on structural equation modeling that examines hypothesized multivariate associations between features, potentially suggesting causal relationships (Duffy et al., 1981).PA is implemented using maximum likelihood estimation using Lavaan (Version 0.6-16) (Rosseel, 2012).All EEG and fMRI features were adjusted for age covariates before entering the PA model.We only included significant pathways with the absolute standardized regression weight (β) > 0.1 and P < 0.05.This exclusion is necessary to consider both effect size and statistical significance (Cohen, 2013), and control the complexity of PA model to avoid overfitting.
The goodness of fit of PA model is measured by commonly used metrics: Confirmatory fit index (CFI), Tucker-Lewis index (TLI), standardized root mean square residual (SRMR), and root mean square error of approximation (RMSEA) with its 90 % CI, P-value for the test of a close fit (the null hypothesis is RMSEA < 0.05).The cutoff of these measurements for a good fit is based on the recommendation (Schermelleh-Engel et al., 2003).
Furthermore, we performed the non-parametric permutation test to verify that the PA model significantly fit the real data, by randomly shuffling the features and evaluating the fit after shuffling.

Identifying regions of interest for covert/overt speech based on fMRI
To identify regions of interest (ROIs) where neural activity strength changed during OS or CS relative to RS as spatial prior for further EEG analyses, we performed voxel-wise T-tests on fMRI images, separately for two contrasts (Fig. 2): "Overt Speech versus Resting State" ("OS -RS") and "Covert Speech versus Resting State" ("CS -RS").For both contrasts, only voxels with increased BOLD signals passed the T-tests (P < 0.05, FDR corrected).Detailed names, coordinates, and related statistical data of these ROIs are extensively listed in Table 2.For OS, multiple areas including primary motor and sensory regions were significantly activated.In contrast, for CS, significant activations were observed in the bilateral supplementary motor areas (SMA).Additionally, ANOVA results were provided in Supplementary Material 3, which verified significant ROIs detected by T-tests, as a complementation that showing difference across all three conditions.Moreover, by averaging the BOLD signals within the ROIs identified in Fig. 2a, we presented the time course of these signals for both CS and OS in Fig. 2b.The BOLD signal fluctuations observed across 20 runs within the OS block, exhibiting clear periodic patterns.In contrast, the average BOLD signal in the CS block did not demonstrate similar periodic fluctuations as observed in OS.After normalization by the rest blocks, the BOLD signal strength for both CS and OS remained above zero, indicating relatively stronger activation compared with the nearest rest intervals.

Difference in fMRI-informed EEG sources between covert and overt speech
To further analyze the event-locked EEG features in fMRI-defined ROIs, we performed fMRI-informed EEG sources localization and compared various EEG features between CS and OS in these ROIs.CS and OS can partially involve common brain regions as addressed in the introduction section.Therefore, we used all the ROIs for both CS and OS (Table 2) as the constraining regions for EEG source localization.Given the study's focus on CS, only three EEG dipole sources with significant event-locked activations during CS were shown (Fig. 3), and other insignificant dipole sources for CS were listed in Supplementary Material 4. For the sample size in the event-locked EEG activity analysis, 15 participants with 1288 OS trials and 3282 CS trials had significant activation at the left precentral gyrus (PCG) (Fig. 3a); 11 participants with 3064 OS trials and 1929 CS trials had significant activation at the left SMA (Fig. 3b); 10 participants with 1226 OS trials and 2907 CS trials had significant activation at the left putamen (Fig. 3c), demonstrating considerable trial number for statistical analysis between CS and OS.
Here, we presented the event-locked activation, the PSD, and the topographical map (topo-plot) as the key features of the EEG dipole sources for CS in Fig. 3 and compared them with OS.
Significant event-locked activations were identified using Z-tests in six time bin within the trials segmented by task events (middle column in Fig. 3).The plot of a trial ranged from -0.25 s (pre-stimulus baseline) to 0 s (word stimulus to start speech) to 1.5 s (end of the speech period).The left putamen showed earlier event-locked activations for CS compared with OS (Fig. 3c).Specifically, CS activations were significant in the 0.25 -0.5 s time bin (Z-value = 1.896,P = 0.0382, corrected by FDR), reaching the peak at 262 ms.In contrast, OS showed significant activation in the subsequent time bin of 0.5 -1.5 s (Z-value = 3.419, P < 0.0001, corrected by FDR), reaching the peak at 1170 ms.In the left PCG and SMA, event-locked activations during CS were consistently weaker than those during OS, with no unique activations specific to CS observed across the time bins.We also performed time-frequency analysis as a supplement to the event-locked activation (see Supplementary Material 4), with significant difference between CS and OS in event-related spectral perturbation (ERSP) and inter-trial coherence (ITC) in the three key regions of Fig. 3.
Additionally, bootstrapping tests were performed to re-verify the significant activations from Z-tests (Supplementary Material 5).Permutation tests were also performed to assess the group difference of amplitudes between CS and OS (Supplementary Material 5).Both nonparametric tests corroborated the Z-test findings.Fig. 2. Neural activity differences between speech tasks and the resting state.(a) BOLD signal difference identified by T-tests on fMRI images.The difference maps are separately as "Overt Speech versus Resting State" ("OS -RS" on the left panel) and "Covert Speech versus Resting State" ("CS -RS" on the right panel).We used the threshold of P < 0.05, corrected by false discovery ratio (FDR).No regions with a negative T-value were significant in the T-tests.Significant regions are also listed in Table 2. (b) The average and 95 % confidence interval of BOLD signal strength during OS and CS blocks, calculated by averaging across all repetitions within these blocks and across all participants.The ROIs are defined by the combined set identified in Fig. 2a.For each participant, BOLD signal values are normalized using the values from the rest blocks following each speech block.
For the PSD analysis (right column in Fig. 3), CS had weaker PSD than OS in a wide range of frequency (0.5 -30 Hz) in all the three regions in Fig. 3 and in most of the insignificant regions (Supplementary Material 4).We also re-verified the power difference by bar plots and Ttests in each power band separately (Supplementary Material 6).
The topo-plots of EEG dipole sources between OS and CS were highly different in the left putamen (left column in Fig. 3c), showing a strong frontal-occipital contrast in OS and a moderate bilateral-temporal activation in CS.In contrast, the topo-plots for the left PCG and SMA showed more subtle differences between CS and OS.

Difference in fMRI functional connectivity between overt and covert speech
To explore the differential functional network patterns underlying CS and OS, we calculated functional connectivity (FC) between the 16 ROIs (Table 2) derived from fMRI.T-tests were performed to compare the differences in FC values between OS/CS and RS, as well as between OS and CS (Fig. 4a).Significant FCs were identified using a threshold of Bonferroni-corrected P < 0.05.Permutation tests for FC were performed between speech tasks and RS, which re-verified the T-test results in Fig. 4a (Supplementary Material 7).These FC analyses aimed to test our hypothesis that regions with distinct roles in CS should exhibit different connectivity patterns compared with OS, providing further evidence for their specific involvement in CS networks.
Within the two FC networks, nodes with Z-scored degrees > 2.5 are defined as hub nodes (Power et al., 2013).The left putamen emerged as a prominent hub node in the FC networks of both CS and OS, with Z-scored degrees of 2.589 during OS and 2.755 during CS (lower panel in Fig. 4a).Notably, other regions all showed Z-scored degrees < 1.
Through unsupervised algorithm of community clustering, ROIs were divided into two distinct functional clusters with strong intraconnectivity.The groupings of the two clusters were the same after performing the algorithm separately in CS and OS networks.As shown in Fig. 4b, the left cluster (L_CLU) includes left-hemispheric regions of precentral, postcentral, Rolandic operculum, inferior frontal, Heschl and superior temporal gyrus.The right cluster (R_CLU) includes other regions mainly in the right hemisphere, along with the left SMA, insula and putamen.Notably, the left putamen bridged both clusters by significant FCs, highlighting its role as the hub node (a cluster view in Fig. 4c).
After identifying the central role of the left putamen and the two separate functional clusters, we compared all the potential contrasts of FC strength between OS and CS.In Fig. 4d, when using two-sample Ttests and considering multiple comparisons by FDR, only FCs between the left putamen and the L_CLU passed the T-test with t(31) = 2.868 and corrected P-value = 0.0365.FCs between other regions had insignificant corrected P-values of 0.0595, 0.0893, 0.1526, and 0.0893.

Exploring multivariate associations between EEG dynamics, fMRI functional connectivity and covert speech
To explore the associations between the key features and CS, we performed path analysis (PA) on the features of EEG dipole sources and FC measures.We built the hypothesized PA model in a data-driven manner and used the backward steps to explore the valid pathways.The detailed steps for building PA model were shown in Supplementary Material 8.
For the valid pathways in the PA model (Fig. 5), first, we identified direct positive effect from late EEG activation in the left putamen towards the OS (standardized regression weight (β) = -0.248,P < 0.001), but insignificant effect from early EEG activation in the left putamen towards the CS label (|β| < 0.1).Second, the early EEG activation in the left putamen was negatively associated with its FC towards bilateral functional clusters (β = -0.103,P < 0.01; β = -0.158,P < 0.001), while late EEG activation had no significant effect on the FC (|β| < 0.1).Third, we identified the indirect positive effect from early EEG activation on CS label, through the mediation of the FC strength between the left putamen and the left cluster (β = -0.441,P < 0.001), but no indirect mediation effect through the FC strength between the left putamen and the right cluster (|β| < 0.1).
The resulting PA model fit the data well, based on the suggested goodness-of-fit cutoff (Schermelleh-Engel et al., 2003).Comparative Fit Index (CFI) = 0.988, Tucker-Lewis Index (TLI) = 0.970, Root Mean Square Error of Approximation (RMSEA) = 0.050 (90 % confidence interval = 0.033 -0.068, with P = 0.466 for the close fit null hypothesis), and Standardized Root Mean Square Residual (SRMR) = 0.024.Permutation tests on the PA model were performed 1000 times by randomly shuffling the original data and estimating in the same model.In 1000 permutation iterations, 32 shuffled data yielded lower chi-square values, while 968 yielded higher chi-square values.Therefore, the significance that the model fits real data is P = 32/1000 = 0.032.A visualization of the fit metrics across permutations is in Supplementary Material 9.
For the details of the PA model, we provided the raw model including all possible pathways in Supplementary Material 8.The details of regression weight and covariance of the proposed PA model with only significant pathways (Fig. 5) were also listed in Supplementary Material 8.

Table 2
Significant regions of interest (ROIs) in speech tasks relative to resting state, identified by T-tests on fMRI images.This table lists the ROIs in Fig. 2a that exhibited significant changes in the blood oxygen level-dependent (BOLD) signal of fMRI images during both overt and covert speech tasks, compared with resting state.We used the threshold of P < 0.05, corrected by false discovery ratio (FDR).Each ROI is defined based on the Automated Anatomical Labelling (AAL) atlas (Tzourio-Mazoyer et al., 2002)

Discussion
Our study aims to reveal the spatiotemporal brain dynamics underlying CS compared with OS.We performed a simultaneous EEG-fMRI study on 32 participants, where they repeatedly generated words both overtly and covertly.Based on the multi-modal analysis, we compared the spatiotemporal features between CS and OS, and investigated their effect pathways through functional connectivity and multivariate path analysis.Due to the limited prior knowledge about the neural mechanisms of CS, our statistical analyses were data-driven and exploratory among the multimodal features, inspired by previous studies (Dehghani et al., 2023;Golkowski et al., 2017).
The major findings of this study can be summarized in four aspects: 1.We identified three EEG dipole sources with significant event-locked activations during CS, located in the left PCG, left SMA and left putamen.2. Although OS involved more brain regions with stronger activations, CS was characterized as the earlier event-locked EEG activation in the left putamen.3. FMRI FC analysis identified the left putamen as the only hub node of the FC networks during both CS and OS, while showing weaker FC strength towards speech-related regions in the dominant hemisphere during CS. 4. Path analysis focusing on the left putamen verified the indirect effect from early EEG activation towards CS, through the mediation effect of reduced FC towards speech-related regions.Details of results are summarized as table in Supplementary Material 10.
In summary, we propose the potential CS mechanism that the earlier activation of the left putamen might transform OS into CS by inhibiting the FC towards the speech-related network in the dominant hemisphere.

Identifying regions of interest for both speech types by fMRI
Significant ROIs for both types of speech were identified by T-tests on fMRI (Fig. 2a) and treated as the spatial constraints for the subsequent analysis of EEG dipole source localization (as shown in Fig. 1b).For OS, we detected a list of typical speech regions (Price, 2010(Price, , 2012)).For CS, the bilateral SMA was the only significant region.The SMA has been identified in previous CS studies, with some hypothesized functions for the planning and controlling of speech motor (Hertrich et al., 2016), modulation effect towards auditory areas (Linden et al., 2011), and a sign of high neural demand of attention and planning (Shergill et al., 2002).Because BOLD signal from fMRI was an indirect estimation of neural activity with lower temporal resolution, it was not surprising that the BOLD changes were observed only in the SMA but not across a broader functional network during CS.Therefore, we further deepened the exploration by the subsequent multi-modal analysis.

Spatiotemporal dynamic differences between covert speech and overt speech based on fMRI-informed EEG source localization
The fMRI-informed EEG source localization method integrated the spatiotemporal features of both CS and OS.Three ROIs activated during CS were detected: The left PCG, left SMA, and left putamen (Fig. 3).These ROIs were found to be directly connected in neuroanatomy studies as the major motor-related neural circuit through basal ganglia (Alexander et al., 1986).This circuit was also in line with a motor imagery fMRI study (Makary et al., 2017), which detected higher activations in the circuit during motor imagery than during motor execution.
Furthermore, we explored the spatiotemporal dynamics that could hardly be detected by single modalities with limited resolution like EEG or fMRI.For ECoG, although it offers integrated spatiotemporal resolution, it is invasive and can hardly measure deeper signals from the subcortical regions like the putamen (Parvizi & Kastner, 2018).
In the view of event-locked EEG activation (middle column in Fig. 3), CS had weaker and shorter activations in the left PCG and the left SMA compared with OS.However, we found that CS was characterized as the earlier event-locked activation in the left putamen (peaking at 262 ms versus 1170 ms), and this difference was still valid after non-parametric re-verifications of bootstrapping and permutation tests (Supplementary Material 5).This activation time of 262 ms in the left putamen was classified into the "lemma retrieval stage" in previous reviews of the temporal dynamics speech production (Indefrey, 2011;Indefrey and Levelt, 2004), regarding as the early brain abstraction stage of word, while in contrast, activation peaks in other regions in this study were all classified into the stage of "phonetic encoding" (onset at ~455 ms), or even the final stage of articulation (onset at ~600 ms).These differences suggest that the early activation of the left putamen might represent the initial temporal stage in the neural dynamics of CS.Our results also revealed a temporal series of neural activity during CS, beginning in the left putamen and subsequently in the left PCG and the left SMA.This sequence of activations partially aligns with the proposed neural model of proactive inhibition (Aron, 2011), which is thought to be associated with CS (Nalborczyk et al., 2022).We also noticed that the left putamen played a key role in the early modulation of motor in previous studies.For example, some epileptic symptoms were interpreted as the dysfunction of putamen during the early regulation of movements (Bartolini et al., 2014;Moeller et al., 2014).Moreover, a neuronal projection analysis in monkeys found that the activation in the putamen was earlier than in the motor cortex and actual movement, meaning an action selection process occurs in the basal ganglia including the putamen (Bronfeld et al., 2011).After combining previous findings with ours, we propose that this dynamic pattern, induced by the early activity in left putamen, might lead to inhibitive modulation during the early stage of speech production, which is a key factor that transforming OS to CS.
For the power spectrum analysis, we consistently detected the larger power in OS than CS in most of the ROIs, which was in line with the previous review (Perrone-Bertolotti et al., 2014), implying apparently weaker neural activity strengths related to motor execution and potential auditory feedback during CS (Oppenheim and Dell, 2010).
The topo-plot in the left putamen revealed stronger differences between CS and OS, compared with the topo-plots of left PCG and SMA.Despite topo-plots providing limited spatial details, this distinction in the left putamen can imply stronger difference in the neural patterns between CS and OS.

Functional connectivity difference between covert speech and overt speech based on fMRI
Although fMRI-informed EEG source localization has spatiotemporal resolution advantage, it can not detect enough sources data across all ROIs.Therefore, to achieve a comprehensive analysis of functional network, we choice to analyze FC based on the typical fMRI pipeline.
To better interpret the networks, we performed the unsupervised clustering method for identifying the communities (clusters) of regions that had strong intra FC.Notably, CS and OS networks yielded the same cluster structures by this data-driven method, supporting the previous hypothesis that CS and OS partially shared similar functional networks (Alderson- Day and Fernyhough, 2015).
Our FC analysis further emphasized the role of the left putamen as the only hub node in both FC networks during CS and OS (Fig. 4), regarded as a stable connector that consistently facilitating dynamic interactions with multiple language networks (Fedorenko and Thompson-Schill, 2014).As a hub node, the left putamen bridged the left and right clusters (Fig. 4c), aligning with previous findings (Vigneau et al., 2011) about the coactivations in the left putamen towards both hemispheres, associated with both direct and high-order language functions.The lateralization distribution of clusters also justified our inclusion criteria with only male participants for a better data homogeneity, because the sex difference in language lateralization was commonly acknowledged (Sato, 2020;Shaywitz et al., 1995;Zaidi, 2010;Hirnstein et al., 2019).
After quantifying the difference in FC strength, we only detected the lower FC in CS than OS between the left putamen and the left cluster (Fig. 4d).The coactivation between the left putamen and the left cluster has been detected in previous speech studies (Viñas-Guasch and Wu, 2017), which is believed to be directly related to language processing.This FC difference only at the dominant hemisphere is also in line with the previous fMRI-lateralization study between CS and OS (Partovi et al., 2012).The putamen is also observed with achieving motor inhibition by the variance of FC from a Go-NoGo fMRI study (Akkermans et al., 2018) and in a reading-aloud task (Seghier and Price, 2010).As the inhibition of motor function in CS is still debatable (Perrone-Bertolotti et al., 2014), our FC findings provided direct spatiotemporal evidence that the left putamen, as a hub node among the functional network, achieved the inhibition of motor execution by influencing the downstream FC towards the speech-related network, especially at the dominant hemisphere.

Verifying the effect pathways between multi-modal features using path analysis and proposing the potential mechanisms of CS
We performed PA as a statistical tool to investigate the effect pathways between multi-modal features underlying CS.As a hypothesisbased model, we have explained the data-driven steps of constructing the hypothesized PA model (Fig. 5) in Supplementary Material 8. We constructed PA model by data-driven manner because it is more sensitive than model-driven for detecting relationships between features (Marrelec et al., 2007;Warren et al., 2021), thereby facilitating a comprehensive understanding of the neural mechanisms in CS.
In summary, our model, with a focus on the left putamen, identified both direct and indirect pathways influencing the transformation between CS and OS: 1.Late event-locked EEG activation demonstrated a direct positive association with OS. 2. Conversely, early EEG activations did not show a direct association with the label of CS/OS.Instead, these early activations indirectly contribute to CS through a mediating effect by reduce FC between the left putamen and the left cluster.
Here, by combining the EEG dynamics, FC, and PA results, we propose a potential mechanism interpretation of both pathways in the PA model: the left putamen can be involved in two distinct phases during the neural process of speech.In the early phase (peak at 262 ms), eventlocked activation occurs earlier during CS than during OS, involving abstract language processing and inhibition of speech motor functions.This is mediated through the inhibition of FC strength towards the left cluster.In the late phase (peak at 1170 ms), particularly for OS, it might be more involved in the auditory feedback after articulation (Kearney and Guenther, 2019), as the putamen was identified in the related neural circuits (Jürgens, 2009).This was inferred because the late event-locked activation (1170 ms) in the left putamen was only detected during OS but not during CS, which occurred later than the normal reaction time for articulation (~600 ms) (Indefrey, 2011;Indefrey and Levelt, 2004;Jarick and Jones, 2009).It was also later than the activations in most other ROIs in this study (as observed in Fig. 3 and Supplementary Material 4).Additionally, we further double-checked the reliability of the pathways by permutation tests, meaning that the proposed pathways in the model in Fig. 5 fit the real data well, which significantly outperformed the randomly permuted data.

The left putamen in self-regulation, speech dysfunctions, and BCI
Our CS findings might explain, from a neural mechanistic perspective, why CS plays such a critical role in self-regulation, as neural inhibitory pathways are considered to be the key to self-regulation (Schel et al., 2014).The involvement of the left putamen in the rapid initiation and efficient execution of inhibitory control processes might be essential for successful self-regulation (Jahanshahi et al., 2015).
Based on this hypothesis, our EEG-fMRI findings offer insights into potential spatial targets and temporal windows for future neurolinguistic and self-regulation-related behavioral therapies, particularly for deficits in self-regulation and language.For example, these findings suggest future targeted interventions focusing on CS pathways and reinforcing early activation of key regions, which might enhance inhibitory control and facilitate flexible brain FC.
Our findings of CS are complementary with many previous speechrelated dysfunctions, implying the shared mechanisms related to the left putamen.Therefore, The CS mechanisms that we proposed in this study might offer the potential treatment for these speech disorders.For example, for auditory verbal hallucination, previous study detected hyper FC between the left putamen and the Wernicke's area (which refers to the left superior temporal gyrus (L_STG) within the left cluster in Fig. 4) (Hoffman and Hampson, 2012).For stuttering speakers, some fMRI researches observed higher activation (Chang et al., 2009), influenced FC (Lu et al., 2010;Yang et al., 2016), and also larger gray matter volume (Lu et al., 2010) in the left putamen.For speech impairment in Parkinson's disease, a previous study detected an influenced strength of FC related to the left putamen and cortical speech regions (Manes et al., 2018) and lesions in the left putamen during aphasia (Barat et al., 1981).For the second language speakers (especially for non-fluent speakers), their left putamen has higher cerebral blood flow by the H 2 O 15 -PET scan (Klein et al., 1994) and larger gray matter volume by MRI (Abutalebi et al., 2013).Higher functional activity by fMRI was observed in the left putamen when they were not highly proficient in a second language (Abutalebi et al., 2013;Liu et al., 2010).
Other than speech disorders, the choice of ROIs and frequencies for decoding CS brain signals into content is important but debatable in the BCI field (Proix et al., 2022), and our study suggests focusing on the brain dynamics between the regions of the left PCG, SMA and putamen, and the functional interactions between the left putamen and speech-related regions, which might be helpful in extracting the CS content from neural measurements like EEG.

Limitations
1.The inclusion criteria are limited to only male participants to increase sample homogeneity, given plenty of reports had emphasized the gender differences in the brain production of speech (Sato, 2020;Shaywitz et al., 1995;Zaidi, 2010;Hirnstein et al., 2019).Our results also showed strong brain lateralization during speech, which was also acknowledged to be highly different between genders.This limitation can possibly make the findings less generalizable to females.
2. As for the design of the experiment, implementation of speech can have multiple ways: such as word repetition, naming, word generation, reading, and rhyme judgement, which can yield different neural patterns.In our study, we designed a five-word repetition task to focus on the direct effect of CS.The features might be shifted if the experimental manner of speech is changed.We did not try to classify the neural difference between words, as this refers to the speech decoding study in BCI that is beyond our research objective of neural mechanism.
3. The use of fMRI-informed EEG offers good spatiotemporal resolution, but as an optimization analysis, it also presents a potential limitation of higher variability of the results.We should also consider that participants could occasionally become absent-minded, miss some of the speech trials, while during CS trials we could not spot the misses.Based on these two reasons, we could reasonably expect the limitation that not all of the 32 participants contained valid EEG dipole sources within the speech-related ROI, and we calculated the results based on valid EEG dipole sources, inspired by a previous EEG-fMRI study (Timmermann et al., 2023).However, it is noteworthy that our results were still significant based on substantial data amount of thousands of repeated EEG trials and re-verified by bootstrapping and permutation tests.Although EEG-fMRI can substantially improve the spatial precision of neural activity (Michel and Brunet, 2019), decoding EEG-based brain activity from deep subcortical regions like putamen would benefit from Ageing Research Institute for Society and Education (ARISE/2017/16) and DSO National Laboratories (DSOCL21193).

Fig. 1 .
Fig. 1.Flow chart of the study.(a) The structure of the experiment in hierarchical order.The experiment consists of 1 resting state (RS) block, 5 covert speech (CS) blocks, 5 overt speech (OS) blocks.CS and OS blocks alternate with each other during the experiment.Each block consists of 20 runs, and each run consists of 5 trials.A trial is the smallest unit of the experiment, lasting 1.5 seconds, during which a single word is displayed and spoken.Rest intervals are set between blocks, runs, and trials.(b) The analyzing pipeline of the simultaneous EEG-fMRI data.Except for the basic single-modality analyses, we performed the method of fMRI-informed EEG source localization for multi-modal analysis, which is constraining and screening the EEG dipole sources using the regions of interest (ROIs) extracted from fMRI analysis.At the final step, we performed the path analysis to statistically examine the hypothesized multivariate associations between various features.

Fig. 3 .
Fig. 3. Differences in EEG features between covert and overt speech tasks, including three regions of fMRI-informed EEG dipole sources that significantly activated during covert speech.Left column: The averaged scalp plot of EEG dipole sources, with outliers beyond triple standard deviations removed; Middle column: Z-scored amplitude of the EEG dipole sources, normalized by the pre-stimulus baseline period (-0.25 to 0 s); Right column: Power spectral density of the EEG dipole sources.Blue and red represent covert and overt speech, respectively.Blue and red asterisks mark significant activations for both speech types (*: P < 0.05, **: P < 0.01, ***: P < 0.001, ****: P < 0.0001, corrected by false discovery ratio (FDR)).Line plots were smoothed with 20 adjacent points.Shadow indicates the 95 % confidence intervals by standard error of mean (SEM).(a) Features of EEG sources in the left precentral gyrus.(b) Features of EEG sources in the left supplementary motor area.(c) Features of EEG sources in the left putamen.

Fig. 4 .
Fig. 4. Differences in fMRI functional connectivity (FC) between overt speech (OS) and covert speech (CS) tasks.Region abbreviations are detailed in Table 2. (a) Upper panel: Adjacency matrices display the FC networks for OS and CS, which are represented by the T-values compared with resting state (RS), as well as the comparison between OS and CS.White asterisks highlight region pairs that have significantly greater FC than RS under T-tests (Bonferroni-corrected P < 0.05).Lower panel: Bar plots showing the degree of each node within the adjacency matrices, for OS and CS separately.The left putamen showed as the only hub node in both FC networks.(b) Spatial representation of two identified functional clusters in Fig. 4a, which distributed bilaterally in the brain.(c) Cluster views of network structures highlight the left putamen as a hub node in both OS and CS networks, which also bridged the two clusters by significant FCs.Edge colors are under the same color bar as in Fig. 4a.(d) T-tests showed that FC strengths between the left putamen (L_PUT) and left cluster (L_CLU) was significantly reduced in CS compared with OS (t(31) = 2.868 and corrected P-value = 0.0365).Other comparisons were not significant, including between L_PUT and right cluster (R_CLU), within L_CLU, within R_CLU, and between L_CLU and R_CLU.Error bars represent standard error of mean (SEM).

Fig. 5 .
Fig. 5. Path analysis showing the significant associations from EEG dynamics and fMRI functional connectivity (FC) in the left putamen towards the binary classification of covert speech (CS) and overt speech (OS).The model fit the data well based on the suggested cutoff.This model hypothesized and fit the associations among three types of features: 1.Early and late EEG dipole source activations in the left putamen (L_PUT) identified in Fig. 3c; 2. FC between the L_PUT with left (L_CLU) and right clusters (R_CLU) identified in Fig. 4; 3. The binary classification labels of speech trials as covert (marked as 1) or overt (marked as 0).Significant associations (solid blue arrows in the figure) suggest that late EEG activations in the L_PUT have direct effect towards OS.The early EEG activations leads to CS by the mediation effect of reduced FC between the L_PUT and L_CLU.Significance indicators: *** denotes P < 0.001, ** denotes P < 0.01, and N.S. denotes associations with absolute standardized regression weights below 0.1 or P-values greater than 0.05.

Table 1
EEG data processing pipeline, including detail steps from preprocessing to fMRIinformed source localization and statistical analysis.