Separate neural subsystems support goal-directed speech listening

do humans This feat places great demands on the collaboration between speech processing and goal-related regulatory functions. Here, we propose that separate subsystems with different cross-task dynamic activity properties and distinct functional purposes support goal-directed speech listening. We adopted a naturalistic dichotic speech listening paradigm in which listeners were instructed to attend to only one narrative from two competing inputs. Using functional magnetic resonance imaging with inter- and intra-subject correlation techniques, we discovered a dissociation in response consistency in temporal, parietal and frontal brain areas as the task demand varied. Specifically, some areas in the bilateral temporal cortex (SomMotB_Aud and TempPar) and lateral prefrontal cortex (DefaultB_PFCl and ContA_PFCl) always showed consistent activation across subjects and across scan runs, regardless of the task demand. In contrast, some areas in the parietal cortex (DefaultA_pCunPCC and ContC_pCun) responded reliably only when the task goal remained the same. These results suggested two dissociated functional neural networks that were independently validated by performing a data-driven clustering analysis of voxelwise functional connectivity patterns. A subsequent meta-analysis revealed distinct functional profiles for these two brain correlation maps. The different-task correlation map was strongly associated with language-related processes (e.g., listening, speech and sentences), whereas the same-task versus different-task correlation map was linked to self-referencing functions (e.g., default mode, theory of mind and autobiographical topics). Altogether, the three-pronged findings revealed two anatomically and functionally dissociated subsystems supporting goal-directed speech listening. meta-analysis to provide external validity by assessing the corresponding cognitive profiles of our fMRI findings.


Introduction
Making sense of speech from a single voice in a multitalker noisy background might be difficult for artificial intelligence but is well within the competence of humans. The power to voluntarily select, control and interpret what we perceive to form a coherent conscious perception is perhaps one of the most prominent signatures of human cognition. Studies aiming to determine how the human brain reacts to ambiguous or competing sensory inputs that give rise to the conscious experience are important. The neural representation of auditory input is not a simple reflection of the external acoustic environment but a reconstructed reality that biases toward goal-relevant perceptual aspects and tunes out the irrelevant components ( Bizley and Cohen, 2013 ;Mesgarani and probably to enable fundamental cognitive functions such as attention and working memory ( Campbell and Tyler, 2018 ;Fedorenko et al., 2012Fedorenko et al., , 2013Zanto and Gazzaley, 2013 ). In addition, studies using correlational analyses reported that regions without above-baseline responses to language report great cross-subject activity consistency during narrative comprehension, including the precuneus and posterior cingulate cortex ( Honey et al., 2012 ;Lerner et al., 2011 ). The intact or scrambled version of narrative listening might cause a dynamic reconfiguration of the default mode network, indicating its role in integrating speech information over minutes ( Simony et al., 2016 ). However, the division of labor among the engaged brain regions and their networks in goaldirected speech listening remain unclear.
The present study targeted identifying a possibly higher-level control subsystem that should be highly goal-driven (e.g., which speaker to attend to) and separate from a stimulus-driven subsystem for which activity is mostly driven by ongoing acoustic inputs. Presumably, an effective system should be highly aware of what the goal is to maintain focus on the task at hand and neglect distractions from irrelevant information. This internal conscious goal will endogenously guide the cognitive and linguistic processing toward our intended representation. For instance, trying to keep track of a teacher's speech is the volitional motivation for students to behave in class. On the other hand, previous studies have shown that attentional modulation in speech-evoked responses may reach as far as the nonprimary auditory cortex in the superior temporal gyrus ( Mesgarani and Chang, 2012 ;O'Sullivan et al., 2019 ). This penetration of goal-driven activity into the auditory regions suggests a highly intertwined relationship between linguistic and top-down processes. Thus, empirical investigations are needed to examine the existence of separate neural subsystems or a more general system that may be either difficult to subdivide or only loosely composed of multiple brain regions involved in goal-directed speech listening ( Sporns and Betzel, 2016 ).
We adapted a classical dichotic listening paradigm with minutelong speech materials to first examine brain activity consistency while changing the listening target in unchanged mixed acoustic streams as a method to achieve this objective. This approach provides a fine measurement of the extent to which brain activity is driven by the goal or stimuli. Behaviorally, when two different acoustic stimuli are presented simultaneously, with one to the left ear and the other to the right ear, the perceptual outcomes are substantially influenced by the focus of attention ( Moray, 1959 ). The forced-attention paradigm in dichotic listening has proven effective in studying the interaction between stimulus-driven and goal-related processes ( Hugdahl and Andersson, 1986 ;Hugdahl et al., 2009 ;Westerhausen et al., 2009 ). Attentional modulation during dichotic listening of syllables and words involves a wide range of frontal, parietal and temporal regions ( Falkenberg et al., 2011 ;Lipschutz et al., 2002 ;Pugh et al., 1996 ;Thomsen et al., 2004 ). Moreover, evidence has revealed the hierarchical topography of listening materials with various time scales (e.g., word, sentence versus paragraph), with reliable responses in higher-order brain regions and networks evoked only by meaningful paragraphs ( Lerner et al., 2011 ).
Specifically, during functional magnetic resonance imaging (fMRI) scanning, subjects received an ∼12-min mixed narrative, with face-and architecture-based contents monaurally presented to his or her left and right ear, respectively. In a face-task run, the listener was required to attend to face contents in one ear while ignoring the architecture speech presented in the other ear (see Fig. 1 ). The task requirements were reversed in the architecture-task run. A repeated face-or architecturetask session was implemented across subjects. The activity of stimulusdriven brain regions will remain the same, regardless of the task goal, and thus the different-task response time courses will be strongly correlated (face-run and architecture-run correlations). The activity of goaldriven brain regions, in contrast, will vary as the goal changes. The response time courses will be strongly correlated under same-task conditions but weakly correlated under different-task conditions. Intersubject correlation (inter-SC) and intrasubject correlation (intra-SC) analyses of the fMRI data were performed in parallel ( Chen et al., 2017 ;Hasson et al., 2009 ;Nastase et al., 2019 ). Two benefits of combining these two measurements have been identified: to cross-verify the reliability of our findings and to enable a comprehensive investigation by quantifying the corresponding brain response invariance from two different dimensions -across subjects and across within-subject repetitions. Furthermore, we conducted a data-driven clustering analysis to identify the dissociation of multiple synchronized functional networks and a meta-analysis to provide external validity by assessing the corresponding cognitive profiles of our fMRI findings.

Subjects
Twenty-three right-handed native Mandarin speakers from South China Normal University participated in the experiment (13 females; aged 18-40 years; including one author, M.M.). Data from two subjects were excluded from further analysis, as these subjects failed to complete the whole experiment, leaving 21 valid subjects. All subjects reported normal hearing abilities and no known neurological illness. All subjects provided written informed consent before the experiments. This study was approved by the Ethics Committee of the School of Psychology at South China Normal University.

Stimuli
Our stimuli consisted of 3 pairs of 4-min (240 s) narrative stories. Each pair would include a series of narratives about 3 architectural structures (e.g., an art museum, historic residences of celebrities and an ancient bell tower located in one district) and 3 faces (e.g., a family of three). Overall, 9 architectures and 9 faces were included in the listening materials. In the architecture narratives, the exteriors of multiple architectures were characterized along with the spatial/trajectorial relationship among them. For example, (with translation to English), "The art museum is an ellipsoidal building, and it looks like a huge silver egg…The bell tower is located to the northeast of the art museum…Four protruding corners are located on the eaves, and a copper bell hangs under each corner…". Face narratives described the facial appearances of several individuals in a particular scenario. For example, "The lady with a high ponytail is the mother. Her face is oval-shaped…She has slender eyes with single eyelids and brown pupils…She is holding a seven-or eight-year-old boy by the hand…His-lips look quite fleshy and a few gaps are exposed between his teeth…". All narratives were recorded from a single female speaker (one of the authors, D.Z.) with Neundo 4 software (Steinberg Media Technologies, Germany). Reading speed and tones were deliberately controlled. After adjusting the average volumes for each narrative to a similar level, we integrated the architecture and face narratives by placing one narrative in the left-ear sound channel and the other in the right ear sound channel using GoldWave Digital Audio Editing software (GoldWave Inc., Canada). Thus, we generated three mixed narratives that lasted 12 min (720 s) in total. An audio sample of the mixed narrative is available in the Supplementary Materials. We chose 6 architectures and 6 face images (namely, 2 architectures and 2 faces from each 4-min narrative) that were tagged as target images. Then, 3 distracting images were generated for each target image. Distractor architectures had architectural styles similar to those of the target. Distractor faces were of the same sex of the target and were similar in age to the target. Thus, the test pool included 12 sets of questions in total. All images were downloaded from online search engines commonly available from the internet.

Task and experimental design
In each run, a portion of a 4-min continuous mixed narrative was presented binaurally. The narrative was sandwiched by two 16-second rest periods. During the rest period and speech presentation, partici- Similarly, each 4-min architecture narrative included a series of descriptions of the external appearances of 3 architectures (A1/A2/A3). The face and architecture narratives were presented monaurally, one to the left ear and the other to the right ear. After the mixed speech presentation, participants were asked to choose the besting fitting image with the facial appearance of one particular person from the attended speech under the face-attending condition. (B) Illustration of inter-and intra-SC analyses. For each voxel, the inter-SC of the BOLD time courses was measured across subjects when performing the same or different tasks (dotted black line), whereas the intra-SC of the BOLD time courses was calculated within-subject across two times (solid black line). Samples of BOLD time courses from two representative subjects are shown in the bottom middle panel of the figure (green and red lines).
pants were instructed to keep their eyes open and look at the central fixation cross on the screen. Verbal instructions were presented before each run to let the participant know to which narrative to attend. In an architecture-attending run, for example, the instruction is as follows: "Please listen to the architecture narrative in your left ear ". After attending to architecture narratives (3 architectures) and ignoring face narratives (3 personals), participants were asked to select a matching image among a set of four that best agreed with a particular item in the previous attended narrative. After the second rest period, text instructions were presented on the screen together with four image choices. An example instruction would be "Please choose the image that best fits the previously described art museum with a button press ". The questions were randomized across subjects, and no question was asked twice to one subject. The time window for behavioral responses was 20 s. Notably, the experimental settings tried to mimic realistic complex scenarios where multiple cognitive functions (i.e., attention, memory, cognitive control, etc.) were demanded. Although the goal instruction was more focused on attention orienting without explicit demands on memory, subjects were likely to adopt various memory strategies (e.g., episodic memory encoding) for successful speech understanding and better performance on the subsequent image test.
We presented all mixed narratives three times over the course of a typical scanning session, for a total of 9 runs. In addition to a firsttime face-task run and a first-time architecture-task run, subjects performed a repeated face-task run (12 subjects) or a repeated architecturetask run (9 subjects). In the repeated run, every parameter and procedure were the same, except that an alternative image-choice task would be presented to the subjects to prevent the subjects from simply memorizing the previous response in the task. Run 1, run 4 and run 7, as Narrative Set 1, shared the same face/architecture narrative contents but were followed by different questions, depending on the task and presentation times. Correspondingly, run 2, run 5 and run 8 constituted Narrative Set 2 with new face/architecture mixed narratives. Run 3, run 6 and run 9 constituted Narrative Set 3. The ear for the attended speech presentation would switch after each run to exclude the effect of left/right ear channels. For instance, in the first 3 runs that a subject would typically complete, he or she would be instructed to attend to the "face narrative in your left ear ", "face narrative in your right ear " and "face narrative in your left ear ". In addition, the task sequences and the correspondences between face/architecture narratives and left/right ear channels were counterbalanced across subjects.

MRI acquisition
MRI data were acquired with a 3 T Trio MRI scanner (Siemens, Erlangen, Germany) using a 32-channel head coil at the Brain Imaging Center of South China Normal University. BOLD signals were collected using an echo-planar imaging (EPI) sequence (repetition time (TR) = 2000 ms, echo time (TE) = 30 ms, flip angle = 90°, field of view (FOV) = 192 mm, voxel size = 3 × 3 × 3 mm, and 32 slices). A high-resolution magnetization-prepared rapid acquisition gradient-echo (MPRAGE) anatomical scan was acquired for each subject (TR = 2300 ms, TE = 3.24 ms, flip angle = 9°, FOV = 256 mm, and voxel size = 1 × 1 × 1 mm). Stimuli were presented using the Psychophysics Toolbox ( Brainard, 1997 ;Pelli, 1997 ). Visual stimuli were projected onto a screen located at the back of the scanner via an LCD projector (1280 × 960 pixel resolution). Subjects viewed the screen through a mirror placed within the head coil and wore MR-compatible dual-channel headphones that provided noise reduction and audio stimuli.

Inter-SC analysis
We conducted pairwise inter-SC to assess the similarity of brain activity between subjects and quantify the invariance of BOLD responses across different subjects when attending to the same face or architecture narratives ( Nastase et al., 2019 ). After preprocessing, we extracted the time courses for each voxel and each subject from the scan from the first-time face-task run. Pearson's correlation coefficients for one subject were calculated between his or her time courses and another subject for each voxel. Each subject was correlated with every other subject, leading to 210 correlation coefficients (N( N − 1)/2) for each voxel. Then, the coefficients were transformed into Fisher's z within each voxel. Finally, inter-SC ̃ for each voxel k was calculated as the average correlation of all pairwise correlations r between subject i ( i = 1,…, N-1) and subject j ( j = 2,…, N and j > i) as follows: For statistical analyses, we used a parametric linear mixed-effects (LME) model with crossed random effects to measure the shared variance between subjects via 3dISC in AFNI (FDR corrected, p < .05). The LME approach has the advantage of achieving proper control for false positives with a relatively low computational cost ( Chen et al., 2017 ). Using a similar procedure, we calculated the inter-SC for the scan from the first-time architecture-task run. Then, because similar networks of brain regions showed invariant responses across subjects for both face-attending and architecture-attending tasks, we defined the inter-SC same-task as the averaged inter-SC between the two tasks.
We quantified the invariance of BOLD responses across different subjects when listening to the same stimuli but attending to different nar-ratives by calculating inter-SC for different task runs (inter-SC diff-task ). For each voxel and each subject, Pearson's correlation coefficients were calculated based on the time series from the first-time face-task and firsttime architecture-task runs. The other subsequent procedures were the same as for the inter-SC same-task analysis. Finally, we mapped the inter-SC over the MNI template to delineate the extent of response similarity across subjects when attending to the same or different narratives.

Intra-SC analysis
We performed intra-SC analyses to assess the similarity of brain activity across two presentations of the same mixed stimuli with the same task ( Golland et al., 2007 ;Hasson et al., 2009 ). For each voxel and each subject, we calculated the intra-SC same-task as the Pearson correlation coefficient between two time courses from his or her first-time and repeated face-task runs or first-time and repeated architecture-task runs. Thus, for each voxel, intra-SC was calculated as the average correlation across all subjects. Similarly, we calculated intra-SC diff-task values across two presentations of the same mixed stimuli with different tasks. Pearson's correlation coefficients for each voxel and each subject were calculated between two time courses from his or her first-time face-task and repeated architecture-task runs or first-time architecture-task and repeated face-task runs. Thus, intra-SC ̃ for each voxel k was calculated as the average of pairwise correlations r between first-time and repeated runs across all subjects (subject i , i = 1,…, N) as follows: The coefficients were then transformed into Fisher's z. At the group level of analysis, we performed voxelwise one-sample t tests and identified clusters with a cluster-based control for the familywise error rate (voxel-level threshold p < .05, cluster-level extent threshold < 0.05) using 3dClustSim in AFNI. Finally, we mapped the intra-SC same-task and intra-SC diff-task values over the MNI template to delineate the extent of within-subject response similarity across time points when the subjects were attending to the same or different narratives.

Clustering analysis
A purely data-driven approach was adopted to cluster voxels in the observed stimulus-driven (different-task correlation) and goal-driven (same-task versus different-task correlation) activation maps from our fMRI experiment. First, we selected overlapping voxels from brain mappings using Inter-and Intra-SC methods, considering their shared similarity of the result patterns under every experimental condition. Notably, 1867 stimulus-driven voxels and 196 goal-driven voxels survived both correlation measurements, indicating that their activity was not only significantly consistent across subjects (FDR-corrected p < .05) but also significantly consistent over time (voxelwise p < .05, clusterwise < 0.05). For each voxel, we extracted its activity time course from the first-time face-task run and first-time architecture-task run. Then, we utilized a k-means algorithm to cluster the 2063 (1867 + 196) voxels based on the measured mean activity across subjects under faceattending and architecture-attending conditions separately. Specifically, the cosine distance measure was used to estimate dissimilarities between voxels. The Davies -Bouldin criterion was used as described in a recent study to determine the optimal number of clusters ( Vergara et al., 2020 ) in a search of cluster numbers ranging from 2 to 9. Afterward, for each of the identified clusters, we calculated its Dice coefficient with our predefined stimulus-/goal-driven clusters. A Dice coefficient of 0 indicates no overlap, and a Dice coefficient of 1 indicates perfect overlap. We also performed a Monte-Carlo simulation test to exclude the null hypothesis that the stimulus-/goal-driven voxels were not grouped into subsystems. Specifically, for each identified cluster, we calculated the Dice coefficient with voxels randomly labeled stimulus-/goal-driven, and this routine was repeated 2000 separate times for face-attending and architecture-attending tasks. The p value was estimated as the proportion of the simulations with a Dice coefficient exceeding the original Dice coefficient.

Meta-Analysis
We performed automatic meta-analyses using Neurosynth ( Yarkoni et al., 2011 , www.neurosynth.org ) to further explore the cognitive functions of the cortical regions and networks identified in our fMRI experiment. With over 14,000 published functional MRI studies in the database, we were able to decode the cognitive functions of activity widely distributed across the whole brain through a meta-analysis. First, we linked each brain activation consistency map of our results with depicted featural brain maps in Neurosynth by calculating Pearson's correlation coefficients across all voxels. This method generated a rank-ordered list of psychological concepts and brain regions that were associated with whole-brain activity patterns discovered in our fMRI experiment. Then, we removed terms for anatomical structures and words that shared the same morphological root or similar meanings (e.g., "language " and "linguistic "). Words that ranked at the top of the remaining lists included "listening ", "speech ", "language ", "sentences " and "comprehension " for the different-task mapping and "default mode ", "theory mind ", "autobiographical ", "episodic ", "memory retrieval " and "working memory " for the sametask versus different-task mapping. We also generated reverse inference maps based on these twelve selected words (FDR p < .01). Finally, as a method to highlight the shared and unique functional "fingerprints " linked to each condition, we constructed radar plots to characterize the functional preference profile for each brain map under same-task, different-task, and same-minus-different-task conditions.

Behavioral results
A follow-up question about which one image of a set of four best represented the described person or architecture in each 4-min audio segment to ensure that subjects focused on the instructed narrative. The group average accuracy for the image-choice task was 86 ± 12% (face task: 87 ± 18% and architecture task: 82 ± 16%, consistent with a subjectively reported higher degree of difficulty in the architecture task than in the face task).

Mapping the Inter-and Intra-SC under same-task conditions
We began by identifying the regions with consistent responses across subjects during dichotic speech listening when task instructions were the same. The voxel-by-voxel inter-SC FaceTask was calculated using brain activity from the first-time face-task run of each participant. Similarly, the voxel-by-voxel inter-SC ArchitectureTask was calculated using brain activity from the first-time architecture-task run of each participant. The cortical regions with high activity consistency were very similar in the inter-SC FaceTask and inter-SC ArchitectureTask results. For simplicity, we combined the results from these two conditions into the measurement inter-SC SameTask . Consistent with previous studies ( Honey et al., 2012 ;Lerner et al., 2011 ;Regev et al., 2019 ), we found that a wide range of temporal, parietal and frontal regions showed shared responses across subjects under same-task conditions ( Fig. 2A ). First, the highest consistency was observed in the auditory temporal cortex (inter-SC peak: x = 58.5, y = 16.5, z = 4.5; intra-SC peak: x = 61.5, y = 13.5, z = 4.5). Second, the consistent activation pattern spread across the superior and middle temporal gyrus and inferior frontal gyrus, covering the traditional language-specific Broca's area and Wernicke's area. Third, the activation reached a wide range of nonlinguistic regions across the frontal and parietal cortices, including the commonly thought task-negative re-gions such as the precuneus, posterior cingulate cortex and medial prefrontal areas.
Next, we investigated the regions that showed consistent responses in goal-directed speech listening across runs within each subject. This assessment was accomplished by performing the intra-SC analysis between first-time and repeated runs under same-task conditions. Mapping the intra-SC SameTask over the MNI template revealed a wide extent of regions with consistent activity over repeated runs ( Fig. 2B ). We observed great consistency in the activity patterns between the inter-SC and intra-SC mapping results, indicating that the response dynamics in these cortical areas were highly reliable when tracking the same speech from mixed dichotic inputs.

Mapping the Inter-and Intra-SC under different-task conditions
We performed inter-and intra-SC analyses between runs under the different-task condition to examine the functional subdivision within the brain network that supports goal-directed speech listening. The motivation for this analysis was to investigate the existence of a stimulusdriven subsystem for automatic linguistic processing. We assumed that the activity dynamics of a voxel reflected the combination of the spatiotemporal properties of the external inputs and top-down modulation from the task demand. Thus, for a stimulus-driven voxel whose activity might be mainly driven by external inputs and less influenced by the tasks, we would expect a high degree of correlation of responses between the face-attending and architecture-attending conditions. Otherwise, the response of a goal-driven voxel may fluctuate independently across tasks, leading to low correlations under different-task conditions.
We mapped the voxel-by-voxel inter-SC ( Fig. 3A ) and intra-SC ( Fig. 3B ) results under different-task conditions (face-architecture correlation) and observed consistent responses in bilateral temporal and frontal regions. As a method to better understand the specific brain functions and avoid misunderstandings due to the use of a broad nomenclature, we aimed to report the results more specifically with reference to a recent Schafer-Yeo brain parcellation atlas (Supplementary Figure  1, 17 networks and 400 parcellations, Schaefer et al., 2018 ). The consistent activation across/within-subjects was most robust in the bilateral superior temporal regions (i.e., SomMotB_Aud and TempPar). The overall patterns for the results were similar using these two correlation methods, indicating the robustness of our findings across subjects and across scan runs. As subjects were always presented with exactly the same dichotic inputs, the significant cross-task correlation values in these areas indicated a possibly automatic language processing subsystem that might have been unsupervised and bottom-up stimulus-driven. Consistent activation under the different-task conditions also occurred in dorsolateral prefrontal areas (i.e., DefaultB_PFCl, DefaultB_PFCd, and ContA_PFCl). This finding might indicate an automatic involvement of executive function in linguistic processing ( Fedorenko and Thompson-Schill, 2014 ), validating our hypothesis about adaptive executive controls that support speech comprehension. Nevertheless, a simple visual comparison of these data to the correlation map under same-task conditions ( Fig. 2 ) revealed decreases in correlated activation in both frontal and parietal regions under different-task conditions.

Comparing correlations between Same-and Different-task conditions
Next, we aimed to identify regions relevant to specific goals from the network of regions generally engaged in goal-directed speech selection and tracking ( Fig. 2 ). For each voxel, we calculated the difference in the correlation score between same-and different-task conditions. The voxel should not only have engaged consistently in dichotic speech listening (high correlation under same-task conditions) but should also have been influenced by task goals (low correlation under different-task conditions) to obtain a large difference in the correlations. Voxels that were not consistently involved in dichotic speech listening (low correlation under same-task conditions) or showed invariance to specific goals  (high correlation under different-task conditions) presented low differences in the correlations.
Using both inter-and intra-SC approaches, we again observed similar activation patterns for the goal-driven regions engaged in dichotic speech listening ( Fig. 4A and 4B ). The activity in the posterior parietal areas (DefaultA_pCunPCC and ContC_pCun) was strongly modulated by the task demand. In addition, the intra-SC results suggested that additional parietal and temporal regions (ContB_IPL, DorsAttnA_SPL, and TempPar) might be involved in this goal-driven process. Indeed, these areas showed highly consistent responses under same-task conditions ( Fig. 2 ) but low consistency of responses across different tasks ( Fig. 3 ).

Networks verified by data-driven cluster modeling
Do the observed stimulus-driven/goal-driven activation maps indeed represent separate neural networks instead of several isolated brain ar- eas? We adopted a data-driven voxel clustering approach to answer this question. The logic is that the voxels within a coherent cognitive network would be coactivated and thus show synchronized temporal activity patterns, whereas voxels from distinct functional networks might not. We focused on the temporal synchronization (indexed by Pearson's correlation coefficients) between time courses of voxels extracted from activation maps. Data from 1867 stimulus-driven voxels and 196 goaldriven voxels were gauged together, leading to 2062 voxels in total. As shown in Fig. 5 , the connectivity pattern appeared to have a good fit to the partition model based on stimulus-driven/goal-driven labels, as indicated by orange dashed lines. Using a purely data-driven k-means method, the results revealed that the 2062 voxels were optimally classified into 2 clusters under the face-attending condition.
A follow-up analysis of the Dice coefficient (DC) showed almost perfect overlap between Cluster 1 (1773 voxels) and our predefined stimulus-driven cluster (DC = 0.96, Monte-Carlo test p < .001, same below), as well as remarkable overlap between Cluster 2 (290 voxels) and our predefined goal-driven cluster (DC = 0.73, p < .001). Under the architecture-attending condition, the 2062 voxels were also separated into two clusters, with a high DC value of 0.96 ( p < .001) between Cluster 1 (1748 voxels) and the stimulus-driven cluster and a DC value of 0.70 ( p < .001) between Cluster 2 (315 voxels) and the goal-driven cluster. Notably, a similar pattern was obtained using a nonk-means-based modularity analysis (see the Supplementary Materials). Taken together, distinct temporal fluctuation patterns among activated voxels were observed, and their clustering states quantitively aligned with our separate network models based on experimental findings.

Meta-analysis results
We characterized functional preference profiles by comparing the maps with twelve Neurosynth feature maps based on selected psychological terms to further explore the functional roles of the significant areas in the correlation maps we observed above ( Fig. 6A ). Again, we observed similar inter-and intra-SC meta-analytic results ( Fig. 6B and C ). This approach revealed a functional dissociation between the subnetwork that showed high response consistency under different-task conditions (red shaded area) and the subnetwork that showed high response consistency under same-task minus different-task conditions (blue shaded area). In particular, the former subnetwork was associated with linguistic topics (comprehension, language, speech, sentences, and listening), whereas the latter subnetwork was associated with topics related to first-person perspective-taking processing (default mode, theory of mind, autobiographical, episodic, memory retrieval, working memory and mentalizing).
Combined with the empirical imaging and data-driven clustering results above, these meta-analytic results validated our predictions that at least two separate subsystems underlie goal-directed dichotic speech listening: 1) an automatic language-related processing subsystem that was more time-locked to external acoustic stimuli and less influenced by task goals, and 2) a high-level goal-driven subsystem that processed specific information about the task goal and may involve self-referencing functions through the default mode network (DMN).

Summary of the findings
At least two functionally separate networks were identified and validated to be involved in goal-directed speech listening. When participants kept track of one same part of a speech while ignoring another speech in a dichotic listening paradigm, a wide range of cortical regions from the early auditory cortex to higher-level parietal and frontal areas responded reliably across subjects and over time, consistent with previous narrative listening studies ( Honey et al., 2012 ;Regev et al., 2019 ). More interestingly, we identified two sets of brain regions constituting separate subnetworks with distinct cross-task reliability properties and cognitive functions. The response reliability in auditory areas (SomMotB_Aud and TempPar) and the prefrontal cortex (DefaultB_PFCl, DefaultB_PFCd, and ContA_PFCl) was not affected by task demands, suggesting that the activation in these regions was mainly driven by external acoustic input. In contrast, many parietal regions (DefaultA_pCunPCC and ContC_pCun) responded reliably only when the task goal was the same (e.g., attend to the face narrative repeatedly). conditions. The heatmaps represent the pairwise Pearson correlation coefficients between the time courses from the total 2063 voxels. Stimulus-driven voxels were indexed as voxels 1-1867, and goal-driven voxels were indexed as voxels 1868-2062 (partition indicated by the dashed lines). The dots below the heatmap show the voxels that were classified as Cluster 1 (orange) or Cluster 2 (blue). DCs indicated the extent to which each identified cluster overlapped with the experimentaldefined stimulus-driven and goal-driven activation clusters. In addition, a subsequent Monte-Carlo simulation test was conducted to refute the null hypothesis that the stimulus-/goal-driven voxels were not grouped into subsystems. The DC values all exceeded either the upper limits (solid black lines) or lower limits (dotted black lines) of the 95% confidence interval from the Monte-Carlo simulation test.
Moreover, using a data-driven clustering analysis, we independently validated the functional connectivity network underlying the activated voxels observed in the fMRI results. The neural clusters identified by the data-driven clustering analysis showed remarkable overlaps with the organization of experimental-defined goal-driven/stimulus-driven voxels, evincing the validity of the fMRI correlational findings. Finally, a subsequent meta-analysis provided a deeper understanding of the psychological significance of the two separate neural subsystems. The functional profile based on the different-task correlation map exhibited a clear shift toward language-related processes, leaving self-referencing as the main functional characteristic of the remaining correlation map (same-task minus different-task).

Separate neural subsystems involved in goal-directed speech listening
Based on the three-pronged results, we propose that multiple anatomically and functionally dissociated subsystems are involved in goal-directed speech listening in a noisy background. The bilateral temporal and lateral prefrontal regions, which showed time-locked responses to the acoustic signals and low sensitivity to goal changes, constituted one stimulus-driven subsystem contributing to the real-time linguistic analysis. The parietal regions, including the precuneus and posterior cingulate cortex, which exhibited distinct temporal dynamics across goal conditions, constituted another goal-driven subsystem that was associated with self-referencing regulation functions ( Yeshurun et al., 2021 ).
For better visualization of the functional division of labor of the engaged brain regions/networks, we further calculated a goal-driven index (GDI, ratio of correlation coefficients between the same-task versusdifferent-task condition and the same-task condition) for each significant voxel from the goal-directed listening map ( Fig. 2 ). Higher GDI values indicated that the activity of the voxel was more strongly influenced by the goal. As shown in Fig. 7 , a spectrum of goal sensitivity was observed across the temporal, frontal and parietal regions. Task goals strongly modulated the activity dynamics in the posterior parietal regions (higher GDI values in red) but showed much less of a modulatory effect on temporal and dorsolateral prefrontal regions (lower GDI values in blue), consistent with previous goal-driven and stimulus-driven activity maps ( Figs. 3 and 4 ).
We noted that a large portion of the frontal regions (including the inferior frontal gyrus) showed intermediate levels of goal sensitivity (in a greenish color). Considering the neighboring relationship of languagespecific and domain-general regions in the frontal lobe ( Fedorenko et al., 2012 ), those frontal areas may serve as an interface for bottom-up linguistic information and high-level goal modulation to meet and operate adaptive computations, e.g., to solve the mismatch between upcoming inputs and the task demand. The mid-level goal-driven regions also overlapped with the frontoparietal control network. Adaptive task control by the frontoparietal network has been proposed to be accomplished by Fig. 6. Meta-analysis results. (A) Neurosynth feature maps based on twelve psychological terms. Functional preference profiles of brain maps of inter-SC (B) and intra-SC (C) analyses under same-task conditions (goal-directed listening maps, in gray), different-task conditions (stimulus-driven maps, in red) and same-task minus different-task conditions (goal-driven maps, in blue). The value indicated the Pearson correlation coefficient across all voxels between the brain activation consistency map of our results and the twelve depicted featural brain maps from Neurosynth.

Fig. 7. GDI maps for inter-SC (A) and intra-SC results (B)
. GDI was calculated for each significant voxel in the goal-directed dichotic listening map ( Fig. 2 ). The index was calculated by dividing the mean correlation coefficient from the same-task condition versus the different-task condition by that from the same-task condition. A higher GDI represented stronger effects of the task goal on the activity of the voxel.
flexibly altering its functional connectivity with other process-specific networks ( Cole et al., 2013 ;Zanto and Gazzaley, 2013 ). Recent studies have further identified anatomical and functional coupling between the frontoparietal network and the default mode network, which may enable internally directed attention to attain certain task goals together ( Dixon et al., 2018 ;Kam et al., 2019 ). Researchers have proposed that the activity of the frontoparietal network and default mode network is mediated by the cingulo-opercular systems ( Bressler and Menon, 2010 ;Cocchi et al., 2013 ), which are presumed to be associated with task set maintenance ( Dosenbach et al., 2008( Dosenbach et al., , 2007. Future studies may further investigate how extrinsic and intrinsic control-type networks conjointly interact with the language-specific network to realize the intended speech representation. A further meta-analysis of the functional profiles of whole-brain activity confirmed our multicomponent hypothesis. A clear functional shift toward language functions was observed in the different-task correlation map, suggesting that the regions with reliable cross-task activity may undertake a major role in processing linguistic information from the mixed acoustic waveforms. The stimulus-driven characteristics also indicated that the linguistic circuit may operate unconsciously and reflect how far the task-irrelevant speech information is propagated in the processing hierarchy ( Beaman et al., 2007 ). In contrast, the contrast map between the same-and different-task conditions was functionally associated with default mode and self-referencing cognition. Parietal nodes in the default mode network (e.g., the posterior cingulate cortex) are closely related to episodic memory retrieval ( Sestieri et al., 2011( Sestieri et al., , 2017. Converging evidence has shown that the precuneus plays an important role in the internal mental process of self-representation, as implicated in first-person perspective-taking tasks and episodic memory tasks ( Cavanna and Trimble, 2006 ;Lou et al., 2004 ;Ye et al., 2018 ). In addition, the intraparietal sulcus is considered a central hub for perceptual organization and is associated with stimulus-driven auditory figure-ground segregation ( Teki et al., 2016( Teki et al., , 2011. Our data were consistent with the role of the intraparietal sulcus in the top-down goal modulation of perceptual outcomes ( Cusack, 2005 ). Taken together, these parietal regions may contribute to high-level goal representation and support internally directed attention, memory retrieval and introspective processes, consistent with previous cognitive control studies ( Esterman et al., 2009 ;Zanto and Gazzaley, 2013 ). The brain patterns for stimulus-and goal-driven activity were overlaid onto eight commonly referenced brain functional network correlates to relate the separate neural subsystems identified in the current speech listening scenario with brain network studies in the broader field ( Yeo et al., 2011 ;Zhang et al., 2017 ): the frontoparietal network (FPN), dorsal attention network (DAN), ventral attention network (VAN), somatomotor network (SMN), visual network (VN), affective network (AFN), subcortical network (SCN) and DMN. Activity mapping with these functional network atlases revealed that the stimulus-driven subnetwork in dichotic speech listening primarily engaged the SMN (49% and 39% for inter-and intra-SC results, respectively, the same comparison was performed below), whereas the goal-driven subnetwork supporting dichotic speech listening was distributed in the DMN (57% and 29%), FPN (29% and 21%) and DAN (5% and 27%). Furthermore, using meta-analytic maps generated by Neurosynth ( https://neurosynth.org ), we were able to compare the average response consistency for different functional networks across same-task and difference-task conditions. The leave-one-out inter-SC calculation ( Nastase et al., 2019 ) was adopted for statistical analyses. A significant decrease in the response consistency under difference-task conditions was observed using the "DMN " meta-analytic map ( p < .001), and the magnitude of the decrease was significantly larger than that using the "speech network " meta-analytic map ( p < .001). Overall, our results suggest that goal-directed speech listening involves multidimensional cognitive processes that require coordination between multiple neural regions and networks.

Inter-SC and intra-SC approaches
Correlation techniques with naturalistic narrative or movie materials have allowed us to investigate widely distributed brain activity across within-subject networks and across between-subject idiosyncrasies ( Chen et al., 2017 ;Cui et al., 2021 ;Nastase et al., 2019 ). Although inter-and intra-SC methods may appear similar, they are conceptually different in logic and might be complementary to each other. Inter-SC measures shared responses across individuals, which were iso-lated from spontaneous and idiosyncratic responses within individuals. However, the psychological meaning of these idiosyncratic activities remains an open question ( Hahamy et al., 2015 ;Hasson et al., 2009 ). Intra-SC reflects repeatable within-subject responses across two experimental sessions. However, measurements over time may lead to other practical issues, such as changes in familiarity and motivation. These trade-offs are probably why the observed intra-SC results have wider spatial propagation and why the inter-SC results have higher correlation values. Additionally, differences in statistical testing methods may be another factor that affects the inter-and intra-SC results. Nevertheless, the brain activity patterns obtained using these two methods were strikingly similar in our study, indicating the great reliability of this naturalistic paradigm.
Although the correlation approaches allow us to leverage naturalistic and complex stimuli, they are limited in revealing the stimuli features that modulate brain activity. For instance, an interesting question is what level of features in the mixed narrative stimuli drive the consistent cortical activity when the task goal shifts ( Fig. 3 ). The identification of features that contribute to stimulus-driven brain activity requires integrating results from other approaches. A large number of electrophysiological studies have identified representations of spectrogram features for both attended and unattended speech streams in auditory areas, even when the streams are mixed in a single channel Simon, 2012a , 2012b ;Golumbic et al., 2013 ;Mesgarani and Chang, 2012 ;Puvvada and Simon, 2017 ). Thus, low-level acoustic features likely contribute to the stimulus-driven pattern of brain activity. However, the stimulus-driven pattern in the current study was distributed widely across the superior temporal cortex and extended up to the prefrontal areas. This pattern indicated the involvement of higherlevel linguistic processes ( Fedorenko and Thompson-Schill, 2014 ). Recently, an fMRI study adopted a feature modeling approach and distinguished multilevel speech representations during the cocktail party task (Kiremitci et al., 2021). The authors reported that the representation of unattended speech extends to the linguistic level and localizes to early-to-intermediate stages of auditory processing. Despite our methodological differences, the converging evidence suggests that speech representations may carry some high-level linguistic information, even when the speech stream is meant to be neglected as a distractor.
The correlational approaches also have some limitations in referencing what exact cognitive processes underlie the highly consistent activity. As suggested by an anonymous reviewer, the goal-driven activity observed in our study might be driven by memory retrieval, working memory, attentional demand or a combination of multiple cognitive processes. Although we have decoded the functional profiles of goal-driven activity maps through a meta-analysis, how these specific functions participate in goal-directed speech listening still requires investigation. Potential directions for future studies include examining the network subdivisions for each functional domain with experiments specifically designed to test these cognitive functions.
Finally, but very importantly, our results suggest that the correlation analysis reveals consistent patterns of signal fluctuations in voxels with even below-baseline activation, which would fill in the gaps from traditional activation-based fMRI analysis. For instance, the involvement of the DMN in speech processing was viewed as an epiphenomenon of task disengagement ( Anticevic et al., 2012 ) due to its task-induced deactivation ( Cuevas et al., 2019 ;Rodriguez Moreno et al., 2015 ) and anticorrelation with task-positive networks ( Fox et al., 2005 ;Smith et al., 2012 ). However, our results did not confirm a task disengagement of the DMN in speech processing. If the DMN would have been suppressed for task-related processes, then it should not show any significant difference in activity patterns under the same-task and different-task conditions, which was inconsistent with our observed results in the precuneus, a key component in the DMN ( Utevsky et al., 2014 ). According to a recent study, listener-speaker neural couplings in the posterior DMN predict listener's speech comprehension, indicating its active role in nar-rative comprehension ( Liu et al., 2022 ). That study also reported that the neural coupling of the posterior DMN depends on its anticorrelation strength with the executive control network rather than its level of activation. The elucidation of the functional involvement of below-baseline activity might be a valuable direction for future neuroimaging studies using inter-or intra-SC approaches.

Conclusions
Humans can successfully track a speech over an extended period in the presence of other distracting speeches in the background. The current study investigated the functional division of labor of the engaged brain regions/networks through the cross-task dynamic activity properties inferred from a goal-directed dichotic listening task with naturalistic speech materials. Converging evidence showed that two anatomically and functionally dissociated subsystems supported goal-directed speech listening: one stimulus-driven system consisted of the bilateral superior temporal cortex and lateral prefrontal regions and was functionally associated with acoustic-to-linguistic analysis; another goal-driven subsystem included a set of posterior parietal regions and was functionally linked with default mode and self-referencing functions.

Author contributions
D. Z., X. C. and M. M. designed the experiment. D. Z., X. C. and B. G. performed the experiment. L. F. Z., D. Z., B. G., F. Z., C. F., J. W. and M. M analyzed the data. L. F. Z. and D. Z. wrote the manuscript in consultation with C. F., J. W. and M. M.

Data and code availability statement
The data, experimental materials, as well as the codes that support the findings of this study are available from the corresponding author, MM, upon reasonable request.