Probing the neural signature of mind wandering with simultaneous fMRI-EEG and pupillometry

in attentional


Introduction
Humans pervasively engage in shifting attentional focus from demands in the environment toward self-generated, task-unrelated trains of thought (TUTs), leading to performance errors during tasks that require sustained vigilance ( Smallwood and Schooler, 2015 ). Although this phenomenon, also termed mind wandering, has been of increasing interest in the past decades, its underlying neural signature remains a question of interest.
Converging evidence from functional magnetic resonance imaging (fMRI) studies indicates an association between activity in areas in the default mode network (DMN) and mind wandering ( Mason et al., 2007 ;Christoff et al., 2009 ). These areas behave antagonistically with a taskpositive, or anticorrelated network (ACN) that generally constitutes regions of frontoparietal control (FPCN) and dorsal attention (DAN) networks ( Fox et al., 2005 ;Mittner et al., 2014 ). Although these findings support a major role for the DMN in internal mentation, more (ICNs) that demonstrate a stable functional organization across individuals and mental states when measured statically ( Gratton et al., 2018 ), studies investigating the dynamic FC (at a temporal resolution of seconds) between them have described opposite associations with behavior, with greater DMN/ACN anticorrelation during vigilant attention ( Thompson et al., 2013 ) as well as during periods of mind wandering ( Mittner et al., 2014 ).
Cortical dynamics during internal states have also been examined with more temporally precise measures including electroencephalography (EEG). A robust finding from these studies concerns the decrease in amplitude of event-related potentials (ERPs) prior to performance errors and self-reported TUTs ( Smallwood et al., 2008 ;Kam et al., 2011 ), supporting the idea that attention is perceptually detached from external input during mind wandering episodes . Since the attenuation of sensory processing may arise from concurrent increases in alpha power that have been observed over widespread cortical areas, alpha-band activity may serve as a reliable electrophysiological correlate of mind wandering ( O'Connell et al., 2009 ;Compton et al., 2019 ;Jin et al., 2019 ).
New lines of research suggest that fluctuations in attention are modulated through the locus coeruleus/norepinephrinergic (LC/NE) system ( Aston-Jones and Cohen, 2005 ;Mittner et al., 2016 ). Specifically, changes in tonic and phasic NE levels are proposed to facilitate transitions between exploratory and exploitative states in order to optimize behavior. These dynamics have been derived from changes in pupil size at baseline and in response to stimuli ( Gilzenrat et al., 2010 ). Whereas (phasic) pupil responses seem consistently smaller during TUTs, changes in (tonic) baseline pupil size have yielded different results across experiments ( Smallwood et al., 2012a ;Grandchamp et al., 2014 ;Mittner et al., 2014: Konishi et al., 2017. This suggests that there are distinct forms of mind wandering characterized by varying levels of tonic arousal and neural gain ( Mittner et al., 2016 ;Robison, 2016 , 2018 ).
The possibility to detect the occurrence of mind wandering episodes has been examined with machine learning techniques using neural markers from different imaging modalities. For example, non-linear support vector machines (SVM) built for EEG data were trained on mind wandering probes during SART and visual search tasks ( Jin et al., 2019( Jin et al., , 2020 and live lectures ( Dhindsa et al., 2019 ). These studies demonstrate that EEG markers can be used to predict TUTs, and that this predictive ability can be generalized across tasks and settings. In another classification study, Mittner et al. (2014) successfully predicted selfreported TUTs across subjects with a non-linear SVM based on singletrial fMRI activity, functional connectivity, as well as pupillometric measures. Rather than excluding all measures that cannot be directly related to a self-reported attentional state, machine learning allows examination of data that is not interrupted by thought probing and offers a powerful tool for single-trial detection of latent cognitive processes. However, the predictive power of classifiers based on multimodal imaging datasets remains unexplored.
The interplay between temporally well-defined neural responses and spatially-localized functional networks can be assessed by multimodal neuroimaging. Although studies have been conducted combining EEG with resting-state MRI to determine the electrophysiological correlates of the DMN ( Neuner et al., 2014 ;Bowman et al., 2017 ;Marino et al., 2019 ), to our knowledge none exist that investigate the neural substrate of TUTs during a cognitive task. We expected that the complementary contributions of neural modalities offers unique spatial and temporal information for detecting TUT episodes. Therefore, we present the first study of mind wandering that utilizes simultaneous fMRI-EEG and pupillometry measures during task performance. By combining multimodal neural information with machine learning, we aimed to explore the markers sensitive to the fluctuations in attention that underlie mind wandering to ultimately gain a better understanding of its neural mechanisms. Specifically, we aimed to replicate the methods previously employed by Mittner et al. (2014) with addition of exploring more temporally refined features from EEG.

Overview
Simultaneous fMRI-EEG, and pupillometry data were collected during performance of a sustained attention task with probe-caught experience sampling. Features of interest were selected based on prior findings and extracted from each modality after preprocessing. We aimed to extend the single-trial analysis approach introduced by Mittner et al. (2014) by exploring activity and synchronicity within and between ICNs as well as changes in EEG markers and pupil size in relation to fluctuations in attentional focus. To this end, we employed a supervised learning algorithm trained to classify single trials as either 'on-task' or 'off-task' states. We then analyzed and compared the spatiotemporal signatures of respective states. Additionally, we performed recursive feature elimination procedures across different combinations of modalities to assess the relative importance of individual features in distinguishing between on and off-task states. Data and code are publicly available and can be found at https://osf.io/43dp5 .

Participants
Ethical approval was obtained from the ethics review board of the University of Amsterdam. Thirty healthy adult volunteers (25 female, aged 21 ± 2.51 years) were recruited and screened for MRI compatibility with a standard safety questionnaire. Participants were eligible when none of the following exclusion criteria were met: having a (record of) neurological or psychiatric disease, impaired vision, or any contraindication for MRI such as certain medical implants or prostheses. Written informed consent was obtained prior to the experiment. Participation was compensated with a €20,reward for a total duration of approximately 1.5 h. Two participants were excluded due to ending the experiment prematurely. Therefore, we performed data analysis on 28 datasets of which two were incomplete (one without EEG and another without eye-tracking) due to technical issues.

Sustained attention to response task
Participants performed a fast-paced sustained attention to response task (SART) that consisted of a series of non-target and target digits at an average 9:1 ratio. Stimuli were presented on a 32 inch BOLD screen using the Presentation software (Neurobehavioral Systems, Inc., Berkeley, CA). The SART was divided into two runs of 700 trials each, with a 1.4 s trial duration. At the start of each trial, a centered fixation cross was presented on a gray background for 400 ms before it was replaced by a random stimulus (digits 1 to 9) for 400 ms. Participants were instructed to respond to every digit with a button press using their right index finger unless the target stimulus appeared (digit 3). The train of stimuli was occasionally interrupted by a thought probe to track ongoing fluctuations in attentional focus, which was formulated as the following question: "Where was your attention during the previous trials? ". To respond to the probe, participants had to use left and right response buttons to move an arrow above a 4-point slidebar ranging from 1 (offtask) to 4 (on-task). After a fixed duration of 6 s, the location at which the arrow was positioned was registered as the response to the probe and the task continued. Participants were instructed to respond with 'off-task' when their attention was not primarily focused on the task or environmental distractions but on internal processes such as memories or personally relevant thoughts.
An online iterative algorithm was implemented to optimize the onsets of thought probes in order to maximize the probability of capturing off-task thought episodes throughout the task. To achieve this, the reaction time coefficient of variability (RT CV ) was tracked as a continuous index of attentional focus based on previous findings relating mind wandering to increases in RT CV ( Bastian and Sackur, 2013 ). For every trial that returned an RT, the RT CV was computed over the previous eight trials (RT SD / RT mean ). When a threshold was crossed of either above 80% or below 20% of the entire RT CV history, the algorithm searched for a peak or trough, respectively, in the previous RT CV values. Specifically, a peak was identified if the RT CV of the second last trial ( T -2) was higher than that of the third last trial ( T -3) and the last trial ( T -1), and the RT CV of T -1 was also higher than that of the current trial ( T ). Similarly, a trough was identified if the RT CV of T -2 was lower than that of T -3 and T -1, and the RT CV of T -1 was also lower than that of T . If such a pattern was detected, a probe onset was triggered. The algorithm was not activated when the current trial did not return an RT or when the RT CV did not cross the initial threshold . Thought probe onsets were constrained to have no less than 15 trials (21 s) and no more than 45 trials (63 s) between them. Thus, a probe onset was omitted if one had occurred within the past 15 trials but forced if one had not occurred for 45 trials, regardless of whether the current trial's RT CV reached threshold. On average, 22 thought probes (min = 19, max = 25) were presented per SART run. A short practice run of the task was completed prior to the experiment to ensure participants understood all task instructions.

Behavioral analysis
Thought probe responses were dichotomized by collapsing response options 1 and 2 into 'off-task' and response options 3 and 4 into 'ontask'. Behavioral indices of mind wandering were calculated for windows spanning 10 pre-probe trials (14 s) separately for off-task and on-task thought probes and included: (i) RT coefficient of variability (RT CV = RT SD / RT mean ); (ii) omission error rate (failure to respond to non-targets); and (iii) commission error rate (failure to withhold a response to targets). We selected a 10-trial window a priori based on the assumption that mind wandering occurs in slowly fluctuating episodes spanning multiple seconds and to include sufficient data for detecting differences in error rates, which are relatively low in this experimental paradigm.

Preprocessing
Standard image preprocessing was performed in FSL (v6.0; Jenkinson et al., 2012 ) with custom Python scripts (v2.7.15; Van Rossum and Drake, 2011 ) using the Nipype framework (v1.1.8; Gorgolewski et al., 2011 ). Each of the two functional BOLD runs was spatially smoothed with a 6 mm full-width half-maximum Gaussian kernel using SUSAN ( Smith and Brady, 1997 ), motion-corrected with MCFLIRT ( Jenkinson et al., 2002 ) and slice-time corrected with slicetimer. The signal was then high-pass filtered at 1/44 Hz to remove slow fluctuating noise such as scanner drift. The brain was extracted from T 1 w images with BET ( Smith, 2002 ) and segmented into gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF) with FAST ( Zhang et al., 2001 ). To investigate task-unrelated patterns of brain activity, a general linear model (GLM) was constructed using FEAT ( Woolrich et al., 2001 ) and included: (i) task regressors that were prepared by convolving stimulus, thought probe, and response onsets with a standard hemodynamic response function to model task-dependent BOLD signal; and (ii) nuisance regressors including six motion (direction and amplitude) parameters as well as mean time courses in WM and CSF masks. The mod-eled data were obtained via ordinary least-squares linear regression and subtracted from the preprocessed signal. The residual time-series were then merged across the two runs for each subject, normalized, and used for further analyses.

Feature extraction
We followed the procedure described by Mittner et al. (2014) to determine regions of interest (ROIs) by performing a seed-based correlation analysis with a prior mask of the posterior cingulate cortex (PCC; Van Maanen et al., 2011 ) as seed-region. First, the mask was transformed to native space using FLIRT ( Jenkinson and Smith, 2001 ) and the mean time course of voxels within the mask was correlated with all other voxels in the brain, yielding a connectivity map for each subject. Next, individual connectivity maps were registered with FLIRT to MNI space, Fisher z-transformed, and averaged to create a group-level connectivity map. The group-level map was then thresholded to locate the voxels with the 5% strongest positive and 5% strongest negative correlations with the PCC to determine the DMN and ACN, respectively ( Fig. 1 ). Automated segmentation of the group-level thresholded maps into spatial clusters resulted in seven nodes for the DMN  ). The thresholded ROI maps were projected back to native space in order to extract the mean time-series from a 3 × 3 × 3 cube centered around the peak-correlation voxel of each ROI for each subject. These individual time-series were linearly interpolated to find the signal at stimulus onset at every trial, resulting in 13 single-trial node-activity features per subject. Additionally, the mean time-series of every ROI was correlated with that of every other ROI using sliding-window correlations of 45 s, resulting in another 78 single-trial node-connectivity features per subject (i.e., 21 pairs for intra-DMN connectivity, 15 pairs for intra-ACN connectivity, and 42 pairs for inter-network connectivity).

Acquisition
Continuous EEG data were concurrently acquired with an MRIcompatible, 64-channel HydroCel Geodesic Sensor system and Net Amps 300 amplifier (Electrical Geodesics, Inc., Eugene, OR, USA) and processed with Net Station (v4.5.2; Eugene, OR, USA). The cap was fitted with carbon-wire loops sensitive to movement-induced variations in the magnetic field, serving as a reference for cardioballistic artifacts ( Van der Meer et al., 2016 ). The signal was collected at a sampling rate of 1000 Hz, online high-pass filtered at 0.1 Hz, and referenced to the Cz electrode. Electrooculography (EOG) was recorded from four electrodes positioned above and below and outer canthi of the eyes.

Preprocessing
Data were analyzed in EEGLAB (v14.1.2; Delorme and Makeig, 2004 ) using MATLAB (R2018b; Mathworks, Natick, MA, United States) and BrainVision Analyzer (v2.1.2; Brain Products GmbH, Gilching, Germany). First, data were filtered with a fourth-order zero phase-shift Butterworth filter (24 dB/oct) with a low cut-off of 0.33 Hz followed by a high cut-off of 125 Hz. Next, average artifact subtraction (AAS; Allen et al., 2000 ) with a sliding window of 21 artifacts was used to correct for MR gradient artifacts. In addition, cardioballistic artifacts were removed with the regression-based method described by Van der Meer et al. (2016) and artifacts related to eye-movement were removed with independent component analysis (ICA). Bad EEG channels were interpolated before re-referencing data to the average reference. Subsequently, data were high-pass and low-pass filtered at 1 Hz and 30 Hz, respectively, segmented into epochs from − 1000 ms to 600 ms post-stimulus, and DC trends were removed. . Note: the color index does not refer to specific labels but serves to aid the visual distinction of region borders.

Feature extraction
Based on previous findings, we were interested in local changes in prestimulus oscillatory power across multiple frequency bands. To extract prestimulus frequency power, data were first baseline corrected (1000 ms prestimulus) and pooled into four channel clusters centered above frontal, bilateral parietal, and occipital scalp locations (Supplementary Figure  Furthermore, we were interested in differences in amplitudes of event-related EEG signals across midline occipital (MidOcc), occipitotemporal (OccTem), midline parietal (MidPar), and midline frontal (MidFro) channel clusters, roughly corresponding to the scalp distributions of P1, N1, P300, and associated frontal ERPs, respectively (Supplementary Figure A.1B). Where the posterior P1 and N1 are believed to signal early perceptual processes in the visual domain, the later P300 component is thought to index working memory and related cognitive processes ( Shendan and Lucia, 2010 ). We used an offset of 8 ms to correct for the delay from the anti-aliasing filter of the Net Amps 300 amplifier. Data were baseline corrected (100 ms prestimulus) and pooled into aforementioned ERP clusters. Semi-automatic artifact correction was performed (gradient threshold 50 V/ms, amplitude criteria ± 100 V, and low activity criterion 0.5 V/100 ms) and applied to the full epoch after visual verification. The 0 to 600 ms post-stimulus time window was then subdivided into 24 bins of 25 ms and the mean of raw amplitudes was extracted for each bin at each ERP channel cluster, which generated 96 single-trial ERP features per subject.

Acquisition
Pupil diameter (PD) of the left eye was continuously recorded with EyeLink 1000 and EyeLink 1000 Plus tracking systems (SR Research, Ottawa, Canada) at a sampling rate of 1000 Hz.

Preprocessing
Blinks were identified using EyeLink's built-in online saccade and blink detection algorithm and linearly interpolated using the startsaccade and end-saccade markers as start and end points of each blink, respectively. Visual inspection showed that blink offset was registered prematurely across the majority of blinks and a correctional buffer of 70 ms was added to the end-saccade markers. If blink duration exceeded 1500 ms, data between the start-saccade and end-saccade markers were removed. Remaining artifacts were identified by thresholding singletrial PD ranges ( − 400 ms to 1000 ms post-stimulus) at the 95th percentile. Most of these extreme PD ranges were caused by large eye movements or technical issues with pupil tracking rather than physiological changes in pupil size. Trials containing such artifacts or with more than 40% missing data were excluded from further analysis (12.7% of trials across all subjects).

Feature extraction
Due to the tempo at which stimuli were presented, we found that baseline pupil fluctuations were contaminated by evoked dilations from preceding trials, preventing selection of single-trial time windows for determining baseline PD. We therefore developed a novel method for modeling pupillometric changes for fast-paced task designs, which is documented in detail in the recently developed package Pypillometry ( Mittner, 2020 ). First, the preprocessed signal was low-pass filtered with a zero-phase shift second-order Butterworth filter, preserving signal fluctuations slower than 2 Hz. The lower peaks in the signal were then identified based on their prominence and connected through cubic spline interpolation. This resulted in a lower-peak envelope that was used as an estimation of the tonic, baseline fluctuations on which the phasic, pupil responses are superimposed. Consequently, single-trial baseline pupil diameter (BPD) was featured as the value of the lower peak-envelope at stimulus onset for each trial. To determine evoked pupil diameter (EPD), single-trial regressors with a delta-peak at each stimulus and response onset (if any) were prepared and convolved with an Erlang gamma function: h = s ×t n ×e -n / t max , where s = 1/10 24 equals a scaling constant and n = 10 and t max = 930 are empirically determined constants ( Hoeks and Levelt, 1993 ). After subtraction of the baseline signal, the data were fitted with a linear regression model. Since pupil diameter cannot physiologically reach a value below zero, the beta coefficients of the model were constrained to be positive with a non-negative leastsquares solver as implemented in scipy.optimize.nnls() by using the for-mula: argmin b ||Xb-y|| 2 for b ≥ 0 ( Lawson and Hanson, 1987 ). Singletrial estimators for EPD were then defined as the estimated b coefficients at each trial.

Supervised machine learning
Following previous machine learning studies of mind wandering (e.g., Mittner et al., 2014 ;Jin et al., 2019 ), we used a non-linear support vector machine (SVM) with radial basis functions (RBF) as kernel to classify single trials into on-task or off-task attentional states with scikitlearn.svm ( Pedregosa et al., 2011 ). SVM classifiers attempt to separate classes with a hyperplane that is optimized by maximizing its margin. Besides generally being well understood and effective in high dimensionality, SVM's do not require a linear relationship between target labels and predictor variables and were shown to outperform (linear) logistic regression analysis when predicting mind wandering with EEG ( Jin et al., 2019 ). The SVM-RBF was trained on a dataset containing the three trials (4.2 s) preceding each thought probe, resulting in n = 3655 trials that were assigned the dichotomized probe responses as target labels. Training was based on a total of 205 single-trial features that could be grouped in five modalities: (i) activation in seven DMN and six ACN nodes; (ii) intra-network and inter-network dynamic functional connectivity ( [MidFro, MidPar, OccTemp, MidOcc] in 24 time windows; and (v) baseline and evoked PD. Features in the fMRI and pupil modalities were standardized (z-scored) within each subject, whereas the frequency power features were standardized within subjects and channel clusters. The ERP features were first baseline corrected by subtracting the mean at stimulus onset at each trial for each ERP within subjects and then standardized by dividing by the standard deviation across trials for each subject.
First, tuning parameters for the SVM-RBF were optimized through grid-search over a large range of values (2 − 1 to 2 15 for soft-margin C and 2 − 20 to 2 0 for kernel-width ) and leave-one-subject-out cross-validation (LOSOCV), using the F1 metric as objective function. In this procedure, the classifier was trained on all possible combinations of datasets of size n -1 in order to predict the one dataset that was left out. Classification performance was measured as the accuracy, recall, and precision averaged across all folds, where recall (sensitivity) reflects the ability to detect positive cases and precision (positive predictive value) is the proportion of positive cases that were correctly identified. Second, the most optimal set of features was evaluated with recursive feature elimination (RFE), in which all possible combinations of feature sets of size n -1 were evaluated with LOSOCV. The feature set with the highest cross-validated (CV) mean F1 score was then selected, resulting in the elimination of one feature at every iteration. This process was repeated until the size of the feature set was n = 1. The feature set that produced the highest mean CV accuracy across all iterations was then selected as the final set and used to classify the remaining, unlabeled data.
Additionally, we performed a cross-modality RFE procedure for each of the five modalities separately (node activity, functional connectivity, frequency power, ERP amplitudes, and pupil diameter), for each combination of modalities (all doubles, triples, and quadruples), as well as for the full five-modality classifier decribed above. This resulted in a total of 31 independent classifiers that allowed assessment of the pattern of feature elimination across different combinations of modalities. The proportion of times a feature survived elimination in a classifier relative to the number of times the modality was represented was used to indicate a feature's importance (0 being always eliminated and 1 being never eliminated), or the amount of predictive information as perceived by the classifier with respect to distinguishing off-task from on-task trials. Fig. 2. The effect of dropping modalities from support vector machines on crossvalidated (CV) classification performance. Averages and error bars (SE) are calculated across all 31 fits from the cross-modality RFE procedure. Classification performance increases as a function of the number of modalities added to the classifier. Note that exclusion of dynamic functional connectivity features (red) results in the lowest accuracy scores, suggesting that classification of attentional state was mostly driven by information contained in this modality.

Behavioral performance is impaired during mind wandering
During the SART, participants indicated on 42.6% of total thought probes that their attention was focused on internal trains of thought rather than on the task or external distractions. In line with our expectations, behavioral performance was significantly worse preceding off-task reports, with higher RT CV

Modalities contribute to the prediction of mind wandering episodes
The optimized SVM-RBF performed single-trial classification with a mean accuracy of 65% (F1 = 0.51, 57% recall and 54% precision) based on a set of 74 features (36.1% of total), indicating an above chance-level ability to predict the incidence of TUT episodes. The cross-modality RFE procedure furthermore revealed a linear increase in accuracy with increasing number of modalities added to the classifier, suggesting that features from each modality contribute unique spatial and temporal information that improves the prediction of TUTs ( Fig. 2 ). Collectively, intra-network and inter-network functional connectivity features carried most of this predictive information, as all classifiers performed worse when this modality was excluded. Individual feature importance scores from the cross-modality RFE procedure are presented in Supplementary Figure A.2.

The multimodal neural signatures of mind wandering
After supervised classification learning, all features were standardized and averaged separately for all trials classified as either off-task or on-task ( Fig. 3 ). Whether a feature survived the elimination procedure of the optimized five-modality SVM-RBF was interpreted as an indication of that feature's significance in predicting TUT episodes. Contrary to expectations, all nodes of the DMN showed a stronger mean signal in ontask trials compared to off-task trials. In contrast, all nodes of the ACN were more active during off-task, with the exception of the right-SMG ( Fig. 3 A). Whereas most nodes were selected in the optimized classifier, the PCC and right-SFG (DMN) and right-DLPFC (ACN) did not survive feature elimination, suggesting that signal fluctuations within these regions were not predictive of TUT episodes.
For both networks, nodes were more often positively correlated with each other during on-task trials compared to off-task trials (28 of 36 node-pairs; Fig. 3 B). Interestingly, four of five intra-DMN connections that were positively correlated during off-task were connected to the PCC, including: left-MTG, right-AG, and bilateral SFG. From these, the PCC to left-MTG connection was the strongest, whereas the connections with the SFG and the remaining connection (right-SFG to left-MTG) were weakest and did not survive feature elimination. For the ACN, all three node-pairs that were positively correlated in off-task trials were selected by the optimized SVM-RBF (from strongest to weakest: right-SMG to right-INS, left-INS to SMA, and right-SMG to left-DLPFC [visible in the coronal view of the ACNxACN plot in Fig. 3 B]).
Whereas most of the intra-network connections were positively correlated during on-task, the majority of inter-network node-pairs were positively correlated during off-task (38 of 42 node-pairs; Fig. 3 C). Thus, whereas information in the PCC and right-SFG themselves did not distinguish between on-task and off-task states, their functional interregional connections seem important for predicting TUT episodes. Similar roles for the SMA and left-INS are unsurprising given their high anatomical and functional heterogeneity and their involvement in domain-general cognitive processes ( Uddin et al., 2017 ;Cona and Semenza, 2017 ;Ruan et al., 2018 ).
With respect to the pupil features, BPD was selected in the optimized SVM-RBF and indicated more dilation in off-task compared to on-task trials, indicating higher levels of tonic NE ( Fig. 3 D). Pupillary response, however, did not seem to differentiate between the two states and was eliminated. Similarly, we observed that early positive and negative peaks reflecting P1 and N1 components, respectively, were more pronounced in off-task states, indicating the absence of attenuated early perceptual processing ( Fig. 3 E). However, decreased amplitudes at especially the midline frontal and parietal clusters from 250 to 300 ms onward implicate reduced information processing during off-task states at later latencies. Although several early bins did survive feature elimination, the majority of retained features occurred after the 200 ms post-stimulus mark (8 of 13 bins), suggesting that late rather than early eventrelated signals were predictive of mind wandering.
The frequency power analysis revealed a global increase in prestimulus alpha, theta, and delta power during mind wandering, with the exception of delta power over the right parietal cortex ( Fig. 3 F). In contrast, beta power was consistently reduced in off-task compared to on-task trials across the scalp. Although bilateral parietal alpha and beta features also survived elimination, the greatest changes in power were observed over the occipital cortex. None of the theta features were selected in the optimized SVM-RBF, suggesting that theta power itself did not contribute to classification and that the predictive information contained in theta fluctuations was instead carried by other features.

Discussion
The detection of ongoing covert cognitive processes in humans has been a problem facing significant methodological challenges. The present study provides new insights into the neural markers that reflect the attentional shift from externally-oriented cognition toward selfgenerated trains of thought. By integrating single-trial features across multiple neural modalities in a classification learning algorithm, we showed that specific patterns of fMRI activity and connectivity, EEG markers, and baseline pupil size were predictive of TUTs. Although each neural modality provided unique information that improved classification performance, the greatest predictive power encompassed dynamic interactions within and between intrinsic connectivity networks (ICNs), including the DMN and ACN.
Our results indicate recruitment of ACN nodes during TUTs. This finding is not surprising given the growing body of evidence advocating a role for these regions in spontaneous thought processes ( Christoff et al., 2009 ;Fox et al., 2015 ;Dixon et al., 2018 ). Specifically, their recruitment has been suggested to reflect a mechanism in which top-down control systems exert deliberate constraints on the stream of internally-oriented thoughts in order to guide them toward motivationally relevant or rewarding goals ( Christoff et al., 2016 ;Shepard, 2019 ). According to this view, mind wandering may be characterized by the redistribution of executive and attentional resources toward the internal environment driven by the prioritization of relevant information ( Turnbull et al., 2019b ).
In line with this, it has been argued that attentional decoupling in the form of suppression of sensory inputs may serve adaptive functions by insulating the stream of thought from external interference ( Kam and Handy, 2013 ;Smallwood, 2013 ). Although we did not find evidence for deficits in early sensory processing, our results may be interpreted as cognitive disengagement from task-relevant information as reflected in reduced amplitudes of P300 and midfrontal ERPs prior to self-reported TUTs. Correspondingly, task performance was significantly affected as indexed by increased RT variability and error rates. This corroborates an earlier finding ( O'Connell et al., 2009 ) and may imply that the shallow processing of visual information remains relatively unimpaired during mind wandering, whereas later cognitive and decision-making processes involved in assimilating the deeper meaning of stimuli needed to accurately perform the task are disrupted.
Contrary to expectations, we did not observe any increase in DMN activity during mind wandering. Although this finding seems counterintuitive, previous studies have reported a similar association between the recruitment of DMN regions and optimized behavior ( "in-the-zone "), whereas suboptimal behavioral performance ( "out-the-zone ") was instead associated with DAN activation ( Esterman et al., 2014 ;Kucyi et al., 2017 ;Yamashita et al., 2020 ). Although speculative, together these findings may point to DMN activity during task-focused attention as representing a weaker engagement in goal-directed behavior or attentional stability needed to accurately perform the task. Indeed, it is generally assumed that habitual response tendencies are developed early during repetitive tasks such as the SART and thus stable performance may rely more heavily on automatic processes ( Hawkins et al., 2019 ). As previous work has suggested a role for the DMN in automated cognition as opposed to mindful, focused attention ( Shamloo and Helie, 2016 ;Vatansever et al., 2017 ;Scheibner et al., 2017 ), our findings may be tentatively interpreted as a lesser engagement of top-down resources during the (more automated) task-focused state compared to the (more goal-directed) mind wandering state ( Christoff et al., 2016 ;Seli et al., 2016 ).
An alternative explanation may be that parts of the DMN, specifically its core nodes (PCC and mPFC), are not directly involved in mind wandering but rather function as a "global workspace " by tailoring their activity to the temporal dynamics of other ICNs ( Mittner et al., 2016 ). Thus, when attention is focused either externally (oriented to the task) or internally (mind wandering), functionally specific networks are recruited to support goal-directed behavior whereas converging network activity is lowered, resulting in deactivation of the PCC and mPFC. While we did not observe that single-trial activity in the PCC itself was predictive of TUTs, our results indicate high importance of the dynamic coupling between the PCC and other nodes of the DMN and ACN during both task-related and task-unrelated thought. Together with previous work ( Leech et al., 2012 ;Kucyi and Davis, 2014 ;Lin et al., 2016 ;Zhou et al., 2019 ), this finding supports the intriguing possibility that the PCC is involved in the coordination of network interactions to regulate shifts in attentional focus by maintaining or suppressing ongoing trains of thought.
Importantly, previous work has demonstrated the significance of context for the role that different networks play in ongoing thought. Activity in both the DMN and ACN has been associated with task-related as well as task-unrelated cognitive operations, depending on task difficulty ( Turnbull et al., 2019a( Turnbull et al., , 2019bKonu et al., 2020 ). These findings align with the context-regulation hypothesis, which states that mind wandering instances are adaptively regulated depending on environmental demands in order to minimize the negative impact on maintaining task performance ( Smallwood and Andrews-Hanna, 2013 ). Thus, to better understand how complex large-scale network activity gives rise to mind wandering, specific task effects need to be considered. One such task characteristic that varies among studies is pacing of trials. Compared to previous studies showing a link between the DMN and mind wandering, the SART design in the current study was faster paced (stop-signal paradigm; Mittner et al., 2014 ) and contained a lower proportion of target trials and was overall shorter in duration (SART; Christoff et al., 2009 ). Therefore, the role that the DMN plays in mind wandering during a sustained task may depend heavily on such effects.
Previous work indicates that the interactions within and between ICNs dynamically reconfigure to transient changes in ongoing cognitive processes such as mind wandering ( Thompson et al., 2013 ;Mittner et al., 2014 ). Accordingly, we observed high importance of information contained in functional connectivity compared to other modalities. Specifically, our results indicate that mind wandering is associated with overall decreased connectivity within and increased connectivity between the DMN and ACN. Thus, whereas these networks are intrinsically anticorrelated at rest ( Fox et al., 2005 ), the dynamic coupling between them during sustained attentional demands may support spontaneous fluctuations in ongoing internal trains of thought ( Smallwood et al., 2012b ;Dixon et al., 2018 ).
The electrophysiological origin of this coupling may concern thetaband oscillations ( Kam et al., 2019 ), which is in line with our observation of a widespread increase in theta power during TUTs, even though theta power itself was not found to be predictive of mind wandering. We also replicated increases in alpha power and reduced beta power across the cortex ( Jin et al., 2019 ;Compton et al., 2019 ;Van Son et al., 2019 ). Although the functional significance of alpha oscillations remains ambiguous, our data imply a role in active mind wandering that may involve inhibition of irrelevant representations and top-down interference ( Palva and Palva, 2011 ;Benedek et al., 2011 ). In addition, the increase in synchronized delta-band activity over frontal, left parietal, and occip-ital areas may have been involved in the maintenance of ongoing trains of thought by inhibiting interfering information ( Harmony, 2013 ).
Similarly, our findings indicate increases in baseline pupil size during mind wandering compared to task-focused attention, which may reflect higher levels of tonic NE and has been proposed to underlie the reduced sensitivity to external interference favoring mental exploration ( Murphy et al., 2011 ;Smallwood et al., 2012a ). Consequently, as exploitation of task-relevant information is no longer prioritized, the cognitive capacity for pursuing alternative goals that are motivationally salient is enhanced ( Bouret and Richmond, 2015 ). Possibly, the low incentive of the SART may warrant the adaptive redistribution of intrinsic motivation, regardless of its detrimental effect on performance. Together with our observations in other modalities, this implies that TUTs in our study were characterized by effortful and guided cognition rather than a state of low alertness or arousal. Although previous work also suggests a linear relationship between phasic NE and task performance, we did not observe any contributions from evoked pupil responses in differentiating attentional state.
One continuing challenge concerns the differences in measuring mind wandering, complicating the comparison of findings across studies. Research has shown that mind wandering is a non-uniform construct that varies along dimensions of intentionality ( Seli et al., 2016 ), meta-awareness ( Christoff et al., 2009 ), temporal locus ( Liefgreen et al., 2020 ), emotional valence ( Banks et al., 2016 ), selfrelevance ( Bocharov et al., 2019 ), and arousal ( Unsworth and Robison, 2018 ), which likely contributes to the divergent patterns of neural activation. The current study is likewise limited by the use of unidimensional experience sampling followed by a coarse dichotomy of attentional state. Therefore, our attempt to capture the spatiotemporal dynamics of TUTs within one signature based on a single task may compromise the generalizability of our results. Although the SART is an attractive and widely used paradigm to study mind wandering, more complex designs are necessary to disentangle the effect of TUTs on other cognitive processes and behavior ( Boayue et al., 2020 ).
The low complexity of the paradigm combined with individual biases in self-report due to variation in meta-awareness or thought content may have negatively influenced classification performance. Although we achieved above chance-level detection of attentional state with 65% accuracy across subjects, a previous study reported 79% accuracy based on fMRI and pupil measures alone ( Mittner et al., 2014 ). However, other EEG classifiers showed similar detection levels of TUTs ( Dhindsa et al., 2019 ;Jin et al., 2019 ) which substantially improved when models were fitted to individual datasets, suggesting that high inter-individual variability in EEG markers can affect cross-subject classification.

Conclusion
Although proven to be detrimental to maintaining attention to taskrelevant events, the capability to engage in internal trains of thought is integral to human neurocognitive functioning. More accurate detection of mind wandering episodes will lead to a more profound understanding of its effect on other cognitive processes. However, such detection is complicated as cognition evolves dynamically in complex spatiotemporal patterns. Multimodal classification enabling single-trial analyses may provide effective means to gain mechanistic insights into the neural basis of attentional fluctuations. We hope that our findings will motivate future studies to consider an agnostic, whole-brain approach to better entangle the respective contributions of dynamic interactions. Furthermore, employing paradigms that allow continuous tracking of attentional intensity combined with neuroimaging are better suited to investigate the evolution of task-unrelated trains of thought with higher temporal precision.

Declaration of Competing Interest
None.

Supplementary materials
Supplementary material associated with this article can be found, in the online version, at doi: 10.1016/j.neuroimage.2020.117412 .