Lack of evidence for predictive utility from resting state fMRI data for individual exposure-based cognitive behavioral therapy outcomes: A machine learning study in two large multi-site samples in anxiety disorders

CBT treatment outcomes in patients with anxiety disorders remains yet to be delivered.


Introduction
Precision medicine represents the idea of tailoring treatment to individual patient characteristics rather than offering a "one size fits all" solution (Ozomaro et al., 2013).It bears high potential for improving treatment outcomes for the cost-intensive group of patients with mental disorders not responding to first-line treatments such as cognitive behavioral therapy (CBT).Data-based predictions of individual treatment outcomes are a fundamental step towards precision medicine, as these predictions are needed to select the most suitable treatment option (Lueken and Hahn, 2020).
With machine learning, a set of tools is available that excels at detecting complex patterns and interactions in multiple predictor variables and translating them into one prediction for the individual patient (Bzdok et al., 2017;Janssen et al., 2018).These approaches compliment those using classical, univariate analyses that have already been able to detect a range of response predictors on a group-level.For example, (Pico-Perez et al., 2022) show that it is possible to associate therapy response using task-based function magnetic resonance imaging (fMRI) data; (Fullana et al., 2017) and (Cyr et al., 2021) found evidence that resting state (rs-)fMRI data can be used to associate response in therapy patients with obsessive-compulsive disorder.Consequently, a variety of studies in recent years has examined a plethora of different potential predictor variables from numerous data processing pipelines and machine learning algorithms in order to identify the most promising approaches, including clinical routine and neuroimaging data (Vieira et al., 2022).
Recent efforts to associate individual-level CBT outcomes with clinical routine data alone have only resulted in moderate prediction accuracies (58− 69%) (Hilbert et al., 2021;Hilbert et al., 2020;Hornstein et al., 2021;Taubitz et al., 2022;Wallert et al., 2022;Leehr et al., 2021;Symons et al., 2019;Symons et al., 2020).On the contrary, task-based and resting state neuroimaging data was found informative for predictive models in anxiety disorders with accuracies between 81− 92% and might provide incremental accuracy (Hahn et al., 2015;Månsson et al., 2015;Frick et al., 2020;Whitfield-Gabrieli et al., 2016).However, the neuroimaging studies have largely been plagued by very limited sample sizes (Vieira et al., 2022), which are prone to overfitting and only allow for weaker cross-validation strategies with increased risk of biased and overestimated accuracy estimates (Varoquaux, 2018;Varoquaux et al., 2017;Flint et al., 2021).In addition, past reviews emphasized that some clinical predictors could be replicated quite consistently, whereas fMRI-based predictors were reported rather in exploratory studies and replicability is not always given (Deckert and Angelika, 2019).It remains unclear whether neuroimaging data really facilitate superior prediction accuracies compared to clinical routine data, and how much incremental accuracy neuroimaging data provides when both data modalities are included in a common prediction model.
In this study, we established two independent teams that applied sophisticated machine learning analysis pipelines on separate resting state functional MRI data from two multicenter large-scale trials in anxiety disorders.Each team analyzed one dataset, but teams did not directly replicate each other.Our aim was to investigate individual CBT outcome prediction performances (e.g.reduction in relevant questionnaire scores) to examine whether the promise of neuroimaging data to provide good to very good prediction accuracy holds true in these large clinical samples, and to compare the relative contributions of neuroimaging and sociodemographic / clinical features to the final prediction performance.Our approach with two independent analysis teams allowed for separate choices regarding specific predictors (e.g., regarding data extraction, preprocessing, and feature reduction) and machine learning pipelines for both datasets.Hence, we applied distinct powerful methodological approaches mirroring the heterogeneity of analysis pipelines in real-world settings.We hypothesized i) outcome prediction accuracies both immediately after and six months after treatment based on both resting state as well as demographic and clinical data significantly exceeding chance level in both datasets, ii) resting state data providing incremental predictive power beyond the sociodemographic and clinical data alone.

Datasets
We analyzed data from two German multicenter studies on exposurebased CBT for anxiety disorders, the Protect-AD (NIMH Protocol Registration System 01EE1402A; ClinicalTrials.govID NCT02605668; German Register of Clinical Studies DRKS00008743) and SpiderVR (ClinicalTrials.govID NCT03208400) studies.Protect-AD was conducted in Berlin, Bochum, Cologne, Dresden, Greifswald, Marburg, Münster and Würzburg.Patients with a diagnosis of panic disorder, agoraphobia, social anxiety disorder, or multiple specific phobias received exposurebased CBT in vivo across twelve sessions (Heinig et al., 2017).This was a randomized controlled study with two arms, where patients received the exposure sessions in a temporally intensified fashion delivered within 2 weeks or a standard less intensive way delivered within 6 weeks.Overall, participants did improve clinically with large effect sizes (Pittig et al., 2021).As patients in both arms improved comparably in the primary outcome, we consequently did not differentiate for treatment condition for the current outcome prediction analysis and used all patients together.SpiderVR was conducted in Münster and Würzburg.Patients with a diagnosis of spider phobia received exposure-based CBT in virtuo in a single-session (Schwarzmeier et al., 2020).This was a prospective longitudinal study with only one arm, where all patients received an exposure session with a maximum duration of 2.5 h.Overall, participants did improve clinically with large effect sizes (Leehr et al., 2021) that were in line with other studies investigating virtual reality exposure treatment (Opris et al., 2012;Powers and Emmelkamp, 2008).
All participants provided written informed consent before study participation.
Both studies were approved by local Ethics Committees and performed according to the Declaration of Helsinki.Protect-AD: TUD-Ethics Review Committee (EK 234,062,014, 11/14/2014)

Patients and outcomes
Among all patients included in the Protect-AD trial, a subset of n = 309 patients completed the fMRI assessment at baseline and was thus available for the current analyses.Of these, we excluded n = 66 patients due to data loss and insufficient quality in the rs-fMRI data (Langhammer et al.), and an additional n = 23 patients due to missing primary outcome data, which is needed for the predictions.This resulted in a final sample of n = 220 patients for Protect-AD.For SpiderVR, all n = 207 patients were assessed in the scanner.Due to loss and insufficient quality in the rs-fMRI data, we excluded 17 participants, resulting in a final sample of n = 190 patients for SpiderVR.Eleven patients did not take part in the 6-month follow up (FU) assessment and thus were not included in the prediction of FU outcomes, resulting in a sample of n = 179 patients for these sub-analyses.Note that this sample is largely the same sample used in (Leehr et al., 2021), but differs slightly: we included patients who were assessed after the analyses for the previous study were completed, but had to exclude patients with bad or missing rs-fMRI images that were included in the previous study.The data used in the analysis of this manuscript have previously been analyzed univariately elsewhere (Leehr et al., 2024).
Treatment response for Protect-AD was defined as a reduction of at least 50% in the Hamilton Anxiety Rating Scale (HAMA-A; administered with the Structured Interview Guide for the Hamilton Anxiety Scale (Shear et al., 2001)) score between baseline and post-treatment assessment.This results in patients having improved their severity by at least one category post-therapy compared to pre-therapy (according to the severity cut-offs of Matza et al., 2010).This response definition was also used as the primary outcome in the clinical trial (Pittig et al., 2021).For SpiderVR, a reduction of at least 30% in the spider phobia questionnaire (SPQ (Hamm, 2006)) score between baseline and post-treatment assessment and between baseline and FU-assessment (primary outcome) was seen as treatment response.We chose this cutoff criterion as such a reduction typically leads to a post-or FU-treatment SPQ score of <20, which is defined as the cutoff for clinically significant symptoms (Hamm, 2006).This is in accordance with all other analyses of the Spider-VR project (Schwarzmeier et al., 2020).As a secondary outcome, a reduction of at least 50% in approach distance in a Behavioral Avoidance Test with a living bird spider (BAT, see (Schwarzmeier et al., 2020) for a thorough description) was seen as treatment response.Patients' demographics and clinical characteristics split between responders and non-responders are presented in Table 1.

Image acquisition and preprocessing
Protect-AD.MRI scans were acquired on harmonized scanning sequences at seven sites with 3-Tesla MRI scanners (3x Siemens TrioTim, 1x Siemens Verio, 1x Siemens Prisma, 1x Siemens Skyra, 1 x Philips Achieva).Rs-fMRI images were collected using a T2-weighted gradientecho echoplanar imaging (EPI) sequence (31-33 axial slices of 3.8 mm thickness with 10% gap per volume, TR = 2000 ms, TE = 29-30 ms, flip angle = 90 • , resolution = 3.3 × 3.3 × 3.8 mm, matrix size = 64 × voxel, field of view = 210 mm).237 vol scans were acquired in one run with a total length of approx.8 min.Patients were instructed to remain still and to close their eyes.The screen was black and the lights inside the MRI scanning room were switched off.Additionally, a T1-weighted magnetization-prepared rapid gradient-echo (MPRAGE) sequence (176 sagittal slices, 1 mm slice thickness, no gap, TR = 1900 ms, TE = 2.26 ms, flip angle = 9 • , resolution = 1 × 1 × 1 mm, matrix size = 256 × 256, field of view = 256 mm 3 ) was used to acquire a structural scan.To minimize carry-over effects, we conducted the RS fMRI paradigm before all other paradigms that were also part of our MRI sessions.Hence, there were only structural MRI/T1-weighted sequences before the RS fMRI.
Rs-fMRI data were preprocessed using the CONN toolbox and the CONN default preprocessing pipeline (Whitfield-Gabrieli, 2012) implemented in MATLAB (R2019b; The MathWorks Inc., MA, USA) and SPM12 (Penny et al., 2006).The pipeline included functional realignment and unwarping, slice timing correction, structural segmentation and normalization, functional normalization, smoothing, outlier identification and denoising.This was done centrally in order to rule out methodological differences between sites.More information on the preprocessing can be found elsewhere (Langhammer et al.).We used a selection of bilateral regions of interest (ROIs) implicated in clinical anxiety in recent meta-analyses (Chavanne and Robinson, 2021;Santos et al., 2019;Lueken and Hahn, 2016): the dorsal, pregenual, and subgenual part of the anterior cingulate cortex (ACC), the dorsomedial prefrontal cortex (DMPFC), the ventromedial prefrontal cortex (VMPFC), the dorsolateral prefrontal cortex (DLPFC), the ventrolateral prefrontal cortex (VLPFC), the orbitofrontal prefrontal cortex (OFPFC), the amygdala, the anterior and posterior insula, the hippocampus, the thalamus, the periacqueductal gray (PAG), and the bed nucleus of the stria terminalis (BNST).ROI definitions were taken from the Brainnetome atlas (Fan et al., 2016) (all ROIs except PAG and BNST; PAG based on masks created by Keuken and colleagues (Keuken et al., 2017), BNST based on the atlas for the hypothalamic region from Neudorfer and colleagues (Neudorfer et al., 2020)).ROI-to-ROI connectivity was extracted as the Fisher's-z transformed correlation between mean time series per ROI.
Additionally, graph metrics were calculated as they can characterize the properties of large-scale brain networks and their parts, which may be more predictive than classical ROI-ROI connectivity.We used ROIs as nodes and z-correlations >0.3 as edges.Four graph metrics were calculated for each node (ROI) and for the whole graph by averaging the results of all nodes: cost, global efficiency, betweenness centrality, and clustering coefficient.An introduction to graph-theory including an explanation of these metrics can be found here (Bullmore and Sporns, 2009).
Rs-fMRI data were again preprocessed with the CONN toolbox and the CONN default preprocessing pipeline (Whitfield-Gabrieli, 2012), implemented in MATLAB (R2020b) and SPM12 (Penny et al., 2006).The preprocessing pipeline was similar to the one described above.For the first ("static resting state") and second analysis ("combined static resting state and clinical data"), we extracted the ROI-to-ROI connectivity as the Fisher's-z transformed correlation between mean time series per ROI for all the n = 116 ROIs in the "automatic anatomic labeling" (AAL) atlas (Rolls et al., 2020) (http://www.gin.cnrs.fr/en/tools/aal/).This resulted in a 116 × 116 matrix.To obtain the correlations between every ROI-ROI pair excluding duplicates and self-correlations, we extracted the upper triangular part (without the diagonal), resulting in 116×115/2 = 6670 correlation values per participant.

Clinical and demographic predictor variables
Protect-AD.In addition to the neuroimaging data, we used a minimal sociodemographic and clinical data predictor set that included only age, sex, and baseline severity (HAM-A values).Particularly age and baseline severity have been consistently shown to yield predictive utility in prior CBT outcome prediction studies (Hilbert et al., 2021;Hilbert et al., 2020;Hornstein et al., 2021;Leehr et al., 2021).For a more profound analysis of the clinical data, a separate analysis using the full breadth of clinical data in the Protect-AD dataset is currently under preparation.
Spider-VR.To test whether the combination of clinical and demographic data with Rs-fMRI data would improve predictive accuracy, we repeated the data predictor set described and used in a previous study on SpiderVR data (Leehr et al., 2021), in which we could find a significant, but relatively small predictive value of the clinical and demographic alone.

Machine learning pipeline
Since the two teams worked separate from each other, the pipelines differ from each other: Protect-AD.We employed a stacked ensemble learning approach, with three first-level learners providing separate predictions based on i) the clinical and demographic predictors, ii) the ROI-to-ROI resting state

Table 1
Core sociodemographic and clinical characteristics of Protect-AD and SpiderVR datasets at pre-treatment, for the final used samples of patients and for responders and non-responders separately.Means (SD), except where noted.The categorization as responders (vs.non-responders) is based on the primary outcome measure, i.e.HAMA-A reductions of 50% (Protect-AD) and SPQ reductions of 30% (SpiderVR) from pre-to post-treatment assessment.Then, these separate first-level predictions were in turn used as features for the second-level learner that produced the final predictions.The whole procedure was conducted over 100 iterations to get robust estimates of the performance metrics independent from the current traintest split.
The first-level learners employed a train-test split with 80% of the data going into the train and 20% of the data going into the test sets, i.e., serving as out-of-sample validation samples.Patients of the minority outcome label were oversampled for a balanced outcome frequency.Missing features were imputed with their mode, median and mean as appropriate and rescaled to z-scores.For feature selection, we used an elastic net-regularized logistic regression model with stochastic gradient descent learning; features were selected for the prediction if they had above-average absolute feature weights with the help of scikit-learn's "feature_selection.SelectFromModel" function.For the prediction, we used a random forest (Breiman, 2001) with 1000 estimators and otherwise standard scikit-learn (Pedregosa et al., 2011) hyperparameters.Predictions by first-level learners in the train set were used as features to train the second-level learner and predictions by first-level learners in the test set were used as features for the prediction on the test set (see supplemental figure 1).The primary second-level learner was also a random forest with 1000 estimators and otherwise standard scikit-learn hyperparameters, while we also employed logistic regression, majority voting, softmax voting and weighted softmax voting as exploratory second-level learners (see supplemental methods for details).Significance of the predictive models was assessed by comparing the balanced accuracies of the classifier against a pseudo-classifier with 0.5 balanced accuracy (derived from the balanced outcome frequencies after resampling) over 100 iterations with Nadeau & Bengio's corrected resampled t-test (Nadeau and Bengio, 2003;Bouckaert and Frank, 2004).Due to the oversampling procedure the samples for the train and test split varied somewhat across the iterations, for the corrected resampled t-test we used the mean sample sizes across all iterations.Given the lack of significant results for this main approach, we conducted additional exploratory post-hoc analyses that varied specific aspects of the machine learning pipeline, particularly the feature selection approach, the examined outcome and the second-level learner.This also included one approach using all available ROIs from the whole brain instead the preselected ROIs based on the previous literature.Additionally, we repeated the main approach with treatment response as a dimensional measurement (change of scores/distance from pre to post treatment in percent) in a regressional machine learning approach and conducted a sanity check with synthetic data.These approaches are described in the supplemental methods.
SpiderVR.We first used only the clinical and demographic data as features, effectively replicating the analysis by Leehr et al. (Leehr et al., 2021).We then used only "static resting state" data, followed by the combination of static resting state and clinical and demographic data.Given the lack of significant results for these main approaches, we repeated the first approach with treatment response this time as a dimensional measurement (change of scores/distance from pre to post treatment in percent) in a regressional machine learning approach.We then conducted additional exploratory post-hoc analyses that used time-dependent features: a "sliding window" and an edge-functional connectivity analysis.Additionally, we repeated the sanity check of the Protect-AD part with our methodology.These exploratory approaches are described in detail in the supplemental methods.All built pipelines mainly followed a shared identical structure.We used the PHOTONAI toolbox (see https://www.photon-ai.com(Leenings et al., 2021)).Following guidelines by Poldrack and colleagues (Poldrack et al., 2020), we applied a nested cross-validation scheme in which 10 inner validation loops were used to optimize hyperparameters and 10 outer validation loops were used to estimate model performance (with 10% of the samples as test set in every step).Thus, ten percent of the sample were used as out-of-sample validation sample in every outer validation loop.Model performance was optimized for balanced accuracy.The pipeline consisted of an imbalanced data transformer to achieve balanced outcome frequency, a standard scaler to z-score the data and a principal component analysis (PCA) and/ or an F-test based univariate feature selection implemented in PHOTONAI to reduce the dimensionality of the data.Hyperparameter optimization was performed based on grid search and included a test-wise disabling of each pre-processing algorithm (Scaling, PCA, F-test based feature selection).Hence, the grid search for the inner cross-validation included a pipeline variant without the algorithm, or in other words: whether to use the respective algorithm at all was itself a hyperparameter.This approach resulted in different numbers of features available for the classification algorithm, depending on whether PCA or F-test based univariate feature selection was used (see supplemental methods for details).As classifier, we used either a support vector machine (with the parameter c as well as the choice of the kernel, either linear or radial basis function, as hyperparameters and all other parameters at their default value set by scikit learn) or a random forest classifier (with maximum depth of each tree as a hyperparameter and all other parameters again at their default value).To test for statistical significance of the models, we repeated the analyses 1000 times with permuted labels.
There were only small deviations from this approach in the different pipelines.For detailed parameters and differences between the pipelines, see supplemental methods.

Results
Predictions across both datasets performed never outperformed chance.
Protect-AD.For Protect-AD, the full prediction approach resulted in a mean balanced accuracy of 0.51 on the second-level learner, with all separate first-level learners likewise performing on chance level only (table 2).Despite being on chance level on average, we found that the corresponding area-under-the-curve for classification varied considerably across iterations.Calibration plots also indicated an inability of the second-level learner to classify responders and non-responders correctly (supplemental figure 2).Results for the exploratory analyses including hyperparameter optimization were similarly non-significant, except for the sanity check that reached a mean balanced accuracy of 99.9% (supplemental Table 1).Given that no learner performed significantly better than chance, we did not examine the importance of individual features for classification.
SpiderVR.For the Spider-VR dataset we were able to replicate the findings of the previous manuscript (Leehr et al., 2021) when using only clinical and demographic data (balanced accuracy of 0.613; see supplemental methods and table 3).However, using "static" resting state data, balanced accuracy never significantly differed from chance (between 0.498 and 0.546; see table 3), with only little difference between outer validation loops and the different approaches concerning pre-to post-treatment or pre-treatment to FU prediction, as well as predicting SPQ or BAT response.Additionally, combining the static resting state data and clinical and demographic data did not yield significant prediction results either(between 0.479 and 0.571; see table 3).This did not change when using sliding-window data (between 0.488 and 0.534) or edge-functional connectivity data (between 0.454 and 0.553; see supplemental table S3), or dimensional targets (R 2 between − .098and − .15;see supplemental table S5) as basis for the predictions.This lack of predictive accuracy is not due to the pipeline, as the pipelines with the synthetic data (sanity check) reached a balanced accuracy of 1. Supplemental table S4 shows detailed results of all ten outer folds for all classification approaches.Supplemental table S6 shows this information for the dimensional approaches.Again, we did not examine the importance of individual features for classification given the lack of classifiers performing significantly better than chance.

Discussion
In these independent studies, we applied sophisticated machine learning analysis pipelines on resting state data from two large-scale trials in anxiety disorders in order to investigate whether neuroimaging (resting state) data provide good to very good prediction accuracies in large clinical samples, and to examine the incremental contribution of neuroimaging beyond clinical and demographic data to the final prediction performance.Contrary to our hypotheses, resting state data was not able to predict exposure-based CBT responses above chance level, and did not provide incremental predictive power beyond the sociodemographic and clinical data.This was true for both datasets, for the main analyses as well as for a substantial number of exploratory analyses conducted on these datasets.Of note, prediction accuracy was even reduced when adding RS data to the feature set of clinical and demographic data in the Spider-VR sample instead of improving, suggesting that RS data introduced considerable noise in the feature set.Accounting for the considerable heterogeneity in methodological approaches in the field, we were able to show a robust null-result of predictive performance in two separate powerful analysis pipelines.
Predicting with an accuracy higher than chance is a comparably easy target in machine learning and substantially below clinical utility, which most likely requires accuracies beyond mere statistical significance to inform clinical decisions.This is true particularly for datasets with relatively equal group sizes (responders and non-responders) such as these.It was surpassed by the vast majority of published prediction studies using clinical and sociodemographic data (Vieira et al., 2022;Hilbert et al., 2021;Hilbert et al., 2020;Hornstein et al., 2021;Taubitz et al., 2022;Wallert et al., 2022;Leehr et al., 2021;Symons et al., 2020) and outperformed by previously published studies using neuroimaging data (Hahn et al., 2015;Månsson et al., 2015;Frick et al., 2020).Considering this background literature, the inability of rs-fMRI data to surpass chance level for the prediction of CBT outcomes in two large-scale datasets was surprising.This is even more the case given the high plausibility of resting state data for constructing outcome prediction models: examining resting state functional connectivity is a reliable approach to robust large-scale brain networks (Yang et al., 2020), several of which play major roles in psychopathology (Menon, 2011): For anxiety disorders, this traditionally includes the salience network, central executive network, default-mode network and ventral attention network (Sylvester et al., 2012), while alternative taxonomies add for example a negative affect network (Williams, 2016).The brain areas implicated in these networks are largely in agreement with results from recent and historical meta-analyses (Chavanne and Robinson, 2021;Etkin and Wager, 2007).On a conceptual level, network dysfunction has been related to functional changes in perception, cognition, emotion, and behavior (Sylvester et al., 2012;Williams, 2016).The high plausibility of resting state connectivity data for predicting psychotherapy outcomes has been further underscored by several studies reporting high prediction accuracies ranging between 70-81% for social anxiety disorder, post-traumatic stress disorder and obsessive-compulsive disorder (Whitfield-Gabrieli et al., 2016;Reggente et al., 2018;Zhutovsky et al., 2019;Zhutovsky et al., 2021).However, all of these studies used samples with n < 50.This severely limits the ability to adequately evaluate classifiers, as this process is particularly dependent on sufficiently large test sets (Flint et al., 2021) which cannot be provided by these sample sizes.On the contrary, both datasets used in this study are substantially larger and include neuroimaging data collected at more than one site, making the datasets overall more similar to real-world use-cases.The inability to predict outcomes in these datasets therefore calls for caution regarding the interpretation of promising results from small samples and regarding the potential of rs-fMRI for treatment outcome prediction.This result and interpretation are further in line with a recent study aiming to predict treatment response to escitalopram in a moderately sized multi-site sample: in this study, the prediction based on resting state connectivity from baseline alone also was not able to surpass  chance level, while changes in connectivity from baseline to week two achieved between 64.0-69.4% accuracy (Harris et al., 2022).The authors related this finding to early change.Including information from the first few sessions of a treatment may be a promising avenue for further research on theranostic biomarkers.
It is important to note that the null-findings in this study were achieved using a powerful methodological approach which mirrored the heterogeneity of analysis pipelines in real-world settings.Random forests and Support Vector Classifiers are particularly well-suited for tabular data such as ours, with random forests even outperforming deep learning approaches under such settings (Grinsztajn et al.).Moreover, in both datasets, we combined different data modalities including fMRI, clinical, and demographic data, as it has been recommended in the literature (Chekroud et al., 2021).On the Protect-AD dataset, we even integrated first-level classifiers in an ensemble learning approach, which often outperforms individual classifiers (Polikar, 2006).The convincing results of the concluding sanity check analysis conducted for this dataset generated further trust in the overall performance of the analytic pipeline.Given the inability to predict CBT outcomes in our main approach, we conducted additional exploratory post-hoc analyses on the Protect-AD dataset.As prediction accuracies did not increase for the approach using ROIs from the whole-brain, we can reasonably conclude that the lack of meaningful prediction was not related to an inappropriate selection of ROIs.For Protect-AD, our selection of ROIs based on the literature (Chavanne and Robinson, 2021;Santos et al., 2019;Lueken and Hahn, 2016) constitutes feature selection by prior knowledge, which has been demonstrated to outperform data-driven feature selection approaches in some cases in the neuroimaging literature (Chu et al., 2012).We initially combined this approach with data driven feature selection methods in the analysis pipeline.As feature weights and selected features showed substantial variability across iterations on the Protect-AD dataset, we employed further feature selection methods, including the extraction of particularly stable features (Nogueira et al., 2018;Meinshausen and Bühlmann, 2010) and a complementary approach using all available ROIs from the whole brain in additional exploratory analyses.As predictive performances did not differ between both hypothesis-based and complementary data-driven feature selection, we concluded that the lack of satisfactory results from feature selection was grounded in the lack of predictive information in the underlying resting state in both datasets.This argument is further corroborated by the lack of significant prediction when combining resting state and clinical and demographic data in the Spider-VR dataset: this can be interpreted as the resting state data decreasing the signal-to-noise ratio in the feature set, thus deteriorating the predictive value of the overall dataset.
Across both datasets, we also implemented sophisticated methodology on the feature level itself by using graph metrics (main analysis), dynamic resting state functional connectivity (sliding windows analysis; exploratory) and edge-centric functional connectivity (exploratory).Graph metrics have received increasing attention as they excel at describing overall characteristics and topology of the large scale networks in the brain (Farahani et al., 2019).Sliding window approaches try to address a commonly raised concern regarding resting state paradigms, namely their lack of temporal resolution: two brain areas may be strongly correlated for part of the paradigm time, but not at all during the rest of the time, making the correlation non-significant for the entire paradigm.Separating different time periods from each other and estimating correlations within these periods might thus show meaningful correlations otherwise overlooked (Yan et al., 2020;Allen et al., 2014).Going even one step further, avoiding temporal blurring altogether, edge-functional connectivity examines co-fluctuations, i.e. shared patterns of activity from one MRI image to the next (Faskowitz et al., 2020;Esfahlani et al., 2020;Novelli and Razi, 2022).Thus, this approach should be even more successful in uncovering dynamic functional connectivity.A growing body of literature corroborates the feasibility and relevance of this approach (Jo et al., 2021;Chumin et al., 2022).
However, none of the approaches mentioned above resulted in a meaningful pattern for prediction.
Despite this methodological rigor, the current investigation has some limitations.Although our sample-sizes considerably surpass those of previous studies, they are still inferior to comparable studies on sociodemographic, clinical and routine data that used more than thousand (Hilbert et al., 2020;Hornstein et al., 2021;Symons et al., 2020) to tens of thousands (Wolff et al., 2020) of patients.As model training and evaluation are dependent on the train and test set sizes, larger samples allow for the construction of more robust models and a less biased assessment of model performance, especially when working with high numbers of potential features, as in the case of the sliding-window and edge-functional connectivity approaches.However, sample sizes such as ours may be suitable for an initial assessment of model performance and considerably exceed previous investigations.Furthermore, both included studies were conducted at different sites and MRI data was collected on different scanners, adding additional variance in our data set.One way to tackle this problem would have been to harmonize the data across the sites (Eshaghzadeh Torbati et al., 2021;Yamashita et al., 2019;Yu et al., 2018).However, since harmonization are population-based, they bear the risk of data leakage, so that test set subject information is used during harmonization and thus during classifier training.Alternatively, harmonization could be included in the train-test-split, however this would lead to increased variability between splits.Additionally, in real-life scenarios, a prediction algorithm might be deployed to new sites not included in the training procedure, thus a successful prediction algorithm should not be dependent on the data collection location or hardware.We therefore argue that the heterogeneity adds to our external validity, preventing overfitting.Second, by employing analytic teams that applied different powerful machine learning pipelines on the data, we mirrored the heterogeneity of analysis pipelines in real-world settings.Here, each team analyzed its own dataset, but teams did not directly replicate each other's analysis.Previous research has shown that the specific analysis strategy can have a large impact on the analysis outcome for MRI data (Botvinik-Nezer et al., 2020).But the fact that we found very similar results for different analysis conducted by different teams in different samples, indicates an overall lack of meaningful information for CBT outcome prediction within the broad feature set of resting state functional connectivity data of both datasets.Therefore, an exact replication of analytic strategies across teams and datasets would have made sense only if we had found a significant predictive model in the first place, to ensure the robustness of this analysis.Third, there are several other ways of analyzing resting state activity in addition to the ones applied in our approaches (e.g.decomposition into brain networks using ICA).As the possible combinations between resting-state and machine learning analysis is endless, we cannot definitively rule out that another way of analyzing resting state would have yielded feature sets that would have led to more accurate prediction.But given the amount of applied state-of-the-art approaches and the similarity in results across two independent teams and datasets suggests that the data does not exhibit an apparent predictive signal.
In conclusion, we were not able to predict individual CBT outcomes based on resting state functional connectivity data from two large clinical trials in anxiety disorders.Across a variety of approaches, prediction accuracies were much lower than in comparable yet smaller previous studies.This finding urges caution regarding the interpretation of promising results from small samples and re-iterates that some of the prediction accuracies from these studies may result from overestimation due to homogeneous data and weak cross-validation schemes (Varoquaux, 2018;Varoquaux et al., 2017;Flint et al., 2021).While neuroimaging may still set out to prove its added value for the prediction of treatment outcomes and in precision psychotherapy, adequately powered samples are a necessary prerequisite for an initial evaluation of predictive performance and for a subsequently identification of the most promising candidates.
PD: panic disorder, AG: agoraphobia, SAD: social anxiety disorder, SPH: specific phobia, SIGH-A: Structured Interview Guide for the Hamilton Anxiety Rating Scale, SPQ: Spider Phobia Questionnaire, CGI: Clinical Global Impression scale, BAT: Behavioral Avoidance Test, BDI-II: Beck Depression Inventory-II, ASI: Anxiety Sensitivity Index, PAS: Panic and Agoraphobia Scale, LSAS: Liebowitz Social Anxiety Scale, DSM-5 SP: Dimensional Specific Phobia Scale for DSM-5, PROMIS-SPH: Patient-Reported Outcomes Measurement Information System.1 data from n = 5 responder and n = 1 non-responder missing in Protect-AD.connectivity and iii) the graph-derived metrics.These three first-level learners initially used k = 3, k = 435, and k = 124 features as inputs.

Table 2
Mean prediction measurements of first and second level learner across 100 iterations in the Protect-AD dataset.

Table 3
Mean prediction measurements of the different pipelines based on clinical and demographic data, resting-state and combined resting-state and clinical data in the Spider-VR dataset.Classifier is the one chosen in the best performing outer fold.