Predicting treatment outcome based on resting-state functional connectivity in internalizing mental disorders: A systematic review and meta-analysis

Predicting treatment outcome in internalizing mental disorders prior to treatment initiation is pivotal for precision mental healthcare. In this regard, resting-state functional connectivity (rs-FC) and machine learning have often shown promising prediction accuracies. This systematic review and meta-analysis evaluates these studies, considering their risk of bias through the Prediction Model Study Risk of Bias Assessment Tool (PROBAST). We examined the predictive performance of features derived from rs-FC, identified features with the highest predictive value


Introduction
Internalizing mental disorders including depressive disorders, anxiety disorders, obsessive compulsive disorders, and post-traumatic stress disorder are highly debilitating, ranking among the top ten causes for global years lived with disability (GBD, 2019Mental Disorders Collaborators, 2022), and are associated with a substantial reduction of quality of life (Mack et al., 2015).These disorders are often grouped together (e. g., Hettema et al., 2006;Wergeland et al., 2021) because their symptoms have shown to load on a shared latent factor, commonly referred to as the internalizing factor (e.g., Andrews, 2018;Kotov et al., 2017).This factor is mainly characterized by distress and fear and also underlies their high comorbidity (Kessler et al., 2011).
Last decades of research have yielded effective treatments for these disorders, including psychotherapy, pharmacotherapy, electroconvulsive treatment (ECT), and repetitive transcranial magnetic stimulation (rTMS; e.g., see meta-analyses of Carpenter et al., 2018;Cuijpers et al., 2013;Dalhuisen et al., 2022;Mutz et al., 2019).However, each of these treatments comes with a high proportion of patients whose condition does not improve after treatment (non-responders, e.g., see reviews of Fitzgerald, 2020;Fonseka et al., 2018;Loerinc et al., 2015;Papakostas and Fava, 2009).These high rates of non-responders across treatments may indicate that there is no one-size fits all treatment and suitability varies among individuals or subgroups of patients.Following the concept of precision mental healthcare (DeRubeis, 2019), allocating patients a priori to the treatment most promising for them could reduce non-response rates.A necessary condition for such a treatment allocation is a sufficiently accurate prediction of treatment outcome on a single-subject level.
From a methodological standpoint, machine learning approaches are particularly well-suited for this endeavor.In contrast to conventional statistical modeling, which predominantly aims at explaining existing data, the core objective of machine learning is the accurate prediction of new data (Sidey-Gibbons and Sidey-Gibbons, 2019;Yarkoni and Westfall, 2017).This shift in focus gives rise to two central distinctions between statistical modeling and machine learning: the assessment of the model's performance and the models or algorithms employed.First, both approaches substantially diverge in their criteria and their procedure for determining a well-performing model.In statistical modeling, a well-performing model is one that effectively explains the data (e.g., a logistic regression model with a high R-squared indicating goodness-of-fit).On the other hand, in machine learning, a well-performing model is one that discriminates effectively between two or more classes in new data (e.g., achieving high predictive accuracy).Hence, instead of evaluating a model ´s performance in the data set on which it has been trained, machine learning approaches apply the fitted model to ideally new data and assess its performance there.Since entirely new data are often unavailable, cross-validation techniques have been developed, iteratively dividing the dataset into a training set for model fitting and a test set for model evaluation.Several metrics to evaluate a model ´s classification performance on the test set(s) exist, combining mainly the number of correctly and falsely predicted cases.One of the most general and most frequently used metrics is accuracy, summarizing the proportion of correctly classified (positive and negative) cases in relation to the total number of cases.However, its interpretability diminishes when being based on imbalanced classes (e.g.60% nonresponders, 40% responders), a factor overlooked in several studies.In such cases, other metrics are recommended, including the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) and the balanced accuracy (see Thölke et al., 2023 for a more detailed discussion).
The second central distinction between machine learning and statistical modelling are the models or algorithms employed.Seeking to explain the observed data in a clear manner, statistical models often determine possible relationships between variables and employ a small number of dependent variables.For instance, logistic and linear regression assume a linear relationship between independent and dependent variables, with interaction effects only considered when explicitly added to the model.In contrast, machine learning approaches include a diverse array of algorithms capable of effectively handling numerous variables and capturing nonlinear relationships.Some of the most common algorithms are support vector machines (SVM), random forests and neural networks.To conclude, machine learning approaches are particularly well-suited for testing the possibility of pre-treatment prediction of treatment outcomes, given their inherent design for this application, validating models by their performance on unseen data, and employing versatile algorithms (see Sidey-Gibbons and Sidey-Gibbons, 2019 for a more in-depth introduction to machine learning).
A large number of studies has employed machine learning approaches to predict treatment outcome in internalizing mental disorders, using a wide range of modalities including demographic, clinical, EEG, and (f)MRI data (e.g., see reviews of Cohen et al., 2021;Karvelis et al., 2022;Vieira et al., 2022).One promising subtype of fMRI data is resting-state functional connectivity, relying on BOLD (blood-oxygenation level dependent) contrast imaging to measure the local blood supply in the brain as a proxy of neural activity.The basis for calculating resting-state functional connectivity is a resting-state scan, in which participants are directed to remain motionless for approximately 8-10 minutes without engaging in any specific task or receiving visual stimuli.Based on these data, functional connectivity between grey matter brain regions is calculated as their statistical correlation of BOLD signals over time (for an easy introduction into resting-state functional connectivity, see Lv et al., 2018).Neuroimaging data and more specifically resting-state functional connectivity appeared to be particularly valuable in previous reviews and meta-analyses comparing prediction performances across different modalities (Del Fabro et al., 2023;Vieira et al., 2022).Moreover, resting-state functional connectivity seems promising due to its shared alterations across internalizing disorders, including disorder-specific variations (e.g., Williams, 2017).The size and type of these alterations may impact treatment response likelihood, independent of the type of treatment.Additionally, compared to other neuroimaging modalities, resting-state data can be assessed relatively consistently across sites and with a low burden on patients, facilitating the generation of larger samples, as required for machine learning.
However, the combination of resting-state functional connectivity and machine learning goes along with several methodological challenges.One of the biggest challenges is the high dimensionality of resting-state functional connectivity data (Khosla et al., 2019).In theory, an extensive number of functional connectivities can be computed from a resting-state scan in typical resolution.A normalized brain scan with an isotropic voxel size of 2 mm has around 124.000 voxels of grey matter.Thus, when calculating functional connectivity between all voxels, more than 15 billion (124.000×123.999)functional connections would be initially available to predict treatment outcome.Such a large number of predictive variables (in machine learning called features) cannot be handled by current machine learning classifiers, especially not with sample sizes of few hundred patients, which is usually the upper limit for longitudinal interventional studies (also known as the curse of dimensionality or small-n-large-p-problem; Mwangi et al., 2014).
The current status of addressing challenges specific to resting-state functional connectivity has not been summarized, as previous reviews have rather examined the predictive ability across different neuroimaging modalities in general (Cohen et al., 2021;J. Lee et al., 2022;Y. Lee et al., 2018;Vieira et al., 2022).Furthermore, the majority of these reviews lacked a comprehensive quality control that thoroughly examined the employed machine learning approach (Cohen et al., 2021;J. Lee et al., 2022;Y. Lee et al., 2018).The reviews of Y. Lee et al. (2018) and Cohen et al. (2021) did address the overall risk of bias to some extent by investigating publication bias.However, Y. Lee et al. (2018) did not conduct any assessment of study quality.In contrast, Cohen et al. (2021) evaluated study quality, including risk of bias, with the QUADAS-2 tool, but overlooked the bias introduced by the design of the machine learning pipeline, as the tool primarily focuses on the validation of diagnostic tests.This lack of attention to the risk of bias introduced by the design of the machine learning pipeline is problematic, considering the common occurrence of inappropriate applications of machine learning in the field (Meehan et al., 2022).Finally, existing reviews did not systematically assess which features contributed to a successful prediction.
To fill these gaps, the aim of this systematic review was three-fold: First, to examine how well treatment outcomes in internalizing disorders can be predicted by features based on resting-state functional connectivity (research question 1 = RQ 1), taking into account the studies´risk of bias using the Prediction Model Study Risk of Bias Assessment Tool (PROBAST).Second, to assess how features with high predictive value were identified (RQ 2.1) and which features were particularly important for the prediction (RQ 2.2).Third, to provide an overview of how the different studies addressed the curse of dimensionality, i.e. how they reduced the large number of theoretically initially available functional connectivities to a small set of features to be used in the final classifier(s) (RQ 3).By addressing these questions, we aimed to give a realistic estimate of the potential of machine learning and resting-state functional connectivity (RQ 1), to assist future researchers in making informed methodological decisions by summarizing current practices and identifying methodological shortcomings (RQ 2.1, RQ 3), and to provide evidence for an a priori selection of brain regions which might be relevant for predicting treatment outcome in future machine learning studies (RQ 2.2).

Search strategy and inclusion criteria
The electronic databases Scopus, PubMed and PsycINFO were C. Meinke et al. searched for relevant studies from inception up the 12th of December 2022.Search terms encompassed keywords for resting state, primary disorder, treatment, and machine learning (see supplement S1 for the specific search terms in all databases).Additionally, reference lists of eligible studies and review articles were screened.The following inclusion criteria were applied: 1) publication in a peer-reviewed journal, written in English, 2) analyzing a sample of patients with one of the following disorders as primary disorder: unipolar depressive disorders, anxiety disorders, obsessive compulsive disorder, or post-traumatic stress disorder, 3) predicting outcome to any treatment (behavioral, pharmacological, placebo, or neuroscience-informed) that aimed to improve the patients` condition 4) using a machine learning approach 5) predicting treatment outcome as a categorical outcome 6) reporting at least one classifier whose input features are exclusively based on restingstate functional connectivity.Anticipating a limited number of studies meeting our inclusion criteria, we refrained from delineating any additional criteria beyond those specified.After an initial abstract screening by the first author (CM), all remaining studies were submitted to a fulltext screening, independently performed by two authors (CM, KH).Disagreements were resolved by discussion.The review part has been preregistered with PROSPERO (CRD42022370949).The meta-analytic summary of classification accuracies was not part of the preregistration as it was conducted in response to the reviewers` recommendations.

Data extraction
The following data, ordered by research questions, were extracted.Study characteristics and RQ 1: first author, year, primary disorder, age group, treatment, definition of response and/or remission, sample size, way of estimating the underlying functional connectivities, type of functional-connectivity-based input features, algorithm(s) of the final classifier(s), validation method, classification metrics of the best model reported.RQ 2: way of detecting features with high predictive value, level of resolution of investigating high predictive values, features that showed high predictive value.RQ 3: approaches to reduce the number of input features.Data extraction was performed by CM and checked by KH.The original table of data extraction as well as a R-script for reproducing all our analyses and plots in R (R Core Team, 2020) can be found here: https://osf.io/y69ke/.

Risk of bias assessment
The best model of each study was assessed for risk of bias using PROBAST (Prediction model study risk of bias assessment tool; Wolff et al., 2019), a tool developed for predictive modelling in healthcare.Based on 20 signaling questions, each model was judged as having low or high risk of bias, in each of four domains (participants, predictors, outcome, analysis) and in total.

Data synthesis 2.4.1. RQ 1 Meta-analysis on balanced classification accuracies
To answer RQ 1 (How well can treatment outcome in internalizing disorders be predicted by features based on resting-state functional connectivities?),we estimated the mean balanced classification accuracy in a meta-analysis using the classification accuracy of each study´s best model.We focused on each study´s best model as sufficient performance metrics were mostly only reported for those.This has been a common procedure in systematic reviews and/or meta-analyses on machine learning (e.g., Bondi et al., 2023;Vieira et al., 2022).We chose accuracy instead of other metrics such as precision, recall/sensitivity, or specificity, which focus on the prediction of one of two classes (either response or nonresponse), for two reasons.First, given that current models are far from any clinical application, it is unclear whether predicting one class is more crucial than the other.Therefore, prioritizing the evaluation of overall model performance, summarized by accuracy, seemed most pertinent.Secondly, opting for accuracy was more practicable, as it was the only metric consistently reported across all studies.In contrast, sensitivity and specificity, the most frequently reported metrics among those focusing on the prediction of one of the two classes, were absent in 4 out of the 13 studies reviewed.In addition, aggregating these metrics across studies would have been challenging, as it was not always apparent to which class the metrics referred.For instance, in some studies it was unclear whether specificity described the ability to predict response or nonresponse.
However, using the classification accuracy has the disadvantage that its meaningfulness diminishes when classes are imbalanced, as accuracies above 50% can easily reached by a model that is systematically predicting the more frequent class (Thölke et al., 2023).For instance, consider a binary classification scenario where nonresponders constitute 70% of the cases.In this context, a classification accuracy of 70% might not truly signify high predictive performance.Instead, it could be attributed to a model lacking genuine predictive ability, merely predicting a nonresponder status for all cases.Some studies with imbalanced classes took this into account by reporting the balanced accuracy or additional evaluation metrics.The balanced accuracy is commonly calculated as the mean of sensitivity and specificity (e.g., Brodersen et al., 2010).Hence, when reported, we calculated missing balanced accuracy values based on these metrics.However, this could not be done for all studies with imbalanced classes.Therefore, we estimated a proxy of balanced accuracy for those remaining studies, using the following formula: Proxy of balanced accuracy = (raw accuracyrelative frequency of the more frequent class) + 50%.This proxy is based on the idea that the accuracy achieved by a dummy classifier always predicting the more prevalent class (= the relative frequency of the more frequent class) represents the chance-level.The improvement above chance-level is thus calculated by subtracting the chance-level from the raw accuracy.To get the final proxy of balanced accuracy, it is added to a chance-level of 50%, as it would exist when classes are balanced.This formula is not a prevalent method in machine learning for taking class imbalances into account, as more suitable metrics such as balanced accuracy and AUC exist when evaluating a model on original data.Its relevance only emerges when summarizing accuracy values across studies and other performance metrics controlling for class imbalance are lacking.
Similar to previous meta-analyses (Y. Lee et al., 2018;Vieira et al., 2022), we conducted a meta-analysis for proportions, treating accuracy values as proportions of correctly classified cases.The R-package meta was employed for these analyses (Balduzzi et al., 2019).As we anticipated a considerable heterogeneity in classification accuracy between studies, we fitted a random effect model and applied Knapp-Hartung adjustments (Knapp and Hartung, 2003) to calculate the confidence interval around the mean estimated accuracy.All analyses were conducted using Freeman Tukey double arcsine-transformed proportions to stabilize error variances (Barendregt et al., 2013).Between-group heterogeneity in the random effect model was estimated with the restricted maximum likelihood estimator (Viechtbauer, 2005).The mean estimated accuracy under the random effect model was calculated by pooling studies accuracies by the inverse of their error variance.Individual study confidence intervals were calculated using the Clopper-Pearson (i.e., exact binomial interval) method.Heterogeneity between studies was evaluated using the I 2 statistic and the Cochrane´s Q-test.We interpreted I 2 following the recommendations from Higgins and Thompson (2002), with 25%, 50%, and 75% as low, moderate, and high, respectively.To explore potential sources of heterogeneity, we performed subgroup analyses for treatment, diagnosis, and risk of data leakage (PROBAST question 4.8) as well as a meta-regression on sample size.

RQ 2 Features with high predictive value
To answer RQ 2.1 (Which approaches are taken to draw inferences about the predictive value of specific features?),we extracted approaches that were used to assess which feature (sets) contributed or led to a prediction above chance-level and then clustered them qualitatively into suitable groups.To account for the expected heterogeneity of approaches, we used the term "predictive value" instead of "feature importance", as the latter is often used to describe the contribution of a feature in a final classifier, which represents only one of various methods.
To answer RQ 2.2 (Which features have high predictive value for the prediction of treatment outcome?), we applied a five-step procedure.First, we extracted which features were reported to have high predictive value.Then, we examined the level of resolution at which these features were reported.For instance, certain studies indicated that a singular functional connectivity between two brain regions holds predictive value, whereas others reported the entire array of functional connectivities between a particular brain region and a subset of other regions as having high predictive value.Since the majority of studies fell into the second category, we decided to summarize findings across studies at the level of brain regions (instead for example, at the level of single functional connectivities).Thus, in a third step, we collected those brain regions whose functional connectivity had high predictive value.Fourth, we grouped these brain regions into larger areas to provide a more comprehensive overview, based on the 22-regions Human Connectome Project multimodal brain parcellation (Glasser et al., 2016;Huang et al., 2022), common subcortical areas, and findings from previous literature.This approach resulted in the following coarse brain areas: visual areas, sensorimotor areas, inferior temporal gyrus, middle temporal gyrus, superior temporal gyrus, parahippocampal gyrus, superior parietal lobule, inferior parietal lobule, posterior cingulate cortex, anterior cingulate cortex (ACC), medial prefrontal cortex (PFC), precuneus, orbitofrontal cortex, ventrolateral PFC, dorsolateral PFC (DLPFC), amygdala, hippocampus, insula, basal ganglia, and thalamus.The Glasser's 22-regions parcellation served solely as inspiration for categorizing brain regions.We did not align coordinates between brain regions and the Glasser parcellation; instead, we relied on the labels provided by the studies for grouping.However, it is important to highlight three key distinctions from the Glasser's parcellation in our approach: First, we merged Glasser´s visual areas 1-5 into one visual area, as this parcellation seemed too fine-grained for our endeavor.Second, we splitted Glasser´s area 19 (ACC and medial prefrontal cortex) into ACC, medial PFC, and precuneus, as these regions and their role in psychopathology and treatment outcome have been discussed separately.Third, we parcellated the temporal lobe in an anatomical way, because the Glasser´s more functionally-based parcellation could not be imposed on our studies´findings.Then, in a fifth step, in order to account for the fact that not all studies employed whole-brain analyses, we assessed for each study and each brain area whether there was an initial opportunity to demonstrate high predictive value.This was for example not the case for all coarse brain areas when the analysis focused solely on functional connectivities within a subset of brain regions or when whole-brain analyses were confined to cortical areas.

RQ 3 Approaches to reduce the number of features
To answer RQ 3 (Which approaches are taken to reduce the large amount of initially available functional connectivities to a small set of features to be used in the final classifier(s)?), the extracted approaches were grouped into suitable categories (here: approaches that served an initial reduction preceding feature generation and approaches that served feature reduction).

Search results and study characteristics
The initial search identified 240 unique records.After screening for eligibility by title and abstract, 49 studies underwent a full text screening.Finally, 13 studies, each using a different sample, were included in the systematic review (see flowchart in Fig. 1).
Even though including a variety of internalizing disorders in the search term, most of the finally selected studies predicted treatment outcome in patients with unipolar depressive disorders (n = 11), 2 studies focused on patients with post-traumatic stress disorders, while Fig. 1.PRISMA Flowchart.We excluded one study because it did not utilize any baseline features to predict treatment outcomes, even though the explicit criterion "Using only baseline features" was not included in our exclusion criteria.However, we interpreted inclusion criterion 3) "Predicting outcome to any treatment" as encompassing the use of only baseline features, given our understanding that prediction occurs before treatment.

C. Meinke et al.
there was no study on patients with obsessive compulsive disorder or anxiety disorders fulfilling the inclusion criteria (see Table 1 for study characteristics).More specifically, all patients with unipolar depressive disorders met the DSM-IV criteria for a major depressive episode, mostly evaluated through (semi)-structured interviews such as the SCID or the MINI.Patients in studies on PTSD satisfied the DSM-IV criteria for PTSD either in full (Zhutovsky et al., 2019) or at least partially (Zhutovsky et al., 2021).Criteria were assessed using PTSD-specific semi-structured interviews such as the CAPS, or, focusing on children and adolescents, the CAPS-CA and the ADIS-P (Zhutovsky et al., 2021).Except for Zhutovsky et al., 2021, all studies exclusively included adult participants.Regarding symptom severity, the majority of patients exhibited moderate symptoms, although certain individual studies specifically included patients with more severe symptoms (Hopman et al., 2021;van Waarde et al., 2015).The treatment patients underwent varied largely across studies: 5 studies employed medication (Harris et al., 2022;Kang and Cho, 2020;Kong et al., 2021;Pei et al., 2020;Tian et al., 2020;H. Wu et al., 2022), 3 studies ECT (Moreno-Ortega et al., 2019;Sun et al., 2020;van Waarde et al., 2015), 2 studies psychotherapy (cognitive behavioral therapy and eye movement desensitization and reprocessing; Zhutovsky et al., 2019;Zhutovsky et al., 2021), 2 studies rTMS (Drysdale et al., 2017;Hopman et al., 2021), and 1 study mixed medication and psychotherapy (Schultz et al., 2018).The diversity in treatments and their duration contributed to a large variability in the timing of the assessment of post treatment outcome, ranging from 7 weeks (Schultz et al., 2018) to 8 months (Zhutovsky et al., 2019).Furthermore, it is noteworthy that two studies specifically concentrated on the early response to medication, measuring and predicting outcomes after a brief period of 2 weeks of treatment (Pei et al., 2020;Tian et al., 2020).Treatment outcomes were always measured in terms of symptom severity, mostly using clinician-rated measures such as the HDRS (HDRS-6, HDRS-17, and HDRS-24) MADRS, and CAPS (More information on the definition on treatment outcome can be found in Table 1).

RQ 1 Meta-analysis of balanced classification accuracy
The meta-analysis was based on 13 studies with n = 972 observations.Aiming to summarize balanced instead of raw classification accuracies, balanced accuracies were calculated from sensitivity and specificity for n = 5 studies and estimated using the proposed proxy for n = 4 studies.Fig. 2a depicts the difference between raw and balanced accuracies for these studies.
In subgroup analyses, neither treatment type, primary disorder, nor the presence of data leakage could account for the observed betweenstudy heterogeneity, potentially due to limited subgroup sizes.Including sample size as a moderator, however, reduced the between study heterogeneity to 57.8%.The meta-regression analysis revealed that lower sample sizes were associated with higher classification accuracies (ß based on transformed proportions = − 0.0017, t(11) = − 2.5, p = 0.0280), explaining around 46% of the observed variance.This relationship is depicted in Fig. 2b.Given the significant between study heterogeneity, we chose not to create a funnel plot and perform asymmetry analyses, following various recommendations in the field (Ioannidis and Trikalinos, 2007;Terrin et al., 2003).

RQ 2.1 Approaches to evaluate predictive value
We grouped the approaches used to draw inferences on the featuresṕ redictive values into three categories: model comparison, selection frequency in feature selection, and feature importance in final classifier.The majority of studies (n = 6) used model comparison.They built and compared models with different sets of input features to assess which set showed the best model performance and thus had the highest predictive value.The compared feature sets included connectivities of different brain regions (Schultz et al., 2018), different combinations of single connectivities (Hopman et al., 2021;Moreno-Ortega et al., 2019), or subject-specific spatial maps of different independent components (van Waarde et al., 2015;Zhutovsky et al., 2019;Zhutovsky et al., 2021).
The "selection frequency in feature selection" was used by 4 studies to draw inferences on the features` predictive value.As previously described, employing feature selection techniques is a common practice to narrow down the initially available features to a more compact set, which is then utilized for the final classifier.When using an internal cross-validation technique as almost all our studies did, feature selection is applied in each iteration.Thus, features which are selected in most iterations are considered as having high predictive value.The studies here used different techniques of feature selection such as Wilcoxon rank sum test (Drysdale et al., 2017), correlation analysis (Sun et al., 2020), SVM with recursive feature elimination (H.Wu et al., 2022), and univariate feature selection (Zhutovsky et al., 2019).
Feature importances in the final classifier were used by 4 studies to assess the features´predictive value.The category "feature importance in the final classifier" comprises approaches which used measures of feature importance in the final classifier to investigate the featuresṕ redictive value.In general, most classifiers have model-specific measures of feature importance but there exist also a wide variety of modelagnostic approaches.Here, the measures of feature importance varied across studies with feature weights for SVM (Tian et al., 2020;Zhutovsky et al., 2021), position ranking for SVM with recursive feature elimination (Pei et al., 2020), and feature weights in a spatio-temporal graph convolutional network (Kong et al., 2021).Please note that some of the 12 studies which examined the features´predictive value applied multiple approaches.More information on the specific approaches and categorization per study can be found in table S1.

RQ 2.2 Important brain regions
As described above, we first examined the level at which features with high predictive value were reported in order to identify the most suitable level for summarizing findings across studies.Most studies (6/ 12) reported that that the entire array of functional connectivities of a specific brain region had high predictive value and did not focus on, for example, single connectivities.The remaining studies reported high predictive value on levels that could be easily transferred to a brain region level: three studies reported high predictive value for single connectivities, three other studies reported high predictive value for independent components, that were described in terms of common brain regions.In addition, H. Wu et al. (2022) reported high predictive value for specific emotion regulation networks.Please see table S2 for the categorization per study.It is noteworthy that no study reported high predictive value for common functional connectivity networks like the default mode or salience network.Thus, we decided to summarize features which had high predictive value on a brain region level, applying the procedure described above in the methods section.Fig. 4 provides a summary of both the absolute and relative frequency with which a brain region demonstrated high predictive value.The DLPFC was the region whose connectivities were most frequently predictive across studies, both in terms of absolute and relative numbers.Other important brain regions included sensorimotor areas, visual areas, and the basal ganglia.

RQ 3 Approaches to reduce the large amount of theoretically initially available functional connectivities
We categorized approaches taken to reduce the theoretically initially large number of available functional connectivities to a more manageable set of features into two levels: those serving an initial reduction before feature generation and those serving feature reduction after Only the balanced accuracy (Acc) of the best model of each study is reported.The asterisk denotes studies for which balanced accuracy was calculated from sensitivity and specificity.The double asterisk denotes studies for which a proxy of balanced accuracy was used.Abbreviations: N = Sample size, FC = functional connectivity, Acc = Accuracy, MDD = major depressive disorder, BPD = bipolar disorder, PTSD = post-traumatic stress feature generation.Both were part of our research question.The approaches each study took and their categorization can be seen in table S3.
Around one third of studies (n = 5) ignored the theoretically large amount of initially available connectivities by focusing a priori on specific brain areas and/or connectivities selected according to prior literature and theoretical assumptions.Other studies explored whole-brain functional connectivity but streamlined the number of connectivities for investigation.This was achieved by transitioning from the theoretically available voxel-level to either a brain region level, employing atlasbased parcellations (n = 6), or an independent-component level, utilizing data-driven parcellations (n = 3).
In terms of feature reduction after feature generation, the majority of studies employed feature selection techniques (n = 11).Among them, seven studies utilized so-called filter techniques, which use traditional statistical measures such as correlation coefficients or t-tests to rank features based on their capacity to differentiate between groups.Two other studies employed wrapper techniques, wherein the final classifier is trained in an inner loop to select features based on their importance in the classifier (see Brakowski et al., 2017;Guyon and Elisseeff, 2000;Mwangi et al., 2014 for an overview of feature selection techniques).In a separate set of studies, the number of input features for the final classifier was diminished by distributing the features across multiple models (n = 7).Furthermore, three studies implemented diverse methods of dimensionality reduction post feature generation, including principal component analysis, layering within a convolutional graphical network, or aggregation.

Risk of bias
All studies were rated as having a high risk of bias.The most common reasons were in the analysis domain, including small sample size, univariate feature selection, and data leakage.A summary of risk of bias is depicted in figure S1, the PROBAST rating for each study is presented in table S4.The most frequent problem was a small sample size, as none of the studies met the PROBAST criterion which requires a number of nonresponders that is 10 times larger than the number of candidate features.A further, potentially very severe problem was data leakage, occurring in internal validations, when information from the test set "leaks" into the training set and thus information from the test set is used to train the model.Data leakage highly increases the risk of overestimation as the model can use information which would not be available in a naturalistic setting.Here, data leakage occurred as feature selection (Hopman et al., 2021;Moreno-Ortega et al., 2019) and independent component analysis (van Waarde et al., 2015) were performed on the whole data set.Another reason for high risk of bias were univariate feature selection methods.Univariate feature selection methods include any procedure testing single features for statistically significant relations or group-differences without taking multivariability into account.Univariate feature selection can cause both under-and overestimations of performance accuracies (Jong et al., 2021).Underestimation might result as multivariable patterns with high predictive value in machine learning algorithms being able to handle multivariable data might not be selected (Jong et al., 2021).Overestimation can emerge as univariate selection is more biased to singularities in the data (Jong et al., 2021).Another procedure that was not assessed with the PROBAST rating but can also increase risk of bias is the simultaneous testing of several final models.Most studies (8/13) tested more than one final model, varying feature subsets (e.g., Zhutovsky et al., 2019), classifiers (e.g., Tian et al., 2020), and machine-learning pipelines (e.g., Harris et al., 2022).As most studies reported sufficient performance metrics only for their best model(s), we quantitatively summarized the studies` best models` metrics to assess the predictability of treatment outcomea common procedure in systematic reviews and/or meta-analyses on machine learning (e.g., Bondi et al., 2023;Y. Lee et al., 2018;Vieira et al., 2022).However, performance metrics of the best one of several final models are likely to overestimate the model ´s performance: In a naturalistic setting, the best model cannot be chosen retrospectively as it should inform the practitioner´s further actions before the beginning of treatment.

Discussion
The present review and meta-analysis aimed to give an overview of studies using resting-state functional connectivity to predict treatment outcome in internalizing mental disorders.An extensive literature search resulted in 13 studies which predicted outcome to a wide range of treatments, including medication, ECT, psychotherapy, and rTMS and focused mainly on patients with depression.The estimated mean balanced classification accuracy was 77%.A close examination of the connectivities which led to a successful prediction showed that the connectivity of the dorsolateral prefrontal cortex had high predictive value across treatments.The PROBAST rating revelated that all studies suffered from high risk of bias, being especially caused by inappropriate methodological choices.

Model performance in the light of high risk of bias
The estimated mean balanced classification accuracy of 77% of the Fig. 2. Difference between raw and balanced accuracies (A) and sample size as moderator in meta-regression (B).A) The difference between balanced and raw accuracy is only depicted for studies that did not report balanced accuracy in the presence of imbalanced classes.The asterix (*) denotes studies where a proxy of balanced accuracy was calculated due to missing information.B) The size of the dots represents the weight of the studies in the meta-analysis.The line is the fitted regression line.Please note that the original meta-regression was performed on the double arcsine transformed proportions.Abbreviations: ECT = electroconvulsive treatment, rTMS = repetitive transcranial magnetic stimulation.Fig. 3. Forest plot of a random-effect meta-analysis on the balanced accuracy values of the studies` best models.Fig. 4. Absolute and relative frequencies of studies in which a brain area had predictive value.Only brain areas that had predictive value in more than 30% of studies are depicted.The numbers in represent the absolute number of studies in which the brain area had predictive or no predictive value.The brain areas are arranged in descending order following their relative frequency.Abbreviation: PFC = prefrontal cortex.studies' best models reflect that treatment outcome can be predicted better than chance-level.However, the PROBAST rating revealed that all included studies suffered from high risk of bias.The most frequent reasons for high risk of bias were in the analysis domain, including small sample sizes (range: 18 -144), univariate feature selection methods, and data leakage.Moreover, the simultaneous testing of multiple models represented another source of risk of bias that has not been considered by the PROBAST rating (see a more detailed explanation of these factors in the results section).
Additionally, further exploration of meta-analyses indicated that small sample sizes were associated with elevated classification accuracies, implying a potential overestimation of predictive performance of studies with small-sample sizes.This pattern is consistent with findings in other reviews (e.g., Steele et al., 2018;Vieira et al., 2022).In general, small sample sizes in cross-validation may cause both over-and underestimation of predictive performance on unseen data, as the precision of the model performance, serving as the estimator of model performance on unseen data, decreases with smaller sample sizes (Varoquaux, 2018).The observed association with higher accuracies, both in this study and others, can be attributed to scientific community´s tendency to present high classification accuracies (so-called filter effect (Varoquaux, 2018).This interplay is similar to the impact observed in classical statistics with small sample sizes (Button et al., 2013).
Thus, given the high risk of bias across studies and the overrepresentation of small studies likely overestimating the true performance, we consider the estimated mean classification accuracy of 77% as an optimistic upper bound of potential prediction performance rather than a proof of principle.Whether machine learning and functional connectivity are able to predict treatment outcome remains an open question which can only be answered by further studies applying stateof-the-art-machine learning methods lege artis, using larger sample sizes and employing external validation.

Far from clinical application
Given the high risk of bias previously discussed and particularly the lack of external model validations, our results indicate that models predicting treatment outcome are still far from any clinical application.A similar picture emerges for other mental disorder or neuroimaging variables, with reviews or meta-analyses reporting a wide range of prediction performance across studies, a lack of external validation studies, and, if assessed, a high risk of bias for the included studies (Del Fabro et al., 2023;Vieira et al., 2022;Watts et al., 2022).Thus, our review underscores that the application of machine learning and neuroimaging variables to predict treatment outcome is still far from any clinical application, regardless of mental disorder and neuroimaging modality.
However, even if a model based on resting-state functional connectivity was successfully validated on multiple external datasets, numerous considerations would still precede a clinical application.One of those would be a thorough cost-benefit analysis that considers the specific context and purpose of the application, the model performance, and the monetary costs of collecting the fMRI data.Moreover, aligning with the Research Domain Criteria (RDoC) framework (Cuthbert, 2014), it should be explored whether the functional connectivity patterns which drove the final predictions could be reflected on other units of analysis, such as behavior or physiology, which are more readily assessable.Furthermore, it is important to note that a model which predicts the outcome to a single treatment, as those included here, may not directly contribute to treatment allocation.The direct utility for treatment allocation arises when combining models for different treatments, as demonstrated in approaches like the personalized advantage index (DeRubeis et al., 2014), or when generating models that explicitly recommend one treatment over others.Nevertheless, a model which predicts the outcome to a single treatment holds value by aiding in developing new treatments or add-ons for patients unlikely to respond.
Additionally, it could protect patients from undergoing an invasive and high-risk treatment with a low likelihood of success.

Important brain regions
Connectivities of the DLPFC (here including BA 6, 8, 9, and 46 following the parcellation of Glasser et al., 2016) had high predictive value in the highest number of studies, both in terms of absolute and relative frequency.The DLPFC is part of the central executive network (CEN, also called frontoparietal network; Seeley et al., 2007) that supports decision-making, emotion-regulation, and working memory (Menon, 2011).Connectivity of DLPFC and CEN has been associated with depression (see meta-analysis of Brandl et al., 2022) and has shown treatment-induced changes after ECT and rTMS (see reviews of Brakowski et al., 2017;Porta-Casteràs et al., 2021).Moreover, even though being less consistent, pretreatment connectivity of DLFPC and CEN has been associated with treatment outcome in several studies (see review of Taylor et al., 2021).
Additional support for the hypothesis that the DLPFC might play an important role in the etiology and maintenance of depression comes from lesion-based network mapping, showing that lesions associated with depression can be mapped to a common circuit that is centered in the DLPFC (Padmanabhan et al., 2019).In a similar vein, a recent analysis of task-based fMRI data targeting altered emotional and cognitive processing in depression revealed two robust circuits of altered emotional and cognitive processing which both included the DLPFC (Cash et al., 2023).Interestingly, the abnormal emotion circuit included the left DLPFC, while the abnormal cognition circuit included the right DLPFC, suggesting that a closer look at the DLPFC might be beneficent, both in terms of lateralization and parcellation into subparts (e.g., Cieslik et al., 2013).
Other brain areas with a relative frequency larger 50% were visual and sensorimotor areas.Visual and sensorimotor areas included both lower sensory processing areas, such as the V3 (Drysdale et al., 2017;Tian et al., 2020) and primary sensorimotor cortices (Drysdale et al., 2017;van Waarde et al., 2015;H. Wu et al., 2022) as well as higher sensory processing areas, such as the fusiform faces complex (Sun et al., 2020;H. Wu et al., 2022) and the (pre-)supplementary motor area (Tian et al., 2020;Zhutovsky et al., 2019).Aberrant low-and high-level visual processing and sensorimotor functioning in depression have been reported in several studies, both on a behavioral (e.g., Bennabi et al., 2013;Brakowski et al., 2017;Bubl et al., 2010) and neural level (e.g., Chen et al., 2022;Liu et al., 2022;Ray et al., 2021;Zeng et al., 2012).Moreover, a recent study describing the brain´s functional connectivity profile in terms of a principal functional similarity gradient showed that gradient differences between patients with depression and healthy controls were mainly rooted in areas of the visual, sensorimotor and default-mode network (Xia et al., 2022).However, in most recent meta-analyses, alterations in visual or sensorimotor neural processing in depression did not reach significance (e.g., meta-analyses of Brandl et al., 2022;Gray et al., 2020).Additionally, studies investigating and/or reporting associations between pretreatment neural processing in visual or sensorimotor areas and treatment outcome have been rare.Although Dichter et al. (2015) reported pretreatment connectivity differences in visual recognition circuits between responders and non-responders as one of their key findings, this pattern only emerged in 4 of the 21 studies reviewed.Thus, together with current literature, the relatively high predictive value of visual and sensorimotor areas suggests that functional connectivity of these areas plays a role in psychopathology and treatment outcome of depression but might be more difficult to detect or might only be relevant for a subgroup of patients.
The basal ganglia, which include subcortical nuclei like caudate, putamen (striatum) and globus pallidus, showed predictive value in half of the possible studies.Interestingly, three out of the four studies in which the basal ganglia had no predictive value used independent component analysis (ICA), suggesting that ICA might be less suitable to C. Meinke et al. detect connectivities of the basal ganglia.Indeed, even though a basal ganglia network, comprising basal ganglia components and the thalamus, can be detected in ICA (Robinson et al., 2009), this is often not the case, possibly because of the low proportion of variance it explains (Robinson et al., 2009).Thus, considering the basal ganglias` involvement in cognitive, emotional, and reward processing (e.g., Chakrabarty et al., 2016;Pierce and Péron, 2020), their association with anhedonia (e.g., Borsini et al., 2020;Brandl et al., 2022;Gray et al., 2020) and the target role of ventral striatum and nucleus accumbens in deep brain stimulation (Drobisz and Damborská, 2019;Y. Wu et al., 2021), we suggest to explore the predictive value of basal ganglia functional connectivity with alternative methods.
It is noteworthy that the functional connectivity of other areas commonly involved in the etiology of depression, such as amygdala, insula and anterior cingulate cortex, showed no high predictive value in our analysis.This underscores that neurological correlates of mental disorders may not inherently predict treatment outcome.Other factors, such as plasticity and compensatory neurological mechanisms, could play a more pivotal role.Additionally, capacities like emotion regulation, with distinct correlates from the disorder, may significantly contribute to predicting treatment outcome.
Furthermore, it is important to note that our examination was limited to brain areas whose connectivity demonstrated high predictive value across treatments.The investigation of treatment-specific connectivities with high predictive value was not feasible due to the limited number of studies and high treatment heterogeneity.As a result, our analysis focused solely on identifying brain regions whose connectivities might predict treatment outcome regardless of treatment type.Following the concept outlined by Simon and Perlis (2010), these connectivities could be characterized as general predictors of prognosis or general predictors of treatment response.According to Simon and Perlis (2010), distinguishing between these two groups requires studies predicting response to placebo treatment.If predictors also showed high predictive value in placebo studies, they would be considered general predictors of prognosis; if not, they could be seen as general predictors of treatment response.However, as none of the studies in our analysis used a placebo, we were unable to make this distinction.Nevertheless, irrespective of the accurate characterization, which is less crucial from a machine learning perspective, functional connectivities of the DLPFC, visual, and sensorimotor areas appear to contribute significantly to the correct prediction of machine learning models and should therefore be considered in future models.

Methodological approaches and recommendation
We summarized current methodological approaches to assist future researchers in making informed decisions regarding two questions: 1.Which approaches are taken to draw inferences about the predictive value of specific features?(RQ 2.1), 2. Which approaches are taken to reduce the large amount of theoretically initially available functional connectivities to a small set of features to be used in the final classifier (s)? (RQ 3) Regarding the first question (RQ 2.1), approaches taken to draw inferences about the features´predictive value, we identified three distinct groups of methodologies that were mostly employed exclusively: model comparison, selection frequency in feature selection, and feature importance in final classifier.As these approaches evaluate the predictive value of features on distinct levels, we refrain from favoring one over the other and view them as complementary.For instance, a high selection frequency in feature selection indicates that a feature was included in the final model in most iterations but does not necessarily imply that this feature also drove the final prediction.We recommend, therefore, deviating from common practice by assessing predictive value on multiple levels when suitable.We advise consistently evaluating feature importance in the final classifier and to assess selection frequency when employing feature selection, even when the primary goal is to compare the predictive value of different features by comparing distinct models (for a review of different measures of feature importance see Mi et al., 2020).This approach ensures a comprehensive understanding of predictive value and helps to detect potential errors or unexpected model behavior.
Concerning the second methodological question (RQ 3), approaches taken to reduce the large amount of theoretically initially available functional connectivities, we observed a common procedure, that accommodated a diverse array of approaches.First, all studies performed some kind of initial reduction before feature generation by either selecting specific brain areas and/or connectivities or parcellating the whole-brain data in a data-or atlas-based manner.Second, after generating functional-connectivity based features (e.g., functional connectivities themselves, node flexibilities, or subject-specific spatial maps), the majority of studies further diminished the number of input features by employing feature selection techniques (mainly filter techniques) and/or distributing features to multiple models.The specific approaches varied among all studies, even when calculating the same type of features.This methodological heterogeneity underscores the lack of standards in dealing with the large amount of theoretically available functional connectivities making the comparison of study findings more challenging.Therefore, further studies are needed to systematically explore the effects of various methodological choices across different data sets in order to provide general recommendations.
Regarding approaches that involve an initial reduction before feature generation, we refrain from providing explicit recommendations.Both methods, namely a priori selection of functional connectivities and whole-brain analysis with a parcellation technique, have their respective advantages and drawbacks.While the number of features generated after a priori selection may be more manageable for machine learning within the current scope, this approach may only be valid if the selection is thoroughly justified, which is often not the case.
In contrast, regarding approaches that serve feature reduction after feature generation, we would like to highlight several shortcomings that should be addressed in future studies.First, the most commonly employed approach, feature selection via filter techniques, increases the risk of bias due to the inherent univariance of filter techniques.This risk of bias emerges as these techniques are unable to select multivariate patterns with high predictive value (Mwangi et al., 2014).Hence, we recommend employing more sophisticated feature selection techniques such as wrapper and embedded methods, being more suitable for multivariate data (Mwangi et al., 2014).Second, the other frequently utilized approach, allocating features to different models, also amplifies the risk of bias in its typical implementation.Most studies tested several models in parallel and reported the prediction accuracy of the best model as estimate of prediction performance.As pointed out in the results section, this procedure might induce bias, as in a naturalistic setting, the best model cannot be chosen retrospectively; it should inform the practitioner´s further actions before the beginning of treatment.To reduce the risk of bias without employing an additional external validation, we recommend to train one final second-level model on the predictions of several first-level models, as applied by Pei et al. (2020).

Limitations
Our review has several limitations.First, even though commonly applied (e.g., Y. Lee et al., 2018;Vieira et al., 2022), the suitability of using meta-analysis for proportions in synthesizing cross-validation accuracies is not conclusively established.Potential issues emerge in terms of estimating the error variance (the square of standard measurement error) which is typically used to weight the studies' results in the meta-analysis.In general, there might be no unbiased estimator of error variance for classification accuracy in cross-validation (for an overview of methods to estimate the error variance in cross-validation see Bates et al., 2023).Moreover, the use of an error variance estimator originally developed for proportions overlooks the impact of certain cross-validation types, like leave-one-out, on increasing error variance (Varoquaux, 2018).Additionally, assumptions underlying the estimation of error variance for proportions, such as each subject having an equal chance of being a case, do not hold in cross-validation, where the probability of being a case (= being correctly classified) varies for each training-test split.Despite these limitations, we maintain that this meta-analysis remains valuable in providing a numerical estimate of the current predictive capability for treatment outcome.Future research should dedicate attention to addressing this issue and developing guidelines for conducting meta-analyses of classification accuracies based on cross-validation.
Second, all studies included had a high risk of bias.Therefore, the results presented here should be interpreted with caution.Please note however that this caveat applies to most systematic reviews in precision healthcare as the PROBAST rating has revealed high risk of bias of predictive modelling in healthcare, notably due to methodological weaknesses in the analysis domain (Jong et al., 2021;Meehan et al., 2022).Third, even though we initially intended to give an overview about the current state of treatment outcome prediction in a wide spectrum of internalizing mental disorders, including also anxiety disorders and obsessive-compulsive disorders, we eventually only included studies on depression and post-traumatic stress disorder.This was not due to a lack of studies predicting treatment outcome in anxiety disorders and obsessive-compulsive disorders per se, but due to a lack of studies fulfilling our inclusion criteria, such as using a machine learning approach (Chen et al., 2022;Göttlich et al., 2015) or applying a model being only based on resting-state functional connectivity (Reggente et al., 2018;Whitfield-Gabrieli et al., 2016).The sample of included studies is thus not fully representative of the entire spectrum of internalizing mental disorders, and it is not clear to which extent our results are also valid for other, initially targeted internalized disorders.Fourth, we summarized important brain regions based on study-specific labels instead of using more sophisticated approaches such as coordinate-based meta-analysis, because a substantial proportion (4 out of 12) of the studies included lacked adequate coordinate information, mainly due to using methods that typically do not provide such information.Additionally, it is important to note that the initial screening was conducted by a single individual, which may not adhere to gold-standard practices.

Conclusion and further directions
The objective of this review was to provide a comprehensive overview of studies utilizing resting-state functional connectivity to predict treatment outcomes in internalizing mental disorders across a spectrum of treatments, encompassing psychotherapy, pharmacotherapy, rTMS, and ECT.Our meta-analysis indicated that treatment outcome can be predicted based on resting-state functional connectivity, with a mean estimated balanced accuracy of 77% (95% CI: [72%-83%]).However, aiming to give a realistic estimate of the potential of machine learning and resting-state functional connectivity, we underscored the need to interpret these values cautiously, considering them more as an optimistic upper limit of potential prediction performance.This caution stems from the influence of small sample sizes systematically biasing the results and a notable risk of bias, as evaluated through PROBAST.A closer look at connectivities which drove a successful prediction highlighted the important role of the dorsolateral prefrontal cortex and raised awareness of two other, previously rather neglected groups of brain areas: visual and sensorimotor areas.In future studies conducting an a priori feature selection for predicting treatment outcomes, it is advisable to contemplate the inclusion of functional connectivities from these specific areas.Moreover, summarizing current methodological practices and employing PROBAST, we have identified several methodological choices that should be considered in future studies.These include the use of larger sample sizes, potentially through collaborative efforts, evaluating predictive value on several levels, opting for multivariable instead of univariate feature selection techniques, and generating one final 2nd-level model when comparing models based on different feature sets.
Besides these methodological choices, current developments might further leverage the predictive ability of resting-state functional connectivity.First, state-of the-art methods to estimate functional connectivities might lead to better and more robust predictions (see review here; Colclough et al., 2018), mitigating the problems of the typically used Pearson correlations which suffer from a low signal-to-noise-ratio (Pervaiz et al., 2020) and lack a distinction between direct and indirect connectivities (Smith et al., 2011).Second, another promising avenue might be measures which summarize a regions` or networks` connectivity in an informative manner, such as graph metrics (Rubinov and Sporns, 2010), circuit scores (e.g., Goldstein-Piekarski et al., 2022), and functional-similarity gradients (Haak et al., 2018).Lastly, another fruitful development could involve characterizing single-subject connectivity by framing them as deviations from those observed in healthy controls.Both straightforward methods, such as quantifying measures as z-deviations (Goldstein-Piekarski et al., 2022), and more complex methods, such as normative modelling (Marquand et al., 2019), might be beneficent.The approaches have the potential to better capture inter-subject heterogeneity of functional connectivity and to reduce the impact of noise.

Table 1
Characteristics of the included studies.