Introduction

Obsessive-compulsive disorder (OCD) is a common, often chronic psychiatric disorder, affecting 1.0–1.5% of the global population over their lifetime [1]. Extensive neuroimaging research suggests structural and functional abnormalities in cortico-striato-thalamo-cortical (CSTC) circuits in OCD [2,3,4,5,6,7]. The field has also started to address the question of whether multivariate analyses of neuroimaging data can be used to classify OCD [8, 9].

Prior OCD studies with relatively small to modest samples show mixed findings, with OCD classification accuracies varying from 66% to 100% [8]. However, the generalizability of such findings has rarely been tested, and reproducibility failures have been a major challenge in psychiatric neuroimaging [9,10,11,12]. Indeed, typical single-site neuroimaging studies seeking brain-wide associations with psychopathology using small sample sizes of tens to hundreds of individuals may report inflated effect sizes, decreasing reproducibility [13].

The ENIGMA-OCD consortium has allowed rigorous mega-analyses and meta-analyses based on the largest international multisite neuroimaging datasets to date [9]. A machine learning analysis of regional measures of cortical thickness, surface area and subcortical volume found that model performance did not exceed chance-level, but that classification performance was improved when individuals with OCD were grouped according to medication status.

Altered white matter pathways have been implicated in the neurobiology of OCD [14]. An ENIGMA-OCD study using diffusion tensor imaging reported significantly lower fractional anisotropy (FA) in the sagittal striatum (SS) and posterior thalamic radiation (PTR), higher mean diffusivity (MD) in the SS and higher radial diffusivity (RD) in SS and PTR [15]. However, the question of whether white matter diffusion tensor imaging findings can be used to classify OCD has not yet been explored in large and multisite studies.

In this study, we therefore used ENIGMA-OCD on diffusion tensor imaging to test the classification power of such measures in a large multisite sample of individuals with OCD and healthy controls. We tested several machine learning algorithms to distinguish those with OCD versus healthy controls, as well as to distinguish OCD individuals off medication versus healthy controls, and to distinguish OCD individuals on versus off medication. We also assessed the site-variability and reproducibility of predictive models using leave-one-site-out cross-validation and evaluated the utility of a post-processing harmonization tool (i.e., NeuroComBat). Finally, we employed a machine learning interpretation framework to assess which features were most relevant to the various classifications.

Participants and methods

Participants

Data from the ENIGMA-OCD Working Group recruited from 18 international research institutes were used. We analyzed data from 1653 participants, including 1336 adult participants (429 unmedicated OCD, 261 medicated OCD, 646 HC) and 317 pediatric participants (70 unmedicated OCD, 105 medicated OCD, 142 HC) (Table 1). Here, we defined pediatrics as under the age of 18 years old, consistent with previous work from the ENIGMA-OCD working group [2, 9]. The diagnosis of OCD and other comorbid conditions (i.e., anxiety disorders and major depressive disorder) were assessed using DSM-IV criteria (American Psychiatric Association, 2000). Clinical characteristics included medication status, childhood-onset, disease duration (in years), symptom severity (total scores ranging from 0-40 on the (Child) Yale-Brown Obsessive-Compulsive Scale ((C)Y-BOCS) [16, 17] and current or lifetime history of symptom dimensions (i.e., aggression/checking, cleaning/contamination, sexual/religion, hoarding, ordering/symmetry). Participants who did not have medication information were excluded from the medication classification analysis.

Table 1 Demographic and clinical characteristics of patients with obsessive-compulsive disorder (OCD) and healthy controls (HCs).

Image acquisition and processing

Image preprocessing, including brain extraction, eddy current correction, movement correction, echo-planar imaging-induced distortion correction, and tensor fitting, was conducted at each site, and Tract-Based Spatial Statistics (TBSS) was performed using protocols and quality control pipelines provided by the ENIGMA-DTI working group (http://enigma.ini.usc.edu/protocols/dti-protocols/) [15]. For the entire skeleton in each hemisphere, four DTI measures (FA, MD, AD, and RD) were estimated within 25 tract-wise regions of interest (ROIs) based on the Johns Hopkins University (JHU) white matter parcellation atlas [15].

OCD classification with machine learning

We conducted automated machine learning (AutoML) with H2O Driverless Artificial Intelligence (AI) (DAI, 1.8.7.1 version) using white matter anisotropy and diffusivity estimates (FA, MD, AD, RD; N = 252; 4 * {(19 fascicules * 3 (left, right, total) + 5 fascicules (total; e.g., corpus callosum, fornix) + average metrics across all fascicules)} and biological variables (age, sex). Three classification models were built in adult and pediatric samples, separately: (1) OCD vs. HC, (2) unmedicated OCD vs. HC (to test the effects of pure OCD–not confounded by medication effects–on the white matter), (3) medicated OCD vs. unmedicated OCD (to test the medication effects on the white matter). To prevent data leakage and reduce model overfitting, we split the entire data into a discovery set (80%) and a replication set (20%) (stratified by diagnosis). In the discovery set, we used leave-one-site-out (LOSO) cross-validation (11 sites for adults, seven sites for pediatrics) (Supplementary Fig. 1). With this scheme, within the discovery set, we evaluated the cross-site variability (or generalizability); within the replication set, we tested the overall model generalizability considering potential site variability. The test samples of the discovery data were not used during model optimization. The machine learning pipeline in AutoML involves the estimation of several base models (e.g., XGBoost, LightGBM, the general linear model (GLM)) and stacked ensemble models [18] derived from base models. The AutoML pipeline performs random hyperparameter tuning along with feature transformation (e.g., interaction encoding, numeric to categorical target encoding). Firstly, in each iteration, models learn and update the weights of the features and select important features based on the prior iteration. Then, the pipeline searches for the best feature transformations and model parameters using genetic algorithm [19]. In DAI, this procedure is called “feature evolution”. In genetic algorithm’s evolution can be seen as a competition between mutating parameters to find best “individuals” referring to information about feature transformations and hyperparameters. The feature evolution procedure is not entirely random and is informed from the variable importance interactions obtained from the modeling algorithms. So, this model training procedure including feature selection, transformation, and hyper-parameter tuning was performed using 11-fold-cross-validation scheme. In each fold, 10 folds were used for training the model, while the remaining 1-fold was used to (cross)validate the best training model. Finally, the best cross-validation models from each fold were combined and tested on a held-out replication set. In this way, the validation data within the 11-fold cross validation was not used for model optimization and feature evolution. Likely, replication data was not used for data preprocessing, model training or optimization. We used the ROC-AUC as the primary performance metric and accuracy, sensitivity, and specificity as additional metrics. pROC v. 1.16.2 in the R programming language was used to calculate the metrics [20].

NeuroComBat harmonization

To reduce potential biases caused by site and scanner effects, we employed NeuroComBat harmonization [21]. ComBat, a short name for combatting batch effects when combining multiple batches [21, 22], corrects potential scanner/site effects on brain data by harmonizing the mean and variance of brain measures across scanners. We harmonized the diffusivity measures in the discovery and replication data separately while also including age and sex as covariates in the model matrix. Non-parametric empirical Bayes adjustments were used to adjust for batch effects.

Model interpretation

To interpret the machine learning classifiers, we calculated the relative weights of DTI features contributing to OCD classification. We used two steps to determine the relative weights of DTI features contributing to OCD classification. First, we calculated the relative weights of each base model according to the model-specific algorithm. For LightGBM and XGBoostGBM, DAI computed the average reduction in impurity across all trees. Second, the importance of each base model was multiplied by its weight and normalized. We further implemented a machine learning interpretation framework, K-Local Interpretable Model-agnostic Explanation (K-LIME) [23]. This method fits surrogate linear models to data to extract the important features either positively or negatively associated with a target outcome: (1) OCD vs. HC, (2) unmedicated OCD vs. HC, and (3) medicated OCD vs. unmedicated OCD.

Statistical analysis

To assess the effects of sites on diffusion white matter estimates, we performed principal component analysis (PCA). We tested the association between predicted OCD probabilities and clinical variables (e.g., medication status, childhood-onset) using stepwise regression models [24]. Additionally, we tested site effects on individual classification performances (i.e., whether participants were correctly classified as OCD or HC). To adjust for potential confounding factors, we included the following variables as covariates: age, sex, site, and average DTI metrics (i.e., mean FA, AD, RD, MD).

Results

Demographic characteristics

This study included 1336 adult participants (690 OCD, 646 HC) and 317 pediatric participants (175 OCD, 142 HC). Out of the adult OCD samples, 37.8% were taking medication, while 60% of the pediatric OCD sample were taking medication. OCD patients showed comorbidity with lifetime anxiety disorders (adult: 11.02%, pediatric: 27.4%) and major depressive disorder (adult: 12.2%, pediatric: 10.3%). Table 1 and Supplementary Table 1 contain detailed demographic and clinical characteristics of the participants. Demographic characteristics were not significantly different between OCD and HC (P’s > 0.45). However, the clinical characteristics varied across sites, including childhood-onset: \({X}^{2}\) = 93.66, p < 0.001, and symptom dimensions: Aggression/checking: \({X}^{2}\) = 64.33, p < 0.001, contamination/cleaning: \({X}^{2}\) = 53.02, p = 0.002, sexual/religious: \({X}^{2}\) = 46.33, p = 0.012, hoarding: \({X}^{2}\) = 73.06, p < 0.001, symmetry/ordering: \({X}^{2}\) = 145.03, p < 0.001 in adults. Illness duration also varied across sites in the pediatric samples, F = 13.20, p < 0.001.

Classification of OCD

The principal component analysis (PCA) of the four-diffusion metrics (FA, MD, AD, RD) across the 18 international sites revealed site variability (Fig. 1). In the PCA biplot, we observed two sites, one from adults and one from pediatrics, which were distinct from other sites. We then performed three classification tasks using the stacked ensemble machine learning models (LOSO cross-validation): (1) OCD vs. HC, (2) unmedicated OCD vs. HC, and (3) unmedicated OCD vs. medicated OCD (Tables 2 and 3, Fig. 2).

Fig. 1: A biplot of principal component analysis (PCA) using the diffusion tensor estimates of the major white matter fascicules across the 18 international sites.
figure 1

A PCA biplot before applying NeuroCombat. (Left: Adult, Right: Pediatric). Some sites (e.g., site B) show apparent clusters distinct from the rest of the sites. B PCA biplot after applying NeuroCombat. (Left: Adult, Right: Pediatric).

Table 2 Performance of classification of OCD clinical outcomes in (A) adult, (B) adult applied NeuroComBat harmonization, (C) pediatric, (D) pediatric applied NeuroCombat harmonization samples. ― mean with 95% confidence interval.
Table 3 The association between brain-predicted OCD risk probabilities and clinical features in a discovery set (stepwise regression).
Fig. 2: Classification of OCD diagnosis and medication status using diffusion tensor estimates.
figure 2

A Classification performances in adult samples. B Classification performances in pediatric samples.

In adult samples, the models minimally-to-modestly classified participants with OCD diagnosis from healthy controls in the discovery set (N = 1068, ROC AUC = 67.29 ± 0.26) and the replication set (N = 268, ROC AUC = 57.19 ± 3.47). The models also minimally-to-modestly distinguished unmedicated OCD versus healthy individuals in the discovery set (N = 854, ROC AUC = 63.96 ± 0.43) and the replication set (N = 214, ROC AUC = 62.67 ± 3.84). Finally, the models distinguished medicated OCD versus unmedicated OCD participants in the discovery set (N = 437, ROC AUC = 60.22 ± 0.40) and the replication set (N = 137, ROC AUC = 76.72 ± 3.97).

In pediatric samples, the models classified participants with OCD diagnosis versus healthy controls in the discovery set (N = 270, ROC AUC = 69.54 ± 8.59) and the replication set (N = 64, ROC AUC = 59.80 ± 7.39). The models also classified unmedicated OCD versus healthy individuals in the discovery set (N = 151, ROC AUC = 65.96 ± 12.33) and the replication set (N = 38, ROC AUC = 48.51 ± 10.14). Finally, the models classified medicated OCD versus unmedicated OCD participants in the discovery set (N = 140, ROC AUC = 61.82 ± 15.50) and the replication set (N = 35, ROC AUC = 72.45 ± 8.87) (Table 2C).

In classifying OCD and HC, the ROC AUC of adult samples ranged from 51.6% (site C) to 79.1% (site F), and pediatric samples ranged from 35.9% (site M) to 63.2% (site L) across sites. Also, mean values of DTI metrics across all ROIs showed significant differences across sites (Fs > 97.4, p < 0.001). The site variability was significantly associated with the classification performance in OCD patients (χ2 = 57.19, p < 0.001) and HCs (χ2 = 50.30, p < 0.001) when adjusting for the covariates (Fig. 3).

Fig. 3: Sample characteristics and prediction performance (ROC AUC) across sites.
figure 3

A In adult samples. B In pediatric samples. Left: Violin plots of sociodemographic, clinical, and 763 brain features across sites, Right: Box plot of the area under the receiver operating 764 characteristic curve (ROC AUC) for the leave-one-site-out (LOSO) cross validation in the 765 diagnosis classification task (OCD vs. HC).

Classification of OCD with NeuroCombat-harmonized data

Considering the site variability (Fig. 1), we implemented the ML analysis with NeuroCombat-harmonized data to correct site effects. The NeuroComBat-harmonized data showed slightly lower performance in the adult samples (Table 2A) and slightly higher performance in the pediatric samples (Table 2B).

Variables associated with OCD classification

Results of stepwise regression analysis indicated that, in adults, site (e.g., site H, site I), higher age, hoarding symptoms, and adult-onset were significantly associated with estimated OCD probabilities (t > 2.04, p < 0.05) (Table 3). In pediatric samples, site (e.g., site M, site S), lifetime diagnosis of depression, and aggression/checking symptoms significantly correlated with predicted OCD probabilities (t > 2.15, p < 0.05).

Machine learning interpretation

Our machine learning interpretation models showed that various specific diffusion white matter features contributed to the OCD classification (Figs. 4 and 5, Supplementary Fig. 2). For the classification of OCD from HC in adult samples, the top 10 features included the superior corona radiata (MD), age, posterior thalamic radiation (FA), and posterior limb of the internal capsule (FA, AD). In the pediatric samples, the cingulum (MD, AD), uncinate fasciculus (MD), fornix (FA), corticospinal tract (FA), and anterior corona radiata (AD) were important in classifying OCD diagnosis (Supplementary Fig. 2). In classifying unmedicated OCD and HC, the internal capsule contributed to both adult (FA, AD of posterior limb) and pediatric samples (FA of the retrolenticular part, AD of anterior limb, FA of posterior limb) (Fig. 4). In classifying medicated OCD and unmedicated OCD in adult samples, the top 10 features included the corpus callosum (total, genu), average FA, and average RD (Supplementary Fig. 2). For the pediatric samples, fornix and stria terminalis, cingulum (cingulate gyrus, hippocampus) were included in the top 10 features (Supplementary Fig. 2).

Fig. 4: Top 10 features of classification models in adults.
figure 4

A Top 10 features contribute to the classification of OCD from HC in adults. B Top 10 features contribute to the classification of unmedicated OCD from HC in adults. C Top 10 features contribute to the classification of medicated OCD from unmedicated OCD in adults. Note: The color legend represents DTI measures: red for FA, yellow for MD, green for AD, and blue for RD. Regions with multiple DTI measures are highlighted in purple.

Fig. 5: Top 10 features of classification models in pediatrics.
figure 5

A Top 10 features contribute to the classification of OCD from HC in pediatrics. B Top 10 features contribute to the classification of unmedicated OCD from HC in pediatrics. C Top 10 features contribute to the classification of medicated OCD from unmedicated OCD in pediatrics. Note: The color legend represents DTI measures: red for FA, yellow for MD, green for AD, and blue for RD. Regions with multiple DTI measures are highlighted in purple.

Discussion

In this study, we tested the extent to the accuracy of machine learning in classifying the diagnosis or medication status of OCD patients based on white matter diffusion estimates obtained using the ENIGMA-matched image analysis pipeline across 18 international sites. Our results showed a low-to-moderate accuracy in predicting OCD diagnosis and medication status. Classification of medicated OCD versus unmedicated OCD had the best classification accuracy (ROC-AUC of 76.72 in adults), followed by unmediated OCD-health control classification (ROC-AUC of 63.96 in adults) and all OCD-HC (ROC-AUC of 57.19 in adults). In all OCD-HC classifications, the performance varied significantly across sites with cross-validated ROC AUC ranging 51.6–79.1 in adults, and 35.9–63.2 in children. Diffusion white matter features contributing to OCD classification (compared with HC) include anisotropy and diffusivity estimates of white matter in the internal capsule, thalamic radiation, and uncinate fasciculus.

The low-to-moderate accuracy of our machine learning models is consistent with prior work. OCD machine learning studies using structural MRI have found that accuracy in classifying OCD and HC, ranges from 60 to 90%, all in small datasets (N < 150) [8, 10]. However, these classification performances from small studies are likely to be inflated and not generalizable, while the true effect size (i.e., the brain-psychopathology association, regardless of the choice of analysis) may be smaller [13]. Indeed, a recent large-scale ENIGMA OCD study found that machine learning models trained on gray matter morphometric estimates from structural MRI resulted in poor classification of OCD vs. HC (ROC AUC, 51-54; leave-one-site CV) [9]. Our model based on white matter features showed improved classification performance compared with the gray matter morphometry model in adults and pediatric samples, though a direct comparison may not be warranted due to different machine learning pipelines and different subsamples used in this study. Future studies should determine whether multi-modal machine learning using structural and functional MRI can increase classification accuracy [25,26,27,28].

We observed significant site variability in classification performance. Firstly, this may be related to the variability of the quality of the diffusion MRI across sites. The aggregated ENIGMA MRI data were harmonized for the post-imaging processing procedure (e.g., TBSS) but not for data acquisition. Though this harmonization method was a best practice when the raw image data were not sharable, nevertheless, given the sensitivity of diffusion MRI to the image acquisition conditions (e.g., magnets types, pulse sequences, such as numbers of gradient directions or b values, etc.; compared with the gray matter morphometry validated across scanners, sites, and pulse sequence designs [29]), our approach is limited in controlling potential confounding factors and their impact on the quality of the diffusion white matter metrics. Also, our application of another post-processing harmonization method, NeuroComBat, was effective in matching the distributions of the data across the sites (in our PCA results). However, this method failed to result in a performance gain in the OCD classification (slightly higher AUC in pediatric samples, slightly lower AUC in adult samples) or a reduction of the cross-site variability. The covariate modeling with NeuroComBat also did not demonstrate a gain in performance. Secondly, our international multisite clinical samples show variability in clinical characteristics such as symptom severity, age, adult-onset, and duration of illness. The sampling variability may have added complexity to the already challenging task of OCD classification.

Our analysis of the machine learning model indicated that OCD probability was significantly associated with several sociodemographic and clinical characteristics. In adults with OCD, a higher age, adult onset, greater hoarding symptoms, and greater depressive symptoms were more likely to be predicted as having OCD. The significant correlation of age and adult-onset with the OCD likelihood might reflect age-dependent patterns in the diffusion white matter estimates. Though there are no significant group differences in age between OCD and HC, the neurobiology of OCD might be related to abnormal aging effects on the diffusion white matter estimates. Indeed, some literature shows that psychiatric disorders, including OCD and anxiety disorders, are linked to accelerated brain aging [30, 31]. However, the potential association between the neuropathophysiology of OCD and age appears more relevant to adults than to children because, despite the similar effect sizes of age and the OCD likelihood, only adult samples show statistical significance (probably due to a larger sample size). This may reflect the effects of chronicity in adult samples [32].

Our machine learning interpretation is consistent with prior white matter studies that have relied on univariate analyses and/or small sample sizes [33]. For example, the well-known CTSC pathway includes the internal capsule (posterior limb (FA, AD) in adults and retrolenticular part (MD) in children), which has been implicated in habit formation and cognitive control in OCD [34]. In the classification model of unmedicated OCD and HC, the corpus callosum - connecting the two cerebral hemispheres - was important in adults and pediatric samples alike. This finding is in line with the previous ENIGMA-OCD study [15] indicating that adult OCD was characterized by lower volume in the genu of the corpus callosum than HC. However, careful interpretation is needed because of differences in the brain metrics used, here based on tensor modeling (FA, MD). In addition, we found that the cingulum bundle contributed to the classification of unmedicated OCD and medicated OCD in both adult and pediatric samples. The cingulum bundle contains short and long connections between the frontal lobe, parietal lobe, and temporal lobe. In short, our machine learning findings suggest common patterns of white matter abnormalities in adult and pediatric OCD, as well as distinct patterns consistent with prior work [2].

The classification model of unmedicated OCD from HC showed greater accuracies than the model classifying all OCD from HC. This would suggest medication status likely confounds the white matter microstructure of OCD patients. In the literature, the causal effects of medication, Serotonin Reuptake Inhibitor (SSRI), on the white matter microstructure remain unclear: No randomized controlled trial exists. Nevertheless, given the key role of serotonin in neurodevelopment including gliogenesis [35], changes in extracellular serotonin levels in the brain owing to SSRI may impact the integrity of the white matter fibers. Prior correlational research supports this. A cross-sectional study shows a decrease in FA in the sagittal striatum associated with medication use in adults with OCD compared to unmedicated OCD [15]; longitudinal clinical studies show a decrease in MD of the midbrain white matter bundles after 12-week administration of SSRI [36], a decrease in MD in the frontal regions and the corpus callosum [37]. Though some of these correlational findings might indicate causal effects of SSRI on the white matter, nevertheless, without direct causal evidence it is still unclear if the associations result from the neurobiological effects of SSRI, symptom improvement, or both. A practical implication of our finding is that the diffusion white matter-based model presents a particular utility in classifying medication naïve individuals with OCD from healthy individuals. Though not reaching the clinical utility yet (e.g., around AUC of 80%), with further research (perhaps with the integration of brain, genetic, and behavioral multi-modal data [38]), the white matter diffusion estimates might be used to predict the risk for OCD. Future research may test whether the models trained on medication naïve OCD patients—perhaps capable of learning the neurobiological patterns underlying the OCD without medication confounding—may be used for related tasks (e.g., via representational learning [39].

There are limitations of this study. Firstly, the imaging acquisition was not harmonized across the sites, so we could not test whether the suboptimal model performance or the cross-site variability might result from the issues of the data or not. Given the sensitivity of the anisotropy and diffusivity estimates depending on the pulse sequence designs (e.g., the number of directions, b-values) [40], despite the harmonized image processing method (TBSS), the remaining data quality and validity issues perhaps may have worked against model performance. Secondly, since only the image-derived phenotypes were available from the ENIGMA consortium, but not the raw images, our results are only limited to a single type of analysis (TBSS) and metrics (diffusivity and anisotropy). Thirdly, our adult samples were larger than the pediatric samples, so our machine learning methods may have resulted in more optimized learning outcomes for adult samples.

In conclusion, using the largest multisite DTI with harmonized image processing, our investigation indicates that machine learning models currently allow only poor-to-modest classification power, but that capture meaningful multivariate patterns of white matter features relevant to the neurobiology of OCD. Accuracy is largely constrained by site variability, indicating room for future improvement.