Automated classification of depression from structural brain measures across two independent community‐based cohorts

Abstract Major depressive disorder (MDD) has been the subject of many neuroimaging case–control classification studies. Although some studies report accuracies ≥80%, most have investigated relatively small samples of clinically‐ascertained, currently symptomatic cases, and did not attempt replication in larger samples. We here first aimed to replicate previously reported classification accuracies in a small, well‐phenotyped community‐based group of current MDD cases with clinical interview‐based diagnoses (from STratifying Resilience and Depression Longitudinally cohort, ‘STRADL’). We performed a set of exploratory predictive classification analyses with measures related to brain morphometry and white matter integrity. We applied three classifier types—SVM, penalised logistic regression or decision tree—either with or without optimisation, and with or without feature selection. We then determined whether similar accuracies could be replicated in a larger independent population‐based sample with self‐reported current depression (UK Biobank cohort). Additional analyses extended to lifetime MDD diagnoses—remitted MDD in STRADL, and lifetime‐experienced MDD in UK Biobank. The highest cross‐validation accuracy (75%) was achieved in the initial current MDD sample with a decision tree classifier and cortical surface area features. The most frequently selected decision tree split variables included surface areas of bilateral caudal anterior cingulate, left lingual gyrus, left superior frontal, right precentral and paracentral regions. High accuracy was not achieved in the larger samples with self‐reported current depression (53.73%), with remitted MDD (57.48%), or with lifetime‐experienced MDD (52.68–60.29%). Our results indicate that high predictive classification accuracies may not immediately translate to larger samples with broader criteria for depression, and may not be robust across different classification approaches.


S1.1.1 Scanning Details
Brain imaging and scanning sequence details for UK Biobank participants were previously described elsewhere (please see Alfaro-Almagro et al., 2018;Smith, Alfaro-Almagro, & Miller, 2018;UK Biobank, 2014). We here describe the T1-weighted and DTI scanning sequence details for STRADL participants, scanned in Aberdeen and in Dundee.
Aberdeen participants in STRADL were imaged on a 3T Philips Achieva TX-series MRI system (Philips Healthcare, Best, Netherlands) with a 32 channel phased-array head coil with a back facing mirror (software version 5.1.7; gradients with maximum amplitude 80 mT/m and maximum slew rate 100 T/m/s). For T1-weighted imaging, 160 sagittal slices were acquired with repetition time 8.3 ms, echo time 3.8 ms, inversion time 1031 ms, 8˚ flip angle, field of view 240 mm, matrix size 240 × 240, and voxel size 0.9 × 0.9 × 1.0 mm 3 . Total acquisition time was 5 minutes and 38 seconds. For DTI imaging, there were 60 axial slices with repetition time 7010 ms, echo time 90 ms, 90˚ flip angle, field of view 220 mm, matrix size 96 × 94, voxel size 2.3 × 2.3 × 2.3 mm 3 , 64 non-collinear gradient directions (b = 1200 s/mm 2 ), and eight diffusion unweighted images. Total acquisition time was 9 minutes and 28 seconds.
The FreeSurfer processed scans were visually inspected and minor errors were manually corrected. Errors included incorrect skull stripping, exclusion of grey or white matter in tissue segmentation maps, or incorrect brain parcellation into separate regions (Neilson et al., 2019, supplementary material). Participants were excluded when there was at least one major error that could not be corrected or when there were multiple minor errors (N = 6). Additional N = 16 participants had at least one cortical measure missing after processing and were also excluded. As an additional quality control step, we also excluded N = 6 participants who were more than three standard deviations different from the sample mean in at least one of three global cortical measures -range (standard deviation) of cortical thickness across brain regions, sum of cortical region volumes, or sum of regional surface areas. N = 622 individuals were available for the STRADL dataset of brain morphometric measures ( Figure   S1A).

S1.1.3 UK Biobank FreeSurfer Processing Details
T1-weighted scans for N = 10,109 participants were processed with FreeSurfer version 5.3. Participants were excluded in cases of general FreeSurfer processing failure, one or more major processing errors, or multiple minor errors as described above for STRADL (N = 1,029). We additionally excluded N = 121 participants as outliers in global cortical metrics Classification of Depression with Structural Brain Measures Supplementary 3 (as above for STRADL), resulting in a dataset of N = 8,959 subjects in total ( Figure S2A).

S1.1.4 STRADL DTI Processing Details
Diffusion-weighted images for 980 participants were corrected for eddy current-related distortions and head movements ('eddy_correct' function in FSL), which was followed by skull stripping and computation of FA and MD maps. Skull stripping was performed with BET with a threshold of 0.2. FA and MD images were computed with DTIFIT component of FSL (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FDT/UserGuide#DTIFIT). As part of the ENIGMA protocol, images were first slightly eroded to remove brain-edge artefacts, and then nonlinearly registered to the ENIGMA template and transformed into 1 × 1 × 1 mm standard space. N = 12 participants with FA image distortions or poor template registration were excluded after visual inspection. White matter skeleton was calculated as the mean of all registered FA images. FA data for each participant was then projected onto the skeleton with a threshold of FA > -0.049. Tracts for FA and MD measure ROI extraction were based on the Johns-Hopkins University (JHU) DTI-based white matter atlas (Mori & Crain, 2006). At the time of the study demographic data was available for N = 884 of N = 968 processed participants. As an additional quality control step, we excluded participants where global FA or global MD measures were more than three standard deviations different from the entire sample means. We here consider first principal components for all 43 FA and MD measures as representative of global FA and MD. Outlier exclusion resulted in data for N = 873 participants being available for STRADL dataset ( Figure S1B).

S1.1.5 UK Biobank DTI Processing Details
As part of the UK Biobank DTI processing protocol, diffusion-weighted images were corrected for head motion and eddy currents and processed with the TBSS toolkit to extract FA and MD skeletons (Smith et al., 2006;Smith et al., 2018, sections 3.10 and 3.10.1). FA  Figure S2B).

S1.2.1 STRADL Diagnostic Criteria
As described in the main text, diagnoses for STRADL participants were established using Structured Clinical Interview for DSM Disorders (SCID), and were based on the Diagnostic and Statistical Manual of Mental Disorders (American Psychiatric Association, 2000;First, Gibbon, Spitzer, & Williams, 2002). Participants were classed as currently depressed (cMDD-STR) if they had an ongoing MDD episode, and as remitted (rMDD-STR) if they had at least one past episode of MDD, but were not depressed at the time of the scan.

S1.2.2 UK Biobank Probable Current MDD (cMDD-UKB) Diagnostic Definition
Criteria for probable current MDD (cMDD-UKB) in UK Biobank was based on the three diagnostic categories defined in Smith et al. (2013), combined with a screen of symptoms at the time of the scan. Briefly, the categories in Smith et al. (2013) were based on self-reported past symptoms of depression (low mood or anhedonia lasting for at least two weeks at any time in their life), and self-reported history of seeing a psychiatrist or a GP for nerves, anxiety, tension or depression. Based on the self-reported participant data, Smith et al.
defined three diagnostic categories -single-episode, moderate recurrent or severe recurrent past (lifetime) depression. We classed participants as cMDD-UKB if they met criteria for either of the three categories, and also reported current symptoms. Participants screened positive for current symptoms if they fulfilled at least one of the following criteria: 1) Reported depressed mood over the past two weeks for more than half of the days or 2) Reported lack of interest or pleasure in daily activities over the past two weeks for more than half of the days or nearly every day (UKB touchscreen questionnaire, data item #2060); 3) Reported in general feeling very unhappy or extremely unhappy (UKB touchscreen questionnaire, data item #4526); 4) Reported at least one symptom for at least three of four symptom groups related to depression -mood symptoms, sleep problems, psychomotor symptoms or interpersonal symptoms.
Mood symptoms mentioned above included the following items: -Often feeling miserable (UKB touchscreen questionnaire, data item #1930); -Often feeling fed-up (UKB touchscreen questionnaire, data item #1960); -Experiencing depressed mood for several days over the past two weeks (UKB touchscreen questionnaire, data item #2050); -Experiencing lack of interest or pleasure for several days over the past two weeks (UKB touchscreen questionnaire, data item #2060); -Feeling moderately unhappy in general (UKB touchscreen questionnaire, data item #4526).
Sleep problems were defined by the following items: -Experiencing difficulty getting up in the morning (UKB touchscreen questionnaire, data item #1170); -Usually experiencing trouble in falling asleep, or waking up in the middle of the night (UKB touchscreen questionnaire, data item #1200).
Psychomotor symptoms included the following items: -Often experiencing restlessness over the past two weeks (UKB touchscreen questionnaire, data item #2070); -Experiencing tiredness or lack of energy nearly every day over the past two Interpersonal symptoms were defined by the following items: -Often being irritable (UKB touchscreen questionnaire, data item #1940); -Experiencing hurt feelings easily (UKB touchscreen questionnaire, data item #1950); -Often feeling lonely (UKB touchscreen questionnaire, data item #2020); -Often being troubled by feelings of guilt (UKB touchscreen questionnaire, data item #2030).
Participants were excluded from both cases and controls if they reported having Parkinson's disease, bipolar disorder, multiple personality disorder, schizophrenia, autism, intellectual disability, multiple sclerosis or cognitive impairment. Participants were also excluded from control samples if they reported depression, anxiety or other mood disorder, use of anxiolytic, antidepressant or antipsychotic drugs, had nervous breakdown or suicide attempt in the past, or had seen a GP or a psychiatrist about nerves, anxiety or depression.

S1.2.3 UK Biobank CIDI-SF Lifetime MDD (pMDD-UKB-CIDI) Diagnostic Definition
Composite International Diagnostic Interview assessment (CIDI-SF, Kessler, Andrews, Mroczek, Ustun, & Wittchen, 1998) was administered in UK Biobank as part of an online mental health questionnaire (UK Biobank, 2017). Participants were first asked if they had ever experienced a period of two weeks in which they had low or depressed mood, or a lack of interest or pleasure in daily activities. If they responded positively to either of the two questions, they were asked six additional questions about whether they experienced other symptoms of depression according to the DSM criteria at the same time (American Psychiatric Association, 2000). The assessed symptoms were related to feelings of worthlessness, tiredness, difficulty in concentrating, thoughts of death, changes in weight, and changes in sleeping patterns (UK Biobank, 2017). Participants were classed as having -Experienced at least five of the eight depression symptoms at the same time -Experienced low mood or lack of interest every day or almost every day during the episode, with feelings lasting most of the day or all day -Reported a level of psychosocial impairment (study / employment / childcare / housework or leisure) during the episode Participants were excluded from being controls for pMDD-UKB-CIDI definition if they reported a diagnosis of depression or had a score above 5 in PHQ-9 according to the online assessment (Kroenke & Spitzer, 2002).

S1.2.4 UK Biobank ICD Lifetime MDD (pMDD-UKB-ICD) Diagnostic Definition
Some participants in UK Biobank had a formal past diagnosis of depression, established by a clinician, reported in their hospital record. The diagnosis was established according to the ICD (World Health Organisation, 1992), but was only available for participants who were depressed during a hospital admission. These participants were classed as pMDD-UKB-ICD cases. Participants who did not have a hospital record available were not included as either cases or controls. Participant who were included as controls may have been depressed at some point in their life, but not during a hospital admission.
Penalised logistic regression was applied with elastic net penalty, which has performed relatively well in a previous study (Yang et al., 2018;Zou & Hastie, 2005). Splitting criterion for the decision tree classifier was Gini's Diversity Index, which is the default criterion in MATLAB R2015b. Features used for SVM and PLR classifier training and testing were standardised -centred by feature means and scaled by standard deviations in the training data. Classification was attempted both with and without hyperparameter optimisation for SVM and DT classifiers, and only with optimised hyperparameters for PLR classifier.

S1.3.2 Fixed Hyperparameters
SVM classifier has two main hyperparameters -box constraint (regularisation) and kernel scale. When no optimisation was applied, box contraint was set to the canonical value of 1, which the default heuristic implemented in MATLAB R2015b and other machine learning toolkits (Chang & Lin, 2011;Pedregosa et al., 2018). Kernel scale parameter was set to the square root of the number of features for each dataset, which is the heuristic implemented in LibSVM toolkit (Chang & Lin, 2011). Rationale for this heuristic is that the optimal kernel scale depends on the distance between data points from different classes, which is in turn bounded by the number of features, when the features are standardised.
Decision tree classifier has three core hyperparameters -maximum number of splits, minimum parent size, and minimum leaf size. Maximum number of splits was set to 20 and minimum parent size was set to 10 following the default MATLAB R2015b heuristics for medium-sized trees. There was no MATLAB heuristic for the minimum leaf size and this parameter was set to the value of 4, following the heuristic implemented in 'rpart' R package (Therneau, Atkinson, & Ripley, 2019). The 'rpart' package heuristic suggests specifying the minimum leaf size as 1 3 of the minimum parent size to reduce possibilities for overfitting. and hence analyses with fixed hyperparameters were not attempted.

S1.3.3 Hyperparameter Optimisation
Hyperparameters were optimised through grid search with inner cross-validation accuracy as the criterion for optimal hyperparameter value combinations. For decision tree, only minimum leaf size was optimised as the most important classifier hyperparameter (Mantovani et al., 2019). Larger minimum leaf sizes simultaneously constrain maximum number of decision tree splits, and some combinations of these two hyperparameters (e.g. low minimum leaf size and low maximum number of splits) can lead to less balanced trees which could be less generalisable. Maximum number of splits was therefore fixed and constrained by the sample size (N -1). Search grid for the minimum leaf size followed the default MATLAB R2015b heuristic and included 10 values in logarithmic scaled space from two to half the sample size (log(2) to log(N/2)) with duplicates excluded.
Minimum parent size was not optimised to reduce computation time, and also because it was shown previously that optimisation of this parameter is less effective compared to optimisation of the minimum leaf size (Mantovani et al., 2019).
In penalised logistic regression, alpha parameter controls the weight of L1 (lasso) versus L2 (ridge) regularisation. Alpha is a higher-level hyperparameter and was specified to the default value of 0.5, which equally balances ridge and lasso regularisation. The main optimised PLR hyperparameter was the regularisation coefficient (lambda). Search grid for lambda was specified following the heuristic implemented in MATLAB R2015b. The grid consisted of 20 values in a geometric sequence between the largest lambda value which results in a nonnull model ( λ max ), and the value of λ max 1000 .

S1.3.4 Filter Feature Selection
The p-value threshold in the t-test filter was optimised through inner cross-validation.
Search grid consisted of nine p-value thresholds between 0.01 and 0.05 with step of 0.005 and was the following: Upper boundary in the above range was specified as the standard threshold for Classification of Depression with Structural Brain Measures Supplementary 11 statistically significant differences ( p ≤ 0.05 , Bross, 1971). Lower boundary was specified as 0.01 because there were generally only few features which were significantly different between cases and controls at significance level p ≤ 0.01 (uncorrected) across all datasets (Tables S1 -S10). During optimisation, filter threshold value with the highest inner crossvalidation accuracy and lowest filtered number of features was selected for testing in outer cross-validation.
Classification analyses with filter feature selection was not performed for rMDD-STR sample with FA feature subset, because only one FA measure was significant at p ≤ 0.05 (Table S7).

S1.3.5 Sequential Feature Elimination
In sequential feature elimination, inner cross-validation accuracy was used as the optimisation criterion. To enable reasonable computation times, sequential feature elimination was performed with elements of parallelisation as implemented in MATLAB R2015b. Each optimisation was performed on 8 cores of an Intel Xeon based computing cluster node with 2.4 GHz clock speed per core.

S1.3.6 Cross-validation Partitioning
Cross-validation was repeated 10 times with pre-determined random fold partitions in the smaller datasets (rMDD-STR and pMDD-UKB-ICD diagnostic criteria). This was not feasible for the larger datasets due to high computational complexity (cMDD-UKB and pMDD-UKB-CIDI diagnostic criteria). Cross-validation in the larger datasets was therefore performed only once for each classification method with the deterministically predefined fold partitions. Fold partitions for these datasets were defined separately for male and female cases and controls, with a greedy algorithm which aimed to maximally balance the folds with respect to age. The algorithm applied to define fold partitions was following: The folds were defined to be deterministically balanced with respect to age and sex with the above algorithm, and were thus non-random.

S1.3.7 Comparison of Classification Methods
For smaller datasets (rMDD-STR and pMDD-UKB-ICD diagnostic criteria) there were 100 accuracy estimates for each classification approach (10 cross-validation repetitions × 10 folds). The approaches were compared between each other using paired t-test with correction

S2.1 Brain Structure Differences
Correction for false discovery rate was performed separately for measures of cortical thickness, cortical surface areas, cortical or subcortical volumes, FA and MD.
Tables S1-S5 outline corrected and uncorrected significant ( p < 0.05 ) case-control differences in brain morphometric measures in the five analysed samples. Where no differences where found for a sample, the related column in the table is omitted.
Tables S6-S10 outline corrected or uncorrected significant ( p < 0.05 ) case-control differences in white matter integrity measures in the five samples. For white matter integrity, significant differences after FDR correction were only found in the three UK Biobank samples. Effects in light blue in Tables S8-S10 overlap between all three UKB samples, effects in light yellow overlap between cMDD-UKB and pMDD-UKB-CIDI, effects in light green overlap between pMDD-UKB-CIDI and pMDD-UKB-ICD samples.

S2.2 Classification Results
Results of cross-validation with sequential feature elimination with decision tree classifier and combined brain morphometric feature set were not obtained for cMDD-UKB and pMDD-UKB-CIDI samples because each optimisation took longer than five days to run on 8 parallel cores of an Intel Xeon based computing cluster node (at 2.4 GHz clock speed per core). In addition, due to long optimisation times, classification analyses with decision tree classifier, sequential feature elimination and combined brain morphometric feature set were only performed once for rMDD-STR and pMDD-UKB-ICD samples, with predefined balanced fold partitions (section S1.3.6, no repeated cross-validation). This was also the case for decision tree classifier, sequential feature elimination and combined white-matter integrity feature set for pMDD-UKB-ICD sample.
For results of all classification analyses with brain morphometric and white-matter integrity features in cMDD-STR samples please see main text (Tables 4 and 5). Tables S11-S14 outline accuracies and ROC AUC measures for all classification attempts with brain morphometric features in rMDD-STR, cMDD-UKB, pMDD-UKB-CIDI and pMDD-UKB-ICD samples. Tables S15-S18 outline accuracies and ROC AUC measures for all classification attempts with white-matter integrity features in the four samples.
Classification analyses for cMDD-STR dataset with brain morphometric measures were repeated with a replaced set of control participants. The replaced controls were again matched to cases for age and sex (mean age 54.87, mean QIDS score 2.8), however matching for age was slightly worse compared to the original sample ( Table 2 in the main text). Brief description of the main analysis results can be found in the results section of the main text.
Table S19 outlines accuracies and ROC AUC measures for all classification attempts with the replaced set of controls in the cMDD-STR dataset with brain morphometric measures.
We additionally attempted classification with the two sets of control participants combined (original and replaced, twice as many controls compared to cases), with synthetic minority oversampling to compensate for unbalanced class data (SMOTE, Chawla, Bowyer, Hall, & Kegelmeyer, 2002). Despite application of the SMOTE technique, this did not improve the original classification results and resulted in largely unbalanced sensitivities and specificities. Results for these classification attempts can be found in Table S20.  (7)                Table S10

Note:
Top accuracies for SVM, PLR and DT classifiers, and score for the best overall approach are highlighted in light blue. Optimisation with sequential feature elimination for decision tree and combined feature set was not performed due to high computational complexity (section S2.2).

Note:
Top accuracies for SVM, PLR and DT classifiers, and score for the best overall approach are highlighted in light blue. Optimisation with sequential feature elimination for decision tree and combined feature set was not performed due to high computational complexity (section S2.2).