Personalized prediction of repetitive transcranial magnetic stimulation clinical response in medication-refractory depression data

This article describes a dataset that was generated as part of the article: Personalized prediction of transcranial magnetic stimulation clinical response in patients with treatment-refractory depression using neuroimaging biomarkers and machine learning (DOI: 10.1016/j.jad.2021.04.081). We collected resting-state functional Magnetic Resonance Imaging data from 70 medication-refractory depressed subjects before undergoing four weeks of repetitive transcranial magnetic stimulation targeting the left dorsolateral prefrontal cortex. The data presented here include information about the seed-based analyses such as regions of interest, individual/group functional connectivity maps and contrast maps. The contrast maps are controlled for age, gender, duration of the current depressive episode, duration since the first depressive episode, and symptom scores. Demographics, clinical characteristics, and categorical treatment response variables are reported as well. Further, the individual connectivity values of the identified neuroimaging biomarkers of long-term clinical response were used as features in the support vector machine models are presented in combination with the trained classifiers of the support vector machine models. Post hoc analyses that were not published in the original analyses are presented as well. Finally, the R or MATLAB code scripts for all figures published in the co-submitted paper are included.

characteristics, and categorical treatment response variables are reported as well. Further, the individual connectivity values of the identified neuroimaging biomarkers of long-term clinical response were used as features in the support vector machine models are presented in combination with the trained classifiers of the support vector machine models. Post hoc analyses that were not published in the original analyses are presented as well. Finally, the R or MATLAB code scripts for all figures published in the co-submitted paper are included.

Value of the Data
• The data provide information about biomarkers of left dorsolateral prefrontal cortex rTMS clinical response in medication-refractory depressed subjects and the impact of two restingstate fMRI denoising strategies, including component based noise correction and global signal regression. • Researchers may benefit from this data by using the biomarkers and trained support vector machine classifiers as the foundation for a prospective clinical trial to examine the validity of these biomarkers. • Researchers can benefit from this data for meta-analyses that focus on biomarkers of any antidepressant treatments. Table 1 Demographics and clinical characteristics at baseline for all participants (n = 70), short-term (n = 63) and long-term (n = 61) rTMS treatment response and repeated measures ANOVA and binomial logistic regression results.

Methods
The derived data described in this article are shared at Mendeley Data [1] and were used to support the findings presented in the article: "Personalized prediction of transcranial magnetic stimulation clinical response in patients with treatment-refractory depression using neuroimaging biomarkers and machine learning" [2] .

Participants
Participants ( n = 70) were referred by psychiatrists from the specialist outpatient clinics in the public sector funded by the local government of Hong Kong Special Administrative Region. Participants were right-handed, aged 18-57 years, met the criteria for major depressive disorder (MDD) based on the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV; principle axis I diagnosis), moderate or severe episode defined by a scored of ≥ 20 on the Montgomery-Å sberg Depression Rating Scale (MADRS; [3] ) and had failed to respond adequately to at least one full course ( > 6 weeks) of antidepressant medication or were medication intolerant ( Table 1 ). Participants were screened to exclude confounding factors such as significant head trauma, active abuse of alcohol or illegal substances, other DSM-IV axis II disorders, and neurological disorders. Participants were also excluded if they had received other neuromodulation treatments in the preceding year. For safety reasons, participants that reported suicide ideation or recent suicide attempts, or with a contraindication for the use of functional Magnetic Resonance Imaging (fMRI, e.g. pacemakers, metal implants, pregnancy; for details refer to [4] ) and repetitive Transcranial Magnetic Stimulation (rTMS, e.g. history of seizures/epilepsy; for details refer to [5] were also excluded). Finally, participants with psychotic symptoms were excluded, because research showed poor response rates in this group [6] . Each participant provided written informed consent and received a travel cost compensation.

Measures
Several demographic and clinical variables were collected during the pre-treatment measurement, including age, gender, handedness, years of education, duration of the current depressive episode in weeks, total duration since the first depressive episode, the number of depressive episodes, medication, and the level of treatment refractoriness [7] . Clinical assessments were administered by a research psychiatrist. The first assessment was before the pre-treatment brain scan and consisted of the Structured Clinical Interview for DSM-IV to ascertain current and lifetime Axis I and II psychiatric diagnoses [8] . The MADRS [3] , Hamilton Depression Rating Scale [9] , Clinical Global Impression scale [10] , and Global Assessment of Functioning score [11] were also administered. Further, the Chinese version of the Beck Depression Inventory-II [12] was completed by the participants. The baseline symptom score measures were validated before the first rTMS session and reassessed at week 2, 4, 6, 8, and 12 using the MADRS [3] , Hamilton Depression Rating Scale [9] , Clinical Global Impression Scale [10] , and Beck Depression Inventory-II [12] . In this paper, the MADRS is used as the primary outcome measure. This clinician-administered scale has high inter-rater reliability and was designed to be sensitive to antidepressant treatment effects in patients with MDD [13] . The MADRS consists of ten items that are rated on a 0-6 continuum (0 = no abnormality, 6 = severe). A last observation carried forward approach was applied for participants with a missing week 12 outcome variable using the week 6 ( n = 1) or week 8 measurement ( n = 1) instead. The percentage change in MADRS symptom scores at week 4 (MADRS baseline -MADRS wk4 )/MADRS baseline * 100%) and week 12 (MADRS baseline -MADRS wk12 )/MADRS baseline * 100%) were calculated. Response was defined as a minimum reduction of 50% in symptom score measured with the MADRS [3] immediately after the last treatment and two months post-treatment compared to baseline. These two time points will be further referred to as short-term and long-term categorical treatment response. Responders were divided into two groups for additional analyses to examine pre-treatment connectivity associated with relapse. The relapse group refers to participants that only showed short-term categorical treatment response. The sustained response group refers to patients that showed both short-term and long-term categorical treatment response.

Brain scans 2.4.1. Image acquisition
MRI scans were acquired up to two weeks before the start of the rTMS treatment on a 3.0T Philips Achieva Medical Scanner with an eight-channel SENSE head coil (Philips Healthcare, The Netherlands) at the Prince of Wales Hospital, Hong Kong Special Administrative Region in China. The first scan was a high resolution T1-weighted structural scan covering the whole brain acquired with the following parameters: repetition time = 7.54 ms, echo time = 3.53 ms, flip angle = 8 0 , 1.1 × 1.1 × 0.6 mm voxels, number of slices = 285, slice orientation = sagittal, slice thickness = 1.2 mm, Field of View = 250 mm 3 , and matrix size = 240 × 240. This scan was used to register with the resting-state fMRI data, and for segmentation into grey matter, white matter, and cerebrospinal fluid, and normalization to template space. The T1-structural scan was followed by a six-minute resting-state fMRI scan consisting of 170 volumes with the following parameters: repetition time = 2050ms, echo time = 25ms, flip angle = 90 0 , 3.2 mm 3 voxels, slice thickness = 3.2 mm, Field of View = 205 mm ², and matrix size = 6 4 × 6 4. Research has shown that six minutes of resting-state fMRI results in moderate to strong reliability for functional connectivity measures [14,27] . Further, a T2-weighted scan and diffusion-weighted imaging scan were collected, but these scans are beyond the scope of this paper. The total scan duration was 25 min and 20 s.

Image pre-processing
Resting-state fMRI data were pre-processed using the default pipeline of the CONN toolbox v18.b [15] . The pre-processing steps included realignment and unwarping, temporal slice time correction, functional outlier detection (ART-based identification of outlier scans), segmentation, normalization, and smoothing (8 mm Gaussian kernel). Two denoising strategies were separately examined including component based noise correction with and without global signal regression [15,16] . For component based noise correction, the following parameters were regressed out: white matter (10 dimensions), cerebrospinal fluid (5 dimensions), realignment parameters (6 dimensions), scrubbing (61 dimensions), and the effect of pre (1 dimension). For global signal regression, the same procedure was performed, but additionally, a brain mask of the entire brain was added as a region of interest to measure the average brain signal, which was also added as a regressor (1 dimension). Subsequently, a default temporal band-pass filter (.008-09 Hz) and detrending were applied for both pre-processing methods.
The region of interests for the seed-based analyses were defined in MNI standardized space. For the left DLPFC, a 20-mm sphere was drawn around previously determined optimum stimulation Montreal Neurological Institute (MNI coordinates: X = -46, Y = 45, Z = 38 [17,18] . The sphere was masked by a cortical brain mask to exclude voxels outside the brain. The size of this ROI was based on previous research that showed that figure-of-eight shaped coils stimulated neurons in a cortical area of 2-3 cm 2 and to a depth of approximately 2 cm [19] . For sgACC, the same method described by Fox and colleagues [17] was used, i.e. a 10-mm sphere was drawn around the MNI coordinates 6, 16, -10, and masked by a cortical brain mask to exclude subcortical voxels. The resting motor threshold was defined as the minimum TMS intensity that elicited a motorevoked potential of ≥ 50 μV peak to peak in the contralateral abductor pollicis brevis in 5 out of 10 trials. The motor threshold was measured before the first treatment and after 10 sessions. The stimulation output was 120% of the motor threshold. The stimulation output was adjusted to 100% for three participants that could not tolerate 120%. The post hoc analysis showed that the inclusion or exclusion of these participants did not influence the results.

Statistical analyses 2.6.1. Demographics and clinical characteristics
Repeated measures analysis of variance (ANOVA) and binomial logistic regression were performed to examine differences in demographics and clinical characteristics. For the continuous variables, repeated measures ANOVA were performed with two main terms including Time (2 levels: short-term, long-term) and Response (2 levels: responder, nonresponder) and one interaction term (Time x Response). For the categorical variables, binomial logistic regression analyses were performed with Response (0 or 1) as outcome variable. Time and each categorical variable were added as predictors and dummy coded. For time (2 levels: short-term, longterm), short-term was the reference group. For gender (2 levels: male, female), the male group was the reference group. For medication (3 levels: none, antidepressants, antidepressant + psychotropic), the none group was the reference group, antidepressants was dummy 1 and antidepressant + psychotropic was dummy 2. For treatment refractoriness (3 levels: 1, 2, 3 + 4), level 1 was the reference group, level 2 was dummy 1 and level 3 + 4 was dummy 2. Model comparison was performed using Akaike Information Criteria (smaller is better criterion) and Chi-square test ( Table 1 ).

Seed-based analyses
Subject-level bivariate Pearson's correlations between the mean time series within each seed (DLPFC/sgACC) and the blood-oxygen-level-dependent time series of each voxel in the brain were extracted and converted to normally distributed Fisher transformed z-scores to conform to the assumptions of generalized linear models using the CONN toolbox [15] with short-term and long-term treatment response as the outcome variable, respectively. Analyses were controlled for age, gender, MADRS symptom score at baseline, the duration of the current depressive episode, and total duration since the first depressive episode. The resulting individual seed maps for each region of interest (left DLPFC and sgACC) were used for the second-level analyses comparing responders versus nonresponders at both time points ( Figs. 1 A-E and 5 A-C). Further, seed-based analyses were performed to examine the pre-treatment connectivity differences between patients that showed sustained treatment response and patients that relapsed after the end of the treatment ( Fig. 1 F and G). The whole-brain results for all seed-based analyses were thresholded twice [20] ; voxel-level threshold p < .005, cluster threshold p-FDR < .05, two-sided. The effect-sizes correspond to the beta Table 2 Significant clusters of the seed-based analyses with left DLPFC and SgACC as Seeds. The contrasts were long-term rTMS responders (n = 33) versus nonresponders (n = 28) and sustained response (n = 24) versus relapse (n = 15) (Voxel Threshold p < .005, cluster threshold p-FDR < .05, Two-tailed).  ( Tables 2 and 4 ).

Supervised machine learning
Machine learning was applied to examine whether combining the identified biomarkers could increase the accuracy of categorical rTMS treatment response prediction. Our sample was split into a training/validation dataset (70%) and a test dataset (30%). We used MATLAB's Machine Learning toolbox (The Mathworks, Natick, MA) to search for the best classification model type, including decision trees, discriminant analysis, support vector machines, logistic regression, nearest neighbors, naive Bayes, and ensemble classification. Hyperparameters optimization was au-  (8) .31 tomated by the toolbox. Subject-level Fisher transformed z-scores identified by the seed-based analyses above were entered as features, and long-term categorical treatment response was entered as a binary outcome (responders/nonresponders). The average accuracy scores and prediction speed from the 5-fold cross-validation procedures were used to determine the best classification model type. This classification model type was used to train classifiers for all feature combinations in the training/CV dataset. The trained classifiers subsequently were used to examine performance in the independent test dataset. A large decrease in performance in the test dataset compared to the training/CV dataset suggests overfitting [21] . The SVM classifiers returned the validationPredictions and validationScores for the training data, which were used to create confusion matrices and perform ROC curve analyses, respectively. The ROC curve analyses were performed with MATLAB's perfcurve function. This function returned the optimal operating point of the ROC curve as an array of size 1-by-2 with False Positive Rate and True Positive Rate values for the optimal ROC operating point. The optimal operating point was obtained by finding the slope, S, using S = (cost(P|N)-cost(N|N))/(cost(N|P)-cost(P|P)) * N/P where cost(I|J) is the cost of assigning an observation of class J to class I, and P = True Positive + False Negative and N = True Negative + False Positive are the total observation counts in the positive and negative class, respectively. This function subsequently identified the optimal operating point by moving the straight line with slope S from the upper left corner of the ROC plot (False Positive Rate = 0, True Positive Rate = 1) down and to the right until intersecting the ROC curve ( Fig. 2 A). According to the literature, AUC is an effective summarize measure of the diagnostic ability of a test. An AUC between 0.7 and 0.8 is considered acceptable, between 0.8 and 0.9 is considered excellent, and higher than 0.9 is considered outstanding [22] . Confusion matrices were used to calculate classification metrics for the optimal operating points of each model ( Fig. 2 B), including sensitivity, specificity, positive predictive value, negative predictive value, and accuracy ( Fig. 2 C and Table 4 Significant clusters of the seed-based analyses with left DLPFC and SgACC as seeds after global signal regression. the contrasts were long-term rTMS responders (n = 33) versus nonresponders (n = 28) and sustained response (n = 24) versus relapse (n = 15) (voxel threshold p < .005, cluster threshold p-FDR < .05, two-tailed). . Sensitivity is the proportion of responders that is correctly classified, while specificity is the proportion of nonresponders that is correctly classified [23,24] . The positive and negative predictive values reflect the probability that participants with a positive or negative test truly become responders or nonresponders, respectively [18,19] . The 95% binominal confidence intervals (CI) were calculated to determine statistical significance [25] .

Post hoc analyses
Based on the reviewers' comments, two post hoc analyses were performed which were not included in the original article.

Correlational analyses anticorrelation
Correlational analyses were performed to directly examine the predictive value of the most anticorrelated areas. Functional connectivity maps were masked with the automated anatomical labeling atlas (AAL) regions of interest, including left middle frontal gyrus and subcallosal cortex ( Fig. 3 A-D). Then, the voxel coordinates of the lowest connectivity value were extracted within each masked area. The fslmaths function (FMRIB Software Library v6.0.2) was used to draw a 5mm kernel around the most anticorrelated coordinates resulting in two new regions of interest; DLPFC (A) and sgACC (A). Connectivity values between the DLPFC and sgACC seeds and new anticorrelated regions of interest were extracted and correlational analyses were performed to examine its relationship with rTMS treatment response ( Fig. 3 E and F).

Seed-based analyses training data
In the original article [2] , validity of the results were questioned because the selected features for machine learning were based on the functional connectivity group differences of the full sample. It was suggested by one of the reviewers to examine whether the seed-based analyses results would still hold by using the training dataset only. Therefore, we performed additional second-level analysis comparing long-term responders and nonresponders using the training dataset. Analyses were controlled for age, gender, MADRS symptom score at baseline, the duration of the current depressive episode, and total duration since the first depressive episode. As the DLPFC seed analysis did not reveal any significant clusters, it was examined whether this was the result of the decreased power due to a smaller sample size by using a more liberal cluster threshold (cluster-size p-uncorrected < .05 instead of p-FDR < 0.05; Fig. 4 , Table 3 ).

Declaration of Competing Interest
S. Neggers holds a minority share in Brain Science Tools BV, a company manufacturing stereotactic navigation technology for TMS. This did not influence the design, analysis, or reporting of the current manuscript in any way. No potential conflict of interest was reported by the other authors.