Predictive modeling of treatment resistant depression using data from STAR*D and an independent clinical study

Identification of risk factors of treatment resistance may be useful to guide treatment selection, avoid inefficient trial-and-error, and improve major depressive disorder (MDD) care. We extended the work in predictive modeling of treatment resistant depression (TRD) via partition of the data from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) cohort into a training and a testing dataset. We also included data from a small yet completely independent cohort RIS-INT-93 as an external test dataset. We used features from enrollment and level 1 treatment (up to week 2 response only) of STAR*D to explore the feature space comprehensively and applied machine learning methods to model TRD outcome at level 2. For TRD defined using QIDS-C16 remission criteria, multiple machine learning models were internally cross-validated in the STAR*D training dataset and externally validated in both the STAR*D testing dataset and RIS-INT-93 independent dataset with an area under the receiver operating characteristic curve (AUC) of 0.70–0.78 and 0.72–0.77, respectively. The upper bound for the AUC achievable with the full set of features could be as high as 0.78 in the STAR*D testing dataset. Model developed using top 30 features identified using feature selection technique (k-means clustering followed by χ2 test) achieved an AUC of 0.77 in the STAR*D testing dataset. In addition, the model developed using overlapping features between STAR*D and RIS-INT-93, achieved an AUC of > 0.70 in both the STAR*D testing and RIS-INT-93 datasets. Among all the features explored in STAR*D and RIS-INT-93 datasets, the most important feature was early or initial treatment response or symptom severity at week 2. These results indicate that prediction of TRD prior to undergoing a second round of antidepressant treatment could be feasible even in the absence of biomarker data.


Introduction
Treatment resistant depression (TRD) and relapse/recurrence of major depressive episodes (MDE) are important sources of morbidity and mortality. Early Identification of risk factors of resistance using baseline characteristics and initial response may be useful to guide treatment selection, avoid inefficient trial-and-error, alter disease course, and improve major depressive disorder (MDD) care. The term initial and early response are used interchangeably in this article and its refers to response at 2 weeks.
In the literature, there are multiple TRD definitions [1] and staging models such as the Antidepressant Treatment History Form (ATHF), the Thase and Rush Model, the European Staging Model, the Massachusetts General Hospital Staging Model, the Maudsley Staging Model (MSM), [2] and the Dutch Measure for quantification of TRD (DM-TRD), an extension of MSM. [3] From the health authority's perspective, the September 17, 2009 Committee for Medicinal Product for Human Use (CHMP) Concept Paper on "The Need For Revision of Note for Guidance on Clinical Investigation of Medicinal Products In The Treatment of Depression With Regard To Treatment Resistant Depression" stated that "In a clinical pragmatic view a patient has been considered suffering from TRD when consecutive treatment with two products of different pharmacological classes, used for a sufficient length of time at an adequate dose, fail to induce a clinically meaningful effect (inadequate response)." However, since there is no evidence supporting switching of treatment within one class is less effective than switch to a different pharmacologic class, the May 23, 2013 CHMP Guideline on "clinical investigation of medicinal products in the treatment of depression" stated that "For the purpose of this guideline TRD is considered, when treatment with at least two different antidepressant agents (of the same or a different class) prescribed in adequate dosages for adequate duration and adequate affirmation of treatment adherence showed lack of clinically meaningful improvement in the regulatory setting." We adopted the spirit of 2013 CHMP definition in this study. A failed antidepressant trial could be defined by either failure to respond (less than 50% reduction in depression severity score) or failure to reach remission (such as having the 16-item Quick Inventory of Depressive Symptomatology, clinician-rated (QIDS-C 16 ) score > 5 [4]). The risk factors for TRD as reviewed by Bennabi et al., 2014 [5] include clinical risk factors such as comorbid anxiety disorder, current suicidal risk, non-response to the first antidepressant received in the patient's lifetime and presence of melancholic features [6]; bipolarity, early onset of first depressive episode, high rate of depressive recurrences, and lack of full remission after a previous episode [7]; low reward dependence and low cooperativeness [8]; high neuroticism, low extraversion, low openness, and low conscientiousness [9]. In addition, a subset of patients with diagnosis of "unipolar" treatment resistant depression may have a bipolar diathesis [10].
Although there are many studies using baseline clinical risk factors/symptom clusters and/ or early treatment response to predict longer term antidepressant treatment outcome within the same treatment regime in clinical ascertained samples [11][12][13][14], the studies using baseline clinical characteristics and initial response to earlier treatment regime to predict response to next treatment option are limited. Perlis et al. [15] developed a clinical decision tool to predict the risk of TRD based on clinical or socio-demographic variables available by or readily amenable to self-report using a logistic regression model with area under the receiver operating characteristic curve (AUC) AUC exceeding 0.71 in training, testing, and validation datasets, all derived from the Sequenced Treatment Alternatives to Relieve Depression (STAR Ã D). In the work by Perlis et al., non-TRD and TRD subjects were defined as individuals reaching remission with a first or second pharmacological treatment trial and those not reaching remission despite two trials, respectively. In our proposed work, we extended the work by Perlis et al., via partitioning the data from STAR Ã D cohort into the training and the testing dataset, and included a completely independent cohort RIS-INT-93 as an external test dataset. In addition, we explored the definition of non-TRD and TRD as individuals responding with a first or second treatment trial and those not responding despite two trials, respectively. In the work by Perlis et al., performance of three machine learning models, namely, naïve Bayes classifier, support vector machine (SVM), and random forest, was less consistent. In our work, we further explored other machine learning models such as XGBoost, l 2 penalized logistic regression, gradient boosted decision tree (GBDT), the elastic net and compared their performance in the prediction of TRD.

Cohorts
STAR Ã D cohort. The multicenter antidepressant effectiveness STAR Ã D study (Clinical-Trials.gov number NCT00021528) has been described elsewhere. [16,17] Briefly, patients meeting DSM-IV criteria for MDD went through four levels of treatment options, for up to 12 weeks in length at each level. A patient exited the study and had the option to enter a naturalistic follow up study if achieving remission at the end of each level of treatment. Otherwise the patient had the option to enter the next level of treatment. All subjects signed written informed consent before participation, with the protocol approved by institutional review boards at participating institutions.
RIS-INT-93 cohort. Data were drawn from a Janssen clinical study RIS-INT-93 (Clinical-Trials.gov number NCT00044681) [18]. Patients met DSM-IV criteria for MDD and had history of resistance to therapy with antidepressant medication and were treated (open-label) prospectively with citalopram for up to 6 weeks. Since the first level of antidepressant failure was retrospective, baseline (week 0) and early response at week 2 from the level 2 prospective treatment was used in TRD modeling. All subjects signed written informed consent before participation, with the protocol approved by institutional review boards at participating institutions. Design STAR Ã D training and testing, and an external test dataset. Data from the STAR Ã D cohort was divided into the training and the testing datasets. The training dataset was drawn from~80% of the STAR Ã D data from regional centers 1-8, 10-12), while the testing dataset was drawn from the rest (~20%) of the STAR Ã D data from regional centers 13-15 also termed as the hold-out testing dataset. Internal cross-validation was performed using the STAR Ã D training dataset to tune model parameters. Data from RIS-INT-93 cohort served as an independent external test dataset.
Target variable for predictive modeling. The predictive modeling objective is to differentiate TRD from non-TRD as early as possible. Non-TRD and TRD subjects were defined by remission as individuals reaching remission after Level one or Level two of treatment trials and those not reaching remission despite two trials, respectively; and by response as individuals responding after Level one or Level two of treatment trials and those not responding despite two trials, respectively. For the STAR Ã D cohort, remission was defined as a score of < = 5 on QIDS-C 16 (primary analysis) or Quick Inventory of Depressive Symptomatology, Self-Report (QIDS-SR 16 , secondary analysis) at last observation carried forward (LOCF). Response was defined as a reduction of > = 50% in baseline QIDS-C 16 or QIDS-SR 16 score. For both remission and response criteria, patients must have a QIDS score greater than 5 at week 0 (baseline) of treatment level one, and remained in the study for at least four weeks. Only data from the first two levels of treatment were used in the STAR Ã D study.
For RIS-INT-93, remission was defined as a score of < = 7 on HAM-D 17, while response was defined as reduction of > = 50% in baseline HAM-D 17 score after 6 weeks of citalopram treatment. Since all patients were retrospectively reported to have failed one round of antidepressant treatment, all non-remitters/non-responders from the prospective treatment were classified as TRD, while others were classified as non-TRD. In addition, subjects must remain in the study for at least six weeks to be included in the analysis.
Input features for predictive modeling. Three sets of features were included in the modeling, the full set of features (n~700) (referred to later as full set of features), the top n features from feature selection technique (referred to later as the top n features), and the set of 22 overlapping features between the STAR Ã D and RIS-INT-93 datasets (referred to later as overlapping features).
Full set of features: All the features from enrollment including information from Cumulative Illness Rating Scale (CRS), demographics (DM), psychiatric history (PHX), medication history (MHX), the Patient Rated Inventory of Side Effects (PRISE), Psychiatric Diagnostic Screening Questionnaire (PDSQ), as well as baseline and week 2 of level 1 treatment which include records from Clinic Visit Form, QIDS-C 16 , QIDS-SR 16 ). Each attribute related to medications was expanded into as many features as number of medications that appeared in the data with each feature representing information collected on a specific mediation. Additional features were derived based on prior knowledge and included in predictive modeling. For comorbid condition, any anxiety disorder (including posttraumatic stress disorder, panic disorder with or without agoraphobia, specific phobia, social phobia, and general anxiety disorder) was derived. In addition, studies suggest that the unidimensional subscales of the multidimensional HAM-D 17 , [19,20] which capture the core depressive symptoms, outperform the full HAM-D 17 . We included five unidimensional subscales as discussed by Boessen et al., [21] including the Bech melancholia scale, [22][23][24], the Maier-Phillipp severity subscale, [25][26][27] the Santen Subscale, [28] the Gibbons' global depression severity scale, [29] and the 7-item abbreviated version (HAM-D 7 ). [30] The five factors derived from HAM-D 17 (retardation (the sum of the scores for items 1, 7, 8 and 14, with a score ranging from 0 to 14), anxiety/somatization (the sum of the scores for items 10,11,12,13,15, and 17, with a score ranging from 0 to 18), sleep disturbance (the sum of scores for items 4, 5, and 6, with a total score ranging from 0 to 6), depression, guilt association) was also included as candidate features and tested if any of them was a better predictor than HAM-D 17 item-level data or full HAMD score. Both IDS-C 30 and HAM-D 17 were only measured at level entry and exit, while QIDS-C 16 and QIDS-SR 16 were measured every two weeks, we derived an exploratory variable IDS-C 5 that was definable from QIDS-C 16 and overlapped with five out of the six items defining IDS-C 6 /HAM-D 6 [24,31], the Bech melancholia scale. Out of the six items (mood sad, involvement, fatigue, anxious mood, outlook (self), and psychomotor slowing) defining IDS-C 6 , all except anxious mood was definable from QIDS-C 16 or QIDS-SR 16 . We included percentage change of IDS-C 5 and QIDS-C 16 at week 2, QIDS-C 16 at week 2 in the predictive model as well. For each data set, we eliminated all the features that have a variance less than 10 −8 across the selected set of samples. After this preprocessing, we were left with around 700 starting features for each dataset.
Top n features: All machine learning approaches did not apply feature selection prior to model fitting. Many of the features were correlated with each other (such as QIDS-C 16 was correlated with QIDS-SR 16 and HAM-D 17 ) and most of the machine learning approaches considered in this work do not handle correlated features well. To mitigate this and to facilitate model interpretation and application, we applied two strategies to select limited sets of features for machine learning models. The primary feature selection strategy was using a similar approach as proposed in Bühlmann et al. [32]. It is a two-step process and it as follows: Step 1: K-means clustering was first applied to all features to group features into k groups (k = 50, 75, and 100 in our experiment). The feature closest to the centroid of each cluster was chosen as the representative feature for the cluster.
Step 2: To this end, a χ 2 score as proposed by Liu et al [33] was calculated as a measure of feature importance to select top n features for model creation. Its implementation is available at http://featureselection.asu.edu/software.php. The method discretizes each numeric attribute based on χ 2 statistics and continuously merges the intervals of each attribute while keeping the inconsistency rate under certain threshold. We adopted this method due to the ordinal nature of majority of the features such as depression symptom severity score as measured by QIDS-C 16 and HAM-D 17 and those measuring how depression affects patients' daily life as exemplified by items from Work and Social Adjustment Scale (WSAS) which gives a quantitative evaluation of to what extent depression impairs activities related to social and work life. The χ 2 statistics was calculated using the STAR Ã D training dataset or the 90% of the training dataset (during 10-fold cross-validation procedure within the training data set for optimization of model parameters). This procedure was applied for each of the target outcome and for each feature clustering configuration (k = 50, 75, 100). This feature selection strategy is called clusteringχ 2 throughout the article.
The second feature selection strategy was using the elastic net technique [34,35] and it corresponds to n features with non-zero beta coefficient while optimizing the number of nonzero beta coefficient to~30. This feature selection strategy is termed as ELNET features throughout the article. As optimizing elastic net to approximately 30 features could result in sub-optimal performance, we selected top n features optimized according to the model's performance (area under the receiver operating characteristics curve (AUC)) via 10-fold internal cross-validation in the training dataset and the results were comparable. If not explicitly stated, the ELNET features refers to the~30 features selected by elastic net.
Overlapping features: There were 22 overlapping (but not exhaustive list of) features between STAR Ã D and RIS-INT-93 (week 0 17 HAMD item score, week 0 HAM-D 17 total score, HAM-D retardation subtotal score, percentage change of HAM-D 17 total score at week 2 from baseline, any anxiety disorders, and recurrent depression) and represented the third set of features for the prediction of TRD.

Predictive modeling
Machine learning methods. Five different machine learning approaches including l 2 penalized logistic regression [36,37] random forest [38] GBDT, XGBoost, and the elastic net were applied in this study. The implementations of l 2 penalized logistic regression, random forest, GBDT, the elastic net available from scikit-learn [39] and XGBoost from https:// xgboost.readthedocs.org/en/latest/ were used in this study.
Handling class imbalance. In our prediction tasks, subjects from different categories (remission vs. non-remission; response vs. non-response as determined by either QIDS-C 16 or QIDS-SR 16 ) were imbalanced to different degrees. A direct application of the machine learning model introduced above would lead to the obtained model biased towards the majority class. The class imbalance handling technique that has been used in Nie et al., 2015 [40] was also applied to the training data in STAR Ã D cohort. Basically, during the training stage, we used all the samples from minority class in the training dataset and subsample with replacement an equal number of samples from the majority class to build one model. We repeated the process t times and thus obtained t models (t = 30). In obtaining predictions for the test dataset, an ensemble model prediction corresponding to t models are obtained by averaging t model predictions (probabilities) and the performance are assessed against the target. (Fig 1).

Model characterization.
Internal cross-validation (10-fold) performance metrics was reported for the training dataset using full set of features and top n features (n = 2, 3, . . ., k, where k = 50, 75, and 100 clusters). The model derived from the STAR Ã D training dataset was externally tested in a hold-out STAR Ã D testing dataset. For the overlapping features, the models trained using the STAR Ã D training dataset was both internally cross-validated in the training dataset and externally validated in the hold-out STAR Ã D testing data and an independent RIS-INT-93 dataset using machine learning approaches. Relevant descriptions of model performance including AUC, sensitivity, specificity, accuracy, balanced accuracy, positive predictive value (PPV), and negative predictive value (NPV) were determined.
Permutation test. To make sure that the observed model performance is not obtained by chance, the target label was randomly permutated 1,000 times in the START Ã D training dataset. The model's training on the STAR Ã D training dataset with permuted target labels and the prediction process in STAR Ã D testing dataset was repeated for 1,000 times giving rise to a null distribution of AUC values. The observed AUC value was plotted in reference to the null AUC distribution, and a permutation p-value was calculated. [41] Interpretation of the relationship between features and outcome via multiple logistic regression procedure. Classical logistic regression with forward selection was used to facilitate model interpretation, although only linear relationship could be captured in the model. With this approach, only the top n features from the STAR Ã D training data (n = 30 from clustering-χ 2 ) were included in the model as candidates for forward selection. The features selected were included in a multiple logistic regression model and features with p-value less than 0.1 were examined to understand the relationship between features and outcome. Classical logistic regression was fitted using PROC LOGISTIC via SAS 9.2 (SAS Institute Inc., Cary, NC).

Results
Using the remission criteria, there were 501 TRD cases and 1463 non-TRD controls from the STAR Ã D training dataset and 141 TRD cases and 349 non-TRD controls from the STAR Ã D testing dataset using QIDS-C 16 assessment (primary analysis, S1 Table). To reduce the correlation between features, we used k-means clustering to group features into clusters and select a candidate feature representing each cluster. To explore the approximate number of top features that should be included in the predictive model without compromising on the model performance, the top n features (n = 2, 3, . . ., 75) from k-means clustering (k = 75 for k-means clustering, as results from k = 50 and 100 were comparable) were used to train, test the model (as shown in Fig 2) and the model's performance plateaued at~10-30 predictors. The variance of AUC obtained from internal cross-validation in the STAR Ã D training dataset for each fold was averaged across 30 models corresponding to 30 subsamples. The variance for the top n (n = 30) features was 3.9%, 3.3%, and 3.3% for random forest, l 2 penalized logistic regression, and GBDT, respectively. The results for full set of features and the top n features (n = 30) were reported for subsequent analysis. We also used the ELNET features generated by elastic net (optimizing for number of features =~30) and compared the results to the earlier top n features. In addition, the results for the overlapping features was also reported.
For TRD phenotype defined by remission criteria using QIDS-C 16 assessment, the AUC for the 10-fold internal cross-validation in the training dataset ranged from 0.73 to 0.79 with the full set of features with GBDT performing the best (S1A Fig). The out of bag (OOB) score for random forest was 0.71. The model derived from the training data was externally validated in the STAR Ã D test dataset with the AUC ranging from 0.70 to 0.78 for all different machine learning approaches and is shown in Fig 3 and S1D Fig. The permutation testing for all the machine learning approaches is shown in Fig 4. Model AUC, accuracy, specificity, sensitivity, PPV, and NPV for selected models were also reported in Table 1, S2 Table, and S3 Table using predicted target (ranging from -1 to 1) with 0 as a threshold for classifying predicted target.  Table. Similar results were observed for target variable using responder criteria defined using QIDS-C 16 assessment (S2 Fig) and for target variable using remission and responder criteria defined QIDS-SR 16 assessment ( S3 Fig and S4 Fig). Overall, the models achieved AUC > = 0.7 in at least one machine learning approaches across the STAR Ã D testing dataset and RIS-INT-93 dataset and was independent of the phenotype definition by remission or response criteria or using QIDC-C 16 or QIDS-SR 16 assessment, although the performance of TRD using remission criteria generally was better than that using response criteria and the performance of the full set of features and the top n features outperformed that of the overlapping features as expected. For simplicity of reporting, we reported models with TRD defined using remission criteria and QIDS-C 16 assessment moving forward if not explicitly stated. Results from alternative models could be found in the supplemental content. The results for the ELNET features (elastic net optimized for number of features = 30 for ease of comparison with the top n features selected by clustering-χ 2 ) were comparable and described in S5 Table. The top groups of features identified to be important based on the clusteringχ 2 score and most consistent between outcome definitions (remission and response) and symptom severity measurements (QIDS-C 16 and QIDS-SR 16 ) were represented by: QIDS total score at week 2 (self-reported and clinician-rated QIDS total score were strongly correlated and grouped together with percentage change in symptom severity score at week 2 from week 0 into the same cluster by k-means clustering procedure, see S6 Table for top 20 groups of features clustered using k-means clustering), Quality of Life Enjoyment and Satisfaction Questionnaire (QLESQ) total score, Work and Social Adjustment Scale (WSAS) total score, and Short Form Health Survey (SFHS, SF-12) Physical component. The top n (n = 10) representative features from clusteringχ 2 were also shown in Fig 5 and S5 Fig. They represented the feature importance in a univariate sense. The ELNET features (S7 Table) and were largely consistent with the univariate features of importance except that multiple correlated features were selected as expected.
To facilitate the interpretation of the relationship between features and target and to consider multiple features simultaneously (in contrast to χ 2 statistics), both classical logistic regression model with forward selection and the elastic net and were evaluated. For the classical logistic regression, starting from top 30 features from the STAR Ã D training data, 8 variables were selected in forward selection procedure and 6 made it to the final logistic regression model ( Table 2, S7 Table). In the STAR Ã D training dataset, more severe QIDS-C 16 symptom severity total score at week 2 [or higher percentage change in QIDS-C 16 at week 2 from baseline (i.e. less improvement) since both variables were clustered together in k-means clustering procedure (S6 Table) although QIDS-C 16 symptom severity total score at week 0 will add additional explanatory value if including percentage change at week 2 from week 0 in the model], anxiety/chronicity (PDSQ questions), lower level of Physical Health Composite Scores (PCS), nervous system symptom (PRISE question) and WSAS total score were associated with increased odds of being TRD (Table 2). For PCS (ranging from 0 to 100), the odds ratio is less than 1 since a zero score indicates the lowest level of health measured by the scales and 100 indicates the highest level of health. Similar results for TRD defined using responder criteria were reported in S8 Table. When testing these six variables in the STAR Ã D testing dataset, all variables predicted outcome in the same direction as the STAR Ã D training data and both QIDS-C 16 symptom severity total score at week 2 and Physical Health Composite Scores (PCS) reached statistical significance (p-value < 0.05). For the elastic net, the sign of β coefficient (S8 Table) was consistent with the directionality of odds ratio and most variables with non-zero β coefficients were consistent except correlated variables were also grouped in the model.
Lastly, we considered an overly simplified model using early response (defined by having less than 20% improvement at week 2) to predict TRD outcome. In the STAR Ã D training dataset, the model accuracy for predicting TRD is 62 in the STAR Ã D training data (sensitivity = 0.52, specificity = 0.77, PPV = 0.75, NPV = 0.54). Model performances in the STAR Ã D testing dataset and RIS-INT-93 dataset were comparable and is shown in S9 Table.

Discussion
We developed a series of machine learning models to predict TRD after two trials of treatment regimens using clinical and socio-demographic data. The models were trained using three sets of features, namely, all set of features (using all available STAR Ã D features from enrollment to deploy the model in a real-world setting. A reduced set of top n (n = 30) features from clusteringχ 2 procedure is more practical if we were to design a new study and collect features to predict TRD. In order to use data from existing studies as independent testing dataset, we could only work with the limitation of overlapping features.
The overall accuracy of the GBDT model using top 30 features (k = 75) was 0.70 for the STAR Ã D testing dataset (predicted outcome ranging from -1 to 1, using mid-point 0 as the threshold to infer TRD vs. non-TRD status). The accuracy in predicting TRD and non-TRD were comparable (0.72 and 0.68, respectively), Assuming there may be a population of ambiguous subjects who may be difficult to predict based on clinical and socio-demographic features alone, we focused only on extreme 10 and 20 percentiles of the predicted outcome distribution (of being TRD or being non-TRD). The accuracy for extreme 10 and 20 percentiles for inferring non-TRD was 0.98 (1 error out of 49 subjects) and 0.95 (5 errors out of 98 subjects), while the accuracy for extreme 10 and 20 percentiles for inferring TRD was 0.69% (15 errors out of 49 subjects) and 0.60 (39 errors out of 98 subjects), respectively, suggesting it is easier to predict non-TRD than TRD in the STAR Ã D testing dataset for those especially among those with extreme predicted outcome. The default choice of using mid-point in predicted probability to classify subjects is arbitrary and not necessarily the optimal threshold. We prefer to report AUC as it takes into different threshold cut points into consideration. For a predictive model to be useful in predicting individual outcome, a pre-specified threshold shall be used. We therefore also explored a specific cut point such as having a predicted score less than -0.6 predicts TRD, while having a predicted score great than 0.6 predicts non-TRD. Using these thresholds, the accuracy for non-TRD and TRD prediction was 0.90 and 0.53, respectively, in the STAR Ã D testing dataset. For the model using overlapping features, the overall accuracy of the GBDT model was 0.82 for RIS-INT-93 dataset. The accuracy in predicting TRD and non-TRD were 0.88 and 0.40, respectively. Using the same -0.6 and 0.6 for predicting TRD and non-TRD, the accuracy for TRD and non-TRD prediction was 0.93 (11 errors out of 154 predictions) and 0.40 (15 errors out of 25 predictions), respectively, in the RIS-INT-93 dataset. It is unlikely that the model performance and these observations merely reflect the group make up as STAR Ã D samples had approximately 2.5 times more non-TRD subjects than TRD subjects, while RIS-INT-93 study had more TRD than non-TRD subjects, as the actual model outperformed the null models in all cases of the permutation testing. It is somewhat surprising that the performance of models in the independent RIS-INT-93 dataset was generally somewhat superior to the STAR Ã D testing dataset. In the STAR Ã D training dataset, symptom severity score at week 2 was selected over symptom severity score at week 0 and percentage change in symptom score, suggesting the closer the symptom measurement is to the future outcome the more predictive the feature is in predicting the future outcome. In RIS-INT-93, level 1 failure was retrospectively reported and therefore only level 2 measurements were available. Therefore, it is possible that RIS-INT-93 prediction problem might be an easier test case compared to STAR Ã D study design. Furthermore, the outcome