Data mining algorithm predicts a range of adverse outcomes in major depression

Background: Course of illness in major depression (MD) is highly varied, which might lead to both underand overtreatment if clinicians adhere to a 'one-size-fits-all' approach. Novel opportunities in data mining could lead to prediction models that can assist clinicians in treatment decisions tailored to the individual patient. This study assesses the performance of a previously developed data mining algorithm to predict future episodes of MD based on clinical information in new data. Methods: We applied a prediction model utilizing baseline clinical characteristics in subjects who reported lifetime MD to two independent test samples (total n = 4226). We assessed the model's performance to predict future episodes of MD, anxiety disorders, and disability during follow-up (1–9 years after baseline). In addition, we compared its prediction performance with well-known risk factors for a severe course of illness. Results: Our model consistently predicted future episodes of MD in both test samples (AUC 0.68–0.73, modest prediction). Equally accurately, it predicted episodes of generalized anxiety disorder, panic disorder and disability (AUC 0.65–0.78). Our model predicted these outcomes more accurately than risk factors for a severe course of illness such as family history of MD and lifetime traumas. Limitations: Prediction accuracy might be different for specific subgroups, such as hospitalized patients or patients with a different cultural background. Conclusions: Our prediction model consistently predicted a range of adverse outcomes in MD across two independent test samples derived from studies in different subpopulations, countries, using different measurement procedures. This replication study holds promise for application in clinical practice.


Introduction
The course of major depression (MD) can be highly varied (Eaton et al., 2008), which may lead to either over-or undertreatment in clinical practice if a generic treatment regimen is adopted. Data mining techniques offer opportunities to develop prediction algorithms for clinically relevant outcomes such as course of illness (Hastie et al., 2009). Data mining uses pattern recognition techniques to extract important patterns and trends from data, for example with the aim to predict outcomes. If sufficiently accurate, the resulting prediction models could assist clinicians in identifying patients with a distinct course of illness, and thus support more specific treatment allocation (Darcy et al., 2016), for instance on decisions whether to continue or discontinue treatment after recovery of MD. In different medical disciplines, these opportunities are now being explored (Jiang et al., 2017), in order to move from a 'one-size-fits-all' approach to treatment assignments that are more tailored to the individual patient.
In a recently published study, we developed a prediction model for recurrence of MD in an attempt to address these previous limitations . We used prospective data, mostly derived from structured clinical interviews, from a sample of 653 participants who reported an episode of MD in the last year. We used a broad range of clinical characteristics assessed at baseline to optimally predict future episodes of MD. The resulting prediction algorithm model showed promising prediction performance in the training data.
Before implementation in clinical practice, prediction models need to be evaluated in new data, preferably in multiple samples representing the target population, i.e. patients who recovered from MD (Hastie et al., 2009;Perlis, 2013). It is crucial to determine whether estimates of prediction performance are reliable and replicable, as estimates derived from initial training data might be overly optimistic due to overfitting (i.e., the model capitalizes on idiosyncratic features of the training data) (Hastie et al., 2009). The primary aim of this study is to validate our previously developed multivariate prediction model in new data. Our primary research question is: how accurately does our previously developed model predict future episodes of MD in two independent test samples? In addition, we test how well the model predicts a broader set of course-related outcomes (future episodes of generalized anxiety disorder (GAD), panic disorder, disability). This replication study is a necessary step towards implementation in clinical practice.

Training sample
In data mining, multiple independent samples are commonly used to train and test a prediction model. The 'training sample' is used to train or discover a prediction model describing the relation between the predictors and the outcome. Then, this model is tested using new data, the 'test sample', to obtain reliable estimates of prediction performance (see also Supplemental methods).
We previously developed a prediction model for recurrence of MD based on a large number of clinical characteristics at baseline , using training data from a longitudinal study of male-male and male-female twin pairs from the Virginia Adult Twin Study of Psychiatric and Substance Use Disorders (VATSPSUD). This training sample included 653 male and female twins who reported an episode of MD (DSM-III-R) in the year prior to baseline interview, and who were also participating in the follow-up interview which was carried out at least one year later. All participants reported a period of >60 days of (partial) remission or recovery (Frank, 1991), in order to focus on MD recurrence instead of chronicity. To minimize recall bias, we used data from participants who reported a MD episode in the last year rather than lifetime (Supplemental Methods). This selection was done to increase the quality of reports about the specific symptoms during the episode of MD, the duration, and other severity indices, which we expected to be higher for participants who recently experienced an episode of MD, than participants who had an episode more than one year ago.

Model discovery
In this study, we analysed a total of 70 potential risk factors using Cox models with elastic net regularization (R-package glmnet) to predict the outcome recurrence of MD using time-to-event data. Regularized regression methods include a penalty for model complexity. This penalty results in the selection of predictors via the shrinkage of weaker predictor beta-coefficients towards zero. Regularized methods are useful for studies examining large numbers of predictors as it reduces overfitting and yields sparser models (Hastie et al., 2009).
The elastic net penalty controlled the selection and effect sizes of predictors to increase prediction performance and model interpretation (Zou and Hastie, 2005). The final model was selected based on minimal prediction error as assessed in 10-fold cross-validation (Friedman et al., 2010;Simon et al., 2011). This model retained 24 out of the 70 initial predictors and was highly multifactorial including diverse risk factors such as comorbid anxiety symptoms and disorders, maternal MD, and childhood traumas (Supplemental Table 1). Prediction performance in the training sample was good (AUC~0.75), but the model was not evaluated in independent test data, because the sample was relatively small to create test data.
In this previous study, we also studied sex differences in prediction models for recurrence of MD. Since no prominent sex differences were identified, we selected the model built on training data including both sexes. For detailed information on the discovery phase of the prediction model, including its predictors, and their effect sizes, see Supplemental Table 1 and van .

Test samples
We used two independent test samples from VATSPSUD and the Netherlands Study of Depression and Anxiety (NESDA) to assess the prediction performance and generalizability of the previously derived prediction model . These studies were selected because of their longitudinal designs and high-quality data: data were primarily based on structured interviews administered by trained interviewers, assessed a large set of the risk factors included in the prediction model, and had relatively few missing observations or drop-outs during follow-up. The study designs, samples, and data collection are described in detail in earlier publications (Kendler and Prescott, 2006;Penninx et al., 2008).
The first test sample combined data from the female-female twins (FF, n = 757) and the male-male/male-female twins (MM-MF, n = 1544) from the VATSPSUD study, but who were not included in the original training sample used to develop the prediction model. Thus, we created an independent test sample including 2301 Caucasian twins who reported a lifetime episode of MD at baseline assessment (American Psychiatric Association, 1987), and who were re-interviewed at follow-up at least 1 year after baseline interview. Previous studies showed that the VATSPSUD sample is broadly characteristic of the Caucasian general population in the USA in terms of demographic features and rates of psychopathology (Kendler and Prescott, 2006).
The second test sample was drawn from the Netherlands Study of Depression and Anxiety (NESDA). NESDA is a longitudinal cohort study including 2981 subjects from the Dutch general population, primary care, and specialized mental health care, aged 18-65 at baseline assessment (2004)(2005)(2006)(2007). From this sample, we included 1925 subjects who reported a lifetime episode of MD (American Psychiatric Association, 2000) at baseline, and who were reinterviewed approximately 2, 4, 6 or 9 years after baseline (waves 3, 4, 5, and 6).
All participants provided written informed consent, and the studies were approved by Institutional Review Boards of VCU and VU University Medical centre (Kendler et al., 2008;Penninx et al., 2008).

Assessment and imputation of predictors
Most predictors retained in the prediction model (Table 1) were assessed at baseline in both test samples. All 24 predictors were present in VATSPSUD; 21 of 24 predictors were available in NESDA (Supplemental Table 1). Some predictors were assessed with different instruments in NESDA. In these cases, items that were most equivalent to the predictors used in VATSPSUD were selected, and if needed, transformed or categorized to increase comparability. All predictors were assessed at baseline, except for four predictors which were assessed during followup in part of the participants. These predictors concerned childhood sexual abuse in NESDA and VATSPSUD-FF; and maternal MD, lifetime GAD, low marital satisfaction in VATSPSUD-MM-MF. As these predictors concerned retrospective reports, or were assessed at baseline in the majority of the sample, we decided not to exclude these predictors in order not to bias the prediction performance downward. Missingness on most predictors was limited: on average 2.8% of the values in VATSPSUD, and 9.8% in NESDA were missing (Supplemental Table 1). Values for missing predictors were multiply imputed in 10 datasets using Multivariate Imputation by Chained Equations (Rpackage mice, 20 iterations) (van Buuren and Groothuis-Oudshoorn, 2011). All predictors needed in the prediction model were included in these imputations; variables concerning lifetime diagnoses of panic disorder and social phobia were used in NESDA as extra predictors in the imputation to improve imputation results (Carpenter and Kenward, 2013).

Risk score for recurrence of MD
First, we applied the prediction model for recurrence of MD to 10 imputed datasets to create 10 risk scores for each subject. The risk scores were constructed as the sum of the subject's risk factor values multiplied by the corresponding beta weight of that risk factor (as estimated in the VATSPSUD-training sample, Supplemental Table 1), i.e. the linear predictor or prognostic index (Royston and Altman, 2013). We created a single risk score for each subject by averaging their risk scores from each of the 10 imputed datasets.
Because multiple imputation of missing values will often not be feasible in clinical practice, we performed a sensitivity analysis with a risk score where missing observations were replaced by sample means. The sample means of VATSPSUD and NESDA are provided in Supplemental Table 1. In this case, we created one single risk score for each subject by summing all the subject's predictor values -or sample mean in case the value was missing-multiplied by the predictors' corresponding beta weights. Note that this a conservative approach to missingness which might bias downward predictive power.

Assessment of prediction performance
We selected several outcomes during follow-up to test the predictive performance of the risk score. The primary outcome was any episode of MD during follow-up, since the prediction model was trained to predict this outcome . Given that patients recovered from MD are not only at risk of future episodes of MD, but also of anxiety disorders and disability (Lamers et al., 2011;Moffitt et al., 2007) -the presence of which could also inform treatment decisions on monitoring and treatment (e.g., continuation of antidepressant medication after recovery of MD)we also tested the predictive value of the risk score with secondary outcomes. These concerned GAD and panic disorder (Kessler et al., 2005) and severe disability as assessed by the World Health Organization Disability Assessment Schedule (WHODAS-II). All outcomes were dichotomous (Supplemental Table 2, note that time-toevent data were not available).
All statistical analyses were performed in R (R Core Team, 2017; Wickham, 2009). Logistic regression models were used to estimate the association between the risk score at baseline and the outcomes during follow-up (R-packages stats, rcompanion) (Mangiafico, 2017;R Core Team, 2017). Model discrimination was assessed using areas under the receiver operating characteristic curve (AUC, R-package Epi) (Carstensen et al., 2017;Royston and Altman, 2013).The AUC is a measure of model discrimination or "separation"do patients predicted to be at higher risk exhibit higher event rates than those predicted to be at lower risk? (Royston and Altman, 2013) We derived two values from the AUC to facilitate interpretation of the effect size. The success rate difference (SRD=2AUC-1) is equal to Somers' D or Kendall's tau and thus interpretable as a correlation coefficient. The number needed to take (NNT=1/SRD) represents the number that one would need to test to have one more 'success' (i.e. adverse outcome) in the higher risk group than in the lower risk group (Kraemer, 2014).
To assess the absolute risk for different levels of the risk score, both test samples were split in quartiles and the proportion of observed adverse outcomes for each quartile was determined.

Comparison of risk score with other risk factors for severe course of illness
To further validate the prediction model, we assessed whether our risk score outperformed other risk factors of a severe course of MD: measures of genetic risk, environmental risk, and neuroticism.
First, we used three measures reflecting genetic risk for MD, i.e. family history of MD, age at onset, and polygenic risk score for MD. These measures have been shown to be associated with a more severe course of illness (Eaton et al., 2008;   relatives affected with MD divided by the number of first-degree relatives. We calculated a polygenic risk score for MD for 1662 NESDA participants whose genome wide association study (GWAS) data were available. We used GWAS summary statistics for MD publicly released by the Psychiatric Genomics Consortium, obtained for a subset of 59,851 cases and 113,154 controls after the exclusion of data from 23andMe (Wray et al., 2018). Furthermore, since NESDA data were part of the meta-analysis, we re-ran the meta-analysis after removal of overlapping data (~3 K samples). LDpred was used to compute polygenic risk scores (Supplemental Methods 1) (Vilhjálmsson et al., 2015).
Second, we used two measures of environmental risk -early or lifetime traumas and childhood sexual abuse-which are also associated with a more severe course of illness (Gopinath et al., 2007;Hardeveld et al., 2013b;Paterniti et al., 2017). Third, we assessed the association between the risk score and the personality trait neuroticism. Neuroticism strongly reflects liability for MD (Jeronimus et al., 2016), and is associated with a more severe course of MD (Xia et al., 2011). For details about how these risk factors were assessed, we refer to Supplemental Table 2.
We calculated Pearson's correlation coefficient between our risk score and these other risk factors, and we determined the AUC's of the other risk factors for all adverse outcomes.

Sample characteristics
On average, participants in the test sample from NESDA had a higher risk score for recurrence of MD at baseline than participants in the VATSPSUD test sample (Table 1). NESDA participants also reported more often lifetime episodes of GAD and alcohol dependence at baseline. This may reflect differences in disease severity due to differences in sample ascertainment. Whereas VATSPSUD is based on birth records of twins in Virginia (Kendler and Prescott, 2006), NESDA sampled from the general population, primary care, and specialized mental health care (Penninx et al., 2008). In addition, NESDA included a large proportion of subjects with current depressive or anxiety disorders, whereas in VATSPSUD, 653 cases with last year/current MD were excluded from the test sample since they were included in the training sample drawn from VATSPSUD. The larger proportion of participants with current/ recent episodes of MD (instead of lifetime) in NESDA might also explain the similarity between the risk score in the test sample from NESDA and the training sample from VATSPSUD, which included exclusively subjects with a last year MD episode ( Table 1).
As expected, there were high rates of co-occurrence between the different outcomes (Table 2). Correlations between MD at follow-up and GAD, panic disorder, and severe disability at follow-up ranged between 0.34 and 0.76.

Prospective prediction performance
In both test samples, the risk score significantly predicted future episodes of MD, and also the other adverse course-related outcomes, viz. episodes of GAD, panic disorder, and disability (Table 3). Despite the fact that the risk score was specifically trained to predict MD recurrence, its associations with future episodes of MD, GAD, panic disorder, and disability were equally strong. A standard deviation (SD) increase in risk score corresponded with a double risk of these adverse outcomes (mean OR=2.1). In addition, the AUC's for future episodes of MD (range 0.68-0.73) were comparable with AUC's for the other outcomes (range 0.65-0.78) as indicated by the overlapping 95% confidence intervals.
Prediction performance for future episodes of MD in both test samples was comparable to the performance in the training sample (AUC's 0.68-0.73 versus AUC 0.75; confidence intervals were overlapping), indicating that there was little overfitting of the prediction model in the training data. This showed that the predictive performance of this model was not specific to our first study, but that the model also predicted adverse outcomes of MD across samples from different subpopulations, two different countries, in which different measurement procedures were used.
Using sensitivity analyses, we assessed to what extent prediction performance decreased when multiple imputation was not used to construct the risk score, but missing values were replaced by sample means, because in clinical practice multiple imputation will often not be feasible. Prediction performance was very comparable for this alternative risk score, i.e. AUC's were at most 0.01 attenuated (Supplemental Table 3).
Dividing participants in quartiles based on their risk score, subjects in the lower risk groups consistently reported fewer adverse outcomes than individuals in the higher risk groups (Fig. 1). ROC-curves showed that the optimal cutpoint of the risk score (i.e., resulting in the maximum sum of sensitivity and specificity) for the risk score was~0.8 in VATSPSUD and~1.0 in NESDA, resulting in a mean sensitivity of 74% (range 65-82) and mean specificity of 60% (range 46-76) across the different outcomes (Supplemental Fig. 1). The mean negative predictive value was 63% (range 16-83) and the mean positive predictive value was 15% (range 4-51). This means that at the optimal cutpoint, the score performs better in detecting the true negatives than the true Table 2 Co-occurrence of the outcomes. positives in a population. This can be attributed to the relatively low prevalence of some of the outcomes. If outcomes are rare (e.g., panic disorder occurred in only 7% of VATSPSUD participants), diagnostic tests will more often result in false positives, and less often in false negatives. From a clinical standpoint, a negative test result could in this case be more valuable than a positive test result (e.g., a negative test result could support a decision to reduce antidepressant use).

Comparing the model with other risk factors for severe course of illness
We compared the prediction performance of our risk score with several measures of genetic risk, environmental risk, and neuroticism, which are well-known risk factors for a severe course of MD. We did this to assess to what extent the more complex risk score outperformed these simpler risk factors. Three of these risk factors were included as a predictor in our risk score (family history, traumas, childhood sexual abuse), the other three risk factors (age at onset, polygenic risk score for MD, neuroticism) were not.
All these risk factors for a severe course of illness were significantly correlated with our risk score in the expected direction (Supplemental Table 4, all P-values<0.002). Subjects with a higher risk score tended to have more first-degree relatives with MD, a higher polygenic risk for MD, an earlier age at onset, higher neuroticism scores, and reported a higher number of traumas and a history of childhood sexual abuse. Neuroticism was particularly highly correlated with the risk score (r~0.5).
The risk score predicted the outcomes more accurately (AUC's 0.65-0.78) than logistic regression models based on one of the other risk factors. Prediction performance of the following risk factors -family history of MD, age at onset, polygenic risk score for MD, and childhood sexual abuse-were all in the same range (AUC's~0.5-0.6), and lifetime traumas performed slightly better (AUC's 0.55-0.66) ( Table 4). However, the model including neuroticism only predicted the outcomes almost as accurately as our risk score (AUC's 0.64-0.76). The confidence intervals of the AUC's were overlapping for most outcomes, except for episodes of MD and GAD in VATSPSUD (Supplemental Table 5). For these two outcomes, the risk score had a significantly higher AUC. Of note, neuroticism was not included in our risk score. While it was included in the model discovery phase, it was not retained in the elastic net penalized model , which could be due to multicollinearity between neuroticism and the other predictors.
Because of the relatively strong prediction performance of neuroticism, we performed post hoc analyses to investigate whether neuroticism could further enhance prediction performance of the risk score. We performed an unpenalized Cox regression analysis including both our risk score and neuroticism as independent variables to predict MD recurrence in the training data (VATSPSUD, n = 653) . In this model, our risk score significantly predicted MD recurrence (HR 2.1, CI 1.8-2.4) but neuroticism's effect attenuated to not significant (HR 0.9, CI 0.8-1.1). The addition of neuroticism to the risk score did not improve prediction performance-AUC's based on this model were similar to or lower than these based on the risk score alone.

Principal findings
We tested a data mining algorithm for predicting future episodes of MD in subjects with lifetime MD using baseline clinical characteristics. The model consistently predicted future episodes of MD in two independent test samples, despite differences in sample composition, study design and assessment of predictors. In addition, the model predicted future episodes of GAD, panic disorder, and disability comparably. Furthermore, the algorithm outperformed several known risk factors for a more severe course of illness, viz. measures of genetic risk, and environmental risk. Only neuroticism predicted the adverse  Table 1) and several adverse outcomes during follow-up. The strength of association was tested using logistic regression analyses in which the standardized recurrence risk score (M = 0, SD=1) was the independent variable, and the measures of MD, anxiety, and disability were the dependent variables. All outcomes were binary; psychiatric disorders were coded as "1" if the subject reported at least one episode in the time interval. Disability was coded as "1" if the participant's level of disability was high (WHODAS>40, corresponding roughly with the top 25% in NESDA). Years indicate the approximate number of years after baseline assessment (e.g., 0-2 years concerns the first two years after baseline, etc.). For ROC-curves see Supplemental Figure 1. 1 All odds ratios are highly significant with P-values < 3 × 10 −9 (Bonferroni corrected alpha 0.05/21 = 0.002). 2 Success rate difference (SRD) equals 2AUC-1 (equal to Somers' D or Kendall's tau and thus interpretable as a correlation coefficient). Number needed to take (NNT) equals 1/SRD, and represents the number one needed to sample from the subgroup with the higher risk to have one more 'success' (i.e. adverse outcome) than the lower risk group. 3 Number (N) of subjects with available data on the dependent variable. 4 Proportion of subjects reporting this outcome. 5 Outcome assessed in the 12 months prior to interview wave(s). 6 In VATSPSUD, we tested the association of the recurrence risk score with episodes of GAD with a duration of ≥1 month instead of ≥6 months in the year prior to interview, because only 4% of cases reported GAD with a duration ≥6 months, which might limit reliability of estimated associations.
outcomes with nearly equal performance. However, prediction was not improved when we combined the risk score and neuroticism.

Scientific and clinical relevance
First, estimates of prediction performance were similar across two differently ascertained test samples. This indicates that the combination of risk factors predicting future episodes of MD are to some extent shared rather than being unique across subjects with MD sampled from the general population, primary care, and specialized mental health care, across subjects from different countries, twins vs. non-twins, and measured with different procedures. This is a promising finding for clinical practice since a prediction model derived in one sample could be relevant for clinical populations, rather than being restricted to the training sample used to develop the model. Second, the model predicted a broader range of adverse outcomes than it was originally developed for: it did not only predict future episodes of MD but also episodes of anxiety disorders and disability. Thus, the model could give clinicians an estimate of the risk on multiple outcomes instead of only one, which might facilitate treatment decisions. For example, one could think of decisions on the intensity of monitoring, or continuing treatment in patients who recovered from depression to prevent future episodes of MD or anxiety disorders (Coplan et al., 2015). Third, the model predicted future episodes of MD, anxiety disorders, and disability very similarly. Partly, this was expected because of the high rates of co-occurrence between MD, anxiety disorders and severe disability, and the overlap in their risk factors . However, it was surprising how similar the model predicted across these outcomes. Future studies are needed to investigate whether more specific prediction models can be identified with larger training samples.

Relation to previous studies
In previous studies using independent test data, estimates of prediction performance for models predicting course of MD were quite similar. In our study, the average AUC across future episodes of depression, anxiety, and disability was 0.71, whereas in previous studies predicting course-related outcomes in MD the AUC ranged from 0.63 to 0.76 (for ≥12 weeks follow-up) (Chekroud et al., 2016;de Vries et al., 2018;Kessler et al., 2016;Perlis, 2013;Wang et al., 2014). Interestingly, estimates of prediction performance for prediction models in other medical disciplines are not very different. For instance, similar AUC's have been found for instance in models predicting mortality after myocardial infarction (0.75-0.77) (van Loo et al., 2014b), other outcomes in cardiology (0.7-0.8) (Siontis et al., 2012), melanoma (0.7-0.8) (Usher-Smith et al., 2014), or bleeding when using antiplatelet therapy after percutaneous coronary intervention (0.64) (Yeh et al., 2016).
Despite its potential relevance for clinical practice, few previous studies assessed the predictive value of a model for a wider range of outcomes than the model was original trained for. Only one study externally validated one data mining risk score across multiple outcomes: MD persistence and chronicity, hospitalization for depression, attempted suicide, disability due to depression at time (Kessler et al., 2016). Similar to our study, this risk score predicted these different multiple outcomes (AUC's 0.63-0.76), but its predictive value for In each external validation sample, subjects were stratified in quartiles based on their recurrence risk score: quartile 1 includes 25% of subjects with the lowest recurrence risk scores and quartile 4 includes the 25% of subjects with the highest scores. The subjects in quartiles 2 and 3 had intermediate scores. The y-axis shows the proportion of subjects reporting the outcome during follow-up. (a) Presents the results of the VATSPSUD test sample, (b) presents the results of the NESDA test sample. The number of cases (N) with present data for each outcome are described in Table 3 anxiety disorders was not assessed, so we cannot compare these results.
Our risk score performed only modestly better than neuroticism in predicting adverse outcomes in MD. However, the effect of neuroticism was attenuated to nonsignificant when added to the risk score in a multiple predictor model, while our risk score remained strongly predictive. Two previous studies found a similar attenuation of neuroticism's effect to predict recurrence of MD in a model including multiple predictors such as stressful life events and childhood traumas (Gopinath et al., 2007;Hardeveld et al., 2013b). One study found that neuroticism did not predict MD recurrence even in a univariate context (Hardeveld et al., 2013a). Given the inconsistent findings, future studies are warranted to investigate whether neuroticism is a consistent predictor of MD recurrence, and how it compares to our risk score.

Strengths and limitations
First, although our model's prediction performance was comparable to that of other models predicting course of MD, and other medical conditions, its performance is moderate (AUC~0.7), with relatively low positive predictive values for some of the rare outcomes. The model also needs information on 24 predictors, which may limit its value for clinical practice. Future studies are needed to assess whether the model can be improved by using larger training samples, other types of statistical learning techniques (Chekroud et al., 2016), or other types of data such as neuro-imaging, biomarkers, and molecular genetic data (Gillan and Whelan, 2017). However, the fact that our model exclusively utilizes readily available clinical information also is a strength as this reduces its associated costs and burden to patients.
Second, despite our study showed consistent prediction performance in two different independent test samples from different populations, prediction might be different for specific subgroups of patients. How well does this model predict course of MD in hospitalized patients, or in patients from different cultures? Furthermore, are results generalizable to situations in which less high-quality baseline data are available? Partly, this study showed that not all predictors need to be available or assessed with the exact same instruments -which promotes its applicability in clinical practice-but more work is needed to confirm this.
Third, not all our outcomes or predictors were optimally assessed. For instance, outcomes were assessed over the course of several years, instead of over a period of months, or decades. The latter would have provided more fine-grained information to test the risk score's prediction performance on the short and long term. More longitudinal studies are needed to collect these data. In addition, the polygenic score for MD only explains a limited percentage of the variance of MD (Wray et al., 2018), and different genetic variants might be implicated in MD onset than in MD recurrence.
Fourth, calculating the risk score by hand in clinical practice is labor intensive. We are working on a digitalized version of this prediction model, which facilitates implementing and testing this algorithm in clinical samples.
Fifth, in all probabilistic decision tools, the interpretation of probabilistic estimates is challenging. For instance, low probabilities are generally overrated, whereas high probabilities are underrated (Kahneman and Tversky, 1979). Thus, it should be carefully studied whether the application of these probabilistic decision support tools indeed improves clinical decision making in randomized controlled trials (Gillan and Whelan, 2017).

Conclusion
A prediction model based on 24 clinical characteristics consistently predicted multiple outcomes related to a more severe course of MD. Future studies are needed to test whether this risk prediction tool can serve as an extra source of information to differentiate high-risk from low-risk patients in clinical practice. The final aim would be to leverage the opportunities of data mining to improve insight into individual disease risk, and tailor treatment decisions to the individual patient. AAO, age at onset; AUC, area under the receiving operating characteristic curve; CSA, childhood sexual abuse; MD, major depression; N, number; n.a., not available; PRS, polygenic risk score. This table presents the AUC's of logistic regression models. 1 Ratio of family members with MD, calculated by dividing the number of first-degree relatives with MD by the number of first-degree relatives. 2 Age at onset of MD (years). To increase the comparability with the other predictors, we multiplied the age at onset by −1 to estimate its AUC, because a lower age at onset is associated with a higher risk of adverse outcomes. 3 Polygenic risk score for MD; GWAS-data are not available for VATSPSUD. 4 Traumas during lifetime in VATSPSUD; traumas during childhood in NESDA. 5 Outcome assessed in the 12 months prior to interview wave(s). 6 In VATSPSUD, we tested the association of the recurrence risk score with episodes of GAD with a duration of ≥1 month instead of ≥6 months in the year prior to interview, because only 4% of cases reported GAD with a duration ≥6 months, which might limit reliability of estimated associations. 7 Because of the similarity of the AUC's for the risk score and neuroticism, 95%-confidence intervals were calculated for the AUC's of neuroticism (see Supplemental   Table 5). All AUCeCI's of neuroticism were overlapping with the AUCeCI's of the risk score, except for MD 0-1 year VATSPSUD: AUC 0.65 (CI 0.61-0.68) and GAD 0-1 year VATSPSUD: AUC 0.64 (CI 0.59-0.68). Confidence intervals of the risk score are presented in Table 3. For more details on the outcomes or competing predictors, we refer to Supplemental Table 2.

Author contributions
The study was designed by HMvL and KSK. HMvL, SHA, TBB, and YM analysed the data. All authors contributed to interpretation of the results. HMvL drafted the manuscript; all other authors critically revised the manuscript, and approved its final version.

Role of the funding source
The infrastructure for the NESDA study (www.nesda.nl) is funded through the Geestkracht program of the Netherlands organization for Health Research and Development (NWO, ZonMw, grant number 10-000-1002) and financial contributions by participating universities and mental health care organizations (VU University Medical Center, GGZ inGeest, Leiden University Medical Center, Leiden University, GGZ Rivierduinen, University Medical Center Groningen, University of Groningen, Lentis, GGZ Friesland, GGZ Drenthe, Rob Giel Onderzoekscentrum).
Further funding is provided by the Center for Medical Systems Biology (CSMB, NWO Genomics), Biobanking and Biomolecular Resources Research Infrastructure (BBMRI-NL), VU University's Institutes for Health and Care Research (EMGO+) and Neuroscience Campus Amsterdam, University Medical Center Groningen, Leiden University Medical Center, National Institutes of Health (NIH, R01D0042157-01A, MH081802, Grand Opportunity grants 1RC2 MH089951 and 1RC2 MH089995). Part of the genotyping and analyses were funded by the Genetic Association Information Network (GAIN) of the Foundation for the National Institutes of Health. Computing was supported by BiG Grid, the Dutch e-Science Grid, which is financially supported by NWO.

Declaration of Competing Interest
None.