Predicting the 9-year course of mood and anxiety disorders with automated machine learning: A comparison between auto-sklearn, naïve Bayes classifier, and traditional logistic regression

,


Introduction
Despite a large body of epidemiological research, the course and onset of mood and anxiety disorders remain difficult to predict. Improving the ability to predict the onset and course of mood and anxiety disorders can be clinically relevant for prevention, early detection, staging, and personalized treatments (McGorry, 2010). In clinical settings, most decision making is based on clinical-care guidelines and experience (AEgisdóttir et al., 2006). However, even experienced clinicians may ignore relevant information or may put too much emphasis on clinically salient cues (Odeh et al., 2006). Information on demographic characteristics and clinician-rated and self-reported measures are increasingly collected as part of routine outcome monitoring (ROM; Carlier et al., 2012), but this information is underused in clinical decision making. Literature suggests that automated statistical prediction of current diagnoses and course may improve clinical decision making (AEgisdóttir et al., 2006;Grove et al., 2000), particularly through modern machine learning (ML) approaches .
ML may be more time efficient, better suited for large and complex datasets, and better able to detect complex patterns in the data than current data-modelling approaches that rely heavily on human decision making (Iniesta et al., 2016;Wang et al., 2018). Most clinical data thus far have been analyzed by selecting only specific putative predictors. It is possible that more complex (including nonlinear and higher dimensional) patterns exist in the data, which can efficiently be detected when analyzing all available data simultaneously using ML (Chekroud et al., 2016;Hahn et al., 2016). These approaches are able to examine huge numbers of potential predictors in an unbiased manner while preventing overfitting (Hastie et al., 2009).
Thus far, ML studies in the field of psychiatry have been promising. A recent meta-analysis, which included 20 studies that predicted the therapeutic outcome of depression using ML algorithms, found an overall accuracy of .82 (95% confidence interval [CI] .77-.87; Lee et al., 2018). Another ML study used an extensive set of baseline variables in a subset of 805 depressed patients from the Netherlands Study of Depression and Anxiety (NESDA) cohort, including biological and psychological variables (e.g., personality traits; Dinga et al., 2018). The study achieved an accuracy significantly greater than chance of 66% for predicting persistent depression over the course of 2 years. A similar study, performed in a subset of the NESDA cohort of 887 anxiety patients, found an accuracy of predicting anxiety recovery of 62% (p < .05) and an accuracy of predicting recovery of all common mental disorders of 63% (p < .05; Bokma et al., 2020). Clinical severity measures were the most important predictor variables, which is in line with previous reports (Bokma et al., 2020;Dinga et al., 2018;Lee et al., 2018). Although these studies seem promising, recently published papers have demonstrated only limited added value of ML over traditional regression analyses (Christodoulou et al., 2019;van Mens et al., 2020). Additionally, other studies found that when predicting suicide, ML did not outperform regression analysis and resulted in positive predictive values below 0.01, thus limiting the practical utility of these predictions (Belsher et al., 2019;Kessler et al., 2017). Despite the increasing number of publications in this field, ML has yet to move towards clinical application (Tran et al., 2019).
Although ML incorporates less human decision making than traditional methods, most ML methods are still not fully automated. Feature selection has been standardized as much as possible, but cut-off values that determine which features to include or exclude are somewhat arbitrarily selected. One solution would be to fully automate the selection of features, as is done in the Auto-sklearn system (Waring et al., 2020). Auto-sklearn is a next generation ML system that automatically selects the learning algorithm that best suits the data and automatically optimizes the hyperparameter settings of this algorithm. It has proved effective when analyzing a diverse range of datasets and is considered to be an efficient and robust system for use by both ML novices and experts (Feurer et al., 2015;Feurer et al., 2019).
We aimed to study and to compare the performance of traditional multinominal logistic regression, a basic probabilistic ML algorithm (naïve Bayesian classifier; Jayant and Safari, 2020) and a more advanced automated ML method (Auto-sklearn) to predict DSM-IV-TR psychiatric diagnoses at a 2-, 4-, 6-, and 9-year follow up with different sets of predictors. We incorporated predictor variables that can be easily and inexpensively collected in clinical practice, such as demographic variables, clinician-rated psychiatric diagnoses, and self-reported depression and anxiety. Our hypothesis was that Auto-sklearn would be better at detecting complex patterns in the data and therefore would outdo a naïve Bayesian classifier, which in turn would outdo traditional regression analysis techniques in achieved level of accuracy. Moreover, we hypothesized that Auto-sklearn would be particularly efficient when single items and follow-up measures were included.

Study sample and procedures
For the current study, we included participants from the NESDA cohort, which investigated the course and consequences of depressive and anxiety disorders. A detailed description of the NESDA design and sampling procedures are published elsewhere (Penninx et al., 2008).
The first wave (baseline) lasted from 2004 to September 2007, and the sixth wave of measurement at the 9-year follow up finished in October 2016. NESDA is a cohort study that recruited from the community (n = 564; 18.9%), general practice (n = 1,610; 54.0%), and secondary mental healthcare (n = 807; 27.1%; Penninx et al., 2008) and included patients with a current or lifetime depressive or anxiety disorder as well as healthy controls (see supplementary Table 1). A limited number of exclusion criteria were applied, namely not being fluent in Dutch and the presence of other clinically overt psychiatric disorders (e.g., addiction, psychotic, bipolar). With this method, NESDA aimed for a cohort that is representative for diverse populations of healthy controls and patients with depression and anxiety (Penninx et al., 2008). Due to missing outcome data (mainly due to attrition), we included 2,596 (87.1%) participants to predict 2-year outcomes, 2,402 (80.6%) to predict 4-year outcomes, 2,256 (75.7%) to predict 6-year outcomes, and 2,068 (69.4%) to predict 9-year outcomes.

Independent variables
An overview of the independent variables within each predictor set can be found in Table 1 in the supplementary material. Independent variables comprised baseline demographics, lifetime and baseline DSM-IV-TR diagnoses, self-reported depression, and anxiety symptomatology. Demographic variables included gender, age, ethnicity (North European heritage: yes/no), level of education (1 = elementary or less; 2 = general intermediate/secondary education; 3 = college/university), partner status (no partner, with partner [not married], married, living apart/no partner, divorced/no partner, widowed/no partner), and working status (employed/unemployed). The Composite International Diagnostic Interview (CIDI WHO, version 2.1) was used to assess the presence of mood and anxiety disorders according to the DSM-IV-TR. This included current dysthymia, major depressive disorder (MDD), lifetime depressive disorder, social phobia, panic with agoraphobia, panic without agoraphobia, agoraphobia without panic, generalized anxiety disorder, and lifetime anxiety disorder. Future CIDI-based diagnoses were used as outcome variables at 2-, 4-, 6-, and 9-year follow up, and past and current CIDI-based diagnoses were used as independent variables. Thus, diagnoses at baseline and at Years 2, 4, and 6 were used to predict the diagnosis at the 9-year follow up (see Section 2.2.2).
Anxiety and depressive severity as well as symptoms at baseline and 1-year follow up were assessed using the Fear Questionnaire (FQ; Marks and Mathews, 1979), the Beck's Anxiety Inventory (BAI; Beck et al., 1988), and the Inventory of Depressive Symptomatology (IDS-SR; Rush et al., 1996). These measures were entered into the models as either sum scores only or as a combination of sum scores and individual items. Detailed (psychometric) information about the measures can be found in the supplementary material.

Outcome variable: clinical diagnoses
The CIDI WHO, version 2.1 was used to assess clinical diagnoses according to the DSM-IV-TR. The CIDI is a fully standardized diagnostic interview with extensively validated psychometric characteristics (Penninx et al., 2008;Wittchen, 1994) and may be considered a gold standard for psychiatric diagnostic classification (Haro et al., 2006;Kessler et al., 2009).
At the 2-, 4-, 6-, and 9-year follow up, CIDI-based outcomes were coded both as a binary variable (psychiatric disorder absent vs. present) and as a categorical variable with four categories: healthy, mood disorder (i.e., major depression and/or dysthymia), anxiety disorder (i.e., general anxiety, social phobia, panic with agoraphobia, panic without agoraphobia, and/or agoraphobia without a panic disorder), and comorbid mood and anxiety disorders.

Statistical analysis
A total of 96 models were tested. We compared three methods, over four sets of predictor variables, over two outcome sets, and over four follow-up waves. The three methods were multinomial logistic regression (Menard, 2002), naïve Bayes classifier (Jayant and Safari, 2020), and Auto-sklearn (Feurer et al., 2015). The four sets of predictor variables (all including sociodemographic variables and baseline diagnoses) were (a) baseline sum scores only; (b) baseline sum scores and 1-year follow up sum scores; (c) baseline sum scores, 1-year follow up sum scores, and individual items at baseline; and (d) sum scores and individual items at baseline and 1-year follow up. For an overview of the predictor Sets A-D, see Table 1 in the supplementary material. Missing item values (0.54% -13.1%) were replaced by the mean of the available cases. The two outcomes were binary (healthy/mood or anxiety disorder) and multinomial (healthy [A], mood disorder [B], anxiety [C], or comorbid mood-and anxiety disorder [D]). The follow-up waves occurred at 2, 4, 6, and 9 years.
Auto-sklearn is an automated ML system that addresses both the problem of choosing which ML algorithm is best suited to analyze a specific application scenario (i.e., the model/algorithm selection problem) and the problem of determining which parameter setting leads to high performance (i.e., the hyperparameter optimization problem). Auto-sklearn considers a wide range of feature selection methods including all classification approaches implemented within the Python scikit-learn package, spanning 15 classifiers (e.g., random forests, decision tree, gradient boosting, etc.), 14 feature preprocessing methods (e.g., feature agglomeration, polynomial, nystroem sampler, etc.), and four data preprocessing methods (i.e., one-hot encoding, imputation, balancing, and rescaling), giving rise to a structured hypothesis space with 110 hyperparameters. Auto-sklearn features preprocessing methods that can be mainly categorized into feature selection, kernel approximation, matrix decomposition, embeddings, feature clustering, polynomial feature expansion, and methods that use a classifier for feature selection (for more details see Feurer et al., 2019). Previous research shows that the classification performance is often much better than using standard selection/hyperparameter optimization methods (Feurer et al., 2015), and researchers believe Auto-sklearn to be a promising system for use by both ML novices and experts (Feurer et al., 2019). Auto-sklearn won six out of 10 phases of the first ChaLearn AutoML challenge. Furthermore, a comprehensive analysis of over 100 diverse datasets, while taking into account time and computational resource constraints, demonstrated that Auto-sklearn outperformed the previous state of the art in AutoML (Feurer et al., 2019). More details about Auto-sklearn can be found elsewhere (Feurer et al., 2015;Feurer et al., 2019; https://automl.github.io/auto-sklearn/master/api.html, accessed at 2019-12-10).
Naïve Bayes classifier is a basic ML method that can predict class membership probabilities, such as the probability that a given MDD patient is still depressed after 2 years, with the underlying assumption that the effect of an attribute value on a given class is independent of the values of the other attributes. It aims to simplify the computation involved and, in this sense, is considered naïve (Jayant and Safari, 2020). For the present study, we used the Gaussian Naïve Bayes Classifier provided in the scikit-learn package with the var_smoothing hyper-parameter. According to the scikit-learn manual, by using this implementation a researcher need not choose the probability cut off. Several hyper-parameter settings were tried in the preliminary analysis, resulting in no significant differences. Therefore, the default hyper-parameter setting was used (i.e., setting the value of var_smoothing to 1e-9). More details about the scikit-learn can be found elsewhere (https://scikit-learn.org/stable/modules/generated/s klearn.naive_Bayes.GaussianNB.html#sklearn.naive_Bayes.Gaussi anNB, accessed at 2019-12-10).
Logistic regression is a classification method used for binary or multinomial outcome variables. Multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems (Menard, 2002). We used the R package nnet (R Foundation for Statistical Computing, Vienna, Austria, 2016. https://www.R-pro ject.org/; Ripley et al., 2016).
We computed all models by randomly splitting (50:50) the dataset into a training and a test dataset using Scikit-learn data split (Pedregosa et al., 2011). The training dataset was used to select the best fitting regression model or ML algorithm. For the present study, models were optimized for overall accuracy. Auto-sklearn feature selection and preprocessing were based on the training data. Auto-sklearn selected "multinomial_nb" as its classifier for the binary outcome analysis and "random forest" for the multinomial outcome analyses. Subsequently, we tested and compared the accuracy of how well these models/algorithms predicted outcomes in the test data with a 95% CI (i.e., percentage of correctly predicted individuals). We also tested and compared their balanced accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. For the multinominal outcomes, this was computed using a one-versus-all approach. For each model, we tested the significance of accuracy related to the no-information rate. The no-information rate contains the accuracy if the model were to choose the most frequent outcome group: healthy, that is, the proportion of correct predictions when all patients are predicted to be healthy. Auto-sklearn and naïve Bayes classifier were implemented using the Python programming language (Rossum, 1995). For logistic regression, R was used (R Foundation for Statistical Computing, Vienna, Austria, 2016. https://www.R-project.org/; Ripley et al., 2016).

Sociodemographic and clinical characteristics at baseline
Characteristics of the study population are presented in supplementary Table 2. Age at baseline ranged from 18 to 64 years (M = 42.2, SD = 13.1), and 1,975 (66.5%) participants were women. At baseline, 26.8% of the sample suffered from MDD (n = 796), 9.3% of the sample from dysthymia (n = 241), and 43.7% from a (comorbid) anxiety disorder (n = 1,299), of which social anxiety disorder was the most common (18.6%; n = 483). Of the participants in our sample, 46.1% did not meet DSM-IV-TR criteria for a mood or anxiety diagnosis within the preceding 6 months (n = 1,368), of whom 54.2% had never been diagnosed with a psychiatric disorder (n = 742).

Prediction of health status as binary outcome
Figs. 1 and 2 and supplementary material Figure 1 and Table 3 contain the prediction of health status as a binary outcome (i.e., mentally healthy vs. any anxiety or mood disorder) at the 2-, 4-, 6-, and 9-year follow up using either logistic regression, naïve Bayes classifier, or Auto-sklearn. Fig. 1 demonstrates the correctly predicted health status at the 2-year follow up (true negatives and true positives). With optimized overall accuracy, the three methods had different sensitivity and specificity levels. As demonstrated in Fig. 2 As further demonstrated in Fig. 2, the accuracy values ranged from .75 through .79. Logistic regression, naïve Bayes classifier, and Autosklearn were all significantly (p < .001) more accurate than the noinformation rate (level of accuracy when only predicting a healthy status). Regarding logistic regression, the level of accuracy was significantly higher when only sum scores, and not individual item scores, were included as predictor variables (predictor Set A; acc .79 [95% CI .76-.81]), compared to logistic regression predictor Set B (acc .75 [95% CI .72-.77). The level of accuracy of naïve Bayes classifier and Auto-sklearn did not significantly decrease or improve when individual items were added as predictor variables. At 4-, 6-, and 9-year follow up, accuracy values ranged between .73-.78, .71-.77, and .76-.79 for logistic regression, naïve Bayes classifier, and Auto-sklearn, respectively. Of 16 tests per method (of which eight are presented in Fig. 2 and eight in supplementary Table 3), Auto-sklearn had significantly higher accuracy levels than the no-information rate for all tests, compared to eight out of 16 for naïve Bayes classifier and eight out of 16 for logistic regression. Auto-sklearn thus performed adequately within each of the different datasets four different datasets.

Prediction of health status as categorical outcome
The results of predicting health status as a categorical outcome (i.e., healthy, mood disorder, anxiety disorder, or comorbid mood-and anxiety disorder) at the 2-, 4-, 6-, and 9-year follow up using either Autosklearn, naïve Bayes classifier, or logistic regression are shown in Figs. 1, 3, and 4 and in the supplementary material Fig. 1 and Tables 4 and 5. Fig. 1 demonstrates the correctly predicted health status at 2-year follow up (true positives and true negatives). When the models were optimized for overall accuracy, their performance for predicting the disorder categories were low. When predicting with logistic regression, balanced accuracy values were .53 for mood disorders, .62 for anxiety disorders, and .61 for comorbidity. When predicting with Auto-sklearn, balanced accuracy values were .50 for mood disorders, .60 for anxiety disorders, and .61 for comorbidity. Comparatively, these figures were .70 and .66 when predicting a healthy health status with logistic regression and Auto-sklearn, respectively (see Fig. 3 outcome year 2). Mood disorder (n = 91 cases in the test data set) was predicted the least often, resulting in sensitivity values ranging from .00-.32 and specificity values ranging from .89-1.00. Further inspection of Fig. 1 in the supplementary material demonstrates that both logistic regression and Auto-sklearn mostly predicted a healthy health status instead of mood disorders (n = 55 and n = 68, respectively).
As further demonstrated in Figs. 3 and 4, the accuracy values when predicting health status at 2-year follow up ranged from .63 to .72. Both logistic regression (acc .70 [95% CI .68-.73]; p = .003) and Auto-sklearn (acc.72 [95% CI .69-.74]; p < .001) were significantly more accurate than the no-information rate, when predicting health status with sum scores at 2-year follow-up (see Fig. 3), but only Auto-sklearn was significantly more accurate than the no-information rate when also individual item scores were included (acc .71 [95% CI .69-.74]; p < .001; Fig. 1. Percentages of train and test dataset values, as well as those correctly predicted at 2-year follow up, using the three data models. All predictor sets included baseline psychiatric diagnoses and demographic variables. Predictor Set A further includes baseline and 1-year follow-up sum scores. Predictor Set B additionally includes baseline and 1-year follow-up individual items. see Fig. 4). Again, the level of accuracy of logistic regression was significantly lower when individual item scores were included as predictor variables (predictor Set B;; p = >.99), compared to only sum scores (predictor Set A; acc .70 [95% CI .68-.73]; p = .003) when predicting health status at 2-year follow up. Autosklearn achieved demonstrated similar predictive performance when using sum scores as well as individual item scores (see Tables 4 and 5 in the supplementary material). Naïve Bayes classifier did not achieve levels of accuracy above the no-information rate. Achieving significantly accurate predictions became more difficult at later follow-ups. None of the models achieved accuracy levels that exceeded the no-information rate when predicting health status at 4-, 6-, and 9-years follow up.

Discussion
Our aim was to assess and compare the predictive performances and clinical usefulness of Auto-sklearn, naïve Bayes classifier, and logistic regression to predict mood and anxiety disorders at follow up. Furthermore, we assessed the effects of different sets of predictors. Although we hypothesized that Auto-sklearn would outperform the two other data models, this could not be concluded unequivocally. In fact, only moderate levels of accuracy were found, with correct prediction percentages of up to 79% and 75% when using either binary or categorical outcomes, respectively. Yet, Auto-sklearn outperformed both logistic regression and naïve Bayes when predictor sets included individual item scores. Categorical outcomes were more difficult to predict than binary outcomes, compared to the no-information rate; in Predicting health status (binary outcome) at 2-, 4-, 6-, and 9-year follow up. All predictor sets included baseline psychiatric diagnoses and demographic variables. Predictor Set A further includes baseline and 1-year follow-up sum scores. Predictor Set B additionally includes baseline and 1-year follow-up individual items. The grey vertical line denotes as the no information rate for year 2-, 4-, 6-, and 9-year outcomes, respectively. Accuracy values were compared to the noinformation rate by using a one way ANOVA test of which the p values are as follows: * p value < .05 ** p value < .01 *** p value < .001 Fig. 3. Predicting health status (multinominal outcome) at 2-, 4-, 6-, and 9-year follow up with baseline and 1-year sum scores (predictor Set A). All predictor sets included baseline psychiatric diagnoses and demographic variables. Predictor Set A further includes baseline and 1-year follow-up sum scores. Predictor Set B additionally includes baseline and 1-year follow-up individual items. PPV denotes as positive predictive value. NPV denotes as negative predictive value. The grey vertical line denotes as the no information for year 2-, 4-, 6-, and 9-year outcome, respectively. Accuracy values were compared to the no-information rate by using a one way ANOVA test of which the p values are as follows: ** p value < .01 *** p value < .001 Fig. 4. Predicting health status (multinominal outcome) at 2-, 4-, 6-, and 9-year follow up with baseline and 1-year sum scores and individual item-scores (predictor Set B). All predictor sets included baseline psychiatric diagnoses and demographic variables. Predictor Set B further includes baseline and 1-year follow-up sum scores and individual items. PPV denotes as positive predictive value. NPV denotes as negative predictive value. The grey vertical line denotes as the no information rate for year 2-, 4-, 6-, and 9-year outcome, respectively. Accuracy values were compared to the no-information rate by using a one way ANOVA test of which the p values are as follows: *** p value < .001 particular, mood disorders could not be distinguished well.
Our results support those of previous ML studies that reported 60% to 82% of correctly predicted mood and anxiety diagnoses when using a broad spectrum of predictor variables (Bokma et al., 2020;Chekroud et al., 2016;Dinga et al., 2018;Kessler et al., 2016;Lee et al., 2018;Nie et al., 2018). One of these studies used a subset of the NESDA dataset that included patients with a depression at baseline and a more extensive set of clinical, behavioral, and biological baseline-only variables in order to predict the course of depression, resulting in accuracy levels of 62-66% (Dinga et al., 2018). A similar study, within a subset of anxiety patients in NESDA (again using an extensive set of predictors) found an accuracy for predicting anxiety recovery of 62% and a accuracy of predicting recovery of all common mental disorders of 63% (Bokma et al., 2020). In contrast to these prior studies, we only used data that could be easily collected in clinical practice, including 1-year follow-up data as predictor variables. Despite our dataset not being as rich and diverse, we achieved a higher overall accuracy which was significantly higher than the no-information rate (Bokma et al., 2020;Dinga et al., 2018). However, these results cannot be compared easily. Our often higher accuracy values were likely in part due to our inclusion of healthy participants. The predictive performance when predicting the disorder value were similar and the large proportion of the healthy health status outcomes resulted in unbalanced sensitivity and specificity values when models were optimized to maximum overall accuracy. Prior studies lacked thorough comparisons to (logistic) regression models, and thereby failed to address the additional value of ML methods over "traditional" data-modelling methods.
Previous ML studies in the field of psychiatry used a wide variety of ML methods, ranging from regression trees to gradient boosting machines-methods that were included in Auto-sklearn (Chekroud et al., 2016;Kessler et al., 2016). In line with an earlier study, we found that depending on the predictor set, more complex ML methods do not necessarily result in higher similar levels of accuracy when predicting future outcomes of mood disorders (Nie et al., 2018). Two previous studies found that when optimized on overall level of accuracy, ML methods were about 1-6% more accurate compared to regression analysis and needed fewer predictor variables when predicting the persistence of mood disorders at a 12-week follow up (Chekroud et al., 2016;Kessler et al., 2016). Although level of accuracy was higher for ML, this difference was not found to be significant in either study (Chekroud et al., 2016;Kessler et al., 2016). Several studies found that ML was of only limited added value in research (Belsher et al., 2019;Christodoulou et al., 2019;van Mens et al., 2020) and clinical usefulness (Tran et al., 2019). Although we did not find any published reviews within the field of psychiatry, within other fields the added value of ML has been notably criticized (e.g., Christodoulou et al., 2019;Desai et al., 2020;Frizzell et al., 2017). However, it is possible that ML does outperform traditional methods when more complex (large) datasets are used (Iniesta et al., 2016;Wang et al., 2018). More advanced ML methods have the capability to distinguish which variables in large datasets are relevant or irrelevant for prediction, whereas traditional (regression) models rely on the researcher or clinician to select variables of interest to a particular analysis. ML therefore requires less human input. Although regression models sequentially analyze the relationship between variables, ML approaches can iteratively and contemporaneously analyze multiple interacting associations between variables or variable sets. Indeed, ML approaches may potentially be better suited to complex datasets with a large amount of predictors, while limiting the risk of overfitting (Lee et al., 2018). These advantages were confirmed by our findings. Auto-sklearn outperformed the other two models when our predictor sets included more variables, that is, they were more complex.
ML, especially when automated, has the potential for use in mental healthcare. Deciding what information to collect from patients and making predictions on the micro and macro level based on that information are important aspects of a clinician's skill set. This includes predictions regarding suicide risk, violence, the efficacy of treatment options, and the prognoses on the course of disorders (AEgisdóttir et al., 2006). The accuracy of these predictions is of vital importance for individual patients. Two major approaches to predict clinical outcomes can be identified: the clinical and the statistical method. The clinical approach refers to an informal and intuitive process in which the clinician combines and integrates patient data. A clinician's experience, interpersonal sensitivity, and theoretical perspective combined with a patient's characteristics and circumstances determine how that clinician recalls, synthesizes, and interprets all these bits of information (AEgisdóttir et al., 2006). With a statistical approach, statistical methods are applied on objectively measured variables in order to make predictions and prognoses based on probabilities (AEgisdóttir et al., 2006). Two meta-analyses demonstrated that statistical approaches were more accurate than clinical methods (AEgisdóttir et al., 2006;Grove et al., 2000). Our study found that moderate levels of accuracy can be accomplished based on data that can be easily collected in clinical practice, confirming that integrating statistical methods into clinical decision making could provide added benefits. Current mental healthcare is already partly digitalized, and the development of automated digital tools to assist clinicians should be attainable, providing clinicians fast and cheap support in decision making. Automated ML can be developed into such a tool because its automated techniques can match or improve upon expert human performance in certain ML tasks-often in a shorter amount of time (Waring et al., 2020). Moreover, Auto-sklearn demonstrated that it can perform even under rigid time and computational resource constraints (Feurer et al., 2015). Automated ML is already demonstrating its usefulness in healthcare practice (Waring et al., 2020).
There are several study limitations that need to be discussed. First, despite the marginal differences between DSM-IV-TR and DSM-5 criteria for mood and anxiety disorders, the diagnostic classifications used in this study were slightly outdated but were chosen to be kept constant during the follow-up waves (Regier et al., 2013). Despite our relatively large sample size, our analyses could not be carried out for each diagnosis separately (e.g., dysthymia, panic disorder, etc.) because the samples would have become too small. Second, in contrast with other studies, we did not replicate our findings with an independent dataset (Chekroud et al., 2016;Nie et al., 2018). Although we made use of a training and testing dataset, it is possible that the results from the ML methods and regression analyses differed in generalizability to other datasets, which could not be assessed with our current study design. Third, NESDA is an observational cohort study, and different types of pharmacological and psychotherapeutic treatment were not taken into account as predictor variables. Fourth, we included both healthy participants and patients, testing concomitantly the prediction of the course and onset of depression and anxiety. The proportion of healthy controls may have influenced the predictive models because their homeostatic responses to internal or external stimuli do not represent that of psychopathologic disorders (Regier et al., 1998). The large proportion of the healthy health status outcomes resulted in unbalanced sensitivity and specificity values when models were optimized to maximum overall accuracy. Fifth, differentiating depression, anxiety, and comorbid disorders as multinomial variables was especially poor and may have been unrealistic because anxiety disorders and depression have overlapping risk factors and high levels of (subclinical) comorbidity (Jacobson and Newman, 2017;Shorter and Tyrer, 2003). Sixth, ML may have more added value when the dataset is more complex, such as imaging or genetic data (Iniesta et al., 2016;Lee et al., 2018;Wang et al., 2018). Although our data was easy to collect in clinical practice, it may have lacked the complexity that is needed for ML methods to excel. Finally, because of its automated features, Auto-sklearn acts like a black box, which made it difficult for us to examine which individual features were most predictive. Nevertheless, significant levels of accuracy were achieved when predictor sets included sociodemographic, baseline diagnoses, and self-reported sum scores, which did not significantly improve when variables were added, suggesting that these were the most important predictor variables.
In conclusion, we found that moderately high levels of accuracy could be achieved when predicting dichotomous outcomes with easy-tocollect data. Auto-sklearn did not achieve the highest level of accuracy in every set of predictors, compared to traditional logistic regression and a naïve Bayes classifier. However, it was most consistent regardless of the set of predictor variables, and it outperformed the other models when the predictor sets were more complex (i.e., individual item scores). In time, clinical practice may benefit from integrating next generation automated ML methods into clinical decision making.

Author statement: contributors
BP is principal investigator of NESDA. Author WE wrote and performed the methodology, and wrote and edited the manuscript. CL performed the methodology. WE had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. EG contributed by frequent supervision of the writing of this manuscript. EG and HH contributed by supervision of the methodology. CL, EG, AH, IC, BP, KW, and HH contributed by review and editing of the early stages as well as final stages of the manuscript. All authors had access to the data. All authors have approved the final manuscript.

Role of the funding source
The funding source had no role in the design of this study, it's execution, analyses, interpretation of the data, or decision to submit results.

Declaration of Competing Interests
BP has received (non-related) research funding from Boehringer Ingelheim and Jansen Research. All remaining authors report no financial interests or potential conflicts of interest.