Predicting Depression in Community Dwellers Using a Machine Learning Algorithm

Depression is one of the leading causes of disability worldwide. Given the socioeconomic burden of depression, appropriate depression screening for community dwellers is necessary. We used data from the 2014 and 2016 Korea National Health and Nutrition Examination Surveys. The 2014 dataset was used as a training set, whereas the 2016 dataset was used as the hold-out test set. The synthetic minority oversampling technique (SMOTE) was used to control for class imbalances between the depression and non-depression groups in the 2014 dataset. The least absolute shrinkage and selection operator (LASSO) was used for feature reduction and classifiers in the final model. Data obtained from 9488 participants were used for the machine learning process. The depression group had poorer socioeconomic, health, functional, and biological measures than the non-depression group. From the initial 37 variables, 13 were selected using LASSO. All performance measures were calculated based on the raw 2016 dataset without the SMOTE. The area under the receiver operating characteristic curve and overall accuracy in the hold-out test set were 0.903 and 0.828, respectively. Perceived stress had the strongest influence on the classifying model for depression. LASSO can be practically applied for depression screening of community dwellers with a few variables. Future studies are needed to develop a more efficient and accurate classification model for depression.


Introduction
Depression causes emotional, cognitive, vegetative, and somatic symptoms, which lead to functional impairment in everyday activities [1]. The prevalence of depression is as high as 10.8% worldwide [2], and it is the single most significant contributor to non-fatal health loss globally [3].
Thus far, increasing evidence indicates that genetic [4], neurogenetic [5], biological [6], and environmental [7] factors contribute to depression. In particular, biological factors such as the level of pro-inflammatory cytokines and brain-derived neurotrophic factors have long been investigated in the field of depression [8][9][10]. However, the presence of such risk factors does not necessarily lead to the future onset of depression. Predictive models capable of indicating who may or may not develop depression are needed. With an emphasis on the practical usefulness of such models in real-world practice, individual-level analyses-rather than group-level analyses-are increasingly important in the field of medicine [11]. Owing to its practical utility, machine learning has received a substantial amount of attention in the field of medicine, including psychiatry [12].
Treatment of individuals with depression is often unsatisfactory. For example, the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study showed that only one third of the total sample entered remission following initial treatment. In that study, less than 30% of patients achieved remission throughout four consecutive therapeutic regimens [13]. The STAR*D study is not the only clinical study of antidepressants for depression; however, given its large scale and longitudinal style, the influence of the Diagnostics 2021, 11, 1429 2 of 12 STAR*D study continues to this day [14][15][16]. Therefore, it is necessary to intervene before the onset of a depressive disorder. If we can identify who is more likely to suffer from depression in the near term, we can more effectively prevent depression by focusing on those most at risk.
However, most studies have focused on diagnosing and predicting the prognosis of depression in clinical samples [17,18]. In addition, studies with neuroimaging modalities, such as MRI, largely feature an extremely small sample size, typically less than 100 [18].
Some studies have investigated depression in non-clinical samples using modalities other than machine learning. For example, social media has been widely used, particularly in non-clinical adolescents and youths [19][20][21]. These studies reported that social media usage patterns could meaningfully predict the severity or onset of depression. However, social media can overrepresent young people's characteristics. As the age at onset of depression extends from adolescence into the early 40s, across almost all sociocultural contexts [22], solely investigating data from social media would limit its applicability to all age groups.
Recent reviews have suggested that machine learning-based approaches have shown some promise in the diagnosis and treatment of depression [17,18,23]. One of the most promising aspects of machine learning is that it provides individual-level results, rather than group-level estimation, of the risk for depression and/or response to treatment. However, many of the machine learning studies that were included in the above reviews suffer from small sample sizes and a lack of separate test sets. These shortcomings can increase the potential risk of overfitting. In addition, the usefulness of focusing on the clinical sample could be limited by the low treatment response rate, as proven by the STAR*D study.
In the present study, we built a predictive model for depression using a machine learning algorithm based on national survey data. Moreover, we identified which variables were the most important for predicting depression.

Participants and Data
The Korea National Health and Nutrition Examination Survey (KNHANES) is an annual nationwide survey that collects a variety of data on health behaviors, the prevalence of chronic diseases, and food and nutrition status. A detailed description of the KNHANES can be found in Kweon et al. [24]. According to guidelines established by the Korean Centers for Disease Control and Prevention (KCDC), depression has been measured biannually since 2014 [25]. We used data from 2014 (n = 7550) and 2016 (n = 8150).
Only participants who responded to questions that focused on depression and its predictive factors were included in this study. All participants received a full explanation of the aims and protocol of the KNHANES and provided written informed consent. All data processing procedures were approved by the Institutional Review Board of the KCDC (2013-12EXP-03-5C).

Depression and Other Variables
The nine-item version of the Patient Health Questionnaire (PHQ-9) was used to measure depression [26]. As suggested by the KCDC [27], the presence of depression was defined as a score of 10 or higher on the PHQ-9.
Other variables included sociodemographic characteristics (e.g., age, sex, marital status, family income, basic living allowance, and private medical insurance), health (e.g., the prevalence of chronic diseases such as hypertension, diabetes mellitus, and arthritis), quality of life (EuroQol EQ-5D), and laboratory findings (e.g., hemoglobin, hematocrit, white blood cell count, platelet count, blood urea nitrogen level, and urine specific gravity).

Data Preprocessing and Machine Learning
All machine learning processes were conducted using the scikit-learn library implemented in Python 3.7. The 2014 dataset was used as the training and validation sets. Given the unbalanced ratio of depression and non-depression, a synthetic minority oversampling technique (SMOTE) was used [28]. To tune the hyperparameters, 10-fold cross-validation was conducted within the training set. The 2016 dataset was used as a test set to estimate the performance of the classification algorithms built from the 2014 dataset. Categorical variables were converted to dummy variables, whereas continuous variables were transformed into z-scores to ensure that they could be fitted into the linear model, such as regularized logistic regression analysis.
Regularizing the logistic regression model attenuated the overfitting and allowed the classifying model to learn from the training data, not just copy it. Both L1 regularization (also called the least absolute shrinkage and selection operator (LASSO)) and L2 regularization (also called ridge regression) provide a practical solution for overfitting. In a linear regression model, y = ω 0 + λ l ∑ k=1 ω k χ k , and LASSO uses a regularization term, |ω k | [29]. As the coefficients of weak predictive variables decrease to zero, LASSO can also be practically used as a feature reduction method.
The regularized logistic regression model has low computing costs and easy-tounderstand algorithms, contrary to most other machine learning algorithms that have high computing costs with the black box model.
In this study, we first applied LASSO with the initial 37 contributing variables for feature reduction. Subsequently, we re-entered the resultant 13 variables with non-zero coefficients in the final model. The hyperparameter C, which inversely reflects the strength of the regularization parameter λ, was set to 0.0076. As we used LASSO, the penalty option was set to "l1." Other hyperparameters were set to default in the LogisticRegression scikit-learn library.

Performance Metrics
The area under the receiver operating characteristic curve (AUC) was used as the primary performance metric. Generally, an AUC of 0.8 to 0.9 is considered good, and that >0.9 is regarded as excellent [30]. Other performance metrics such as overall accuracy . The MCC is superior in utilizing all four principal components (TP, TN, FP, and FN) of the confusion matrix. As the MCC is a discretized form of Pearson's correlational analysis, the value can also be interpreted on the basis of Pearson's correlational coefficient r [31]. Hence, the MCC values range from −1 to 1, unlike other performance metrics with a range of 0 to 1. A value of −1 indicates total disagreement between the actual and predicted values, which coincides with 0 for accuracy. The value of 1 in the MCC indicates a complete agreement between the actual and predicted values, corresponding to 1 for accuracy.

Participants
After excluding missing cases from the initial 37 variables, 4186 of 7550 (55.4%) participants in 2014 and 5302 of 8150 (65.1%) participants in 2016 were included in the machine learning (Table 1). Table 2 shows the differences in the variables between the depression and non-depression groups. The prevalence of the minority class (i.e., depression) was 6.16% (584 out of 9488) in the total sample, 6.45% (270 out of 4186) in the 2014 dataset, and 5.92% (314 out of 5302) in the 2016 dataset.
The number (%) of the older adults (i.e., age ≥ 65 years) was 2074 (21.86%). There were significantly higher rates of divorce or separated marital status, older age, and females in the depression group than in the non-depression group. The depression group had significantly lower values than the non-depression group in the socioeconomic domain, such as the number of houses, the number of private insurance policies, receiving a basic living allowance, and household income. The depression group also had a significantly higher prevalence of chronic diseases such as hypertension, dyslipidemia, cerebrovascular disease, cardiovascular disease, thyroid disease, diabetes mellitus, and arthritis compared to the non-depression group. Regarding the quality of life, the depression group had lower scores than the non-depression group on all five domains of the EQ-5D.

Classifying Performance
As shown in Figures 1 and 2 and Table 3, LASSO showed good classification performance (AUC = 0.903; overall accuracy, sensitivity, and specificity were 0.828). The total number in the confusion matrix of Figure 1 was 5474 because the number of variables was reduced from 37 to 13; accordingly, the number of missing cases decreased. The LASSO model with 13 variables showed a slightly better performance than the model with 37 variables. The prevalence of the minority class (i.e., depression) was 6.16% (584 out of 9488) in the total sample, 6.45% (270 out of 4186) in the 2014 dataset, and 5.92% (314 out of 5302) in the 2016 dataset.
The number (%) of the older adults (i.e., age ≥ 65 years) was 2074 (21.86%). There were significantly higher rates of divorce or separated marital status, older age, and females in the depression group than in the non-depression group. The depression group had significantly lower values than the non-depression group in the socioeconomic domain, such as the number of houses, the number of private insurance policies, receiving a basic living allowance, and household income. The depression group also had a significantly higher prevalence of chronic diseases such as hypertension, dyslipidemia, cerebrovascular disease, cardiovascular disease, thyroid disease, diabetes mellitus, and arthritis compared to the non-depression group. Regarding the quality of life, the depression group had lower scores than the non-depression group on all five domains of the EQ-5D.

Classifying Performance
As shown in Figures 1 and 2 and Table 3, LASSO showed good classification performance (AUC = 0.903; overall accuracy, sensitivity, and specificity were 0.828). The total number in the confusion matrix of Figure 1 was 5474 because the number of variables was reduced from 37 to 13; accordingly, the number of missing cases decreased. The LASSO model with 13 variables showed a slightly better performance than the model with 37 variables.

Feature Importance
Feature importance was obtained from the magnitude of the coefficients. The variables with the greatest importance were perceived stress, subjective health, anxiety/depression in the EQ-5D, and divorced/separated status (Table 4).

Feature Importance
Feature importance was obtained from the magnitude of the coefficients. The variables with the greatest importance were perceived stress, subjective health, anxiety/depression in the EQ-5D, and divorced/separated status (Table 4).

Discussion
We built a machine learning-based model for predicting future depression. The AUC (0.903), overall accuracy (0.828), sensitivity (0.828), and specificity (0.828) showed that this model could be practically used for screening community-dwelling individuals who may develop depression.
In the final set of variables, perceived stress was the strongest predictor of depression. Stress is generally categorized as either eustress or distress. Eustress represents positive aspects of stress, whereas distress refers to its negative aspects. Perceived stress measures distress by using questions such as "In the last month, how often have you felt nervous and stressed?" The negative effects of stress have a well-documented relationship with the pathophysiology of psychiatric disorders, such as depression [32,33]. As most screening instruments for depression do not contain the term "stress," perceived stress should be included in screenings of community-dwelling individuals. Moreover, subjective health was ranked as the second most predictive variable for classifying depression. The concept of subjective health reflects the quality of life or well-being [34,35]. Subjective health plays an important role in the pathophysiology of depression [36]. Although depression might contribute to perceived stress and poor subjective health, these factors should be considered important for the early detection of depression.
Our study had several strengths. First, we built a model to classify depression among community dwellers. Although depression causes substantial disability, the treatment of clinical depression is difficult [13]. Hence, early screening and detection of depression among community dwellers are particularly important, and many countries have focused on screening for depression in community settings before the clinical stages of the disease [37,38]. Thus, we believe our model could be practically used in community mental health institutions for accurate and prompt screening of depression.
Second, we used various types of variables. As depression is based on a complex interaction among biopsychosocial variables [39][40][41], clinicians must utilize the possible correlates of depression to improve classification. We included peripheral biomarkers (e.g., thyroid hormone, hemoglobin, white blood cells, platelets, aspartate aminotransferase, and alanine aminotransferase), psychosocial functioning (e.g., EQ-5D), and sociodemographic variables (e.g., age, sex, marital status, educational level, and economic status) to classify depression.
Third, we used LASSO to reduce features and build a final model to classify depression. We found that a model with fewer variables resulted in a performance comparable to one with more variables. We believe that practicality is necessary for such a machine learning model, and from a practical perspective, a questionnaire with too many questions might not be suitable for use in routine screening settings. If the performance between the two models is not substantially different, one with fewer variables could be practically used with the benefits of a short screening time and effort. As we developed this model for use in community health institutions, rather than higher-level facilities, we presumed that low computing costs with fewer variables are an important point. The reasonable computing costs of LASSO facilitate its deployment in community health institutions.
Fourth, it is noteworthy to discuss why we used the 2014 dataset for the training set and the 2016 dataset for the test set, rather than randomly selecting training and test sets. First, we wanted to test whether the algorithm made with past data (i.e., the 2014 dataset) could be applied to future data (i.e., the 2016 dataset). There will be some changes in the frequency or severity of the variables by reflecting the number of times the dataset was collected. If an algorithm should be useful in the real world over time, it should be robust for future data. In addition, there were statistical differences in many of the variables between the 2014 and 2016 datasets, whereas there was no statistical difference in the severity of depression between the two datasets. We interpreted the results mainly in terms of sample size and standard deviation. Generally, as the total sample size increases, the p-value decreases [42]. As the sample size was large (n = 9488), negligible differences were statistically significant (p < 0.05). Moreover, as the standard deviation (i.e., the degree of spread) increases, the p-value increases [43]; thus, the non-significant statistical difference in the severity of depression (i.e., PHQ score) resulted from a high standard deviation. As the participants of this study were from the general population, the distribution of the PHQ score would be severely positively skewed, which is associated with a high standard deviation.
This study had several limitations. First, although we included biopsychosocial factors for depression, neuroimaging and genetic variables were not available. Neuroimaging markers, such as structural volumes and functional activity, have long been used to classify depression [44,45]. Genetic studies have also provided information for understanding and classifying depression [4]. As this study sought to create a prompt and accurate tool to classify depression, such expensive tests do not seem applicable for a screening test. Nonetheless, we should consider whether biological factors are, indeed, helpful for discriminating depression. For example, a previous study revealed that the singular use of biomarkers to predict depression prognosis resulted in a poor performance (AUC < 0.6) [46]. The small effects of biological factors were confirmed in our study; only blood urea nitrogen was included in the final model throughout LASSO. Second, due to the limited sample size, we could not subdivide the study population by age group (e.g., youth, middle-aged adults, and older adults); instead, we grouped all ages to build a machine learning model. Given the different contributors to depression across different age groups [47,48], future studies with larger sample sizes are needed. Third, the survey data may not sufficiently reflect respondents' interpersonal relationships. For example, a recent study revealed that Facebook entries predicted future clinical depression [49]. Although the sample size was small (n = 683), and the outcome measure was only moderately predictive (AUC = 0.69 to 0.72), such an approach should be used to supplement future surveys and help construct a more comprehensive dataset.
In summary, we successfully built a model for classifying depression using the LASSO algorithm and sociodemographic, psychosocial, and laboratory data obtained from community dwellers. We believe that this model may help improve the accuracy of depression screening among community-dwelling individuals.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. All participants received a full explanation of the aims and protocol of the KNHANES and provided written informed consent.
Data Availability Statement: The Korea National Health and Nutrition Examination Survey (KN-HANES) is an annual nationwide survey that collects a variety of data on health behaviors, the prevalence of chronic diseases, and food and nutrition status. We used data from 2014 and 2016. Data are available in a publicly accessible repository. The data in this study are available in Kaggle at https://www.kaggle.com/seoeuncho/predicting-depression-in-community-dwellers (accessed on 20 June 2021).