Prediction models for high risk of suicide in Korean adolescents using machine learning techniques

Objective Suicide in adolescents is a major problem worldwide and previous history of suicide ideation and attempt represents the strongest predictors of future suicidal behavior. The aim of this study was to develop prediction model to identify Korean adolescents of high risk suicide (= who have history of suicide ideation/attempt in previous year) using machine learning techniques. Methods A nationally representative dataset of Korea Youth Risk Behavior Web-based Survey (KYRBWS) was used (n = 59,984 of middle and high school students in 2017). The classification process was performed using machine learning techniques such as logistic regression (LR), random forest (RF), support vector machine (SVM), artificial neural network (ANN), and extreme gradient boosting (XGB). Results A total of 7,443 adolescents (12.4%) had a previous history of suicidal ideation/attempt. In the multivariable analysis, sadness (odds ratio [OR], 6.41; 95% confidence interval [95% CI], 6.08–6.87), violence (OR, 2.32; 95% CI, 2.01–2.67), substance use (OR, 1.93; 95% CI, 1.52–2.45), and stress (OR, 1.63; 95% CI, 1.40–1.86) were associated factors. Taking into account 26 variables as predictors, the accuracy of models of machine learning techniques to predict the high-risk suicidal was comparable with that of LR; the accuracy was best in XGB (79.0%), followed by SVM (78.7%), LR (77.9%), RF (77.8%), and ANN (77.5%). Conclusions The machine leaning techniques showed comparable performance with LR to classify adolescents who have previous history of suicidal ideation/attempt. This model will hopefully serve as a foundation for decreasing future suicides as it enables early identification of adolescents at risk of suicide and modification of risk factors.


Introduction
In South Korea, suicide in adolescents has been emerging as a major public health problem. The suicide rate has increased annually in adolescents and is recorded as not only one of the highest, but also the most rapidly increasing feature among Organization for Economic Cooperation and Development (OECD) countries.
Although several studies have identified risk factors of suicide [1][2][3][4][5], a recent meta-analysis reveals that the ability to predict suicide behaviors have remained limited [6]. New application of machine learning techniques are gaining attention to identify suicide risk at various clinical setting [7]; Passos et al. classified individuals with a history of suicide attempt among patients with mood disorders based on demographic and clinical data [8]. Oh et al. distinguished suicide attempters from non-suicide attempters among patients with depression or anxiety disorders, applying ANN to multiple psychiatric scales and sociodemographic data [5]. Using general characteristics and insurance data from the National Health Insurance Service cohort in Korea, one recent study analyzed the probability of death by suicide [9].
Since the presence of previous suicide ideation/attempt represent one of the strongest predictors of future suicide behavior and death by suicide [6], it is important to identify adolescents who have history of previous suicide ideation/attempt. Herein, the purpose of this study was to establish prediction models for high-risk of suicide in Korean adolescents using machine learning techniques.

Data collection and preparation
Data used in this study was brought from the Korean Young Risk Behavior Web-based Survey (KYRBWS) XIII in 2017. The KYRBWS is a self-administered online survey and it was approved by the Institutional Review Board (Certificate Number: 11758) of the Korea Centers for Disease Control and Prevention (KCDC).
This survey intends to grasp South Korean adolescents' health-risk behaviors such as smoking, alcohol use, obesity, physical activity, eating habits, injury prevention, mental health, sexual behaviors, oral health, allergic disorders, personal hygiene, internet addiction, and health equity. Participants were provided with identification numbers and were guaranteed anonymity, and all participants completed an online, self-reported questionnaire in a school computer room after the survey had been fully explained. All data used in this study have been fully anonymized before we accessed them. All procedures and terms and conditions of the survey have been complied with were performed in accordance with the Declaration of Helsinki 7th version and informed consent was obtained from all participants. The test-retest reliability of the KYRBWS questionnaire has been reported to be stable [10]. The dataset and questionnaire is provided with guidelines for calculating a health-related index through the KCDC online site (http://www.cdc.go.kr/CDC/eng/main.jsp).
In 2017, the KYRBWS dataset included a total 62,276 adolescents from 799 middle and high schools (response rate: 95.8%), using a complex sampling design which involves stratification, clustering, and multistage sampling.

Suicide
High risk of suicide, as a dependent variable, was categorized as adolescents who had either suicidal ideation or suicidal attempt in previous year. Suicidal ideation was defined as a yes response to the question, "Did you consider suicide in the last 12 months?" and suicidal attempt was defined as a yes response to the question, "Did you attempt suicide in the last 12 months?" The respondents who experienced either suicidal ideation or suicidal attempt were categorized within the high risk of suicide group.

Independent variables
Independent variables included socio-demographic variables (sex, grade, city type, academic achievement, family structure, family socioeconomic status, and education level of father and mother), health-related lifestyle factors (current smoking, current alcohol consumption, substance use, physical activity, obesity, sexual experience, and internet addiction), and psychological stress factors (sadness, stress, self-rated health, sleep satisfaction, self-rated weight, distorted weight perception, school injury, and violence). Comorbidities included asthma, allergic rhinitis, and atopic dermatitis.
School grade was divided as middle school (Grades 1-3, corresponding age 12-15 years) and high school (Grades 4-6, corresponding age 16-18 years). City type was categorized as big cities, small and medium-sized cities, and countryside. Academic achievement was categorized as high, high middle, middle, low middle, and low. Family structure was categorized as having both parents, having either parent, and neither parent. Family socioeconomic status (SES) was categorized as high, high middle, middle, low middle, and low. Education level of father and mother was categorized as unknown, middle school graduate or less, high school graduate, and college or graduate degree.
Current smoking, current alcohol consumption, and substance use were defined as a yes response to the questions: "Did you smoke or drink alcohol more than once within the last 30 days?" and "Have you ever used any substance or sniffed glue or butane habitually on purpose?" Physical activity was categorized as "active" (vigorous physical activities more than two days among the last seven days) or "inactive." Vigorous physical activities were defined as those that make one sweat or feel breathless for 20 minutes or more in the questionnaire.
Body mass index (BMI) was calculated based on the self-reported height and weight, and was categorized as underweight (� 5 th percentile), normal (5-85 th percentile), overweight (85-95 th percentile), and obesity (� 95 th percentile or BMI � 25 kg/m 2 ). Self-rated weight was categorized as very fat, fat, normal, thin, and very thin. Distorted weight perception was defined when respondents answered "very fat" or "fat" for the self-rated weight question, while his or her actual weight was categorized as underweight or normal.
Information regarding sexual experience, school injury, and internet addiction was also collected. For sadness, the adolescents were asked, "In the last 12 months, has a feeling of sadness interrupted your daily activities for at least two weeks?" In addition, stress, self-rated health, and sleep satisfaction were categorized in five levels by the extent of these symptoms.

Models to predict high risk of suicide
To prevent learning bias resulting from an imbalanced dataset (the proportion of the non-suicide group was about 7 times larger than the suicide group in the entire dataset), a balanced dataset (same number of age-and sex-matched non-suicide group for the suicide group, n = 7,647 for each group) was selected from preprocessed data in terms of down-sampling (Fig 1). To prevent overfitting, the preprocessed dataset was split in five equally-sized random groups using a 5-fold cross validation. One group was used as the test set and the other groups were used as the training sets for the machine learning prediction models. Five machine learning methods were trained: logistic regression (LR), random forest (RF), support vector machine (SVM), artificial neural network (ANN), and extreme gradient boosting (XGB). Optimal parameters for each machine learning method were selected through a grid search ( Table 1). The variables used in the model were categorical; hence, a 0 or 1 value was applied by one-hot encoding.
A comparison of LR and other machine learning discriminations for each model was performed, in terms of sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy to predict adolescents who had a history of suicidal ideation or attempt. For test dataset, the area under the receiver operating characteristic curve (AUC) for each model was also calculated to evaluate general prediction performance.

Statistical analysis
Results are presented as percentages for categorical variables and as means (± standard deviation) for continuous variables. Categorical variables and continuous variables were compared using the chi-square test or the Student's t-test for comparisons between adolescents with/ without risk of suicide. Multivariate regression analysis was used to identify factors associated with previous suicidal ideation or attempt using the backward stepwise selection method.
The analysis and machine learning models and diagnostic performance was evaluated using the open-source statistical software Python version 3.6.0. P-values of less than 0.05 (two-sided) were considered significant.

Results
The clinical characteristics for a total of 59,984 subjects with valid information regarding previous history of suicidal ideation/attempt are summarized in Table 2. The high risk suicide group showed higher proportions of girl, low school grade, low academic achievement, those not living with both parents, low family SES, low parental education level, current smoking, current alcohol drinking, substance use, inactive physical activity, sexual experience, internet addiction, sadness, high stress, poor self-rated health, low sleep satisfaction, high self-rated weight, distorted weight perception, experience of school injury and violence, and presence of comorbid diseases (asthma, allergic rhinitis, atopic dermatitis).
For the test dataset, the confusion matrix and receiver operating characteristic (ROC) curve show that the diagnostic performance of machine learning techniques are comparable with that of the LR result (Table 4 and Fig 2). XGB showed the best performance, with a sensitivity of 78.5%, specificity of 79.4%, PPV of 79.2%, NPV of 78.7%, classification accuracy of 79.0%, and AUC of 0.863.

Discussion
Machine learning techniques offer promise to improve risk prediction for suicide. A systematic review revealed greater prediction accuracy of self-injurious thoughts and behaviors than in previous studies using traditional statistical methods [7].  Machine learning techniques have advantages beyond traditional statistical approaches in psychological research [11]. For example, traditional approaches greatly minimize the number of variables and impose linearity on relationships that likely have more complex associations. On the other hand, machine learning approaches enable the simultaneous testing of numerous variables and their complex interactions and allow for non-linearity in producing predictive models [11]. The purpose of this study was to develop models to determine adolescent at risk of suicide using nationally representative survey dataset in Korea by using machine learning methods. In this study, we applied the LR method and several other machine learning algorithms, and XGB showed the best performance in the test dataset with an accuracy of 79.0% (AUC = 0.863). XGB, one of the machine learning techniques, is highly efficient and flexible and can be easily used on distributed platforms for further computational efficiency [12]. Ensemble learning is possible by attaching another algorithm to XGB. Future studies would possibly show a better performance if XGB is combined with various algorithms rather than a single algorithm model. However, the machine leaning techniques showed an overall comparable diagnostic performance with LR. The main reason might be due to the type of dataset used in the present study. The KYRBWS survey data are composed of general health-risk behaviors and we arbitrarily select 26 categorical variables to develop prediction models. Further study is warranted to explore the increasing accuracy using latent variables.
The present study has several limitations. First, the KYRBWS was developed to cover general health-risk behaviors including psychological status and previous suicidal behavior, which were examined by simple questions and scales. If the survey had been composed of more detailed questions regarding suicide behavior or psychological status, the performance of models might have improved. Second, this model was developed using the KYRBWS dataset, it does not guarantee the same diagnostic performance with other datasets or populations. In the present study, we used pairing cross validation for imbalance outcome to avoid the problem of "limited generalization" or "overfitting." Nevertheless, despite these limitations, this is the first study to adopt machine learning techniques to a nationally representative, and large number (n = 59,984) of Korean adolescents.
In conclusion, this study showed that machine learning techniques have the potential to identify Korean adolescents at risk of suicide using nationally representative survey dataset of general health-risk behaviors. Several machine learning models have comparable performance with the conventional LR method, which have potential for development. Establishment of accurate prediction models through additional studies would facilitate early screening of high risk adolescents and correction of modifiable risk factors, so that society can prevent future suicidal behavior and death by suicide.