A Comparison of Random Forest and Decision Tree for Suicide Ideation Classification


 Background: Suicide resulted from complex interaction factors. Most classical statistical methods were not efficiently enough to cover this complexity. With the new branch of statistics as statistical/machine learning, complex relationships between risk factors and responses can be modeled. Methods: We aimed to identify the high-risk groups for suicide using different classification methods including logistic regression(LR), decision tree(DT), and random forest(RF). Also, the prediction accuracy of the models is compared. This study used data obtained from a cross-sectional study conducted in in the Hamadan University of Medical Sciences from 2015-2016 to investigate the prevalence of suicidal ideation and related risk factors among university students. The LR, DT, and RF models were used to evaluate the high-risk group for suicide. Finally, the applied all three models were compared using sensitivity(SE), specificity(SP), and the area under receiver operating characteristics (ROC) curves. Results: In the training sample, the area under the ROC curve of the DT was greater than the LR and RF. But in the validation sample, the RF model has the best performance and the DT has the worst performance among these methods.Discussion: In this study, the risk factors for suicide were different for men and women. According to the results of the DT, substance abuse, average, general health score, faculty of education, depression were the risk factors on suicidal ideation in both genders. But despair about the future, residence (parents' house/dormitory) were among the factors contributing to the suicidal ideation of men. On the other hand, parents’ education, interested in the discipline and anxiety influence factors on suicidal ideation in women. The results of RF indicated that depression, general health score, average, anxiety and substance abuse were important risk factors for suicidal ideation in both genders. Also, the faculty of education and age are risk factors for suicide in women. Conclusions: In the training sample, the DT had better performance but in the validation sample, the RF model provided better results. The LR was the best model for diagnosis of the patient and the DT and RF are considered the best models to diagnosis a healthy individual.


Introduction
The term suicide was rst de ned in 1642, with a combination of Latin terms Sui and CADER meaning self and killing, respectively(1). If someone commits suicide but remains alive, this behavior will be de ned as attempted suicides (2) and suicide ideation is de ned as the thought of injuring or killing yourself (3). In the last decade, the suicide rate among university students in worldwide has increased signi cantly (4). Suicide is the third cause of death in people aged 15 to 24 years old (5), and the second leading cause of death among college students (6).
However, suicide is observed in all age groups, but it is of greater importance in young adults because of their loss of life expectancy (7,8). University years are the most vulnerable period for these people due to pressure to succeed, nancial burden and responsibility for the transition to adulthood (9). Suicide is a multidimensional phenomenon which is resulted from a complex interaction of biological, genetic, psychological and environmental factors (10).
The suicide ideations are closely related to suicide, therefore identifying risk factors of suicide ideation can be very important in reducing suicide rates (11)(12)(13)(14)(15). Due to the complex relationships of various factors in the creation of behavior, the most classical statistical methods that are based on the functional form of a predetermined and simple relationships between factors are not su ciently accurate to cover this complexity. But today, with the new branch of statistics as statistical/machine learning, complex relationships between factors and responses can be modeled. This method leads to the partitioning of people based on different risk factors. This will, therefore, help identifying high-risk groups. This study aimed to identify the high-risk group for suicide using different classi cation methods including logistic regression (LR), decision tree (DT) and random forest (RF). Also, we evaluate prediction performance of these models to identi cation high-risk person.

Data and settings:
This study used data obtained from a cross-sectional study conducted in the Hamadan University of Medical Sciences from 2015-2016 to investigate the prevalence of suicidal ideation, suicide attempt and related risk factors among university students (16). We enrolled students who had passed at least one semester of their education at the university. The associated response to this study is suicide ideation which is recorded as a binary variable. Three methods of LR, DT, and RF were tted to the data to gender segregation. The required data needed to be divided into two subsamples of training and validation. The training sample nds the model and the validation sample tests the performance of the trained model. Because the causes of suicide differ between men and women, the risk factors of suicidal ideation for men and women were analyzed separately. Predictors of age, marital status, residence (parents' house/dormitory), mother's education, father's education, Educational level, faculty of education, average, interested in the discipline, despair about the future, substance abuse, general health score, depression, anxiety, boyfriends/girlfriends, emotional breakdown, illegitimate heterosexual and/or homosexual intercourse, cigarette smoking, city, birth order were analyzed.

Statistical models:
Logistic Regression LR is a modeling mechanism that can be used to describe the relationship between multiple predictive variables with a binary dependent variable. LR is one of the most commonly used methods in the study of Epidemiological data when the dependent variable is dichotomous (17). One of the causes of the popularity of LR is related to the regression estimators presented in the range of zero to one and also describes the s-shape of the combined effect of several risk factors for disease (17). The logit ( odds) has the following linear relation (18):

Decision Tree
In this analysis, the response variable is a dichotomy. This model is designed for quantitative variables but applicable to any form of variable. DT are the most powerful classi cation algorithms that have gained more popularity through the growth of data mining in various elds. Popular DT algorithms include Quinlan ID3, C4. 5 (19,20) C5 and CART (21). This technique recursively separates the observations in the branches to construct a tree to improve the predictive accuracy. To do this, mathematical algorithms such as Information Gain, Gini index and Chi-Square test are used to identify a variable and associated threshold for the variable which divides the input observations into two or more subgroups. This process on each leaf node is repeated until the complete tree is constructed. The splitting algorithm aims are to nd the variable and threshold pair that maximizes the homogeneity (order) of two or more subgroups of sample. The mathematical algorithms such as Information Gain in C4.5, C.5, ID3 trees, Gini index in CART and Chi-Square test in CHAID are used (22). The CART algorithm can be used as one of the best-known diagnostic and predictive classi cation methods in the medical sciences (23). Also, in the CART model, the classi cation tree pruning is based on the complexity cost. The easy perception of DT, using both nominal and categorical data and the absence of hypothesis about the nature of data are appropriate properties of DT (24).

Random forest algorithm
Breiman introduced the RF classi cation algorithm in 2001 (25). The RF is an ensemble of unpruned regression and classi cation trees (26). The Bootstrap sample is extracted to construct a RF. Afterward, the recursive partitioning is used to the Bootstrap sample. The q predictors are randomly selected of the p predictors at each node. The recursive partitioning is run to the end and a tree is formed. The above steps are repeated until a forest is formed. The forest-based classi cation of the majority vote from all trees is formed (27). Generally, RF demonstrates signi cant performance to single tree classi ers such as CART and C4.5 (24).A RF, unlike trees, is extremely large for interpretation. One way to summarize or quantify information is to identify the forest's important predictors.
Evaluation models' performance: The results derived from the training and validation samples were evaluated by utilizing the sensitivity (SE), speci city (SP) and the area under curve (AUC).
The accuracy of diagnostic tests is assessed with two conditional probabilities: sensitivity, speci city (28). The ability of a test to identify correctly those who have the disease (or characteristic) of interest is called sensitivity (29). The ability of a test to identify correctly those who do not have the disease (or characteristic) of interest is called speci city (29). The receiver operating characteristic (ROC) curve is the plot that displays the complete picture of the compromise between sensitivity and 1-speci city over a series of cutoff points. Area under the ROC curve is as a measure of the intrinsic validity of the diagnostic test (30). Software: The data were analyzed through 'tree', 'RF', 'ggplot2' packages of R software 3.5.1 (31)(32)(33). The control argument in the tree package was used to determine the minimum number of observations per node, the smallest size of each node and the within-node deviance, were set at 10, 20, and 0.01, respectively. Also, the cut-off point 0 .3 was used in the DT for allocating the response to each node. In other words, if the response value is less than 0.3, then no is allocated to the nal node.

Results
The statistical population consisted of 1259 students in this study who after re ning the data 1247 individuals were included in the nal analysis. Of the 1247 participants, 491(39.4%) were male and 756(60.6%) were women.
The mean age of students was 18 to 49 years with a mean of 22.52 (3.33). Of participants,1044 (83.8%) were single, 160 married (12.8%) and 43 (3.4%) divorced. Of the university students who participated in the study, 146 (around 12%) had suicidal ideation during the past year and 63 (5%) students, had attempted suicide at least once in the past year. Suicide ideation was higher among men than in women. For analyzes LR, DT, and RF were tted for suicidal ideation to gender segregation.
The LR t results indicated that in women, father's education, depression, substance abuse has a signi cant relationship to suicidal ideation. The DT for women's suicidal ideation has 16 terminal nodes. Of the 21 selected features for entering in the tree decision, variables of depression, age, average, substance abuse, mother 's education, general health score, interested in the discipline, faculty of education, father 's education, anxiety are important attributes in the DT. The misclassi cation error rate is 0.58. A total of 500 trees were used to construct the RF of suicide ideation in women.
According to the Gini importance index, depression, general health score, average, anxiety, faculty of education, age and substance abuse were the most important predictors, respectively.

Men's suicide
Regression tting results showed a signi cant relationship between age, marital status, substance abuse, depression and suicidal ideation in men. The DT for suicidal ideation in men has 14 terminal nodes. Of 21 entered features, depression, substance abuse, average, faculty of education, despair about the future, residence (parents' house/dormitory) and general health score were selected by the DT. The misclassi cation error rate is 0.079. Important predictors based on the Gini importance index in the RF were depression, general health score, anxiety, substance abuse.
Comparing the results of logistic regression, the variables of depression, substance abuse in both genders have a signi cant relationship with the response. By comparing DTs, four features of faculty of education, general health score, average, substance abuse, depression are the effective factors in both genders. The results of RF indicated that depression, general health score, anxiety, substance abuse, average were identi ed as important variables.
The comparison of the area under the ROC curve, the sensitivity and speci city of the three methods of LR, DT, and the RF to gender segregation are shown in Table 1. In both genders, the area under the ROC curve in the training sample and DT is better than the LR and RF. In fact, in the existing samples, the DT has good performance and the RF in the training sample has the lowest performance among the three methods compared. The RF model has the best performance in the validation model, and the DT has the weakest performance among the three methods. In a new sample, RF performs better than the other two methods. The sensitivity of LR in training and validation samples is higher than the DT and RF. Among the three methods compared, RF and DT have the highest speci city, respectively.

Discussion
The purpose of this study was to identify signi cant risk factors associated with gender-speci c suicidal ideation.
In this study, the risk factors for suicide were different for men and women. According to the results of the DT, substance abuse, average, general health score, faculty of education, depression were the risk factors on suicidal ideation in both genders. But despair about the future, residence (parents' house/dormitory) were among the factors contributing to the suicidal ideation of men. On the other hand, parents' education, interested in the discipline and anxiety in uence factors on suicidal ideation in women. The results of RF indicated that depression, general health score, average, anxiety and substance abuse were important risk factors for suicidal ideation in both genders. Also, the faculty of education and age are risk factors for suicide in women. The relationship between depression and suicide among students has been reported in other studies (34). There is a well-known relation between depression and suicidal ideation (35,36). The results of this study were consistent with the study results which have reported the relationship between anxiety and suicide (37)(38)(39). The relationship between substance abuse and suicidal ideation in medical students has also been con rmed in previous studies (40,41). The next purpose of this study was to compare the performance of LR, DT and RF based on sensitivity, speci city, and area under the ROC curve. The predictive power of the classi ers was investigated by the area under the ROC curve (AUC), in women's suicidal ideation, the area under ROC curve of the DT in the training sample is higher than LR and RF, in a similar study, the area under ROC curve for the DT is better than LR and discriminant analysis (42). But in the validation sample, the area under the ROC curve of RF is higher than the LR and DT, in Tian study, the area under the ROC curve is better for RF than DT and LR (43). The sensitivity range between 0.2 (the validation sample of the RF for women's suicidal ideation) was 0.85 (the training sample of the LR for women's suicidal ideation).
The minimum speci city was 0.78 (the validation sample of the LR for male's suicidal ideations) and the maximum speci city was 0.99 (the validation sample of the RF for female suicidal ideations and the training sample of the RF for male's suicidal ideations). The area under the ROC curve of the DT in the training sample has better than LR and RF, The area under the ROC curve of the RF in the validation sample and speci city of RF for validation and training samples better than the other two methods. In women's suicidal ideations, the sensitivity of the LR model in the training and validation samples was better than the other two methods. These results were also con rmed in men's suicidal ideations.
In the training sample, although the area under the curve of the DT was higher than the LR, the sensitivity of the DT and RF in suicidal ideations of men and women in both training and validation samples was lower than the LR.
Power of detection of individuals exposed to suicidal ideations using LR better than RF and DT. But, in the case of speci city in both training and validation samples, the DT and the RF has better performance than LR. In identifying people who do not have suicidal ideations, the DT and RF perform better than LR.

Conclusion
In binary classi cation, the DT has better performance in the training sample, but in the validation sample, the RF provides better results. Among the three classi ers, to identify the patient, LR is the best model and the DT and the RF are the best models to identify the healthy person. By comparing the results of the three methods of LR, DT, RF in suicidal ideation of men and women, depression, substance abuse were identi ed as important risk factors.

Availability of data and materials
Readers who wish to gain access to the data can send email to poorolajal@umsha.ac.ir.

Competing interests
The authors declare that they have no competing interests.