Identifying populations at ultra-high risk of suicide using a novel machine learning method

Background: Targeted interventions for suicide prevention rely on adequate identification of groups at elevated risk. Several risk factors for suicide are known, but little is known about the interactions between risk factors. Interactions between risk factors may aid in detecting more specific sub-populations at higher risk. Methods: Here, we use a novel machine learning heuristic to detect sub-populations at ultra high-risk for suicide based on interacting risk factors. The data-driven and hypothesis-free model is applied to investigate data covering the entire population of the Netherlands. Findings: We found three sub-populations with extremely high suicide rates (i.e. > 50 suicides per 100,000 person years, compared to 12/100,000 in the general population), namely: (1) people on unfit for work benefits that were never married, (2) males on unfit for work benefits, and (3) those aged 55 – 69 who live alone, were never married and have a relatively low household income. Additionally, we found two sub-populations where the rate was higher than expected based on individual risk factors alone: widowed males, and people aged 25 – 39 with a low level of education. Interpretation: Our model is effective at finding ultra-high risk groups which can be targeted using sub-population level interventions. Additionally, it is effective at identifying high-risk groups that would not be considered risk groups based on conventional risk factor analysis.


Introduction
In the Netherlands alone, an average of five people die by suicide each day [1]. Every case of suicide marks a personal tragedy, both for the victim and for those left behind. Therefore, it is of utmost importance to implement effective suicide prevention programmes at multiple levels, including interventions directed at the entire population (e.g., public awareness campaigns), interventions targeting high-risk groups or sub-populations (e.g., training gatekeepers among professionals encountering individuals with financial difficulties) and interventions targeting at-risk individuals (e.g., cognitive behavioural therapy for individuals with suicidal thoughts) [2].
Interventions at the second level, targeting sub-populations, require adequate identification and detection of groups at elevated risk of suicide. Multiple studies have been performed to detect risk factors for suicide [3][4][5][6][7]. Not unexpectedly, the most important predictor of death by suicide is a prior non-fatal suicide attempt or prior psychiatric hospitalization [6]. Experiencing stressful life events and mental health problems including depression and substance use problems substantially increase the risk for suicide attempts and suicidal ideation, which in turn increases the risk of suicide [6]. In addition, certain socio-demographic groups are at elevated risk, including but not limited to men, people of middle age, people of lower socio-economic status and people living alone [6,1].
In complex and multifactorial outcomes such as mental illness, risk factors are known to interact or accumulate. For instance, stressful life events may trigger a depressive episode in persons with a genetic vulnerability to depression [8]. To our knowledge, however, little is known about interacting socio-demographic risk factors for suicide. In a hypothetical example, one might expect that unemployment might increase the risk of suicide more for men living alone than for the rest of the population. The detection of relevant interacting socio-demographic risk factors will allow the identification of more specific sub-populations at elevated risk of suicide. This may increase the efficacy of targeted preventive interventions and has the potential to reduce suicide rates.
Machine learning methods offer new possibilities for flexible, datadriven, hypothesis-free and robust investigation of accumulating risk factors for suicide. A recent study performed such analyses using predominantly healthcare data and succeeded in identifying multiple relevant interactions [9]. Risk of suicide was higher, for instance, in men and women who had recently attempted suicide and were not being treated with pharmacotherapy. In a second study, including over 15,000 features (including but not limited to: demographics, diagnostic codes, procedure codes, and medication prescriptions) in the initial model and retaining 117 of them, researchers were able to develop a risk prediction model with acceptable performance parameters to stratify hospital patients by suicide risk [10].
An important limitation of the above studies is their complexity, hampering translation of their results to actionable recommendations for clinical practice. Moreover, as Kirtley et al. have recently emphasized [11], current machine learning methods have limited capabilities to support decisions and interventions at the individual level, as falsepositive rates as well as false-negative rates are typically high. Thus, there is a need for more actionable and transparent machine-learning models to aid detection of high-risk subgroups rather than individuals.
In this paper, we present a new machine learning model that allows for investigation of complex interactions of socio-demographic risk factors whilst retaining interpretability. This model is applied to predict suicide risk groups in a dataset spanning the entire population of the Netherlands over a period of nine years, thereby mitigating sampling bias and sample size limitations. Our model yields detailed and interpretable results to aid the identification of sub-populations of individuals at relatively high risk for suicide, which may aid targeted preventive interventions.

Data
Statistic Netherlands (CBS) is a national administrative authority aiming to collect and provide reliable information that advances the understanding of social issues. CBS maintains a high-quality database containing, among others, socio-demographic and medical information regarding every inhabitant of the Netherlands. Analyses on CBS data are to be performed via a remote access connection to their computational servers. All results are verified prior to release, ensuring compliance with privacy laws.
For the current paper, we included data regarding all inhabitants of the Netherlands on the 31st of December of nine consecutive years (2011 to 2019), adding up to a total of 137,666,515 person years. Of those, 16,417 person years ended by suicide in the year following observation and 137,650,098 person years did not end by suicide in the year following observation.

Features of interest
The following socio-demographic predictor variables were measured on the 31st of December of the year preceding the outcome: sex, age, immigration background, household income, personal income, household wealth or debts, level of education, physical healthcare costs, place in household, marital status, short-term unemployment benefits, longterm unemployment benefits and unfit for work benefits. For details, see Table 1. Categorical variables were one-hot-encoded for use in machine learning analyses, meaning that for each category a new variable was introduced which has value 1 if the individual was in said category and has value 0 otherwise. Continuous variables were split into mutually exclusive response categories (e.g., quartiles) and also one-hot-encoded.

Model
A heuristic algorithm was devised to obtain interacting features which provide additional risk of suicide or reduce the risk. The obtained interaction features were prioritised on statistical significance as well as model improvement. The algorithm comprises four steps.
Step 1: the data is divided into three disjoint partitions: a training set, a validation set and a test set. The training set includes fifty percent of person years ending in suicide (N = 8,214) and one percent of all other person years (N = 1,377,055) and is used to detect significant interactions between features of interest. The validation set includes forty percent of person years ending in suicide (N = 6,512) and one percent of all other person years (N = 1,377,870) and is used to estimate the final logistic regression model. The test set includes ten percent of person years ending in suicide (N = 1,691) and one percent of all other person years (N = 1,375,966) and is used to evaluate the performance of the final model.
Step 2: the algorithm identifies significant interactions between features of interest in the training dataset. For details, see Appendix A. In short, the algorithm defines a main-effects logistic regression model including all features listed in Table 1 (hereafter referred to as basic features). Next, interaction terms are added in an iterative manner. The algorithm looks at combinations of the form "X and Y", where X is a feature already present in the model, and Y is a basic feature. So the new combination feature "X and Y" would have value 1 if both feature X and feature Y have value 1. For each of these combinations, it calculates the rate at which it would improve the log-likelihood. Then we corrected for sub-population size, since larger sub-populations without an underlying effect on suicide risk will still have a large effect on log-likelihood simply due to variance. The significant interactions that came out of this analysis were listed and for the further analyses we focused on interactions of features that had the largest effects and also included at least 200 suicides. This was done because for suicide prevention interventions the primary interest is in sub-populations with a substantial number of suicides. After this, a check was performed to ascertain whether this (interaction of) feature(s) truly improved the model. If it did not, it was removed. The process was stopped when the ratio at which removals needed to be performed exceeded 10% and at least 30 interactions were tested.
Step 3: a logistic regression model was estimated on the validation dataset including all significant interactions detected in step two. As the data in the validation set is disjoint from the training set, the notion of over-fitting is removed and regular test statistics such as t-tests and pvalues can be interpreted.
Step 4: the following performance statistics were computed on the test set: log-likelihood as an indicator of model fit, and area under the receiver operating characteristics curve (AUC) as an indicator of the model's ability to distinguish between those who died by suicide and those who did not.

Statistics
For each significant feature of interest and interaction between two or more features of interest, we report the logistic regression model β parameters, odds ratios and corresponding confidence intervals. For interaction terms, we also report the compound odds ratios (CORs) and their confidence intervals, reflecting the summed effect of features when combined (e.g., exp(β male + β widowed + β male and widowed )). Also reported are the number of suicides in the corresponding sub-populations for the validation set as well as the relative rate in said sets (per 100,000 inhabitants per year), which are corrected for the sampling procedure (number of suicides is scaled up by a factor of 2.5, and number of nonsuicides by a factor of 100).

Table 2
Interaction terms found by the algorithm as tested on the validation set. With corresponding Beta parameters, Odds-Ratios, Compound Odds Ratios, absolute and relative number of suicides within the sub-population within the validation set. Sub-populations with ⩾30 suicides per 100,000 are in bold.  . For confidence intervals of the differences between non-reference groups (i.e. 40-54 vs 10-24), see Appendix C. Among the general population there is a suicide rate of 11.8 per 100,000. When considering relative suicide rates among the subpopulations corresponding to the various features, the highest rate among the basic features is among the people who are unfit for work with a suicide rate of 47.0 per 100,000 on the validation set, with the second highest rate being among the long-term unemployed with a suicide rate of 32.1 per 100,000 on the validation set, and the rest of the sub-populations having rates below 30.0 per 100,000. Table 2 lists all twenty interaction terms included in the final logistic regression model. Of those, seventeen yielded significant effects in the validation dataset (p < 0.05). Among the interaction features there are ten sub-populations identified with relative risks higher than 30.0 per 100,000 on the validation set.

Interaction effects
Broadly, three categories of interacting risk factors can be distinguished (with minor crossover): (1) interactions related to age, (2) interactions related to sex, and (3) interactions related to marital status. Two significant interactions did not fit any of these categories.
Interactions involving age: among people of young working age (25-39 years old), but not in the other age groups, a low level of education is an important risk factor for suicide (

Model performance
The baseline logistic regression model without interaction terms had a log-likelihood of − 12184.54 and an AUC of 0.75. In comparison the logistic regression model with interaction terms had a log-likelihood of − 12119.24 and an AUC of 0.76. See Fig. 1 for the curves themselves.

Discussion
Effective suicide prevention programs include, among others, interventions targeting subgroups of people at particularly high-risk of suicide. Here, we designed a heuristic model to detect such subgroups based on interactions between risk factors, and applied it to data covering the entire population of the Netherlands. We identified three sub-populations at ultra-high risk for suicide, with relative suicide rates of 50/100,000 person years or higher. In addition, we identified several factors that when combined increase the risk of suicide, while in isolation they do not increase the risk of suicide. These risk factors would not be detected using traditional prediction models.
We identified three sub-populations at ultra-high risk of suicide, with social isolation and socio-economic hardship as common denominators. Compared to suicide rates in the general population of the Netherlands (11.8 suicides per 100,000 person years), people who were never married and unfit for work -and among them those with low healthcare costs -were up to 7.4 times more likely to die by suicide (88 suicides per 100,000 person years). Despite the relatively small size of this group in the Dutch population, in 2012-2020 more than 100 suicides (7% of all suicides within that period) occurred in this group each year. The second ultra-high risk group concerns males who are unfit for work, with 59 suicides per 100,000 person years. These findings urge professionals in regular contact with individuals receiving unfit for work benefits, including occupational healthcare professionals, community service providers and municipal workers, to pay particular attention to males and people who were never married. The third ultra-high risk group comprises individuals aged 55-69, who were never married, are living alone and have a relatively low income, with 57 suicides per 100,000 person years. Further studies, including longitudinal and qualitative studies, are needed to investigate how the combination of these specific risk factors culminates in extreme high-risk profiles.
In addition to the extreme high-risk group, we identified several risk factors that increase the risk of suicide only in the presence of other risk factors. First, while neither young age (25-39 years old) nor lower level of education was found to be a risk factor in itself, together they constituted a major risk profile. Among individuals of young adult age, those with a lower level of education presented with a relative suicide rate more than double that of their peers with a medium or higher level of education (20.1 vs. 8.8 suicides per 100,000 person years). Our data does not provide insights into mechanisms that might underlie the elevated risk of suicide among young adults with lower education. In keeping with our prior observation that socioeconomic hardship may be a common denominator, we speculate that, among many factors, job insecurity might play a role: young adults in the Netherlands, and especially those with lower levels of education, are more likely than other age groups to be offered temporary employment [12]. Job insecurity has been linked to poorer mental health [13], which in turn is linked to a higher suicide risk [4]. To substantiate this hypothesis or find alternative explanations, we recommend research into risk factors for suicide in this group, including socio-economic factors, external stressors, psycho-social circumstances and psychological vulnerabilities.
Second, widowhood did not increase the risk of suicide in the general population in our study, yet it did when combined with the known risk factor male sex. Among widowed males, the suicide rate is more than twice the rate observed in general male population. Previous studies including males only have reported a higher risk of suicide among widowed individuals [14][15][16], but to our knowledge the combined risk of widowhood and male gender has not previously been reported. The current study does not allow characterisation of the suicidal process within male widowed individuals. A recent study showed that male widows, compared to female widows, are generally protected from income loss yet are more likely to experience negative emotional consequences such as loneliness and depression [17]. Our findings underline the need for social support for males who lost their partner, and urge training of gatekeepers among professionals encountering these males.
Finally, we wish to draw the readers attention to two risk factors that each appear in a large number of significant interaction terms: (1) being of middle age (55-69 years old) and (2) having never been married. The large number of significant interactions involving these factors suggests risk profiles within the sub-populations of middle-aged individuals and individuals who were never married that differ from risk profiles in the general population.
Several limitations to our approach should be considered when interpreting our findings. First, death by suicide is a relatively rare event, limiting our statistical power to find associations with risk factors. To achieve reliable model performance, we included all suicides that occurred in the Netherlands between 2012 and 2020. We are unable to assess whether results are stable over time. Second, the model is constructed bottom-up. A top-down approach starting with all possible highest-level interactions might allow detection of more high-risk subgroups, however such approaches are also known to generate more false-positives. Third, adding interaction terms to the model improved model performance only slightly (AUC = 0.76 vs. AUC = 0.75). While the validity of the identification of high-risk groups is not affected (AUC between 0.7 and 0.8 is generally deemed 'acceptable'), it does suggest that even with highly complex statistical modelling predicting death by suicide remains challenging. Fourth, we did not have data regarding family history of suicide, nor mental disorder diagnoses. These are both substantial risk factors which might explain some of the associations. Lastly, since suicide rates differ substantially across nations, there might be a limit to generalisability, especially with regard countries with substantially different cultures.
Our approach has many strengths. First, since we sampled from the entire population in a controlled manner, we avoid sampling bias. Second, our model is hypothesis-free, allowing identification of previously unidentified risk groups. Third, our model has flexible settings, allowing the user to adjust the trade-off between good model performance and statistically robust results. Finally, and in contrast to existing machine learning methods such as artificial neural networks, our model is open and readily interpretable.

Conclusions
In summary, we performed a heuristic machine learning method to find interactions. We found disproportionately high suicide rates among people who were never married and received unfit for work benefits, among males who received unfit for work benefits, and among those aged 55-69 who lives alone, were never married and whose household income was low. Additionally, we found high suicide rates among those aged 25-39 with a low level of education and among males who lost their partner. Our findings may have important implications for suicide prevention policies and are generalizable to other (similar) countries.

Funding
This paper was funded by 113 Suicide Prevention, which is in turn funded by the Dutch Ministry of Health, Welfare, and Sport.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Results based on calculations by 113 Suicide Prevention using nonpublic microdata from Statistics Netherlands. Under certain conditions, these microdata are accessible for statistical and scientific research. For further information: microdata@cbs.nl.

A.3. Elaboration flowchart
In what follows we will outline the full details of every step within the global flowchart, further splitting the "Add Interaction" step into the substeps shown in the second flowchart. Start: To start with we specify our hyper-parameters N added , θ, t, and S min whose functions shall be explained as they become relevant. Additionally, we initialize n added = n removed = 0 and T as an empty list. These will be updated throughout the procedure.
We define x → i for i ∈ {1, 2, …, N} to be our one-hot encoded basic features. We define y → i for i ∈ {1, 2, …, L} to be all the features in our model. The amount of basic features, N, is fixed. However, since we will be adding features throughout our model, the total amount of features, L, will vary.
Split data: We split our training set into two subsets: a searching set (80% of cases), and a control set (containing the remaining 20%).

Init. model:
Using the searching set we estimate an initial logistic regression model specified by → is the feature corresponding to "died by suicide" and with the β i being the parameters to be estimated. Estimation is done through log-likelihood maximization via gradient descent methods. Set LL to be equal to the log-likelihood of the model on the control set.  Full results logistic regression on validation set including both basic features and interaction terms. With corresponding Beta parameters, Odds-Ratios, Compound Odds Ratios, absolute and relative number of suicides within the sub-population within the validation set as well as the training set. With N(val)=absolute number of suicides within validation set, N(train)=absolute number of suicides within training set, Rel(val) = relative number of suicides within the validation set (corrected for sampling procedure, per 100,000), Rel(train)=relative number of suicides within the training set (corrected for sampling procedure, per 100,000).
where N p is the total number of cases in our searching set. Note that under the assumption that the "true" value of β n,m on the underlying probability process is 0 (i.e. feature z → m,n is irrelevant) the value of this expression scales to the order of where hyper-parameter t describes the trade-off between optimization of the log-likelihood and statistical significance, with a value of 0 completely prioritizing the former, and a value of 0.5 completely prioritizing the latter. We used t = 0.3. Add argmax|d t (m, n)|to model: We then select and add the corresponding feature to our model by setting y → L+1 = z → m * ,n * and set L←L + 1. We add (m * , n * ) to the list T. We also set n added ←n added + 1. Re-estimate model: We re-estimate the model with the new feature and set LL to the log-likelihood of this new model on the control set.

Check LL:
We check whether or not the performance on the control set has improved by looking at LL − LL old . If this is negative we once again remove the added feature from our model and set n removed ←n removed + 1n added ⩾N added : Here N added functions as a minimum number of iterations before stopping. If we have not yet run that many iterations, we return to the "Add interaction" step. If we have we move on to the next step. We used N added = 30. nremoved nadded ⩾θ: Here θ functions as a minimum amount of false positives before terminating. If the proportion of false positives is less that θ we return to the "Add interaction" step. If it is at least θ we end our algorithm. We used θ = 0.1.