Predicting crime during or after psychiatric care Evaluating machine learning for risk assessment using the Danish patient registries

Background: Structural changes in psychiatric systems have altered treatment opportunities for patients in need of mental healthcare. These changes are possibly associated with an increase in post-discharge crime, reported in the increase of forensic psychiatric populations. As current risk-assessment tools are time-consuming to administer and offer limited accuracy, this study aims to develop a predictive model designed to identify psychiatric patients at risk of committing crime leading to a future forensic psychiatric treatment course. Method: We utilized the longitudinal quality of the Danish patient registries, identifying the 45.720 adult patients who had contact with the psychiatric system in 2014, of which 474 committed crime leading to a forensic psychiatric treatment course after discharge. Four machine learning models (Logistic Regression, Random Forest, XGBoost and LightGBM) were applied over a range of sociodemographic, judicial, and psychiatric variables. Results: This study achieves a F1-macro score of 76%, with precision = 57% and recall = 47% reported by the LightGBM algorithm. Our model was therefore able to identify 47% of future forensic psychiatric patients, while making correct predictions in 57% of cases. Conclusion: The study demonstrates how a clinically useful initial risk-assessment can be achieved using machine learning on data from patient registries. The proposed approach offers the opportunity to flag potential future forensic psychiatric patients, while in contact with the general psychiatric system, hereby allowing early-intervention initiatives to be activated.


Introduction
Demand for forensic psychiatric services has seen an increase in Western European countries, occurring concurrently with reforms intended to downsize psychiatric hospital inpatient wards (Jansman-Hart et al., 2011).This association is known as Penrose's Law (Kalapos, 2016) and offers the hypothesis that deinstitutionalisation in the general psychiatric system is associated with the premature discharge of high-risk individuals (Lamb, 2015).Offering these citizens advanced mental healthcare while in contact with the psychiatric system is thus an increasingly pivotal task for Western European countries.Such enhanced treatment may ultimately lead to better quality of life for the individual, and the avoidance of future criminal episodes, with the myriad of benefits enjoyed by society in general.
Qualified risk-assessment prior to discharge from psychiatric care hereby emerges as a valuable instrument employed in order to identify individuals for whom a criminal trajectory is likely.The current approach to risk-assessment in psychiatry relies on a combination of structured interviews combined with clinical-decision making.
Structured risk-assessment interviews remain time-consuming to administer (Viljoen et al., 2010) with questionable predictive power (Douglas et al., 2017).Such questionnaires have further been reported to be insufficiently specific, limiting their general usefulness (Large et al., 2011).Clinical judgement is additionally limited by decision-making bias (Murray and Thomson, 2010) exacerbated by large patient quantities and fewer resources.With the advent of advanced prediction techniques such as machine learning (ML), the opportunity to perform an initial risk-assessment to assist clinical decision making appears as an interesting path for the future psychiatric system.Current studies attempting such demarcation of risk via advanced statistical techniques are mainly concerned with small population sizes, including only individuals who are already in secure-or forensic psychiatric hospitals (Watts et al., 2021;Wolf et al., 2018).A gap in the literature is therefore if statistical risk-assessment using large sample sizes enables the identification of this group of high-risk individuals before treatment in the forensic system is required.Such an approach could offer clinicians invaluable and easy-to-interpret estimates of which patients are in need of follow-up care, in an inexpensive and efficient manner (Passos et al., 2019).
In this study, we attempt the development of a supervised ML framework designed to perform clinically useful risk-assessment of individuals in the general psychiatric system, according to their future probability of requiring treatment in a forensic institution.Specifically, this is done by predicting if individuals will commit a crime leading to the issuing of court-ordered psychiatric admission (COPA) for a crime committed during or after an out-patient treatment course or after discharge from in-patient care.A COPA is the judicial verdict used in Denmark to assign mentally ill offenders to forensic psychiatric treatment.In addition to the quantitative findings in this paper, we also aim to analyse and discuss the clinical implications; including the effects of varying the risk-threshold used to classify patients (Chen et al., 2020;Douglas et al., 2017).The current study employs the Danish patient registries, hereby also investigating if satisfactory estimates can be achieved using readily-available and non-invasive data types (Bjerre--Nielsen et al., 2021), ensuring cheap and fast predictions.

Data foundation
This study employs the Danish national psychiatric patient register, offering information on citizens in contact with Danish mental healthcare institutions.Each observation includes a pseudo-anonymous social security number enabling the linkage to socioeconomic and criminal justice registries.

Sample
The resulting study sample consists of 45.720 patients of whom 474 (1%) committed a crime leading to a COPA.The study sample contains all individuals between the age of 20 and 60 who were either admitted as a psychiatric in-or out-patient in the year of 2014.The subset of these patients who committed a crime, leading to the issuing of a COPA during or after out-patient treatment or after in-patient discharge, in either 2014 or 2015 constitutes class 1.The remaining patients constitute class 0. The aim of this study is thus to distinguish the two groups from one another, i.e. future recipients of COPA (minority group/class 1) vs. nonfuture recipients (majority group/class 0).For fitting the ML models, our dataset was split into two sets for training and evaluation: a 80% training set containing 36.576 patients, where 362 had received a COPA within the two year time span, and a 20% hold-out evaluation set containing 9.144 individuals with 112 COPA recipients.

Selection of predictors
The dataset comprises 39 predictors, all identified from research on characteristics of forensic psychiatric patients (Kirchebner et al., 2020;Soyka et al., 2007).The predictors are divided into three groups: socioeconomic, psychiatric history, and criminal history.The variables within each group are briefly described in the following paragraphs and further details can be found in the appendix.All categorical variables have been one-hot-encoded (Hastie et al., 2009).

Sociodemographic information
This group of predictors includes information regarding gender, age, the highest attained education, civil status, and area of residence.Age is included as four dummy variables each representing a decade from the 20's to the 50's.This is done in order to incorporate differences between age groups opposed to the effect of one year increase.

Psychiatric information
This group of predictors includes diagnoses as well as information regarding the patients' previous treatment trajectory.The majority of patients included had multiple contacts with the Danish psychiatric system prior to their contact in 2014.In order to maximize the amount of information, the data from these contacts has been accumulated since 1995.This in turn means that the predictors loose temporal information, since the dataset is unable to account for when a given interaction occurred.In addition we compute the number of unique disorders that an individual has received over time.Regarding the treatment trajectory, we include the total amount of treatment courses an individual has had, the total number of days admitted to in-patient treatment and the total number of days admitted to out-patient treatment.

Criminal history
The information about an individuals' criminal history is also accumulated since 1995.It includes the sum of convictions, the total number of criminal offences, the total number of sentences served in prison, the number of days sentenced to unconditional and conditional prison sentences, respectively, and the number of previous court-ordered psychiatric admissions, if applicable.

Missing values
Missing values were detected for 443 individuals.These were all members of the majority group and were dropped from the study sample.Missing values in the Danish patient registries often occur for shortterm visitors and the overall influence of this group of people on the outcomes of this study can therefore be considered minuscule.

Sampling strategies
Since the minority group comprises just 1% of the total sample, the dataset is highly imbalanced, a property which is known to cause poor model performance (Fernández et al., 2018).Algorithmic and data level techniques have been investigated to address this issue.At the algorithmic level, SMOTE and One-Sided Selection have been attempted (Fernández et al., 2018), but due to poor performance compared to data level approaches, these were not included in the final analyses.Sampling strategies used at the data level are thus implemented on the training data with the purpose of changing its class distribution (Buda et al., 2018).This study compares two techniques, 1) Random Undersampling of the majority class, and 2) Random Oversampling of the minority class.When implementing the under sampling strategy, random observations without a COPA were randomly eliminated until a desired class balance was achieved.This ratio was tested in the range from 99:1 to 50:50.When implementing Random Oversampling, random observations with a COPA were duplicated until the desired class balance.This ratio was tested in the range from 99:1 to 95:5.

Risk thresholding
When performing classification analyses, the standard approach is to assign observations with a predicted probability equaling or exceeding 0.5 to class 1, and assign those with a probability less than 0.5 to class 0. When working with highly imbalanced datasets, however, it is not given that this standard threshold is the optimal one (Fernández et al., 2018).To address this, our analyses will also investigate the models' performance over varying classification thresholds, using 0.1, 0.3, 0.5, 0.7, and 0.9 as the demarcation points.These findings will be used to qualify our discussion concerning the clinical implications for employing risk prediction tools in the general psychiatric system.

Machine learning models
A total of four machine learning models were trained: a conventional logistic regression model was included as a baseline; Random Forest (Breiman, 2001) since it is used in comparable studies (Chen et al., 2020;Watts et al., 2021) and has generally shown good performance (Fernández-Delgado et al., 2014); XGBoost which employs gradient boosting and is recommended for imbalanced datasets (Wang et al., 2020); and LightGBM (Ke et al., 2017) that recently proofed very potent when dealing with tabular data (Borisov et al., 2021).For a detailed description to the model's inner workings, we refer to the original articles (Breiman, 2001;Ke et al., 2017;Wang et al., 2020).All models were implemented in Python 3 (Pedregosa et al., 2011).
To determine the optimal set of hyperparameters for each model, a grid search with five fold cross-validation was applied to the training set.The sampling strategies were also applied within each fold to minimize the risk of overfitting.Generally, the grid search contained five to ten different values per hyperparameter; the range can be found in the appendix.For Random Forest, we tuned the tree depth, the maximum number features to consider before a split, the minimum number of observations allowed in a node before a split as well as in leaf nodes.Tuning the number of observations in the nodes suggests improved performance, particularly in situations where the sample size is large and the dimension is relatively low (Lin and Jeon, 2006).
For LightGBM and XGBoost we tuned comparable hyperparameters such as the maximum tree depth, and a range of parameters particularly suitable for highly imbalanced datasets, such as minimum samples in each child for LightGBM and the minimum sum of weight in child for XGBoost.

Evaluation metrics
Given the imbalanced class distribution, the proposed model can be expected to predict the majority of observations as true negatives, i.e. non-future recipients of COPA.As the conventional metrics (such as Accuracy and AUROC) are inflated by this, we instead validate our model according to the F1-Macro score.The F1 score is a weighted combination of precision (also known as the Positive Predictive Value, PPV) and recall (Lewis and Gale, 1994) with 1 being the best score achievable for a classifier and 0 the worst.The F1-Macro score is the arithmetic mean of the F1 score for each class in the data (Opitz and Burst, 2019).Recall, also known as sensitivity, is a measure of the share of positive cases that the model can actually identify, i.e. how many of the individuals who receive a COPA after discharge the model flags.Precision reports the amount of correctly identified cases, out of the sum of identified cases and is thus deflated by many false positives.When evaluating the results, the best model will be identified from the highest achievable F1-Macro score, a metric recommended for highly imbalanced datasets (Fernández et al., 2018).For evaluating the models' performance on the evaluation set we also present the Negative Predictive Value (NPV), Specificity, Accuracy, Balanced Accuracy and the AUROC at different classification thresholds.

General results
Table 1 and Table 2 presents the results from our ML models.Our study reports the LightGBM algorithm combined with Random Undersampling (Table 1) as the best performing approach, yielding an overall F1-Macro score of 76%.The related, tree-based models, Random Forest and XGBoost, perform roughly equally well reporting an F1-Macro score of 75% and 74%, respectively.Minor, yet noteworthy differences between the three models are the recall and precision scores.The highest recall is achieved by the LightGBM model, though the Random Forest reports the highest precision score.Additionally, it should be noted that the very high rate of true negatives in our study inflates the NPV, the Specificity, the Accuracy and the AUROC which makes these metrics incomplete to solely rely on.Our study reports Random Undersampling as the best performing sampling strategy for all models in the battery.
The results reported in Tables 1 and 2 have been computed using the standard risk-threshold of 0.5, meaning that the model will classify a patient to the high-risk group (class 1), if the predicted probability equals or exceeds 0.5.In the context of this specific study, the adverse outcome of a false positive is less severe than that of a false negative.As such, to qualify adoption in a clinical setting, we investigated how the best performing LightGBM model would perform over varying riskthresholds.

Risk thresholding
Table 3 presents the results for the LightGBM model when investigating a range of different classification thresholds.Risk-thresholds for the other models can be found in the appendix.
A trade-off can be identified between recall and precision.When the threshold for classification is high, i.e. 0.9, only patients with a very high probability for future forensic treatment is classified as class 1.In this case, the precision score is 0.96, meaning that the vast majority of flagged patients were true future offenders.This approach, however, omits far too many individuals, reported in the low recall.Reversely, with a low threshold (0.1), many patients are assigned to the high-risk group, as only a predicted probability larger than 0.1 is required.This approach identifies 81% of future offenders.The optimal classification threshold will be discussed next.

Clinical usefulness
This study has demonstrated how a ML approach can offer useful risk-estimates of psychiatric patients' probability for a future admission in the forensic psychiatric system.With a precision score of 57%, the results offer notable improvements to current structured interview approaches, as our model is correct in 57% of cases when assigning a patient to the high-risk group.When using conventional methods, 50%-99.7% of the patients categorized as high-risk would not go on to commit future crime, underscoring the highly sensitive character of these tools (Large et al., 2011).Further, when compared to current statistical approaches (Watts et al., 2021;Wolf et al., 2018) our model offers the ability to identify psychiatric patients before a crime leading

Table 1
General Results General results for all models for the sampling strategy, Random Undersampling (RU) according to the F1-macro score.to forensic psychiatric treatment has been committed.Next to the myriad of benefits enjoyed by a host of societal actors, this offers vastly different mental healthcare trajectories for the individuals, as the type and quality of care expected in psychiatric services are of a wholly other kind than the care available at forensic clinics (Franke et al., 2020).Our approach additionally allows the development of early-intervention initiatives in the intersection between mental health care and crime, initiatives which has been invited by other authors (Ford, 2015).Such efforts could critically reform the current treatment procedures in the psychiatric system, as advanced treatment procedures can be targeted the most risk-prone individuals.It should, however, also be noted that the current results from our analyses offer a recall metric of 47%, indicating that the majority of COPA-recipients are omitted by the model.When coupled with the precision metric of 57%, we argue that as a starting point for research, this remains a crucial first step, given both the highly imbalanced nature of our dataset, but also the extremely complex phenomenon (crime leading to COPA) we are attempting to predict.
As reported in Table 3 we investigated a range of different risk thresholds to determine the optimal marginal value for classifying patients.As can be seen in Table 3, LightGBM, XGBoost and Random Forest reports roughly the same scores across varying thresholds, indicating the robustness of our results.Currently, no official guidelines exist as to which risk threshold is recommended when considering clinical implementation.Choosing this value will depend on the quality and availability of intervention strategies employed to address the needs of riskprone patients (Chen et al., 2020).Our results demonstrate that using the risk threshold of 0.9 (i.e., only classifying patients to class 1 when their predicted probability equals or exceeds 0.9) offers a highly precise tool, only flagging the most severe cases, though omitting many future offenders.A risk-assessment tool utilizing this threshold offers little clinical utility, as it only classifies individuals who display many and strong forensic psychiatric characteristics; mental healthcare personnel can be expected to appreciate the needs of this patient group themselves.
For a threshold of 0.7, the LightGBM model can identify roughly 34% of future forensic patients, with a precision score of 76%.An approach using this threshold would offer conservative outcomes, flagging few patients, but being correct in 76% of cases.In the context of discussing the appropriate risk threshold, it can be argued that the impact of wrongly classifying an individual as posing a future risk (false positive) is less severe than that of a false negative (Douglas et al., 2017) countering the usefulness of this approach.Following this logic, the results reported at the threshold of 0.3 is noteworthy.Here, 58% of future forensic patients are identified, thus reducing the share of false negatives compared to the 0.5-threshold.This approach values the need for finding more future forensic patients at the expense of false positives.
As a tool of this kind is only considered an initial risk-screening prior to clinical evaluations by healthcare personnel, the misclassified individual would only be subjugated to further assessment before discharge.This creates consequentialist harms for the individual (Vayena et al., 2018) which are to be balanced with the benefits enjoyed by society in general (Douglas et al., 2017).Depending on the availability of further risk-assessment by clinicians, the results reported at the 0.1 threshold, could also be useful.This approach allows the identification of 81% of forensic psychiatric patients, though resulting in a precision score of 18%.
These reflections pertain to the still exploratory nature of setting an appropriate risk threshold.A range of different individual and societal outcomes are to be considered, adopting the usefulness of data-driven approaches, while not neglecting ethical concerns of privacy (Starke et al., 2021).
Another quality of the proposed model is the fact that the data foundation contains only already available, high-quality administrative registry data, arguably of a less invasive nature than bio metric information or results from questionnaires (Viljoen et al., 2010).Recently, scholars have argued how high-level administrative data types can, in some instances, perform equally well to personality-and behavioral data (Bjerre-Nielsen et al., 2021).These authors also suggest that incorporation of more privacy-intense data types may in fact be redundant.When risk-assessment tools are to be implemented in psychiatric services, the opportunity to avoid highly invasive data types is beneficial in order to address privacy concerns.

Limitations
A number of limitations to the proposed approach deserve to be highlighted.First of all, our models are only designed to flag patients who committed crime leading to the issuing of a COPA.As such, the tool is unable to target the whole population of individuals committing any crime after a psychiatric treatment course.We choose to limit our sample to recipients of COPA, as one can only receive such a verdict if mentally incapacitated in the act of crime.This increases the probability that the subsequent act of crime is related to the mental state of the individual and thus that additional treatment in the psychiatric system could have avoided this incidence.If all acts of crime were included, it is possible that some or many of them would be completely unrelated to the mental state of the individual and additional treatment futile.In the same vein, the inclusion criterion for the study sample were psychiatric in-or out-patient treatment in 2014.Therefore, the tool is also unable to target the whole forensic psychiatric population, as some COPArecipients in the years 2014-2015 did not have contact with the psychiatric system in 2014.
A overarching assumption of the chosen approach is further that any patient who is assessed to be in high-risk of forensic psychiatric treatment is offered first additional assessment by clinicians and, if necessary, prolonged treatment opportunities.Scholars argue that riskassessment without appreciable benefits for the individual is unethical (Douglas et al., 2017).As such, the ethically valid implementation of statistical risk-assessment techniques rests on sufficient opportunities for both additional clinical screening and treatment.Additionally, while the model remains a valuable first attempt to assess risk in the psychiatric system, external validation studies are needed.New developments in computational data methods can likely expand the current performance level of models.For instance, advanced techniques to overcome the problem of class-imbalance in psychiatric data have been proposed in the context of predicting in-patient readmission (Du et al., 2021a(Du et al., , 2021b) ) an outcome which is comparable in terms of complexity to the one presented here.Additionally, deep learning models used extensively elsewhere in mental healthcare research (Su et al., 2020) also constitute a promising quantitative trajectory for new studies.Such contributions are left for future research to address.An additional factor which should be adequately studied before implementation is the effect of algorithmic assessments on clinicians decision-making process.As the valid application of this and related tools assumes that human decision-making is never subdued, it is crucial to understand how implementation ensures that humans are always responsible.While initial studies suggest that clinicians values and trusts risk-assessment from algorithms in a forensic psychiatric context (Zhong et al., 2021)

Table 2
General Results General results for all models for the sampling strategy, Random Oversampling (RO) according to the F1-macro score.

Table 3 Thresholds
Evaluating the LightGBM model at different thresholds for Random Undersampling.
more detailed and comprehensive qualitative studies are needed.