Predicting Suicidal and Self-Injurious Events in a Correctional Predicting Suicidal and Self-Injurious Events in a Correctional Setting Using AI Algorithms on Unstructured Medical Notes and Setting Using AI Algorithms on Unstructured Medical Notes and Structured Data Structured Data

.


Introduction
Suicidal and self-injurious incidents occur considerably more often in a correctional setting than in the general population (Frühwald and Frottier, 2005).In addition to losing lives, these incidents drain institutional and healthcare resources, and create disorder and stress for staff and other inmates (DeHart et al., 2009;Smith and Kaminski, 2010).There have been many studies carried out to identify and analyze the factors that are associated with these events using different statistical methods such as Zero-Inflated Poisson, Zero-Inflated Negative Binomial Regression, and matched studies (Dye, 2010;Li et al., 1999;Way et al., 2005;Wichmann et al., 2002).However, the true causes of suicidal and self-injurious behaviors, as with the true causes of many other social behaviors, can be much more complicated than a handful of structured variables can explain.The results of these studies provide some guidance but are often hardly actionable.For example, some studies identified that younger and single White male inmates with less education and no children are more likely to make suicidal or self-injurious attempts, or those who are divorced or separated from their spouses have a higher risk (Arboleda-Florez and Holley, 1989;Smith and Kaminski, 2010).However, there is a large population that falls within this description and providing enhanced observation or services to everyone who fits these criteria in an attempt to prevent suicidal and self-injurious behavior would be impractical if not impossible.Other studies involved interviewing the inmates to collect additional information such as levels of stress, depression, hopelessness, and vulnerability to suicide and self-harm risk that are generally not readily available in the database.Additionally, this information is very expensive and time consuming to carry out due to the large population in correctional institutions (Dear et al., 2001;Naud and Daigle, 2013;Perry and Gilbody, 2009).
As reported in the above publications, structured variables such as demographic information have statistically significant effects on the risk of suicidal and self-injurious behaviors, but additional data have to be collected in order to improve predictive power.Inmates who are prone to suicidal or self-injurious behaviors may exhibit signs that are not captured by traditionally collected structured variables.Such omitted signs may include their facial expressions, the way they interact with others, the things they complain about to the healthcare staff, etc (In- Albon et al., 2013;Morriss et al., 2013).In addition to the general demographic information about the inmates, most correctional institutions utilize an Electronic Health Records (EHR) and keep digital records of their observations of the inmates when they interact with them in a clinical encounter.These digital records of observations are often unstructured texts and contain rich information about the inmates.The emergence of advanced Natural Language Processing (NLP) deep learning models such as the sequence neural networks and Transformer-based models made it possible to extract information from such unstructured text data (Girgis et al., 2018;Onan, 2020;Vaswani et al., 2017;Yang et al., 2020).
The Correctional Health Services (CHS) at the Orange County Jails adopted an EHR in 2014.Since then, CHS has been keeping observational descriptions of the inmates they encounter in Progress Notes in their EHR database.In this study, we aim to employ the compact but powerful NLP deep learning model, the Transformer Encoder, to extract information from the unstructured Progress Notes and predict the tendency of suicidal and self-injurious behaviors, in the hope of narrowing down the targeting population to triage more efficiently the high-risk group for close monitoring to prevent suicidal and self-injurious behaviors.

Data
The structured and unstructured data were extracted from the Orange County Jail's CHS' EHR database.There are two types of Progress Notes (namely, the SOAP notes and Quick notes) recorded for the inmates in the Orange County Jails.SOAP notes contain Subjective, Objective, Assessment, and Plan sections (Podder et al., 2021).They are typically used to document a clinical encounter where a health care professional records the patient's subjective complaints and concerns, their objective observations of that encounter, their assessment of that complaint and a plan to treat or address the complaint.Whereas Quick notes are typically used to document a quick interaction with the patient's chart to record or acknowledge receipt of information and/or some action taken for that individual.While each section of the SOAP notes was designed for different contents, in reality, some doctors write the assessment and plan together in either the Assessment section or the Plan section leaving the other section blank.Instead of trying to separate them which is difficult, we combined the Assessment and Plan sections into a single section for all data points to resolve this issue.This resulted in a total of four types of notes, namely, Quick, Subjective, Objective, Assessment-and-Plan.
There has been a total of 335 inmates with one or more documented suicidal or self-injurious behaviors at the Orange County Jail since 2014.Of these inmates, 249 have at least one type of the Progress Notes before the time of the suicidal or self-injurious behavior incident.Of the inmates without documented suicidal or self-injurious incidents since 2014, a total of 89,096 inmates have at least one type of the Progress Notes.11.3% of the inmates who made suicidal or self-injurious attempts had more than one booking (admission to the jails).For these inmates, only the booking during which the suicidal or self-injurious event happened was used for this study, and only the notes that were recorded before the suicidal or self-injurious event in this booking were used.36.9% of the inmates who did not exhibit suicidal or self-injurious behaviors had more than one booking.For these inmates, we randomly sampled one booking for each inmate to be used in this study.Hereinafter, we refer to the group of inmates who exhibited suicidal or selfinjurious behaviors as the positive class, positive group, or positive cases, and those who did not as the negative class, negative group, or negative cases.
The notes data need to be preprocessed before they can be fed into the model.First, we randomly sampled and examined some contents of the notes and noticed that there are many abbreviations such as "bx" for "biopsy", "AH" for "auditory hallucination", etc.With the help of domain experts, we made a dictionary to replace these abbreviations with their complete forms for all four types of notes.However, the dictionary does not cover all abbreviations in the data since it is not practical to collect all abbreviations from all four types of the notes given the large sample size, but we may have captured the majority or at least a good portion of the abbreviations if they are the most frequently used terms.Second, all words were converted to lower case, so that "complaint" and "Complaint" are treated as the same word.Third, punctuations, numbers and single letters were removed.Lastly, standard English stop words (the most common words in any natural language that do not add value in NLP modeling) such as "of", "the", "this", and "that" were removed.Table 1 shows the descriptive statistics of the length of the Subjective, Objective, Assessment-Plan, and the Quick notes after preprocessing.
In addition to the notes data, structured data relating to characteristics of the inmates or pertaining to the seriousness of the offense that were available in the Orange County CHS database such as Race, Sex, Marital Status, Age, and AB109 were also extracted.Table 2 shows the counts of each category in the discrete structured variables, and Table 3 shows the descriptive statistics of the age variable.AB109, aka California Public Safety Realignment Act, of 2011 is an indicator of whether an inmate meets the criteria of the AB109 Bill.The Bill shifted incarceration and supervision responsibility for many lower-level felons who have committed non-violent, non-serious, and non-sex offenses to be incarcerated or supervised at the local county level ("California Public Safety Realignment Act AB109," n.d.).
The dataset was randomly split 10 times with stratification into training and test sets with a 70:30 ratio after initial preprocessing.Since there were only 249 positive cases, the stratification avoided the scenario when no positive cases or too few positive cases appeared in either the training or the test set.The notes data in the training sets were tokenized (converted to integers ranging from 1 to the number of unique words in the training set), and then the tokenizer was applied to the test sets.The reason that the tokenization was initialized on the training sets only was to avoid information leakage since the model was not supposed to be exposed to any information from the test set during training.After tokenization, the notes data were converted to sequences of integers of varying lengths.The longer sequences had to be truncated, and the shorter sequences had to be padded with zeros in order to make them the same length.According to Ying Wen et al., the model performance increases as the length of the input sequences increases and roughly levels off after reaching the mean length (Wen et al., 2016).Therefore, we chose to force all sequences to be of length 162 which is the largest mean length of the four types of notes.

Methods
Unstructured text data are considered as sequence data since the order of words matters.Typical sequence neural network models include Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), and Bidirectional Long Short-Term Memory (Bi-LSTM).Our previous work investigated the performance of these typical sequence neural network models as well as the CNN, the Transformer Encoder, and the BERT-Base model (Lu et al., 2022).Our results showed that the Transformer Encoder outperformed all other models especially on tasks with highly imbalanced classes.Since the proportion of the positive cases in our data was 0.28% (coded as Pct 0.28 in this study) which was extremely imbalanced, we chose to use the Transformer Encoder with 4 attention heads for this classification task.The Transformer Encoder model can be considered as the first encoder block of Transformer-based models which are well-known for language translation, question-answering, and classification tasks.Transformers for translation and question answering tasks have both encoders and decoders, while those for classification tasks have encoders only.An encoder in Transformers typically has two layers, a multi-head self-attention layer and a feed-forward layer.The first encoder has a token embedding and positional embedding layer (Vaswani et al., 2017).The token embeddings represent the relationships among words in the data and are learned during training (Wang et al., 2020).They are vectors of real numbers, and the length of the vectors (i.e. the dimension of the token embeddings) is specified as a hyperparameter.In our study, we set the dimension to 200.The positional embeddings are of the same dimension that take into account the order of the words.The positional embeddings in the Transformer model proposed by Vaswani et al. are calculated with a sine function for the even columns and a cosine function for the odd columns of the embeddings for each position in the input sequence.Then the positional embedding matrix is added to the token embedding matrix, and the resulting sum matrix is passed to the self-attention layer.
In the self-attention layer, the sum matrix is multiplied by three randomly initialized weight matrices to become the Q (Query), K (Key), and V (Value) matrices to which the function Attention V is applied, where d k is the dimension of Q and K (Vaswani et al., 2017).In the case when there are multiple self-attention heads, each head implements the same function but with the Q, K and V of smaller dimension.The outputs of the multiple self-attention heads are then concatenated and passed to the Add and Norm layer where the concatenated output is added to the embedding sum matrix, and the result is normalized and fed to the feedforward layer.Another Add and Norm layer is applied to the output of the first Add and Norm layer and the output of the Feed Forward layer.Fig. 1 shows the architecture of the Transformer Encoder block used in this study.
Deep learning models can take inputs from multiple sources.Since we had four types of notes data available, and each may contribute different information, we would want to use all four types of notes data.However, if we combine all four types of notes for each inmate, in the case when all four were lengthy, we may lose the last one or two types of notes after truncating.This was avoided by using the four types of notes as separate inputs with each type being fed into a separate Transformer Encoder block parallelly.The outputs from the Transformer Encoders were flattened and then concatenated to form one single input for the fully connected layer.After the token embedding and positional embedding, the input was turned into a matrix of 162 by 200 (162 is the length of the input sequences after truncating and padding, and 200 is the token and positional embedding dimension).After the matrix came out of the Transformer Encoder block, it was flattened into a vector of length 32,400 (the product of 162 and 200).Then the vectors produced from the Transformer Encoders of the four types of notes were concatenated to form a vector of length 97,200 (32,400 times 3).This vector was then fed into a fully connected layer before reaching the output layer.Fig. 2 shows the architecture of the model with the four unstructured text inputs.
In order to make use of the structured data available in the database, we also experimented with adding the structured data as a fifth input in the model, feeding it to two dense layers parallel to the Transformer Encoder blocks, and then concatenating the output with the flattened outputs from the Transformer Encoder blocks.Fig. 3 shows the architecture of the model with the five inputs.
Another way of incorporating the information extracted from the notes data is to use the predicted probabilities from the notes data as a feature alongside the structured data in traditional Machine Learning models such as LASSO, Random Forest, XGBoost, KNN, and SVM.Due to the fact that some of the predicted probabilities are nearly zero, we took the log of the predicted probabilities from the Transformer Encoder model and then standardized them.In order to demonstrate the improvement in the model performance resulted from the information from the notes data, the same models were run on the structured data only.The standardized predicted probabilities from the Machine Learning models were then used as features in a LASSO model as an ensemble method in an attempt to further improve the model performance.
Since our data were extremely imbalanced, we adopted the undersampling method to address the imbalance issue.Drummond et al. have proved that under-sampling can be more effective than oversampling with decision tree learners (Drummond et al., 2003).For each of the 10 training sets we split upfront, we under-sampled the majority class in the training set to 10%, 20%, 30%, 40%, and 50% prevalence (the number of positive cases divided by the total number of cases, coded in this study as Pct 10, Pct 20, Pct 30, Pct 40, and Pct 50, respectively), leaving the 10 test sets unchanged.All of the aforementioned models were run on the original 10 training sets, as well as on the 10 under-sampled training sets, and then tested on the 10 original test sets.The workflow of this study is illustrated in Fig. 4. The boxes with green boarders are chosen to show detailed processes.The detailed processes for other boxes are omitted due to limitation of space.

Results
The model performance was evaluated in terms of AUC-ROC, Sensitivity, and Specificity.In this study, it is more important to identify as many individuals with high suicidal and self-injurious tendency (positive cases) as possible, and misidentification of those who are not likely to commit suicidal or self-injurious acts (negative cases) would result in some costs in taking unnecessary preventive measures, but the costs would be less than misidentification of the positive cases.Therefore, instead of using measurements such as F1 Score that combines and takes equally important the Sensitivity and Specificity, we chose to use Sensitivity and Specificity separately to evaluate how accurately the  In addition, the AUC-ROC is much lower for all models on structured data only.When adding predicted probabilities from the Transformer Encoder model as an additional feature to the machine learnings on notes data only, the AUC-ROC of all models improved substantially in all under-sampling scenarios.All Machine Learning models (on structured data only and on structured alongside predicted probabilities, respectively) yielded similar performance except for SVM which predicted all cases to be negative in all but the Pct 50 under-sampling scenario.Fig. 6 shows the Sensitivity of the Transformer Encoder model on notes data alone and on notes alongside structured data, respectively, as well as the Sensitivity of the Machine Learning models on structured data alone and on structured data alongside predicted probabilities from the Transformer Encoder model on notes data only.Sensitivity was steeply increasing as the classes were more balanced by under-sampling the majority class.The highest Sensitivity was 0.93 which was produced by SVM on Pct 50 under-sampled structured alongside predicted probabilities, but the Specificity was only 0.56.The highest Sensitivity with decent Specificity was 0.821 from XGBoost on Pct 40 under-sampled structured data alongside predicted probabilities with a Specificity of 0.718.The ensemble model on the same data produced similar results, with a Sensitivity of 0.824 and a Specificity of 0.713.Fig. 7 shows the Specificity of the Transformer Encoder model on notes data alone and on notes alongside structured data, respectively, as well as the Specificity of the Machine Learning models on structured data alone and on structured data alongside predicted probabilities from the Transformer Encoder model on notes data only.The Specificity decreased as the classes were more balanced by under-sampling the majority class which is due to the fact that as more positive cases were being identified, more false positives were produced.The Transformer Encoder model on notes data alone without under-sampling yielded the highest Specificity because it predicted nearly all cases as negative H. Lu et al. which resulted in very low sensitivity and very high specificity.
In order to better demonstrate the contribution of the structured data to the prediction power, Fig. 8 compares the model performance with and without the structured data.It shows that the improvement in model performance by adding the structured data to the Transformer Encoder model were negligible, which suggests that most information pertaining to the suicidal and self-harm tendency is coming from the notes data.However, one can argue that the reason that the improvement in model performance by adding the structured data to the Transformer Encoder model was negligible is because the dense layers were not effective in extracting information from the structured data.In order to test this conjecture, we compared the model performance of the 6 machine learning models (LASSO, Random Forest, XGBoost, KNN, SVM, and Ensemble) on structured data alone against the Transformer Encoder model on notes data alone.
Fig. 9 shows the model performance of the Transformer Encoder model on notes data alone and the machine learning models on structured data alone.It shows that using notes data alone produced substantially better performance in terms of all three measurements in all under-sampling scenarios, which confirms that the notes data contain more information pertaining to the suicidal or self-harm risks than the structured data available in the Orange County Jail database.The machine learning models on structured data alone produced higher Specificity in some under-sampling scenarios because they made more negative predictions which resulted in higher Specificity but lower Sensitivity.
In order to better demonstrate the difference between the two different ways of incorporating information from the notes data in the models, Fig. 10 compares the model performance between using notes   data alongside structured data in the Transformer Encoder model and using structured data alongside the predicted probabilities into the Machine Learning models.The Transformer Encoder model produced better AUC-ROC and Specificity in all under-sampling scenarios, but the machine learning models on structured data alongside the predicted probabilities from the Transformer Encoder model produced higher Sensitivity.This is because the Transformer Encoder model produced fewer true positives (which resulted in lower sensitivity) and also fewer false positives (which resulted in higher Specificity), while the machine learning models produced more true positives (which resulted in higher Sensitivity) but at the same time more false positives (which resulted in lower Specificity).

Discussion
The results show that the model performances are substantially higher with the information from the notes data than without in both cases (notes data alongside structured data, and structured data alongside predicted probabilities).This confirms that there is more information contained in the notes data pertaining to suicidal and injurious tendency in the correctional setting than in the structured data available in the Orange County Jail CHS database.The information from the notes data can either be directly used in an NLP algorithm to make predictions or incorporated in machine learning models as an additional feature.In all under-sampling scenarios, the first approach produced the highest AUC-ROC, and the second approach produced the highest Sensitivity.
With the ensemble model, we experimented with all five models (LASSO, Random Forest, XGBoost, KNN, and SVM) on the predicted probabilities (removing the probabilities from SVM since it was not predicting any positives at all except on the Pct 50 under-sampled data) with and without 2-way interactions.The ensemble LASSO performed slightly better (with 2-way interaction), but it did not really improve the model performance as expected.This may be because all of these individual models captured similar information from the data, and thus combining them in the ensemble model did not add value to the model performance.
The results also show that under-sampling helps improve model performance.As the classes were more balanced by under sampling the majority class in the training sets, the Sensitivity was increasing and reaches 93.1% when the two classes were completely balanced.However, this improvement is accompanied by the decreasing of the Specificity because more false positives were produced.If the cost of taking preventive measures against suicidal or self-injurious behaviors per inmate is known, a cost-benefit analysis can be carried out to choose the best under-sampling strategy and the best threshold (the cut-off value that determines if a case is classified as 1 or 0 based on the predicted probability) that minimizes the cost while targeting the most inmates with potential suicidal or self-injurious behaviors.
In order to tackle the class imbalance issue, we also experimented with augmenting the minority class in the training sets using different text data augmentation methods that were investigated in our previous work (Lu and Rakovski, 2022).Unfortunately, they did not help with the model performance.Most text data augmentation methods are intended to mimic the noise in the data such as typos and spelling errors.Theoretically speaking, if the noise introduced by the augmentation methods is similar to the noise in the data, it may help improve the model performance.However, trial-and-error is needed to find the appropriate augmentation method for a specific dataset and a specific task.
Although the medical and mental health progress notes contain rich information about an inmate, and those who have a high risk for suicidal or self-injurious behaviors may share some common characteristics in their mental state and history, some mentally healthy inmates make deliberate self-injurious attempts to get special attention such as a single cell, which are harder to identify (Lohner and Konrad, 2006).In addition, the structured data used in this study are very limited.Collecting more structured data (especially those that have been identified as statistically important such as education level, socioeconomic status, and type of offense) may help further improve model performance (Favril et al., 2020).Another limitation of this study is that it only applies to those who have Progress Notes recorded in the database.A considerable amount (18.2%) of the suicidal and self-injurious incidents happened within the first day of booking at the jails and no Progress Notes have been recorded for them before the incidents.

Conclusion
The results of this study show that inmates' medical and mental health notes contain more information pertaining to suicidal or injurious behaviors than the structured data available in the database at the Orange County Jail.NLP algorithms such as the Transformer Encoder can extract information from the notes data and improve model performance in predicting suicidal or injurious events in correctional settings.In this study, using the notes data in the Transformer Encoder model to make predictions directly produced the highest AUC-ROC, and incorporating the information extracted from the notes data in traditional Machine Learning models by adding the predicted probabilities from the Transformer Encoder model (on the notes data alone) as a feature alongside the structured data yielded better performance in terms of Sensitivity.In addition, under-sampling is an effective approach to mitigating the impact of the extremely imbalanced classes.More undersampling produced higher Sensitivity but also lower Specificity.The XGBoost model on under-sampled structured data alongside predicted probabilities from notes data produced 82.1% Sensitivity and 71.8% Specificity.
The results are very convincing in showing that the Progressive notes contain more information associated with the suicidal and self-injurious behaviors than the structured data available in the Orange County Jails.Since the methods we proposed here can incorporate both the Progressive notes and structured data in the same model, for correctional facilities that have more structured variables available in their database, we envision that adding more structured variables to the models could further improve the sensitivity and specificity.With these models (which run on readily available data in the database of any correctional facilities) in place, the screening of the high-risk inmates and taking preventive actions would be much more efficient and at a lower cost.

Ethics approval and consent to participate
This study has been approved by County of Orange Health Care Agency Human Subjects Review Committee.
H.Lu et al.   model identifies the positive cases and negative cases, respectively, and use AUC-ROC as an overall model performance measurement.Fig. 5 shows the AUC-ROC of the Transformer Encoder model on notes data and on notes alongside structured data, respectively, as well as the AUC-ROC of the Machine Learning models on structured data alone and on structured data alongside predicted probabilities from the Transformer Encoder model on notes data.The black line represents the AUC-ROC of the Transformer Encoder model for notes data alone without under-sampling.It serves as a reference for all other models and under-sampling scenarios.It is clearly shown that the Transformer Encoder model produced the highest AUC-ROC and adding the structured data as extra inputs did not make much difference.The highest AUC-ROC was produced by the Transformer Encoder model (0.863 on Pct 50 under-sampled notes alongside structured data, and 0.862 on Pct 50 under-sampled notes data alone).

Table 1
Descriptive statistics of number of words after preprocessing.

Table 2
Counts of each category in discrete structured variables.

Table 3
Descriptive statistics of variable age.
Fig. 1.Architecture of the transformer encoder block.H.Lu et al.