Federated learning for violence incident prediction in a simulated cross-institutional psychiatric setting

Inpatient violence is a common and severe problem within psychiatry. Knowing who might become violent can influence staffing levels and mitigate severity. Predictive machine learning models can assess each patient's likelihood of becoming violent based on clinical notes. Yet, while machine learning models benefit from having more data, data availability is limited as hospitals typically do not share their data for privacy preservation. Federated Learning (FL) can overcome the problem of data limitation by training models in a decentralised manner, without disclosing data between collaborators. However, although several FL approaches exist, none of these train Natural Language Processing models on clinical notes. In this work, we investigate the application of Federated Learning to clinical Natural Language Processing, applied to the task of Violence Risk Assessment by simulating a cross-institutional psychiatric setting. We train and compare four models: two local models, a federated model and a data-centralised model. Our results indicate that the federated model outperforms the local models and has similar performance as the data-centralised model. These findings suggest that Federated Learning can be used successfully in a cross-institutional setting and is a step towards new applications of Federated Learning based on clinical notes


Introduction
Inpatient violence is a serious problem in clinical psychiatry, causing shortand long-term damage to property as well as people (van Leeuwen & Harte, 2017;Inoue et al., 2006;Nijman et al., 2005;Havaei et al., 2019). Violence Risk Assessment (VRA) has been used in mental healthcare to inform medical decisions and mitigation strategies (Singh et al., 2014;Conroy & Murrie, 2012).
Several manual VRA methods have been proposed and evaluated (Almvik et al., 2000;Douglas et al., 2014;Ogloff & Daffern, 2006), yet these methods are time-consuming and subjective, and some of them require advanced training to use (Nicholls et al., 2006). Machine Learning (ML) methods promise to address these limitations, developing fast and objective predictions based on patient data present in Electronic Health Records (EHR).
In the psychiatry domain, a particularly promising ML approach is Natural Language Processing (NLP), since EHRs contain large amounts of unstructured clinical notes written by nurses and psychiatrists. The information in these notes could be employed in decision-support systems to aid psychiatrists in predicting aggression, diagnosing patients, predicting side-effects from medication, and predicting suicide attempts, among others. The information is reported in subtle and nuanced ways, and often includes typographical errors, abbreviations and technical terms. Not surprisingly, a common problem encountered by ML researchers in the clinical domain are datasets that are small (Pestian et al., 2010) or too specific (Suchting et al., 2018). Thus, increasing dataset size and diversity is desirable for performance of ML models, in particular NLP models used in psychiatry.
Combining datasets from multiple departments and institutions would be a natural way to enlarge datasets for various tasks. Yet, medical institutions are usually not allowed to combine their data (Flikweert et al., 2020). Thus, instead of sharing data, machine learning models can be shared amongst institutions, using local data for training and/or fine-tuning. This is the basis of Federated Learning (FL) . Through FL, multiple parties collaborate in solving an ML task under the coordination of a central server, where data are never allowed to leave a party's device (Kairouz et al., 2021). Though some losses are expected with respect to a data-central approach, it has been shown that these could be quite small and acceptable given the gain in privacy (Sheller et al., 2019). FL has been gaining traction in recent years, and applications within the medical domain are slowly emerging (Kairouz et al., 2021;Deist et al., 2020). However, none of the clinical applications of FL so far employ clinical texts.
In this work, we employ clinical texts for FL, examining violence risk assessment. We seek to find how FL compares to centrally-and locally trained models. For this comparison, we use free texts in EHRs. Since we do not have access to data from multiple institutions, we use "mock" institutions, created from the data of a single location using nursing-ward-based partitioning.
We train four machine learning models: a federated model, a data-centralised model and two local models (A and B). Here, A and B are the names of the mock institutions we created. Then, we compare the performance of these four models on institutions A and B separately and on the combined test dataset.
Our main contributions are: • We demonstrate that FL applied to NLP models and trained on clinical texts has similar performance as a centralised model, and better than locally trained models.
• We highlight the potential of FL for clinical psychiatry.
The remainder of this paper is structured as follows. Section 2 discusses related work regarding FL in the medical domain. Section 3 describes the dataset, and explains the method for obtaining the empirical results. Sections 4 & 5 state and discuss the empirical results. Finally, Section 6 provides the conclusions drawn from the results.

Related Work and Background
Multiple Machine Learning (ML) methods have been proposed to tackle the problem of Violence Risk Assessment (VRA). Bader & Evans (2015) attempted to differentiate between patients perpetrating severe and repeated aggression and non-aggressive patients, using common risk factors as predictor variables.
In a retrospective study, Raja & Azzoni (2005) found some factors that seemed to correlate with inpatient violence. Menger et al. (2019), Le et al. (2018), and Cook et al. (2016) exploited the abundant free text in EHRs to employ Natural Language Processing (NLP) to this task. Beyond VRA, Pestian et al. (2010) used NLP to classify suicide notes as legitimate or elicited.
Two limiting factors in the development of fair and accurate ML models for the healthcare domain are dataset size and diversity. Of the studies mentioned above, only one had more than a few thousand data points (Le et al., 2018). Its limitation, however, was that they predicted existing VRA instrument scores, not real violence incidents. Suchting et al. (2018) had nearly 30 thousand data points, yet they report being limited both by dataset diversity (due to the nature of their facility) and by dataset size (due to the imbalanced nature of the dataset, as most patients do not engage in violence). Aggregating data from multiple institutions would tackle both problems. However, as medical data often resides in secure data silos across institutions (Lehne et al., 2019), aggregating these data is not possible.
Federated Learning (FL) is a novel technique for training ML models on decentralised data (Konečný et al., 2016). It began with the question of how one can train an ML model in a setting where data is unevenly distributed across a large number of devices, and the data cannot be shared among devices or with the central server. FL provides a solution to this question through decentralised training, orchestrated by a central server. The server initialises and sends a model to each participating institution or data silo. Each institution trains the model on their own data, and shares the updated model's parameters with the central server. The server then aggregates all models and creates a new global model. A widely used algorithm for creating a new model is FedAvg (McMahan et al., 2016), which performs a weighted average over the parameters of all models to create a new model. Other algorithms have been proposed to allow the use of adaptive optimisers, such as FedAdagrad, FedYogi, and FedAdam (Reddi et al., 2021).
FL has brought promising results in recent literature, where federated models perform nearly on par with data-centralised models for medical classification tasks, such as brain tumour segmentation (Sheller et al., 2019;Li et al., 2019) and in-hospital mortality prediction (Choudhury et al., 2019). The technique has been applied on private medical data as well by utilising the Personal Health Train (PHT), for classifying post-treatment survival chances in lung cancer patients, by collaborating with eight medical institutions (Deist et al., 2020).
PHT is a platform aiming to provide healthcare data from various sources to researchers while ensuring privacy protection. FL has also been used to predict suicidal ideation in online social care texts (Ji et al., 2019 (2015) Small cohort Raja & Azzoni (2005) Retrospective study Menger et al.  methods to convert texts into vectorial representations, including bag-of-words, TF-IDF, Word2Vec and Doc2Vec. Following previous work (Mosteiro et al., 2021), in this paper we use Doc2Vec (Le & Mikolov, 2014), which generates a fixed-length vector for a piece of text of arbitrary length. In this study's context, a document is the collection of notes of one admission period of a patient.
Through this method, the vector representations aims to keep the semantics within each document intact. The representations can then be fed into an ML model such as a neural network for a classification task.

Method
In this section, we outline the method for conducting the FL experiment for predicting inpatient violence. First the data and the processing steps are described in Section 3.1 and Section 3.2, respectively. Then the setup and training procedure are described in Section 3.3, and the method for validation of the classification models is given in Section 3.4. Thereafter, more detail is provided regarding the implementation of FL in the experiment in Section 3.5.

Data
The data made available by the psychiatry ward of UMC Utrecht for this study is the violence incident dataset prepared for violence risk assessment within admitted patients by Mosteiro et al. (2020Mosteiro et al. ( , 2021. Each data point corresponds to an admission period of a patient, and contains the concatenation of clinical notes of a maximum of 28 days before up until and including the 1st day after admission. Based on the next 27 days following the first day of admission, the data points are labelled by whether a violence incident took place or not (positive/negative outcome). The clinical notes, which are written in Dutch, have been vectorised using Doc2Vec (Le & Mikolov, 2014), with a feature vector dimensionality of 300. No structured features such as gender or age were used, as they did not provide significant discriminatory power in previous work (Mosteiro et al., 2020). There are four nursing wards in the psychiatry department at the UMC Utrecht, and each data point belongs to one nursing ward. The characteristics of the dataset are shown on Table 2.  Table 2: Dataset characteristics. Each data point is an admission period, i.e., the period that a patient spends while admitted to a given nursing ward of the psychiatry department. Age refers to the age of the patients in the nursing ward. Positive and Negative data points are defined by whether the patient is involved in a violence incident during the first 27 days after the first day of the admission period.

Data Processing
To simulate two institutions (A & B) based on one dataset, and to allow for hyper-parameter tuning, a data processing procedure was designed to ensure the split-up datasets meet the following requirements. First, each of the four nursing wards is assigned to either institution A or B, in such a way that makes the numbers of data points in A and B as even as possible. Second, both datasets are split up into a train/validation and test set. The train/validation set is split up into 5 folds for cross-validation (CV). Third, between cross-validation folds themselves, and between the train/validation set and the testing set, no patient IDs may overlap; overlapping patient IDs could result in validating/testing on training data. This overlap sometimes occurs when a patient is moved to a different nursing ward, and the new nursing ward copies the notes taken from the previous nursing ward. Fourth, it should be possible to combine the folds between institutions to form patient-independent folds for federated and datacentralised training. Fifth, both testing sets may only include new data based on the admission timestamp, to ensure we test the final models on new data points exclusively. These requirements are visualised as a top-down procedure illustrated in Figure 1.

Treatment Design
In this study, four treatments are designed and compared to test all scenarios derived from the research goals mentioned in Section 1, based on Wieringa's design cycle (Wieringa, 2014). Each treatment performs a grid search with 5fold cross-validation (CV) to search for the best hyper-parameters for training a neural network on their respective dataset. Based on this outcome, each treatment delivers a final model by training on the data from all five folds, and is tested against a held-out testing set. These four final models are compared as part of the statistical difference-making experiment.
The difference between each treatment lies within the data it's applied on, and the training method. Two treatments are trained on data from the two simulated institutions A and B. The other two treatments, data-centralised and federated, train on data from both institutions. The data-centralised treatment trains on all data without restrictions, to show how performance would be if privacy regulations could be ignored. Therefore, it acts as a gold standard in terms of performance, as we expect the nonrestrictive training environment to deliver the best performance. The federated treatment trains a neural network on both institutional data sets through FL.

Classification Model
The classification model used across treatments is a feed-forward neural network, consisting of an input layer, one hidden layer, and output layer. The size of the input layer corresponds to the number of elements in the Doc2Vec vectors in the dataset (300). The hidden layer size is given by the variable h, whose values are optimised through hyper-parameter tuning. Furthermore, the hidden layer uses the Rectified Linear Unit (ReLU) activation function, chosen for its fast computation time. The output layer has a single neuron with a sigmoid activation function for providing the classification.
The model uses the Binary Cross Entropy (BCE) with logit loss function to compute its gradients. We use mini batch gradient descent. Equation 1 is the average loss per data point for a single batch n, given input x and outcomes y.
Batch n contains T n data points. For each data point i, we apply a sigmoid activation function σ to the input x i . To mitigate the issue of class imbalance, when the binary outcome y i is positive, we multiply it by a weight p equal to the ratio of negative to positive samples in the dataset.
An exponential learning rate scheduler is used for training, which updates the learning rate through lr = lr 0 · γ ne , where lr 0 is the starting learning rate, γ is the amount of decay, and n e is the number of the current epoch. A γ of 0.975 is used for the experiment. This value causes the learning rate to approximately be divided by 10 at epoch 100. It is a relatively quick drop, but as there is a computational constraint in the amount of epochs we can use, we aim for models with an initial quick convergence, and use the remaining epochs for more fine-grained model updates. The maximum number of epochs is 120.
An early stopping mechanism keeps track of the validation loss of the model at each epoch. The mechanism saves a checkpoint of the model whenever the validation loss decreases since the last overall decrease in validation loss. It means that if the validation loss has not decreased in the last seven epochs, the early stopping mechanism kicks in and stops the training. It will then load the model checkpoint which has the lowest validation loss. This checkpoint model is then used for model evaluation.

Hyper-parameter Tuning
Each treatment follows the hyper-parameter tuning cycle and testing procedure as shown in Figure 2. This aims to result in hyper-parameter values optimal for training a treatment's final model. The tuning happens through a process known as grid search in steps 1 through 4 in the Figure, where for each possible combination from a fixed set of hyper-parameters, a model is trained to reveal its respective performance. To ensure a good error estimate, the grid search is performed through 5-fold CV. Thus, for each hyper-parameter combination, five models are trained. To compute a performance measure of a combination, the ground truth labels and the predicted labels from the five models are concatenated and used as input for performance measure calculations. The performance measure used for fine-tuning is the F1-score, which assigns importance to correctly classifying the positive class. As the violence risk assessment dataset is strongly imbalanced and as correctly classifying patients exhibiting violence is deemed to be more important than correctly classifying patients who do not exhibit such behaviour, being able to accurately evaluate true positives among the positive predictions is key. When the combination with the best F1-score is determined in step 5, the final model can be trained.
It uses the best hyper-parameters to train on the data from all 5-folds in step 6.
After training, it is ready to be evaluated on the held out testing data at step 7.

Treatment Validation
Each treatment is validated by testing its final model on the held out testing data containing only new data points based on the admission timestamp. It corresponds to the final step in the the hyper-parameter tuning cycle ( Figure   2). This is done by feeding the testing data from institutions A, B, and the  Cumming & Finch (2005). This too will yield a distribution with 95% confidence intervals. If the confidence interval whiskers exclude zero, then the difference is statistically significant. An important side-note for this method is that bootstrapping is ideally performed on the training set. Due to computational limitations, we performed it on the test set to see how much our specific trained models vary in their performance, when the test set is modified slightly through bootstrapping.

Federated Learning Implementation
The Python library PySyft is used for simulating a federated setting on a single device. We simulate two institutional devices and a central server. predictions, labels = model(dataset(inst)) return (predictions, labels)

Data Splitting
The violence risk assessment dataset contains a total of 4005 data points after removing overlapping patients from the training/validation set.   Table 4 displays the relationship between cross-validation F1-scores and the F1-scores of applying the treatments to the held-out test set. The goal of the grid search is to find the combination giving the highest F1-score (CV Max F1), and in this way it aims to find a combination which has a comparably high F1score on the held out testing set. For the final models of the data-centralised, federated, and institution A treatments, the F1-score on their own test set is higher than the CV Max F1-score. This indicates that the hyper-parameters picked during cross validation provides a decent F1-score on the held out test set. Only for institution B the opposite was true as the F1-score on its own test set is lower. In an ideal situation, a similar F1-score is preferred as the cross-validation would then provide the most realistic carry-over value.

Performance measures
We observe in Table 5

Bootstrapped F1-scores comparison
For a given performance measure, the confidence intervals are calculated in two ways. Both methods rely on bootstrapping of the ground truth labels and the predictions based on 10 000 resamplings of the testing sets. The first method computes a performance measure for each bootstrapped sample. This results in a distribution of a given measure's scores with 10 000 data points. Then the two-tailed confidence intervals are calculated using percentiles (CI: 95%). The confidence intervals of this method using the F1-score as a performance measure, are illustrated in Figure 3a. The second method compares the bootstrapped distributions of the F1-scores between all non-federated treatments compared to the federated treatment. Given two treatments, the difference in a performance measure for each bootstrapped sample is calculated. These differences for a given measure provide a new distribution for which the confidence intervals are calculated (CI: 95%). This method is illustrated in Figure 3b.

Prediction comparisons
To provide a more in-depth comparison between each treatment's predictions, the confusion matrices and contingency tables are displayed in Tables 6 and 7, respectively. We observe significant differences between the predictions of the two local models and the data-centralised and federated model. Comparing the predictions of the data-centralised and federated models alone, reveals highly similar predictions.      To provide more insights into the data structure and the classifications, t-SNE and PCA analyses were performed on the testing dataset of the federated model. Figure

Statistical significance
When tested on the combined testing data, the federated model achieved an F1-score of 0.388 and the data-centralised achieved 0.397. This is in line with our expectations that both models would perform on par with one another.
We also report on

Model Differences
The confusion matrices of all models as displayed in Table 6 7b show that there is a significant disagreement between the two local models compared to the federated model. Table 7c shows that the highest level of agreement is between the data-centralised and federated model, disagreeing only on 36 (21 + 15) data points.
To see the extent to which all models agree with one another, Table 8  the question of whether these data points have anything in common as to be classified incorrectly.

Limitations
The first limitation of this study is concerning the Doc2Vec model. preserving techniques such as differential privacy (Dwork, 2006), homomorphic encryption (Gentry, 2009), and secure multiparty computation (Yao, 1982).
These techniques might alleviate additional privacy concerns, but could also negatively impact model performance. To guarantee a high level of privacy to admitted patients, combining FL with these techniques might be required.
Lastly, we observed a large variance in the performance measures and believe this can be attributed to the small test set with a low number of positive samples.
Because of this, there is a high probability that a bootstrapped sample contains a skewed class distribution, which has a high impact on the variance of F1-scores.

Conclusions
Violence Risk Assessment (VRA), like many other clinical tasks, can be tackled with Machine Learning methods. In the psychiatry domain, NLP methods are particularly interesting thanks to the abundance of clinical notes containing valuable information. NLP models benefit enormously from bigger and more diverse datasets, as can be acquired by working across multiple institutions. Since data sharing among institutions is not possible, we have developed a Federated Learning (FL) pipeline for training an algorithm for VRA. We found no performance loss from using FL, as opposed to a data-centralised approach. Also, FL seems to improve the performance of locally trained models tested on a different dataset. To the best of our knowledge, this is the first application of FL and NLP on clinical texts.
The results suggest that there are benefits to using federated models and this should be investigated further with cross-institutional datasets. Not only would this provide insights into real-life deployment, it would also lead to more data points for training and testing and could help to decrease performance measure variance.
In future work, we plan to train document embeddings in a federated environment. Furthermore, we will investigate how FL can help solve other clinical tasks, such as text de-identification. Finally, we plan to investigate the possibility of adding other additional privacy-preserving technologies, such as differential privacy.