Evaluation of machine learning-based models for prediction of clinical deterioration: A systematic literature review

Background and objective: Early identification of patients at risk of deterioration can prevent life-threatening adverse events and shorten length of stay. Although there are numerous models applied to predict patient clinical deterioration, most are based on vital signs and have methodological shortcomings that are not able to provide accurate estimates of deterioration risk. The aim of this systematic review is to examine the effectiveness, challenges, and limitations of using machine learning (ML) techniques to predict patient clinical deterioration in hospital settings. Methods: A systematic review was performed in accordance with the Preferred Reporting Items for Systematic Reviews and meta-Analyses (PRISMA) guidelines using EMBASE, MEDLINE Complete, CINAHL Complete, and IEEExplore databases. Citation searching was carried out for studies that met inclusion criteria. Two reviewers used the inclusion/exclusion criteria to independently screen studies and extract data. To address any discrepancies in the screening process, the two reviewers discussed their findings and a third reviewer was consulted as needed to reach a consensus. Studies focusing on use of ML in predicting patient clinical deterioration that were published from inception to July 2022 were included. Results: A total of 29 primary studies that evaluated ML models to predict patient clinical deterioration were identified. After reviewing these studies, we found that 15 types of ML techniques have been employed to predict patient clinical deterioration. While six studies used a single technique exclusively, several others utilised a combination of classical techniques, unsupervised and supervised learning, as well as other novel techniques. Depending on which ML model was applied and the type of input features, ML models predicted outcomes with an area under the curve from 0.55 to 0.99. Conclusions: Numerous ML methods have been employed to automate the identification of patient deterioration. Despite these advancements, there is still a need for further investigation to examine the application and effectiveness of these methods in real-world situations.


Introduction
Clinical deterioration can occur during the course of a patient's hospitalisation. "A deteriorating patient is one who moves from one clinical state to a worse clinical state which increases their individual risk of morbidity, including organ dysfunction, protracted hospital stay, disability, or death" [23], p. 1033. Early detection and rapid response to hospitalised deteriorating patients may result in achieving optimal patient outcomes and minimising interventions required to stabilise patients' conditions [6]. In recent years, proactive clinical processes and systems have been developed in many countries to support the provision of appropriate and timely care to patients whose conditions are acutely deteriorating [3]. An early warning system (EWS) is commonly used to predict the likelihood of patient deterioration in hospital by employing vital signs such as heart rate, respiratory rate, blood pressure, peripheral oxygen saturation, temperature, and sometimes the level of consciousness [14]. Aggregate-weighted EWS assigns weights to each of these vital signs and characteristics based on pre-defined trigger thresholds. An overall aggregate score is calculated by the summation of each score multiplied by its weight. However, aggregate-weighted EWSs have some limitations in predicting patient deterioration. For example, they are not able to define complex relationships or patterns in empirical data, and the score for each input is calculated independently [13]. However, machine learning (ML) involves algorithms that learn from patterns and complex relationships in data rather than relying on a rule-based approach to enable users to make informed decisions.
In recent years, the number of studies using ML to predict patient deterioration has grown rapidly, although there is no general model that can be used reliably in practice yet. There have been several review studies of ML research that aimed to assess and evaluate the employment of models for the prediction of patient deterioration [10,11,27,31,33,34]. Supplemental File 1 (Appendix A) provides a summary of these studies based on type, aim, setting, primary outcomes, number reviewed, and period. These review studies have not provided a focused evaluation of ML models for various primary outcomes of patient clinical deterioration in across settings. However, general ML models have been developed and validated for a wider range of wards to which patients are admitted. With this in mind, we performed an umbrella investigation to identify and analyse quantitative studies developing, utilising and/or integrating ML to detect and/or predict clinical deterioration in all hospital settings based primarily on routinely collected data from patients. Specific review questions included: • Have ML algorithms been able to detect real-time and/or predict future patient clinical deterioration risks with high accuracy? • What types of ML algorithms have demonstrated better performance? • How has previous research optimised the performance of ML algorithms in detection and/or prediction of clinical deterioration? • What types of input features have been utilised to detect and/or predict patient clinical deterioration with high accuracy? • Do ML models have the potential to detect and/or predict patient clinical deterioration in order to incorporate them into the process of decision making? • If a patient clinical deterioration model has the potential to detect and/or predict patient deterioration, why is there a gap between development and deployment of algorithms? • What are the barriers to effective implementation of patient clinical deterioration ML models in clinical settings? • Which patient clinical deterioration detection/prediction models have described practical aspects of model design, such as actionability, safety and utility of the model? • Has user-experience design considered the merging of a developed algorithm into clinical practice? • What are the technical and non-technical barriers related to the creation, validation, and deployment of ML models in hospitals along with the development of these methods, as well as the ethical and societal implications of their adoption?

Methods
This systematic literature review was conducted following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta Analyses) guidelines [30] and the protocol was registered on PROSPERO (registration number CRD42022340731).

Search strategy and data sources
A three-step search process was undertaken in conducting this review. In the first step, with the assistance of a health research librarian, initial index terms (i.e., thesaurus and subject headings) and MeSH terminology were adapted to suit different databases. A primary limited search of EMBASE, CINAHL Complete, and MEDLINE Complete was then used to analyse text words in titles and abstracts, as well as keywords. Following this analysis, the final keywords related to patient clinical deterioration and ML were chosen to develop the search strategy.
The search strategy was appropriately tailored for each individual database based on their requirements. The approach followed a general pattern consisting of: In the second step, a comprehensive systematic search was conducted using the search strategy in four electronic bibliographic databases, including EMBASE, 1 MEDLINE Complete, 2 CINAHL Complete, 3 and IEEExplore 4 (refer to Supplementary File 1, Appendix B for details). The searches were undertaken on 21 July 2022, with no time limitations, and the authors continued to add new results up to August 2022. In the third step, a cited reference search was conducted to identify relevant studies.

Inclusion criteria, exclusion criteria, and data extraction
Studies were included if: • Published peer-reviewed research (e.g., original research, systematic reviews, and meta-analyses) developed, evaluated and implemented ML models for application in patient clinical deterioration prediction in hospital settings and review papers. • Used at least one ML technique for patient deterioration prediction in hospital settings. • Conducted in adult patients (age 18 years or older).
• Published in English (due to the team's inability to translate articles).
• Electronic database searches for all peer-reviewed research in English were completed by August 2022.
Exclusion criteria were: • Literature reviews that incorporated theoretical studies or opinions as primary sources of evidence.

Study selection and data collection
Search results were imported into Rayyan (https://www.rayyan.ai/) where duplicates were removed. Two reviewers (SJ and GO) used the inclusion/exclusion criteria to screen titles and abstracts and to then identify potentially relevant articles. Full texts for all potentially relevant articles were retrieved for full-text screening. Any discrepancies in screening between reviewers were resolved through discussions and a third reviewer (BS) was available if agreement was not reached). Once included articles were identified, their reference lists were scrutinised for other relevant articles. A PRISMA study flow diagram was used to visually represent the selection process at each stage ( Fig. 1).
A data abstraction form was developed and used to record standardised information from each article. The key information included authors, year, location and setting of study, aims, objectives, methods (including ML algorithms used, input features, output classes, data sources, inclusion and exclusion criteria, handling of missing data, size of training set, size validation set/validation method, size test set, prediction type, and statistical measures used for evaluation), findings (predictor variables and performance metrics), and limitations of the study. Using this form, we categorised each article based on the type of ML algorithm, as well as outcomes reported. A SWOT (Strengths, Weaknesses, Opportunities, and Threats) Analysis of the most frequently used ML models in the studies reviewed was also performed. Recognition of these items will facilitate a reasonable use, application, and interpretation of all the opportunities that ML offer in the field of clinical deterioration.

Assessment of research quality
The risk and sources of potential bias of all selected articles were assessed by two researchers (SJ and GO) using a standard questionnaire consisting of four items [35]. Once completed, assessment results were compared, and discrepancies were discussed and moderated by two reviewers until agreement was achieved.. Studies that met all criteria were awarded a score of 2, those that moderately satisfied criteria were awarded a score of 1 and studies that did not satisfy any criteria were awarded a score of 0. Scores were aggregated across items. Studies that received an overall score of 5 or higher on this scale were deemed high quality, while those with a score of 4 were rated average. Studies with a score lower than 4 (out of a total of 8) were considered to be of unacceptable quality and were eliminated from consideration in this systematic literature review. The 4-point threshold was based on the questionnaire guidelines e and was decided through discussions involving two reviewers to the reduce risk of bias (quality) assessment.

Results
Among 1,330 studies identified, the titles and abstracts of 1,018 unique papers were screened after removing duplicates. After excluding 1,236 records, the remaining 55 articles in full-text form were assessed for full-text eligibility, and a total of 29 articles were considered eligible to be included in this review (Supplementary File 1, Appendix C). size in the studies varied between 60 and 644,257 EMRs (Supplementary File 1, Appendix D). Patient deterioration predictions involved a variety of ML techniques. Given the review focused on advanced ML-based algorithms, if traditional statistical modeling methods, such as logistic regression and variable selection by regularisation, were utilised, they were excluded from the review.

Study population and patients deterioration outcomes
Included studies have suggested different definitions of deterioration. Patient deterioration outcomes were defined as intensive care unit (ICU) transfers, emergency surgery, in-hospital sepsis, cardiopulmonary arrest, chronic heart failure, organ dysfunction or combined outcomes. Cardiac arrest was the primary outcome in five studies, while general cardiorespiratory deterioration or decompensation was the primary outcome in three studies. Another commonly predicted outcome in five studies was sepsis, severe sepsis, septic shock and sepsis-related mortality. Other outcomes explored within the studies included unanticipated ICU admissions, development of critical illnesses, general physiological deterioration, and mortality (Supplementary File 1, Appendix D).

Variables included in ML models
Studies were conducted in a variety of hospital settings. Therefore, the number and type of predictors differed across studies. Regarding study settings, six studies were conducted in ICUs [2,5,16,44,45,51], one study was conducted in emergency department (ED) [48], one study was conducted in ICUs and EDs [46], two studies were conducted in general wards [26,49], and 16 studies were conducted in non-specified locations (Supplementary File 1, Appendix D). For comparative reasons, variables were categorised into the following patient domains: demographic data, vital signs, laboratory values, images, text data, examination reports, and the Glasgow Coma Scale.

Use of ML
Various ML techniques were applied in the included studies. A total of 22 studies applied more than one ML technique, and the details of these ML techniques are summarised in Table 2. The most common algorithms were regression, artificial neural network (ANN), random forest (RF), and support vector machine (SVM) models (Table 2).

Model performance
A total of 19 studies reported ML algorithms with AUC above 0.70, which was an indication of modest to high discrimination ability. A range of variability within AUC was reported by these studies from 0.44 to 0.99. In addition to AUC, other performance measures including accuracy, AUC PR, F1 Score, positive/negative predictive value, precision, recall, sensivity, and specificity have also been reported in the selected studies (Table 3). A total of 20 studies also reported a method to handle missing data ( Table 1). As the table shows, accuracy, sensivity, and specificity are the most used measure in the studies.

Analysis of content results
The results of the analysis indicate that ML-based models have better prediction abilities when compared to early warning systems used in forecasting patient deterioration [24,26]. The majority of the previous studies that examined the applications of ML in patient deterioration, analysed the prediction of patient deterioration characteristics. RF models were predominantly adopted in those studies, and this was followed by SVM, LR and ANN, as presented in Table 2. Genetic algorithm and restricted cubic spline models were the least adopted in predicting patient deterioration in the studies included in this review. Hu et al. [21] also suggested that the performance of prediction models may be improved by developing different models for different wards within a hospital instead of developing an overall model for an entire hospital to reflect the different types of patients and care requirements.
In relation to opportunities in adopting ML-based models, this study found that when ML-technologies are appropriately implemented into the everyday care of patients, they could help in the development of personalised care models and could supplement clinical decision making, especially in contexts with less experienced clinicians [36]. MLalgorithms can be converted into software programs to assist clinicians in making decisions. Regarding the challenges presented by MLbased models, analysis of the content results shows that there are difficulties encountered in adopting the appropriate ML model in patient deterioration. The study found that a majority of the articles reviewed were direct applications of ML models intended to predict various outcomes of patient deterioration. No studies were personalised or tailored, and none of them implemented ML applications in healthcare settings. Several studies reported that designing specific ML models with distinctive features of patient deterioration requires robust collaborative research between multidisciplinary teams of computer scientists and clinicians, with a strong emphasis on a highly-qualified ML technical team. Thus, the development and implementation of ML applications and technologies have been particularly challenging in the area of patient deterioration due to the lack of robust engineering practices to ensure safe and reliable applications. As ML models and technologies are becoming more pervasive and universal, organisations in various sectors, including hospitals, have raised concerns regarding the accountability, transparency, fairness, ethical soundness and comprehensiveness of ML models and technologies. A challenge of ML models and technologies is the continual emergence of new security threats [50].
Consistent with the findings of Jeffery et al. [22], inadequate data has been recognised as a major obstacle affecting the implementation of ML models. Machine learning prediction systems generally must be trained on large amounts of retrospective data from a given hospital, and this process can be burdensome, as it can delay the implementation of ML models. The literature reviewed in this study was undermined by the deficiency and incompleteness of patient deterioration data. The data was non-current and the studies reviewed did not draw strong conclusions about prospective performance in clinical settings. Regardless of the challenges identified in this study, ML-based patient deterioration models are a feasible solution to predict patient deterioration. Adopting patient deterioration ML technologies will not only help in providing clinical teams more time to intervene, but also contribute to improving patient outcomes. Many health conditions can have high costs and eventual mortality in their later stages. However, they can be readily treated in their early stages.
While there are various technologies used in the prediction of patient clinical deterioration, this study has focused on the four most frequently adopted ML applications for modelling and optimisation of patient deterioration processes, including: (1) regression, (2) artificial neural        1 Lee et al. [26] DEWS (consists of 3 recurrent neural network layers with long shortterm memory unit) 1 Kwon et al. [25] Combined classic Machine-Learning (ML) and end-to-end Deep Learning (DL) 1 Gjoreski et al. [19] Gaussian mixture models (GMM) (Total 1) N/A 1 Gultepe et al. [20] Hidden  (Table 3). Table 4 provides a SWOT analysis of the most frequently used ML models in the studies reviewed.

Quality assessment
All studies were of high quality based on the appraisal of four domains of the adopted questionnaire. Total quality scores ranged from 7 to 8. The full description of the quality assessment for all included studies is summarised in Supplementary File 1, Appendix E.

Discussion
In this systematic literature review, 29 studies involving ML prediction models for patient clinical deterioration were evaluated. These models were developed and tested in a variety of settings and populations using healthcare data from EMR, administrative databases, and clinical data warehouses. Regression, artificial neural network (ANN), random forest (RF) and support vector machine (SVM) models were the most common ML approaches used to predict patient clinical deterioration. There was variation in model performance in terms of AUC across these prediction models. Most studies applied multiple methods to predict patient clinical deterioration. The domain of input variables included demographic variables, admission information, and assessment information in most of the studies for the development of ML prediction models. Overall, quality in most of the studies was high. The high diversity in study settings, samples, input features, ML models, and performance measures among the included literature makes their results incomparable. Therefore, no relative success rate can be obtained for the considered studies.
To our knowledge, this is the first review to provide a focused evaluation of ML models for patient clinical deterioration in all settings. This review suggests the growing importance of ML methods in the area of patient clinical deterioration. Therefore, this systematic review provides insight into incorporating the cutting-edge applications of ML methods into patient clinical deterioration programs. However, the utilisation of ML to promote patient safety is a developing area and most of these applied algorithms have not been externally validated nor tested. Promising performance based on development or internal validation samples may not translate into improvements in real-world healthcare practice since algorithms may be limited in generalisability and affected by the healthcare contexts in which they are implemented [4,9]. In addition to this, lack of methodological reporting and critical appraisal guidelines, particularly with regard to the choice of method and validation of predictions, could be considered a barrier in translation of the studies' findings into clinical practice.
Another important aspect to consider in the development and implementation of ML models for patient clinical deterioration is the ethical and legal implications [15]. There are concerns regarding the potential biases in ML models, especially in healthcare where the consequences of a false positive or false negative prediction can have severe impacts on patients. Additionally, there are issues around the transparency and interpretability of ML models, as healthcare providers may be hesitant to rely on predictions if they cannot understand how the model arrived at its decision. It is important for researchers and developers to prioritise the ethical and legal implications of their ML models, and work towards developing models that are transparent and interpretable. This will not only help to ensure patient safety, but also increase trust and adoption of these models in healthcare practice.
This study also explored the main challenges and limitations in ML adoption in healthcare services. Researchers must be cautious about evaluating clinical decision support systems based on ML analytics before widespread algorithm implementation to guarantee safety and accuracy [18,28,41,47]. From a practical perspective, applied algorithms and tools should be validated across different healthcare systems to report differential performance in various healthcare settings. Furthermore, papers describing model development and performance assessments should report consistent information about validity, biases, and generalisability in other settings. There are also gaps in the identification of both key barriers and facilitators to the implementation of ML-based tools and to utilise the experience of users in applying these tools in healthcare settings.

Conclusion
This systematic review showed that the included research used a variety of features, depending on the aim of the research and the availability of datasets; and various types of algorithms have been applied to develop prediction models. Accordingly, the findings show that no specific conclusion can be drawn regarding the best model characteristics. However, the selected research show that some ML models have been used more than others. Some of the studies used more than one ML model to identify the model with the best performance.
There have been growing advances in applying ML in the context of Table 3 Performance measures used in included studies (N = 29).

Performance Measures Times used
healthcare settings. However, there are barriers related to the development, validation, and deployment of machine learning models in hospitals along with the development of these methods. Evaluation of the implementation of ML in healthcare settings to ensure that ML tools are safe and effective, and how these tools could be effective in improving patient safety, requires further study.

Summary table
• The review suggests the growing importance of ML methods in the area of patient clinical deterioration, but there are gaps in the identification of both key barriers and facilitators to the implementation of ML-based tools and to utilise the experience of users in applying these tools in healthcare settings. • Evaluation of the implementation of ML in healthcare settings to ensure that ML tools are safe and effective, and how these tools could be effective in improving patient safety, requires further study.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. • Black box nature of RF models is a fundamental issue in the widespread application of such techniques because it is used to find relationships between given data and results, not to create a rule based on knowledge. When alarms sound, medical staff are unsure of what immediate action to take until the patient is checked (cannot describe relationships within data) • Required >200 events per variable to achieve stability and perform better in larger datasets.
• Potential impacts of decision support interventions on workflows, nurses' roles, and patients' outcomes. • Feasibility of implementing a RF algorithm for real-time analysis of electronic health record data demonstrated.
• Missing values and potential variation in accuracy of data. The amount of missing data limits the trustworthiness and clinical applicability of the models.
Kia et al. [24], Giannini  • Highly sensitive to selected kernels and turning variables.
• Feasibility of using SVM classification with feature selection. • Feasibility of implementing a SVM algorithm with a high precision.
• Missing values and potential variation in accuracy of data. The amount of missing data limits the trustworthiness and clinical applicability of the models.  [46], and Gultepe et al. [20] Abbreviations: ANN: Artificial Neural Network; RF: Random Forest; and SVM: Support Vector Machine.