Development and Validation of a Novel Scoring System

Rationale: Late recognition of patient deterioration in hospital is associated with worse outcomes, including higher mortality. Despite the widespread introduction of early warning score (EWS) systems and electronic health records, deterioration still goes unrecognized. Objectives: To develop and externally validate a Hospitalwide Alerting via Electronic Noticeboard (HAVEN) system to identify hospitalized patients at risk of reversible deterioration. Methods: This was a retrospective cohort study of patients 16 years of age or above admitted to four UK hospitals. The primary outcome was cardiac arrest or unplanned admission to the ICU.We used patient data (vital signs, laboratory tests, comorbidities, and frailty) from one hospital to train a machine-learning model (gradient boosting trees). We internally and externally validated the model and compared its performance with existing scoring systems (including the National EWS, laboratory-based acute physiology score, and electronic cardiac arrest risk triage score). Measurements and Main Results: We developed the HAVEN model using 230,415 patient admissions to a single hospital. We validated HAVEN on 266,295 admissions to four hospitals. HAVEN showed substantially higher discrimination (c-statistic, 0.901 [95% confidence interval, 0.898–0.903]) for the primary outcome within 24 hours of each measurement than other published scoring systems (which range from0.700 [0.696–0.704] to 0.863 [0.860–0.865]).With a precision of 10%, HAVEN was able to identify 42% of cardiac arrests or unplanned ICU admissions with a lead time of up to 48 hours in advance, compared with 22% by the next best system. Conclusions: The HAVEN machine-learning algorithm for early identification of in-hospital deterioration significantly outperforms other published scores such as the National EWS.

Over 60,000 patients annually deteriorate on UK hospital wards to the extent that they require ICU admission (1). Late or missed recognition of deterioration is associated with worse patient outcomes, including higher mortality (2)(3)(4). Over the past 20 years, healthcare systems worldwide have implemented alerting systems to improve the detection of patients at risk of deterioration (5)(6)(7). Most are based on abnormalities in patients' vital signs, usually by combining them into an early warning score (EWS). Clinicians are alerted when the EWS rises above a given threshold. Many hospitals also employ rapid response teams to respond to the most critically unwell patients (8). However, there is conflicting evidence that implemented EWS systems or rapid response teams improve patient outcomes (8)(9)(10)(11).
Current EWSs were designed to be calculated easily at the bedside when most hospitals recorded observations on paper charts. This simplicity means EWSs cannot account for trends over time, patients with chronically abnormal physiology, or other indicators of deterioration (e.g., acute kidney injury). Consequently, EWSs commonly generate

At a Glance Commentary
Scientific Knowledge on the Subject: Late recognition of patient deterioration in hospital is associated with worse patient outcomes. Current early warning score systems based purely on vital sign measurements still do not identify the majority of deteriorations without also generating many false alerts.
What This Study Adds to the Field: We used a machine-learning algorithm to combine patients' vital signs with additional physiological measurements, comorbidities, and frailty to create the Hospital-wide Alerting via Electronic Noticeboard scoring system. This model substantially increased the precision with which deteriorating patients could be identified when compared with previously published scores.
false alerts, risking alarm fatigue and increasing the likelihood that deteriorating patients are missed (12).
Increased uptake of electronic health records (EHRs) facilitates the development of sophisticated EWSs incorporating additional routinely collected patient data. For example, our group and others have shown that combining laboratory results with vital sign measurements increases the precision with which deteriorating patients can be detected (13)(14)(15)(16)(17)(18)(19). Many newer risk scores exploit machine-learning algorithms (13,15,17,(20)(21)(22)(23)(24). However, few are externally validated (25)(26)(27) and fewer still are implemented in the EHR (23). Those that have are often subject to proprietary licenses, which can limit the research community's ability to validate them (22,23,28,29). Some algorithms also use data, such as detailed nursing assessments, that are not routinely recorded in the EHR (28). A key reason predictive machinelearning models are not clinically implemented is the failure to consider whether they add value in clinical practice (15,30,31). Indeed, we previously argued that even current EWS systems are not optimized to identify patients with reversible deterioration; namely, where intervention is likely to change patient outcomes (32).
In this study, we describe the development and external validation of the Hospital-wide Alerting via Electronic Noticeboard (HAVEN) system to identify patients with potentially reversible deterioration. HAVEN provides a real-time risk assessment, which is continuously updated using patients' vital signs, laboratory test results, and medical histories.

Study Design
A multicenter retrospective development and external validation of a prognostic model. It is reported following the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines (33). Portsmouth Hospitals NHS Trust is a large, acute, district general hospital (hospital A) with approximately 1,250 beds, which provides a full range of elective and emergency medical and surgical services to a local population of around 675,000 (34). Oxford University Hospitals NHS Trust is a hospital group with approximately 1,465 beds, which serves a local population of around 655,000. We included the tertiary referral center for trauma, cardiology, and neurosurgery, which also provides general acute medical and surgical services (hospital B); the specialist renal transplant and cancer referral center (hospital C); and the district general hospital (hospital D). We excluded a hospital performing predominantly elective orthopedic procedures.

Data Sources
The routinely collected data stored across different clinical information systems in all four hospitals were extracted. Data included admissions' administrative information (including dates and timings for admission, discharge, and any transfers within the hospital site), diagnoses as 10th-revision International statistical Classification of Diseases and related health problems (ICD-10) codes, laboratory results (including hematology, biochemistry, and microbiology results), vital signs, and patient demographics.

Participants
We included all patients (aged 16 or above) admitted to hospital A from January 2012 to December 2017 or admitted to hospitals B-D from January 2016, to December 2017.
Admissions with no recorded vital signs were excluded to ensure a minimum required data set for score computation.
The training cohort comprised admissions to hospital A from January 2012 to December 2015. The primary test cohort combined admissions from hospitals A-D between January 2016, and December 2017.

Outcomes
Our primary outcome was a composite of in-hospital cardiac arrest and unplanned admission to the ICU not preceded by surgery in the prior 24 hours. ICU admissions shortly after surgery were excluded, as deterioration may happen during the procedure rather than on the ward. Secondary outcomes were unplanned admission to the ICU not preceded by surgery in the initial 24 hours and in-hospital cardiac arrest separately. We included a third secondary outcome of all unplanned admissions to the ICU to determine the effect of including unplanned ICU admissions preceded by surgery within 24 hours.

Predictors
We identified potential variables for inclusion in the model by a systematic literature search (35) and expert suggestions, followed by an expert panel review. The expert panel comprised critical care nurses and doctors, alongside a statistician and senior general physician. The panel undertook a modified Delphi process to consider additional variables useful in determining patients' risk of deterioration. A consensus was reached after two discussion rounds, resulting in a final 76-candidate variable list.
Each patient admission was represented by static (time-invariant) and dynamic (timevarying) variables.
As static variables, we included the patient's age and sex at admission to the hospital. We also encoded the presence or absence of comorbidities using ICD-10 diagnosis codes. Because diagnostic coding in the United Kingdom typically occurs after discharge from the hospital, this information is not available electronically unless the patient has previously been admitted to the same hospital. We extracted unique diagnostic codes from previous admissions over the 2 years before hospital admission under study. Diagnostic codes were grouped into 30 categories according to Elixhauser (36), comprising 30 binary features encoding whether patients had common chronic diseases, such as congestive heart failure or chronic lung disease. We further calculated: smoking status (using the ICD-10 codes F17, Z716, and Z720), the Hospital Frailty Risk Score (37), and the total length of all hospitalizations in the 2 prior years.
As dynamic variables, we included commonly measured laboratory values and vital signs and the estimated inspired oxygen fraction. A variable list is provided in the online supplement (SECTION D).
We designed HAVEN to recalculate a patient's deterioration risk each time a new variable is recorded. When one time-varying variable is measured, other variables often are not. We therefore included the most recent measured value for each physiological and laboratory result variable at each time point (equivalent to a last value carried forward imputation). To capture how variables change over time, we also calculated two derived features before imputation: a 24-hour variability index for physiological variables (38) (defined as the difference between the maximum and minimum values over the preceding 24 h) and the maximum and minimum values of laboratory results recorded during the patient's admission before the time point (both including the current measurement).

Missing Data
Distributions of variables were inspected manually. A clinical expert panel identified "biologically implausible" ranges, with values outside these ranges defined as missing.
The remaining missing values were imputed with the median (or mode for dichotomous variables) of each variable from the training set. Although other methods were considered, such as multiple imputations (39), we used the median and mode to simulate the HAVEN implementation within a live clinical system.

Statistical Analysis
Model development. We trained the HAVEN system by generating the set of features for each time point in which a new measurement (vital sign or laboratory test) occurred. We labeled each time point as "positive" if the primary outcome occurred within 24 hours. We used a gradient boosting machine with decision trees, as implemented in the XGBoost library (40). XGBoost has a number of hyperparameters (e.g., the depth of the component decision trees) that are modifiable to produce the best model. One of these hyperparameters changes the relative weighting between the positive and negative classes, which can improve model performance in unbalanced data sets. To discover the optimal hyperparameters, we used a random search (500 permutations) and selected the model with the highest c-statistic (using a fivefold cross-validation procedure), using the first 3 years of data in the training set.
Optimal model predictions were recalibrated on the training set's final year of data to reflect the frequency of observed outcomes using isotonic regression (41). Uncalibrated and calibrated predictions were compared using reliability plots (41).
In addition to the gradient boosting machine, we trained, optimized, and validated four alternative machine-learning models: a Random Forest, a Generalized Additive Model, and both L1-regularized (Lasso regression) and L2-regularized logistic regression models (see Table EA6 in the online supplement).
Model evaluation. We evaluated risk prediction model performance using the test set containing data from all four hospitals. In line with Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis guidance, we report results for individual hospitals and for the three hospitals not used to develop HAVEN (33). We report model performance using discrimination and calibration metrics computed at both the "observation" and "patient admission" levels. We designed HAVEN to identify patients at risk of deterioration on hospital wards rather than identifying direct admissions from the emergency department-for this reason, scores generated from emergency department measurements were excluded.
At the observation level, we calculated the area under the curve (AUC) for the receiver operating curve (ROC) for our outcome measures occurring within the subsequent 12-, 24-, and 48-hour periods of each measurement (i.e., each time a measurement is recorded). The ROC AUC (c-statistic) measures discrimination, corresponding to the probability that patients who experience the outcome will be ranked above those who do not. As the outcomes are relatively rare (there are many more patients who go home without an event than there are patients who have an unplanned ICU admission or a cardiac arrest), we also computed the AUC for the precisionrecall curve (PR), which can be informative in class-imbalanced data sets (42,43). The PR AUC shows the trade-off between precision (positive predictive value) and recall (sensitivity) at each threshold. The closer the PR AUC is to 1, the greater the ability of the score or model to detect true cases (recall) with high precision over the range of thresholds. Calibration curves for selected models were determined for outcome occurrence within 24 hours of each measurement.
The sequential nature of predictions means the total number of positive time steps (in which the outcome occurs within n hours) does not directly correspond to the number of patients experiencing the outcome. Multiple positive time steps may be associated with a single adverse event. To assess the clinical applicability of the proposed model, we calculated the "patient admission sensitivity" at different degrees of precision (5%, 10%, 20%). These precisions correspond to evaluating 20, 10, and 5 patients, respectively, for each true-positive result-also known as the number needed to evaluate (NNE) (44). For each degree of precision, a patient admission was considered a false-positive result if they had at least one score above the threshold and no adverse event occurred. True-positive results were patient admissions with at least one score above the threshold in the n hours before an adverse event. We examined the sensitivity of the model over different time prediction windows preceding the event (up to 48 h). To further evaluate clinical utility, we performed a decision curve analysis (45)(46)(47)(48).
All 95% confidence intervals (CIs) were calculated using bootstrapping (200 samples) (49). We used the Shapley additive explanation algorithm (50) to calculate the relative "importance" of each predictor in the final model (see SECTION F of the online supplement).

Comparison with Published Risk Scoring Systems
We compared HAVEN score performance with established EWS systems: the centilebased EWS (51), the modified EWS (52), the standardized EWS (53), the National EWS (NEWS) (54), and the cardiac arrest risk triage (CART) score (55). We also compared it with three physiological scoring systems: the NEWS:LDTEWS (13), the electronic CART (eCART) score (56), and the laboratory-based acute physiology score (LAPS-2) (57). We excluded scoring systems in which the coefficients were unpublished or where data (e.g., nursing assessments) were not routinely recorded in our study sites (22,58). Further details of EWSs and other scoring systems are shown in in the online supplement (SECTION C).

Results
After exclusions, we included 496,710 unique admissions to four hospitals. The training set included 230,415 admissions (from 113,450 patients) to hospital A.
There were 266,295 admissions (159,182 patients) to four hospitals (A-D) in the test set. The two cohorts have similar patient characteristics (Table 1), both with a slightly higher proportion of female patients (of around 53%) and a median age of 62-63 years.

ORIGINAL ARTICLE
In the test cohort, 31% of admissions to the four hospitals (A-D) were elective, with a median hospital stay of 1.36 (interquartile range, 0.36-4.76) days. Hospital mortality was approximately 3%. In approximately 1% of admissions, patients had an unplanned ICU admission without visiting the operating theater in the preceding 24 hours. A cardiac arrest occurred during 647 admissions (0.2%). There was some variability in patient characteristics across the four hospitals (see Table EA1). Hospital C had a higher proportion of elective admissions (55.6%), a lower mortality rate (1.9%), and a higher rate of unplanned ICU admissions (3.9%) than the other hospitals. Class imbalance and the extent of missing data are reported in online supplement (SECTION E).
The calibration curve in the combined test set is shown in Figure 1. Table 2 shows HAVEN model performance on the test set for predicting the observation-level primary outcome (unplanned ICU admission or cardiac arrest) within different time windows.
ROC AUC values increase as the time window moves closer to the event, from 0.881 (95% CI, 0.879-0.883) within the following 48 hours to 0.921 (95% CI, 0.919-0.924) within the following 12 hours. A similar trend in ROC AUC values occurs for the individual secondary outcomes (Table 2). HAVEN model performance (either by ROC or PR AUCs) was higher for predicting unplanned ICU admissions than for cardiac arrests ( Table  2). The average contributions ("feature importance") of individual predictors are shown in the online supplement (SECTION F).
HAVEN performance was higher than all other published EWS and risk scores when predicting the primary outcome measured by either the ROC AUC or the PR AUC (Table 3). For example, for a time window of 24 hours, HAVEN had a ROC AUC of 0.901 (95% CI, 0.898-0.903), whereas LAPS-2, the next bestperforming scoring system, had a ROC AUC of 0.863 (0.860-0.865). This improved performance remained when testing was restricted to individual hospitals (Tables EA2  and EA4) and to the three test hospitals (B-D) where HAVEN had not been developed (Table  EA5). HAVEN performed as well or better than all other EWS and other risk scores for the individual secondary outcomes (see Table EA3). Figure 2 shows the patient admission level sensitivity of HAVEN for different prediction time windows for three fixed degrees of precision. A greater proportion of events were correctly predicted, as outcomes are included closer to the prediction point. At 10% precision (NNE = 10), HAVEN identified 42% of adverse events occurring in the subsequent period of ,1-48 hours and 27% of adverse events occurring between 12 and 48 hours after the prediction point. In comparison, LAPS-2 identified 22% and 14% of adverse events in the    corresponding time periods ( Figure EB1). NEWS and LAPS-2 performed similarly. The total number of events becomes smaller as the window duration decreases. Nearly all patients were in the hospital for an hour before an event, but progressively fewer were hospitalized as the prediction horizon increased (roughly 60% of events occurred more than 24 h after admission to a general ward). Decision curve analysis showed HAVEN had a higher net benefit than all other scoring systems over a range of risk thresholds (see Figures EB3 and EB4). Including unplanned ICU admissions preceded by a theater visit decreased the performance of HAVEN and all other scoring systems (Table EA4).

Main Findings
In this large, retrospective, observational study, we developed a novel risk score (HAVEN) to identify hospitalized patients at risk of potentially reversible deterioration. HAVEN had higher discrimination than all previously published EWSs and physiological scoring systems we tested (Tables 2 and 3) and was well calibrated (Figure 1). At 10% precision, the model identified nearly twice as many adverse outcomes in advance of the event (depending on the prediction horizon) (Figure 2) as the next best scoring system, LAPS-2 ( Figure EB1).

Strengths and Limitations
Our study used data from four large hospitals and follows the latest recommendations for developing and validating prediction models and EWSs (45,59). We used a composite primary outcome of unplanned admission to the ICU and in-hospital cardiac arrest as a proxy for potentially reversible clinical deterioration, as no well-defined indicator of "reversible" deterioration is recorded. This contrasts with other studies that either used only one of these two outcomes or used in-hospital mortality (60)(61)(62). We deliberately excluded in-hospital mortality from our composite outcome. In the United Kingdom, 40-50% of deaths occur in hospitals and only 3.6% of these are estimated to be avoidable (63,64). Excluding in-hospital mortality reduces the risk that our model would be optimized to predict inevitable, rather than potentially preventable, deterioration. The importance of outcome selection has been noted previously by ourselves and others (32,61). LAPS-2 was optimized to predict in-hospital mortality, which may have impacted its performance in our study.
We excluded unplanned ICU admissions preceded by an operating theater visit from the primary outcome. We assessed the impact of this exclusion on HAVEN performance, finding (as with other scoring systems) lower performance when including unplanned ICU admissions preceded by a theater visit. This decrease was particularly marked in hospital C, a dedicated center for cancer and renal services (including transplants). Notwithstanding the case-mix differences in comparison with the other three hospitals (see Table EA1 and Figure EB2), certain surgical   Identified with 20% precision Identified with 10% precision Identified with 5% precision Figure 2. Patient admission level sensitivity: average proportion of (candidate) adverse events to be identified within each window (left); and the average proportion of adverse events identified ahead of time for Hospital-wide Alerting via Electronic Noticeboard at different precision levels (5%, 10%, and 20%) (right). The error bars denote 1 SD.
procedures are undertaken on physiologically stable patients, who are routinely transferred to the ICU postoperatively and coded as an unplanned ICU admission. This again demonstrates the importance of selecting the appropriate outcome when evaluating risk scoring systems.
There are limitations to using unplanned ICU admission and cardiac arrest as outcomes. These outcomes are affected by existing treatment-limitation plans and "do not attempt cardiopulmonary resuscitation" decisions. Electronic coding of these decisions varies between hospitals and is currently insufficiently robust for inclusion in a generalizable model. A recent systematic review found that ICU admission can be affected by clinicians' experience, the perception of benefit, and organizational factors (e.g., bed availability) (65). Training our model on retrospective data risks incorporating these potential "cultural biases." We sought to reduce bias (e.g., against older patients) by including a broad range of patient factors (comorbidities, frailty) in our model. Indeed, Figure EF3 shows that although, on average, patients aged over 80 years have a decreasing likelihood of either cardiac arrest or ICU transfer, there is wider variation in the overall predicted risk for each age value above 80 years.
To further evaluate HAVEN's predictive performance, we computed the percentage of adverse events identified ahead of time ( Figure  2). We used a patient-level approach to determine the sensitivity of the model at different degrees of precision. As HAVEN was targeted at patients who deteriorate on general wards (rather than direct ICU admissions), we only included time periods after patients were transferred to a general ward. Our results therefore cannot be applied to patients who deteriorated in the emergency department. Despite the low prevalence of the outcome, the HAVEN model identified 42% of adverse events up to 48 hours in advance at an NNE of 10. Although nearly twice as good as the next best system (LAPS-2), seeing 10 patients to detect 1 would still create a significant workload. However, decision curve analysis ( Figures EB4 and EB5) showed that HAVEN has higher net benefit than the next three highest scoring systems (including NEWS, in common use in the United Kingdom). Together, these findings suggest that implementing the HAVEN model should improve patient care.
Studies of EWSs and other risk scores for identifying deteriorating patients vary in the outcomes and statistical methods used to validate their performance, making comparisons difficult (22,43,45,66). A large retrospective study of the Advanced Alert Monitor (AAM) score showed that the AAM score had a discrimination (ROC AUC) of 0.82 in comparison with discrimination of 0.79 and 0.76 for electronic CART and NEWS, respectively, for predicting unplanned ICU admissions within 12 hours (22). In contrast, the discrimination of HAVEN was 0.92 for predicting unplanned ICU admissions within 24 hours, outperforming the AAM score over a longer prediction horizon.

Conclusions
HAVEN performed significantly better than other published scores, such as NEWS and LAPS-2, when externally validated on an independent test set. Through the use of an ensemble of "weak learners" (gradient boosting decision trees) as our machinelearning algorithm, we were better able to model patients' physiological measurements in the context of their known comorbidities and frailty. We plan further external validation to ensure HAVEN model performance is sustained in other hospitals before a prospective evaluation on patient outcomes.