Mortality prediction model for the triage of COVID-19, pneumonia, and mechanically ventilated ICU patients: A retrospective study

Rationale Prediction of patients at risk for mortality can help triage patients and assist in resource allocation. Objectives Develop and evaluate a machine learning-based algorithm which accurately predicts mortality in COVID-19, pneumonia, and mechanically ventilated patients. Methods Retrospective study of 53,001 total ICU patients, including 9166 patients with pneumonia and 25,895 mechanically ventilated patients, performed on the MIMIC dataset. An additional retrospective analysis was performed on a community hospital dataset containing 114 patients positive for SARS-COV-2 by PCR test. The outcome of interest was in-hospital patient mortality. Results When trained and tested on the MIMIC dataset, the XGBoost predictor obtained area under the receiver operating characteristic (AUROC) values of 0.82, 0.81, 0.77, and 0.75 for mortality prediction on mechanically ventilated patients at 12-, 24-, 48-, and 72- hour windows, respectively, and AUROCs of 0.87, 0.78, 0.77, and 0.734 for mortality prediction on pneumonia patients at 12-, 24-, 48-, and 72- hour windows, respectively. The predictor outperformed the qSOFA, MEWS and CURB-65 risk scores at all prediction windows. When tested on the community hospital dataset, the predictor obtained AUROCs of 0.91, 0.90, 0.86, and 0.87 for mortality prediction on COVID-19 patients at 12-, 24-, 48-, and 72- hour windows, respectively, outperforming the qSOFA, MEWS and CURB-65 risk scores at all prediction windows. Conclusions This machine learning-based algorithm is a useful predictive tool for anticipating patient mortality at clinically useful timepoints, and is capable of accurate mortality prediction for mechanically ventilated patients as well as those diagnosed with pneumonia and COVID-19.


Introduction
Infection prevention and control recommendations from the World Health Organization (WHO) stress that early detection, effective triage, and isolation of potentially infectious patients are essential to prevent unnecessary exposures to COVID-19 [1]. However, the rapid spread of COVID-19 has outpaced US healthcare facilities' ability to administer diagnostic tests to guide the quarantine and triage COVID-19 patients [2][3][4][5]. The outbreak significantly affects the availability of necessary hospital resources (i.e. respirators [6] and mechanical ventilators [7][8][9][10][11][12]). COVID-19 can be lethal, with a variable case fatality rate considered to be between that of severe acute respiratory syndrome (SARS; 9.5% [13]) and influenza (0.1%) [14][15][16] and the potential to develop into severe respiratory diseases [17][18][19]. During this period of unprecedented health crisis, clinicians must prioritize care for at-risk individuals to maximize limited resources. Mortality prediction tools aid in triage and resource allocation by providing advance warning of patient deterioration. Our prior work has validated machine-learning (ML) algorithms for their ability to predict mortality and patient stability in a variety of settings and on diverse patient populations [20][21][22][23][24].

Theory
Of particular interest during the COVID-19 pandemic is mortality prediction of COVID-19 patients, as well as those who have developed respiratory complications such as pneumonia and conditions requiring mechanical ventilation. Some prior studies predicting mortality in the mechanically vented subpopulation have used a logistic regression model. When applied on day 21 [25] or 14 [26,27] of mechanical ventilation, this provides a probability of 1-year mortality. These studies were designed to determine the long-term prognosis of patients receiving prolonged mechanical ventilation. Here we present a mortality prediction tool applied to intensive care unit (ICU) patients requiring mechanical ventilation as well as those diagnosed with pneumonia, with mortality prediction windows of 12, 24, 48 and 72 h prior to death. We apply this algorithm for the same mortality prediction windows in COVID-19 patients.

Data sources
Patient records were collected from the Medical Information Mart for Intensive Care (MIMIC) dataset, an openly available dataset developed by the MIT Lab for Computational Physiology, comprising de-identified health data associated with ~60,000 intensive care unit admissions [28]. It includes demographics, vital signs, laboratory tests, medications, and more. Data collection was passive with no impact on patient safety. MIMIC data has been de-identified in compliance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.
Patient records of COVID-19 polymerase chain reaction (PCR) positive patients were collected from a community hospital and formatted in the same manner as the MIMIC dataset. A total of 114 patient encounters were collected between 12 March and 12 April 2020. Data collection was passive with no impact on patient safety. Dascena establishes deidentification by removing all protected health information (PHI) identifiers and by jittering all timestamps (including date of birth (DOB)) randomly either forwards or backwards in time. Studies performed on de-identified patient data constitute non-human subjects research, and thus this study has been determined by the Pearl Institutional Review Board to be Exempt according to FDA 21 CFR 56.104 and 45CFR46.104(b) (4): (4) Secondary Research Uses of Data or Specimens under study number 20-DASC-119.

Data processing
For the MIMIC and community hospital datasets, we included only records for patients aged 18 years or older. We excluded patient records for which there were no raw data or no discharge or death dates. We then filtered for length of stay (LOS) for the different look aheads of 12, 24, 48, and 72 h. Table 1 lists the number of patients for each inclusion criterion from the MIMIC dataset. Inclusion criteria for the community hospital dataset are listed in Table 2. We minimally processed raw electronic health record (EHR) data to generate features. Following imputation of missing values, we averaged one value for each measurement each hour for up to 3 h preceding prediction time. We also calculated differences between the current hour and the prior hour and between the prior hour and the hour before that. We concatenated these values from each measurement into a feature vector. For the MIMIC dataset, pneumonia patients were identified by International Classification of Diseases (ICD) codes, while those requiring mechanical ventilation and their corresponding start times were determined by chart measurements indicative of a mechanical ventilation setting. In the community hospital dataset, COVID-19 patients were identified with positive SARS-Cov2 PCR tests.
Data were discretized into 1 h intervals, beginning at the time of the first recorded patient measurement and hourly measurements were required for each input variable. Measurements were averaged to produce a single value in cases when multiple observations of the same patient measurement were taken within a given hour. This ensures that the measurement rate was the same across patients and across time. Missing values were imputed by carrying forward the most recent past measurement in cases where no measurement of a clinical variable was available for a given hour. For some patients with infrequent measurements of one or more vital signs, this simple imputation resulted in many consecutive hours with identical values.
Our publication on the use of gradient boosted trees for sepsis detection and prediction describes the data processing in detail [29]. Predictions were generated for all experiments using the following variables: Age, Heart Rate, Respiratory Rate, Peripheral Oxygen Saturation (SpO2), Temperature, Systolic Blood Pressure, Diastolic Blood Pressure, White Blood Cell Counts, Platelets, Lactate, Creatinine, and Bilirubin, over an interval of 3 h and their corresponding differentials in that interval.

Gold standard
The outcome of interest was in-hospital patient mortality, determined retrospectively for each patient. In the MIMIC dataset, we used the expire_flag field to identify the last stays of those patients. Similarly, the community hospital dataset contains a deceased flag that is either true or false to determine mortality.

The machine learning algorithm
The classifier was created using the XGBoost method for fitting "boosted" decision trees. We applied the XGBoost package for Python32 to the patient age and vital sign measurements and their temporal changes, where temporal changes included hourly differences between each measurement beginning 3 h before prediction time. Gradient boosting, which XGBoost implements, is an ensemble learning technique that combines results from multiple decision trees to create prediction scores. Each tree splits the patient population into smaller and smaller groups, successively. Each branch splits the patients who enter it into  two groups, based on whether their value of some covariate is above or below some threshold-for instance, a branch might divide patients according to whether their temperature is above or below 100 • F. After some number of branches, the tree ends in a set of "leaves." Each patient is in exactly one leaf, according to the values of his or her measurements. Each "leaf" of the tree is predicted to have the same risk of mortality. The covariate involved in each split and the threshold value are selected by an algorithm designed to trade off fit to the training data and accuracy on out-of-sample data by using cross-validation to avoid "over-fitting." We restricted tree depth to a maximum of six branching levels, set the learning rate parameter of XGBoost to 0.1, and restricted the tree ensembles to 1000 trees to limit the computational burden.
Hyperparameter optimization was performed using cross-validated grid search. We included a hyperparameter for the early stopping of the iterative tree-addition procedure to prevent overfit of the model on the training data and optimized across this hyperparameter using fivefold cross-validation. Due to computational and time constraints, hyperparameter optimization was performed across a sparse parameter grid, where the candidate hyperparameter values were chosen to span large ranges of viable parameter space. Cross-validated grid search was conducted to determine the optimal combination of candidate hyperparameters. While XGBoost has a large number of trainable parameters, computational and time constraints limited the set of parameters to be tuned to just those parameters with the largest impact on performance on the training data and most relevant to the prediction task.
To validate the boosted tree predictor when training and testing was performed on data from the same institution, we used fivefold crossvalidation. For each model, four-fifths of the patients were randomly selected to train the model and the remaining one-fifth were used as a hold-out set to test the predictions. To account for the random selection of the training set, reported performance metrics are the average performance of the five separately trained models arising from fivefold cross-validation, each of which was trained on four-fifths of the data and tested on the remaining fifth. For AUROC, we also reported the standard deviation of the five AUROC values obtained from cross-validation.
For patients who died, we modeled mortality 12, 24, 48, and 72 h before death to evaluate the performance with a variety of lead times. For mechanically ventilated encounters, the time point was the start of ventilation for positive and negative class. Predictors were trained independently for each distinct lookahead time. In 12, 24, 48 and 72 h long lookahead predictions following a 3-h window of measurements, patients must have data for, respectively, 15, 27, 51 or 75 respective hours preceding the time of in-hospital mortality or the time of discharge. Accordingly, we selected patients with the appropriate stays for the training and testing of each lookahead.

Comparison to rule-based methods
To calculate the AUROC for rule-based predictors of mortality, we calculated quick Sepsis Related Organ Failure Assessment (qSOFA), Modified Early Warning Score (MEWS) and CURB-65 scores for patients in the MIMIC database. qSOFA has also been used to predict poor outcomes in pneumonia patients, including the need for mechanical ventilation, and has been shown to either match or outperform other outcome predictors such as SOFA, CRB, CRB-65 and the pneumonia severity index (PSI) [30,31]. Among more generally used mortality prediction scores, qSOFA has been shown to have similar predictive performance to that of Acute Physiology, Age, Chronic Health Evaluation (APACHE) II or SOFA, as evidenced by a lack of statistical difference between AUROC [32]. The MEWS and CURB-65 scores have also been validated for mortality prediction in general patient populations [33,34] and those with community-acquired pneumonia [35] or COVID-19 [36], respectively. Scores were calculated using the entire dataset. We calculated the qSOFA score using systolic blood pressure, respiratory rate, and Glasgow Coma Scale (GCS) from EHR data. MEWS was calculated using systolic blood pressure, heart rate, respiratory rate, and temperature. GCS was used as a proxy for evaluating AVPU. CURB-65 scores were computed using age, BUN, respiratory rate, as well as systolic and diastolic blood pressure. A GCS of less than or equal to 14 was used as a proxy for confusion. Comparator score calculations for patients in the community hospital dataset were modified based on available data.

Results
XGBoost model training and testing was performed on the MIMIC dataset. Patient demographic information for all ICU encounters as well as each subpopulation are presented in Tables 3-5. Patient demographic information for all encounters from the community hospital data set are listed in Table 6.
The XGBoost ML algorithm predicted mortality in all ICU patients as well as mechanically ventilated and pneumonia patients more accurately than qSOFA, MEWS and CURB-65 at all prediction windows (  Tables 7 and 8 and Supplementary Table S5). When trained and tested on the MIMIC dataset, the XGBoost predictor obtained AUROCs of 0.82, 0.81, 0.77, and 0.75 for mortality prediction on mechanically ventilated patients at 12-, 24-, 48-, and 72-hour windows, respectively, and AUROCs of 0.87, 0.78, 0.77, and 0.73 for mortality prediction on pneumonia patients at 12-, 24-, 48-, and 72-hour windows, respectively ( Fig. 1). Feature importance statistics are listed in Supplementary Tables S1-S4.
Detailed performance metrics for the XGBoost predictor on pneumonia and mechanically ventilated patients are presented in Tables 7  and 8 and on COVID-19 patients in Table 9. All predictor training and testing was performed on the MIMIC data set. The diagnostic odds ratio (DOR) is a measure for comparing diagnostic accuracy between tools and is calculated as (True Positive/False Negative)/(False Positive/True Negative). DOR represents the ratio of the odds of a true positive prediction of mortality in patients who died within a certain prediction window to the odds of a false positive prediction of mortality in patients who did not die within a certain prediction window. For all prediction windows, the XGBoost predictor had a higher DOR than qSOFA.
These results suggest that the XGBoost predictor is capable of predicting mortality in pneumonia, mechanically ventilated, and COVID-19 patients and outperforms the qSOFA, MEWS and CURB-65 mortality risk scores.

Discussion
Accurate mortality prediction can assist with the allocation of limited hospital resources and optimize patient management. Additionally, advanced mortality prediction can facilitate decision making with family and caregivers. The commonly used MEWS [37], the APACHE [38], Simplified Acute Physiology Score (SAPS II) [39], Sepsis-Related Organ Failure Assessment (SOFA) [40], and the quick SOFA (qSOFA) score [41] provide a rough estimate of mortality prediction, however the specificity and sensitivity of these tools are limited for COVID and mechanically ventilated populations [42]. Machine learning (ML) has been previously broadly applied to predictive tasks within the biosciences [43][44][45][46]. ML-based tools for mortality prediction have been applied to sepsis [47,48] cardiac arrest [49], coronary artery disease [50], and extubation [51] patient populations, and have been implemented in a broad range of clinical settings, including the emergency department (ED) [48] and the intensive care unit (ICU) [52][53][54][55]. Studies of mortality prediction on pneumonia and mechanically ventilated patients are particularly relevant for COVID-19 related lung complications. We have demonstrated that machine learning algorithms are useful predictive tools for anticipating patient mortality at clinically useful windows of 12, 24, 48, and 72 h in advance and have validated mortality prediction accuracy for COVID-19, pneumonia, mechanically ventilated, and all ICU patients (Fig. 1), demonstrating that for all prediction types and windows, our ML algorithm outperforms the qSOFA, MEWS and CURB-65 severity scores (Tables 7-9).
A meta-analysis of studies focusing on predicting mortality in pneumonia patients showed that of the three commonly used prognostic scores which predicted mortality, the Pneumonia Severity Index (PSI) had the highest AUROC of 0.81. However, this index was used for predicting 30-day mortality specifically among patients with community acquired pneumonia [56]. When trained and tested on the MIMIC dataset, the XGBoost predictor obtained AUROCs of 0.87, 0.78, 0.77,    and 0.73 for mortality prediction on pneumonia patients at 12-, 24-, 48-, and 72-hour windows, respectively (Fig. 1, Table 7). When trained and tested on the community hospital dataset, the XGBoost predictor obtained AUROCs of 0.91, 0.90, 0.86, and 0.87 for mortality prediction on COVID-19 PCR positive patients at 12-, 24-, 48-, and 72-hour windows, respectively ( Table 9). The algorithm outperformed the qSOFA, MEWS and CURB-65 risk scores at all prediction windows (Table 9). This ML algorithm can be used to automatically monitor patient populations without incurring additional data entry or impeding clinical workflow, and patient alerts can be set to desired thresholds for sensitivity and specificity of alerting as needed in different care settings. As a clinical decision support tool, the machine learning algorithm presented in this study may assist clinicians in navigating the complexities surrounding COVID-19 related resource allocation. During a pandemic, accurate triage of patients is essential for improving patient outcomes, effectively utilizing clinical care teams, and efficiently allocating resources. The benefit of our approach is that when our machine learning algorithm is implemented in clinical ICU settings, healthcare providers can potentially identify patients at risk of significant COVID-19 related decompensation before they deteriorate, thus facilitating effective resource allocation and identifying those patients most likely to benefit from increased care.
There are several limitations to our study. The ML algorithm developed on the MIMIC dataset used only data from the ICU. Therefore, further research is required to evaluate performance of the algorithm in other patient care settings. Further, because the algorithm only utilized laboratory data and vital signs as inputs, it did not account for actions undertaken by the care team. These actions could signify aggressive treatment or withdrawal of treatment and could cause changes to algorithm inputs, potentially leading to variations in the algorithm's prediction score. On one hand, incorporating care team actions into  Table 8 Comparison of AUROC, average precision (APR), sensitivity, specificity, F1, diagnostic odds ratio (DOR), positive and negative likelihood ratios (LR+ and LR-), accuracy and recall obtained by the machine learning algorithm (MLA) and the qSOFA score for mortality prediction at 12-, 24-, 48-, and 72-hour windows on mechanically ventilated patients using the MIMIC dataset. Standard deviations are listed in parenthesis. For AUROC and APR the operating point was set near a sensitivity of 0.800. algorithm inputs could be useful feedback to the care team in the sense that it may aid them in determining whether a given intervention was harmful or beneficial. On the other hand, accounting for actions undertaken by the care team may complicate the interpretation of what it means to "anticipate" mortality, given that the current state of knowledge of the care team is unknown. Finally, because this is a retrospective study, we cannot determine the performance of the mortality prediction algorithm in a prospective clinical setting. Prospective validation is required to determine how clinicians may respond to risk predictions as well as whether predictions can affect patient outcomes or resource allocation.

Conclusion
The ML algorithm presented in this study is a useful predictive tool for anticipating patient mortality at clinically useful windows up to 72 h in advance, and capable of accurate mortality prediction for COVID-19, pneumonia, and mechanically ventilated patients.

Patient and public involvement statement
Patients and the public were not involved in the design and conduct  Comparison of AUROC, average precision (APR), sensitivity, specificity, F1, diagnostic odds ratio (DOR), positive and negative likelihood ratios (LR+ and LR-), accuracy and recall obtained by the ML algorithm (MLA) and the qSOFA, MEWS and CURB-65 scores for mortality prediction at 12-, 24-, 48- of the study, choice of outcome measures, or recruitment to the study due to the nature of data collection.

Dissemination declaration
Transparency declaration: RD affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as originally planned (and, if relevant, registered) have been explained.

Ethical approval
Data has been deidentified and, as such, does not constitute human subjects research.

Conflicts of interest
All authors who have affiliations listed with Dascena (San Francisco, California, USA) are employees or contractors of Dascena.

Trial registration
This study has been registered on ClinicalTrials.gov under study number NCT04358510.

Provenance and peer review
Not commissioned, externally peer reviewed.

Guarantor
The Guarantor is the one or more people who accept full responsibility for the work and/or the conduct of the study, had access to the data, and controlled the decision to publish   RespRate_-2 60 7

Table S5
Comparison of AUROC, average precision (APR), sensitivity, specificity, F1, diagnostic odds ratio (DOR), positive and negative likelihood ratios (LR+ and LR-), accuracy and recall obtained by the machine learning algorithm (MLA) and the qSOFA score for mortality prediction at 12-, 24