Interpretable Machine Learning Model for Mortality Prediction in ICU: A Multicenter Study

Background: Researchers have long been struggling to improve the disease severity score in mortality prediction in ICU. The digitalization of medical health records and advancement of computation power have promoted the use of machine learning in critical care. This study aimed to develop an interpretable machine learning model using datasets from multicenters, and to compare with the APACHE IV, in predicting hospital mortality of patients admitted to ICU. Method: The datasets were assembled from the eICU database including 136145 patients across 208 hospitals throughout the U.S. and 5 ICUs in Hong Kong, including 10909 patients. The two datasets were rst combined into one large dataset before 80:20 stratied split into the training set and the test set. The XGBoost machine algorithm was chosen to predict the hospital mortality. The variables in the model were the same as those included in the APACHE IV score. The discrimination and calibration of the model were assessed. The model would be interpreted using the Shapley Additive explanations values. Results: Of the 147054 patients in the whole cohort, the hospital mortality was 9.3%. The area under the precision-recall curve for the XGBoost algorithm was 0.57, and 0.49 for APACHE IV. Similarly, the XGBoost reached an area under the receiving operating curve (AUROC) of 0.90, while APACHE IV had an AUROC of 0.87. Additionally, the XGBoost algorithm showed better calibration than the APACHE IV. The three most important variables were age, heart rate, and whether the patient was on ventilator. Conclusions: The severity score developed by machine learning model using mutlicenter datasets outperformed the APACHE IV in predicting hospital mortality for patients admitted to ICU.


Introduction
Over the years there have been evolutions in the development of disease severity scores in Intensive Care Units (ICU). In 1985, the Acute Physiology and Chronic Health Evaluation (APACHE) II was designed by a group of experts led by Dr Knaus, who subjectively chose some variables and assigned weights to them based on their expert clinical judgements and some documented physiologic relationships from 5,815 ICU admissions. [1] Twenty years later, APACHE IV was published by Zimmerman et al., formulated using a non-linear logistic regression from data originated from 131,618 ICU admissions. These scores are now commonly used in many ICUs to quantify disease severity, to characterise organ dysfunction, to predict patient outcome and to facilitate resource allocation. [2] Although the use of statistical inference to characterise relationships between variables remains the cornerstone in medical research, statistical tests are not primarily built for making predictions. The di culty lies in the complexity of critical illness which makes it unrealistic to form an accurate statistical model, while at the same time reasonably ful lling the rigid assumptions behind these statistical tests. In contrast, machine learning models excel in their exibility, which are particularly suitable for making prediction. The superior performance of machine learning has been well demonstrated in mortality prediction over SAPS II score, prediction of unplanned extubations and prediction of ICU readmissions. [3][4][5] However, the achievement of machine learning has been slow to be recognized in the medical community. Many would see machine learning as a 'black-box' model, and question how the model derives such results. It is important to explain how a model works because if the users do not trust the working mechanism of a model or prediction, they will not use it no matter how accurate it could be. It has been shown that providing an explanation increased the acceptance of automation system. [6,7] In this study, to demonstrate the exibility of machine learning model in making prediction, the very same APACHE IV variables would be used. As the prediction of SAPS II has been shown to be less accurate than the APACHE IV [8], it would be expected that the machine learning model would be more di cult to beat the record of the APACHE IV. The objective of this study was to develop an interpretable machine learning model and to compare with the APACHE IV, in predicting hospital mortality of patients admitted to ICU.  Table S1). Additional le 1: Figure S1 visualized the distribution of APACHE IV score, with the eICU dataset having a lower mean APACHE IV score, approximating to a normal distribution, whereas that distribution of the Hong Kong dataset was rather right skewed. Furthermore, additional le 1: Table S2 listed the top ten admission diagnoses in each dataset. The eICU dataset contained a signi cant proportion of patients admitted for medical diseases, such as acute myocardial infarction, cerebrovascular accidents, congestive heart failure, and rhythm disturbances. Conversely, the Hong Kong dataset contained more patients admitted to ICU after neurosurgical operations, hepatobiliary operations, and emergency operations after gastrointestinal perforation. Therefore, to maintain the generalizability of the model across centers, the training dataset was built from both the eICU and the Hong Kong datasets. A strati ed 80:20 split of the data was performed: 80% of the eICU datasets and 80% of the Hong Kong datasets were combined into the training set, while the rest would join to be the test sets. The datasets were preprocessed using the one-hot encoder and standard scaler.

XGBoost model
The primary outcome was hospital mortality. The predicted mortality by APACHE IV was extracted from the datasets. Then the same variables used in APACHE IV were inputted to compute the machine learning model using the extreme gradient boosting (XGBoost) algorithm. [10] The XGBoost model was trained with Strati ed K Fold. The optimal combination of hyperparameters were sought by Grid Search with cross validation by looking for the highest area under the precision-recall curve (AUPRC).

Performance measures
While the most common measure for model discrimination has been the area under the receiver-operating characteristic curve (AUROC), it might mask the poor model performance in the case of imbalanced datasets. [11,12] Therefore, the AUPRC instead of the AUROC would be used as the primary performance measure in this study. Meanwhile, the AUROC would also be shown for comparison.
Calibration of the model would be evaluated by the calibration curve and the Hosmer-Lemeshow goodness-of-t test. Furthermore, with reference to the original APACHE IV study, the predicted mortality and the actual mortality in certain speci c subgroups, would be compared to calculate the standardized mortality ratio (SMR). [2] Model Interpretation Lundberg and Lee proposed SHapley Additive exPlanations (SHAP) explain the output of machine learning models. [13] It would be calculated and visualised in SHAP summary plot. It was a concise plot that combined the feature importance with feature effect. Each point on the plot represented the Shapley value for the corresponding feature in a particular instance. The colour represented the value of the feature. The features were ordered in the y-axis in a descending order of importance. [14] Furthermore, relationship between individual variables and SHAP would be visualised using dependence plot. It was a scatter plot re ecting the effect one single variable had on the predictions made by the model.

Results
There were 147054 patients included in this study. Table 1 showed their baseline characteristics. The hospital mortality of the whole cohort was 9.3%. The number of patients with data available for each variable, p-value and con dence interval were shown in Table 7  The number of patients with data available for each variable, p-value and con dence interval were shown in Table 7  The number of patients with data available for each variable, p-value and con dence interval were shown in Table 7 in Appendix 6.4 a The percentages of admission source did not add up to 100% because of missing data b High-dependency units were counted as step down units in the Hong Kong dataset.

Discrimination
The AUPRC was 0.57 for the XGBoost algorithm, and 0.49 for the APACHE IV in the whole cohort. (Fig. 1) Looking individually at the eICU and Hong Kong datasets, the XGBoost algorithm had higher AUPRC than the APACHE IV score (eICU: 0.55 vs. 0.45, Hong Kong dataset: 0.71 vs. 0.66). (Additional le 1: Table S2). The XGBoost algorithm reached an AUROC of 0.90, and APACHE IV had an AUROC of 0.87 in the combined data test set. (Additional le 1: Table S3) Calibration Figure 2 showed the calibration plot of the whole cohort. The closer curve to the diagonal reference line suggested a better calibration of the XGBoost algorithm (closer to the diagonal reference line) than the APACHE IV score. The calibration plots of the individual datasets were shown in Additional le 1:    Table 4 summarized the discrimination and calibration between the XGBoost model and the APACHE IV, showing the superior performance in the former.  Figure 3 showed the SHAP variable importance plot. It was made up of individual dots and each represented one training data. The feature importance was shown by the descending order of the variables. The x position of the dot re ected the impact of the prediction. A positive SHAP value was positively associated with higher mortality, and vice versa. The color of the dots represented the value of that variable for the prediction. For example, older age (as shown in red) was positively associated with mortality (as it was on the right side of the axis), whereas patients who were not on ventilators (as shown in blue as it was encoded as 0 in the data, compared with 1 representing patients who were on ventilator) were associated with a lower risk of mortality (left side of the axis). Therefore, age contributed most in predicting hospital mortality in the XGBoost model, followed by other factors like heart rate, whether the patient was on ventilator, bilirubin level, and whether the patient suffered from sepsis (non-urinary tract) etc.
Furthermore, the SHAP variable importance plot visualized the data in an intuitive way. For example, when looking at the right side of the plot (SHAP value > 0.0), both a high heart rate and a low heart rate were positively associated with mortality, but the effect of low heart rate was stronger than that of high heart rate. To investigate the effect of an individual variable like the heart rate, the dependence plot generated by the SHAP model had elegantly illustrated the U-shape relationship. (Fig. 4) The SHAP value was lowest with the heart rate ranging from about 50 to 100 beats per minute. The effect of severe bradycardia on mortality prediction was greater compared with that of tachycardia. The effect of interactions with other variables was also shown. For example, for patients who had heart rate of 150 bpm, younger patients had much lower SHAP than that of older patients. Different dependence plots between SHAP and independent variables, and interactions among these variables could also be plotted and studied from the system if needed.

Discussion
This study demonstrated the superior performance of model prediction based on big data using machine learning algorithm than traditional statistical inference, by utilizing the same variables in the original APACHE IV score. While the original APACHE IV model was built using solely the data from the U.S. population, the model in this study was built on a combination of data from the eICU and multiple centers in Hong Kong. This machine learning algorithm outperformed the original APACHE IV scoring in the whole cohort, as well as the individual eICU and the Hong Kong populations. Such improved generalisability of the model was an important quality of an outcome predictive scoring system which served as a tool to compare different patient populations in medical research, and to compare the quality of care across ICUs worldwide.
The use of SHAP has increased the interpretability of the model. It has been proven that machine learning model is no longer a 'black-box' model -that is, although the machine itself has no idea what these variables mean, compared with clinicians who could subjectively give weight to individual variables in the APACHE II era, the interpretation of the models by SHAP has proved that it is biologically sound and clinically plausible. This message is paramount in boosting the con dence of clinicians who would consider utilizing this model.
The strength of this study was the enormous dataset built from the eICU dataset and multiple centers in Hong Kong. The XGBoost algorithm computed in this model was one of the state-of-the-art machine learning models with outstanding performance. This study also demonstrated the use of AUPRC in addition of AUROC to reveal the true performance of the model in face of an imbalanced dataset. Model discrimination was evaluated in multiple facets and calibration plot was used to avoid the pitfall of Hosmer-Lemeshow Chi-square test. The use of SHAP improved the interpretability of the model.
The limitation of this study was that most of the patients came from the eICU dataset representing the U.S. data. Based on the gures, the two groups of patients obviously differed in terms of disease severity, admission sources and admission diagnoses. Behind the scene there might be difference in case referral pattern and case management. The thresholds for ICU admission varied across centers and countries. The pathophysiology of disease might not be the same between Chinese and Caucasians. Therefore, it was also the reason why the eICU dataset and the Hong Kong dataset were combined before splitting into the training and test sets. Furthermore, the variables used in this study were limited to those used in the original APACHE IV, because the aim of this study was to show the superior performance of machine learning model versus traditional statistical methods. The mortality prediction model might be improved by using data from a more complex case-mix or recruiting more data from academic centers and rural hospitals worldwide. Particularly, there is a need to establish a global database recruiting ICUs across the continents in order to produce a generalisable prediction model. Other formats of data such as waveforms (electrocardiogram, photoplethysmography, arterial blood pressure), imaging data, and clinical texts might also be employed to boost the accuracy of mortality prediction model. On the other hand, one needs to strike a balance in determining the number of variables in a model because increasing the variables increases the di culty in data collection. Lastly, other outcomes can also be predicted in the future studies, such as ICU and hospital length of stay, ventilator-free days, dialysis independence, and long-term mortality outcome.

Conclusion
Using the same variables in APACHE IV, the XGBoost algorithm outperformed the APACHE IV scoring in hospital mortality prediction for patients admitted to the ICU, based on the data from eICU dataset and Hong Kong datasets. There is an emerging need to establish a global database of patients in critical care.

List Of Abbreviations
ICU, intensive care unit; APACHE, Acute Physiology and Chronic Health Evaluation; APS, Acute Physiology Score; XGBoost, eXtreme Gradient Boosting; AUPRC, area under the precision-recall curve; AUROC, area under the receiver-operating characteristic curve; SMR, standardized mortality ratio; SHAP, SHapley Additive exPlanations