Artificial neural networks improve and simplify intensive care mortality prognostication: a national cohort study of 217,289 first-time intensive care unit admissions

Purpose We investigated if early intensive care unit (ICU) scoring with the Simplified Acute Physiology Score (SAPS 3) could be improved using artificial neural networks (ANNs). Methods All first-time adult intensive care admissions in Sweden during 2009–2017 were included. A test set was set aside for validation. We trained ANNs with two hidden layers with random hyper-parameters and retained the best ANN, determined using cross-validation. The ANNs were constructed using the same parameters as in the SAPS 3 model. The performance was assessed with the area under the receiver operating characteristic curve (AUC) and Brier score. Results A total of 217,289 admissions were included. The developed ANN (AUC 0.89 and Brier score 0.096) was found to be superior (p <10−15 for AUC and p <10−5 for Brier score) in early prediction of 30-day mortality for intensive care patients when compared with SAPS 3 (AUC 0.85 and Brier score 0.109). In addition, a simple, eight-parameter ANN model was found to perform just as well as SAPS 3, but with better calibration (AUC 0.85 and and Brier score 0.106, p <10−5). Furthermore, the ANN model was superior in correcting mortality for age. Conclusion ANNs can outperform the SAPS 3 model for early prediction of 30-day mortality for intensive care patients.


Introduction
Outcome prediction on admission to the intensive care unit (ICU) is a difficult task as patients are admitted with a wide array of diseases with varying severity in addition to patients' diversity in terms of age and comorbidities. In this study, we investigate if the current gold standard of early (within 1 h of admission) ICU-scoring, the proceeds. The weight increases or decreases the strength of the signal at a connection [3].
Advances in computing speed and the development of efficient algorithms have led to a renaissance for machine learning techniques such as ANNs during the last decade. The use of machine learning has proven to be valuable in a wide variety of medical fields, from the interpretation of cardiac magnetic resonance imaging for mortality prediction of pulmonary hypertension to detecting skin cancer [4,5]. Machine learning has also been found to be a promising technique in prognostication of the critically ill but only in conjunction with data available after 24 h and comparing with the Acute Physiology And Chronic Health Evaluation (APACHE) model. In a study from 2015, Pirracchio et al. found that an ensemble of machine learning techniques could improve ICU prediction [6]. Similarly, in Kim et al. [7], the authors used different machine learning algorithms to estimate ICU mortality from data collected within the first 24 h of ICU admission.
Current ICU prediction models such as the APACHE, used for scoring within the first 24 h, the Mortality Prediction Model (MPM), used for scoring on admission or after 24 hours, and the SAPS 3 [8] are based on multivariable logistic regression models. The SAPS 3 uses characteristics such as comorbidities before ICU admission, the reason for ICU admission, physiological parameters, and laboratory findings within 1 h of ICU admission to calculate an estimated mortality risk (EMR) [1,2]. The SAPS 3 has been re-calibrated several times to improve its performance [9]. To our knowledge, machine learning has not yet been used to improve early prognostication (prospectively registered within the first hour of admission) or using the massive data repositories of a national intensive care registry.
The aim of this study was to improve the 30-day mortality prognostication within the first hour of ICU admission using ANN modelling on data prospectively gathered within the first hour of admission (for SAPS 3 prognostication), as well as to identify the smallest possible subset of the more-than-twenty SAPS 3 parameters that can retain the same performance as the SAPS 3 model.

Materials and methods
We identified all first-time adult ICU admissions (excluding cardiothoracic ICU admissions as these use a different scoring system) with follow-ups for at least 30 days during 2009-2017 from the Swedish Intensive Care Registry (SIR). Both SAPS 3 parameters and 30-day mortality were used in this study. Physiological parameters and laboratory findings were prospectively recorded within 1 h of ICU admission, and an estimated mortality ratio (EMR) was calculated according to the latest Swedish calibration from 2016. This calculation estimates the 30-day mortality, in contrast to the original SAPS 3 model, which estimates the in-hospital mortality [9]. In Sweden, the Reaction Level Scale (RLS85) is often used instead of the more widespread Glasgow Coma Scale (GCS). For the studied admissions, 80% had RLS85 recorded, 20% had GCS recorded, whereas 2.5% had neither. Instead of translating GCS to RLS85, we chose to transform both scales to the central nervous system (CNS) scale used by APACHE II [10] and then use CNS scores in our ANN. See Table 1 for a comprehensive list of the SAPS 3 parameters.
In order to select an appropriate network, we constructed 200 single-output ANNs using two hidden layers, where the number of nodes in each layer was log-sampled between 5 and 400. These networks were constructed using TensorFlow [11], which is a Python-based opensource machine learning framework developed by Google LLC (Mountain View, USA). To improve convergence, training speed, and accuracy, we normalise each layer using batch normalisation, so that the output of these have zero mean and unit variance [12]. The loss function was optimised using the Adam implementation of stochastic gradient descent (SGD) [13], using a learning   [14]. Regularisation was performed using log-sampled weight decay with the decay parameter, λ, ranging from 10 −7 to 10 −3 .
To increase feature selection capabilities and to further improve regularisation, dropout was used, where p was log-sampled from 5% to 20% on the input layer and 40% to 60% on the hidden layers [15]. The network was trained for 100 epochs with a batch size of 512 using ReLU activation functions on the hidden layers [14]. In order to find the selected network, fivefold cross-validation was used, which yielded the hyper-parameters of our network: 158 first-layer nodes and 67 second-layer nodes with a weight decay of λ = 5.04 × 10 −6 . The dropout rates were 0.073 (input) and 0.501 (hidden). Data were randomly divided into six portions, with one portion set aside for independent validation purposes (the test set). Simple mean and mode substitution turned out to perform just as well as the more advanced methods for imputation, such as autoencoders [16].
To evaluate the performance of the ANN model, we examined the receiver operating characteristic (ROC) curve, which plots sensitivity, against 1-specificity, for various threshold settings. We used the area under the ROC curve (AUC) as a performance measure [17]. Differences in AUC were tested for with the method of DeLong et al. [18]. Furthermore, we computed the Brier score, which is a measure of the calibration of a set of probabilistic predictions; in effect, it is the mean squared error of the forecast [19]. Differences in Brier scores were tested with an approximate permutation test with 50,000 permutations [20]. We evaluated our ANN models with the AUC of the ROC and the Brier score for the calibration error on the test set. The ratio between the 30-day mortality and the EMR is the standardised mortality ratio (SMR), which is a morbidity-adjusted mortality measure. The SMR is only interesting as a group measure, as individual SMRs are either 0 (if the individual has not survived) or EMR −1 i , where EMR i is the EMR of individual i (who has survived). However, a way of defining an individual (or local) SMR is using smoothing techniques. We applied local polynomial regression using the default settings of the loess function of R [21] on mortality and EMR (and then interpolated evenly over the whole range). We subsequently calculated the ratio of the smoothed mortality and the smoothed EMR to obtain smoothed (local) estimates of SMR [22]. One possible interpretation of the SMR is that the closer the SMR is to 1, the better the EMR prognosticates the mortality.

Results
A total of 217,289 first-time admissions were identified, of which 1 / 6 th (n = 36,214) were randomly allocated to the test set whereas 5 / 6 th (n = 181,075) were randomly allocated to the training set. The median age was 65 years (interquartile range, IQR 48-76 years), while the median SAPS 3 score was 53 (IQR 42-65) and 30-day mortality was 18.5%. Baseline characteristics, including SAPS 3 parameters of the study population, are shown in Table 1. There were no differences in the SAPS 3 parameters between the test set and the training set (after correction for multiple testing) in any of the parameters shown in Table 1 Fig. 3, we see that the calibration error (that is the difference between OMR and EMR) in the high EMR range (0.7 -1) was reduced in the ANN model. The improvement in AUC using the ANN model over the SAPS 3 model for different primary ICU diagnoses can be seen in Table 2. The ANN model outperformed the SAPS 3 model for all the top primary diagnoses. In our study, an eight-parameter subset of the SAPS 3 parameters was the smallest subset that achieved better performance than the SAPS 3 model. The eight parameters were (in order of importance for AUC) age, level of consciousness, neurological cause, cardiovascular cause, cancer, temperature,  Mean, 95% confidence intervals, and p values were obtained using the method of DeLong [18] SIRS Systemic Inflammatory Response Syndrome pH, and leukocytes. The eight-parameter model had an AUC of 0.851 (95% CI 0.845-0.857) and a Brier score of 0.106 (95% CI 0.106-0.107). In Fig. 4, the SMR is displayed as a function of age, the most important prognostic factor. The ANN model was superior in correcting mortality (with respect to age as a prognostic factor) compared to the SAPS 3 model, which underestimated the mortality in the elderly ICU population. Conversely, the SAPS 3 model overestimated the mortality in the younger ICU population.

Discussion
We have shown that a well-designed neural network model can outperform the SAPS 3 model in the prediction of 30-day mortality while using the same parameters obtained within 1 h of admission. The ANN model was better with regards to both sensitivity and specificity, as measured by the AUC of the ROC curve (0.89 vs. 0.85, p< 10 −15 ) and notably in the calibration (Brier score of 0.106 vs. 0.093; p< 10 −5 ). As seen in Fig. 3, the ANN model was better in predicting 30-day mortality in the sickest patients, to be specific those with a very high EMR over 0.70. We noted in Fig. 4 that the ANN model was superior in correcting the most important prognostic factor, namely age. This single improvement in detecting a nonlinear relationship may very well have been the major contributor to the improved performance of the ANN model. The improvement in AUC using the proposed   [6]. They used a super learner algorithm that performs at least as well as the best performing algorithm of its 12 algorithms-one of which was an ANN. Their finding was that a random forest algorithm performed best, and they reached a cross-validated AUC of 0.88 (95% CI 0.87-0.89), as compared to 0.82 reached by APACHE II. In Pirracchio's study, they had access to SAPS II data and APACHE II data, both of which are registered within the first 24 h of admission (in contrast with SAPS 3 that only use data from the first hour). It is significant to note that the AUC should be higher, as it is considerably easier to prognosticate mortality with data obtained within 24 h than it is within 1 h of ICU admission. Kim and colleagues compared a range of machine learning techniques for the identification of ICU mortality with APACHE III, using data recorded within the first 24 h, making it difficult to compare their AUCs with our study [7]. They reached an AUC of 0.87 with 15 parameters, which was the same as APACHE III, based on data from 23,446 ICU patients at Kentucky University Hospital in the USA during 1998-2007. It is clear that our AUC of 0.89 using data from only the first hour of admission is better than other models relying on more information using data recorded during the first 24 h. It is also worth mentioning that some other studies report AUCs on the training data and not the test data, something which should be discouraged due to the potential of achieving misleading AUCs by overfitting and therefore not being discussed here.
The main limitation of our study, as with all neural network models is that they can be viewed as "black box" models, i.e. there is little insight in how individual parameters contribute to the prediction. This problem is somewhat alleviated by ranking the predictors after their contribution to the total AUC. It is, however, inherent to many non-linear problems that the complex interactions found within the data are not easily expressed and interpreted. We believe that the primary aim of a good predictor is to just that: a good predictor (of mortality).
ICU prognostication is an ongoing process and will most likely improve significantly over the next decade due to an increasing amount of patient-level data. Based on this study, we believe logistic regression-based predictive modelling should be abandoned and instead replaced with machine learning algorithms like ANN.

Conclusion
Our ANN model outperformed the SAPS 3 model (using the same data) in early (within 1 h of admission) prediction of 30-day mortality for intensive care patients in both AUC and calibration on a massive (217,289 admissions) dataset from the Swedish Intensive Care Registry. The superiority of our ANN model was also seen in the fact that an eight-parameter ANN model still outperformed the SAPS 3 model that uses over 40 parameters. The perhaps most important result was the fact that the ANN model was superior in correcting for the most important prognostic parameter, age. We thus encourage intensive care registries to use ANN models for short-term mortality predictions in quality control and research.

Availability of data materials
The data is available from the Swedish Intensive Care Registry after an approval process.