Can we reliably automate clinical prognostic modelling? A retrospective cohort study for ICU triage prediction of in-hospital mortality of COVID-19 patients in the Netherlands

Background Building Machine Learning (ML) models in healthcare may suffer from time-consuming and potentially biased pre-selection of predictors by hand that can result in limited or trivial selection of suitable models. We aimed to assess the predictive performance of automating the process of building ML models (AutoML) in-hospital mortality prediction modelling of triage COVID-19 patients at ICU admission versus expert-based predictor pre-selection followed by logistic regression. Methods We conducted an observational study of all COVID-19 patients admitted to Dutch ICUs between February and July 2020. We included 2,690 COVID-19 patients from 70 ICUs participating in the Dutch National Intensive Care Evaluation (NICE) registry. The main outcome measure was in-hospital mortality. We asessed model performance (at admission and after 24h, respectively) of AutoML compared to the more traditional approach of predictor pre-selection and logistic regression. Findings Predictive performance of the autoML models with variables available at admission shows fair discrimination (average AUROC = 0·75-0·76 (sdev = 0·03), PPV = 0·70-0·76 (sdev = 0·1) at cut-off = 0·3 (the observed mortality rate), and good calibration. This performance is on par with a logistic regression model with selection of patient variables by three experts (average AUROC = 0·78 (sdev = 0·03) and PPV = 0·79 (sdev = 0·2)). Extending the models with variables that are available at 24h after admission resulted in models with higher predictive performance (average AUROC = 0·77-0·79 (sdev = 0·03) and PPV = 0·79-0·80 (sdev = 0·10-0·17)). Conclusions AutoML delivers prediction models with fair discriminatory performance, and good calibration and accuracy, which is as good as regression models with expert-based predictor pre-selection. In the context of the restricted availability of data in an ICU quality registry, extending the models with variables that are available at 24h after admission showed small (but significantly) performance increase.


Introduction
The prevalent approach to clinical prediction modeling often involves the manual selection of potentially relevant variables by experts, followed by regression analysis. Recent advancements in Machine Learning (ML) render this classical approach restrictive (uses only one model type), inefficient (labor-intensive manual selection) and potentially biased (predictor pre-selection). Automated Machine Learning (AutoML) is the automation of the ML design process which includes, among others, automatic model and variable selection and hyperparameter tuning [1]. The promise of AutoML is to remove or lessen the burden of manual ML design tasks. In this study, we assess the predictive performance of AutoML for clinical prognosis modeling by comparing classical modeling (manual variable selection followed by regression) and AutoML modeling approaches. In particular, we assess the performance of AutoPrognosis [2]for the prediction of in-hospital mortality of COVID-19 patients that were admitted to the ICU. AutoPrognosis is an AutoML tool developed for clinical prognostic modeling that learns 20 ML models (e.g., regression, neural networks, and linear discriminant analysis) simultaneously. The case study is particularly relevant for challenging the classical model approach, because (1) the largest proportion of prediction models for diagnosis and prognosis of COVID-19 were developed in the classical way (dd. July 2021: 89 out of 238 models used regression); [3] and (2) efficient automated approaches might be part of a rapid response strategy in a crisis situation.
The classical approach to develop prediction models based on expertbased predictor preselection followed by logistic regression can be time and labor intensive and may be biased. In case of new and yet unknown diseases, such predictor selection is not even possible. New and highly infectious diseases with high chances of leading to pandemic outbreaks, like COVID-19, require a rapid response in order to obtain and disseminate new information about the disease. It is unclear whether automated clinical prognostic modelling approaches based on different machine learning algorithms, which are more rapid and less laborintensive, are able to reliably predict in-hospital mortality for COVID-19 patients [2].
The aim of this study is twofold. First, to assess the performance of prognostic models to predict in-hospital mortality of COVID-19 patients admitted to Dutch ICUs using automated clinical prognostic modelling versus using the more traditional approach with expert-based predictor preselection followed by logistic regression. Second, to assess the performance of these models based on data available at ICU admission versus data available after 24h of ICU admission.

Data
This study used prospectively collected data on all patients admitted between February 15th and July 1st 2020 with confirmed COVID-19 to a Dutch ICU extracted from the Dutch National Intensive Care Evaluation (NICE) registry. This NICE dataset contains, amongst other items, demographic data, minimum and maximum values of physiological data in the first 24h of ICU admission, diagnoses (reason for admission as well as comorbidities), ICU as well as in-hospital mortality data and length of stay [4]. This data collection takes place in a standardized manner according to strict definitions and stringent data quality checks to ensure high data quality [5].
Patients were considered to have COVID-19 when the RT-PCR of their respiratory secretions was positive for SARS-CoV-2 or when their CT-scan was consistent with COVID-19 (i.e. a CO-RADS score of ≥ 4 in combination with the absence of an alternative diagnosis) [6]. All analyses were performed on two variants of the NICE dataset: (1) when including only variables available at ICU admission (0h) and (2) when including all variables available after the first 24h of ICU admission (24h).

Outcome measurements
The primary outcome of this study was in-hospital mortality. During the peak of COVID-19 there was a shortage of ICU beds in some hospitals and many patients were transferred to other ICUs. For transferred patients we could follow their transfers through the Netherlands (because all Dutch ICUs participate in the used registry) and used the survival status of the last hospital the patient was admitted to during one and the same COVID-19 episode.

Analyses
We applied AutoPrognosis to build prognostic models for prediction of in-hospital mortality using an automated machine learning (AutoML) process [2]. Supplementary Section 1 provides a brief technical overview of how AutoPrognosis works.
Comparative design -In our study, we compared three different approaches (see Table 1) to develop a prognostic model to predict the inhospital mortality of confirmed COVID-19 patients. Additionally, as a reference, we applied a recalibrated version of the Acute Physiology and Chronic Health Evaluation IV (APACHE IV) regression model, [7] which is one of the most common prognostic model used in intensive care, on our COVID-19 patient population. Such a reference enabled us to verify if developing an ad-hoc model makes sense at all (independently from the used approach).
Statistical Analysis -All the analyses were performed using Python v3.6 and R version 3.5.1 x64 with publicly available software packages 2 .
For the reporting of this study, we followed the TRIPOD statement (htt ps://www.tripod-statement.org) and the IJMEDI checklist for assessment of medical AI (https://zenodo.org/record/4835800) [8]. The file is available in an Open Science Foundation (osf.io) repository (htt ps://osf.io/d68cr/). Table 2 includes an overview of the processing operations that were performed. For the expert-selection approach, three intensivists (DD, DdL, SA) independently preselected predictors from a list of available variables in the NICE registry. Discrepancies were resolved by discussion and based on consensus. The APACHE III acute physiology score [9] and the overall Glasgow Coma Scale (GCS) [10] score were included, and the raw predictors that these scores take into account were excluded (we tried adding the raw predictors but this did not improve results). A further selection on the predictors was done with a backward stepwise AIC selection model. Table 1 Model approaches.

Fullyautomated
We performed an AutoPrognosis analysis on all available patient variables and these variables were not processed, i.e., selected or transformed.

Semiautomated
We performed an AutoPrognosis analysis on patient variables that were selected by means of stepwise regression and subsequently transformed (capped and normalized) -see the Section Table 1 for details.

Expertselection
We performed a more traditional logistic regression analysis on patient variables that were selected based on experts' opinions (i.e. intensivists) and by means of stepwise regression.

Model performance
We measured (1) discrimination: Area Under the Receiver Operating Curve (AUROC), Area Under the Precision-Recall Curve (AUPRC), sensitivity, Positive predictive value (PPV), Negative predictive value (NPV), Brier score (i.e., the mean squared error of the prediction); (2) calibration: calibration curves; and (3) interpretation: model coefficients. AUROC and AUPRC were provided by AutoPrognosis; we computed separately the other required measurements. For PPV, NPV and sensitivity, the decision threshold was set to 0.3, which is the average mortality rate in this patient population, corresponding to outcome prevalence [11]. For some models built by AutoPrognosis, e.g., neural networks, interpretation was not readily available (but involves more elaborative techniques like SHAP [12] or LIME [13]), this was not measured.

Validation
The model performance was evaluated as the average performance over a five-fold cross validation (this is the default validation in Auto-Prognosis). For all three approaches, the folds were kept identical to enable fair comparison. The original APACHE IV model as a baseline was first-level recalibrated with the same five folds to achieve a better fit with our specific population, and was then also evaluated with the same five folds. Following Moreno and Apolone, [14] recalibration was done by computing a new intercept α new and the overall calibration slope β new by fitting a logistic regression model with the APACHE IV probability (lp APIV ) as the only covariate: lp APIVrecal = α new + β new lp APIV .

Approach comparison
Performance measures for discrimination and calibration were assessed by averaging the mean predicted values and the fraction of positives of the best models per fold. To determine the best model per fold, we perform a model comparison within AutoPrognosis. The best model is the one which achieved the highest average AUROC over the five folds. We used the 5x2 cross validation (CV) F-test statistical test for determining the best model [15,16]. For interpretation, we provided feature importance results for the best performing model within each approach. The interpretation results were judged on clinical relevance by intensivists (DdL, SA, DD).

Study population
In total 2,706 confirmed COVID-19 patients of 70 ICUs were included, of which 2,690 (99⋅4%) could be followed up until hospital discharge; 796 patients (29⋅6%) died during their hospital stay. Table 3 (data at admission) and supplementary Table 1 (data at 24h) show the descriptive summary statistics of the patient population stratified by hospital survival state.

Models' performance
Discrimination -Tables 4a (models with data at admission; referred to as 0h models onwards) and 4b (models with data after 24h; referred to as 24h models onwards) show the AUROC, AUPRC, PPV, NPV, and Brier scores of the three approaches. The obtained 0h and 24h models have fair discriminatory performance (AUROC = 0⋅75-0⋅78). For both the 0h and 24h models, there is a significant difference in discriminatory performance in terms of AUROC, AUPRC and Brier score between the fullyand semi-automated approaches (AUROC 0h: p < 0⋅05, AUROC 24h p < 0⋅01, AUPRC and Brier score both 0h and 24h: p < 0⋅01, for 5x2 CV Ftest). Additionally, for the 24h models the results of the APACHE IV model are significantly different to all other models for all measures but NPV (p < 0⋅01 for 5x2 CV F-test). The best 0h and 24h models obtained by the fully-automated approach were linear discriminant analysis (LDA) models. The best 0h models of the semi-automated approach was LDA; the best 24h was a logistic regression (logR) model. The PPV in the context of triage is most important as one does not want to falsely identify non-survivors and abstain them from ICU care. The 0h model PPVs range between 0⋅70 (fully-automated) and 0⋅79 (expert-selection); there is no significant difference in PPV between the three approaches (p > 0⋅05 for 5x2 CV F-test).
Calibration - Fig. 1a (data at admission: 0h) and 1b (data at 24h: 24h) show the calibration curves of the three approaches. The 0h and 24h models were well calibrated (calibration curves closely follow the 45 • line) and the 24h models outperformed the calibration of the APACHE IV model.
Interpretation of the models - Fig. 2a (data at admission: 0h) and Fig. 2b (data at 24h: 24h) show the coefficients of the best performing models of the fully-automated approach (Linear Discriminant Analysis). Supplementary Table 2 includes the 0h-model description for the fullyautomated approach. Supplementary Table 3 includes the 0h-model description for the semi-automated approach. For both the LDA models, the major harmful risk factor for mortality was the patient's age and the major protective risk factor for mortality was the date at which the patient was admitted to the ICU (later date lower mortality risk). Fig. 3a (data at admission) and 3b (data at 24h) show the coefficients of the best performing models of the semi-automated approach (best 0h Table 2 Overview of data processing operations.

All approaches Missings
Missing values for numerical variables were imputed by using fast k-nearest neighbour (kNN) [27] and mode imputation for categorical variables. Multiple imputation by chained equations (MICE) [28] yielded similar results.

Derived variables
In addition to the original patient variables as collected and described above, we included a derived variable for the body mass index (BMI) based on weight divided by squared length.

Variable selection
Variables were selected with a backward stepwise AIC (Akaike information criterion) selection model [29] before application of AutoPrognosis.

Extreme values
Extreme values were removed by capping numerical variables (below 1th percentile and above 99th percentile).

Rescaling
All variables were rescaled to the range [0,1] by min-max normalization: where × is the original value and x' is the normalized value.
model: LDA, best 24h model: logR). Again, age (harmful) and ICU admission date (protective) were found as most important risk factors. Fig. 4a (data at admission) and 4b (data at 24h) show the coefficients of the logR models of the expert-selection approach. Most important factors were again age (harmful) and ICU admission (protective). Supplementary Table 4 includes the model description of the 0h logR model. Variable selection -Supplementary Table 5 shows the selections of variables. For the 0h models, the semi-automated approach selects the least number of variables (16 versus 25 selected by the experts); and there is a major overlap (13 out of 16 variables) in the variable selections in the semi-automated and expert-selection approaches. For the 24h models, the semi-automated approach selected more variables (34) than the experts did (30), but the overlap of variables (13) is the same as in the 0h models.

Table 4a
Comparison of the automated, semi-automated, and expert-selection approaches using data available on admission (0h). We outline the average results for the five-fold cross validation with the standard deviation in between brackets and considering the best model per fold. For both PPV and NPV, the decision threshold was set to 0⋅3.

Discussion
In this study, we assessed the predictive performance of automated clinical prognostic modelling (AutoML) for in-hospital mortality of ICUadmitted confirmed COVID-19 patients by comparing two automated modelling approaches using (fully-automated and semi-automated) AutoML and one expert-selection approach where intensivists selected potentially relevant variables and a logistic regression analysis was performed. In addition, we compared predictive performance of models that had access to only variables available at admission (0h) with models that had access to variables available at 24h after ICU admission (24h). Overall, predictive performance in terms of discrimination (AUROC) was fair (0⋅7-0⋅8).
For the 0h models, there was no significant difference for discrimination (AUROC) between the automated and manual approaches. The semi-automated constructed LDA model (best model of the semiautomated approach) did significantly outperform the fully automated constructed LDA model (best model of the fully-automated approach), but the difference was too small to be clinically relevant. There was no significant difference in PPV between the three approaches.
The 24h models performed similarly in terms of discrimination (AUROC), PPV, and calibration. The selected best model for the semiautomated approach was different for 0h and 24h (0h: LDA, 24h: logR), for the fully-automated approach the best 0h and 24 models were the same (both LDA).
The 24h models were found to perform significantly better than the

Table 4b
Comparison of the automated, semi-automated and expert-selection approaches and the APACHE IV baseline using data from the first 24h after admission (24h). We outline the average results for the five-fold cross validation with the standard deviation in between brackets and considering the best model per fold. For both PPV and NPV, the decision threshold was set to 0⋅3. . 1a. Calibration curves of the fully-automated, semi-automated, and expert-selection approaches using data available on admission. Below the distribution of predicted values is shown.

Approach AUROC AUPRC PPV NPV Sensitivity Brier
0h models (improved AUROC of 0⋅02), but since it is only a small improvement, it may not be clinically relevant.

Related work
The studies that are most closely related to our work focus on the development and assessment of prognostic models of mortality among COVID-19 infected patients [17,18] and the identification of prognostic factors for severity and mortality in patients infected with COVID-19. [19][20][21][22][23] As for development of prognostic models, reported predictive performance varies from fair (AUROC 0⋅7-0⋅8) to very good (AUROC > 0⋅9), other performance measures than AUROC are rarely assessed (e.g., calibration), the studies show an high risk of bias and concern sample sizes up to a maximum of 577 ( Table 1 in Wynants et al. [3]).
As for finding strong prognostic factors, similar to other studies we found age, sex and patient history (comorbidities) to be predictors of mortality among COVID-19 patients.
Additionally other indicative predictors were found in other studies such as body temperature, disease signs and symptoms (such as shortness of breath and headache), blood pressure, features derived from CT images, oxygen saturation on room air, hypoxia, diverse laboratory test abnormalities, biomarkers of end-organ dysfunction. [17,18,20,21,23] Most of these other predictors were not included in our dataset (mainly because the used registry data did not include detailed individual patient information). For some of the included comorbidities, we have no explanation why these were not selected as predictors in our models, other than that it is a result from dependences and correlations that are specific for our set of predictors. Our best performing models included CPR, gastro intestinal bleedings and neoplasm, which were not mentioned before in other studies. This may be because these data items are not systematically recorded in other datasets, or that the combination of COVID-19 with another important reason for ICU admission cannot be identified in other studies. A bad prognosis of ICU patients with cancer and after CPR, even independent of COVID-19, is expected and known. [24,25] Fig. 1b. Calibration curves for the fully-automated, semi-automated, and expert-selection approaches using data from the first 24h from admission. Below the distribution of predicted values is shown.

Strengths
The sample size of our study is large (i.e., contains many confirmed COVID-19 patients), and the dataset is comprehensive (i.e., contains many features per patient). As for the analysis, our evaluation is rigorous in that we use multiple performance measures. In general, our approach enables the rapid development of prediction models in case of the COVID-19 epidemic crisis since the registry data that we use are readily available and we use an autormated machine learning approach.

Limitations
Regarding the model development, we enabled the logistic regression model to perform better to some degree (e.g., with/without variable selection, inclusion of either aggregate (APACHE, GCS) scores or the raw predictors) but this was not done exhaustively. Boosting logR performance is still possible, for example by allowing it to use the best form of predictors (i.e., transformation with for example restricted cubic splines [26]). We found further model tweaking to be out of scope, because we primarily compare (automated versus traditional) approaches and not models.
As for data, the used NICE registration data does not include all laboratory or other individual patient variables, but a specific selection and sometimes an aggregation of routinely collected data. As other studies do include more and different individual patient information such as time series of laboratory values and features derived from CT images that may explain their higher predictive performance.

Implications
Our study shows the value of automated modelling. After further development and extensive validation, these models are of great importance to assist medical staff in making decisions on ICU admittance and treatment, thereby supporting the use of ICU capacity as efficiently as possible.
Since we do not find clinically relevant differences between models using data at admission time compared to after 24h, this may affect the triage process itself as well: when considering predicted mortality under high pressure on ICU capacity, it may not be effective to admit patients only to see how they develop in the first 24h. However, in case limited ICU capacity is not the main pressure for triage one might say that 24h is not long enough to accurately estimate individuals' survival chances.

Future work
The models achieve fair (AUROC 0⋅7-0⋅8) but not good (AUROC > 0⋅8) predictive performance. The addition of more individual patient information such as other and more detailed laboratory values (instead of min/max values that we included) and findings of CT images obtained from the electronic patient record may increase the performance since other COVID-19 models including those predictors show better performance than we do, and this is thus worthwhile to investigate.

Conclusions
This study shows that automated clinical prognostic modelling (AutoML) delivers prediction models with fair predictive performance in Fig. 2a. Coefficients of the linear discriminant analysis model of the fully-automated approach using data available at admission. terms of discrimination, calibration, and accuracy. The model performance is as good as models that were developed using the more timeconsuming regression analysis with expert-based predictor preselection. Models including data from the first 24h of ICU admission did significantly outperform models based on admission data, but the clinical relevance is small. These results pave the way to serve as a baseline for rapid automated model development in times of pandemics or other enduring crises that affect ICU capacity and hence increase the need for patient triage.

Other declarations
The investigators were independent from the funders; IV, SB and MCS had full access to the data, have verified the data, and take responsibility for the integrity of the data and the accuracy of the data analysis; the lead author (the manuscript's guarantor) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.

Summary Table
What was already known on the topic: • Classical prediction models (i.e., regression models with manual predictor selection) yield good performance for clinical diagnosis and prognosis, but the modeling process is potentially biased and limited. • Automated prognostic modelling (AutoML) facilitates automatic model and variable selection and hyperparameter tuning, and can lessen the burden of carrying out manual design tasks for prediction modeling.     • The largest proportion of prediction models for diagnosis and prognosis of COVID-19 were developed in the classical way (regression with manual predictor selection).
What this study added to our knowledge: • Automated modeling can deliver clinical prediction models that perform on par with more classical models (regression models with manual predictor selection). • Automated modelling can assist decision-making on ICU admittance and treatment, and can support efficient use of ICU capacity. • Admitting of COVID-19 patients to the ICU to see how they develop in the first 24 hours may not be effective.

Ethics approval and consent to participate
The study protocol was reviewed by the Medical Ethics Committee of the Amsterdam Medical Center, the Netherlands. This committee provided a waiver from formal approval (W20_273 # 20.308) and informed consent since this trial does not fall within the scope of the Dutch Medical Research (Human Subjects) Act.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data is available under stringent conditions as described on the NICE website https://www.stichting-nice.nl/extractieverzoek_procedure.jsp (in Dutch).