Assess and validate predictive performance of models for in-hospital mortality in COVID-19 patients: A retrospective cohort study in the Netherlands comparing the value of registry data with high-granular electronic health records

Purpose To assess, validate and compare the predictive performance of models for in-hospital mortality of COVID-19 patients admitted to the intensive care unit (ICU) over two different waves of infections. Our models were built with high-granular Electronic Health Records (EHR) data versus less-granular registry data. Methods Observational study of all COVID-19 patients admitted to 19 Dutch ICUs participating in both the national quality registry National Intensive Care Evaluation (NICE) and the EHR-based Dutch Data Warehouse (hereafter EHR). Multiple models were developed on data from the first 24 h of ICU admissions from February to June 2020 (first COVID-19 wave) and validated on prospective patients admitted to the same ICUs between July and December 2020 (second COVID-19 wave). We assessed model discrimination, calibration, and the degree of relatedness between development and validation population. Coefficients were used to identify relevant risk factors. Results A total of 1533 patients from the EHR and 1563 from the registry were included. With high granular EHR data, the average AUROC was 0.69 (standard deviation of 0.05) for the internal validation, and the AUROC was 0.75 for the temporal validation. The registry model achieved an average AUROC of 0.76 (standard deviation of 0.05) in the internal validation and 0.77 in the temporal validation. In the EHR data, age, and respiratory-system related variables were the most important risk factors identified. In the NICE registry data, age and chronic respiratory insufficiency were the most important risk factors. Conclusion In our study, prognostic models built on less-granular but readily-available registry data had similar performance to models built on high-granular EHR data and showed similar transportability to a prospective COVID-19 population. Future research is needed to verify whether this finding can be confirmed for upcoming waves.

0.05) in the internal validation and 0.77 in the temporal validation. In the EHR data, age, and respiratory-system related variables were the most important risk factors identified. In the NICE registry data, age and chronic respiratory insufficiency were the most important risk factors. Conclusion: In our study, prognostic models built on less-granular but readily-available registry data had similar performance to models built on high-granular EHR data and showed similar transportability to a prospective COVID-19 population. Future research is needed to verify whether this finding can be confirmed for upcoming waves.

Introduction
The coronavirus disease 2019  has challenged global health and society at large. Most countries have experienced multiple COVID-waves in the last years. Models that estimate the risk of inhospital mortality of COVID-19 patients in the intensive care unit (ICU) could be valuable for decision making on treatment (intensify or withhold) and capacity planning. Many prognostic models have been developed, often using data purposely collected from electronic health records (EHR) [1]. Various existing ICU data registries or specifically developed COVID-19 data collections improved our understanding of the relation between patient characteristics and disease progress at the ICU [2][3][4].
EHR data typically have high granularity (multiple variables and measurements over time). They potentially support the application of advanced methods, but combining these data from multiple centers, each using different data models and coding lists, requires a considerable effort and time. In a sudden pandemic or crisis situation, a rapid response is needed. Therefore, waiting to collect, curate and aggregate EHR data might not be possible. In contrast, high-quality registry data are already collected, more uniform, quality-controlled, and thereby readily-usable. Thus, they may enable a faster response, although have a lower granularity and a possibly-delayed availability. A comparison of the value of registry data with high-granular electronic health records for building prognostic models is still lacking.
This study aims to assess, validate over successive waves of infections, and compare the predictive performance of models for inhospital mortality of COVID-19 patients admitted to the intensive care unit (ICU), when such models are developed with high-granular ICU data collected from various hospitals' EHR or low-granular registry data.

Study design and population
This was a multi-center observational study on prospectively collected EHR data on patients from 19 ICUs participating in the Dutch ICU Data Warehouse [5] as well as Dutch National Intensive Care Evaluation (NICE) registry [6,7]. We included all patients that were 18 years and older and were admitted between February 15th, 2020 and January 1st, 2021 with confirmed COVID-19. Thirteen of those 19 ICUs uploaded EHR data during the second wave. For the NICE registry, we selected data from the same 19 ICUs.
COVID-19 was defined as a positive real-time reverse transcriptase polymerase chain reaction (RT-PCR) assay for SARS-CoV-2 or, in the early phase of the pandemic, as a CT-scan consistent with COVID-19, i.e. a COVID-19 Reporting and Data System (CO-RADS) score of ≥ 4 in combination with the absence of an alternative diagnosis) [8].

Data collection
NICE is a quality registry with national coverage since 2016. ICUs extract for all their patients a predefined dataset from the routinely collected data from their EHR and upload this dataset each month after manual validation and completion. This predefined dataset includes demographics, minimum and maximum values of physiological data in the first 24 h of ICU admission, diagnoses, ICU and in-hospital mortality and length of stay [6]. Data collection is standardized with strict definitions and stringent data-quality checks [7]. Hereafter we call this data source REG.
The Dutch ICU Data Warehouse includes high-granular data of critically-ill patients with COVID-19 in the Netherlands. The raw data were extracted from the participating hospitals' EHR. Parameters were mapped to a common nomenclature by a team of clinicians, data entry errors were filtered, and derived parameters were added (e.g., bodymass index) when not directly provided [9]. Data included demographics, administrative variables, comorbidities, and physiological data and information regarding the patient positioning and ventilation characteristics. Hereafter we call this data source EHR. EHR patients transferred to another ICU were linked when data from the referring and receiving hospital were available, otherwise excluded as their final inhospital outcome was unknown.

Outcome and predictors
The outcome of this study was in-hospital mortality. The variables available in the two data sources in the first 24 h were included as predictors. The model is intended to be used at the first 24 h from admission. The predictors finally included are provided in Tables 1 and  2.

Data preprocessing
The data preprocessing was the same for both datasets, unless differently specified. We removed administrative variables which did not have clinical value (such as identifiers), variables regarding discharge (date, destination), and variables that have zero variance. In REG, which had less missing data than EHR, we removed variables with over 45 % of missing data, in EHR, variables with over 85 % of missing. For the remaining numerical variables, missing values were imputed by using the multiple imputation by chained equations (MICE) [10]. Mode imputation was used for the remaining categorical variables. Backward stepwise variable selection was used with the Akaike information criterion [11]. Numerical variables were capped below the 1st percentile and above the 99th percentile. All variables were rescaled to the range [0,1] with min-max scaling. In EHR, the average, minimum value, maximum value, difference between the last and first measurement, and slope were computed based on the repeated measurements of each numerical variable available in the first 24 h. For categorical variables, the mode was selected.

Analyses
We developed several prognostic models to predict in-hospital mortality with each of the two datasets. We used AutoPrognosis, an automated machine learning process [12,13]. The best model per dataset was chosen based on predictive performance, variability of performance, and interpretability, and it was then internally and temporally validated.

Performance measures, internal validation, and temporal validation
We measured discrimination with the area under the receiver operating characteristics (AUROC), the area under the precision-recall curve (AUPRC), positive predictive values (PPV), negative predictive values (NPV), and the Brier score. We also assessed model calibration with calibration curves and provided model coefficients to interpret the models. Brier score is used to assess discrimination due to its known limitations to assess calibration [14]. A calibration curve gives better insight into risk prediction areas with larger deviation between predicted and true risk. For both PPV and NPV, the decision threshold was set to 0.3, the average mortality rate in this patient population.
Model performance was internally validated by the average performance over a fivefold cross validation on all COVID-19 patients admitted to Dutch ICUs between February 15th and June 30th, 2020 (first wave).
Various factors (virus mutations, treatment strategies, etc.) may impact model performance over time. To validate our models over time [15], we validated on a prospective dataset of all COVID-19 patients admitted to 13 of the same 19 Dutch ICUs between July 1st, 2020 and January 1st, 2021 (second wave) as the other 6 ICUs did not provide data in this time period. Following Debray et al. [16], we assessed the degree of relatedness between development and validation population to understand whether temporal validation was estimating the model reproducibility or transportability. Model reproducibility means that a model performs sufficiently accurate across new samples from the same target population. Transportability is the ability of a model to perform well across samples from different but related populations. To assess the degree of relatedness between development and validation population, we evaluated their corresponding case-mix differences: We built a logistic-regression membership model that uses the same predictors used by the in-hospital mortality models plus the in-hospital mortality outcome. The outcome of the membership model was the predicted probability of a patient to belong to the development or validation population. When such a model performed poorly, it meant development and validation population had similar case-mix and therefore the temporal validation assessed the reproducibility of the model. When such a model performed well, development and validation population had different case-mix and therefore the temporal validation tested the transportability of the model. The membership model performance was assessed with the AUROC and interpreted according to Hosmer and Lemeshow [17].
Statistical difference among the performance results for the models built on EHR and REG was assessed with a paired Student's-t test for dependent samples after bootstrapping each measure over 300 iterations.

Table 1
Descriptive summary statistics of the EHR patient population used in the development (and internal validation) as well as temporal validation, stratified by in-hospital mortality. We only show the variables selected after the variable selection with backward elimination. APTT stands for activated partial thromboplastin time, FiO 2 for fraction of inspired oxygen, PaCO 2 for partial pressure of carbon dioxide, PaO 2 for partial pressure of oxygen.

Study population
In EHR, the development population included 992 confirmed COVID-19 patients admitted to 19 ICUs, which could be followed until hospital discharge. In total, 316 patients (31.9 %) died during their hospital stay. Survivors were significantly younger (61.5 vs 68.6 years) and less often males (71.0 % vs 78.8 %) than non-survivors. For the temporal-validation population, 541 confirmed COVID-19 patients of 13 ICUs were included; 181 patients (33.5 %) died during their hospital stay. As in the development population, survivors were significantly younger (61.5 vs 70.5 years) and were less often males (70.6 % vs 74.0 %) than non-survivors. Table 1 shows the descriptive summary statistics of each patient population.
In REG, 972 patients admitted to the same 19 ICUs as EHR were included in the development population. In total, 322 patients (33.1 %) died during their hospital stay. Survivors were significantly younger (61.0 vs 68.2 years) and were less often males (69.8 % vs 77.6 %) than non-survivors. For the temporal-validation population, 591 confirmed COVID-19 patients of the same 13 ICUs as EHR were included; 181 patients (30.6 %) died during their hospital stay. As in the development population, survivors were significantly younger (61.2 vs 70.0 years), but were more often males (71.7 % vs 68.5 %) than non-survivor. Table 2 shows the descriptive statistics of each population.

Results
Among the 20 models built with Autoprognosis, the best model on both datasets was logistic regression. Table 3 shows the discrimination of the EHR and REG models, in terms of AUROC, AUPRC, PPV, NPV, and Brier scores. Both models have fair discriminatory performance in the internal validation (AUROC = 0.69 vs 0.74). On all measures, the best model developed on REG data performed significantly better (p < 0.01) than the best model developed on EHR data. Fig. 1 shows the calibration curves of the models for the internal and temporal validation. In the internal validation, the REG model and the EHR model are similarly calibrated. Fig. 2 shows the coefficients of the EHR model. Age, fraction of inspired oxygen and glucose were the most important risk factors. The estimated globular filtration rate and erythrocytes were other important risk factors. The EHR membership model showed acceptable  performance (AUROC = 0.71), as illustrated in Table 4. The EHR model yielded better results in the temporal validation than in the internal validation for all measures but NPV, which slightly decreased (AUROC = 0.75, see Table 3). In the temporal validation the model calibration was slightly worse than in the internal validation (Fig. 1). Fig. 3 shows the coefficients of the REG model. Age and chronic respiratory insufficiency were found as most important risk factors. The fraction of inspired oxygen and chronic obstructive pulmonary disease (COPD) were other relevant risk factors. The REG membership model showed an acceptable performance (AUROC = 0.76), as outlined in Table 4. The predictive performance of the REG model for the temporal validation improved for all the measures compared to the internal validation, except AUPRC, which slightly decreased (AUROC = 0.75, see Table 3). The REG model showed better calibration than in the internal validation for most of the predictions (Fig. 1).
The results for the temporal validation are significantly better for the REG model than for the EHR model for all the measures (p < 0.01), although the EHR model had a larger improvement than the REG model from the internal to the temporal validation for AUROC (from 0.69 to 0.75, and from 0.74 to 0.75, respectively). The calibration is similarly good for both models (Fig. 1).

Main findings
We assessed the predictive performance of clinical prognostic models for in-hospital mortality of ICU patients with confirmed COVID-19 using high-granular EHR data and low-granular REG data. The predictive performance in the internal validation was fair (AUROC of 0.69-0.74). In the temporal validation, the performance improves (AUROC from 0.69 to 0.75 for EHR and from 0.74 to 0.75 for REG). The membershipmodels' results on both datasets indicate that the case-mix was different and therefore temporal validation assess the transportability of the models. For temporal validation, transportability means that the models are stable over time. Both models are well transportable to the temporalvalidation population since their performance in the temporal validation increased. Such increase may also be due to the use of 5-fold cross validation in the internal validation which resulted in reporting conservative performance: the model is trained five times with 80 % of the data as every fold correspond to one fifth (20 %) of the data. We select a final model used in the temporal validation by retraining on the whole development data. The better performance of the REG model in the internal validation, and the similar performance of both models in the temporal validation, despite more predictors in the EHR model, may be due to the greater number of missing values in the EHR dataset.
Age, fraction of inspired oxygen and glucose were the strongest predictors in the EHR model. In REG, age and chronic respiratory insufficiency were the most important predictors. Different risk factors are identified in different data sources due to the different total set of variables included. Additionally, some variables, such as the lowest verbal response in the first 24 h, although available in both datasets, were selected by the variable selection in one model but not the other.

Related work
Similar to [18], we found age and respiratory-system predictors to be predictor of mortality among COVID-19 patients. Other predictors found in other studies ranged from diverse laboratory test to comorbidities [1,19,20,21,22,23]. Izcovich et al. identified 49 valuable predictors [24], including various laboratory tests that we also identified in our EHR data, such as neutrophils, or in REG, e.g., COPD (see Figs. 2 and 3). Among these other predictors found by other studies, some were not included in our dataset (the participating hospitals did not collect or share such information). Some of earlier found comorbidities, medications (notably steroids, anticoagulants, vasopressors) and other predictors, e.g., lung compliance, ventilator volume and pressures that were included in our EHR dataset, were not selected as predictors in our models. This might be a result from dependences and correlations that are specific for our set of predictors. Various prognostic models of mortality among patients with COVID-19 have been proposed [1,18,25,26,27,28,29]. Their predictive performance varied from fair (AUROC 0.7-0.8) to excellent (AUROC > 0.9). However, many studies showed high risk of bias [1] and only few temporally validate the models [27][28][29]. Our performance in the temporal validation is in line with another externally-validated model developed on EHR data [30], as well as other studies that take into account readily availability data [31,32]. The importance of external validation and continuous monitoring and updated of machine-learning models to ensure their long-term safety and effectiveness has also been underlined by previous studies [33,34].  Table S2 includes the model description.

Table 4
Discrimination of the EHR and REG membership models. We outline the average results for the fivefold cross validation with the standard deviation in between brackets.

Strengths and weaknesses
We temporarily validated COVID-19 prognostic models, which is an important aspect of model evaluation [15], especially with a new disease. We used data from multiple centers and two different data sources, each with their own benefits and limitations. The EHR data source includes raw EHR data and hence more variables and more measurements per variable compared to the REG dataset. However, it was timeconsuming to join all different EHR data schemas and, perhaps accordingly, missing values were frequent in the EHR data. The REG dataset is less rich in the number of variables and measures per variable but more standardized and quality-controlled.
Our study also has some limitations that need consideration in its interpretation. First, due to privacy regulations we were not able to join both datasets and link the same patients. Although we used the same ICUs there were small differences regarding included patients.
Second, the EHR data source included data only up to January 1st, 2021, so it was not possible to temporally validate its model on later waves of infection, when vaccinations became available. To compare models built on EHR and REG data we limited REG data to the same period. Six of the 19 ICUs did not provide data after June 2020 and were not included in the temporal validation. The EHR data, although a dump of several EHRs, did not include all laboratory or other individual patient variables available, and not all the hospitals provided all the variables, e.g., D-dimer was collected only by few hospitals. Although repeated measurements of the same variable (time series) were available, we aggregated them, reducing the granularity in time and potentially losing useful information. Other studies include more and different individual patient information, such as time series of laboratory values and features derived from CT images, which may explain their higher predictive performance [1].
Third, we did apply imputation and normalization before the data splitting due to using cross validation, which may introduce a bias. After 5-fold cross-validation, there would be the need of 5 different imputations (actually multiple imputations), as well as 5 different normalizations. First this would have made the difficult to identify the final imputed and normalized data: it is not straight forward which of the five imputations or normalization to use in the final model or how to properly aggregated those five imputations. Second, it would also have made the computation time explode for the EHR data, which holds over 2000 variables (before selection). After in-depth discussions and preliminary analyses, we nested the variable selection in the cross validation using a majority voting to select the variables from the five folds (a variable needs to be selected in 3 out of 5 folds to be selected in the final model), but not the rest. Given we also temporally validated the model, we believe this bias has a low impact.
Forth, we do not exclude that extensive parameter tuning of single models may provide slightly better results than automated machine learning with AutoPrognosis. However, a gap between the potential and actual use of machine learning in prognostic research exists because classical model development and tuning requires greater time and effort (and may become unfeasible in the healthcare domain). More importantly, it is hard for clinicians with no or few expertise in machine learning to do so [35]. Automated machine learning tackles these issues and our study shows how can be successfully used in the healthcare domain.
Finally, removing transferred patients from the EHR dataset may have introduced bias in the dataset because transferred patients may be healthier since they are fit for transport, or more severely ill and need treatment in a better equipped ICU. Whenever data from the referring and receiving hospital were available, data were linked to limit the exclusions.

Implications
Registries, with less-granular but readily-available and controlled data, provide better performance than high-granular EHR data in the internal validation and show similar results to EHR data in the temporal validation. Independently of the data source, model performance remains stable over time. This is an important finding because longstanding ICU registries require less effort for data-collection, integration, and processing than setting up a specific research data set from multiple EHR with different data unless such data platform is already in place, which is currently rare but upcoming (giviti. marionegri.it/portfolio/covid-19/, last access 11/02/2022) [5,36,37]. When EHRs will move to using information standards and/or FAIR data, many of the current disadvantages of EHR data may disappear, since collection, integration and processing will be eased. However, validation and quality control of data as in place in registries may still remain a challenge. High-granular EHR data could still be beneficial for other problems, such as finding the optimal combination of ventilators settings or drugs for individual patients.

Future works
We aggregated the repeated measurements of each numerical variable available. Exploiting repeated measurements should be investigated. COVID-19 patients typically have long ICU stays. Twenty-four hours might be a too-short interval to estimate patients' survival. Determining the best 'ICU trial time' requires further research. Other interesting directions of research would be exploiting different models instead of the logistic regression for the population membership discrimination, e.g. kernel based methods, as well as tracking the presence or absence of data with additional variables instead of imputing missing values, as done by a recent study [38].

Conclusions
In our study, temporally-validated models built on less-granular but readily-available registry data performed closely to models developed with higher-granular EHR data and showed the same transportability to a prospective COVID-19 population as model developed with highergranular EHR data. Readily-available registry data might be a valuable resource when a rapid response is needed. Future research is needed to verify whether this finding can be confirmed for upcoming COVID-19 waves and for models focusing on other ICU patient categories.

Summary Table
What was already known on the topic: • Electronic health records (EHR) typically have high granularity (multiple variables and measurements over time). • EHR data can enable the application of advanced machine learning methods, but combining these data from multiple centers requires a considerable effort and time. • In a sudden pandemic or crisis situation, a rapid response is needed, therefore, waiting to collect, curate and aggregate EHR data is undesirable.
What this study added to our knowledge: • Prognostic models built on less-granular but readily-available registry data can, in particular cases as in this study, achieve performance similar to models built on high-granular EHR data • Prognostic models built on high-and less-granular data of COVID-19 patients show equal transportability to a prospective COVID-19 population.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. the manuscript.

Data and code availability
All participating hospitals have access to the Dutch ICU Data Warehouse and NICE data. The NICE registry data are available under conditions as described on the NICE website at stichting-nice. nl/extractieverzoek_procedure.jsp (in Dutch). External researchers can get access to the Dutch ICU Data Warehouse in collaboration with any of the participating hospitals. The list of collaborators is available in the coauthor list and in the collaborators section, through the corresponding author, and through the contact details on amsterdammedicaldatascience.nl. Research questions have to be in line with the DSA; to investigate the course of COVID-19 in the ICU and to research potential treatments. Researchers have sign a code of conduct before accessing the data.
The code used for our analyses is publicly available at bitbucket. org/aumc-kik/automl4covid.