Developing and validating COVID-19 adverse outcome risk prediction models from a bi-national European cohort of 5594 patients

Patients with severe COVID-19 have overwhelmed healthcare systems worldwide. We hypothesized that machine learning (ML) models could be used to predict risks at different stages of management and thereby provide insights into drivers and prognostic markers of disease progression and death. From a cohort of approx. 2.6 million citizens in Denmark, SARS-CoV-2 PCR tests were performed on subjects suspected for COVID-19 disease; 3944 cases had at least one positive test and were subjected to further analysis. SARS-CoV-2 positive cases from the United Kingdom Biobank was used for external validation. The ML models predicted the risk of death (Receiver Operation Characteristics—Area Under the Curve, ROC-AUC) of 0.906 at diagnosis, 0.818, at hospital admission and 0.721 at Intensive Care Unit (ICU) admission. Similar metrics were achieved for predicted risks of hospital and ICU admission and use of mechanical ventilation. Common risk factors, included age, body mass index and hypertension, although the top risk features shifted towards markers of shock and organ dysfunction in ICU patients. The external validation indicated fair predictive performance for mortality prediction, but suboptimal performance for predicting ICU admission. ML may be used to identify drivers of progression to more severe disease and for prognostication patients in patients with COVID-19. We provide access to an online risk calculator based on these findings.


Methods
All methods were carried out in accordance with relevant guidelines and regulations. The study was approved by the relevant legal and ethics boards, including the Danish Patient Safety Authority (Styrelsen for Patientsikkerhed, approval #31-1521-257) and the Danish Data Protection Agency (Datatilsynet, approval #P-2020-320) as well as the UK Biobank (Application ID #60861) COVID-19 cohort. Under Danish law, approval from these agencies are required for access to and handling of patient sensitive data, including EHR records, whereas legal approval for the study was furthermore obtained from the Danish Capital Region (Region Hovedstaden). Patients from the UK biobank have provided informed consent prior to enrolment in the biobank. Under Danish law, informed consent for patient chart access for research purposes can be waived, provided approval from the Danish Patient Safety Authority (see approval number above) is obtained prior to data access.
We conducted a prospective study by including all individuals undergoing a SARS-CoV-2 test (nasal and/or pharyngeal swap subjected to Real-Time Polymerase Chain Reaction testing) in the Capital and Zealand Regions (approximately 2.6 million citizens) of Denmark between March 1st, 2020 and June 16th 2020. Data inclusion was censored on June 16th. Patients were identified through their Central Person Registry (CPR) number, a unique numerical combination given to every Danish citizen, enabling linking of electronic health records (EHRs) with nationwide medical registry data.
During the study period, all SARS-CoV-2 tests were performed at regional hospitals, with patients referred for testing based on presence of symptoms albeit test strategies shifted towards the end of the inclusion period to include a wider screening indication.
For cases with at least one positive tests, we extracted data from the bi-regional EHR system, including demographics, comorbidities and prescription medication. In-hospital data included laboratory results and vital signs.
Supplementary Table S1 lists extracted comorbidities with their definitions, Supplementary Table S2 extracted  laboratory values, and Supplementary Table S3 extracted temporal features (vital signs).
For the purpose of external validation of the ML models, we extracted data from the UK biobank COVID-19 cohort. The UK biobank contains detailed healthcare information on 500.000 UK citizens, of which 1650 have been tested SARS-CoV-2 positive. This cohort has recently been made available for the purpose of COVID-19 www.nature.com/scientificreports/ research by the UK biobank consortium 12 , and contain COVID-19 diagnostic test data, in hospital data as well as general practitioner data. For further information on the UK biobank COVID-19 cohort, please see Ref. 13 . The rationale behind choosing the UK biobank dataset as an external validation cohort was primarily the need for model validation on an international dataset from a comparable health care system. The specific choice of the UK biobank was due to the curated and high-quality nature of this dataset, encompassing data from 500.000 UK citizens.
Prediction models. ML models were trained and validated on the Danish dataset. A subset of models sharing identical data fields (e.g. age, comorbidities etc.) between the Danish and UK cohorts were subsequently externally validated on the UK biobank dataset.
We constructed ML prediction models by including available data for patients up to and including the selected time frames or time points. These time frames or points were. For each task, we trained with different feature sets to study how incrementally adding data affects model performance as well as to gain insight into drivers of disease progression: • Base models: Age, sex and body mass index (BMI).
For the purpose of external validation, data points were available in the UK biobank matching those of the base and comorbidities models. In-hospital models could not be externally validated due to lack of availability of these data points in the UK biobank. ML models. We used random forests (RFs) 14 , implemented in the open-source machine learning library scikit-learn 15 . Because each individual tree was trained on a bootstrap sample, there were out-of-bag (OOB) samples from the training data that could be used to estimate the performance of the RF, a method considered superior to nested cross-validation 16 .
All models were evaluated on the Danish set using fivefold cross-validation. The folds were stratified to ensure that the splits were representative of the full cohort. For each split, we conducted grid search on the available training data fold to tune the hyperparameters of the RF models for each prediction task. As selection criterion, we computed the Receiver Operating Characteristics Area Under the Curve (ROC-AUC) on the OOB samples. Each model used a 1000 decision trees, while we varied the maximum number of features considered in each split (all features or square root of all features) and either the maximum depth (5, 10 or unlimited) or the minimum samples for a split (2, 5 or 10).For the full model evaluation, we combine the outputs for all test folds resulting in predictions for the entire data set on which we report the ROC-AUC and the precision/recall AUC (PR-AUC).
For evaluation on the UK data, models were trained on the entire Danish data set. As before, for each model, a grid search based on the OOB ROC-AUC was performed on the same parameter grid. Each model was then evaluated on the entire UK cohort.
For each task, we also applied standard logistic regression as a baseline model, but leave out the results here, since logistic regression exhibited sub-par or equal performance compared to the RF in most cases, the only exception being prediction of death when not including in-hospital tests.
Post-hoc analysis of the use of the predictive variables across all decision trees in the RF allowed us to derive a measure of feature importance. Feature importance was calculated by the mean decrease in impurity (MDI). This measure considers how often a feature is used when classifying the training data points and how well it splits the training data points when being used. The predictive variable importance was computed for the models trained on the entire data set. The top-10 or top-20 (depending on the model) were extracted and visualized. The correlations of these features were then computed across the entire dataset.
Missing data. Missing data was considered missing at random.
Percentages of available data in the Danish cohort are presented in supplementary Table S4. Missing values for BMI were imputed by using k-nearest neighbour imputation using age and sex 17  Data presentation and statistical testing. Continuous data is presented as medians (interquartile range) and compared using the Mann-Whitney U test. Categorical data is presented as percentages and compared using the Chi-square test. ML model performances are presented as ROC-AUC for Positive and Negative predictive value and precision/ recall. Model comparisons were performed by the deLong test 18 .
The p-values for comparisons of outcome groups are provided for reference only. As these comparisons are not part of the study hypotheses, p values are presented without post-hoc correction for multiple testing and should be interpreted as such.
In addition, calibration curves are presented for the combined test folds and the external validation data. For each calibration plot, the predictions were grouped using quantile-based binning.
Online models. The risk prediction model for SARS-CoV-2 positive patients admitted to hospital is available in an online version on https ://cope.scien ce.ku.dk.

Results
A total of 3944 individuals had at least one positive SARS-CoV-2 test in the two Danish regions and were included in the study. These were supplemented by the 1650 patients from the UK biobank used for external model validation. Figure 1 depicts patient identification and selection in a flowchart form for the Danish cohort.
Among the Danish cases, 1359 (34.5%) required hospitalization, and 181 (4.6%) intensive care. A total of 324 patients (8.2%) died. Table 1. Demographic information on the group of SARS-CoV-2 positive patients, including information on pre-existing comorbidities. Supplementary Table S2 holds information on diagnoses codes included in the individual comorbidity classifications. The table presents information on the full cohort (admitted and nonadmitted SARS-CoV-2 positive patients) as well as subgroups admitted to a hospital and Intensive Care Unit (ICU) respectively. Furthermore, differential demographics between survivors and non-survivors (in-hospital mortality) is presented. Continuous variables are presented as medians with (interquartile range). COPD chronic obstructive pulmonary disease. **p < 0.001 when subgroups are compared (e.g. hospitalized vs. nonhospitalized, ICU vs. non-ICU, survivors vs. non-survivors).       Table 2. Demographics information for the UK biobank external validation cohort is presented in supplementary  Table S5.
When compared to non-hospitalized patients, hospital admitted patients were older and more likely to be male and a number of comorbidities were overrepresented in the admitted subgroup. These included hypertension, diabetes, ischemic heart disease, heart failure, arrythmias, stroke, chronic obstructive pulmonary disease (COPD) or asthma, osteoporosis, neurological disease, cancer, chronic kidney failure and use of dialysis. Hospitalized patients were more likely to be smokers (Table 2).
For hospitalized patients requiring ICU admission vs. hospitalized patients without ICU admission, only male sex, Body Mass Index (BMI), dementia and hypertension differed between patients and ICU-admitted patients were furthermore more likely to be smokers, older and male (Table 1).
Non-survivors were furthermore more likely to suffer from hypertension, diabetes, ischemic heart disease, heart failure, arrythmias, stroke, COPD or asthma, osteoporosis, dementia, mental disorders, neurological disease, cancer, chronic kidney failure and use of dialysis.
When compared to non-admitted, admitted patients differed significantly in all measured values (Table 2). Among those hospitalised, those admitted to the ICU had derangements in many variables ( Table 2). The same was observed for non-survivors compared with survivors ( Table 2). Table 3 and graphically depicted in supplementary Base models deployed on the time of diagnosis were able to predict hospital admission with a ROC-AUC of 0.820, ICU admission 0.802, ventilator treatment 0.815 and death 0.902 (Table 3 and Supplementary Fig. S1).

ML models prediction. ML models are presented in
Adding information on patient comorbidities increased the predictive ability for all outcomes. Models deployed at hospital admission achieved ROC-AUC scores ranging from 0.675 to 0.818 for the selected outcomes (Table 3 and Supplementary Fig. S2). Adding information on comorbidities, temporal features Table 3. Main results from the prediction models. Predictions were performed with data available from four different time frames in the patient disease trajectories (left column): On diagnosis (Diagnoses model), On hospital admission and 12-h into admission (Admission model), 12 h leading up to Intensive Care Unit (ICU) admission (Pre-ICU model) and 12 h after ICU admission (post-ICU model). Models were trained to predict risk of hospital admission, ICU admission, ventilator treatment and death (top row). All models were trained with incremental data, starting with age, gender and Body Mass Index, then adding comorbidity information, temporal features (e.g. vital signs) and finally by adding hospital laboratory tests where applicable. Please see supplementary tables S1 and S2 for data definitions. Performance metrics are presented as the Receiver Operating Characteristics Area Under the Curve (ROC-AUC) for True/False positive rates (TPR/ FPR) and Precision/Recall (Pre/Rec). *Model is significantly (p < 0.01) better than the base prediction model (Age + gender + Body Mass Index, BMI). # Model is significantly (p < 0.01) better than the comorbidities model. § Model is significantly (p < 0.01) better than the temporal model. --: Insufficient data available at the time point, or prediction irrelevant (e.g. predicting hospital admission for patients already in the ICU). show that the models are well calibrated when looking at all diagnosed subjects and at patients admitted to the hospital. When restricted to patients admitted to ICU, the calibration gets worse as expected due to smaller sample size. Ventilator treatment could not be predicted accurately, and the calibration curves reflect this.
External validation results (Supplementary Table S6) on UK data indicated an overall reduction in model classification ability. For diagnosed patients, ROC-AUCs were 0.661 for predicting hospital admission, 0.529 for predicting ICU admission and 0.742 for predicting mortality. Inspection of the calibration curves (Supplementary Fig. S7) shows that the models are only slightly worse calibrated for the UK data, meaning that the model outputs approximately the correct probability for individual patients, despite the degradation in ROC-AUC.
As patients progressed through the disease severity trajectories, mortality prediction remained in the area of 0.617-0.722 (Supplementary Table S6).

Detection of important features and drivers of disease progression.
Results of the drivers of disease progression feature detection analysis for each of the selected timepoints are depicted in Fig. 2 (diagnoses model) and Fig. 3 (admission model) as well as Supplementary Fig. S4 (pre-ICU model) and Supplementary  Fig. S5 (post-ICU model).
For diagnosed patients (Fig. 2), age and BMI were among the most relevant features for all targets, and indeed the most important features for predicting hospital admission and ventilator treatment.
Hypertension was the most important feature for predicting ICU admission, and indeed an important feature for all models.
For admitted patients (Fig. 3) the most relevant drivers of disease progression were age, BMI, hypertension and the presence of dementia. When the full dataset was analysed, lab tests indicating aspects of cell dysfunction (Lactic dehydrogenase, LDH), kidney dysfunction (Blood urea nitrogen and Creatinine), the inflammatory

Discussion
In this study, we analyse prognostic and factors associated with disease progression in 3944 SARS-CoV-2 positive patients by constructing an interpretable ML framework. In contrast to previous studies 1,2 , these included diagnosed patients outside hospitals, and thus included the entire spectrum of SARS-CoV-2 positive patients in the 2.6 million regional population. Results indicate that by focusing on a limited number of demographic variables, including age, gender and BMI, it is possible to predict the risk of hospital and ICU admission, use of mechanical ventilation and death as early as at the time of diagnosis. Using these parameters only, our model achieved a ROC-AUC of 0.902 for mortality prediction, which is slightly inferior to a model reported by Gao et al. achieving a ROC-AUC of 0.962 using more complex clinical data points on admission 3 .
Adding information on comorbidities to the model increase performance, indicating that these features play a prognostic role in the outcome of patients as they progress through the disease trajectory.
As such, results from the ML feature detection indicate that comorbidity factors such as hypertension and diabetes are driving factors of adverse outcome, which is in line with reports from other cohort studies [19][20][21] . The role of hypertension is further underlined by reports indicating a role of the angiotensin converting enzyme 2 (ACE2) receptor as an entry point for the SARS-CoV2 22 . Whether COVID-19 interacts unfavourably with hypertension per se, or whether this risk is simply a manifestation of reduced tolerance to severe infection and hypoxia is currently debated 23 .
Furthermore, BMI was identified as a major feature of adverse outcome, as also reported by others [24][25][26] . Whether this is due to a reduced respiratory capacity or chronic impairment of the immune system through alterations in tumour necrosis factor and interferon secretion associated with obesity, is also currently debated 27 . Caution should, however, be taken when analysing these results, as the median observed differences between groups were minor and may not be clinically relevant. Furthermore, data imputation may have impacted on these results.
The addition of more data points, including temporal features and lab tests improved the model's predictive value for hospitalized patients, with group comparisons indicating alterations of a plethora of laboratory tests for admitted patients, including features of immune activation and organ dysfunction. Interestingly, laboratory tests differed to a lesser extent between ICU and non-ICU patients, except for CRP levels, lymphocyte counts, LDH, ALT, neutrophil, D-dimer and ferritin levels as well as arterial blood gas values. As expected, ICU admitted patients had lower oxygen saturation and higher respiratory rates, likely reflecting the acute respiratory distress from COVID-19 pneumonia.
Feature analysis indicated that strong prognostic markers expectedly included CRP levels, but also markers of organ damage, including kidney injury (creatinine and blood urea nitrogen), liver injury (ALAT), cell damage (LDH), anaemia (haemoglobin levels) as well as ferritin levels. These, as well as vital signs and arterial blood gas values superseded many of the comorbidities in feature importance once the patient progressed through hospital and ICU admission, which again indicates that drivers and prognostic markers of adverse outcomes represent a dynamic field affected by the patient's current point on the disease trajectory, and that differential values should be considered when risk-assessing COVID-19 patients depending on their current status (e.g. in hospital, in ICU etc.). A caveat is, however, that multiple comorbidities and advanced age may resulted in decisions by patients, relatives or clinicians limiting the use of life-support, and thus potentially precluding them from ICU admission and reducing the effect of comorbidities and age on model predictions.
Kidney injury has previously been reported in patients with COVID-19 28,29 and our finding that markers of kidney injury may be important at hospital admission supports the notion that COVID-19 associated kidney injury plays an important pathophysiological role.
The importance of LDH for COVID-19 patients has previously been reported in other ML 1 as well as clinical studies 30 and these results are supported by the feature detection from this study, indicating that LDH levels serve as an important prognostic marker on hospital admission, although its value is superseded by other biomarkers when the patient advance to the ICU stage. As LDH can be seen as a general marker of cell and organ damage with a reported prognostic value for mortality in ICU patients 31 , these findings likely indicate a general organ affection associated with COVID-19 disease progression.
Abnormal liver function tests, including ALAT, has previously been associated with COVID-19 disease severity 32 and these reports have indicated the presence of elevated liver enzymes in both severe and non-severe COVID-19 cases 33 . Whether this is a function of viral infection, shock or a consequence of hepatotoxic pharmaceuticals deployed during treatment is still not clear 34 .
Ferritin levels have previously been associated with COVID-19 35 , presumably due to its role in immunomodulation and association with the cytokine storm response seen in critical illness 36 .
Taken together, the feature importance of laboratory tests indicating affection of several organ systems indicates that COVID-19 disease severity follows a predictable pattern characterized by multi-organ affection (albeit not always dysfunction), which is in line with previous findings 37 .
Once patients progress to the ICU stage, feature detection indicated a switch towards vital signs and biomarkers indicating that the severity of respiratory failure, shock and inflammatory markers were the most important features of risk of death ( Supplementary Figs. S5 and S6).
When the feature importance of all models is analysed, the results indicate that COVID-19 outcomes are at the time of diagnoses largely predictable through a relatively limited number of features, dominated by age, BMI and comorbidities, effectively proxies for frailty.
As patients follow their disease trajectories, differential features supersede each other in prognostic importance, and prognostic models should thus consider the patient's place in the disease trajectory. The results of the external validation did, however, show an overall reduction in the model's classification ability when the UK biobank cohort was analysed, thus impacting on the generalizability of the presented models, but results should be interpreted with caution.
As such, the UK cohort was assembled for the purpose of biobanking studies, and thus comprise a highly selected subset of patients, whereas the Danish cohort was population wide in the two analysed geographical regions. Demographic data also highlights differences in the two populations, including an age difference between groups. Actually, when predicting death for ICU patients, where demographics are similar, we do not observe a reduction in model performance.
The differences in results can be explained by the change of the underlying data distribution, demonstrating that caution should be exercised when evaluating whether ML models are useful for local health care practitioners if developed on other cohorts, especially when developed on early phase COVID-19 data. As such, significant variations in national factors such as isolation policies and triage for ICU and mechanical ventilation, population demographics etc. may impact on results. This notion is supported by the finding that our model retained reasonable classification ability for mortality in UK patients, but failed to predict ICU admission risk.
These results could thus indicate that potential users of ML models for COVID-19 patients should carefully examine the generalizability of the training cohort and healthcare infrastructure where patients originated from and compare these with local features prior to model usage.
Our study has several limitations. The number of patients available for this analysis was limited, and additional patient data could change the results. This is especially evident when performing predictions in the ICU setting, where the number of patients was limited. A larger and preferably multinational dataset would be required to address this issue.
Secondly, we have extracted a subset of clinical variables from the EHR system and analysing other features could affect the model. Furthermore, the changing criteria for SARS-CoV-2 testing associated with the course of the pandemic, likely also affects the results.
For external validation, our results are limited by the fact the UK biobank data did not offer datapoints allowing for external validation of advanced features models.
Even with these limitations, we may conclude that ML may be leveraged to perform outcome prediction in COVID-19 patients, as well as serve as a potential tool for identifying drivers and prognostic markers.

Data availability
Patient data from this study has not been made available to the public due to patient confidentiality constraints. License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.