Intubation and mortality prediction in hospitalized COVID-19 patients using a combination of convolutional neural network-based scoring of chest radiographs and clinical data

Objective: To predict short-term outcomes in hospitalized COVID-19 patients using a model incorporating clinical variables with automated convolutional neural network (CNN) chest radiograph analysis. Methods: A retrospective single center study was performed on patients consecutively admitted with COVID-19 between March 14 and April 21 2020. Demographic, clinical and laboratory data were collected, and automated CNN scoring of the admission chest radiograph was performed. The two outcomes of disease progression were intubation or death within 7 days and death within 14 days following admission. Multiple imputation was performed for missing predictor variables and, for each imputed data set, a penalized logistic regression model was constructed to identify predictors and their functional relationship to each outcome. Cross-validated area under the characteristic (AUC) curves were estimated to quantify the discriminative ability of each model. Results: 801 patients (median age 59; interquartile range 46–73 years, 469 men) were evaluated. 36 patients were deceased and 207 were intubated at 7 days and 65 were deceased at 14 days. Cross-validated AUC values for predictive models were 0.82 (95% CI, 0.79–0.86) for death or intubation within 7 days and 0.82 (0.78–0.87) for death within 14 days. Automated CNN chest radiograph score was an important variable in predicting both outcomes. Conclusion: Automated CNN chest radiograph analysis, in combination with clinical variables, predicts short-term intubation and death in patients hospitalized for COVID-19 infection. Chest radiograph scoring of more severe disease was associated with a greater probability of adverse short-term outcome. Advances in knowledge: Model-based predictions of intubation and death in COVID-19 can be performed with high discriminative performance using admission clinical data and convolutional neural network-based scoring of chest radiograph severity.


INTRODUCTION
Widespread vaccination for COVID-19 is underway; yet despite this, healthcare systems throughout many parts of the world continued to be overwhelmed by an escalating caseload. [1][2][3][4][5] The upward trajectory of COVID-19 cases places an inexorable strain on hospital resources and is likely to continue to do so in the future with each surge cycle of the pandemic and viral variants. In caring for patients hospitalized for severe COVID-19 infection, clinical risk prediction tools to identify those most likely to decompensate in the short term would aid in optimizing the allocation of limited resources and minimize morbidity and mortality associated with the disease.
Numerous clinical prediction tools have been developed in a bid to better manage COVID-19 infection. Varying permutations have been modeled; but most use a combination of readily available clinical parameters, including laboratory tests (e.g., white blood cell count, D-dimer, platelets) and demographic (e.g., age) and medical history data (e.g., vital signs, and comorbidities) to identify patients with symptomatic COVID-19 who are most at risk for decompensation. [6][7][8] While most of these models have not been validated for generalizability, they have identified some common features as associated with the disease course including age, pulmonary, and cardiovascular status. 6,7 Given that pulmonary infection is a hallmark of the illness, chest imaging has also been shown to correlate with outcomes. Models developed using chest CT assessment of disease burden along with clinical variables are reported as reasonably accurate, with AUC estimates greater than 0.8 9,10 ; but, in most practice settings, chest CT is obtained only in a small subset of COVID-19 patients, usually as a secondary assessment in cases of negative RT-PCR results but persistent clinical suspicion of SARS-CoV-2 infection. Even when patients are symptomatic and hospitalized with severe disease, CT imaging is typically performed to assess associated complications, such as a pulmonary embolism, associated secondary infections, or sequela of barotrauma. 11 In contrast, chest radiographs, obtained routinely when patients suspected of or known to be infected with COVID-19 present with symptoms, when used for prediction yield a model applicable to most hospitalized COVID-19 patients. Modeling using chest radiograph data in combination with clinical variables has been reported but their performance overall is less robust than chest CT. [12][13][14][15][16][17][18] While there are many possibilities for this difference, one reason may be that chest radiograph assessment of disease severity is subject to greater observer variability. However, a convolutional neural network (CNN)-based algorithm that calculates a COVID-19 severity score for a chest radiograph based on the density and extent of lung opacities has been developed, is publicly available, and has been shown to correlate with disease severity assessment by multiple radiologists. [19][20][21] Such an automated tool, de novo not subjective and, therefore, more reproducible than human readers, better lends itself to a model for outcome prediction that can eventually be scaled and tested in large cohorts in various clinical settings.
For the purpose of the study, we defined two different outcomes relative to hospital admission -death or intubation within 7 days and death within 14 days. Death and intubation are readily available and objective endpoints that indicate a severe disease course in patients with COVID-19. For the short-term outcome of 7 days, as some patients decompensated so quickly that they died before intubation, the two states were combined to describe a single outcome of critical COVID-19 illness. Such a marker indicating high likelihood of rapid decline would enable management planning, such as triage to more intensive monitoring and possibly prophylactic therapy. An assessment of likelihood of death within 14 days encompasses the entirety of the two-week period over which most COVD-19 positive patients presenting with symptoms decompensate into severe disease. Thus, it is a useful marker for healthcare resource allocation.
A CNN algorithm, for automated chest radiograph analysis, has been developed and validated as a surrogate for radiologist assessment and has been previously reported. [19][20][21] The algorithm automates the chest radiograph interpretation yielding a reproducible and numerical output of the imaging information. With this tool on hand, we set out to identify whether demographic, clinical, and laboratory variables, in combination with a chest radiograph severity score from the CNN algorithm, could be used to predict outcomes that could be used to guide management of hospitalized COVID-19 positive patients.

METHODS AND MATERIALS
Cohort definition and follow-up Institutional review board (IRB) approval was obtained and the requirement for informed consent waived for this HIPAA compliant study. We performed a retrospective analysis of consecutive adult patients (age≥18 years old) admitted to our hospital system between 14 March 2020 and 21 April 2020 who were diagnosed with COVID-19 by the reverse transcriptionpolymerase chain reaction assay before the time of or within four days following admission. Patients either presented through the emergency department or were transferred from outside our hospital system. All patients were followed to date of discharge or death. Patients who were alive and not discharged were followed until 15 September 2020.
We used International Classification of Diseases, Tenth Revision codes (ICD-10) to extract comorbidities (Supplementary Material 1), including diabetes, cancer, hypertension, cardiac disease, and respiratory disease (chronic obstructive pulmonary disease or emphysema and asthma). Laboratory values recorded were those closest to the admission date. Extracted laboratory values were those available within 7 days of admission; however, only the value closest to the date of admission was included. All Patients Refined Diagnosis-Related Groups (APRDRG) and ICD-10 codes were used to extract ventilation status (Supplementary Material 1). These variables were cross-referenced with thorough manual review of the electronic health record (EHR).

Outcome measures
Two outcome variables of short-term disease progression were defined: death or intubation within 7 days of admission and death within 14 days of admission.
Chest radiograph scoring Chest radiographs included in this analysis were those available either up to 2 days prior to or 5 days post admission. If more than one exam was available, the chronologically earliest radiograph within this interval was analyzed. Chest radiographs with an endotracheal tube in place were excluded from modeling for intubation. An automated severity score was generated for each of the chest radiographs determined from the density and extent of lung opacities using a the CNN algorithm, previously validated in multiple patient populations using the manual assessments of disease severity by multiple radiologists as a reference standard. 19 This algorithm receives raw DICOM pixel data from frontal chest radiographs as inputs and calculates a numeric score for lung disease severity. While the score is continuous, as a guide for interpretation, the following ranges of scores reflect different gradations of severity as established by our radiologists: ≤2.5 no or minimal disease, >2.5 and ≤5.0 mild disease, >5.0 and ≤9.0 moderate disease, and >9.0 severe disease. Examples of chest radiograph scores for representative images are shown in Figure 1.

Statistical analysis
Descriptive summaries were generated for continuous and categorical variables; continuous variables were summarized as median and IQR (interquartile range) and categorical variables as frequencies (percentages). The percentage of missing data were also calculated for each variable.
Missing data were present in at least of the one candidate predictors in 23% of the study population. Multiple imputation (MI) was used to account for the missing exposure values. Missing exposure data were "filled-in" using observed exposure data via a chained equation approach. 22 To account for the uncertainty in the exposure data, this imputation process was repeated 100 times resulting in the creation of 100 complete data sets. Outcome data were excluded from the imputation process.
For each imputed data set, a penalized (LASSO) logistic regression model was constructed. The model was chosen over a standard logistic model because of its ability to shrink parameter estimates that are not, or at most weakly, associated with the outcome of interest to zero. Thus, this approach simultaneously performs variable estimation and variable selection (e.g., variables whose estimates are zero are effectively dropped from the model). 10-fold cross-validation was performed to estimate all model parameters (both covariate effects and the tuning parameter). Parameter estimates and cross-validated model predictions were stored for each imputed data set. Additional details about the missing data patterns, the analysis approach (e.g., tuning parameter selection, cross-validation), comparator models (linear vs non-linear coding of continuous variables, main-effects only vs interactions), and estimation methods (LASSO, random forests) are provided in the supplemental documentation.
Parameter estimates and cross-validated predictions were averaged, or bagged, across imputed data sets. The importance of each variable was assessed by ranking the absolute value of the average estimates as well as estimating the percentage of times a non-zero estimate was obtained. Average predictions, and the true outcome status, were then used to estimate the cross-validated AUC curves and their 95% confidence intervals. A schematic diagram of the data collection process and data analysis is presented in Figure 2.
Cross-fold validation rather than splitting our dataset was chosen for internal validation as this represents the current standard regardless of sample size. The former method avoids the possibility that resultant validation metric (AUC, RMSE, etc) may be biased simply due to the choice of the test set.
The study cohort includes 315 patients (39%) whose chest radiographs were used to develop the CNN algorithm. 19 To quantify the possible effect of data leakage (or double dipping), the proposed analyses was performed on both the full cohort and on the subgroup of patients whose chest radiographs were not used in developing the CNN algorithm.
All analyses were performed using R 4.0.2 (R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/) and the mice, glmnet, pROC, randomForest, and caret libraries. [22][23][24][25][26][27] Table 1 describes the diagnostic variables for the entire cohort. 801 patients (median age 59, interquartile range 46-73 years; 469 men). The median BMI for admitted patients was 28.9 (IQR, 25.1-33.6) which implies many of these patients were classified Figure 1. Chest radiograph with their corresponding severity scores from the CNN algorithm. The automated severity score uses a Siamese CNN by using raw DICOM pixel data to calculate a numeric score for lung disease. An increasing score (left to right) corresponds to increasing parenchymal lung opacity and extent. The score is continuous; however, a severity scale proposed from radiologist interpretation is as follows: ≤2.5 no or minimal disease, >2.5 and ≤5.0 mild disease, >5.0 and ≤9.0 moderate disease, and>9.0 severe disease. [19][20][21] as either overweight or obese. Cardiac disease, hypertension and diabetes were the most frequently encountered comorbidities, present in 62%, 59%, and 38% of patients, respectively. The admission chest radiograph was on the day of hospitalization for 691 patients (86%) and within 1 day for 722 (90%) of the cohort. The median score for the admission chest radiograph was 4.21 (IQR 2.21-6.81), corresponding to mild disease. Fibrinogen, prothrombin time and D-dimer values were available within 2 days of admission for 43%, 68%, and 89% of the cohort, respectively. GFR, LDH, PLT, WBC and CRP values were available within two days of admission for 98%, 95%, 99%, and 95% of the cohort, respectively.

Cohort description
Intubation or death within 7 days A total of 243 (30.3%) patients were either intubated or died by day 7. Of these, 36 patients (4.5%) died and 207 (25.8%) had been intubated (Table 2). Table 3 summarizes the discriminative performance of the prediction models. The cross-validated AUCs were approximately 0.80 for all evaluated models and estimation methods. Given the similarity of these results, specific details pertaining to the penalized logistic regression model utilizing only main effects (no interactions) and continuous variables modeled as linear terms are summarized.
The average standardized regression coefficients of this model are presented in Table 4. Chest radiograph severity score was positively associated with death or intubation within 7 days. Clinical variables that also demonstrated a positive association included the presence of cardiovascular disease, hypertension, and diabetes. Laboratory values that demonstrated a positive association included CRP, LDH, WBC count, and D-dimer whereas eGFR, SpO 2 (oxygen saturation) and platelets demonstrated a negative association, indicating a protective effect of a higher value. Figure 3A illustrates the distribution of average standardized regression coefficients for each variable and the percentage of times, across the imputed data sets, each variable had a non-zero estimate. In defining the importance of a predictor both in terms of the absolute value and the percentage of times a non-zero estimate was seen, cardiac disease, CXR and CRP, SpO 2 , WBC, and eGFR were deemed important predictors of death or intubation within 7 days of hospital admission. The cross-validated ROC curve for this model is presented in Figure 3B and its associated AUC is 0.82 (95% CI, 0.79-0.86) ( Table 3).

Death within 14 days
Sixty-five (8.1%) patients died by day 14 (Table 2). Chest radiograph severity score was positively associated with this outcome ( Table 4). The only clinical variable that also demonstrated a positive association was age. eGFR and SpO 2 values showed a negative association, thereby indicating a protective effect of a higher value.
Age, eGFR, CXR, and SpO 2 values were all deemed important predictors of death within 14 days of hospital admission ( Figure 4A). The cross-validated ROC curve for this model is presented in Figure 4B and its associated AUC is 0.82 (95% CI, 0.78-0.87) ( Table 3). . Data collection and analysis. A numeric score of the severity of COVID-19 on each chest radiograph was obtained using the convolutional neural network (CNN) algorithm. Demographic, clinical, laboratory, and chest radiograph score were combined to impute 100 datasets. Penalized logistic regression modeling was performed on each dataset to identify relevant predictors of each outcome. 10-fold cross-validation was performed to identify all model parameters and to estimate model predictions. Crossvalidated predictions were averaged, or bagged, across datasets to estimate cross-validated ROC curves and AUC estimates.  Table 5 summarizes the cross-validated AUC estimates by estimation method (LASSO, random forests) and by model complexity for the subgroup of 486 patients not included in the CNN algorithm development. We assume that the results of the random forests are the surrogates for the gold-standard since this estimation approach does not require the explicit modeling of covariate effects, or their interactions. For intubation or death with 7 days, the cross-validated AUCs for the full and subgroup cohorts were 0.81 (95% CI, 0.78-0.84) and 0.80 (95% CI, 0.76-0.85), respectively. Similarly, for death within 14 days, the crossvalidated AUCs for the full and subgroup cohorts were 0.83 (95% CI, 0.79-0.87) and 0.83 (95% CI, 0.78-0.88), respectively. Due to the similarity of these estimates, the data leakage was adequately accounted for in the analysis approach (i.e., via the use of crossfold validation). Additional summaries for this subgroup are presented in the Supplementary Materials. The percentage (%) of missing data for each variable collected is provided. c In these patient's admission chest radiograph demonstrated an endotracheal tube in situ, which was an exclusionary criterion.

DISCUSSION
Our study showed that in patients hospitalized for COVID-19, readily available data obtained early in admission (i.e., demographics, major co-morbidities, vital signs, laboratory values, and a severity score of the chest radiograph generated by a CNNbased algorithm) can predict the likelihood of decompensation to severe illness. Models of intubation or death within 7 days and of death within 14 days both showed a cross-validated discriminative performance of 0.82. As stated above, there was a cohort overlap of 315 patients. In order to address concerns regarding data leakage, analyses were performed in the full cohort, and in a subgroup excluding overlapping subjects. In each model, including modeling of the subgroup not used to develop the CNN algorithm, the automated chest radiograph severity score was identified as an important predictor that was positively associated with patient outcome. Oxygen saturation and eGFR were also seen as important predictors for both outcomes.
Previous studies on the usefulness of a chest radiograph in prognosticating early progression of COVID-19 to critical illness have produced varying results. A UK-based study that examined the chest radiographs of over 1000 patients admitted to a tertiary academic hospital in mid-March to mid-April 2020 failed to demonstrate any significant or clinically meaningful association between chest radiograph findings and 30-day outcomes of death or hospital discharge. 28 Scoring systems for chest radiographs elsewhere have been successfully used semi-quantitative scoring tools in order to predict the likelihood of admission to hospital and/or death. 17 Typically this is performed by dividing the chest radiograph into separate zones and scoring each section. For young patients (aged 21-50 years), a higher chest radiograph score (>3) was positively associated with hospital admission. 17 While in another study, a higher chest radiograph score was positively associated with death. 29 However, in both studies, the involved a small number of radiologists in a single institution. 17,29 "In contradistinction to reader-based scoring systems, AI based scoring of chest radiographs offer the advantages of greater reproducibility, and scalability. At the outset of the pandemic, the diagnostic performance of AI based systems as compared to reader-based chest radiograph scoring systems has been evaluated elsewhere and was shown to be independent and comparable predictors of adverse outcomes in patients with COVID-19. 30 More recent studies have built on this initial experience to develop AI-based radiograph analysis that outperforms, radiologist derived scores and clinical variables. 31,32 A notable shortcoming of the study by Jiao and colleagues, was the dichotomization of outcomes into critical and non-critical. In contrast, our study aimed to assist in the stratification of all hospitalized patients with COVID-19, to assist in the allocation of critical care resources by identifying those at most risk for decompensation. When integrated with clinical or laboratory values, predictive models with modest discriminative performance for hospitalization outcomes including critical illness and death, have been reported with AUC's of 0.66 and 0.59, respectively, reported elsewhere. 15 Our model adds to the growing evidence that AI scoring of chest radiographs are an important variable in models that assess COVID-19 severity early in the course of illness. " Models using reader-based scoring of chest radiographs, even when successful, cannot be easily generalized. Scaling such models for validation in larger cohorts and in other clinical settings would be challenging given the manual and subjective nature of the chest radiograph input. If validated, implementation into a clinical workflow would involve chest radiograph severity scores from numerous radiologists introducing observer imputed data sets are shown. Average standardized estimates (e.g., higher LDH, CRP or CXR score and lower SpO 2 , eGFR, platelets (PLT) values) were associated with an increased odds of being intubated or death within 7 days. LDH, cardiac disease, CXR, CRP, SpO 2 , WBC, and eGFR values were deemed important, as defined as both the absolute value of the standardized regression coefficient and the percentage of times the estimate was non-zero (reported in the X-axis below each variable). The asterisk corresponds to the average estimate, including those reduced to zero via the LASSO algorithm (Table 4). (B.) ROC curve of intubation or death within 7 days of hospital admission using bagged cross-validated predictions and the true outcome status (AUC: 0.82 [95% CI: 0.79-0.86], Table 3).
variability as a factor in the model's performance. In contrast, our models depend upon a chest radiograph severity score automatically derived from a CNN-based algorithm that analyzes the DICOM inputs of the radiographic dataset. Thus, the imaging output becomes objective, reproducible, and scalable and is now amenable to high-throughput dissemination, similar to laboratory testing. The use of a CNN to generate a numeric and continuous variable also provides an opportunity to dynamically model outcomes as the clinical profile of patients with severe COVID-19 evolves with viral mutations, public immunization, and novel therapies. 33 With respect to other prediction models that have investigated the association between critical illness and clinical and/or laboratory values, our results are mostly congruent. Established risk factors for critical illness in larger cohorts from the UK and US suggest a similar positive association between demographic variables (age, BMI, and co-morbidities), laboratory variables (CRP, D-dimer), and clinical variables such as admission oxygen saturations. [34][35][36] The importance of eGFR in modeling outcomes has not been previously noted but has not been included in many previous modeling studies. It may serve as a surrogate marker of cardiovascular disease, hypertension, or diabetes, and as a   Table 3). Table 5. Cross-fold validated AUC estimates (95% confidence intervals) by model parameterization (e.g., continuous variables modeled as linear or non-linear terms, and all two-way interactions or no interactions between predictors) and estimation method for the subgroup of patients not included in the CNN algorithm development

Intubation or death within 7 days Death within 14 days
Our study has several important limitations. Most important is the single center design and absence of an external test set. Indeed, these concerns have been highlighted extensively in more recent reviews examining the available body of evidence for deep-learning-based assessment of COVID-19 chest radiographs, noting that single-center bias, differences in technical parameters and the presence of chest radiograph artifacts, which may hamper the reliability of many available deep learning models. 37 Moreover, the data were collected from patients presenting early in the pandemic when COVID management and prognosis was quite different, at least as it has evolved within our practice setting. These considerations limit the generalizability of our prediction model. However, given the heterogeneity of the COVID-19 surges and medical practice during the duration of this pandemic, obtaining sufficiently complete and homogeneous datasets with outcome data to validate generalizability has not yet been feasible. 38 The CNN algorithm, however, has been validated previously in separate populations. 21 While clinical and demographic variables were retrieved in the majority of patients, these values were missing in some. We assumed the missing data mechanism was at random and accounted for variable missingness using multiple imputation. If this assumption is not satisfied, then additional data would need to be collected to properly handle the missing data. We did not gather patient symptoms at presentation as they were variably recorded. Thus, it is possible that patients without COVID-19-related symptoms hospitalized for other conditions could have been included in our cohort. However, our institution was undergoing a COVID-19 surge during the accrual period. Patients requiring non-COVID related hospitalizations were being diverted, if possible, to other centers, non-emergent procedures requiring hospitalization had been canceled, and those testing positive for COVID-19 but without symptoms were being monitored as outpatients virtually. Thus, nearly all of the patients with COVID-19 positive tests hospitalized during this period were admitted for this diagnosis. Finally, other factors previously noted as important predictors, such as duration of symptoms prior to presentation and chest CT findings, were not included in our modeling as the data were not available on most of the cohort.
Our models identify COVID-19 patients at risk of progressing to intubation or death within the first two weeks of hospitalization. Predictors are clinical, laboratory, and radiographic data routinely obtained at the point of admission. The chest radiograph scoring of disease severity is an important predictor and, as it is automated and CNN-based, is numerical and readily scaled into a high throughput clinical workflow, similar to the other laboratory values. If validated, the model could be used to help inform resource allocation and clinical practice algorithms in settings where a surge in case burden strains hospital resources.

ACKNOWLEDGMENT
AOS is supported by a scholarship from the Faculty of Radiologists, Ireland and Massachusetts General Hospital.

CONFLICTS OF INTEREST:
JK reports grants from GE Healthcare, non-financial support from AWS, and grants from Genentech Foundation, outside the submitted work. BL reports royalties from Elsevier, Inc for prior work as an academic textbook associate editor, outside the submitted work. AS has no relevant disclosures but is the co-founder and board member of a digital health company, CareSignal Health, that specializes in deviceless remote patient monitoring.