Comparison of a Target Trial Emulation Framework vs Cox Regression to Estimate the Association of Corticosteroids With COVID-19 Mortality

Key Points Question How do modern methods for statistical inference compare with approaches common in the clinical literature when estimating the association of corticosteroids with mortality for patients with moderate to severe COVID-19? Findings In a cohort study using retrospective data for 3298 hospitalized patients with COVID-19, target trial emulation using a doubly robust estimation procedure successfully recovers a benchmark from a meta-analysis of randomized clinical trials . In contrast, analytic approaches common in the clinical research literature generally cannot recover the benchmark. Meaning These findings suggest that clinical research based on observational data can be used to estimate findings similar to those from randomized clinical trials; however, the correctness of these estimates requires designing and analyzing the data set on principles that are different from the current standard in clinical research.

This supplemental material has been provided by the authors to give readers additional information about their work.

eMethods. 1 Per-protocol effect versus intention-to-treat
Intention-to-treat, or the comparison of patients who were randomized to each treatment arm regardless of whether they actually received the assigned treatment, is the primary estimand of interest in many RCTs. However, it is rarely possible to emulate this contrast using observational data. 1 The closest possible analog of an intention-to-treat design for our corticosteroid question may have been analyzing the difference in patients who were prescribed corticosteroids, whether or not they were actually administered, if that data were available. 1 Instead of pursuing intention-to-treat, we design a hypothetical target trial primarily interested in the per-protocol effect, or the difference in patients who actually follow each of the regimes of interest. To analyze the closest possible analog of a per-protocol effect, we will account for baseline and post-baseline (time-dependent) confounders in the administration of corticosteroid treatment over time, since corticosteroid administration and adherence depends on changing physiological characteristics of patients. 1 See "Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available" by Hernan et al. for a more detailed description of the intentionto-treat versus per-protocol comparisons in target trials. 1

Study time frame
The time frame and location of this target trial emulation, March-May 2020 at a large hospital system in New York City, lends itself well to studying the effect of corticosteroids on mortality. The dates of the study are during the peak of NYC's initial pandemic surge conditions, meaning there were thousands of patients admitted within a short period of time, all infected with the same COVID-19 variant and treated with a relatively equal, albeit limited, level of provider knowledge on the disease. This time period was prior to vaccines and other treatments now known to be effective, such as monoclonal antibodies or antiviral treatments. A key aspect of this time period is that the effectiveness of corticosteroids was not yet established; in fact, there was some evidence early in the pandemic that corticosteroids could be harmful. The largest and most influential study, Randomized Evaluation of COVID-19 Therapy (RECOVERY), published preliminary results in June 2020 2 and significantly changed clinical practice afterwards. Our pre-RECOVERY time period resulted in high variability in provider practice regarding the timing and administration of steroids, as shown in eFigure 1. Although variability in provider practice can complicate simpler analyses of observational data, it is ideal for estimating causal effects in combination with an advanced study design because there is natural experimentation already occurring in the data.

Confounders
The full list of baseline confounders includes age, sex, race, ethnicity, body mass index (BMI), comorbidities (cerebral vascular event, hypertension, diabetes mellitus, cirrhosis, chronic obstructive pulmonary disease, active cancer, asthma, interstitial lung disease, chronic kidney disease, immunosuppression, human immunodeficiency virus (HIV)-infection, and home oxygen use), hospital admission location, and mode of respiratory support within 3 hours of hospital admission. Time-dependent confounders include vital signs (heart rate, pulse oximetry percentage, respiratory rate, temperature, systolic/diastolic blood pressure), laboratory results (blood urea nitrogen (BUN)creatinine ratio, creatinine, neutrophils, lymphocytes, platelets, bilirubin, blood glucose, D-dimers, C-reactive protein, activated partial thromboplastin time, prothrombin time, arterial partial pressures of oxygen and carbon dioxide), supplemental oxygen, and concurrent pharmaceutical treatments. Supplemental oxygen support was defined by no supplemental oxygen, non-invasive mechanical ventilation (IMV) supplemental oxygen, and IMV. The concurrent treatments included as confounders were vasopressors, diuretics, Angiotensin-converting enzyme (ACE) and Angiotensin receptor blockers (ARBs), hydroxychloroquine, and tocilizumab. Of note, none of these treatments have been established as effective therapies for COVID-19. In addition, physiological changes from these treatments should be captured in the plethora of vital signs and laboratory results we include as time-varying confounders. A sensitivity analysis without these experimental treatments as confounders is provided in the eResults, section 3.

Decision to bin data in 24-hour time intervals
This study required binning our data into time intervals, since we needed a standard time to recalibrate dosing corticosteroid dosing information across patients. For this research question, we chose to bin the data in 24-hour intervals. In cases of multiple laboratory results or vital signs in a 24-hour period, the clinically worst value was used. The 24-hour window was specified a priori by clinical guidance: we defined a corticosteroid exposure as a methylprednisolone-equivalent reaching >0.5 mg/kg body weight over a 24-hour period. Thus, patients could "qualify" as having a corticosteroids exposure up until the end of each 24-hour binning period. The 24-hour time period is how all of the RCTs used in the WHO meta-analysis also define their treatment regimes. The biological reason we (and also presumably the RCTs) used for choosing 24 hours is that it should adequately capture the mechanistic action timeline of corticosteroids. In a randomized controlled trial (RCT) looking at hydrocortisone for mechanically ventilated septic shock patients, the mean arterial pressure between hydrocortisone and placebo patients does not begin to noticeably diverge until study day 2 ( Figure S4b of Adrenal et al.). 3 This led us to the decision that 24-hour windows were sufficient to capture the physiological dynamics for corticosteroids. We can definitively say that the outcome of death occurred after the covariates and exposures were measured on a causal pathway. However, this binning strategy of covariates and exposures could theoretically lead to measured time-varying covariates (laboratory tests or vital signs) occurring concurrently or posteriorly to corticosteroid administration. The impact on this analysis of the data should be minimal, especially due to the relatively slower biological time of action time of corticosteroids.

Informative measurement
The measurement, or lack thereof, of covariates in this analysis is often highly prognostic. The analytical strategy for handling missing variables was to fully account for the measurement process by including an indicator of whether the variable was measured. Missing baseline categorical variables were handled using an indicator for informative measurement (e.g., missing race was its own category). Missing time-varying continuous variables were also handled with an indicator for informative measurement. In order to make the software work for estimating our dynamic regimes, we then substituted unmeasured continuous variables with an arbitrary value (median). Of note, the same results could be obtained with any arbitrary value, such as 9999, after adding the indicator of the measurement process. Further details on the informative measurement process are shown in eTable 5. A sensitivity analysis in which Multiple Imputation Chained Equations 4 (MICE) was used sequentially at every time point to substitute missing values (after including the informative measurement variable) was also performed. These results are in eResults, section 4.

Estimand
The g-computation formula can be conceptualized as a sequential regression of the outcome as follows. 5 First, a model for the probability of mortality a day 28 is estimated amongst patients who have not been lost to follow-up, with predictors given by all the data prior to day 28. This model fit is then used to produce predictions of the probability of mortality for all patients had the hypothetical intervention of interest been implemented at day 28. This probability is then used as a pseudo-outcome in a regression model where the predictors are all variables measured up to day 27. This model is then used to obtain new predictions for all patients had the hypothetical intervention of interest been implemented at day 27. This process is iterated until day 1, at which point the predictions are averaged. This average is the estimated mortality rate under the hypothetical treatment rule. The G-computation formula is one possible way to identify our estimand of interest. A key assumption when estimating the g-computation formula is that treatment and loss-to-follow-up are independent of the counterfactual outcome at every time point conditional on measured confounders. Other identification strategies which do not require unconfoundedness, but which we did not use in this application, include double negative controls, identification of bounds, sensitivity analyses, or front-door criterion. 6-9

Transportability
The comparability of our hypothetical target trial to the RCTs in the WHO working group's corticosteroid metaanalysis as a benchmark is not exact. 10 This is partially because we did not have the sample size to look at individual corticosteroid types, such as dexamethasone or hydrocortisone. It is unknown how the heterogeneity in treatment regimes between the trials and our cohort may affect our use of the WHO's meta-analysis as a benchmark. We summarize the treatment regimes for the seven RCT's included in the WHO meta-analysis in eTable 1.
It is also unclear how differences in patient eligibility and participation characteristics may affect the use of the meta-analysis as a benchmark. We show characteristics of RECOVERY's study design and patient characteristics (eTable 2 and 3) as a reference to demonstrate how the RCTs compare to our own research. Although our hypothetical intervention did generally resemble the RCTs in overall study design, we looked at different contrasts of interest (see Section 1 -Intention to treat versus per-protocol effect). There are also some differences in study population, notably race and some comorbidities. Differences such as study location (e.g. RECOVERY took place in the United Kingdom) may also play a role in comparability of our results from a study cohort in New York City, either from RCTs to the general population of COVID-19 patients, or between our study and the RCTs. Our hypothetical target trial protocol also deviates, as discussed in Section 1 of the eMethods (Intention-to-treat versus per-protocol estimand), from the RCTs because it does not allow for physicians to discontinue corticosteroids before the assigned six days, or for patients in the "standard of care" arm to receive corticosteroids as part of standard of care. We are measuring the effect of patients receiving and continue to receive their assigned treatment. This may also limit comparability, since the RCTs were often interested in intention-to-treat as the primary outcome. However, patients in the RCTs who were given corticosteroids when assigned to the treatment arm or not given corticosteroids when assigned should bias results towards the null. These comparability issues, which are discussed under the names of transportability, generalizability, and internal/external validity in the causal inference literature, are open problems and require analytical tools outside the scope of this paper. 11

Data analysis plan
The Sequentially Double Robust (SDR) estimator uses the probability of receiving the intervention, the probability of experiencing the outcome, and the probability of censoring at each time point to estimate the final survival probability under a given intervention. For longitudinal studies, the SDR estimator estimates these probabilities iteratively at discrete time points. At each time point, only the intervention or outcome model need be correctly specified for the final estimate to be consistent. 12 We fit each model using an ensemble machine learning via the super learner algorithm. 13,14 The super learner algorithm creates a weighted combination of algorithms in a usersupplied library to obtain a final prediction that is guaranteed to outperform each algorithm in the library as sample size grows. 14 We use a super learner library containing candidate learners of multivariate adaptive regression splines (MARS) with varying parameters, Bayesian Additive Regression Trees (BART), penalized linear regression with penalties of 0, 0.5, and 1 (Ridge, Elastic Net, and LASSO), and an intercept only (mean) model. To fit these regressions, we assume that outcomes at day t only depend on baseline confounders and time-varying covariates measured in the previous two days. We employ 10-fold cross-validation on the super learner estimation and 10-fold cross-fitting on the SDR estimator. Hypothesis tests for both the target trial emulation and model-first approaches are two-sided with a Type I error rate.
9 Analytical file eFigure 2 shows an illustrated example of the analytical file. Time-dependent confounders, censoring indicators, and outcome indicators are binned in 24-hour windows from time of randomization. The final analytical file contains one row per patient with one column for each baseline confounder and one column for each time-dependent confounder, outcome, and censoring indicator at all 28 time points. Further details on the analytical file creation are publicly available in a working Github tutorial: https://github.com/kathoffman/steroids-trial-emulation. A template for creating the illustrative figures from this report is also included on the Github repository.

Participants
There were 3,737 unique adult COVID-19 patients admitted during the study time frame. In cases of multiple patient-encounters, the first visit was used. 165 patients were excluded due to transfer in from an outside hospital. An additional 274 were excluded due to chronic corticosteroid use, leaving a final cohort of 3,298.

Informative measurement
Of the baseline confounders, Body Mass Index (BMI) was missing for 5.8% patients, race was missing for 218 (6.6%), and ethnicity was missing for 596 (18%). All other baseline variables were complete. For the time-varying covariates, vital signs, co-treatment, and supplemental oxygen data were complete. For routine laboratory results such as platelets, nearly all patients had complete data (e.g. 90% completeness for day 0 platelets). Measurement of other COVID-specific laboratory results, e.g. D-dimers, was less common and highly prognostic as discussed in eMethods section 5, Informative measurement. A table showing the percentage of the at-risk population with informative measurement for the time-varying covariates is provided in eTable 5.

Sensitivity Analysis 1: Removing co-treatments as confounders
A sensitivity analysis removing co-treatments as confounders yielded the result that under the no corticosteroids regime, 28-day mortality is estimated to be 32.2% (30. 8-33.5). Under the corticosteroids regime, 28-day mortality is estimated to be 25.9% (24.7-27.0%). This yields an estimated decrease in 28-day mortality of 6.3% (5.4-7.2).

Sensitivity Analysis 2: MICE
A sensitivity analysis using MICE to sequentially substitute missing values at every time point after adding a variable for informative missingness yielded the result that under the no corticosteroids regime, 28-day mortality is estimated to be 32.9% (31.7-34.2). Under the corticosteroids regime, 28-day mortality is estimated to be 26.4% (25. 4-27.5). This yields an estimated decrease in 28-day mortality of 6.5% (5.7-7.3

eFigure 2. Illustration of Analytical Data Structure
The analytical file contained one row per patient and one column per baseline confounder. For every time point, there was one column per time-dependent confounder, a treatment status indicator, censoring indicator, and outcome indicator. The data frame was prepared in R and used in combination with the open-source R package lmtp.