Controlling Confounding in a Study of Oral Anticoagulants: Comparing Disease Risk Scores Developed Using Different Follow-Up Approaches

Purpose: Little is known about how disease risk score (DRS) development should proceed under different pharmacoepidemiologic follow-up strategies. In an analysis of dabigatran vs. warfarin and risk of major bleeding, we compared the results of DRS adjustment when models were developed under “intention-to-treat” (ITT) and “as-treated” (AT) approaches. Methods: We assessed DRS model discrimination, calibration, and ability to induce prognostic balance via the “dry run analysis”. AT treatment effects stratified on each DRS were compared with each other and with a propensity score (PS) stratified reference estimate. Bootstrap resampling of the historical cohort at 10 percent–90 percent sample size was performed to assess the impact of sample size on DRS estimation. Results: Historically-derived DRS models fit under AT showed greater decrements in discrimination and calibration than those fit under ITT when applied to the concurrent study population. Prognostic balance was approximately equal across DRS models (–6 percent to –7 percent “pseudo-bias” on the hazard ratio scale). Hazard ratios were between 0.76 and 0.78 with all methods of DRS adjustment, while the PS stratified hazard ratio was 0.83. In resampling, AT DRS models showed more overfitting and worse prognostic balance, and led to hazard ratios further from the reference estimate than did ITT DRSs, across sample sizes. Conclusions: In a study of anticoagulant safety, DRSs developed under an AT principle showed signs of overfitting and reduced confounding control. More research is needed to determine if development of DRSs under ITT is a viable solution to overfitting in other settings.


Introduction
Disease risk scores (DRSs) have a long history in the confounding control literature [1] and are increasingly being used for confounding adjustment in pharmacoepidemiological studies [2][3][4][5][6][7][8][9][10]. Unlike the more commonly-used propensity score (PS), which models the probability of treatment conditional on confounders, the DRS, or prognostic score, models the expected outcome under the comparator treatment conditional on the confounders. When PS modeling is difficult, such as in studies with small numbers in one treatment group, DRS modeling can provide an alternative dimensionreduction method when the number of confounders is large [3][4].
Hansen developed the theoretical framework for DRSs, showing that their ability to reduce or remove confounding is based on the ability to induce "prognostic balance", an independence between the outcome that would be observed under the comparator treatment and the confounders, conditional on the DRS [11]. In contrast, conditioning on the PS induces "propensity balance", an independence between the study treatment and confounders. Hansen compares prognostic and propensity balance to the type of balance that is obtained in laboratory research and clinical trials, respectively. The prognostic balance that results from conditioning on the DRS mimics the tight control of experimental conditions, while the propensity balance that results from conditioning on the PS mimics randomization [2].
Considerable thought has been given to methods for DRS development, including questions of the appropriate time period (i.e., use of historical or concurrent data) and treatment group (i.e., use of the total population or of the comparator population only) for estimation [7] and how to select variables to enter the model [12]. However, little attention has been given to the question of how DRS development should proceed under different models of follow-up for the study outcome.
Commonly, pharmacoepidemiological studies employ one of two follow-up strategies: "intention-to-treat" (ITT) or "astreated" (AT). The ITT strategy follows patients from therapeutic initiation to the occurrence of an outcome event or loss to follow-up, treating treatment status as fixed at baseline and ignoring post-initiation adherence. In contrast, the AT strategy follows patients until outcome event occurrence, loss to follow-up, or discontinuation of the initial treatment, whichever comes first. This terminology differs from that used in the clinical trial literature, where the approach we describe as AT would often be described as "per-protocol" [13], however we retain the ITT/AT nomenclature for consistency with previous pharmacoepidemiological studies.
The choice of ITT or AT follow-up for the study outcome is generally dictated by the research question at hand, but it should be noted that these strategies yield treatment effect estimates representing different causal contrasts, which are subject to different sources of bias [13]. The choice of which approach to use when developing a DRS is less studied, but investigators will likely mirror the approach used for the main outcome analysis. However, should the investigator decide to estimate a so-called "as-treated effect", the follow-up of patients for DRS development under the same AT strategy may introduce practical issues that arise due to a reduction in the number of observed outcome events that are currently not well understood. In this paper, we explore the issue of overfitting DRS models fit under AT follow-up, propose a potential practical solution, and assess its performance in the setting of comparative safety of oral anticoagulants.
Under AT, both the total accrued person-time and number of outcome events observed will be lower than in the ITT approach, and the extent of this decrease will depend on the extent and timing of treatment discontinuation. This in turn has implications for the development of DRS models, since model dimensionality, or correspondingly model goodness of fit, will be affected by the number of events available [14]. Therefore, if the desired final analysis is to be performed under AT, it is possible that one may be unable to obtain robust estimates of DRS model coefficients using an AT approach due to low event counts. This is likely problematic in pharmacoepidemiological studies, which generally rely on proxy adjustment for a large number of confounders defined in secondary data sources [15] and may deal in infrequent clinical outcomes.
One potential solution to this problem is to develop the DRS under an ITT strategy, allowing events occurring after the discontinuation of treatment to be included. This may solve the issue of event scarcity, but at a cost of misspecification of the true target of estimation: the patient-specific expected outcome that would occur while taking the comparator treatment, conditional on the confounders. While an ITT DRS may not represent the correct DRS with which to adjust an AT treatment effect, such a DRS may facilitate better confounding control than an overfit AT DRS, or a robust AT DRS that omits relevant confounders. That is, a good estimate of the risk among initiators of the comparator treatment may be better than a bad estimate of the risk among current users of the comparator treatment.
In order to inform DRS practice in a comparative safety and effectiveness setting, we sought to compare the ability of DRS models developed under ITT and AT approaches to control confounding in a comparison of dabigatran vs. warfarin for the risk of major bleeding events. More specifically, we sought to: (a) describe how different follow-up strategies impact the discrimination and calibration of DRS models, (b) assess DRS models for prognostic balance after fitting under various follow-up strategies, (c) determine the impact of follow-up strategy choice on DRS-based confounding control, and (d) use resampling to explore the impact of sample size on DRS-based confounding control. Finally, in light of these results, we describe the assumptions inherent in choosing different follow-up strategies for DRS development.

Study population
We identified all new users of dabigatran and warfarin in the Truven Marketscan Commercial Database between October 2010 (the month of dabigatran's approval in the US) and December 2013. At the time of initiation, eligible patients were required to: (a) have been continuously enrolled in their health plan for at least 365 days, (b) be 18 years of age or older, (c) have documented evidence of atrial fibrillation, (d) have a CHA 2 DS 2 -VASc score of at least 1, and (e) have no recorded dispensing of an oral anticoagulant in the preceding 365 days. Patients were excluded if they: (a) were missing information on age and sex, (b) initiated both warfarin and dabigatran on the same date, (c) had a recorded nursing home stay within the 365 days before initiation, or (d) had documented evidence of valvular disease. We refer to this cohort as the "concurrent cohort". We additionally enumerated a cohort of individuals initiating warfarin between January 2009 and September 2010 for disease risk score estimation, which we refer to as the "historical cohort". All inclusion/exclusion criteria for this group were the same as those for the concurrent cohort. Follow-up for both cohorts ended in December of 2013.
In both the concurrent and historical cohorts, we used claims from the 365 days preceding initiation to define 69 confounders measuring various aspects of demographics, comorbidity, concomitant and historical medication use and health care utilization. Clinical risk scores including the HAS-BLED [16], CHA 2 DS 2 -VASc [17], and CHADS 2 [18] were likewise defined with limited adaptations for use with claims data and included as covariates. Counting multi-level categorical variables (and the categorization of continuous measures such as days hospitalized), these covariates accounted for 91 total model terms.

Cohort follow-up
The outcome of interest was the occurrence of any major bleeding event, excluding hemorrhagic stroke. In both the concurrent and historical cohorts, we followed patients under two broad strategies. Under ITT, patients were followed from the day after treatment initiation until the occurrence of a major bleeding event, study termination (December 31, 2013), or end of enrollment, whichever came first. Under AT, patients were additionally censored upon discontinuation of the index treatment (either due to switching treatments or stopping). We also followed patients under three time-limited ITT approaches, censoring patients at 180, 365, and 730 days of follow-up respectively. Figure 1 outlines the general follow-up strategy.

Disease risk score development
All DRSs were estimated within the historical cohort of warfarin initiators. While estimation of DRSs within the full concurrent cohort [3], or among only warfarin initiators in the concurrent cohort, have been proposed, historical estimation has several advantages, including the reduced potential for overfitting [2,7] and for bias amplification [19]. Due to patients' variable lengths of follow-up, we chose to fit Cox proportional hazards models to estimate DRSs, including terms for the 91 main effects of the covariates. Five separate models were fit, following patients for occurrence of the major bleeding outcome under each of the five follow-up strategies: AT, 6-month ITT, 1-year ITT, 2-year ITT, and full ITT. Using the coefficients from these five models, we defined DRSs on the linear predictor scale for patients in both the concurrent and historical cohorts.

Disease risk score assessment
The discriminatory ability of the DRS models was assessed by the concordance index described by Harrell [20]. In survival analysis, the "C-index" is, informally, the proportion of all pairs of subjects in which the subject with the higher predicted survival actually survived longer, excluding pairs in which both subjects are censored or one is censored before the other fails. Model calibration was assessed by calibration plots constructed with the modified Greenwood-Nam-D'Agostino (G-N-D) method comparing unadjusted Kaplan-Meier risk estimates, to the DRS model average predicted risk at 180 days, within deciles of the DRS linear predictor [21]. For all models, discrimination and calibration were assessed in both the historical and concurrent cohorts with the AT outcome serving as the reference.
We also utilized the "dry run analysis" framework proposed by Hansen to assess the ability of a DRS model to induce prognostic balance [11,22]. The dry run procedure involves sampling the untreated (i.e., concurrent warfarin) population based on their PS to form "pseudo-treated" and "pseudo-untreated" groups of untreated individuals with differences in covariate distributions similar to those of the actual treated and untreated groups. An effect of "pseudo-treatment" can then be estimated comparing these groups, conditional on the DRS. If the DRS induces prognostic balance and the propensity score is correctly specified then this treatment effect must be null since both groups are actually untreated. Hansen therefore argues for using the divergence of this estimate from its null value, which he terms the "pseudo-bias", as a diagnostic for DRSs that can be performed without reference to the planned, full cohort treatment effect estimate. More information on the dry run approach can be found in Appendix 1.

Treatment effect estimation
In order to compare rates of major bleeding between concurrent dabigatran and warfarin users, we estimated AT hazard ratios using Cox proportional hazards models, with stratification on deciles of the five DRSs. No DRS trimming was performed. Throughout, we used as a reference estimate the AT hazard ratio comparing dabigatran to warfarin initiators, stratified on deciles of the PS. The same set of covariates were included in the PS and DRS models, with the exception of two variables (history of acute renal disease and number of prothrombin tests ordered) omitted from the PS model for extreme negative associations with dabigatran initiation.

Resampling study
To assess the impact of sample size on DRS development, we resampled the historical cohort 1,000 times each at 6 sample sizes: 10 percent, 25 percent, 50 percent, 75 percent, and 90 percent of its full sample size. All sampling was done with replacement. In each resampled historical cohort, we fit the five aforementioned DRS models in the manner described above, using the same covariates and methodology. Coefficients from these models were used to define DRSs on the linear predictor scale in both the given resampled historical cohort, and in the concurrent cohort (which was not resampled). For each DRS model in each resampled cohort, we recorded: (a) the number of major bleeding events in the resampled historical cohort, (b) the model C-Index as assessed in the resampled historical cohort, (c) the model C-Index for each treatment group as assessed in the concurrent cohort, and (d) the estimated AT hazard ratio stratified on deciles of the given DRS in the concurrent cohort. We also estimated a dry run "pseudo-bias" for each DRS in each resampled cohort using the dry run approach described in Appendix 1. In this case, only a single dry-run "pseudo-bias" was estimated per resampled historical cohort, since performing a large number of dry run resamples for every resampled DRS model would be computationally prohibitive.

Patient characteristics and event rates
There were 79,265 patients followed for a total of 66,109 person-years in the concurrent cohort, of whom 22,809 (28 percent) initiated dabigatran. Among historical initiators of warfarin, there were 3,936 major bleeding events (48 events per 1,000 person-years) observed, which was reduced to 1,059 (48 events per 1,000 person-years) after AT restrictions were applied ( Table 1). The incidence of major bleeding events was consistently higher among concurrent initiators of warfarin than among concurrent initiators of dabigatran or historical initiators of warfarin. Table 2 shows the distribution of selected covariates among dabigatran and warfarin initiators in the concurrent and historical cohorts. Dabigatran initiators tended to be younger than warfarin initiators, with lower HASBLED and CHA 2 DS 2 -VASc scores, less renal dysfunction, less coronary artery disease and less upper GI bleeding. Warfarin initiators in the concurrent period tended to be older and less healthy than those in the historical period.

Discrimination
All DRS models showed moderate discriminatory ability among the historical warfarin initiators that comprised the DRS development cohort (C-index 0.653-0.678, Table 3). When the DRS was applied in the concurrent cohort, C-indices were consistently higher among dabigatran initiators than among warfarin initiators. The DRS model "optimism" (the difference between the discriminatory ability measured in the historical warfarin initiators and the concurrent warfarin initiators, was lowest for the model developed under ITT and highest for the model developed under AT. Abbreviations: IR, incidence rate per 1,000 person-years; AT, as-treated; ITT, intention-to-treat.

Calibration
Among historical warfarin initiators, DRS model-predicted 180-day risk was close to observed risk in the models developed under all strategies, indicating acceptable calibration with nominally insignificant G-N-D tests throughout (Figure 2). When assessed in the concurrent cohort, there was evidence of overestimation of risk, particularly in the upper two deciles of the DRS, across all models, and all G-N-D tests were highly significant at the 0.001 level. This overestimation was more severe among concurrent warfarin initiators than among concurrent dabigatran initiators, and was greater for the model developed under AT than the model developed under ITT.

Pseudo-bias
The "pseudo-bias" from the dry run analysis was between -6 percent and -7 percent for all DRS models, with confidence intervals overlapping the null value (0 percent), indicating that some downward bias is likely to remain when conditioning on any of the five DRSs.   Abbreviations: DRS, disease risk score; AT, as-treated; ITT, intention-to-treat; HR, hazard ratio; CI, confidence interval. Footnotes: a) DRS model optimism is the difference between the C-Index for the model in the historical cohort and the C-Index for the model among warfarin initiators in the concurrent cohort. All C-indices were estimated within the AT follow-up experience. b) The dry run "pseudo-bias" is the ability of a DRS to induce prognostic balance. These are presented on the scale of percent bias in the hazard ratio. Values closer to 0% indicate better prognostic balance.

Resampling smaller historical cohorts
Across DRS follow-up approaches in the resampling study, DRS model optimism decreased predictably with increasing events-per-variable (Table 4), and was greater for models developed under AT than those developed under ITT. Dry run "pseudo-biases" ranged between -19 percent and -8 percent and decreased with increasing sample size. "Pseudo-bias" values were also consistently closer to the null value of 0 percent bias for the ITT DRS than for the AT DRS, with time-limited ITT DRS values falling in between. Likewise, across historical cohort sample sizes, stratification on the ITT DRS led to more adjustment away from the crude estimate and toward the reference estimate than did stratification on the AT DRS (Figure 3). For the time-limited ITT DRS approaches, the amount of adjustment toward the reference estimate was between that of the AT DRS and the full ITT DRS, with longer follow-up leading to more adjustment.

Discussion
As-treated follow-up, which restricts observation to person-time accrued by subjects while continuously receiving their initial treatment, is a common strategy in comparative effectiveness and safety research of medical products. However, such limited follow-up may be problematic for the development of robust DRS models, which rely on large numbers of outcome events to accommodate many covariates. Developing DRS models under an ITT approach allows investigators to take advantage of more study outcomes in the historical cohort, potentially permitting more confounders to be adjusted for and reducing overfitting. The comparison of DRS models fit using a variety of followup strategies based on the dry-run analysis may be a viable approach to identify the model likely to control the most confounding, though this requires the estimation of a valid PS. In an example comparative safety study of two oral anticoagulants, we found that issues of DRS model overfitting were more acute when the DRS was developed under AT than when it was developed under ITT, but that these differences did not necessarily lead to appreciably worse confounding control in the analysis using the full historical cohort. However, results from our resampling study suggest that substituting DRSs developed under various ITT strategies for those developed under AT may be advantageous when the number of outcomes in the development cohort is small, a The size of the resampled historical cohorts is given as a percentage of the full historical cohort sample size (N = 39,209 warfarin initiators). b Number of events refers to the number of events corresponding to the follow-up strategy used for disease risk score development (column 1). This has been averaged across the 1,000 resampled cohorts of each size. c C-index among the AT experience of warfarin initiators in the resampled historical cohorts. This has been averaged across the 1,000 resampled cohorts of each size. d C-index among the AT experience of warfarin initiators in the concurrent cohort. This has been averaged across the 1,000 DRS models fit in the resampled cohorts of each size. e C-index among the AT experience of dabigatran initiators in the concurrent cohort. This has been averaged across the 1,000 DRS models fit in the resampled cohorts of each size. f C-index among the AT experience of all initiators in the concurrent cohort. This has been averaged across the 1,000 DRS models fit in the resampled cohorts of each size. g Pseudo-bias estimated with the dry-run analysis is a measure of prognostic balance. Here it is presented in terms of the percent bias in the hazard ratio. This has been averaged across the 1,000 DRS models fit in the resampled historical cohorts of each size. Confidence intervals are 95% empirical intervals based on the 2.5 th and 97.5 th percentile.
since ITT DRSs generally had less "pseudo-bias" than AT DRSs in dry run analyses, and since ITT DRS-adjusted estimates were consistently closer to the reference estimate than AT-adjusted estimates. However, these differences were small, and additional investigation should consider whether these results hold in alternate contexts. Interestingly, we found that DRS models fit among historical warfarin initiators showed better discrimination and calibration among concurrent dabigatran initiators than among concurrent warfarin initiators. This may indicate channeling of sicker patients to warfarin once dabigatran became available, leading to a higher prevalence of unmeasured risk factors or contraindications, such as frailty, among concurrent warfarin initiators and thus better fit among dabigatran initiators. The consequences of such channeling on historical development of DRSs warrants further attention. Throughout our empirical assessment, all methods of DRS adjustment were biased toward the crude by between 6 percent and 9 percent compared to the reference PS-stratified estimate. While there are several reasons PS-and DRSadjusted treatment effect estimates might differ, the consistent direction of bias observed here suggests that DRS methods were subject to more residual confounding than PS methods. This is consistent with previous studies of DRS development in the anticoagulant setting [23]. More empirical research is needed to determine the comparability of PS and DRS-based confounding control in settings where both are possible.
This study may be limited by reliance on results from a single empirical assessment. It is possible that the use of ITT DRSs to adjust AT effect estimates may lead to substantially less confounding control than was observed here in settings with different mechanisms of treatment discontinuation. Likewise, it should be noted that the reference PS-stratified treatment effect is not known to be true, and thus may not provide the ideal benchmark against which to compare DRS-adjusted estimates. However, previous clinical trials [24] and observational studies [25][26][27][28] indicate similar effects of dabigatran vs. warfarin on the risk of major bleeding events.
We advise serious consideration of the assumptions inherent in DRS adjustment before choosing the method of DRS development. In choosing to control confounding of an AT contrast with a DRS developed under ITT, we must assume that the ITT DRS ranks patients in roughly the same order as a correctly specified, robust AT DRS would. An example of when this assumption may not be tenable is if some covariates exert broadly differential effects when adherent or non-adherent to the comparator treatment. For example, if diabetes is not itself a risk factor for bleeding, but diabetics may be at a higher risk of warfarin-related bleeding, then the DRS model coefficient for diabetes may be null under ITT but positive under AT. This may adversely affect confounding control with the ITT DRS if many patients have diabetes, diabetes dramatically increases the risk of warfarin-related bleeding, and many patients discontinue warfarin soon after initiation, since the comparative ranking of patients with respect to bleeding risk may be completely altered. cohorts from 10%-100% sample size. Estimated as-treated (AT) hazard ratio (HR) vs. sample size, averaged across estimates stratified on deciles of DRS from models fit to 1,000 resampled cohorts at 10%, 25%, 50%, 75%, 90%, and 100% of the full historical cohort sample size. For reference, the crude AT HR and PS-stratified AT HR are presented below and above the plotted curves, respectively.
In any analysis that censors patients based on treatment discontinuation, consideration of informative censoring is important. If treatment discontinuation, and thus censoring, is prognostic for the study outcome, a Cox proportional hazards model will produce biased estimates of the survival function [29]. The direction of this bias will correspond to the direction of the correlation between censoring and failure, and the degree of bias will correspond to the prevalence of censoring. This in itself will not adversely affect DRS-based confounding control since subjects will still be ranked appropriately, but if censoring is differential with respect to the confounders, the model coefficients themselves may be biased in ways that change the ranks [30]. Therefore, we might assume that development of a DRS model under AT follow-up will yield the worst confounding control when censoring is informative, common, and differential with respect to important confounders. In practice, censoring patients upon discontinuation of treatment may also induce selection bias in the treatment effect estimate. While methods to address selection bias exist [20,21], we did not make use of them in this example study. This should not be problematic, as all hazard ratios were estimated in the same AT population and therefore are subject to the same degree of selection bias, with differing amounts of residual confounding attributable to method of DRS adjustment.
Additional research is needed to explore alternate methods of DRS development and use for the purpose of confounding control. Potential alternate strategies for increasing DRS power might include obtaining additional historical data from increasingly early time periods, broadening cohort eligibility criteria to include similar patients who did not initiate the comparator treatment, or reducing the dimensionality of the DRS model in a principled manner. As with altering the follow-up approach for DRS development, each of these strategies also carries disadvantages. For example, developing DRSs in a historical cohort followed from long before introduction of the treatment of interest increases the likelihood of secular changes in covariate measurement (e.g., due to coding practices). Broadening eligibility for the DRS development cohort may increase the quantity of available events, but may also allow the inclusion of patients who failed to initiate the comparator treatment due to unmeasured contraindications. Likewise, reducing a DRS model's dimensionality may require the omission of important confounders. Likewise, methods of DRS adjustment other than stratification, including regression adjustment with smooth functional forms and matching, as well as use of the DRS as a probability as opposed to a linear predictor, warrant further attention.
In conclusion, development of the DRS without censoring patients based on discontinuation may be a reasonable approach to reduce overfitting and enhance confounding control when DRS adjustment of AT effects is desired. This example may motivate further study in alternate contexts, which can help establish general recommendations for DRS development.

Additional Files
The additional files for this article can be found as follows: • Appendix 1. Details on the dry run analysis. DOI: https://doi.org/10.5334/egems.254.s1