A new likelihood model for analyses of pharmacoepidemiologic case–control studies which avoids decision rules for determining latent exposure status

Background Case–control studies based on pharmaco-epidemiological databases typically use decision rules to determine exposure status from information on dates of prescription redemptions, although this induces misclassification. The reverse Waiting Time Distribution has been suggested as a likelihood based model to estimate the latent exposure status, and we therefore suggest to extend this into a joint likelihood based model, which incorporates both the latent exposure status and the exposure-outcome association. This will achieve consistency and efficiency of the estimates, i.e. they can be expected to be asymptotically unbiased and have optimal precision. Methods We established a joint likelihood for the observed case–control status and last prescription redemption before the index date. The likelihood combines the ordinary logistic regression likelihood and the reverse Waiting Time Distribution, and allows inclusion of covariates in both parts to adjust for observed confounders. We conducted a simulation study of the new model and standard models based on decision rules for exposure and the probability of being exposed, respectively, to assess the relative bias and variability of estimates. Lastly, we applied the models to a case–control study on use of nonsteroidal anti-inflammatory drugs and risk of upper-gastrointestinal bleeding. Results In simulation studies the new model had low relative bias (< 1.4%) and largely retained nominal coverage probabilities (90.2% to 95.1% of nominal 95% confidence intervals), also when moderate misspecification was introduced. All standard methods generally had substantial bias (-21.1% to 17.0%) and low coverage probabilities (0.0% to 68.9%). When analyzing the empirical case–control study, the new method estimated the effect of nonsteroidal anti-inflammatory drugs on risk of with upper-gastrointestinal bleeding hospitalization to 2.52 (1.59 – 3.45), whereas the other methods had estimates ranging from 3.52 (2.19 – 5.65) to 5.17 (2.40 – 11.11). Conclusions Unlike standard methods, the proposed model gave nearly unbiased estimates with adequate coverage probabilities in simulation studies. The developed model demonstrates the potential for the reverse Waiting Time Distribution to be integrated with existing likelihood based analyses in pharmacoepidemiological studies. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-021-01312-y.


Background 1
Pharmacoepidemiologic case-control studies are commonly based on routinely collected register 2 data, as this is an efficient strategy for assessing the risk of adverse drug effects in real-world 3 settings. Typically in such studies, exposure information is obtained from databases recording all 4 individual prescription redemptions for a population and then linked with information on for 5 example hospital admissions for each individual in the population. Since the actual treatment after 6 redeeming a prescription is not recorded, such studies routinely rely on decision rules for defining 7 drug exposure status at the index time for both cases and controls. In a recent study, we suggested to 8 use the reverse Waiting Time Distribution (WTD) to estimate the probability of being exposed on 9 the index date, and use this probability as exposure variable.(1) The objective was to better allow 10 for the fact that the actual exposure status is not recorded, and that the analyst may therefore 11 introduce misclassification bias when forcing a dichotomy upon exposure. The approach was 12 intended to directly represent the analyst's uncertainty about actual exposure by using probability of 13 being exposed as exposure covariate. We found that the approach improved statistical efficiency 14 when studying the risk of upper gastrointestinal bleeding (UGIB) associated with use of a 15 nonsteroidal anti-inflammatory drug (NSAID).(1) The suggested method consisted of two steps: 16 First, a parametric reverse WTD model was fitted to the last observed NSAID prescription 17 redemption among controls before their index date to estimate their probability of being users on 18 the index date. Second, we used the predicted probability as covariate in a logistic regression with 19 case-control status as outcome.

21
However, both the traditional decision rule based approach and using the probability of exposure as 22 covariate are susceptible to misclassification bias. Also, they are not able to incorporate the 23 uncertainty regarding the actual exposure into the subsequent analysis estimating the effect of being 24 exposed with respect to the outcome of interest. This will bias uncertainty estimates. To overcome 25 these biases, we here propose to base estimation on a joint parametric likelihood for the last 1 observed prescription redemption in the year preceding the index date, if any, together with the 2 observed case-control status. From a theoretical point of view, formulation of a full likelihood 3 model is attractive as it provides estimates based on a coherent and transparent model of what is 4 actually observed, i.e. the case-control status and the time of last redemption before the index date. 5 Since the observed data are conceptually linked together via the unobserved (i.e. latent) exposure 6 status, we can obtain estimates for the exposure-outcome relation by maximizing the likelihood. 7 Based on general likelihood theory, such estimates will have optimal large sample properties. (2) 8 Note that when no prescription redemption is observed in the year before the index date, we can 9 safely assume that the person is unexposed, and thus such persons can also be readily included in 10 the analysis.

12
In this paper, we first establish the new joint likelihood, which is a synthesis of a reverse WTD 13 model and the ordinary logistic regression of exposure status among cases and controls. In a 14 simulation study we then compare the performance of the new joint likelihood model with classic 15 decision-rule based analyses as well as the previous suggestion of a two-step model based on 16 exposure probabilities, and we investigate the impact of misspecification. We finally apply the 17 models to data on NSAID prescription redemptions and hospitalization with UGIB in a Danish 18 case-control study.

20
As a starting point, let us assume that if a person is in continued treatment with a drug, then (s)he 21 will redeem prescriptions following a renewal process with ( ) as the inter-arrival density (IAD) 22 between subsequent redemptions. When this renewal process is intercepted at an index date, is 23 the time from the last renewal (redemption) before the index date until the index date ( Figure 1). 24 1 0) after the last observed prescription redemption before the index date. Note that we have in 2 previous papers shown that even if the assumption of redemptions following a renewal process is 3 violated, then the backward recurrence time (time from last prescription redemption before the 4 index date to the index date) can still be reasonably modelled as if originating from a renewal Concerning the outcome and exposure in the case-control study, a person can either be a case, 10 = 1, or a control, = 0, at the index date 0 , and be exposed to treatment, = 1, or not, = 0. 11 We will assume that the case-control status of a person depends on the true, but unobserved binary 12 exposure status, , in the ordinary fashion of logistic regression, i.e. 13 ( = 1| = ) = exp( 0 + 1 ) 1 + exp( 0 + 1 ) In this formulation, exp( 1 ) is the odds ratio of interest when studying the association between 14 being exposed and becoming a case. To maintain mathematical feasibility, we will assume that if 15 the true exposure status is known, then whether a person is a case or a control is unrelated to the 16 distribution of time since last prescription redemption. In other words, we assume and are 17 independent given .

19
Let us now consider the relation between redemptions and exposure status. When a person 20 continues treatment after the index date with one or more subsequent prescription redemptions 21 ( = 1 at the index date), then the person is by definition exposed at the index date ( = 1).

22
However, when the last prescription redemption observed before the index date is also the last 23 redemption in the treatment of an individual ( = 0), such individuals can be either treated ( = 1 1) or untreated ( = 0) on the index date ( Figure 1). This is because such individuals will remain 2 exposed for some time after their last prescription redemption before they enter a state without 3 treatment. We will assume that their duration of exposure after the prescription redemption, , 4 follows the ordinary IAD of treated patients continuing treatment, i.e. the probability density 5 function of is ( ). Finally, we will assume that persons without a prescription redemption in an 6 interval before the index date, ( 0 − ; 0 ), are unexposed. To accommodate this, we will let be 7 an indicator variable, which is 1 if an individual has at least one redemption observed in ( 0 − ; 0 ) 8 and 0 otherwise. Figure 1 gives an overview of the variables introduced and how they are related to 9 four different characteristic patient types.

11
We can now write the likelihood contributions for each patient type by conditioning on , , and .

17
With obvious notation, we can parametrize the probabilities as follows: In sum, the likelihood contribution for each individual becomes 1 ( , , 0 , 1 , | , ) = ⋅ ( ) exp( 0 + 1 ) 1 + exp( 0 + 1 ) (1 − ( ))) The full likelihood for all individuals is the product of the individual likelihood contributions. We 2 call this the joint likelihood as it is based on the joint density function for both the last prescription 3 redemption before the index date, , and the case-control status, . In the joint likelihood, 4 denotes the parameters which the IAD, and thus the BRD, depends upon and which must be where Φ is the standard normal cumulative distribution function. Note, that whether a case patient continues having prescription redemptions or not after the index 10 date may be hypothetical in the sense that case status may prohibit any further treatment. The model 11 does, however, only rely on the distribution of which is in principle unaffected by any termination 12 of treatment after = 1 has been observed (the case becomes a case).

14
Analogous with previous implementations of parametric WTDs, we have logit-transformed the 1 probability parameters, and , and log-transformed the parameter which is required to be 2 positive. Instead of a Log-Normal distribution, other parametric distributions with support on the 3 positive axis could be considered, for example the Weibull or Gamma distributions.

5
We have previously shown how parameters of the parametric reverse WTD can be allowed to 6 depend upon covariates in a regression-like setup,(5) and the above formulation allows completely 7 analogous inclusion of covariates. In particular, confounders which would ordinarily be included in 8 an logistic regression analysis of the case-control study can be included in an equation for 0 . Also 9 note that different parameters may depend on different sets of covariates.

11
Simulation study 12 The aim of the simulation studies was to examine the bias and precision of estimates. In the 13 simulation study of the finite sample properties, we generated data compatible with the model 14 developed above. More specifically, we used the following steps to create each dataset: values it is possible to derive the probabilities of exposure among cases and controls, 25 respectively, from which the exposure status can be drawn for all individuals in the dataset 1 (see Additional file 1, Section A1 for details). Also for each individual it is randomly 2 determined who has a prescription redemption in the observation window, and such that it is 3 compatible with the random exposure statusexposed individuals must have a redemption 4 in the observation window, whereas only some unexposed individuals will have such a 5 redemption.
6 2. For the individuals assigned to continue treatment, and thus be exposed on the index date,  3. For individuals identified to stop treatment after index date, we randomly draw a uniformly 13 distributed time of prescription redemption in the observation window, but such that for 14 those still exposed on the index date their redemptions follow the above specified Log-

15
Normal BRD. We used rejection sampling to satisfy the requirement of the marginal 16 distribution being uniform among all treatment stoppers when redemptions of the exposed 17 sub-population followed a Log-Normal BRD (see Additional file 1, Section A2 for details).

19
To examine the impact of a misspecified BRD, we also generated data with a Weibull BRD and 20 analyzed the data with a model based on a Log-Normal BRD. To make the settings comparable, we 21 used parameters such that the Weibull IAD had the same mean and variance as the Log-Normal

22
IAD corresponding to the and values stated above. The probability is estimated from an ordinary rWTD analysis. to exclude individuals that are not classified as non-exposed (i.e. used as reference) or exposed (i.e. For each method and setting we estimated the median relative bias of the estimated OR, the median 2 standard error of the log-OR, the coverage probability of nominal 95% confidence intervals and 3 finally the relative variance inflation factor with respect to the optimal analysis when exposure 4 status is directly observed. As case material we used a case-control study also employed in previous publications, which was 10 described in detail there.(1,6) Briefly, our source population was the residents of Funen County,

11
Denmark, during a study period of August 1, 1995 to July 31, 2006. We included as cases all 12 subjects who were admitted to a hospital on Funen with UGIB, validated by single case review.

13
Controls were selected by a risk-set sampling strategy, i.e. for each case, we randomly selected ten 14 controls among the subjects in our source population who matched the case by sex and birth year.

15
Controls were assigned an index date identical to the outcome defining date of the corresponding 16 case. We required that both cases and controls had been residents of Funen for at least one year on 17 the index date. As some of the very old cases had less than 10 eligible controls, the final control to 18 case ratio deviated slightly from 10:1.

Data analysis 21
Although data was collected with matching of controls to cases, we did not analyze the data 22 conditional on match sets. Each matched set consisted of at most 11 individuals, which did not 23 contain sufficient information to reliably estimate parameters of a WTD for each matched set.

24
Consequently, we ignored the matching and instead included the matching variables age and sex as 25 covariates in the association between exposure and case-control status, although this will likely 1 introduce a small bias towards the null. (7) 2 3 As reference analyses we considered logistic regression with case-control status as outcome and the 4 exposure covariate with respect to NSAID defined as follows:

5
 WTD probabilitythe probability of being exposed as estimated by a reverse waiting time 6 distribution among controls in the year prior to their index date. We let the WTD parameters nitrates, spironolactone, calcium antagonists, bisphosphonates; 2) any history of the following 25 events: NVUGB, Helicobacter pylori (HP) eradication, peptic ulcer, chronic obstructive pulmonary 1 disease (COPD), diabetes, ischemic heart disease, heart failure, stroke, hypertension, inflammatory 2 bowel disease, malignant disease, renal failure; and 3) prescription or diagnosis markers of smoking 3 or excessive alcohol consumption. For all drugs used as covariates, current drug use was defined by 4 the redeeming of a prescription within less than 120 days before the index date.

6
For the joint likelihood we let the parameters associated with the reverse WTD ( , and ) depend 7 on the same covariates as the reverse WTD described above when using probability of exposure as

20
Simulation study 21 Results of the simulation study are shown in Table 1  The new method based on a joint likelihood for the case status and last observed prescription 2 redemption before the index date was virtually unbiased (range 0.3% to 1.7%) with coverage 3 probabilities close to the nominal value (range 90.2% to 95.1%) and relatively low VIFs (range 1.09 4 to 1.24). With 1:1 matching and a low value (median prescription duration of 1.5 months) there 5 were a small number of datasets for which the likelihood estimation did not converge (< 70 of 6 2,500).  Results were largely similar for the joint likelihood method when the sample size was halved to 11 19,800, although with higher coverage probabilities (see Additional file 1, Table A1) or when the 12 fraction of patients continuing treatment was reduced to 40% (see Additional file 1, Table A2).

13
VIFs were however increased when fewer continued treatment (range 1.23 to 1.53). The other 14 methods remained biased with low coverage probabilities, although their internal rank of 15 performance changed in some settings.

17
In all settings there was substantial bias associated with the method of choosing a fixed exposure 18 window, be it 30 or 90 days (relative bias ranged from -6.7% to -21.5%). Correspondingly, the 19 coverage probabilities of nominal 95% confidence intervals for the odds ratio were low (range 0.3% 20 to 40%). The variance inflation factor (VIF) varied considerably relative to the optimal analysis in 21 which the true exposure status of individuals was directly observed (range 1.06 to 2.48), but given 22 the substantial bias this is less relevant.

24
The analysis based on estimating the probability of exposure with the rWTD and then use this 1 probability as exposure covariate also led to substantial relative bias (range 6.1% to 17.1%), which 2 indicates that this approach in general overestimated the association. This approach also had low 3 coverage probabilities (range 0.0% to 68.9%), but consistently it had high VIFs relative to the 4 optimal analysis (range 1.56 to 2.00). The low coverage probabilities were thus not so much a 5 consequence of small estimated standard errors, but rather of the magnitude of the bias. When the BRD was misspecified (data generated from a Weibull BRD and analyzed with a model 8 based on a Log-Normal BRD), the relative bias increased slightly for the joint likelihood (≤ 1.7%), 9 coverage probabilities remained near the nominal level (≥ 90.1%), and VIFs were as above (≤ 10 1.25). Similar results were seen with the smaller sample size of 19,800 (see Additional file 1, Table   11 A3). The bias, coverage and VIFs for the other analytic approaches were similar to the settings 12 where a Log-Normal BRD was used for generating the data.  Table A4). The joint likelihood model provided an unadjusted estimate of the NSAID effect on risk 17 of UGIB similar to the other models (OR = 5.57 (5.08 -6.05)), although the estimate based on the 18 probability of being exposed was higher (Table 3). However, when the association was adjusted for 19 confounders, the estimated effect (2.52 (1.59 -3.45)) was substantially lower for the joint model 20 than estimates from the three other models. The joint model had better precision than the methods 21 based on a fixed exposure window, but lower than the exposure probability approach. As in the 22 previous papers, when reporting estimates based on a fixed exposure window we omitted patients 23 who had an NSAID prescription outside the exposure window resulting in a lower sample size for 24 these two analyses (30 days: n=9,453 and 90 days: n=12,662 vs the full sample size of 39,119). The 25 gain in precision for the models based on the WTD is thus due to their ability to include the entire 1 sample in the estimation. Estimates were largely unchanged, whether we used conditional logistic 2 regression to account for the age-and sex-matched design or included sex and age as covariates in 3 ordinary logistic regression with all three standard analyses. However, for the adjusted analyses, 4 estimated ORs were smaller for the two analyses based on a fixed exposure window (30 days or 90 5 days).  We have shown how it is possible to establish a likelihood for the effect of a latent exposure status  Theoretically, a full likelihood-based model provides the optimal analysis in terms of statistical 1 efficiency, since it for large samples provides unbiased estimates with minimal standard errors 2 (Cramér-Rao lower bound).
(2) Other methods may yield smaller standard errors, but are then not 3 consistent as they are asymptotically biased. We therefore joined the likelihood of the reverse WTD 4 and that of a logistic regression to allow direct estimation of the OR in case-control studies with 5 exposure information based on registry data. We based our approach on noting that logistic 6 regression in a case-control study models the relation between exposure and case-control status.

7
When using pharmacoepidemiologic register data, exposure status is not directly observed (did the 8 patient actually take the pill on the index date?), but this latent exposure status is modelled in the 9 reverse WTD. The reverse WTD links information on last observed prescription redemption before 10 the index date to two different sub-groups of patients, those who are continuing treatment on the 11 index date and those who are stopping treatment. In our approach we allowed for the fact that We are not aware of other methods that allow estimation of the exposure OR in a case-control study 23 in which exposure status is latent and linked to observed prescription redemptions. Abrahamowicz 24 and co-authors investigated how different models for the exposure-outcome relation over time could 25 be used in prospective studies on the risk of adverse effects with pharmacoepidemiological registry 1 data.(11) They relied on model diagnostic criteria to identify the correct model among many 2 candidates and found substantial bias associated with misspecification of the exposure-outcome 3 relation with respect to timing. In our model we assumed that it was exposure on the index date that 4 induced a potential effect on case-control status.

6
While theoretically superior, our developed method has several limitations. First, the method does 7 not allow taking into account any matching. This is expected to dilute effect estimates. (12) 8 Including matching variables as covariates in the estimation may however remedy this, at least 9 partially, and we did indeed see that this strategy provided similar results with the standard analyses 10 in our application. Second, even when the model is correctly specified, it is complex and the 11 maximum likelihood estimation procedure may occasionally not converge. Third, the method relies 12 on the simplifying assumptions of the BRD being similar between cases and controls, and that where we expect recent initiators to have a higher risk than later in the treatment phase. (13) 20 Addressing this will require further modelling, but we have likely underestimated the effect of 21 NSAID on UGIB in our analysis based on the developed joint likelihood. Finally, the developed 22 model only estimates the effect of current use, i.e. the effect of being exposed on the index date.  predict the probability of an individual being exposed and this exposure probability is used as 10 covariate in logistic regression.

11
90 daysindividuals are considered exposed if they have a redemption < 90 days before index 12 date.
13 30 daysindividuals are considered exposed if they have a redemption < 30 days before index 14 date, logistic regression.  The datasets had a sample size of 39,600, and on average 80% of patients continue treatment at the 3 index date, 25% of patients have a prescription redemption in the year before the index date and the 4 true OR is 3. For each setting 2,500 datasets were generated and analyzed. The Weibull Backward 5 Recurrence Density used to generate data corresponded to a Weibull distribution with the same 1 mean and variance as a Log-Normal distribution with and as its parameters, see text for details. predict the probability of an individual being exposed and this exposure probability is used as 10 covariate in logistic regression. 11 90 daysindividuals are considered exposed if they have a redemption < 90 days before index 12 date. 13 30 daysindividuals are considered exposed if they have a redemption < 30 days before index 14 date, logistic regression. logistic regression (unconditional), but where sex and age are included as covariates, both in the 6 crude and adjusted analyses.  Black bullets represent prescription redemptions included in the likelihood, grey bullets are redemptions not included in the likelihood. The thicker black lines represent periods of treatment, whereas dashed lines are periods without treatment. T is duration of the prescription and R is time from the last prescription redemption before the index date t_0 until the index date. X indicates whether a patient continues treatment, Z exposure status on the index date, and V whether a prescription is redeemed in the interval (δ;t_0). For simplicity, index dates have been aligned on the time scale. Type refers to patient type, see text for details.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.