Hypothetical Estimands in Clinical Trials: A Unification of Causal Inference and Missing Data Methods

Abstract The ICH E9 addendum introduces the term intercurrent event to refer to events that happen after treatment initiation and that can either preclude observation of the outcome of interest or affect its interpretation. It proposes five strategies for handling intercurrent events to form an estimand but does not suggest statistical methods for estimation. In this article we focus on the hypothetical strategy, where the treatment effect is defined under the hypothetical scenario in which the intercurrent event is prevented. For its estimation, we consider causal inference and missing data methods. We establish that certain “causal inference estimators” are identical to certain “missing data estimators.” These links may help those familiar with one set of methods but not the other. Moreover, using potential outcome notation allows us to state more clearly the assumptions on which missing data methods rely to estimate hypothetical estimands. This helps to indicate whether estimating a hypothetical estimand is reasonable, and what data should be used in the analysis. We show that hypothetical estimands can be estimated by exploiting data after intercurrent event occurrence, which is typically not used. Supplementary materials for this article are available online.


Introduction
The analysis of randomized trials is often complicated by the occurrence of certain events that affect the interpretation of the treatment effect or preclude the observation of the outcome of interest. Such occurrences, termed "intercurrent events" (ICE) by the recently-published ICH E9 addendum on estimands (International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use 2019), include treatment discontinuation, addition of rescue medication, or death prior to measurement of the outcome of interest. In the presence of such ICEs, the importance of clear specification of a trial's treatment effect "estimand" and how the statistical analysis targets this estimand has been increasingly recognized over the last decade.
The U.S. National Research Council report on the Prevention and Handling of Missing Data in Clinical Trials highlighted the importance of trials clearly specifying the target estimand(s), and how the trial design and statistical analysis should be chosen to support its reliable estimation (National Research Council 2010). Since then, a number of authors have considered the complex questions involved in how to choose and specify an estimand and how to select an appropriate statistical method to estimate it (Mallinckrodt et al. 2012(Mallinckrodt et al. , 2019Carpenter, Roger, and Kenward 2013;Holzhauer, Akacha, and Bermann 2015).
According to the ICH E9 framework, strategies for dealing with intercurrent events must be specified when choosing and defining the target estimand of a clinical trial. The addendum CONTACT Camila Olarte Parra cop23@bath.ac.uk Department of Mathematical Sciences, University of Bath, Bath, UK. Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/SBR. does not however specify how these might or should correspond to statistical analysis methods. One of the proposed strategies is labeled as hypothetical. Under the hypothetical strategy, the causal effect is targeting what would have happened if the ICE had (somehow) been prevented from occurring. For instance, if rescue medication was not available or patients could be prevented from discontinuing treatment. For patients in the trial for whom the ICE did not occur, their observed outcome corresponds to the outcome of interest under the hypothetical strategy, whereas for those who experienced the ICE, the potential outcome of interest is missing. Consequently, the existing literature has almost exclusively focused on tackling the problem of estimation of hypothetical estimands from the perspective of missing data, by deleting any outcomes observed after ICE occurrence and applying methods such as direct likelihood (e.g., using linear mixed models) or multiple imputation (Qu et al. 2020).
Until recently Bowden et al. 2020;Qu et al. 2020;Michiels et al. 2021;Qu, Luo, and Ruberg 2021), relatively little has been published on the topic of estimation of estimands from the perspective of modern casual inference. Indeed, perhaps surprisingly, the ICH E9 addendum itself does not explicitly mention causal inference concepts or methods, although these are clearly relevant.
The hypothetical strategy has been used when the ICE is addition of rescue medication. A recent systematic review on rescue medication in published trials of asthma and eczema found that its use was not routinely reported or accounted for, even when there was evidence of an imbalance in rescue medication between arms (Ster, Cornelius, and Cro 2020). When analyses aiming to account for rescue medication were reported, the authors of the review considered that they were mainly targeting a hypothetical estimand with suboptimal methods and concluded that further guidance was warranted.
In diabetes trials, rescue medication is usually available for ethical reasons. An example of such a trial compared dapagliflozin, dapagliflozin plus saxagliptin, and glimepiride in patients with type 2 diabetes who were using metformin (Müller-Wieland et al. 2018). Insulin therapy was available as rescue medication. The analysis of the primary endpoint was performed using a linear mixed model with fixed effects for treatment group and covariates after deleting values beyond the first use of rescue medication. Here, we will discuss alternative approaches and their corresponding underlying assumptions, using either only the values prior to rescue medication or the full observed data including values after the ICE occurrence and how they relate to this approach.
In this article, we review concepts from causal inference to characterize precisely the conditions under which hypothetical estimands can be estimated from trial data. We describe statistical estimators of these arising from both the causal inference and missing data literatures, and establish that for each missing data estimator there is a corresponding numerically identical causal inference estimator, thereby unifying the sets of methods.
We begin in Section 2 with a review of the concepts and tools in causal inference, first for a setting with a treatment decision at a single time point, and then for a generalized setting where treatment changes can occur at multiple times. In Section 3, we consider the definition and estimation of a hypothetical estimand in a simplified setting in which the ICE can only occur at a single time point, linking it to the concepts and methods reviewed in Section 2. In Section 4, we consider the more general setup in which an ICE can occur at multiple time points. Finally, we give conclusions in Section 5.

A Brief Review of Causal Inference Concepts, Assumptions, and Estimators
In this section, we review the key concepts, assumptions and estimation methods from causal inference for studies with timevarying treatments, drawing on Hernan and Robins (2020), Hernán (2009), Tsiatis et al. (2020), Daniel et al. (2013), and Ding and Li (2018).

Time-Fixed Treatment
Clinical trials usually compare two or more treatments for a given condition and evaluate their effects on an outcome of interest. The potential outcomes framework provides a formal definition for such causal effects and the assumptions required to estimate them (Rubin 1974). "Potential outcome" refers to the response that would have been observed on a patient had they been assigned a particular treatment. Thus, there is a potential outcome for each patient for every treatment we might feasibly assign to them. Except in certain special situations, patients only receive one treatment and therefore only one of their potential outcomes is observed.
The potential outcome, denoted Y a , expresses the outcome Y under assignment to treatment a. We can then define the target causal effect of interest (the estimand) as a contrast of the distributions of such potential outcomes. For a dichotomous treatment, A, we may for example be interested in the mean In RCTs, the treatment at baseline is assigned at random but the occurrence of the ICE is not. As we will see in the next section, the ICE can be considered a treatment that is not randomly assigned. The causal effect of interest can be estimated if certain identifiability assumptions are satisfied. First, the interventions have to be sufficiently well defined to ensure consistency, which states that the observed outcome corresponds to the potential outcome under the treatment received: where A denotes the variable recording the treatment a given patient receives (VanderWeele 2009). Consider an oncology trial where we compare chemotherapy versus no chemotherapy. The chemotherapy treatment would be considered ill-defined if the type of chemotherapy and regimen are not specified. Also no chemotherapy can imply no treatment or follow up at all or just standard of care or many other options. As we would not expect that different types of chemotherapy regimens would yield similar outcomes, then the potential outcomes Y a are not sufficiently well defined.
As patients can usually receive only one treatment, we compare different groups of patients receiving the different treatments of interest. Randomization ensures that the different groups of patients have similar prognostic factor distributions. In the absence of randomization, the effect estimate has to be adjusted for a sufficient set of confounders to ensure that patients are comparable in terms of their prognostic factors. This is the second identifiability condition known as conditional exchangeability. In other words, each of the potential outcomes Y a for the different possible values of a has to be independent of the actual treatment received A, given the confounders L: Y a | A|L. Finally, the last identifiability assumption is positivity. This means that for every patient, on the basis of their confounder values L, there is a nonzero probability that they could receive each of the treatments under study (Petersen et al. 2012). It would not be sensible to consider patients who, on the basis of one or more of their confounder values, would always receive a given treatment. This could happen if, say, a particular confounder level implies contraindication for one of the treatments. Therefore, all patients should have a nonzero probability of receiving the different treatments P(A = a|L = l) > 0 for all values of a and l such that P(L = l) > 0.
Directed acyclic graphs (DAGs) are a useful tool for encoding causal assumptions. These graphs are composed of nodes that represent random variables, including treatment, outcome and covariates, and edges connecting the nodes (Figure 1). DAGs  are said to be directed because the edges have arrows indicating the direction of the causal effect and acyclic because a directed path that is, edge or series of consecutive edges, cannot lead back to an initial node. The absence of an edge between two nodes encodes the assumption that there is no direct causal effect between them. In Figure 1, we would omit the edge L → A in an RCT to show that A is assigned at random. Briefly, paths between nodes are made up of (1) chains (e.g., L → A → Y), (2) forks (e.g., A ← L → Y) and (3) inverted forks (e.g., A → Y ← L). The variable at the center of an inverted fork, in this case Y, is known as a collider on that path. Associations are transmitted along chains and forks but not along inverted forks; thus, paths consisting only of chains and forks are called open paths, but those containing at least one inverted fork are closed. The association transmitted along an open path can be blocked by conditioning on a central variable in a chain or fork, but a closed path is unblocked by conditioning on a collider, and correspondingly a conditional association induced. The composition of these 3-variable phenomena lead to an algorithm for deciding whether or not a set of covariates is sufficient to control for confounding. For example, in Figure 1, were the arrow from A to Y to be removed, A and Y would be marginally associated (via L) but conditionally independent given L, which indicates that adjusting for L is sufficient to control for confounding under the assumptions of this simple DAG.

Time-Varying Treatment
The preceding setup can be extended to a more general one where treatment can change over time. Figure 2 shows a possible DAG for a study where treatment can change at three time points, with A 0 , A 1 , A 2 denoting variables for treatment at each time. The treatment at a given time point k can depend on the earlier treatment values and also the earlier values of the variables L. The final outcome of interest is denoted Y.
In the setting with time-varying treatments we are typically interested in comparing different treatment regimes, that is different (hypothetical) ways of assigning treatments. A static treatment regime is one in which the decision on which treatment to assign at each time point does not depend on the timevarying variables L, whereas dynamic regimes are those where the treatment decisions can be based on the hypothetical values of L that would be observed under that regime. As we will see later, the hypothetical estimand of "no ICE" corresponds to a static regime. For the DAG in Figure 2 a static regime is defined by specifying particular values for the three treatment variables. For example, assuming there are two treatments available, coded 0 and 1, a particular regime isā = (0, 0, 0), which corresponds to assigning the first treatment at each of the three time points. The potential outcome under this regime is denoted Yā.
The identifiability conditions described previously for the setting with a single treatment assignment can be extended to this more general setting with time-varying treatments. The no unmeasured confounding or exchangeability condition has a number of different versions in the time-varying treatment setting. For our purposes we use the following version, which states for k = 0, 1, . . . , K and all static treatment regimesā, wherē A k−1 = (A 0 , . . . , A k−1 ) denotes the history of treatment received through to time k − 1, and similarly forā k−1 andL k . This condition is satisfied if treatment assignment at each time point depends only on previous treatments and measured timevarying confounders, as in the DAG in Figure 2. We will see later that this assumption plays a critical role regarding which variables are included in L k in order to provide valid inferences. The positivity condition is similarly extended in the timevarying treatment setting to the following In words, this says that for all combinations of treatment histories and time-varying confounders up to time k − 1 that can occur in the study, there is a positive probability of each of the possible treatments being given at time k. In fact, if we are interested in a particular treatment strategy, this condition only needs to hold for treatment histories compatible with the strategy of interest (see Technical Point 19.2 of Hernan and Robins 2020). For example, if we are interested in the treatment strategy of giving treatment zero throughout,ā = (0, . . . , 0), we require only that We will see later that the positivity assumption plays an important role in the feasibility of estimation of hypothetical estimands where whether an ICE occurs depends deterministically on time-varying confounders.
In what follows, where there is no ambiguity introduced, we use f (x|a, b, c) as shorthand for f X|A,B,C (x|a, b, c). In the case that

G-Formula
We now review the two most commonly adopted approaches for estimation of the causal effects of treatment in the time-varying treatment setting. The first is G-formula or G-computation. Under the previously stated identification conditions, for a given treatment regimeā the density function of the potential outcomes under this regime can be shown (sec. 5.4 of Tsiatis et al. 2020) to be equal to where we have taken K = 2 for concreteness,ā = (a 0 , a 1 , a 2 ) andā 1 = (a 0 , a 1 ). Often we will be interested in the mean outcome under a given treatment regime, which can then be shown to equal To implement G-formula we can specify and fit models for the conditional distributions If we are only interested in the mean outcome (as opposed to other aspects of the distribution) under the treatment strategy, then the conditional model for Y can be replaced with a model for its conditional expectation. When the L k are multivariate, these regressions become multivariate, which is more difficult, particularly if the components of L k are a mixture of continuous and discrete variables. An issue for this approach is that it is not usually clear which order should be chosen. We note that a similar issue arises in the specification of imputation models for multivariate missing data (Erler, Rizopoulos, and Lesaffre 2019).
Having specified and fitted the models, the G-formula identification Equations (3) or (4) can be used. However, evaluation of the integrals involved in general is difficult. To circumvent this, a Monte Carlo integration approach can be used, in which the values of the time-varying confounders L k are simulated from the fitted models sequentially. For further details, see, Daniel, De Stavola, and Cousens (2011).
The resulting G-formula estimates of E(Yā) are consistent provided the previously stated identification assumptions hold and the models for the conditional distributions of the timevarying confounders and the model for the outcome (5), are correctly specified.

Inverse Probability of Treatment Weighting
A different approach is to use inverse probability of treatment weighting to create a pseudopopulation in which the timevarying treatment assignment is independent of the values of the (time-varying) covariates. In other words, we create a pseudopopulation in which there are no longer arrows into the treatment nodes from any other node, but all other relationships remain unaltered. This is achieved by weighting each patient by the inverse of the probability of receiving the treatment they in fact received at each time point given the covariate and treatment history. This is the time-varying treatment extension of the propensity score (the probability of treatment given covariates). In the time-varying treatment setting, the (unstabilized) weight for patient i is where patient i's treatment history is (A 0,i , A 1,i , . . . , A K,i ), their time-varying confounder history is (L 0,i , L 1,i , . . . , L K,i ), and A −1,i is taken to be zero for all i by definition. The potential mean outcome under the treatment strategy of interest can then be estimated as the weighted average of the outcomes among those patients whose treatment history matches the treatment strategyā we are targeting: where I(·) denotes the indicator function.
In practice, the conditional distributions of the time-varying treatment variables involved in the definition of the weights are not known, but must instead be estimated. When there are only two treatment options at each time, such that each A k is binary, these can be logistic regression models. The resulting estimator of E(Yā) is consistent provided these models are correctly specified, along with the same identification assumptions stated previously.

Definition and Estimation of a Hypothetical Estimand in a Simplified Setting
In this section, we apply the assumptions and methods introduced in Section 2 to the problem of estimating a hypothetical estimand in a randomized trial in which the ICE can take place at only a single fixed time point. We then compare and contrast the causal inference estimators with missing data estimators which are currently more commonly adopted in practice for the estimation of hypothetical estimands. We consider inference under the setup depicted by the DAG in Figure 3, which is a special case of the general time-varying treatment setting described in Section 2.1.2. Here A 0 represents randomized treatment at baseline. Unlike Figure 2, there is no arrow from L 0 to A 0 due to the fact that treatment at baseline is randomly assigned. The second "treatment" variable A 1 represents whether or not the ICE occurs for a particular patient. The ICE can be considered a "treatment" because we want to intervene on it to set it to 0.
The potential outcomes Y a 0 ,a 1 denote the outcome were we to assign treatment a 0 and intervene (somehow) on the ICE to set it to level a 1 . The effect of treatment can be expressed as the expected outcome of assigning the full population to treatment 1 versus assigning them to treatment 0 in the hypothetical scenario were the ICE prevented from occurring: In some cases, the post-baseline variables (L 1 ) can include measurements of the outcome at intermediate visits. For instance in a diabetes trial, L 1 might consist of measurements of glycemic control, like glycated hemoglobin (HbA1c) and fasting plasma glucose (FPG), assessed at an intermediate follow-up visit, while Y denotes HbA1c at the second (final) visit.

Causal Inference Estimation Approaches
The sequential exchangeability assumption (Equation (1)) holds here, given covariates L 0 , L 1 , since the DAG in Figure 3 is a special case of the general DAG in Figure 2. Informally this can be seen to hold here because the treatment A 0 is randomly assigned and the ICE A 1 (which is analogous to the "second treatment") is only influenced by assigned treatment A 0 and the measured variables L 0 and L 1 . Thus, Y a 0 ,a 1 =0 | A 0 and Y a 0 ,a 1 =0 | A 1 |A 0 , L 0 , L 1 . We emphasize that L 0 and L 1 consist of all variables which influence the ICE occurrence (A 1 ) and the outcome Y. In particular, it is not sufficient to specify L 0 and L 1 as simply the baseline and intermediate measures of the outcome variable unless these are truly the only variables influencing the occurrence of the ICE and outcome.
For the positivity assumption (Equation (2)), first since in randomized trials P(A 0 = 0) = P(A 0 = 1) = 0.5, it is clearly the case that P(A 0 = a 0 |L 0 = l 0 ) = P(A 0 = a 0 ) > 0 for both a 0 = 0 and a 0 = 1. Second, the occurrence of the ICE should also satisfy that P(A 1 = 0|A 0 = a 0 , L 0 = l 0 , L 1 = l 1 ) > 0 for all possible l 0 and l 1 values in both treatment arms. This means that for all possible combinations of L 0 , L 1 and A 0 , the probability of not having the ICE must be nonzero. Note that as we are only interested in the potential outcomes in the absence of the ICE, then P(A 1 = 1|A 0 = a 0 , L 0 = l 0 , L 1 = l 1 ) > 0 is not required, which implies that for some values of a 0 , l 0 and l 1 it can be that P(A 1 = 0|A 0 = a 0 , L 0 = l 0 , L 1 = l 1 ) = 1 without representing a violation of the positivity assumption.

G-formula
From Equation (4) the G-formula estimator for the mean potential outcome under a general treatment regimeā is given by Unlike in the general observational study setting, in our setting the first treatment A 0 is randomly assigned. This means in particular that A 0 | L 0 , which in turn means that f (l 0 ) = f (l 0 |a 0 ). Thus, the preceding equation can in our setting be written as To use this to estimate E(Yā) we could specify models for E(Y|ā,l) and f (l 1 , l 0 |a 0 ). To avoid having to specify a model, we can use the empirical distribution of L 0 and L 1 among those randomized to A 0 = a 0 . This motivates the G-formula estimator: which relies only on a model for E (Y|ā,l), that is, an appropriate model for the mean of Y with A 0 , A 1 , L 0 , and L 1 as covariates, noting that our estimator then requires predictions from this model where A 1 is set to a 1 . If Y is continuous we might for example choose a linear regression model with main effects of randomized treatment A 0 , occurrence of ICE A 1 , L 0 , and L 1 . This model would make use of all the observed data, including outcomes Y which take place after an ICE. Through the inclusion of A 1 as a covariate, it models how the ICE influences the final outcome Y. When the ICE is receipt of rescue medication, this corresponds to use of the post-rescue outcomes with adjustment for rescue, and this type of approach has been discussed previously by Holzhauer, Akacha, and Bermann (2015).
As described previously, the hypothetical estimand corresponds to the contrast of the treatment regimesā = (1, 0) and a = (0, 0), where we set A 1 to 0. A G-formula estimator for the hypothetical estimand under no ICE is thus, The G-formula estimator is thus predicting, for every patient randomized to treatment group a 0 , what their outcome would be were the ICE set to not occur.
Since here we are only interested in regimes which set a 1 = 0, we do not need in fact to model how occurrence of the ICE influences Y, since we only need to predict outcomes under no ICE a 1 = 0. Thus, an alternative G-formula approach for the hypothetical estimand is to only specify and fit a model among those patients who did not experience an ICE (A 1 = 0), that is, for E(Y|A 0 = a 0 , A 1 = 0, L 0 , L 1 ). For example if we are interested in the potential outcome under a particular value of the treatment A 0 = a 0 we might assume that E(Y|A 0 = a 0 , A 1 = 0, L 0 , L 1 ) = β a 0 0 + β a 0 1 L 0 + β a 0 2 L 1 . This model can be fitted by ordinary least squares to those randomized to A 0 = a 0 and for whom A 1 = 0, giving estimatesβ a 0 0 ,β a 0 1 , andβ a 0 2 . The G-formula estimator is then equal to .
(11) The approach which specifies and fits a model to only those patients who did not experience the ICE makes fewer assumptions than the one which specifies and fits a model to all observed data. The approach which uses all the data has the potential to give improved precision of estimates (and hence greater power), but is more vulnerable to model misspecification. Similarly there is a tradeoff between precision and model misspecification when fitting a single model and including the randomized treatment A 0 as covariate or fitting separate models by treatment arm to relax modeling assumptions. The different modeling alternatives can also be combined for example, separate models by treatment arm among ICE-free.

Inverse Probability of Treatment Weighting
We now apply the general IPW estimator described in Section 2.2.2. Applied to the current setting, the IPW estimator of the mean of Y a 0 ,a 1 =0 from Equation (7) is given by This is a weighted mean of the outcomes from those patients who were randomized to treatment a 0 and in whom the ICE did not occur. Under 1:1 randomization P(A 0 = a 0 |L 0 ) = 0.5, and so the weights W i are given by Since the occurrence of the ICE is typically not under the investigator's full control, we must postulate and fit a model for P(A 1 = Figure 4. Single-world intervention graph (SWIG) resulting from the DAG shown in Figure 3, intervening to set the ICE to a 1 = 0.
0|A 0 , L 0 , L 1 ). For example, we could fit a logistic regression for A 1 with A 0 , L 0 , and L 1 as covariates. Here the logistic models to estimate the weights could also be fitted separately by treatment arm. As with G-formula, there is a similar trade off of making the models less vulnerable to model misspecification but at the expense of precision.

Likelihood Based Missing Data Approaches
As described in Section 1, currently a commonly adopted estimation approach when targeting hypothetical estimands with continuous endpoints is to fit a linear mixed model to the repeated measures of the outcome variable using maximum likelihood, after deleting any outcome measurements which were made after the ICE took place. These are based on assuming the resulting missing data are missing at random (MAR). We will use the theory of DAGs to check whether MAR holds for the missing hypothetical outcomes under the DAG shown in Figure 3. To do so, we make use of an extension of DAGs: single world intervention graphs (SWIGs), as described by Richardson and Robins (2013). A SWIG takes as its input a DAG, and shows the graph that would result under an intervention which fixes the values of certain variables. Figure 4 shows the SWIG that results if we intervene to set the ICE a 1 = 0. Each variable intervened on is split into two parts, the first (in capitals) which denotes the original variable, taking whatever value it would naturally take (without intervention), and the second part (lower case) indicating the intervened value. Variables which are affected by those variables intervened on are changed to their potential outcome value under the specified values of the intervened variables. Thus, when we intervene to set a 1 = 0, Y becomes Y a 1 =0 .
Unlike the DAG, the SWIG in Figure 4 contains the partially observed potential outcomes Y a 1 =0 of interest under the hypothetical estimand. To check MAR, we note that the indicator of missingness in the hypothetical outcomes Y a 1 =0 is identical to the ICE variable A 1 , since it is those individuals with A 1 = 1 for which Y a 1 =0 is missing. MAR here means that the missingness indicator A 1 is independent of the partially observed Y a 1 =0 conditional on A 0 , L 0 and L 1 . This conditional independence can be read off from the SWIG, since once we condition on A 0 , L 0 and L 1 there are no open paths from A 1 to Y a 1 =0 . Analogous to the sequential exchangeability assumption, we emphasize that we must ensure the variables used as L 0 and L 1 in estimation do include all common causes of the ICE A 1 and the outcome Y. For example, while L 0 and L 1 may typically need to include measurements of the outcome variable at baseline and the intermediate time point, there will generally be additional common causes of the ICE and outcome Y, and these must be included in L 0 and L 1 for the DAG in Figure 3 to be correct.
Since the potential outcomes Y a 1 =0 which are of interest for the hypothetical estimand are MAR given A 0 , L 0 and L 1 , it follows that an observed data likelihood analysis assuming the missingness is ignorable will give consistent estimates in this scenario under the previously stated assumptions and provided the full data model assumed is correctly specified. These conclusions are in agreement with those of Holzhauer, Akacha, and Bermann (2015), who proposed fitting a joint mixed effects model for L 0 , L 1 , Y conditional on A 0 , after deleting post ICE outcomes, or alternatively use of multiple imputation to impute Y a 1 =0 in those with A 1 = 1, again ensuring that L 0 , L 1 , A 0 are included in the imputation model.
We now show that particular observed data likelihood based estimators are identical to particular G-formula estimators. Consider the data on L 0 , L 1 and Y in treatment group A 0 = a 0 , after deleting any Y values for individuals with A 1 = 1. Suppose that L 1 is a single variable and that we fit a bivariate normal model (a type of linear mixed model) to the resulting (L 1 , Y a 1 =0 ) data in group A 0 = a 0 assuming MAR, with the means of L 1 and Y a 1 =0 depending linearly on L 0 but with separate coefficients, and an unstructured covariance matrix. The bivariate normal model implies that E(Y a 1 =0 |A 0 = a 0 , L 0 , L 1 ) = β a 0 20 + β a 0 21 L 0 + β a 0 22 L 1 E(L 1 |A 0 = a 0 , L 0 ) = β a 0 10 + β a 0 11 L 0 . The observed data likelihood function under MAR factorizes (sec. 7.2 of Little and Rubin 2019) such that the MLEs of the parameters in these two models are obtained by fitting the Y a 1 =0 model among those with Y a 1 =0 observed (here meaning A 1 = 0) and for the L 1 model using all patients. Then we have that

Taking expectations of this conditional on
). The model for L 1 is fitted to all those with A 0 = a 0 . A property of ordinary least squares estimators is that the sample mean of the dependent variable (here L 1 ) is equal to the predicted value of the dependent variable when the covariate is set to its sample mean, such that .

It follows that
which is identical to the G-formula estimator given in Equation (11). More commonly a linear mixed model is fitted which assumes a common covariance matrix for (L 1 , Y a 1 =0 ) across the two randomized groups with mean effects of A 0 and L 0 . A similar argument to the one above shows that the resulting estimator is a G-formula estimator where we fit a single model for Y a 1 =0 to both randomized groups, with A 0 as a covariate (in addition to L 0 and L 1 ).
Finally, we note that a multiple imputation (MI) analysis assuming MAR and based on the same modeling assumptions as the likelihood based analysis is (up to Monte Carlo noise) equivalent to the likelihood based analysis. Thus, corresponding multiple imputation estimates which delete data after the ICE occurs are equivalent to particular G-formula implementations. The equivalence of imputation approaches and G-formula was previously discussed by Westreich et al. (2015) and Qu et al. (2020).

Inverse Probability of Missing Weighting
The potential outcomes of interest among those randomized to A 0 = a 0 are Y a 1 =0 . Since the event that A 1 = 0 is precisely the indicator of observation of the potential outcome of interest, the standard IPW missing data estimator (Seaman and White 2013) for E(Y a 1 =0 |A 0 = a 0 ) can be seen to be identical to the "causal inference" IPW estimator given in Equation (12).

Deterministic Intercurrent Events
In some trials the intercurrent event could be discontinuation of randomized treatment or addition of rescue treatment. In some therapy areas, for example, diabetes, the protocol specifies that rescue treatment be given at/following a visit at time k if and only if a measurement of glycemic control (e.g., via FPG or HbA1c) exceeds some threshold. This has been termed a deterministic MAR situation (Holzhauer, Akacha, and Bermann 2015), with missing data estimation approaches advocated.
In such situations the positivity assumption is violated if the protocol was followed. This implies that the hypothetical estimand cannot be nonparametrically identified, which essentially means it cannot be estimated without making untestable modeling assumptions. Parametric G-formula, likelihood and MI approaches can provide consistent estimates, but only by extrapolating beyond the data. In particular, they must predict the no ICE potential outcomes for those individuals who did in fact have an ICE. When intercurrent events occur deterministically, there are no similar patients in terms of A 0 , L 0 , L 1 who did not have the ICE and hence have Y a 1 =0 observed. If the extrapolation implied by the model is correct, we can obtain consistent estimates. However, from the data alone, we have no basis on which to judge whether the extrapolation is justified. In such cases, the extrapolation can arguably only be justified on the basis of external evidence, since the data offer no information about whether the extrapolation is reliable.
In contrast, if the IPW approach is used, and the model for ICE/missingness correctly incorporates the deterministic ICE mechanism, for those with A 1 = 0 because P(A 1 = 0|A 0 , L 0 , L 1 ) = 1 their true weight will be 2 as per Equation (13), as well as for those with A 1 = 1 because P(A 1 = 1|A 0 , L 0 , L 1 ) = 1. In this case, the estimator in Equation (12) would simply be the unweighted average of outcomes in those with A 1 = 0, which in general will give a biased estimate.

Intercurrent Events at Multiple Time Points
We now consider the more realistic setting where the ICE can occur at multiple time points. We consider the case where the ICE can occur at two time points, denoted A 1 and A 2 , and note that our conclusions in this situation can be easily extended to the general setup with more time points. The hypothetical potential outcomes of interest now are Y a 0 ,a 1 =0,a 2 =0 for a 0 = 0 and a 0 = 1. Figure 5 shows the DAG for this situation.

G-formula
Using Equation (4) we have that As before because of randomization f (l 0 ) = f (l 0 |a 0 ) and so we can write To construct an estimator based on this, we specify and fit a model for E(Y|Ā,L) and for f (L 2 |Ā 1 , L 0 , L 1 ), while for f (L 1 , L 0 |A 0 = a 0 ) we can again avoid modeling it by using the empirical distribution of (L 1 , L 0 ) in those randomized to treatment a 0 . As before, we could fit models for E(Y|Ā,L) and f (L 2 |Ā 1 , L 0 , L 1 ) using data from all patients with suitable covariate effect specification, or instead restrict them to those whose treatment historyĀ equals the hypothetical estimand regime of interestā = (a 0 , 0, 0). The former may be more efficient, but it requires specification of more modeling assumptions. Taking the latter approach, suppose we assume that E(Y|A 0 = a 0 , A 1 = 0, A 2 = 0, L 0 , L 1 , L 2 ) = β a 0 30 + β a 0 31 L 0 + β a 0 32 L 1 + β a 0 33 L 2 E(L 2 |A 0 = a 0 , A 1 = 0, L 0 , L 1 ) = β a 0 20 + β a 0 21 L 0 + β a 0 22 L 1 . (15) Then we have that In words, for each patient randomized to treatment a 0 , this Gformula estimator first predicts L a 0 ,a 1 =0 2 under the hypothetical no ICE scenario and then predicts Y a 0 ,a 1 =0,a 2 =0 . Finally, it averages these predictions across the patients randomized to a 0 .

Inverse Probability of Treatment Weighting
For this setting when the ICE can occur at two time points, the IPW estimator for the mean of Y a 0 ,a 1 =0,a 2 =0 is a weighted mean of the outcomes from those patients who were randomized to treatment a 0 and in whom the ICE did not occur at either of the two intermediate time points. The weights are as defined in Equation (6).
Unlike in the setting considered in Section 3, we could now choose to model the occurrence of ICE at each time point using all patients, with earlier occurrence of ICE as a covariate. However, since in the end we only need weights for those patients who did not experience the ICE at either time point, we might choose instead to model the occurrence of the ICE at time k only among those who had not up to time k experienced an ICE. Thus, like the G-formula, one has some flexibility and choice about which data to use and what modeling assumptions to make. Indeed, one may wish to avoid modeling how the occurrence of an ICE depends on the past among those who have already experienced an ICE.

Likelihood Based Missing Data Approaches
As noted earlier, missing data methods have previously been advocated to and applied for estimating hypothetical estimands by fitting mixed models to the repeated measurements of outcomes after excluding any post ICE outcomes. This leads to a so called monotone missingness pattern. We now show using a SWIG derived from the DAG in Figure 5 that the hypothetical potential outcomes of interest are again MAR.
The full data under the hypothetical estimand is now (L 1 , L a 1 =0 2 , Y a 1 =0,a 2 =0 ). There are missing values in L a 1 =0 2 and Y a 1 =0,a 2 =0 , and if L a 1 =0 2 is missing for an individual then so is Y a 1 =0,a 2 =0 . This is analogous to monotone dropout in a longitudinal study. In this context, MAR can be expressed as saying that at any given time, among those subjects who have not yet dropped out, the probability of dropout before the next follow-up visit is independent of future outcomes given the past outcomes (Daniels and Hogan 2008). Figure 6 shows the SWIG resulting from the DAG in Figure 5 if we intervene to set a 1 = 0 and a 2 = 0, and we can use this to check MAR is satisfied. First we can immediately confirm from the SWIG that For this, note that in those with A 1 = 0, by the consistency assumption L 2 = L a 1 =0 2 and A 2 = A a 1 =0 2 , and so this assumption is , L 1 , L 0 , and the SWIG shows that this conditional independence condition indeed holds. We emphasize again, that if there are, as there typically would be, common causes of ICE occurrence and final outcome additional to the repeated measurements of outcome, they must be included in L 0 , L 1 , L 2 . Others have previously discussed the use of missing data methods assuming MAR whereby data after the ICE occurs are excluded from the analysis (Holzhauer, Akacha, and Bermann 2015;Mallinckrodt et al. 2020). By using the machinery of causal diagrams, we are able to clarify the conditions under which the MAR assumption would be satisfied-namely that we have measured and conditioned on all common causes of the ICE variables (here A 1 and A 2 ) and the final outcome of interest (here Y).
As was the case with an ICE occurring at a single time point, in the supplementary material we show that particular "missing data estimators" and particular "causal inference estimators" are equivalent. The fact that missing data approaches are valid and are identical to particular causal inference estimators in this setting is perhaps at first sight surprising, since it is often said that a setting with time-dependent confounding affected by treatment is one in which "standard approaches" are invalid, with more complex g-methods required instead. For further details, see the supplementary materials.

Inverse Probability of Missing Weighting
As before we note that we are considering the ICE as a "treatment" and it is also the missing indicator because we are only interested in the potential outcomes Y a 1 =0,a 2 =0 which would occur in the absence of the ICE. However, when the ICE can occur at more than one time point, the implementation of IPW for missingness is equivalent to the one described in Section 4.1.2, where we fit the models to estimate the weights at time point k restricted to those with A 1:k−1 =0.

Conclusion
To target a hypothetical estimand under no ICE, we have considered different estimators arising from the causal inference and missing data literature. In doing so, we have shown that there are close connections between estimators in these groups. Indeed, we have shown that "missing data" likelihood based estimators (observed data likelihood or multiple imputation) applied after data post-ICE are deleted are implementations of the G-formula method from causal inference. Similarly, inverse probability of missing estimators for hypothetical estimands are also inverse probability of treatment estimators from causal inference.
We believe this unification is helpful to those analyzing clinical trial data not least because the assumptions required for estimation, expressed via causal inference language, are arguably more easily understood than missing data assumptions. In particular, we believe DAGs can be very useful tools for graphically encoding what we may view as plausible for the causal relationships between variables, and from this, the validity of the sequential exchangeability assumption can be assessed.
The causal inference lens also brings to the fore the importance of the positivity assumption. In trials where ICE occurrence is a deterministic function of biomarker values, such that the assumption is violated, estimation via likelihood methods or G-formula relies on extrapolation. The reasonableness of this extrapolation should be assessed on a case by case basis, taking into account the type of ICE, the disease context, and the extent of the extrapolation being made. Sensitivity analyses may be required given that the reliability of the extrapolation cannot be assessed from the observed data. Alternatively, in such cases one may choose to target a different estimand (Michiels et al. 2021).
It is worth noting that, in a given context, different hypothetical estimands can be defined for a particular ICE. For instance, in the case of use of rescue medication, we could conceive a hypothetical scenario where rescue medication was not available for the early stage of the trial that is, no rescue medication in the first 6 months in a 36 months follow-up trial. The alternatives could be to set a shorter/longer period of followup or to consider not allowing rescue medication use in the protocol. The hypothetical strategy could be regarded even more broadly and instead of considering an intervention to set the ICE to 0 for everyone, this could be set to a different value. For example, this could mean assessing the randomized treatment versus control treatment where everyone was to receive rescue medication.
While we have shown that commonly used "missing data estimators" for hypothetical estimands correspond to certain causal inference estimators, we have also seen that there are additional implementations of G-formula and IPW which could be used instead. For example, contrary to current common practice, it is possible to use the full data, including intermediate and final outcomes assessed after the occurrence of the ICE, so long as suitable adjustment is made for past ICE occurrence. As seen in the simulations (supplementary material), this approach offers the potential for more precise estimates, at the expense of relying on more modeling assumptions.
There can be settings where it is plausible to borrow information from patients with ICEs. Consider a diabetes trial assessing the impact of a novel treatment with standard of care on achieving normal HbA1c levels. If a patient discontinues the novel treatment and still achieves normal HbA1c levels, they would probably have also had a positive outcome had they continued on treatment.
It is worth noting that the hypothetical estimand is particularly relevant to deal with ICEs that can be intervened on. For instance we could conceive a trial where we could intervene to avoid treatment interruptions due to administrative reasons such as the ones derived from government enforced closures during the current COVID-19 pandemic. In contrast, it would probably be less sensible to consider a world were we could intervene to avoid treatment discontinuation due to adverse events. For these cases, a different strategy to deal with the ICE is more reasonable.
We have assumed there is only one ICE under consideration. In practice there is typically more than one ICE. With multiple types of ICE which are all chosen to be handled using the hypothetical strategy, the methods described here could be applied with the A 1 , . . . , A K now denoting occurrence of at least one of the ICEs. However, in order to avoid model misspecification it may be preferable to define A k as a vector indicating occurrence or not of each of the ICE types at time k. If some ICE are to be handle with using the hypothetical strategy and some using treatment policy, if the treatment policy ICE precedes the hypothetical ICE in time, it may be possible to consider the treatment policy ICE as an additional time-varying covariate (i.e., as part of L 1 , . . . , L K ). We will address in more detail estimation in the case of two or more ICE types in a subsequent paper.
The validity of the estimates given by the methods described depends on the models being correctly specified, as shown by the simulations (supplementary material). To make the estimates more robust to model misspecification, so-called doublyrobust estimators were developed (Bang and Robins 2005). The idea is that the estimates are derived using two separate models and only one of the models needs to be correctly specified to obtain consistent estimates. The first model typically concerns the treatment assignment (propensity score model) while the second model is a model for the outcome. Section 21.3 of Hernan and Robins (2020) describes calculation of a doubly robust estimator in the time-varying treatment setting, which could be applied for estimation of hypothetical estimands. To provide further robustness to model misspecification, machine learning approaches could also be explored (Van Der Laan and Rubin 2006).
For the methods covered in the article, we did not discuss variance estimation. First, the purpose of the article was to provide feasible ways to implement existing estimators to target hypothetical estimands, so the main focus was on the treatment effect estimate. More importantly, as these are well established estimators, there is already existing literature proposing different ways of estimating the corresponding variance, including bootstrapping and sandwich estimators. In fact, there are different packages available to implement them in standard software (van der Wal and Geskus 2011;McGrath et al. 2020).
We also did not discuss G-estimation (Hernan and Robins 2020). The difference between G-estimation and the other G-methods is that G-estimation estimates conditional treatment effects within levels of L. In settings where conditional causal effects are of interest, G-estimation may be of relevance.
In practice, there may be data missing for example, due to missed visits. As we have discussed, certain estimators do not use data after the occurrence of an ICE, and so missingness in variables at follow-up visits occurring after ICEs present no difficulties for these estimators. For patients with missing values before occurrence of the ICE, their missing hypothetical (no ICE) outcomes could be imputed, for example using MI. There would be no need to differentiate between such missing values and dataset missing because of the ICE, provided both common causes of missingness and outcome, and common causes of the ICE and outcome are conditioned on in the imputation models. Alternatively, MI could first be used to impute any missing actual (as opposed to hypothetical counterfactuals) data, following which any of the methods described previously (e.g., Gformula or IPW) could be applied to estimate the hypothetical no ICE estimand. Dealing with missing actual data with IPW is less attractive, since its implementation when missingness is non-monotone (as it often is), is difficult (Sun and Tchetgen Tchetgen 2018). Moreover, use of IPW may require one to specify separate models for the missingness and occurrence of ICE processes.
We hope that by drawing parallels between causal inference and missing data methods and describing how the different estimators work, the assumptions required for valid estimates and showing feasible ways to implement them, researchers involved in the design and analysis of clinical trials will be able to successfully apply these methods to their trials.

Supplementary Materials
Equivalence of G-formula and likelihood based missing data approaches -Simulations.