Empirical likelihood inference for non-randomized pretest-posttest studies with missing data

: Pretest-posttest studies are commonly used for assessing the eﬀect of a treatment or an intervention. We propose an empirical likelihood based approach to both testing and estimation of the treatment eﬀect in non-randomized pretest-posttest studies where the posttest outcomes are subject to missingness. The proposed empirical likelihood ratio test and the estimation procedure are multiply robust in the sense that multiple working models are allowed for the propensity score of treatment assignment, the missingness probability and the outcome regression, and the validity of the test and the estimation requires only a certain combination of those multiple working models to be correctly speciﬁed. An empirical likelihood ratio conﬁdence interval can be constructed for the treatment eﬀect and has better coverage probabilities than conﬁdence intervals based on the Wald statistic. Simulations are conducted to demonstrate the ﬁnite-sample performances of the proposed methods.


Introduction
Pretest-posttest studies are commonly used to evaluate the effect of a treatment or an intervention. In randomized pretest-posttest studies, subjects are randomly assigned to one of the treatment group or the control group. The outcome of interest is first measured at the baseline prior to the treatment (pretest) along with certain auxiliary variables and then again at the end of the study after the treatment (posttest). The main parameter of interest is the treatment effect defined as the difference in change of mean responses between the treatment and the control groups. Inferences on the treatment effect for randomized pretest-posttest studies can be done by classic statistical methods such as the two-sample t-test, the paired t-test, analysis of covariance, or the generalized estimating equations. See, for instance, Brogan and Kutner [2], Laird [16], Stanek [25], Follmann [9], Singer and Andrade [24], Yang and Tsiatis [28], Bonate [1], among others. By adopting the framework of potential outcomes and treating the unobservable counterfactual outcomes as missing data, Leon et al. [17] constructed semiparametric estimators of the treatment effect based on the missing data theory of Robins et al. [21]. Chen et al. [5,6] considered an imputationbased approach under the same framework.
The posttest outcomes are often subject to missingness. Under the assumption of missing at random as defined by Rubin [23], Davidian et al. [8] studied semiparametric estimation of the treatment effect and constructed a class of augmented inverse probability weighted estimators (Robins et al. 21) by deriving the influence functions for all regular and asymptotically linear estimators under the setting they considered. The augmented inverse probability weighted estimators require working models for both the missingness probability and the outcome regression. Huang et al. [15] proposed an empirical likelihood method for randomized pretest-posttest studies with missing data.
The randomization step in pretest-posttest research designs is often not feasible in practice. In many social studies on the effectiveness of an intervention, it is mandatory that participants are allowed to choose to take the intervention or not. This is also the case for many medical studies involving two alternative treatments. For such self-selection-based treatment assignments, inference procedures developed for randomized designs are no longer suitable.
This paper presents an empirical likelihood approach to non-randomized pretest-posttest studies when the posttest outcomes are also subject to missingness. We develop a unified framework for both testing and estimation of the treatment effect and the inferential procedures are multiply robust in the sense that multiple working models are allowed for the propensity score of treatment assignment, the missingness probability and the outcome regression, and the validity of testing and estimation requires only a certain combination of those multiple working models to be correctly specified. Multiply robust inferences were first introduced by Han and Wang [14] for missing data problems. The methods have attracted considerable research interests due to added layers of protection against possible misspecifications of working models. See, for instance, Chan and Yam [3], Han [11,10,12,13], Chen and Haziza [7], among others. Multiply robust methods usually lead to smaller biases under complete misspecification of all working models (Han 11,12). Our general framework follows the two-sample empirical likelihood formulation with estimating equations (Qin and Lawless 19, Owen 18,Qin and Zhang 20,Huang et al. 15,Wu and Yan 27). However, the use of multiple working models for the propensity score of treatment assignment, the missingness probability and the outcome regression makes our proposed approach much more challenging in terms of theoretical development.
In addition to multiple robustness, the proposed empirical likelihood ratio test leads to data-adaptive confidence intervals for the treatment effect that give better finite-sample coverage probabilities compared to those based on the Wald statistic, and the proposed estimator of treatment effect always falls into the corresponding parameter space by construction.

Notation and setup
The setup for this paper is a generalization of that used by Davidian et al. [8] and Huang et al. [15] from randomized trials to non-randomized trials. Let T denote the treatment indicator with T = 1 if the subject chooses treatment and T = 0 if the subject selects control. Let Y 1 and Y 0 denote, respectively, the posttest potential outcomes that would have been observed had a subject chosen treatment (T = 1) and control (T = 0). Let Y = T Y 1 + (1 − T )Y 0 be the actual observed posttest outcome. Let Z denote a vector of variables, including the pretest outcome as well as auxiliary variables, collected at the baseline before the treatment or intervention. After the treatment assignment but prior to the end of the study, some additional variables X t for T = t, t = 0, 1, are also measured during the follow-up, including possible intermediate outcome measures. To accommodate possible missingness on Y , let R t denote the indicator variable of observing Y for subjects in group T = t, t = 0, 1, with R t = 1 if Y is observed and R t = 0 otherwise. For a random sample of size n at the baseline, let n 1 = n i=1 T i and n 0 = n − n 1 be the numbers of subjects in the treatment group and in the control group, respectively. In addition, let n 11 = i:Ti=1 R 1i and n 01 = i:Ti=0 R 0i be the numbers of subjects in the treatment group and in the control group with Y observed, respectively. The data structure and the notation for all associated variables are shown in Table  1, which is similar to Table 1 of Huang et al. [15], after suitable re-ordering of subjects within each group. The parameter of interest is the treatment effect defined as δ = E(Y 1 ) − E(Y 0 ). We make the following standard assumptions on the treatment assignment and the missing data mechanism.
Assumption 2 (Missing at random). Missingness of the posttest outcome is independent of the outcome itself given all the baseline and intermediate measurements and the treatment assignment, i.e., R t ⊥ Y t | (Z, T = t, X t ), t = 0, 1.
Assumption 3 (Positivity). The propensity score for treatment assignment and the missingness probability satisfy (i) 0 < P(T = 1 | Z) < 1 and (ii) Let π(Z) = P(T = 1 | Z) be the propensity score of treatment assignment and t (Z, X t ) = P(R t = 1 | Z, T = t, X t ) be the non-missingness probability for a subject in the group with T = t, t = 0, 1. The combination of Assumption 1 and Assumption 3(i) is referred to as strongly ignorable treatment assignment by Rosenbaum and Rubin [22]. Assumption 3(ii) implies that there is a positive probability of observing the complete data for each of the treatment and control groups.

Known propensity scores
We propose empirical likelihood ratio tests for H 0 : δ = δ * against H 1 : δ = δ * , where δ * is a pre-specified value. Testing the existence of a treatment effect then corresponds to the special case with δ * = 0. For ease of presentation and to facilitate asymptotic development, we first consider the less practical scenario in Section 3.1 where the propensity score π(Z) is assumed to be known. The scenario with unknown propensity score is considered in Section 3.2.
(i) Single working model for the missingness probability. Suppose that the missingness probability t (Z, X t ) is correctly modeled by t (Z, X t ; α t ) such that t (Z, X t ; α t * ) = t (Z, X t ) for some α t * , t = 0, 1. Letα t be the estimator of α t * derived by maximizing the likelihood function where for notational simplicity we used ti (α t ) to denote t (Z i , X ti ; α t ). It is easy to verify that the inverse probability weighted estimator is consistent for δ. This motivates the use of constraint (3.4) in the discussions below for our proposed empirical likelihood method.
Let {w i : T i = 1, R 1i = 1} be a discrete probability measure over the set of subjects {i : T i = 1, R 1i = 1} with observed posttest outcomes; let {v i : T i = 0, R 0i = 1} be a discrete probability measure over the set of subjects {i : T i = 0, R 0i = 1}. The empirical likelihood function for the combined sample is defined as Maximizing L(w, v) with respect to w i and v i under the normalization constraints leads toŵ i = 1/n 11 andv i = 1/n 01 . The constraint induced by the parameter of interest, the treatment effect δ, is constructed as (3.4) for the given δ * .
The formulation described above follows the general framework of empirical likelihood with estimating equations (Qin and Lawless 19). It provides a powerful platform for incorporating additional constraints induced by working models for the treatment assignment, the missingness probability and the outcome regression to construct multiply robust test and estimation procedures.
Letw i andṽ i be the maximizer of L(w, v) under both the normalization constraint (3.3) and the parameter constraint (3.4). The empirical likelihood ratio statistic for testing H 0 : δ = δ * is computed as where the dependence of W (δ * ) on δ * is through the dependence ofw i and v i on δ * . Let χ 2 1 denote the chi-square distribution with one degree of freedom. The following result gives the asymptotic distribution of W (δ * ) under H 0 . Proof of the theorem and the exact expression for the positive scaling factor σ 1 are provided in the Appendix. Theorem 3.1. Under Assumptions 1-3 with known π(Z) and correctly specified t (α t ) for t (Z, X t ), t = 0, 1, the empirical likelihood ratio statistic W (δ * ) has an asymptotic distribution σ 1 χ 2 1 under H 0 : δ = δ * . A (1 − α)-level empirical likelihood ratio confidence interval for δ can be constructed as {δ : W (δ) σ 1 χ 2 1,(1−α) }, whereσ 1 is the estimated value of σ 1 and χ 2 1,(1−α) is the 100(1−α)% percentile of the chi-square distribution with one degree of freedom. One advantage of the empirical likelihood ratio confidence interval compared to the Wald confidence interval is that it is data-driven and range respecting and has better coverage probability for δ with moderate sample sizes, which are shown in our simulation studies.
The major computational task for calculating W (δ * ) is to maximize (3.2) subject to (3.3) and (3.4) with the given δ * . To derive thew i andṽ i , we follow Wu and Yan [27] and reformulate the constrained maximization problem as to maximize 1 2 i:Ti=1,R1i=1 It can be shown through the standard Lagrange multiplier method that the maximizers are given bỹ where the Lagrange multiplierρ satisfies The solutionρ can be solved by using the modified Newton-Raphson algorithm in Chen et al. [4] or Han [11].
(ii) Multiple working models for the missingness probability. We now consider the case where there are multiple working models for the missingness probability and propose an empirical likelihood ratio test for H 0 that is valid if one of the working models in each of the treatment and control groups is correctly specified.
t ), j = 1, . . . , J t } be a set of working models for t (Z, X t ) andα   It can be verified by following similar arguments in Han and Wang [14] that for any h 1 (Z, X 1 ) and h 0 (Z, X 0 ), assuming all relevant expectations exist, the following equalities hold: where w(Z, It follows that multiple working models in P 1 and P 0 can be simultaneously accommodated by taking h 1 (Z, X 1 ) to be π(Z) 0 ) to construct the following empirical version of (3.6) as constraints for the empirical likelihood inference, are, respectively, consistent estimators for E{π(Z) ]. Under the current setting, the empirical likelihood ratio statistic for the treatment effect δ is computed as W (δ) specified in (3.5) withŵ i andv i maximizing L(w, v) subject to the normalization constraint (3.3) and the model constraints Note that the parameter constraint (3.9) has a different form compared to (3.4), due to the inclusion of model constraints (3.7) and (3.8). The asymptotic distribution of the empirical likelihood ratio statistic W (δ * ) under H 0 : δ = δ * is presented in the following theorem. The proof and the exact expression for the positive scaling factor σ 2 are provided in the Appendix.

Theorem 3.2. Under Assumptions 1-3 with known π(Z)
and also assuming that both P 1 and P 0 contain a correctly specified model, the empirical likelihood ratio statistic W (δ * ) has an asymptotic distribution σ 2 χ 2 1 under H 0 : δ = δ * . The empirical likelihood ratio confidence interval for the treatment effect can be constructed similarly to before. Theorem 3.2 states that the proposed test is multiply robust in the sense that it is valid if one of the working models is correctly specified in each of P 1 and P 0 . It can be shown thatŵ i = 1/{n 11

S. Zhang et al.
It follows from Han and Wang [14] that, when both P 1 and P 0 contain a correctly specified model, we is consistent for δ, which justifies the use of the constraint (3.9) for defining W (δ). Following the same arguments as in Wu and Yan [27], we havew The three Lagrange multipliersρ 1 ,ρ 0 andρ can all be calculated using a Newton-Raphson-type algorithm similar to the one described in Chen et al. [4] and Han [11]. For the special case J 1 = J 0 = 1, the scaling constant σ 2 in Theorem 3.2 does not reduce to σ 1 in Theorem 3.1 due to different formulations of constraints. In this special case, the formulation in Section 3.1 involves computingw i and v i , whereas the formulation in Section 3.2 involves computingŵ i ,v i ,w i and v i . The former formulation can only deal with one model for the missingness probability and the latter can deal with multiple models.

Unknown propensity scores
(i) Single working model for the propensity score. We now consider the more practical scenario where the propensity score of treatment assignment π(Z) is unknown. Let π(Z; γ) denote a parametric model for π(Z) with parameter γ and π i (γ) = π(Z i ; γ). Letγ be the estimator of γ derived by maximizing the likelihood function (3.10) With the single correctly specified model π(Z; γ), the empirical likelihood ratio statistic W (δ) can be calculated in the same way as in Section 3.1 but with π(Z i ) substituted by π i (γ). Depending on whether there is a single model or there are multiple working models for the missingness probability, we have the following two theorems parallel to Theorems 3.1 and 3.2. The exact expressions for σ 3 and σ 4 are given in the Appendix but detailed proofs are omitted due to similarities to the proofs of Theorems 3.1 and 3.2.

Theorem 3.3.
Under Assumptions 1-3 and also assuming that π(γ) and t (α t ) are correctly specified models for π(Z) and t (Z, X t ), respectively, t = 0, 1, the empirical likelihood ratio statistic W (δ * ) has an asymptotic distribution Theorem 3.4. Under Assumptions 1-3 and also assuming that π(γ) is a correctly specified model for π(Z) and both P 1 and P 0 contain a correctly specified model, the empirical likelihood ratio statistic W (δ * ) has an asymptotic distribution σ 4 χ 2 1 under H 0 : δ = δ * . (ii) Multiple working models for the propensity score. When multiple working models are considered for the unknown propensity score π(Z), construction of an empirical likelihood ratio test becomes significantly more challenging since using the estimated propensity scores from one particular working model will not lead to valid results. The general concept of multiple robustness assumes that one of the multiple working models is correctly specified but does not require the identification of the correct model.
We propose a two-step strategy to construct the empirical likelihood ratio test on the treatment effect. The first step accommodates the multiple working models for π(Z) and produces weights that are a consistent estimate for π(Z) when one of the working models is correctly specified. The second step incorporates multiple working models for the missingness probability similar to the procedures presented in Section 3.1 by using the weights obtained from the first step. Such a two-step procedure ensures multiple robustness of the proposed test.
Let Q = {π (l) (γ (l) ), l = 1, . . . , L} denote a set of working models for π(Z). An estimatorγ (l) for γ (l) can be derived by maximizing (3.10) but with π(γ) replaced by π (l) (γ (l) ). For the first step, we consider empirical likelihood weights p i for subjects in the treatment group {i : T i = 1} and q i for the subjects in the control group {i : T i = 0}. The estimated weightsp i andq i are obtained by maximizing the empirical likelihood function L(p, q) = i:Ti=1 p i i:Ti=0 q i subject to the constraints The last two sets of constraints in (3.11) are, respectively, the empirical versions of moment equalities The most important consequence from the proposed first step is that if Q contains a correctly specified model for π(Z), as shown in Han and Wang [14]. This leads to the second step to obtain the final weightsŵ i andv i similar to the procedures described in Section 3.1 but replacing π(Z i ) for {i : T i = 1} and 1 − π(Z i ) for {i : T i = 0} by (np i ) −1 and (nq i ) −1 , respectively. More specifically, we replace constraints (3.7) and (3.8) based on missingness probability working models by It is possible to derive the asymptotic distribution of W (δ * ) under H 0 : δ = δ * when each of Q, P 1 and P 0 contains a correctly specified model for π(Z), 1 (Z, X 1 ) and 0 (Z, X 0 ), respectively. However, such a derivation is extremely tedious and the resulting scaling factor similar to those appeared in Theorems 3.1-3.4 has a very complex form. We propose to use a bootstrap method, presented in Section 3.3 below, to bypass the derivation and estimation of the scaling factor. Simulation results show that the bootstrap method performs well for finite samples.

A bootstrap calibration
We present a bootstrap calibration for the proposed empirical likelihood ratio test to bypass the derivation and estimation of scaling factors. Letδ denote our proposed estimator of the treatment effect δ given in (4.1) that will be studied in Section 4. The bootstrap procedure is as follows.
Step 1. Take a random sample of size n with replacement from the original data set.
Step 2. Compute the empirical likelihood ratio statistic W (δ) at δ =δ based on the data set from Step 1 to obtain W (δ).
Step 3. Repeat Steps 1 and 2 B times to obtain {W (1) Step 5. The empirical likelihood ratio test rejects As one referee pointed out, an alternative is to replace Steps 4 and 5 above by the following two steps. Step Theσ computed in Step 4 is an estimate of the scale parameter that appears in all theorems. Our simulation studies in Section 5 indicate that these two bootstrap procedures have very similar performance.

Estimation of the treatment effect
Our proposed maximum empirical likelihood estimator for the treatment effect δ isδ = i:Ti=1,R1i=1ŵ )} when one of the multiple working models for each of π(Z), 1 (Z, X 1 ) and 0 (Z, X 0 ) is correctly specified. In the following we will discuss how to further improve the robustness ofδ by accommodating working models for outcome regression. The outcome regression models can also lead to efficiency gain in estimation. We first consider the case where π(Z) is either known or modeled by a single working model π(γ).
t ) : k = 1, . . . , K t } be a set of working models for the outcome regression E(Y t | Z, X t ), t = 0, 1. We allow working models to be postulated separately for each of the treatment and control groups, and the numbers of models K 1 and K 0 can be different. By Assumptions 1 and 2 can be estimated by fitting the model a are consistent estimators of E{a  . When π(Z) is unknown but is modeled by π(γ), the π(Z) in all the constraints is replaced by π(γ). Multiple robustness ofδ is summarized by the theorem below, the proof of which is given in the Appendix. Theorem 4.1. Suppose that π(Z) is either known or correctly modeled by π(γ). If (i) P 1 contains a correctly specified model for 1 (Z, X 1 ) or A 1 contains a correctly specified model for E(Y 1 | Z, X 1 ); and (ii) P 0 contains a correctly specified model for 0 (Z, X 0 ) or A 0 contains a correctly specified model for E(Y 0 | Z, X 0 ), thenδ is a consistent estimator of δ.
When π(Z) is unknown and there are multiple working models Q = {π (l) (γ (l) ), l = 1, . . . , L}, the two-step strategy described in Section 3.2 can be used to construct the estimator for δ. Letp i andq i be derived through (3.11) based on models in Q. Letŵ i andv i be derived by maximizing L(w, v) given in (3.2) subject to (3.3), (3.12), (3.13) and the additional constraints (4.2) but with the model-averages in (4.2) redefined asη ). Theorem 4.2 summarizes the property ofδ and the proof is given in the Appendix. Theorem 4.2. If (i) Q contains a correctly specified model for π(Z); (ii) P 1 contains a correctly specified model for 1 (Z, X 1 ) or A 1 contains a correctly specified model for E(Y 1 | Z, X 1 ); and (iii) P 0 contains a correctly specified model for 0 (Z, X 0 ) or A 0 contains a correctly specified model for E(Y 0 | Z, X 0 ), thenδ is a consistent estimator of δ.
One might think that modeling E(Y t | Z), in addition to π(Z), t (Z, X t ) and E(Y t | Z, X t ), could further improve the robustness of estimation. This is not the case in general. As pointed out in Davidian et al. [8], because E(Y t | Z) = E(Y t | Z, T = t, R t = 1), fitting a model for E(Y t | Z) cannot be based on a complete-case analysis but rather requires modeling other quantities, which complicates achieving multiple robustness since the parameters in a model for E(Y t | Z) may not be consistently estimated even if this model is correctly specified. For this reason, we do not consider modeling E(Y t | Z).
Standard error ofδ is oftentimes needed to make inference on δ. A practically useful approach to computing the standard error is the bootstrap method. It can be implemented easily by taking repeated bootstrap samples similar to Step 1 of the method described in Section 3.3 to obtain bootstrap copies ofδ, which then lead to the bootstrap standard error. The reliability of the bootstrap method for standard error calculation for multiply robust estimators in missing data literature has been well documented (Han 11,12). We will examine its numerical performance under the current setting in Section 5.
In addition to the multiple robustness property, the proposed maximum empirical likelihood estimatorδ has other advantages compared to existing ones. The weightsŵ i andv i are positive and sum to one, and thus both i:Ti=1,R1i=1 w i Y 1i and i:Ti=0,R0i=1v i Y 0i are convex combinations of the observed responses. The estimatorδ therefore always falls into the parameter space for the treatment effect. Issues with the conventional inverse probability weighted and augmented inverse probability weighted estimators when some estimated values of π(Z) and/or t (Z, X t ) are close to zero are also significantly mitigated under our proposed empirical likelihood approach. Han [11] contains more detailed discussions and simulation results on this particular aspect in missing data analysis.
Note that the empirical likelihood ratio test described in Section 3 can also accommodate outcome regression models in the same way as for estimation, and the test based on the bootstrap calibration as in Section 3.3 remains multiply robust.

Simulation study
We conducted simulation studies to evaluate the finite-sample performance of the proposed empirical likelihood ratio test and the maximum empirical likelihood estimator for the treatment effect. The simulation sample data were generated as follows. At the baseline, a pretest measurement was generated as Z ∼ Uniform(−2.5, 2.5), and then the propensity score of treatment assignment was set to be π(Z) = {1 + exp(1 − 0.8Z 2 )} −1 , which leads to approximately 60% and 40% of subjects in the treatment (T = 1) and the control (T = 0) groups, respectively. For the scenario of no treatment effect (δ = 0), the intermediate covariate was generated as X t ∼ N (1 + Z, 1), and the posttest potential outcome was generated as Y t | X t ∼ N {a t (X t ), 2X 2 t + 2}, where a t (X t ) = 1 + 4X 2 t , t = 0, 1. For the scenario of non-zero δ, the intermediate covariate was generated as X t ∼ N (1 + t + Z, 1), t = 0, 1, and the posttest potential outcome was generated as Y t | X t ∼ N {a t (X t ), 2X 2 t + 2}, where a 1 (X 1 ) = 1 + 4X 2 1 , a 0 (X 0 ) = β 00 * + 2X 2 0 and β 00 * = 1, 5, 9 and 13, which leads to δ = 20.2, 16.2, 12.2 and 8.2, respectively. For all scenarios, the missingness probabilities are set to be 1 (Z, resulting in a missingness rate of 29% for the control group in all scenarios, and 49% for treatment group when δ = 0 and 37% for the treatment group when δ = 0. We considered two parametric working models listed below for each of π(Z), t (Z, X t ) and E(Y t | Z, X t ). The first working model in each pair, π (1) (γ (1) ), t ), is correctly specified: Here we demonstrate the steps for computing the multiply robust estimator and the EL ratio test that use all of the above working models. Estimators and tests that use other combinations of these working models can be computed similarly.

S. Zhang et al.
Step 6. The multiply robust estimator is calculated as in (4.1).
Step 7. Calculate the weights Step 8. The empirical likelihood ratio statistic is calculated as in (3.5).
In the case of only one working model π(Z; γ) for the propensity score, one should skip Step 4 and replace (np i ) −1 and (nq i ) −1 in Step 5 with π(Z i ;γ) and 1 − π(Z i ;γ) respectively.
Results reported in Tables 2 and 3 are based on 1000 repeated simulation samples, and for each sample 1000 bootstrap samples were used to calculate the critical value of the empirical likelihood ratio test or the standard error of the point estimator. Table 2 contains results on the performance of the empirical likelihood ratio test and the Wald test based on four different combinations of models for π(Z) and t (Z, X t ). The outcome regression models for E(Y t | Z, X t ) were not used for these tests. It can be seen that the empirical likelihood ratio test has type I error close to 5% and the empirical likelihood ratio confidence interval has coverage probability close to 95%, and both have better performance than the methods based on the Wald statistic. The Wald test has higher power compared to the empirical likelihood ratio test. Another observation is that the two bootstrap procedures described in Section 3.3 have similar performance. Table 3 summarizes the simulation results on point estimation for the treatment effect. We focused on the scenario with δ = 20.2 and sample size n = 400. We considered four different scenarios for the propensity score: (I) π(Z) is known; (II) π(Z) is correctly modeled by π (1) (γ (1) ); (III) π(Z) is incorrectly modeled by π (2) (γ (2) ); and (IV) both models π (1) (γ (1) ) and π (2) (γ (2) ) are used. Our proposed multiply robust estimator along with the inverse probability weighted estimator and the augmented inverse probability weighted estimator were included in the simulation. The working models used for the missingness probability and the outcome regression are indicated as part of the notation in the first column. For instance, the multiply robust estimator MR- (1,2) a (1,2) was computed with both models for the missingness probability and both models for the outcome regression. Performance of estimators is evaluated through Table 2. Simulation results (in %) of tests on H 0 : δ = 0 and confidence intervals. The Type I error and power are summarized based on significance level 5% and the nominal level for the confidence intervals is 95%. Models in brackets are the ones used.
{π (1) , (1) (1) , ( (1) , π (2) , (1) (1) , π (2) , (1) , ( relative bias and root mean squared error. To evaluate the performance of the bootstrap method, we also included under scenario (IV) the values of empirical standard error and the bootstrap-based standard error. From Table 3, the proposed multiply robust estimator has negligible relative bias under the combination of multiple working models as outlined in Theorems 4.1 and 4.2. The inverse probability weighted estimator and the augmented inverse probability weighted estimator cannot accommodate multiple working models. Another interesting observation is that, compared to the inverse probability weighted estimator and the augmented inverse probability weighted estimator, the multiply robust estimator has smaller bias when the combination of working models does not satisfy the specification outlined in Theorems 4.1 and 4.2. The small bias of multiply robust estimators when they are not theoretically consistent has been previously documented in the missing data literature (Han 11, 12, Chen and Haziza 7), and is due to the calibration constraints used to derive the weightsŵ i andv i . Han [12] contains more discussion on this intriguing observation. We also observe from Table 3 that the bootstrap standard error is very close to the empirical standard error, showing that the bootstrap method is reliable. Table 3 Simulation results (×10 2 ) for point estimation (δ = 20.2 and n = 400) (1) , π (2) } rBias RMSE rBias RMSE rBias RMSE rBias RMSE ESE BSE MR- (1)

Additional remarks
Pretest-posttest studies are commonly used by social science, medical and health researchers, where non-randomized treatment assignment and missing data are two important issues. Valid statistical analyses depend on proper handling of models for the propensity score of treatment assignment, the missingness probability and the outcome regression. Our proposed empirical likelihood approach to testing and estimation of the treatment effect provides a unified inference tool to incorporate models from different sources. Theoretical and simulation results presented in this paper show that the proposed approach is promising. It is worthwhile to point out that the proposed method can still be applied when there is only a single model for each quantity, and thus it indeed provides a powerful alternative to existing methods. Based on our experience from simulation studies, the proposed method is very robust to high collinearity among the possibly multiple working models for the same quantity. In practice, when predictions from different working models are very close so that collinearity may be of an issue, an ad hoc solution would be to remove some of those models because they provide mostly redundant information. It will be of future research interest to develop a formal way to address potential collinearity problem.
This paper focuses on the robustness of estimation. In the missing data literature, estimation efficiency is another important issue (Robins et al. 21,Tsiatis 26). However, a theoretical investigation on efficiency in our setting with multiple working models is both challenging and lack of practical implication. The typical framework for semiparametric efficiency theory in the missing data literature (e.g., Tsiatis 26) assumes that both π(Z) and t (Z, X t ) are correctly modeled and the correct models are identified. For non-randomized pretestposttest studies such an assumption is often unrealistic. This was the major motivation that we considered multiple working models. The simulation results presented in this paper as well as findings from the existing missing data literature, such as Han [11], Han [12], Chen and Haziza [7], among others, show that multiply robust estimators typically have similar or higher efficiency compared to other estimators when the same working models are used.

Expressions for Theorem 3.1
The scaling constant for the asymptotic distribution in Theorem 3.1 is given by , and S t , t = 0, 1 is the score function of the binomial likelihood (3.1).

Expressions for Theorem 3.2
Without loss of generality, suppose

Expressions for Theorem 3.3
The scaling constant for the asymptotic distribution in Theorem 3.3 is given by and S γ is the score function of

Expressions for Theorem 3.4
Without loss of generality, suppose

) and
(1) 0 ) are correctly specified models for 1 (Z, X 1 ) and 0 (Z, X 0 ) respectively. Then the scaling constant for the asymptotic distribution in Theorem 3.4 is given by S. Zhang et al.

2041
Now suppose A 1 contains a correctly specified model a (1) 1 (β (1) 1 ) for E(Y 1 | Z, X 1 ). A similar argument to that in the proof of Theorem 4.1 gives the consistency ofμ 1MR by noticing that i:Ti=1p i a (1) 1i (β