Entropy Balancing is Doubly Robust

Qingyuan Zhao; Daniel Percival

doi:10.1515/jci-2016-0010

Open Access Published by De Gruyter November 15, 2016

Entropy Balancing is Doubly Robust

Qingyuan Zhao and Daniel Percival

From the journal Journal of Causal Inference

https://doi.org/10.1515/jci-2016-0010

Abstract

Covariate balance is a conventional key diagnostic for methods estimating causal effects from observational studies. Recently, there is an emerging interest in directly incorporating covariate balance in the estimation. We study a recently proposed entropy maximization method called Entropy Balancing (EB), which exactly matches the covariate moments for the different experimental groups in its optimization problem. We show EB is doubly robust with respect to linear outcome regression and logistic propensity score regression, and it reaches the asymptotic semiparametric variance bound when both regressions are correctly specified. This is surprising to us because there is no attempt to model the outcome or the treatment assignment in the original proposal of EB. Our theoretical results and simulations suggest that EB is a very appealing alternative to the conventional weighting estimators that estimate the propensity score by maximum likelihood.

Keywords: causal inference; double robustness; exponential tilting; convex optimization; survey sampling

1 Introduction

Consider a typical setting of observational study that two conditions (“treatment” and “control”) are not randomly assigned to the units. Deriving a causal conclusion from such observational data is essentially difficult because the treatment exposure may be related to some covariates that are also related to the outcome. In this case, those covariates may be imbalanced between the treatment groups and the naive mean causal effect estimator can be severely biased.

To adjust for the covariate imbalance, the seminal work of Rosenbaum and Rubin [1] points out the essential role of propensity score, the probability of exposure to treatment conditional on observed covariates. This quantity, rarely known in an observation study, may be estimated from the data. Based on the estimated propensity score, many statistical methods are proposed to estimate the mean causal effect. The most popular approaches are matching [2, 3], stratification [4], and weighting [5, 6]. Theoretically, propensity score weighting is the most attractive among these methods. Hirano et al. [7] show that nonparametric propensity score weighting can achieve the semiparametric efficiency bound for the estimation of mean causal effect derived by Hahn [8]. Another desirable property is double robustness. The pioneering work of Robins et al. [5] augments propensity score weighting by an outcome regression model. The resulting estimator has the so-called double robustness property:

Property 1. If either the propensity score model or the outcome regression model is correctly specified, the mean causal effect estimator is statistically consistent.

In practice, the success of any propensity score method hinges on the quality of the estimated propensity score. The weighting methods are usually more sensitive to model misspecification than matching and stratification, and furthermore, this bias can even be amplified by a doubly robust estimator, brought to attention by Kang and Schafer [9]. In order to avoid model misspecification, applied researchers usually increase the complexity of the propensity score model until a sufficiently balanced solution is found. This cyclical process of modeling propensity score and checking covariate balance is criticized as the “propensity score tautology” by Imai et al. [10] and, moreover, has no guarantee of finding a satisfactory solution eventually.

Recently, there is an emerging interest, particularly among applied researchers, in directly incorporating covariate balance in the estimation procedure, so there is no need to check covariate balance repeatedly [11, 12, 13, 14]. In this paper, we study a method of this kind called Entropy Balancing (hereafter EB) proposed in Hainmueller [15]. In a nutshell, EB solves an (convex) entropy maximization problem under the constraint of exact balance of covariate moments. Due to its easy interpretation and fast computation, EB has already gained some popularity in applied fields [16, 17]. However, little do we known about the theoretical properties of EB. The original proposal in Hainmueller [15] did not give a condition such that EB is guaranteed to give a consistent estimate of the mean causal effect.

In this paper, we shall show EB is indeed a very appealing propensity score weighting method. We find EB simultaneously fits a logistic regression model for the propensity score and a linear regression model for the outcome. The linear predictors of these regression models are the covariate moments being balanced. We shall prove EB is doubly robust (Property 1), in the sense that if at least one of the two models are correctly specified, EB is consistent for the Population Average Treatment effect for the Treated (PATT), a common quantity of interest in causal inference and survey sampling. Moreover, EB is sample bounded [18], meaning the PATT estimator is always within the range of the observed outcomes, and it is semiparametrically efficient if both models are correctly specified. Lastly, The two linear models have an exact correspondence to the primal and dual optimization problem used to solve EB, revealing an interesting connection between doubly robust estimation and convex optimization.

Our discoveries can be summarized in the diagram in Figure 1. Conventionally, the recipe given by Robins and his coauthors is to fit separate models for propensity score and outcome regression and then combine them by a doubly robust estimator (see e. g. [5, 9, 19, 20]). In contrast, Entropy Balancing achieves this goal through enforcing covariate balance. The primal optimization problem of EB amounts to an empirical calibration estimator [21, 22], which is widely popular in survey sampling but perhaps not sufficiently recognized in causal inference [23]. The balancing constraints in this optimization problem result in unbiasedness of the PATT estimator under linear outcome regression model. The dual optimization problem of EB is fitting a logistic propensity score model with a loss function different from the negative binomial likelihood. The Fisher-consistency of this loss function (also called proper scoring rule in statistical decision theory, see e. g. Gneiting and Raftery [24]) ensures the other half of double robustness – consistency under correctly specified propensity score model. Since EB essentially just uses a different loss function, other types of propensity score models, for example the generalized additive models [25] can also easily be fitted. A forthcoming article by Zhao [26] offers more discussion and extension to other weighted average treatment effects.

Figure 1:

The role of covariate balance in doubly robust estimation. Dashed arrows: conventional procedure to achieve double robustness. Solid arrows: double robustness of Entropy Balancing via covariate balance.

2 Setting

First, we fix some notations for the causal inference problem considered in this paper. We follow the potential outcome language of Neyman [27] and Rubin [28]. In this causal model, each unit i is associated with a pair of potential outcomes: the response Yi(1) that is realized if Ti=1 (treated), and another response Yi(0) realized if Ti=0 (control). We assume the observational units are independent and identically distributed samples from a population, for which we wish to infer the treatment’s effect. The main obstacle is that only one potential outcome is observed: Yi=TiYi(1)−(1−Ti)Yi(0), which is commonly known as the “fundamental problem of causal inference” [29].

In this paper we focus on estimating the Population Average Treatment effect on the Treated (PATT):

(1)γ=E[Y(1)|T=1]−E[Y(0)|T=1]=Δμ(1|1)−μ(0|1).

The counterfactual mean μ(0|1)=E[Y(0)|T=1] also naturally occurs in survey sampling with missing data [21, 22] by viewing Y(0) as the only outcome of interest (so T=1 stands for non-response).

Along with the treatment exposure Ti and outcome Yi, each unit i is usually associated with a set of covariates denoted by Xi measured prior to the treatment assignment. In a typical observational study, both treatment assignment and outcome may be related to the covariates, which can cause serious confounding bias. The seminal work by Rosenbaum and Rubin [1] suggest that it is possible to correct the confounding bias under the following two assumptions:

Assumption 1

(strong ignorability). (Y(0),Y(1))⊥T|X.

Assumption 2

(overlap). 0<P(T=1|X)<1.

Intuitively, the first assumption says that the observed covariates contain all the information that may cause the selection bias, i. e. there is no unmeasured confounding variable, and the second assumption ensures that the bias-correction information is available across the entire domain of X.

Since the covariates X contain all the information of confounding bias, it is important to understand the relationship between T,Y and X. Under Assumption 1 (strong ignorability), the joint distribution of (X,Y,T) is determined by the marginal distribution of X and two conditional distributions given X. The first conditional distribution e(X)=P(T=1|X) is often called the propensity score and plays a central role in causal inference [1]. The second conditional distribution is the density of Y(0) and Y(1) given X. Since we only consider the mean causal effect in this paper, it suffices to study the mean regression functions g0(X)=E[Y(0)|X] and g1(X)=E[Y(1)|X].

To estimate the PATT defined in 1, a conventional weighting estimator based on the propensity score is the inverse probability weighting (IPW) defined as

(2)γˆIPW=∑Ti=11n1Yi−∑Ti=0eˆ(Xi)(1−eˆ(Xi))−1∑Ti=0eˆ(Xi)(1−eˆ(Xi))−1Yi.

Here ∑Ti=t is a short-hand notation of summation over all units i such that Ti=t. This will be repeatedly used throughout this paper. In 2, the control units are weighted proportionally to eˆ(Xi)(1−eˆ(Xi))−1 to resemble the full population. The most popular choice of propensity score model is the logistic regression, where logit(e(x))=log[e(x)/(1−e(x))] is modeled by ∑j=1pθjcj(x) and cj(x) are functions of the covariates.

3 Entropy balancing

Entropy Balancing (EB) is an alternative weighting method proposed by Hainmueller [15] to estimate PATT. EB operates by maximizing the entropy of the weights under some pre-specified balancing constraints:

(3)maximizew−∑Ti=0wilogwisubjectto∑Ti=0wicj(Xi)=cˉj(1)=1n1∑Ti=1cj(Xi),j=1,…,p,∑Ti=0wi=1,wi>0,i=1,…,n.

Hainmueller [15] proposes to use the weighted average ∑Ti=0wiEBYi to estimate the counterfactual mean E[Y(0)|T=1]. This gives the Entropy Balancing estimator of PATT

(4)γˆEB=∑Ti=1Yin1−∑Ti=0wiEBYi.

The functions {cj(⋅)}j=1p in 3 are called moment functions of the covariates. They can be any transformation of X, not necessarily polynomial functions. We use c(X) and cˉ(1) to stand for the vector of cj(X) and cˉj(1), j=1,…,p. We shall see the functions {cj(⋅)}j=1p indeed serve as the linear predictors in the propensity score model and the outcome regression model, although at this point it is not even clear that EB attempts to fit any model.

First, we give some heuristics that allows us to view EB as a propensity score weighting method. Since EB seeks to empirically match the control and treatment covariate distributions, we connect EB with density estimation. Let m(x) be the density function of the covariates X for the control population. The minimum relative entropy principle estimates the density of the treatment population by

(5)maximizem˜H(m˜∥m)subjecttoEm˜[c(X)]=cˉ(1),

where H(m˜∥m)=Em˜[log(m˜(X)/m(X))] is the relative entropy between m˜ and m. As an estimate of the distribution of the treatment group, the optimal m˜ of 5 is the “closest” to the control distribution among all distributions satisfying the moment constraints. Now let w(x)=[P(T=1)⋅m˜(x)]/[P(T=0)⋅m(x)] be the population version of the inverse probability weights in 2. Applying a change of measure, we can rewrite 5 as an optimization problem over w(x):

(6)maximizewEm[w(X)logw(X)]subjecttoEm[w(X)c(X)]=cˉ(1).

The EB optimization problem 3 is the finite sample version of 6, where the population distribution m is replaced by the empirical distribution of the control units.

Using the Lagrangian multipliers, one can show the solution of 5 belongs to the family of exponential titled distributions of m [30]:

mθ(x)=m(x)exp(θTc(x)−ψ(θ)).

Here, ψ(θ) is the moment generating function of this exponential family. Consequently, the solution of the population EB 6 is

e(x)1−e(x)=P(T=1|X=x)P(T=0|X=x)=w(x)=exp(α+θTc(x)),

where α=log(P(T=1)/P(T=0)). This is exactly the logistic regression model with predictors c(x).

Notice that EB is different from the maximum likelihood fit of the logistic regression. The dual optimization problem of 3 is

(7)minimizeθlog∑Ti=0exp(∑j=1pθjcj(Xi))−∑j=1pθjcˉj(1),

whereas the maximum likelihood solves

(8)minimizeθ∑i=1nlog1+exp(−(2Ti−1)∑j=1pθjcj(Xi)).

It is apparent from 7 and 8 that EB and maximum likelihood use different loss functions. As a remark, the estimating equations defined by 7 are used to augment the estimating equations defined by 8 in the covariate balancing propensity score (CBPS) approach of Imai and Ratkovic [13]. We will compare the empirical performance of these methods in Section 5.

The optimization problem 7 is strictly convex and the unique solution θˆEB can be efficiently computed by Newton method. The EB weights (solution to the primal problem 3) are given by the Karush-Kuhn-Tucker (KKT) conditions: for any i such that Ti=0,

(9)wiEB=exp∑j=1pθˆjEBcj(Xi)∑Ti=0exp∑j=1pθˆjEBcp(Xi).

As a final remark, Entropy Balancing bridges two existing approaches of estimating the mean causal effect:

The calibration estimator that is very popular in survey sampling [21, 22, 23];
The empirical likelihood approach that significantly advances the theory of doubly robust estimation in observation study [18, 31, 32, 33].

EB is a special case of these two approaches. The main distinction is that it uses the Shannon entropy ∑i=1nwilogwi as the discrepancy function, resulting in an easy-to-solve convex optimization. Due to its easy interpretation, Entropy Balancing has already gained some ground in practice (e. g. [16, 17]).

4 Properties of entropy balancing

We give some theoretical guarantees of Entropy Balancing to justify its usage in real applications. The following is the main theorem of this paper, which shows EB is doubly robust even though its original form 3 does not contain a propensity score model or a outcome regression model.

Theorem 1.

Let Assumption 1 (strong ignorability) and Assumption 2 (overlap) be given. Additionally, assume the expectation ofc(x)exists andVar(Y(0)) < ∞. Then Entropy Balancing is doubly robust (Property 1) in the sense that

If logit(e(x)) or g0(x) is linear in cj(x),j=1,…,R, then γˆEB is statistically consistent.
Moreover, if logit(e(x)), g0(x) and g1(x) are all linear in cj(x),j=1,…,R, then γˆEB reaches the semiparametric variance bound of γ derived in Hahn [34] with unknown propensity score.

We give two proofs of the first claim in Theorem 1. The first proof reveals an interesting connection between the primal-dual optimization problems 3 and 7 and the statistical property, double robustness, which motivates the interpretation in Figure 1. The second proof uses a stabilization trick in Robins et al. [8].

First proof (sketch) The consistency under the linear model of logit(P(T=1|X)) is a consequence of the dual optimization problem 7. See Section 3 for a heuristic justification via the minimum relative entropy principle and Appendix A for a rigorous proof by using the M-estimation theory.

The consistency under the linear model of Y(0) can be proved by expanding E[Y(0)|X] and ∑Ti=0wiYi. Here we provide an indirect proof by showing that augmenting EB with a linear outcome regression does not change the estimator. Given an estimated propensity score model eˆ(x), the corresponding weights eˆ(x)/(1−eˆ(x)) for the control units, and an estimated outcome regression model gˆ0(x), a doubly robust estimator of PATT is given by

(10)γˆDR=∑Ti=11n1(Yi−gˆ0(Xi))−∑Ti=0eˆ(Xi)1−eˆ(Xi)(Yi−gˆ0(Xi)).

This estimator satisfies Property 1, i. e. if eˆ(x)→e(x) or gˆ0(x)→g(x), then γˆDR is statistically consistent for γ. To see this, in the case that gˆ0(x)→g0(x), the first sum in 10 is consistent for γ and the second sum in 10 has mean going to 0 as n→∞. In the case where gˆ0(x)g0(x) but eˆ(x)→e(x), the second sum in 10 is consistent for the bias of the first sum (as an estimator of γ).

When the estimated propensity score model eˆ(x) is obtained by the EB dual problem 7 and the estimated outcome regression model is gˆ0(x)=∑j=1pβˆjcj(x), we have

γˆDR−γˆEB=∑Ti=0wiEBgˆ0(Xi)−1n1∑Ti=0gˆ0(Xi)=∑Ti=0wiEB∑j=1pβˆjcj(Xi)−1n1∑Ti=1∑j=1pβˆjcj(Xi)=∑j=1pβˆj∑Ti=0wiEBcj(Xi)−1n1∑Ti=1cj(Xi)=0.

Therefore by enforcing covariate balancing constraints, EB implicitly fits a linear outcome regression model and is consistent for γ under this model.□

Second proof. This proof is pointed out by an anonymous reviewer. In a discussion of Kang and Schafer [9], Robins et al. [34] show that one can stabilize the standard doubly robust estimator in a number of ways. Specifically, one trick suggested by Robins et al. [34] is to estimate the propensity score, say e˜(x), by the following estimating equation

(11)∑i=1n(1−Ti)e˜(Xi)/(1−e˜(Xi))∑i=1n(1−Ti)e˜(Xi)/(1−e˜(Xi))−Ti∑i=1nTigˆ0(Xi)=0.

Then one can estimate PATT by the IPW estimator 2 by replacing eˆ(Xi) with e˜(Xi). This estimator is sample bounded (the estimator is always within the range of observed values of Y) and doubly robust with respect to the parametric specifications of e˜(x)=e˜(x;θ) and gˆ0(x)=gˆ0(x;β). The only problem with 11 is it may not have a unique solution. However, when logit(e(x)) and g0(x) are assumed linear in c(x), 11 corresponds to the first order condition of the EB dual problem 7. Since 7 is strictly convex, it has an unique solution and e˜(X;θ) is the same as the EB estimate eˆ(X;θ). As a consequence, γˆEB is also doubly robust.

To prove the second claim in Theorem 1, we compute the asymptotic variance of γˆEB using the M-estimation theory. To state our results, we need to introduce four differently weighted covariance-like functions for two random vectors a1 and a2 of length p:

It is obvious that H≥K and usually G≥H. To make the notation more concise, c(X) will be abbreviated as c and Y(0) as 0 in subscripts. For example, Hc,0=Hc(X),Y(0), Gc,1=Gc(X),Y(1) and Kc=Kc(X),c(X).

Theorem 2.

Assume the logistic regression model of propensity score is correct, i. e. logit(P(T=1|X))is a linear combination of{cj(X)}j=1p. Letπ=P(T=1), then we haveγˆEB→dN(γ,VEB/n)andγˆIPW→dN(γ,VIPW/n)where

(12)VEB=π−1⋅H1+G0−Hc,0THc−12Gc,0−Hc,0−GcHc−1Hc,0+2Hc,1,

(13)VIPW=π−1⋅H1+G0−Hc,0TKc−1Hc,0−2Kc,0m+2Kc,1m.

The proof of Theorem 2 is given in Appendix A. The H, G and K matrices in Theorem 2 can be estimated from the observed data, yielding approximate sampling variances for γˆEB and γˆIPW. Alternatively, variance estimates may be obtained via the empirical sandwich method (e. g. [35]). In practice (particularly in simulations where we compare to a known truth), we find that the empirical sandwich method is more stable than the plug-in method, which is consistent with the suggestion in Lunceford and Davidian [20] for PATE estimators.

To complete the proof of the second claim in Theorem 1, we compare these variances with the semiparametric variance bound of γ with unknown e(X) derived by Hahn ([8], Theorem 1):

V∗=1π2Ee(X)Var(Y(1)|X)+e(X)21−e(X)Var(Y(0)|X)+e(X)(g1(X)−g0(X)−γ)2

After some algebra, one can express V∗ in terms of H⋅,⋅ and G⋅,⋅ defined above:

V∗=π−1⋅H1+G0−2Hg0,g1−Gg0+Hg0.

Now assume logit(P(T=1|X))=θTc(X) and E[Y(t)|X]=β(t)Tc(X),t=0,1, it is easy to verify that

Hc,t=Cov(c(X),Y(t)|T=1)=Cov(c(X),β(t)Tc(X)|T=1)=Hcβ(t),fort=0,1.

Similarly, Gc,t=Gcβ(t),t=0,1. From here it is easy to check VEB and V∗ are the same. Since Entropy Balancing reaches the efficiency bound in this case, obviously VEB<VIPW when both models are linear.

If logit(P(T=1|X))=θTc(X) is true but E[Y(t)|X]=β(t)Tc(X) is not true for some t=0,1, there is no guarantee that EB has the smaller asymptotic variance. In practice, the features c(X) in the outcome regression models are almost always correlated with Y. This correlation compensates the slight efficiency loss of not maximizing the likelihood function in logistic regression. As a consequence, the variance VEB in 12 is usually smaller than VIPW in 13. This efficiency advantage of EB over IPW is verified in the next section using simulations.

5 Simulations

5.1 Kang-Schafer example

We use the simulation example in Kang and Schafer [9] to compare EB weighting with IPW (after maximum likelihood logistic regression) and the over-identified Covariate Balancing Propensity Score (CBPS) proposed by Imai and Ratkovic [13]. The simulated data consist of {Xi,Zi,Ti,Yi},i=1,…,n}. Xi and Ti are always observed, Yi is observed only if Ti=1, and Zi is never observed. To generate this data set, Xi is distributed as N(0,I4), Zi is computed by first applying the following transformation:

Zi1=exp(Xi1/2),Zi2=Xi2/(1+exp(Xi1))+10,Zi3=(Xi1Xi3+0.6)3,Zi4=(Xi2+Xi4+20)2.

Next we normalize each column such that Zi has mean 0 and standard deviation 1.

In one setting, Yi is generated by Yi=210+27.4Xi1+13.7Xi2+13.7Xi3+13.7Xi4+εi, εi∼N(0,1) and the true propensity scores are ei=expit(−Xi1+0.5Xi2−0.25Xi3−0.1Xi4). In this case, both Y and T can be correctly modeled by (generalized) linear model of the observed covariates X.

In the other settings, at least one of the propensity score model and the outcome regression model is incorrect. In order to achieve this, the data generating process described above is altered such that Y or T (or both) is linear in the unobserved Z instead of the observed X, though the parameters are kept the same.

For each setting (4 in total), we generated 1000 simulated data sets of size n = 200 and 1000, then apply various methods discussed earlier including

IPW, CBPS: the IPW estimator in 2 with propensity score estimated by logistic regression or CBPS (since the estimand is overall mean, we use the CBPS weights tailored for estimating PATE);
EB: the Entropy Balancing estimator (the EB weights are used to estimate the unobserved mean E[Y|T=0]);
IPW+DR, CBPS+DR: the doubly robust estimator in 10 with propensity score estimated by logistic regression or CBPS.

The simulation results are presented in Figure 2 and Figure 3. Figure 2 and Figure 3 show the covariate imbalance before adjustment in terms of standardized difference. Figure 2 and Figure 3 show the mean estimates given by the five different methods. First, notice that the doubly robust estimator “IPW+DR” performs poorly when both models are misspecified (bottom-right panel in Figure 2) and Figure 3). In fact, all the three doubly robust methods are worse than just using IPW. Second, the three doubly robust estimators have exactly the same variance if the Y model is correct (top two panels in Figure 2 and Figure 4). It seems that how one fits the propensity score model has no impact on the final estimate. This is related to the observation in Kang and Schafer [9] that, in this example, the plain OLS estimate of Y actually outperforms any method involving the propensity score model. Discussion articles such as Robins et al. [34] and Ridgeway and McCaffrey [36] find this phenomenon very uncommon in practice and is most likely due to the estimated inverse probability weights are highly variable, which is a bad setting for doubly robust estimators.

Figure 2:

Kang-Schafer example: sample size n = 200. Both propensity score model and outcome regression model can be correct or incorrect, so there are four scenarios in total. We generate 1,000 simulations in each scenario. (a) Covariate imbalance before adjustment. (b) Mean estimates. The methods are: Inverse Propensity Weighting (IPW), Covariate Balancing Propensity Score (CBPS), Entropy Balancing (EB), and doubly robust versions of the first two (IPW+DR, CBPS+DR). Target mean is 210 and is marked as a black horizontal line to compare the biases. Numbers printed at Y = 230 are the sample standard deviations to compare efficiency.

Figure 3:

Kang-Schafer example: sample size n=200. Both propensity score model and outcome regression model can be correct or incorrect, so there are four scenarios in total. We generate 1,000 simulations in each scenario. (a) Covariate imbalance before adjustment. (b) Mean estimates. The methods are: Inverse Propensity Weighting (IPW), Covariate Balancing Propensity Score (CBPS), Entropy Balancing (EB), and doubly robust versions of the first two (IPW+DR, CBPS+DR). Target mean is 210 and is marked as a black horizontal line to compare the biases. Numbers printed at Y = 230 are the sample standard deviations to compare efficiency.

Regarding Entropy Balancing (EB), we find that:

If both T and Y models are misspecified, EB has smaller bias than the conventional “IPW+DR” or “CBPS+DR”. So EB seems to be less affected by such unfavorable setting.
When T model is correct but Y model is wrong (bottom-left panel in Figure 2 and Figure 3), EB has the smallest variance among all estimators. This supports the conclusion of our efficiency comparison of IPW and EB in Section 4.

Finally notice that the same simulation setting is used in Tan [18] to study the performance of a number of doubly robust estimators. The reader can compare the Figure 2 and Figure 3 with the results there. The performance of Entropy Balancing is comparable to the best estimator in Tan [18].

5.2 Lunceford-Davidian example

We provide another simulation example by Lunceford and Davidian [20] to verify claims in Theorems 1 and 2. In this simulation, the data still consist of {(Xi,Zi,Ti,Yi),i=1,…,n}, but all of them are observed. Both Xi and Zi are three dimensional vectors. The propensity score is only related to X through:

logit(P(Ti=1))=β0+∑j=1βjXij.

Note the above does not involve elements of Zi. The response Y is generated according to

Yi=ν0+∑j=13νjXij+ν4Ti+∑j=13ξjZij+εi;εi∼N(0,1).

The parameters here are set to be ν=(0,−1,1,−1,2)T, and β is set as:

βno=(0,0,0,0)T,βmoderate=(0,0.3,−0.3,0.3)T,orβmoderate=(0,0.3,−0.3,0.3)T,or

The choice of β depends on the level of association of T and X. ξ is based on a similar choice on the level of association of Y and Z:

ξno=(0,0,0)T,ξmoderate=(−0.5,0.5,0.5)T,orξstrong=(−1,1,1)T.

The joint distribution of (Xi,Zi) is specified by taking Xi3∼Bernoulli(0.2) and then generate Zi3 as Bernoulli with

PZi3=1|Xi3=0.75Xi3+0.25(1−Xi3).

Conditional on Xi3, (Xi1,Zi1,Xi2,Zi2) is then generated as multivariate normal N(aXi3,BXi3), where a1=(1,1,−1,−1)T, a0=(−1,−1,1,1)T and

B0=B1=10.5−0.5−0.50.51−0.5−0.5−0.5−0.510.5−0.5−0.50.51.

Figure 4 shows the covariate imbalance in the three settings before adjustment.

Figure 4:

Lunceford-Davidian example: covariate imbalance before adjustment.

The data generating model implies that the true PATT is γ=2. Since the outcome Y depends on both X and Z, we always fit a full linear model of Y using X and Z, if such model is needed. T only depends on X, so it is not necessary to include Z in propensity score modeling. However, as pointed out by Lunceford and Davidian [20], it is actually beneficial to “overmodel” the propensity score by including Z in the model. Here we will try both possibilities, the “full” modeling of T using both X and Z, and the “partial” modeling of T using only X. Since the estimand is PATT in this case, we use the over-identified CBPS weights tailored for estimating PATT.

We generated 1000 simulated data sets and the results are shown in Figure 4 for “full” propensity score modeling and Figure 6 for “partial” propensity score modeling. We make the following comments about these two plots:

IPW and all other estimators are always consistent, no matter what level of association is specified. This is because the propensity score model is always correctly specified.
When using the “full” propensity score modeling, all doubly robust estimators (EB, IPW+DR, CBPS+DR and EB+DR) have almost the same sample variance. This is because all of them are asymptotically efficient.
CBPS, to our surprise, does not perform very well in this simulation. It has smaller variance than IPW but this comes with the price of some bias. If we use the partial propensity score model (only involve X, Figure 6), this bias is smaller but still not negligible. While it is not clear what causes this bias, one possible reason is that the optimization problem of CBPS is nonconvex, so the local solution which is used to construct γ estimator could be far from the global solution. Another possibility is that CBPS uses GMM or Empirical Likelihood to combine likelihood with imbalance penalty, which is less efficient than maximum likelihood directly. Thus, although the estimator is asymptotically unbiased, the convergence spend to the true γ is quite slower than IPW. CBPS combined with outcome regression (CB+DR) fixes the bias and inefficiency issue occurred in CBPS without outcome regression.
EB, in contrast, performs quite well in this simulation. It has relatively small variance, particularly if we use the “full” model in which both X and Z are balanced.
The difference between EB and EB+DR is that while EB only balances “partial” or “full” covariates, EB+DR additionally combines a outcome linear regression model on all the covariates. As shown in the first proof of Theorem 1, when the “full” covariates are used, EB is exactly the same as EB+DR. We can observe this from Figure 4. When EB only balances “partial” covariates, the two methods are different and indeed EB+DR is more efficient in Figure 6 since it fits the correct Y model.
Using the “full” propensity score model improves the efficiency of pure weighting estimators (IPW, CBPS and EB) a lot, but has very little impact on estimators that involves an outcome regression model (IPW+DR and CBPS+DR) compared to “partial” propensity score modeling. Although EB could be viewed as fitting a outcome model implicitly, the “partial” EB estimator only uses X in the outcome model, that is precisely the reason why it is not efficient. Thus there are both robustness and efficiency reasons that one should include all relevant covariates in EB, even if the covariates affect only one of T and Y.

Figure 5:

Results of the Lunceford-Davidian example (full propensity score modeling). The propensity score model and outcome regression model, if applies, are always correctly specified, but the level of association between T or Y with X or Z could be different, ended up with 9 different scenarios. X are confounding covariates and Z only affects the outcome. We generate 1,000 simulations of 1,000 in each scenario and apply five different estimators. The true PATT is 2 and is marked as a black horizontal line to compare the biases of the methods. Numbers printed at Y = 5 are the sample standard deviation of each method, in order to compare their efficiency.

Figure 6:

Results of the Lunceford-Davidian example (partial propensity score modeling). The settings are exactly the same as Figure 4 except the methods here don’t use Z in their propensity score models.

In summary, EB outperforms IPW in all the simulations, making it an appealing alternative to the conventional propensity score weighting methods.

Acknowledgement

This work is completed when Qingyuan Zhao is a Ph.D. student at the Department of Statistics, Stanford University. We would like to thank Jens Hainmueller, Trevor Hastie, Hera He, Bill Heavlin, Diane Lambert, Daryl Pregibon, Jean Steiner and one anonymous reviewer for their helpful comments.

Appendix

A Theoretical proofs

We first describe the conditions under which the EB problem 3 admits a solution. The existence of wEB depends on the solvability of the moment matching constraints

(14)∑Ti=0wicj(Xi)=cˉj(1),j=1,…,p,w>0,∑Ti=0wi=1.

As one may expect, this is closely related to the existence condition of maximum likelihood estimate of logistic regression [37, 38]. An easy way to obtain such condition is through the dual problem of 8

(15)maximizew−∑i=1nwilogwi+(1−wi)log(1−wi)subjectto∑Ti=0wicj(Xi)=∑Ti=1wicj(Xi),j=1,…,p,0<wi<1,i=1,…,n.

Thus, the existence of θˆMLE is equivalent to the solvability of the constraints in 15, which is the overlap condition first given by Silvapulle [37].

Intuitively, in the space of c(X), the solvability of 14 or the existence of wEB means there is no hyperplane separating {c(Xi)}Ti=0 and cˉ(1). In contrast, the solvability of 15 or the existence of wMLE means there is no hyperplane separating {c(Xi)}Ti=0 and {c(Xi)}Ti=1. Hence the existence of EB requires a stronger condition than the logistic regression MLE.

The next proposition suggests that the existence of wEB and hence wMLE is guaranteed by Assumption 2 (overlap) with high probability.

Proposition 1.

Suppose Assumption 2 (overlap) is satisfied and the expectation ofc(X)exist, thenP(wEBexists)→1asn→∞. Furthermore, ∑i=1n(wiEB)2→0in probability asn→∞.

Proof

Since the expectation of c(X) exist, the weak law of large number says cˉ(1)→pcˉ∗(1)=E[c(X)|T=1]. Therefore

Lemma 1.

For anyε>0, P(∥cˉ(1)−cˉ∗(1)∥∞≥ε)→0asn→∞.

Now condition on ∥cˉ(1)−cˉ∗(1)∥∞≥ε, i. e. cˉ(1) is in the box of side length 2ε centered at cˉ∗(1), we want to prove that with probability going to 1 there exists w such that wi>0, ∑Ti=0wi=1 and ∑Ti=0wic(Xi)=cˉ(1). Equivalently, this is saying the convex hull generated by {c(Xi)}Ti=0 contains cˉ(1). We indeed prove a stronger result:

Lemma 2.

With probability going to1the convex hull generated by{c(Xi)}Ti=0contains the boxBε(cˉ∗(1))={c(x):∥c(x)−cˉ∗(1)∥∞≤ε}for someε>0.

Proposition 1 follow immediately from Lemma 1 and Lemma 2. Now we prove Lemma 2. Denote the sample space of X by Ω(X). Assumption 2 (overlap) implies cˉ∗(1) hence Bε(cˉ∗(1)) is in the interior of the convex hull of Ω(X) for sufficiently small ε. Let Ri,i=1,…,3p, be the 3p boxes centered at cˉ∗(1)+32εb, where b∈p is a vector that each entry can be −1, 0, or 1. It is easy to check that the sets Ri are disjoint and the convex hull of {xi}i=13p contains Bε(cˉ∗(1)) if xi∈Ri,i=1,…,3p. Since 0<P(T=0|X)<1, ρ=miniP(X∈Ri|T=0)>0. This implies

(16)P(∃Xi∈RiandTi=0,∀i=1,…,3p)≥1−∑i=13pP(XRi|T=0)n≥1−3p(1−ρ)n→1

as n→∞. This proves the lemma because the event in the left hand side implies the convex hull generated by {c(Xi)}Ti=0 contains the desired box. Note that 16 also tells us how many samples we actually need to ensure the existence of wEB. Indeed if n≥ρ−1(plog2+logδ−1)≥log(1−ρ)(δ2−p), then the probability in 16 is greater than 1−δ. Usually we expect δ=O(3−p). If this is the case, the number of samples needed is n=O(p⋅3p). ^[1]

Now we turn to the second claim of the proposition, i. e. ∑Ti=0wi2→p0. To prove this, we only need to find a sequence (with respect to growing n) of feasible solutions to 3 such that maxiwi→0. This is not hard to show, because the probability in 16 is exponentially decaying as n increases. We can pick n1≥N(δ,p,ρ) such that the probability of the convex hull of {xi}i=1n1 contains Bε(cˉ∗(1)) is at least 1−δ, then pick ni+1≥ni+3iN(δ,p,ρ) so the convex hull of {xi}i=ni+1ni+1 contains Bε(cˉ∗(1)) with probability at least 1−3iδ. This means for each {xi}i=ni+1ni+1,i=0,1,…, we have a set of weights {w˜i}i=ni+1ni+1 such that ∑i=ni+1ni+1w˜ixi=cˉ(1). Now suppose nk≤n<nk+1, the choice wi=w˜i/k if i≤nk and wi=0 if i>nk satisfies the constraints and maxiwi≤k. As n→∞, this implies maxiwi→0 and hence ∑iwi2→0 with probability tending to 1.□

Now we turn to the main theorem of the paper (Theorem 1). The first claim in Theorem 1 follows immediately from the following lemma:

Lemma 3.

Under the assumptions in Theorem 1 and supposelogit(P[T=1|X])=∑j=1pθj∗cj(X), then asn→∞, θˆEB→pθ∗. As a consequence,

E∑Ti=0wiEBYi→pE[Y(0)|T=1].

Proof

The proof is a standard application of M-estimation [35] theory. We will follow the estimating equations approach described in [35] to derive consistency of θˆEB. First we note the first order optimality condition of 7 is

(17)∑i=1n(1−Ti)e∑k=1pθkck(Xi)(cj(Xi)−cˉj(1))=0,j=1,…,R.

We can rewrite 17 as estimating equations. Let ϕj(X,T;m)=T(cj(X)−mj),j=1,…,R and ψj(X,T;θ,m)=(1−T)exp{∑k=1pθkck(X)}(cj(X)−mj), then 17 is equivalent to

(18)∑i=1nϕj(Xi,Ti;m)=0,j=1,…,R,∑i=1nψj(Xi,Ti;θ,m)=0,j=1,…,R.

Since ϕ(⋅) and ψ(⋅) are all smooth functions of θ and m, all we need to verify is that mj∗=E[cj(X)|T=1] and θ∗ is the unique solution to the population version of 18. It is obvious that m∗ is the solution to E[ϕj(X,T;m)]=0,j=1,…,R. Now take conditional expectation of ψj given X:

E[ψj(X,T;θ,m∗)|X]=(1−e(X))e∑k=1pθkck(X)(cj(X)−mj∗)=1−e∑k=1pθk∗ck(X)1+e∑k=1pθk∗ck(X)e∑k=1pθkck(X)(cj(X)−mj∗)=e∑k=1pθkck(X)1+e∑k=1pθk∗ck(X)(cj(X)−E[cj(X)|T=1]).

The only way to make E[ψj(X,T;θˆ,m∗)]=0 is to have

e∑k=1pθˆkck(X)1+e∑k=1pθk∗ck(X)=const⋅P(T=1|X),

i. e. θˆ=θ∗. This proves the consistency of θˆEB.

The consistency of γˆEB is proved by noticing

wiEB=exp(∑j=1pθˆjEBcj(Xi))∑Ti=0exp(∑j=1pθˆjEBcj(Xi))→pP(Ti=1|Xi)1−P(Ti=1|Xi),

which is the IPW-NR weight defined in 2.

The second claim is a corollary of Theorem 2, which is proved below. For simplicity we denote ξ=(mT,θT,μ(1|1),γ)T and the true parameter as ξ∗. Throughout this section we assume logit(e(X))=∑j=1pθj∗cj(X). Denote c˜(X)=c(X)−cˉ∗(1), e∗(X)=e(X;θ∗), l∗(X)=exp{∑j=1pθj∗cj(X)}=e∗(X)/(1−e∗(X)). Let

and ζ(X,T,Y;m,θ,μ(1|1),γ)=(ϕT,ψT,φ1|1,φ)T be all the estimating equations. The Entropy Balancing estimator γˆEB is the solution to

(19)1n∑i=1nζ(Xi,Ti,Yi;m,θ,μ(1|1),γ)=0.

There are two forms of “information” matrix that need to be computed. The first is

AEB(ξ∗)=E−∂∂ξTζ(X,T,Y;ξ∗)=E−∂∂mTζ(ξ∗)E−∂∂θTζ(ξ∗)E−∂∂μ(1|1)ζ(ξ∗)E−∂∂γζ(ξ∗)=ET⋅IR000(1−T)l∗(X)⋅IR−(1−T)l∗(X)(c(X)−cˉ∗(1))c(X)T000T0TT00−(1−T)l∗(X)(Y(0)−μ∗(0|1))c(X)T(1−T)l∗(X)−(1−T)l∗(X)=π⋅IR000IR−Cov[c(X)|T=1]000T0T100−Cov(Y(0),c(X)|T=1)1−1.

A very useful identity in the computation of the expectation is

E[f(X,Y)|T=1]=π−1E[e(X)f(X,Y)]=P(T=0)π⋅Ee(x)1−e(X)f(X,Y)|T=0.

The second information matrix is the covariance of ζ(X,T,Y;ξ∗). Denote Y˜(t)=Y(t)−μ∗(t|1),t=0,1

BEB(ξ∗)=E[ζ(X,Y,T;ξ∗)ζ(X,Y,T;ξ∗)T]=ETc˜(X)c˜(X)T0TY˜(1)c˜(X)00(1−T)l∗(X)2c˜(X)c˜(X)T0(1−T)l∗(X)2Y˜(0)c˜(X)TY˜(1)c˜(X)T0TTY˜2(1)00(1−T)l∗(X)2Y˜(0)c˜(X)T0(1−T)l∗(X)2Y˜2(0).

The asymptotic distribution of γˆEB is N(γ,VEB(ξ∗)/n) where VEB(ξ∗) is the bottom right entry of AEB(ξ∗)−1BEB(ξ∗)AEB(ξ∗)−T. Let’s denote

Ha1,a2=Cov(a1,a2|T=1),

Ga1,a2=El∗(X)(a1−E[a1|T=1])(a2−E[a2|T=1])T|T=1T, and Ha=Ha,a, Ga=Ga,a. So

AEB(ξ∗)=π⋅IR000IR−Hc(X)000T0T100−HY(0),c(X)1−1,

AEB(ξ∗)−1=π−1⋅IR000Hc(X)−1−Hc(X)−1000T0T10−Hc(X),Y(0)THc(X)−1Hc(X),Y(0)THc(X)−11−1,

and

BEB(ξ∗)=π⋅Hc(X)0Hc(X),Y(1)00Gc(X)0Gc(X),Y(0)HY(1),c(X)0THY(1)00GY(0),c(X)T0GY(0).

Thus

VEB=π−1⋅Hc,0THc−1Hc,0+GcHc−1Hc,0−2Gc,0−2Hc,1+H1+G0.

It would be interesting to compare VEB(ξ∗) with VIPW(ξ∗), the asymptotic variance of γˆIPW. The IPW PATT estimator 2 is equivalent to solving the following estimating equations

∑i=1nTi−11+e−∑k=1pθkck(Xi)cj(Xi)=0,r=1,…,R,

1n∑i=1nφ1|1(Xi,Ti,Yi;θ,μ(1|1),γ)=0,

1n∑i=1nφ(Xi,Ti,Yi;θ,μ(1|1),γ)=0.

If we call Ka1,a2=E[(1−e(X))a1a2T|T=1], we have

AIPW(ξ∗)=Ee∗(X)(1−e∗(X))c(X)c(X)T000TT0−(1−T)l∗(X)Y˜(0)c(X)T(1−T)l∗(X)−(1−T)l∗(X)=π⋅Kc(X)000T10−HY(0),c(X)1−1.AIPW(ξ∗)−1=π−1⋅Kc(X)−1000T10−HY(0),c(X)Kc(X)−11−1.

Letq∗(X)=e∗(X)l∗(X),

BIPW(ξ∗)=E(T−e∗(X))2c(X)c(X)TT(T−e∗(X))Y˜(1)c(X)−(1−T)q∗(X)Y˜(0)c(X)T(T−e∗(X))Y˜(1)c(X)TTY˜2(1)0−(1−T)q∗(X)Y˜(0)c(X)0(1−T)l∗(X)2Y˜2(0)=π⋅Kc(X)Kc(X),Y˜(1)Kc(X),Y˜(0)−Hc(X),Y(0)Kc(X),Y˜(1)THY(1)0Kc(X),Y˜(0)T−Hc(X),Y(0)T0GY(0).

VIPW can thus be computed consequently and the details are omitted.

References

1. Rosenbaum P, Rubin D. The central role of the propensity score in observational studies for causal effects. Biometrika 1983;70(1):41–55.10.21236/ADA114514Search in Google Scholar

2. Abadie A, Imbens GW. Large sample properties of matching estimators for average treatment effects. Econometrica 2006;74(1):235–267.10.1111/j.1468-0262.2006.00655.xSearch in Google Scholar

3. Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat 1985;39(1):33–38.10.1017/CBO9780511810725.019Search in Google Scholar

4. Rosenbaum P, Rubin D. Reducing bias in observational studies using subclassification on the propensity score. J Am Stat Assoc 1984;79:516–524.10.1017/CBO9780511810725.018Search in Google Scholar

5. Robins JM, Rotnitzky A, Zhao L. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 1994;89:846–866.10.1080/01621459.1994.10476818Search in Google Scholar

6. Hirano K, Imbens G. Estimation of causal effects using propensity score weighting: an application to data on right heart catheterization. Health Serv Outcomes Res Method 2001;2:259–278.10.1023/A:1020371312283Search in Google Scholar

7. Hirano K, Imbens GW, Ridder G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 2003;71(4):1161–1189.10.3386/t0251Search in Google Scholar

8. Hahn J. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 1998;66(2):315–332.10.2307/2998560Search in Google Scholar

9. Kang JD, Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci 2007;22(4):523–539.Search in Google Scholar

10. Imai K, King G, Stuart EA. Misunderstandings between experimentalists and observationalists about causal inference. J Royal Stat Soc Ser A (Statistics in Society) 2008;171(2):481–502.10.1111/j.1467-985X.2007.00527.xSearch in Google Scholar

11. Diamond A, Sekhon JS. Genetic matching for estimating causal effects: a general multivariate matching method for achieving balance in observational studies. Rev Econ Stat 2013;95(3):932–945.10.1162/REST_a_00318Search in Google Scholar

12. Graham BS, Pinto CC, Egel D. Inverse probability tilting for moment condition models with missing data. Rev Econ Stud 2012;79(3):1053–1079.10.3386/w13981Search in Google Scholar

13. Imai K, Ratkovic M. Covariate balancing propensity score. J Royal Stat Soc Ser B (Statistical Methodology) 2014;76(1):243–263.10.1111/rssb.12027Search in Google Scholar

14. Zubizarreta JR. Stable weights that balance covariates for estimation with incomplete outcome data. J Am Stat Assoc 2015;110(511):910–922.10.1080/01621459.2015.1023805Search in Google Scholar

15. Hainmueller J. Entropy balancing for causal effects: a multivariate reweighting method to produce balanced samples in observational studies. Political Anal 2011;20:25–46.10.1093/pan/mpr025Search in Google Scholar

16. Ferwerda J. Electoral consequences of declining participation: a natural experiment in Austria. Elect Stud 2014;35:242–252.10.1016/j.electstud.2014.01.008Search in Google Scholar

17. Marcus J. The effect of unemployment on the mental health of spouses evidence from plant closures in Germany. J Health Econ 2013;32(3):546–558.10.1016/j.jhealeco.2013.02.004Search in Google Scholar PubMed

18. Tan Z. Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika 2010;97(3):661–682.10.1093/biomet/asq035Search in Google Scholar

19. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics 2005;61(4):962–973.10.1111/j.1541-0420.2005.00377.xSearch in Google Scholar PubMed

20. Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med 2004;23(19):2937–2960.10.1002/sim.7231Search in Google Scholar PubMed

21. Särndal C-E, Lundström S. Estimation in surveys with nonresponseJohn Wiley & Sons, 2005.10.1002/0470011351Search in Google Scholar

22. Deville J-C, Särndal C-E. Calibration estimators in survey sampling. J Am Stat Assoc 1992;87(418):376–382.10.1080/01621459.1992.10475217Search in Google Scholar

23. Chan KC, Yam SC, Zhang Z. Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting. J Royal Stat Soc Ser B (Statistical Methodology) 2016;78:673–700.10.1111/rssb.12129Search in Google Scholar PubMed PubMed Central

24. Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 2007;102(477):359–378.10.1198/016214506000001437Search in Google Scholar

25. Hastie TJ, Tibshirani RJ. Generalized additive models Vol. 43CRC Press, 1990.Search in Google Scholar

26. Zhao Q. Covariate balancing propensity score by tailored loss functions. 2016;http://arxiv.org/abs/1601.05890.10.1214/18-AOS1698Search in Google Scholar

27. Neyman J. Sur les applications de la thar des probabilities aux experiences agaricales: essay des principle. excerpts reprinted (1990) in English. Stat Sci 1923;5:463–472.Search in Google Scholar

28. Rubin D. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 1974;66(5):688–701.10.1037/h0037350Search in Google Scholar

29. Holland PW. Statistics and causal inference. J Am Stat Assoc 1986;81:945–960.10.1080/01621459.1986.10478354Search in Google Scholar

30. Cover TM, Thomas JA. Elements of information theoryJohn Wiley & Sons, 2012.Search in Google Scholar

31. Qin J, Zhang B. Empirical-likelihood-based inference in missing response problems and its application in observational studies. J Royal Stat Soc Ser B (Statistical Methodology) 2007;69(1):101–122.10.1111/j.1467-9868.2007.00579.xSearch in Google Scholar

32. Tan Z. A distributional approach for causal inference using propensity scores. J Am Stat Assoc 2006;101:1619–1637.10.1198/016214506000000023Search in Google Scholar

33. Wang Q, Rao JNK. Empirical likelihood-based inference under imputation for missing response data. Ann Stat 2002;30(3):896–924.Search in Google Scholar

34. Robins J, Sued M, Lei-Gomez Q, Rotnitzky A. Comment: performance of double-robust estimators when inverse probability weights are highly variable. Stat Sci 2007;22(4):544–559.10.1214/07-STS227DSearch in Google Scholar

35. Stefanski LA, Boos DD. The calculus of M-Estimation. Am Stat 2002;56(1):29–38.10.1198/000313002753631330Search in Google Scholar

36. Ridgeway G, McCaffrey DF. Comment: demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci 2007;22(4):540–543.10.1214/07-STS227CSearch in Google Scholar

37. Silvapulle MJ. On the existence of maximum likelihood estimators for the binomial response models. J Royal Stat Soc Ser B (Methodological) 1981;43(3):310–313.10.1111/j.2517-6161.1981.tb01676.xSearch in Google Scholar

38. Albert A, Andersen JA. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 1984;71(1):1–10.10.1093/biomet/71.1.1Search in Google Scholar

Published Online: 2016-11-15

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Entropy Balancing is Doubly Robust

Abstract

1 Introduction

2 Setting

Assumption 1

Assumption 2

3 Entropy balancing

4 Properties of entropy balancing

Theorem 1.

Theorem 2.

5 Simulations

5.1 Kang-Schafer example

5.2 Lunceford-Davidian example

Acknowledgement

Appendix

A Theoretical proofs

Proposition 1.

Proof

Lemma 1.

Lemma 2.

Lemma 3.

Proof

References

Journal and Issue

Articles in the same Issue