Robust propensity score weighting estimation under missing at random

Missing data is frequently encountered in many areas of statistics. Propensity score weighting is a popular method for handling missing data. The propensity score method employs a response propensity model, but correct specification of the statistical model can be challenging in the presence of missing data. Doubly robust estimation is attractive, as the consistency of the estimator is guaranteed when either the outcome regression model or the propensity score model is correctly specified. In this paper, we first employ information projection to develop an efficient and doubly robust estimator under indirect model calibration constraints. The resulting propensity score estimator can be equivalently expressed as a doubly robust regression imputation estimator by imposing the internal bias calibration condition in estimating the regression parameters. In addition, we generalize the information projection to allow for outlier-robust estimation. Some asymptotic properties are presented. The simulation study confirms that the proposed method allows robust inference against not only the violation of various model assumptions, but also outliers. A real-life application is presented using data from the Conservation Effects Assessment Project.


Introduction
Missing data represents a fundamental challenge in statistics.Ignoring missing data can lead to biased estimates of parameters, loss of information, decreased statistical power, increased standard errors, and weak generalizability of findings, as highlighted by Dong and Peng (2013).In addition, the missing data framework is very useful in defining problems in other research areas, such as causal inference or offline policy evaluation.Propensity score (PS) weighting is a popular method to handle missing data.It often employs a response propensity model, but correct specification of the statistical model can be challenging in the presence of missing data.How to make the propensity score weighting method less dependent on the response propensity model is an important practical problem.
In the quest for robust PS estimation, we find two distinct avenues.One approach embraces flexibility, employing nonparametric or semiparametric models to construct robust propensity scores.The nonparametric kernel method (Hahn, 1998), the sieve logistic regression method (Hirano et al., 2003), the general calibration method of Chan et al. (2016) using increasing dimension of the basis functions, the generalized covariate balancing estimator using tailored loss functions (Zhao, 2019) and the random forest approach of Dagdoug et al. (2023) are examples of the robust propensity score estimation method using flexible models.The other path, which we explore in this paper, leans on the outcome regression model explicitly to achieve doubly robust estimation.The literature is replete with investigations in this arena, with notable contributors like Bang and Robins (2005), Cao et al. (2009), Kim and Haziza (2014), Han and Wang (2013), Chen and Haziza (2017), and Yang et al. (2020).
Our journey takes the latter route as we construct a unified framework for doubly robust PS estimation under the setup of missing at random (Rubin, 1976).To achieve the goal, we apply the information projection (Csiszár and Shield, 2004) to the PS weighting problem, while adhering to covariate-balancing constraints.Specifically, the initial PS weights are constructed from the working PS model, but the balancing constraints, crucial for achieving double robustness in the PS weighting estimator, are derived from the working outcome regression model.This maneuver can be perceived as indirect model calibration.Information projection is used to obtain the augmented PS model.
Remarkably, the PS weighting estimator that is obtained from information projection can also be cast as a special type of regression imputation estimator, incorporating balancing functions as covariates in the outcome regression model and the regression coefficients are computed using the weights computed from the final PS weights.This algebraic equivalence, known as self-efficiency, forms the cornerstone of our journey toward outlier-robust PS estimation.In practice, outliers do exist and the presence of outliers can significantly undermine the efficiency of estimators.While outlier-robust procedures are well studied in the literature on regression or classification (Stefanski et al., 1986;Wu and Liu, 2007;Zhang et al., 2010), their application in the context of missing data remains underexplored.To our best knowledge, the construction of an outlier-robust PS weighting estimator has not been investigated in the literature.
To fill in this important research gap, we embark on the creation of an outlier-robust regression imputation technique and subsequently align it with a PS weighting estimator by imposing the self-efficiency condition.In doing so, we set forth on a path previously untraveled in the literature -a journey toward constructing an outlier-robust PS estimator.Specifically, the information projection approach to developing the augmented PS model and obtaining the doubly robust PS estimation is based on the Kullback-Leibler divergence.The γ-power divergence, originally proposed by Basu et al. (1998) and further developed by Eguchi (2021), is a generalization of the Kullback-Leibler divergence to expand the class of statistical models by introducing an additional scale parameter γ.Furthermore, it is well known that the statistical model derived from the γ-power divergence produces robust inferences against outliers.We adopt the γ-power divergence to develop an outlier-robust regression imputation estimator.By self-efficiency, the regression estimator can be expressed as a PS weighting estimator.Therefore, we obtain an outlier-robust PS weighting estimator.The resulting estimator is also robust against misspecification of the response probability model.
The structure of the paper is as follows.In Section 2, the basic setup and the research problem are introduced.In Section 3, we develop a balancing constraint to adjust propensity weights and construct an augmented propensity model using information projection.In Section 5, we present the use of the γ-power divergence to enlarge the class of propensity score models and develop outlier-robust doubly robust estimators.Results from two limited simulation studies are presented in Section 6.A real-life application using data from the Conservation Effects Assessment Project is presented in Section 7. Some concluding remarks are made in Section 8.All required proofs are presented in the Appendix.

Basic Setup
Suppose that there are n independently and identically distributed realizations of (X, Y, δ), denoted by {(x i , y i , δ i ) : i = 1, . . ., n}, where x i is a vector of observed covariates and δ i is the missingness indicator associated with unit i.In particular, δ i = 1 if y i is observed and δ i = 0 otherwise.Thus, instead of observing (x i , y i , δ i ), we only observe (x i , δ i , δ i y i ) for i = 1, . . ., n.We are interested in estimating θ = E(Y ) from the observed sample.
Suppose that we have a working outcome regression (OR) model given by for some given basis functions b 1 (x), . . ., b L (x) and b 0 (x) = 1.The model (2.1) can be expressed equivalently as , where ε i is an error term that is independent of x i and satisfies E(ε i ) = 0. Furthermore, the assumption of missing at random (Rubin, 1976) implies that ε i and δ i are conditionally independent given x i .We are interested in using the following propensity score weighting (PSW) estimator The following lemma provides a sufficient condition for the unbiasedness of the propensity score weighting estimators under the model (2.1).
Lemma 2.1.Assume that the response mechanism is missing at random.If the weights ω i 's 2) is unbiased for θ under the OR model in (2.1).
By Lemma 2.1, condition (2.3) is the key condition that incorporates the OR model into the PSW estimator.Condition (2.3) is often called the covariate balancing condition (Imai and Ratkovic, 2014) in the missing data literature.It is closely related to the calibration estimation in survey sampling (Deville and Särndal, 1992).Under the OR model in (2.1), the covariate-balancing condition in (2.3) implies that which is often referred to as the model calibration (Wu and Sitter, 2001).To distinguish from the direct model calibration of Wu and Sitter (2001), we may call (2.3) the indirect model calibration.
The indirect model calibration condition also implies the following algebraic equivalence.
Lemma 2.2.If the weights ω i satisfy (2.3), then we have where β satisfies ) Lemma 2.2 means that, under (2.6), the PSW estimator that satisfies the covariatebalancing condition is algebraically equivalent to a regression imputation estimator using the balancing functions as covariates in the outcome regression model.In other words, regression imputation under the outcome regression model in (2.1) can be viewed as a PSW estimator where the propensity score weights ω i and the estimated regression coefficients β are related by equation (2.6).Equation (2.5) means that the final propensity score weights ω i do not directly use the outcome regression model (2.1) for imputation, but implement regression imputation indirectly.Condition (2.5) is referred to as the self-efficiency condition.
We now wish to achieve double robustness by including the PS model.Suppose that we have the following working PS model for some ϕ 0 with a known function π 1 (•) ∈ (0, 1).An example is the logistic regression model such as logit{π 1 (x; ϕ 0 )} = x T ϕ 0 .Under the working propensity score model, the PS weights are computed as ω i = {π 1 (x i ; φ)} −1 := di , where φ is the maximum likelihood estimator (MLE) of ϕ.However, the PS weight di does not necessarily satisfy the balancing condition in (2.3).Therefore, to reduce the bias due to model misspecification in the propensity score model, it makes sense to impose the balancing condition in the final weighting.To achieve this goal, Hainmueller (2012) proposed the so-called entropy balancing method that minimizes (2.7) subject to the balancing constraint in (2.3).Chan et al. (2016) generalized this idea further to develop a general calibration weighting method that satisfies the covariance balancing property with increasing dimensions of basis functions b(x).However, the choice of distance measure is somewhat unclear.In the next section, we adopt the information projection approach to develop a unified approach to doubly robust propensity score weighting.

Proposed method
We now develop an alternative approach to modifying the initial propensity score weights to satisfy the covariate balancing condition in (2.3).The proposed approach consists of two parts.One is the modeling part and the other is the parameter estimation part.In the modeling part, we use information projection innovatively to obtain the propensity score model incorporating the moment constraints from the outcome regression model.Information projection is a powerful tool for incorporating the moment constraints into the final model.

Information projection
Instead of using a distance measure for propensity weights, as in (2.7), we use the Kullback-Leibler divergence measure properly to develop the information projection under some moment constraint obtained from the outcome regression model.Because the Kullback-Leibler divergence is defined in terms of density functions, we need to formulate the weighting problem as an optimization problem for density functions.
To apply the information projection, using the Bayes theorem, we obtain where p = P (δ = 1), and is the density function for x given δ = k for k = 0, 1.Thus, the propensity score (PS) weight can be written as where c = 1/p − 1.For a fixed f 1 , the propensity weight function ω(x) is completely determined by f 0 .Furthermore, assume that the basis function b(x) in the outcome regression model in (2.1) is integrable in the sense that where E k {b(X)} = ´b(x)f k (x)dµ(x) for k = 0, 1 and µ is the Lebesgue measure.Thus, for a given f 1 , equation (3.2) can serve as a constraint on f 0 when E{b(X)} is known.Let π (0) 1 (x) be the PS function corresponding to a "working" model for π 1 (x) = P (δ = 1 | x).For a fixed f 1 , by (3.1), we can define f We can treat f (0) 0 as the baseline density for f 0 derived from the working PS function π (0) 1 (x).Our objective is to modify f (0) 0 to satisfy (3.2).Finding the density function f 0 satisfying the balancing condition in (3.2) can be formulated as an optimization problem using the Kullback-Leibler divergence.Given the baseline density f (0) 0 , the Kullback-Leibler divergence of f 0 at f (0) 0 is defined by 0 , we wish to find the minimizer of D KL (f 0 ∥f (0) 0 ) subject to (3.2).The Lagrangian function can be formulated as where λ is the Lagrange multiplier.As f 1 is fixed and E{b(X)} is known, by taking the derivative with respect to f 0 in (3.5), one may arrive at f * 0 (x; λ) ∝ f (0) 0 (x) exp{b T (x)λ}.Furthermore, the density constraint ´f * 0 (x; λ)dx = 1 leads to A graphical illustration of obtaining π * 1 (x) in (3.8) using the information projection for f 0 , where the line running halfway across the right circle represents the space of f 1 satisfying the covariate-balancing constraints.
with abuse of notation for λ.By (3.6), we obtain . (3.7) Combining the above results, we can establish the following lemma.
In (3.8), the Lagrange multiplier λ is determined to satisfy the balancing condition (3.2) with . where If the working propensity score model is correct, then the balancing constraint (3.2) is already satisfied.In this case, we have λ j ≡ 0 for all j = 1, . . ., L. Thus, the information projection is used to obtain the augmented propensity score model (Kim and Riddles, 2012).
Figure 1 presents a graphical summary of obtaining the augmented PS model in (3.8).Using the relationship in (3.1), we can transform π 1 (x) to f 0 (x) and then apply the information projection to obtain f * 0 (x) in (3.6), which in turn finds π * 1 (x) in (3.8) by applying (3.1) again.
In practice, the base functions for the "working" outcome model are chosen by standard model selection techniques.For example, Shortreed and Ertefaie (2017) used adaptive Lasso to select the covariates in the outcome regression model.

Model parameter estimation
We now discuss parameter estimation in the augmented propensity score model in (3.8).Suppose that the working propensity score model is given by π (0) 1 (x; ϕ) with the unknown parameter ϕ.We can estimate ϕ by maximizing where di = {π (0) 1 (x i ; φ)} −1 and λ0 , λ1 , . . ., λL are computed from the calibration equation: By construction, the PS weights in (3.9) satisfies the covariate-balancing property on the basis functions in the outcome regression model in (2.1).By Lemma 2.1, the covariate balancing property implies unbiasedness of the PSW estimator under the OR model.On the other hand, if the working PS model is correct, then λj → 0 in probability as n → ∞ for j = 0, 1, . . ., L. In this case, ωi will converge to di and the PSW estimator is consistent under the PS model.As discussed in Section 2, the propensity score estimator satisfying the indirect model calibration condition can be expressed as a regression imputation estimator.Since ωi satisfy (3.10), by Lemma 2.2, we can obtain (3.11)where ŷi = b T i β and Note that, by (3.9), β in (3.12) can be expressed as the minimizer of where ĝi = exp(b T i λ).The first term di − 1 is the adjustment term to incorporate the working PS model into the regression imputation.The second term ĝi = exp(b T i λ) is to achieve covariate balance in the PS weights, which is a sufficient condition for self-efficiency.Now, if g i = 1 is used in (3.13), then the self-efficiency in (3.11) will not be satisfied.Instead, we only obtain Condition (3.14) is called the internal bias calibration (IBC), which was originally termed by Firth and Bennett (1998) in the context of design-consistent estimation of model parameters under complex sampling.The imputation estimator satisfying the IBC condition (3.14) achieves consistency even when the outcome regression model is incorrect, as long as the working PS model is correct.Thus, the IBC condition is a sufficient condition for the double robustness of the regression imputation estimator.
For the choice of ĝi = exp(b T i λ), under the PS model, we obtain ĝi → 1 in probability as n → ∞.Thus, the IBC condition in (3.14), or double robustness of the regression imputation estimator, holds approximately.
Equation (3.11) deserves further discussion.In the PSW estimation, for a given φ, the final estimator θPSW is a function of estimated nuisance parameter λ while the regression imputation estimator is a function of other estimated nuisance parameter β.Thus, λ and β are in one-to-one correspondence with each other through the following estimating equation: In summary, the proposed method can be implemented in the following steps.
1. Specify a working PS model π (0) 1 (x; ϕ) and estimate ϕ using the ML estimation method to get di = {π 2. Specify a working OR model in (2.1) to construct the augmented PS model in (3.8).

Statistical properties
Now we formally describe the asymptotic properties of the augmented PSW estimator using the final propensity score weight (3.9) with the calibration constraint in (3.10).By the algebraic equivalence established in Lemma 2.2, the result is directly applicable to the regression imputation estimator using the regression coefficient in (3.12).
Let π (0) 1 (x; ϕ) be the working propensity score model, where ϕ ∈ R p .We can estimate ϕ by solving for some h(x; ϕ) such that the solution to (4.1) exists uniquely almost everywhere.The estimating equation (4.1) for ϕ includes the score equation for ϕ as a special case.For a given φ, let Û2 (λ | φ) be the estimating equation for λ.By (2.3), we can estimate λ by solving where We have the following assumptions before the statement of our main theorem.
Assumptions 4.1 -4.3 are commonly used in estimating equation theory and also ensure the existence of λ * and ϕ * ; see Newey and McFadden (1994) and Kim and Rao (2009) for more details.Note that λ * and ϕ * are the probability limits of λ and φ, respectively.Assumption 4.3 also guarantees the existence of β * , which is the probability limit of β in (3.12).Assumption 4.4 ensures the existence of the nuisance parameter κ * defined in the following Theorem 4.1, which helps us represent the influence function for the proposed estimator.If the maximum likelihood estimation is used to estimate ϕ, then ϕ * can be understood as the minimizer of the cross-entropy with respect to ϕ, where π 1 (x) = P (δ = 1 | x) is the true response probability.Now, using (4.3), we can treat as a function of ( φ, λ) and apply the standard Taylor linearization to obtain the following theorem, whose proof is presented in the Appendix.
In (4.6), η(x, y, δ) is called the influence function of θAPSW (Tsiatis, 2006).Note that we can obtain where the second term is equal to zero by the definition of λ * and the third term is equal to zero by the definition of ϕ * .If the outcome regression model is correctly specified, we have where the second equality follows from the definition of λ * .Furthermore, the correct specification of the outcome regression model gives κ * = 0. Thus, we can summarize the asymptotic result in the outcome regression model as follows.
Corollary 4.1.Suppose that the regularity conditions in Theorem 4.1 hold and the outcome regression model is correctly specified.Then, where Now, consider the case where the propensity score model is correctly specified.In this case, we obtain λ * = 0 and π Therefore, we can obtain the following result.
Corollary 4.2.Suppose that the regularity conditions in Theorem 4.1 hold and the propensity score model is correctly specified.Then, where Thus, the efficiency gain using the estimated propensity score function over the true one is visible only when the PS model is true but the OR model is incorrect.
If the two models, the OR model and the PS model, are correctly specified, then the two variance forms are equal to which is equal to the lower bound of the semiparametric efficient estimator considered in Robins et al. (1994).The influence function in (4.6) can also be used to develop the linearized variance estimation formula for θAPSW regardless of whether the outcome regression model or the propensity score model holds.
In summary, the influence function in (4.6) can be reduced to the following limits under each model as follows: Correct OR: (4.11) where π 1,i = P (δ = 1 | x i ), the upward arrows point to the limits under the correct OR model and the downward arrows points to the limits under the correct RP model.
In the next section, we address how to obtain the robust propensity score weighting estimator against model misspecification of the propensity score model and outliers in the outcome regression model.

Adding outlier-robustness
In many real cases, outliers exist in addition to missingness.In this case, imposing robustness against outliers is an important practical problem.This section discusses how to allow robust inference against outliers in the outcome variable in the context of doubly robust estimator.In the presence of outliers, one may use the heavy-tailed distribution (e.g.tdistribution) (Lange et al., 1989) to allow robust inference against outliers.However, it is not straightforward to extend the indirect model calibration condition to the t-distribution.Basu et al. (1998) introduced the density power divergence as a generalization of Kullback-Leibler divergence to expand the class of statistical models by introducing an additional scale parameter γ.Density power divergence is also called γ-power divergence (Eguchi, 2021).We can use the γ-power divergence to develop an outlier robust imputation method.Specifically, we employ the following M -estimator to handle the misspecification of the propensity score model.Define where Ψ γ is an objective function that reduces the effect of outliers, and g i = exp(b T i λ) is the adjustment term to achieve the self-efficiency.Thus, under some suitable choice of Ψ γ , the resulting estimator is not only doubly robust, but also outlier-robust.Note that the Huber-type loss function could be used in (5.1), but it is computationally less attractive as the Huber loss function is non-smooth.
To incorporate the γ-divergence, we now consider the following minimization problem, arg min That is, in (5.1), we use (5.3) As a result, the optimization problem (5.2) can be solved by the iterated reweighted least squares method.That is, we use to estimate β and σ 2 for given λ, where (5.6) Note that during the iterative reweighted least squares estimation procedure, if |y i − b T i β| ≫ 0, then the effect of the i-th subject for the estimation of β will be greatly mitigated as ŵγ,i will be smaller, indicating the robustness of our proposed method.The parameter γ > 0 plays the role of the tuning parameter for robust estimation.As the value of γ increases, the estimator becomes more robust but less efficient.
Let βq and σ2 q be the solution to the above estimating equations, (5.4) and (5.5).Further, let qγ,i = q γ,i ( βq , σ2 q ).We can construct robust imputation with ŷi = b T i βq easily.Furthermore, we can use the robust imputation to construct robust propensity score weights with calibration constraints.Specifically, by (5.4), we can obtain (5.7) In a similar fashion of derivation for Lemma 2.2, it yields that (5.8) where ŷi = b T i βq and (5.9) Note that (5.8) is the self-efficiency of the propensity score weights in (5.9).The final PS weights satisfy (5.10) Thus, for given qγ,i 's, we can compute λ from calibration constraints in (5.10).That is, we can treat (5.4), (5.5), and (5.10) as the estimating equations for β, σ 2 and λ.Note that (5.9) is equivalent to (3.9) for q γ,i = 1.The additional factor q γ,i (β, σ 2 ) controls the effect of outliers in the final weighting.Let β * and σ * 2 be the probability limits of βq and σ2 q , respectively.The following theorem presents the Taylor linearization for our proposed estimator in (5.8).
Note that the first line and the last line has the same structure as in (4.6) of Theorem 4.1.The other two lines are the additional part of the influence functions that reflect the effect of the gamma-divergence model.Specifically, the second line reflects the estimation effect of βq , and the third line reflects the estimation effect of σ2 q .Regarding the choice of the tuning parameter γ, we can use cross-validation to choose the optimal γ.Specifically, we can randomly partition the sample into K-groups (S 1 , • • • , S K ) and then compute q is obtained by solving the same estimation formula using the sample in S c k .The minimizer of MSPE(γ) can be used as the optimal value of the tuning parameter.Furthermore, instead of cross-validation, we can use other approximation-based methods (Sugasawa and Yonekura, 2021;Basak et al., 2020) that are computationally less expensive.As we can find in our simulation study in Section 6, the optimal choice of γ is not critical.The simulation result shows similar performances for different values of γ.
For variance estimation, we can use the influence function in (5.12) to construct a linearization variance estimator.Specifically, the linearization variance estimator for θAPSW,γ for a given γ is Note that βq , σ2 q and λ are estimated from (5.4), (5.5) and (5.10), μ, κ, ν and ζ can be estimated from the procedure listed in the Appendix.A flowchart of the inference procedure for the robust augmented PSW estimator is presented in Fig 2, where the equations (E.2) and (E.6) are estimating equations presented in Appendix E. 6 Empirical studies

Simulation study
We examine the performance of proposed methods under various propensity score models.Let X = (X 1 , X 2 , X 3 , X 4 , X 5 ) T , where X ∼ N (0, I 5×5 ) follow the standard multivariate normal distribution.The simulation experiment can be described as a 2 × 2 factorial design with two factors.The sample size is 100 and the Monte Carlo sample size is 1, 000.The first factor is the outcome regression model (OM1, OM2) and the second factor is the propensity score model (PM1, PM2).
For each sample, we compute the following propensity score estimators.
(i) Mean of the complete cases (CC): θCC = n −1 (ii) Maximum likelihood estimation (MLE): the classical propensity score estimator θPSW = n −1 n i=1 δ i di Y i using di = 1/π (0) 1 (X i ; φ) as the estimated propensity score weight, where φ is the maximum likelihood estimate of ϕ from the working propensity score model in (6.1).
(iv) Regularized calibration method in Tan (2019): where we take the identity mapping for f (•) in this simulation.
Figure 3 shows the results of the simulation study above, where the dashed line represents the true parameter θ = E(Y ).Table 1 presents the corresponding bias, standard error, and root mean square error for each method.Since the consistency of all estimators is guaranteed under OM1PM1, we can see that all estimators give similar performances except for the mean of the complete cases method.However, under OM1PM2, the consistency of the generalized linear model method is no longer guaranteed, while that of other estimators is still valid.In OM1PM2, the covariate balancing condition becomes critical, and so all methods satisfying the covariate balancing will be unbiased.In OM2PM1, the proposed augmented PSW estimator is still consistent, as the propensity score model is correctly specified.Our proposed augmented PSW estimator with γ-divergence has relatively lower root mean square error compared to other estimators.Our proposed augmented PSW estimator with γ-divergence is biased in this scenario due to the misspecification of the outcome model, where other estimators are built under the correct specified response model.In OM2PM2, our proposed augmented PSW estimator with γ-divergence is robust against other estimators, with fewer outliers and the smallest root mean square error.We also investigated the performance of the proposed linearization variance estimator in (5.13).We computed confidence intervals based on asymptotic normality.The results are presented in Table 3.The performances are satisfactory when the ourcome regression model is specified correctly.
In addition, we check the robustness of the augmented propensity score weighting estimator with γ-divergence against outliers.After the data are generated under the same 2 × 2 factorial design as in the previous simulation setting, additional noise eight times a random variable generated from Student's t distribution with degree freedom of 8 is added to 20% of the observed outcomes.Again, the sample size is 100 and the Monte Carlo sample size is 1, 000.
Figure 4 shows the performance of various estimators, where the dashed line represents the true mean of y ′ i s.Table 2 presents the corresponding bias, standard error, and root mean square error for each method.Compared to other estimators, in all four scenarios, our proposed augmented propensity score weighting estimator with γ-divergence apparently gives the smallest root mean square error, which validates our robustness claim in Section 5.
Table 2: Bias, standard error (S.E.), root mean square error (RMSE) for estimators comparison with sample size 100 and Monte Carlo sample 1, 000; 20% of the samples are contaminated with the additive noise that eight times a random variable generated from Student's t distribution with degree freedom of 8.The errors from outcome models are generated from i.i.d.standard Gaussian distribution.All criteria are multiplied by 10.CC: mean of the complete cases; MLE: maximum likelihood estimation; APS: augmented propensity score weighting estimator; APS γ : augmented propensity score weighting with γ-divergence estimator, for γ = 0.3, 0.5, 0.7, 1; HM: entropy balancing method in Hainmueller (2012)

Real data application
We further check our proposed estimators for artificial missingness with real data from the California API program (http://api.cde.ca.gov/ or survey package (Lumley, 2010) in R (R Core Team, 2022)).Within the data set, standardized student tests are performed to calculate the API for California schools.Specifically, we select the API for year 2000 (api00) as the response variable.For covariates, we set the API for year 1999 (api99) as X 1 , the percentage of students eligible for subsidized meals (meals) as X 2 , the percentage of English language learners (ell) as X 3 , the average level of parental education (avg) as X 4 , the percentage of fully qualified teachers (full) as X 5 , and the number of students enrolled (enroll) as X 6 , where the words in parentheses are the abbreviations for variable names in the dataset.We artificially created the missingness with the following two response mechanisms: PM3, where δ | X follows the Bernoulli distribution with logit{P (δ = 1 | X)} = ϕ 0 + 2X 1 + X 2 + 0.5X 3 , and ϕ 0 is chosen to achieve the 60% response rates; PM4, where δ | X follows the Bernoulli distribution with 4 otherwise, and a is chosen to achieve the response rates 60%.We compare the same estimators as in previous subsections with 1000 Monte Carlo samples.In particular, the mean of the complete cases method does not behave well in both cases and will affect the scale of the resultant box plots, so we do not show its performance.Furthermore, we adopt the MSPE as the 5-fold cross-validation criterion for the selection strategy of γ, as described in Section 5.For each Monte Carlo sample, we select the best γ from {0.1, 0.2, . . ., 1} and denote the corresponding estimator as APS γ .The results are summarized with box plots in Figure 5, where the dashed line denotes the true population mean.As we can see, the estimator APS γ always gives the unbiased estimate with a relatively low root mean square error, outperforming other estimators.7 Application to CEAP data The Conservation Effects Assessment Project (CEAP) is a program initiated by the United States Department of Agriculture (USDA) Natural Resources Conservation Service (NRCS).CEAP collects and analyzes data from various sources, including field studies, monitoring sites, and modeling, to evaluate the impact of water and wind erosion.Further details of the CEAP data can be found in Berg and Yu (2019).The farmer's interview data together with the NRI data serve as input to the revised Universal Soil Loss Equation (RUSLE2) to generate a measure of sheet and rill erosion.In addition, RUSLE2 is also an advancement of a traditional approximation called the Universal Soil Loss Equation (USLE).For the CEAP sample, the RUSLE2 suffers from missingness due to the refusal of the farmer interviewed, while the corresponding USLE is available.We are interested in estimating the population mean of RUSLE2 using USLE as an auxiliary variable for calibration weighting in Arkansas.Specifically, the total sampled points are 1509 and the fully observed are 406, with observed rate 26.9%.The normal quantilequantile (Q-Q) plot generated from linear models based on the complete cases is presented in Sub-figure (a) of Figure 6.By the normal Q-Q plot, numerous outlier data points are evident, and the residuals deviate from satisfying the assumption of a normal distribution.Therefore, given this scenario, we have strong reasons to place greater trust in our suggested robust approach.The associated mean estimation result is presented in Sub-figure (b) of Figure 6, where the confidence 95% confidence intervals for the APS method is constructed by (4.5) and (4.6) and the APS γ is constructed by (5.13) and (5.14).In this presentation, our proposed method using gamma divergence exhibits a narrow 95% confidence interval and yields a low RUSLE2 value.

Concluding Remarks
We have applied the information projection to obtain an augmented PS model that allows for doubly robust estimation.We have introduced self-efficiency to obtain the algebraic equivalence between the PSW estimator and the regression imputation estimator.In addition, γ-power divergence can be used to obtain an outlier-robust regression imputation estimator, which can be expressed as a PSW estimator under self-efficiency.In practice, an efficient and outlier-robust PSW estimator is very attractive as the existence of outliers can often damage the efficiency of the result estimator.The tuning parameter is determined to balance the trade-off between statistical efficiency and robustness against outliers.In the future, further theoretical investigation on the effect of the choice of parameter γ on the final estimation will be considered.
There are several directions for further extensions of the proposed method.The proposed method is directly applicable to the calibration weighting in survey sampling (Fuller, 2009;Tillé, 2020).Other divergence measures such as Hellinger divergence (Antoine and Dovonon, 2021) can be also considered in the information projection.The proposed method is based on the assumption of missing at random.Extension to nonignorable nonresponse can also be an interesting research direction.Furthermore, the proposed method can be used for causal inference, including the estimation of the average treatment effect from observational studies (Yang and Ding, 2020;Chen et al., 2021Chen et al., , 2023)).Developing tools for causal inference using the proposed method will be an important extension of this research.

A Proof of Lemma 2.1
If the balancing condition (2.3) holds, then the propensity score weighting estimator can be expressed as Under the MAR assumption, we can obtain Thus, we can conclude that θPSW is unbiased for θ.

B Proof of Lemma 2.2
Note that we can express the covariate-balancing condition in (2.3) as Thus, as long as (B.1) is satisfied, we can write for any β.Thus, as long as (2.6) hold, we can obtain (2.5).

C Proof of Theorem 4.1
Since λ satisfies (4.2), we can obtain the following equivalence Thus, for the choice of β in (18), we can obtain Thus, the uncertainty associated with λ is asymptotically negligible at β = β * .Thus, we obtain Now, to apply Taylor linearization with respect to ϕ, we define Thus, we can choose κ * to satisfy which gives the final linearization form as in (4.5).

D Regularity Conditions for Theorem 5.1
Before the statement of regularity conditions for Theorem 5.1, we first define a few quantities.Let (D.1) Further, define In addition with the regularity conditions in Theorem 1, we need the following additional assumptions.

F Computational Complexity
The main computation involved in our proposed method is solving the estimating equation.We apply the Newton-Raphson algorithm to solve the estimating equation.For each step, the computational complexity is O(N 3 ), where N is the length of root.Recall that λ ∈ R L+1 and ϕ ∈ R p .By the flowchart 1 in Fig 2 in our revised manuscript, in each iteration, the computational complexity for the augmented PSW estimator is O(n max{p + L} + p 3 + L 3 ), where the first term is the complexity to compute the estimating equation and the last two terms are the complexity to solve the estimating equation.Therefore, suppose we set K as the maximum iterations for the Newton-Raphson method, the final computational complexity is O(K(n max{p + L} + p 3 + L 3 )).In a similar fashion, the computational complexity for the robust augmented PSW estimator is also O(K(n max{p + L} + p 3 + L 3 )).As a result, the computational complexity is linear in n.
from the working PS model, the final PS function minimizing the Kullback-Leibler divergence in (3.4) subject to the balancing condition (3.2) is given by with respect to ϕ.Note that the propensity score model has nothing to do with the OR model in (2.1).Now, to incorporate the OR model in (2.1), we use the augmented propensity score model in (3.8) to obtain the final propensity score weights ωi = 1 + ( di − 1) exp L j=0 λj b j (x i ) ,(3.9) Assumption 4.1.The estimating equation U 1,n (ϕ) = 0 has a unique solution φ and there exists a function U 1 (ϕ) such that U 1,n (ϕ) → U 1 (ϕ) uniformly as n → ∞ and U 1 (ϕ) = 0 has a unique solution ϕ * .Further, The estimating equation U 2,n (λ | ϕ * ) = 0 has a unique solution λ and there exists a function U 2

Figure 2 :
Figure 2: Illustration of the inference procedure for the robust augmented PSW estimator.

Figure 4 :
Figure4: Boxplots for estimators comparison with sample size 100 and Monte Carlo sample 1, 000; of the samples are contaminated with the additive noise that eight times a random variable generated from Student's t distribution with degree freedom of 8.The errors from outcome models are generated from i.i.d.standard Gaussian distribution.CC: mean of the complete cases; MLE: maximum likelihood estimation; APS: augmented propensity score weighting estimator; APS γ : augmented propensity score weighting with γ-divergence estimator, for γ = 0.3, 0.5, 0.7, 1; HM: entropy balancing method in Hainmueller (2012); Tan: regularized calibration method inTan (2019).

Figure 6 :
Figure 6: Quantile-quantile plot and mean estimation with 95% confidence band for CEAP data in Arkansas.

Table 1 :
Bias, standard error (S.E.), root mean square error (RMSE) for estimators comparison with sample size 100 and Monte Carlo sample 1, 000.All criteria are multiplied by 10.The errors from outcome models are generated from i.i.d.standard Gaussian distribution.CC: mean of the complete cases; MLE: maximum likelihood estimation; APS: augmented propensity score weighting estimator; APS γ : augmented propensity score weighting with γ-

Table 3 :
Linearized variance estimation for gamma divergence method where the errors are i.i.d.from standard Gaussian distribution s 11 s 12 s 13 s 21 s 22 s 23 s 31 s 32 s 33