Targeted Estimation of Nuisance Parameters to Obtain Valid Statistical Inference

Mark J. van der Laan

doi:10.1515/ijb-2012-0038

Publicly Available Published by De Gruyter February 11, 2014

Targeted Estimation of Nuisance Parameters to Obtain Valid Statistical Inference

Mark J. van der Laan

From the journal The International Journal of Biostatistics

https://doi.org/10.1515/ijb-2012-0038

Abstract

In order to obtain concrete results, we focus on estimation of the treatment specific mean, controlling for all measured baseline covariates, based on observing independent and identically distributed copies of a random variable consisting of baseline covariates, a subsequently assigned binary treatment, and a final outcome. The statistical model only assumes possible restrictions on the conditional distribution of treatment, given the covariates, the so-called propensity score. Estimators of the treatment specific mean involve estimation of the propensity score and/or estimation of the conditional mean of the outcome, given the treatment and covariates. In order to make these estimators asymptotically unbiased at any data distribution in the statistical model, it is essential to use data-adaptive estimators of these nuisance parameters such as ensemble learning, and specifically super-learning. Because such estimators involve optimal trade-off of bias and variance w.r.t. the infinite dimensional nuisance parameter itself, they result in a sub-optimal bias/variance trade-off for the resulting real-valued estimator of the estimand. We demonstrate that additional targeting of the estimators of these nuisance parameters guarantees that this bias for the estimand is second order and thereby allows us to prove theorems that establish asymptotic linearity of the estimator of the treatment specific mean under regularity conditions. These insights result in novel targeted minimum loss-based estimators (TMLEs) that use ensemble learning with additional targeted bias reduction to construct estimators of the nuisance parameters. In particular, we construct collaborative TMLEs (C-TMLEs) with known influence curve allowing for statistical inference, even though these C-TMLEs involve variable selection for the propensity score based on a criterion that measures how effective the resulting fit of the propensity score is in removing bias for the estimand. As a particular special case, we also demonstrate the required targeting of the propensity score for the inverse probability of treatment weighted estimator using super-learning to fit the propensity score.

Keywords: asymptotic linearity; cross-validation; efficient influence curve; influence curve; targeted minimum loss based estimation

1 Introduction and overview

This introduction provides an atlas for the contents of this article. It starts with formulating the role of estimation of nuisance parameters to obtain asymptotically linear estimators of a target parameter of interest. This demonstrates the need to target this estimator of the nuisance parameter in order to make the estimator of the target parameter asymptotically linear when the model for the nuisance parameter is large. The general approach to obtain such a targeted estimator of the nuisance parameter is described. Subsequently, we present our concrete example to which we will apply this general method for targeted estimation of the nuisance parameter, and for which we establish a number of formal theorems. Finally, we discuss the link to previous articles that concerned some kind of targeting of the estimator of the nuisance parameter, and we provide an organization of the remainder of the article.

1.1 The role of nuisance parameter estimation

Suppose we observe n independent and identically distributed copies of a random variable O with probability distribution P0. In addition, assume that it is known that P0 is an element of a statistical model M and that we want to estimate ψ0=Ψ(P0) for a given target parameter mapping Ψ:M↦IR. In order to guarantee that P0∈M one is forced to only incorporate real knowledge, and, as a consequence, such models M are always very large and, in particular, are infinite dimensional. We assume that the target parameter mapping is path-wise differentiable and let D∗(P) denote the canonical gradient of the path-wise derivative of Ψ at P∈M [1]. An estimator ψn=Ψˆ(Pn) is a functional Ψˆ applied to the empirical distribution Pn of O1,…,On and can thus be represented as a mapping Ψˆ:MNP↦IR from the non-parametric statistical model MNP into the real line. An estimator Ψˆ is efficient if and only if it is asymptotically linear with influence curve D∗(P0):

ψn−ψ0=1n∑i=1nD∗(P0)(Oi)+oP(1/n).

The empirical mean of the influence curve D∗(P0) represents the first-order linear approximation of the estimator as a functional of the empirical distribution, and the derivation of the influence curve is a by-product of the application of the so-called functional delta-method for statistical inference based on functionals (i.e. Ψˆ) of the empirical distribution [2–4].

Suppose that Ψ(P) only depends on P through a parameter Q(P) and that the canonical gradient depends on P only through Q(P) and a nuisance parameter g(P). The construction of an efficient estimator requires the construction of estimators Qn and gn of these nuisance parameters Q0 and g0, respectively. Targeted minimum loss-based estimation (TMLE) represents a method for construction of (e.g. efficient) asymptotically linear substitution estimators Ψ(Qn∗), where Qn∗ is a targeted update of Qn that relies on the estimator gn [5–7]. The targeting of Qn is achieved by specifying a parametric submodel {Qn(∈):∈}⊂{Q(P):P∈M} through the initial estimator Qn and a loss function O↦L(Q)(O) for Q0=argminQP0L(Q)≡∫L(Q)(o)dP0(o), so that the generalized score dd∈L(Qn(∈))|∈=0 spans a desired user-supplied estimating function D(Qn,gn). In addition, one may decide to target gn by specifying a parametric submodel {gn(∈1):∈1}⊂{g(P):P∈M} and loss function O↦L(g)(O) for g0=argmingP0L(g), so that the generalized score dd∈1L(gn(∈1))|∈1=0 spans another desired estimating function D1(gn,ηn) for some estimator ηn of nuisance parameter η. The parameter ∈ is fitted with MLE ∈n=argmin∈PnL(Qn(∈)), providing the first-step update Qn1=Qn(∈n), and similarly ∈1,n=argmin∈1PnL(gn(∈1)). This updating process that mapped a current fit (Qn,gn) into an update (Qn1,gn1) is iterated till convergence at which point the TMLE (Qn∗,gn∗) solves PnD(Qn∗,gn∗)=0, i.e. the empirical mean of the estimating function equals zero at the final TMLE (Qn∗,gn∗). If one also targeted gn, then it also solves PnD1(gn∗,ηn)=0. The submodel through Qn will depend on gn, while the submodel through gn will depend on another nuisance parameter ηn. By setting D(Q,g) equal to the efficient influence curve D∗(Q,g), the resulting TMLE solves the efficient influence curve estimating equation PnD∗(Qn∗,gn)=0 and thereby will be asymptotically efficient when (Qn,gn) is consistent for (Q0,g0) under appropriate regularity conditions, where the targeting of gn is not needed.

The latter is shown as follows. By the property of the canonical gradient (in fact, any gradient) we have Ψ(Qn∗)−Ψ(Q0)=−P0D∗(Qn∗,gn)+Rn(Qn∗,Q0,gn,g0), where Rn involves integrals of second-order products of the differences (Qn∗−Q0) and (gn−g0). Combined with PnD∗(Qn∗,gn)=0, this implies the following identity:

Ψ(Qn∗)−Ψ(Q0)=(Pn−P0)D∗(Qn∗,gn)+Rn(Qn∗,Q0,gn,g0).

The first term is an empirical process term that, under empirical process conditions (mentioned below), equals (Pn−P0)D∗(Q,g), where (Q,g) denotes the limit of (Qn∗,gn), plus an oP(1/n)-term. This then yields

Ψ(Qn∗)−Ψ(Q0)=(Pn−P0)D∗(Q,g)+Rn(Qn∗,Q0,gn,g0)+oP(1/n).

To obtain the desired asymptotic linearity of Ψ(Qn∗) one needs Rn=oP(1/n), which in general requires at minimal that both nuisance parameters are consistently estimated: Q=Q0 and g=g0. However, in many problems of interest, Rn only involves a cross-product of the differences Qn∗−Q0 and gn−g0, so that Rn converges to zero if either Qn∗ is consistent or gn is consistent: i.e. Q=Q0 or g=g0. In this latter case, the TMLE is so-called double robust. Either way, the consistency of the TMLE relies now on one of the nuisance parameter estimators being consistent, thereby requiring the use of non-parametric adaptive estimation such as super-learning [8–10] for at least one of the nuisance parameters. If only one of the nuisance parameter estimators is consistent, and we are in the double robust scenario, then it follows that the bias is of the same order as the bias of the consistent nuisance parameter estimator. However, if the nuisance parameter estimator is not based on a correctly specified parametric model, but instead is a data-adaptive estimator, then this bias will be converging to zero at a rate slower than 1/n: i.e. nRn converges to infinity as n ↦ ∞. Thus, in that case, the estimator of the target parameter may thus be overly biased and thereby will not be asymptotically linear.

1.2 Targeting the fit of the nuisance parameter: general approach

In this article, we demonstrate that if Q≠Q0, then it is essential that the consistent nuisance parameter estimator gn be targeted toward the estimand so that the bias for the estimand becomes second order: that is, in our new TMLEs relying on consistent estimation of g0 presented in this article one simultaneously updates gn into a gn∗ so that certain smooth functionals of gn∗, derived from the study of Rn, are asymptotically linear under appropriate conditions. Even if both estimators Qn∗ and gn∗ are consistent, but Qn∗ might be converging at a slower rate than gn∗, this targeting of the nuisance parameter estimator may still remove finite sample bias for the estimand. In addition, we also present such TMLE when only relying on one of the nuisance parameters to be consistently estimated, but not knowing which one: i.e. either Q=Q0 or g=g0. The same arguments applies to other double robust estimators, such as estimating equation based estimators and inverse probability of treatment weighted (IPTW) estimators [11–16]. In fact, we demonstrate such a targeted IPTW-estimator in our next section.

The current article concerns the construction of such targeted IPTW and TMLE that are asymptotically linear under regularity conditions, even when only one of the nuisance parameters is consistent and the estimators of the nuisance parameters are highly data adaptive. In order to be concrete in this article, we will focus on a particular example. In such an example we can concretely present the second-order term Rn mentioned above and thereby develop the concrete form of the TMLE.

The same approach for construction of such TMLE can be carried out in much greater generality, but that is beyond the scope of this article. Nonetheless, it is helpful for the reader to know that the general approach is the following (considering the case that g=g0, but Q can be misspecified): (1) approximate Rn(Qn∗,Q0,gn∗,g0)=Φ0,n(gn∗)−Φ0,n(g0)+R1,n for some mapping Φ0,n that depends on P0 (e.g. through Q0) and the data (e.g. Qn∗,gn∗), and where R1,n is a second-order term so that it is reasonable to assume R1,n=oP(1/n); (2) approximate Φ0,n(gn∗)−Φ0,n(g0)=Φn(gn∗)−Φn(g0)+R2,n, where R2,n is a second-order term and Φn is now a known (only based on data) mapping approximating Φ0; (3) construct gn∗ so that it is a TMLE of the target parameter Φn(g0) thereby allowing an expansion Φn(gn∗)−Φn(g0)=(Pn−P0)D1,n(P0)+R3,n with D1,n(P0) being the efficient influence curve of Φn(g0). That is, in step 3, gn∗ is iteratively updated to solve PnD1,n(gn∗,ηn)=0 with D1,n(P0) depending on P0 through g0 and a nuisance parameter η0, so that Φn(gn∗) is an asymptotically linear estimator of Φn(g0) under regularity conditions. After these three steps, we have that Rn(Qn∗,Q0,gn∗,g0)=(Pn−P0)D1,n(P0)+R1,n+R2,n+R3,n, where R1,n+R2,n+R3,n=oP(1/n), and these steps provide us with the parameter Φn(g0) that needs to be targeted by gn∗, thereby telling us how to target gn∗ in the TMLE of ψ0. In addition, we can then conclude that this TMLE is asymptotically linear with known influence curve D∗(Q,g0)+D1(P0), where D1(P0) represents the limit of the efficient influence curve D1,n(P0) of Φn(g0): Ψ(Qn∗)−Ψ(Q0)=(Pn−P0){D∗(Q,g0)+D1(P0)}+oP(1/n).

1.3 Concrete example covered in this article

Let us now formulate our concrete example we will cover in this article. Let O=(W,A,Y)∼P0, W baseline covariates, A a binary treatment, and Y a final outcome. Let M be a model that makes at most some assumptions about the conditional distribution of A, given W, but leaves the marginal distribution of W and the conditional distribution of Y, given A,W, unspecified. Let Ψ:M↦IR be defined as Ψ(P)=EPEP(Y|A=1,W), the so-called treatment specific mean controlling for the baseline covariates. The canonical gradient, also called the efficient influence curve, of Ψ at P is given by D∗(P)(O)=A/g(1|W)(Y−Qˉ(1,W))+Qˉ(1,W)−Ψ(P), where g(1|W)=P(A=1|W) is the propensity score and Qˉ(a,W)=EP(Y|A=a,W) is the outcome regression [13]. Let Q=(QW,Qˉ), where QW is the marginal distribution of W, and note that Ψ(P) only depends on P through Q=Q(P). For convenience, we will denote the target parameter with Ψ(Q) in order to not have to introduce additional notation. A targeted minimum loss-based estimator (TMLE) is a plug-in estimator Ψ(Qn∗), where Qn∗ is an update of an initial estimator Qn that relies on an estimator gn of g0, and it has the property that it solves PnD∗(Qn∗,gn)=0, where we used the notation Pf=∫f(o)dP(o).

For this particular example, such TMLE are presented in Scharfstein et al. [17]; van der Laan and Rubin [7]; Bembom et al. [18–21]; Rosenblum and van der Laan [22]; Sekhon et al. [23]; van der Laan and Rose [6, 24]. Since P0D∗(Q,g)=ψ0−Ψ(Q)+P0(Qˉ0−Qˉ)(gˉ0−gˉ)/gˉ [25, 26], where we use the notation gˉ(W)=g(1|W) and Qˉ(W)=Qˉ(1,W), and PnD∗(Qn∗,gn)=0, we obtain the identity:

(1)Ψ(Qn∗)−ψ0=(Pn−P0)D∗(Qn∗,gn)+P0(Qˉ0−Qˉn∗)(gˉ0−gˉn)/gˉn.

The first term equals (Pn−P0)D∗(Q,g)+oP(1/n) if D∗(Qn∗,gn) falls in a P0-Donsker class with probability tending to 1, and P0{D∗(Qn∗,gn)−D∗(Q,g)}2↦0 in probability as n↦∞ [4, 27]. If Qˉn∗ and gˉn are consistent for the true Qˉ0 and gˉ0, respectively, then the second term is a second-order term. If one now assumes that this second-order term is oP(1/n), it has been proven that the TMLE is asymptotically efficient. This provides the general basis for proving asymptotic efficiency of TMLE when both Q0 and g0 are consistently estimated.

However, if only one of these nuisance parameter estimators is consistent, then the second term is still a first-order term, and it remains to establish that it is also asymptotically linear with a second-order remainder. For sake of discussion, suppose that Qˉn∗ converges to a wrong Qˉ while gˉn is consistent. In that case, this remainder behaves in first order as P0(Qˉ0−Qˉ)(gˉn−gˉ0)/gˉ0. To establish that such a term is asymptotically linear requires that gˉn solves a particular estimating equation: that is, gˉn needs to be a TMLE itself targeting the required smooth functional of g0. This is naturally achieved within the TMLE framework by specifying a submodel through gn and loss function with the appropriate generalized score, so that a TMLE update step involves both updating Qn and gn, and the iterative TMLE algorithm now results in a final TMLE (Qn∗,gn∗), not only solving PnD∗(Qn∗,gn∗)=0 but also these additional equations that allow us to establish asymptotic linearity of the desired smooth functional of gn∗: see general description of TMLE above.

In this article, we present TMLE that targets gn in a manner that allows us to prove the desired asymptotic linearity of the second term in the right-hand side of eq. (1) when either gˉn or Qˉn is consistent, under conditions that require specified second-order terms to be oP(1/n). The latter type of regularity conditions are typical for the construction of asymptotically linear estimators and are therefore considered appropriate for the sake of this article. Though it is of interest to study cases in which these second-order terms cannot be assumed to be oP(1/n), this is beyond the scope of this article.

1.4 Relation to current literature on targeted nuisance parameter estimators

The construction of TMLE that utilizes targeting of the nuisance parameter gn has been carried out in earlier papers. For example, in van der Laan and Rubin [7], we target gn to obtain a TMLE that, beyond being double robust locally efficient, also equals the IPTW-estimator. In Gruber and van der Laan [29] we target gn to guarantee that, beyond being double robust locally efficient, also outperforms a user-supplied given estimator, based on the original idea of Rotnitzky et al. [28]. In that sense, the distinction of the current article with these previous articles is that gn∗ is now targeted to guarantee that the TMLE remains asymptotically linear when Qn∗ is misspecified. This task of targeting gn∗ appears to be one step more complicated than in these previous articles, since the smooth functionals of gn∗ that need to be targeted are themselves indexed by parameters of the true data distribution P0, and thus unknown. As mentioned above, our strategy is to approximate these unknown smooth functionals by an estimated smooth functional and develop the targeted estimator gn∗ that targets this estimated parameter of g0.

The TMLEs presented in this article are always iterative and thereby rely on convergence of the iterative updating algorithm. Since the empirical risk increases at each updating step, such convergence is typically guaranteed by the existence of the MLE at each updating step (e.g. an MLE of coefficient in a logistic regression). Either way, in this article, we assume this convergence to hold. Since our assumptions of our theorems require gn∗(1|W) to be bounded away from zero, we demonstrate how this property can be achieved by using submodels for updating gn that guarantee this property. Detailed simulations will appear in a future article.

1.5 Organization

The organization of this paper is as follows. In Section 2, we introduce a targeted IPTW-estimator that relies on an adaptive consistent estimator of g0, and we establish its asymptotic linearity with known influence curve, allowing for the construction of asymptotically valid confidence intervals based on this adaptive IPTW-estimator. In the remainder of the article, we focus on construction of TMLE involving the targeting of gn to establish the asymptotic linearity of the resulting TMLE under appropriate conditions. In Section 3, we introduce a novel TMLE that assumes that the targeted adaptive estimator gn∗ is consistent for g0, and we establish its asymptotic linearity. In Section 4, we introduce a novel TMLE that only assumes that either the targeted Qˉn∗ or the targeted gˉn∗ is consistent, and we establish its asymptotic linearity with known influence curve. This TMLE needs to protect the asymptotic linearity under misspecification of either gn∗ or Qˉn∗, and, as a consequence, relies on targeting of gn (in order to preserve asymptotic linearity when Qˉn∗ is inconsistent), but also extra targeting of Qˉn (in order to preserve asymptotic linearity when Qˉn∗ is consistent, but gn is inconsistent). The explicit form of the influence curve of this TMLE allows us to construct asymptotic confidence intervals. Since this result allows statistical inference in the statistical model that only assumes that one of the estimators is consistent, and we refer to this as “double robust statistical inference”. Even though double robust estimators have been extensively presented in the current literature, double robust statistical inference in these large semi-parametric models has been a difficult topic: typically, one has suggested to use the non-parametric bootstrap, but there is no theory supporting that the non-parametric bootstrap is a valid method when the estimators rely on data-adaptive estimation.

In Section 5, we extend the TMLE of Section 3 (that relies on gn∗ being consistent for g0) to the case that gn∗ converges to a possibly misspecified g but one that suffices for consistent estimation of ψ0 in the sense that Ψ(Qˉn∗) will be consistent. We present a corresponding asymptotic linearity theorem for this TMLE that is able to utilize the so-called collaborative double robustness of the efficient influence curve which states that Ψ(Q)=ψ0 if P0D∗(Q,g)=0 and g∈G(Q,P0) for a set G(Q,P0) (including g0). In order to construct a collaborative estimator gn∗ that aims to converge to an element in G(Qn∗,P0) in collaboration with Qn∗, we use the framework of collaborative targeted minimum loss-based estimator (C-TMLE) [20, 29–35]. Our asymptotic linearity theorem can now be applied to this C-TMLE. Again, even though C-TMLEs have been presented in the current literature, statistical inference based on the C-TMLEs has been another challenging topic, and Section 5 provides us with a C-TMLE with known influence curve. We conclude this article with a discussion. The proofs of the theorems are presented in the Appendix.

1.6 Notation

In the following sections, we will use the following notation. We have O=(W,A,Y)∼P0∈M, where M is a statistical model that makes only assumptions on the conditional distribution of A, given W. Let g0(a|W)=P0(A=a|W), and gˉ0(W)=P0(A=1|W). The target parameter is Ψ:M→IR defined by Ψ(P0)=EQW,0Qˉ0(1,W), where Qˉ0(1,W)=EP0(Y|A=1,W), which will also be denoted with Qˉ0(W), and QW,0 is the distribution of W under P0. We also use the notation Ψ(Q), where Q=(QW,Qˉ). In addition, D∗(Q,g) denotes the efficient influence curve of Ψ at (Q,g). We also use the following notation:

HA(Qˉr,gˉ)=Qˉr/gˉ

H0r=Qˉ0r/gˉ0

DA(Qˉr,gˉ)(A,W)=HA(Qˉr,gˉ)(W)(A−gˉ(W))

HY(gˉ)(A,W)=A/gˉ(W)

Qˉ0r(Qˉ,gˉ)=EP0(Y−Qˉ|A=1,gˉ)

Qˉ0r=Qˉ0r(Qˉ,gˉ0)

gˉ0r(gˉ,Qˉ)=E0(A|gˉ,Qˉ)

Qˉ0r(gˉ)=E0(Y|A=1,gˉ)onlyIPTW−Section2

Qˉ0r=Qˉ0r(gˉ0)onlyIPTW−Section2

∥f∥0=P0f20.5.

2 Statistical inference for IPTW-estimator when using super-learning to fit treatment mechanism

We first describe an IPTW-estimator that uses super-learning to fit the treatment mechanism g0. Subsequently, we present this IPTW-estimator but now using an update of the super-learning fit of g0, and we present a theorem establishing the asymptotic linearity of this targeted IPTW-estimator under appropriate conditions. Finally, we discuss how this targeted IPTW-estimator compares with an IPTW-estimator that relies on a parametric model to fit the treatment mechanism.

2.1 An IPTW-estimator using super-learning to fit the treatment mechanism

We consider a simple IPTW-estimator Ψˆ(Pn)=PnD(gˆ(Pn)), where D(g)(O)=YA/gˉ(W), and gˆ:MNP↦G is an adaptive estimator of g0 based on the log-likelihood loss function L(g)(O)≡−logg(A|W). For a general presentation of an IPTW-estimator, we refer to Robins and Rotnitzky [11], van der Laan and Robins [13], and Hernan et al. [36]. We wish to establish conditions under which reliable statistical inference based on this estimator of ψ0 can be obtained. One might wish to estimate g0 with ensemble learning, and, in particular, super-learning in which cross-validation [37] is used to determine the best weighted combination of a library of candidate estimators: van der Laan and Dudoit [8]; van der Laan et al. [9, 38, 39]; van der Vaart et al. [10]; Dudoit and van der Laan [40]; Polley et al. [41]; Polley and van der Laan [42]; van der Laan and Petersen [43]. The super-learner is a general template for construction of an adaptive estimator based on a library of candidate estimators, a loss function whose expectation is minimized over the parameter space by the true parameter value, a parametric family that defines “weighted” combinations of the estimators in the library. We will start with presenting a succinct description of a particular super-learner. Consider a library of estimators gˆj:MNP↦G, j=1,…,J and a family of weighted (on logistic scale) combinations of these estimators Logit gˆα(1|W)=∑j=1JαjLogit gˆj(1|W), indexed by vectors α for which αj∈[0,1] and ∑jαj=1. Consider a random sample split Bn∈{0,1}n into a training sample {i:Bn(i)=0} of size n(1−p) and validation sample {i:Bn(i)=1} of size np, and let Pn,Bn1 and Pn,Bn0 denote the empirical distribution of the validation sample and training sample, respectively. Define

αn=argminαEBnPn,Bn1LgˆαPn,Bn0

=argminαEBn1np∑i:Bn(i)=1LgˆαPn,Bn0(Oi),

as the choice of estimator that minimizes cross-validated risk. The super-learner of g0 is defined as the estimator gˆ(Pn)=gˆαn(Pn).

2.2 Asymptotic linearity of a targeted data-adaptive IPTW-estimator

The next theorem presents an IPTW-estimator that uses a targeted fit gn∗ of g0, involving the updating of an initial estimator gn, and conditions under which this IPTW-estimator of ψ0 is asymptotically linear. For example, gn could be defined as a super-learner of the type presented above. In spite of the fact that such an IPTW-estimator uses a very data adaptive and hard to understand estimator gn, this theorem shows that its influence curve is known and can be well estimated.

Theorem 1We consider a targeted IPTW-estimatorΨˆ(Pn)=PnD(gn∗), whereD(g)(O)=YA/g(A|W), andgn∗is an update of an initial estimatorgnofg0∈Gdefined below.

Definition of targeted estimatorgn∗: LetQˉnr∗be obtained by non-parametric estimation of the regression functionEP0(Y|A=1,gˉn(W))treatinggˉnas a fixed covariate (i.e. function of W). This yields an estimatorHnr∗≡Qˉnr∗/gˉnofH0r=Qˉ0r/gˉ0, whereQˉ0r=EP0(Y|A=1,gˉ0). Consider the submodelLogit gˉn(∈)=Logit gˉn+∈Hnr∗, and fit∈with the MLE

∈n=argmax∈Pnloggn(∈).

We definegn∗=gn(∈n)as the corresponding targeted update ofgn. This TMLEgn∗satisfies

PnDA(Qˉnr∗,gˉn∗)=0.

Empirical process condition: Assume thatD(gn∗),DA(Qˉnr∗,gˉn∗)fall in aP0-Donsker class with probability tending to 1.

Negligibility of second-order terms: DefineQˉ0,nr≡EP0(Y|A=1,gˉ0(W),gˉn∗(W)). Assumegˉn∗>δ>0with probability tending to 1 and assume

∥Qˉnr∗−Qˉ0r∥0=oP(1)

∥gˉn∗−gˉ0∥02=oP(1/n)

∥Qˉnr∗−Qˉ0r∥0∥gˉn∗−gˉ0∥0=oP(1/n)

∥Qˉ0,nr−Qˉ0r∥0∥gˉn∗−gˉ0∥0=oP(1/n).

Then,

Ψˆ(Pn)−ψ0=(Pn−P0)IC(P0)+oP(1/n),

where

IC(P0)(O)=YA/g0(A|W)−ψ0−H0r(W)(A−gˉ0(W)).

So under the conditions of this theorem, we can construct an asymptotic 0.95-confidence interval ψn±1.96σn/n based on this targeted IPTW-estimator ψn=Ψˆ(Pn), where

σn2=PnICn2=1n∑i=1nICn(Oi)2,

and ICn(O)=YA/gˉn∗(W)−ψn−Hnr∗(W)(A−gˉn∗(W)) is the plug-in estimator of the influence curve IC(P0) obtained by plugging in gn or gn∗ for g0 and Qˉnr∗ for Qˉ0r.

Regarding the displayed second-order term conditions, we note that these are satisfied if gˉn∗−gˉ0 converges to zero w.r.t. L2(P0)-norm at rate oP(n−1/4), gˉn∗>δ>0 for some δ>0 with probability tending to 1 as n ↦ ∞, and the product of the rates at which gˉn∗ converges to gˉ0 and (Qˉnr∗,Qˉ0,nr) converges to Qˉ0r is oP(1/n).

Regarding the empirical process condition, we note that an example of a Donsker class is the class of multivariate real-valued functions with uniform sectional variation norm bounded by a universal constant [44]. It is important to note that if each estimator in the library falls in such a class, then also the convex combinations fall in that same class [4]. So this Donsker condition will hold if it holds for each of the candidate estimators in the library of the super-learner.

2.3 Comparison of targeted data-adaptive IPTW and an IPTW using parametric model

Consider an IPTW-estimator using a MLE gn,1 according to a parametric model for g0, and let us contrast this IPTW-estimator with an IPTW-estimator defined in the above theorem based on an initial super-learner gn that includes gn,1 as an element of the library of estimators. Let us first consider the case that the parametric model is correctly specified. In that case gn,1 converges to g0 at a parametric rate 1/n. From the oracle inequality for cross-validation [8, 10, 38], it follows that gn also converges at the rate 1/n to g0 possibly up to a log n-factor in case the number of algorithms in the library is of the order nq for some fixed q. As a consequence, all the consistency and second-order term conditions for the IPTW-estimator using a targeted gn∗ based on gn hold. If one uses estimators in the library of algorithms that have a uniform sectional variation norm smaller than a M<∞ with probability tending to 1, then also a weighted average of these estimators will have uniform sectional variation norm smaller than M<∞ with probability tending to 1. Thus, in that case we will also have that D(gn∗),DA(Qˉnr∗,gˉn∗) fall in a P0-Donsker class. Examples of estimators that control the uniform sectional variation norm are any parametric model with fewer than K main terms that themselves have a uniform sectional variation norm, but also penalized least-squares estimators (e.g. Lasso) using basis functions with bounded uniform sectional variation norm, and one could map any estimator into this space of functions with universally bounded uniform sectional variation norm through a smoothing operation. Thus, under this restriction on the library, the IPTW-estimator using the super-learner is asymptotically linear with influence curve IC(P0)(O) as stated in the theorem. We note that IC(P0) is the efficient influence curve for the target parameter EP0EP0(Y|A=1,gˉ0(W)) if the observed data were (gˉ0(W),A,Y) instead of O=(W,A,Y).

The parametric IPTW-estimator is asymptotically linear with influence curve O↦YA/g0(A|W)−ψ0−Π(YA/gˉ0(W)|Tg), where Tg is the tangent space of the parametric model for g0, and Π(f|Tg) denotes the projection of f onto Tg in the Hilbert space L02(P0) [13]. This IPTW-estimator could be less or more efficient than the IPTW-estimator using the targeted super-learner depending on the actual tangent space of the parametric model.

For example, if the parametric model happens to have a score equal to O ↦ Qˉ0(W)(A/gˉ0(W)−1), then the parametric IPTW-estimator would be asymptotically efficient. Of course, a standard parametric model is not tailored to correspond with such optimal scores, but this shows that we cannot claim superiority of one versus the other in the case that the parametric model for g0 is correctly specified.

If, on the other hand, the parametric model is misspecified, then the IPTW-estimator using gn,1 is inconsistent. However, the super-learner gn will be consistent if the library contains a non-parametric adaptive estimator, and will perform asymptotically as well as the oracle selector among all the weighted combinations of the algorithms in the library. To conclude, the IPTW-estimator using super-learning to estimate g0 will be as good as the IPTW-estimator using a correctly specified parametric model (included in the library of the super-learner), but will remain consistent and asymptotically linear in a much larger model than the parametric IPTW-estimator relying on the true g0 being an element of the parametric model.

3 Statistical inference for TMLE when using super-learning to consistently fit treatment mechanism

In the next subsection, we present a TMLE that targets the fit of the treatment mechanism, analog to the targeted IPTW-estimator presented above. In addition, this subsection presents a formal asymptotic linearity theorem demonstrating that this TMLE will be asymptotically linear even when Qˉn∗ is inconsistent under reasonable conditions. We conclude this section with a subsection showing how the iterative updating of the treatment mechanism can be carried out in such a way that the final fit of the treatment mechanism is still bounded away from zero, as required to obtain a stable estimator.

3.1 Asymptotic linearity of a TMLE using a targeted estimator of the treatment mechanism

The following theorem presents a novel TMLE and corresponding asymptotic linearity with specified influence curve, where we rely on consistent estimation of g0. The TMLE still uses the same updating step for the estimator of Qˉ0 as the regular TMLE [7], but uses a novel updating step for the estimator of g0, analogue to the updating step of the IPTW-estimator in the previous section. We remind the reader of the importance of using the logistic fluctuations as working-submodels for Qˉ0 in the definition of the TMLE, guaranteeing that the TMLE update stays within the bounded parameter space (see, e.g. Gruber and van der Laan [19]).

Theorem 2

Iterative targeted MLE ofψ0:

Definitions: GivenQˉ,gˉ, letQˉnr(Qˉ,gˉ)be a consistent estimator of the regressionQˉ0r(Qˉ,gˉ)=EP0(Y−Qˉ|A=1,gˉ)of(Y−Qˉ)ongˉ(W)andA=1. Let(gn,Qˉn)be an initial estimator of(g0,Qˉ0).

Initialization: Letgn0=gn,Qˉn0=Qˉn, andQˉnr0=Qˉnr(Qˉn0,gˉn0). Letk=0.

Updating step forgnk: Consider the submodelLogit gˉnk(∈)=Logit gˉnk+∈HA(Qˉnrk,gˉnk), and fit∈with the MLE

∈n=argmax∈Pnloggnk(∈).

We definegnk+1=gnk(∈n)as the corresponding update ofgnk. Thisgnk+1satisfies

1n∑i=1nHAQˉnrk,gˉnk(Wi)Ai−gˉnk+1(Wi)=0.

Updating step forQˉnk: Let−L(Qˉ)(O)≡YlogQˉ(A,W)+(1−Y)log(1−Qˉ(A,W))be the quasi-log-likelihood loss function forQˉ0=E0(Y|A=1,W)(allowing that Y is continuous in[0,1]). Consider the submodelLogit Qˉnk(∈)=Log itQˉnk+∈HY(gnk), and let∈n=argmin∈PnL(Qˉnk(∈)). DefineQˉnk+1=Qˉnk(∈n)as the resulting update. DefineQˉnrk+1=Qˉnr(Qˉnk+1,gˉnk+1).

Iterating till convergence: Now, setk←k+1, and iterate this updating process mapping a(gnk,Qˉnk,Qˉnrk)into(gnk+1,Qˉnk+1,Qˉnrk+1)till convergence or till large enough K so that the estimating equations (2) below are solved up till anoP(1/n)-term. Denote the limit of this iterative procedure with(gn∗,Qˉn∗,Qˉnr∗).

Plug-in estimator: LetQn∗=(QW,n,Qˉn∗), whereQW,nis the empirical distribution estimator ofQW,0. The TMLE ofψ0is defined asΨ(Qn∗).

Estimating equations solved by TMLE: This TMLE(Qn∗,gn∗,Qˉnr∗)solves

PnD∗(Qn∗,gn∗)=0

(2)PnDA(Qˉnr∗,gˉn∗)=0.

Empirical process condition: Assume thatD∗(Qn∗,gn∗), DA(Qˉnr∗,gˉn∗)falls in aP0-Donsker class with probability tending to 1 asn ↦∞.

Negligibility of second-order terms: Define

Qˉ0,nr(W)≡EP0Y−Qˉ(1,W)|A=1,gˉn∗(W),gˉ0(W)Qˉ0r(W)≡EP0Y−Qˉ(1,W)|A=1,gˉ0(W)H0,nr=Qˉ0,nr/gˉn∗H0r=Qˉ0r/gˉ0,

wheregˉn∗(W)is treated as a fixed covariate (i.e. function of W) in the conditional expectationQˉ0,nr. Assume that there exists aδ>0, so thatgˉn∗>δ>0with probability tending to 1, and

∥Qˉn∗−Qˉ∥0=oP(1)

∥Qˉnr∗−Qˉ0r∥0=oP(1)

∥gˉn∗−gˉ0∥0∥Qˉn∗−Qˉ∥0=oP(1/n)

∥Qˉ0,nr−Qˉ0r∥0∥gˉn∗−gˉ0∥0=oP(1/n)

∥gˉn∗−gˉ0∥02=oP(1/n)

∥Qˉnr∗−Qˉ0r∥0∥gˉn∗−gˉ0∥0=oP(1/n).

Then,

Ψ(Qn∗)−ψ0=(Pn−P0)IC(P0)+oP(1/n),

whereIC(P0)=D∗(Q,g0)−DA(Qˉ0r,gˉ0).

Thus, under the assumptions of this theorem, an asymptotic 0.95-confidence interval is given by ψn∗±1.96σn/n, where σn2=PnICn2, and ICn=D∗(Qn∗,gn∗)−DA(Qˉnr∗,gˉn∗).

3.2 Using a δ-specific submodel for targeting g that guarantees the positivity condition

The following is an application of the constrained logistic regression approach of the type presented in Gruber and van der Lann [19] for the purpose of estimation of gˉ0 respecting the constraint that gˉ0>δ>0 for a known δ>0. Recall that A∈{0,1}. Suppose that it is known that gˉ0(W)∈(δ,1] for some δ>0, a condition the asymptotic linearity of our proposed estimators relies upon. Define Aδ≡A−δ1−δ. We have gˉ0(W)=δ+(1−δ)gˉ0,δ, where gˉ0,δ=E0(Aδ|W) is a regression that is known to be between [0,1]. Let gδ,n0 be an initial estimator of the true conditional distribution gδ,0 of Aδ, given W, which implies an estimator gˉn0=δ+(1−δ)gˉδ,n0 of gˉ0. Let k=0. Consider the following submodel for the conditional distribution of Aδ, given W, through a given estimator gδ,nk:

Logit gˉδ,nk(∈)=Logit gˉδ,nk+∈HA(Qˉnrk,gˉδ,nk).

The MLE is simply obtained with logistic regression of Aδ on W (see, e.g. Gruber and van der Lann [19]) based on the quasi-log-likelihood loss function:

∈n=argmin∈PnLgˉδ,nk(∈),

where

−L(gˉδ)(O)=Aδloggˉδ(W)+(1−Aδlog(1−gˉδ(W))

is the quasi-log-likelihood loss. The update gˉδ,nk+1=gˉδ,nk(∈n) implies an update gˉnk+1=δ+(1−δ)gˉδ,nk+1 of gˉnk=δ+(1−δ)gˉδ,nk, and, by construction gˉnk+1>δ>0. The above submodel gˉnk(∈)=δ+(1−δ)gˉδ,nk(∈) and corresponding loss function L(gˉ)=L(gˉδ) generates the same score equation as the submodel and loss function used in Theorem 2. Therefore, the TMLE algorithm presented in Theorem 2 but now using this δ-specific logistic regression model solves the same estimating equations, so that the same Theorem 2 immediately applies. However, using this submodel we have now guaranteed that gˉnk>δ>0 for all k in the iterative TMLE algorithm, and thereby that gˉn∗>δ>0.

4 Double robust statistical inference for TMLE when using super-learning to fit outcome regression and treatment mechanism

In this section, our aim is to present a TMLE that is asymptotically linear with known influence curve if either g0 or Q0 is consistently estimated, but we do not need to know which one. Again, this requires a novel way of targeting the estimators gn∗,Qˉn∗ in order to arrange that the relevant smooth functionals of these nuisance parameter estimators are indeed asymptotically linear under appropriate second-order term conditions. In this case, we also need to augment the submodel for the estimator of Qˉ0 with another clever covariate: that is, our estimator of Qˉ0 needs to be double targeted, once for solving the efficient influence curve equation, but also for achieving asymptotic linearity in the case that the estimator of g0 is misspecified.

Theorem 3

Definitions: For any givengˉ,Qˉ, letgˉnr(gˉ,Qˉ)andQˉnr(gˉ,Qˉ)be consistent estimators ofgˉ0r(gˉ,Qˉ)=EP0(A|Qˉ,gˉ)andQˉ0r(gˉ,Qˉ)=EP0(Y−Qˉ|A=1,gˉ), respectively (e.g. using a super-learner or other non-parametric adaptive regression algorithm). LetQˉnr∗=Qˉnr(gˉn∗,Qˉn∗)andgˉnr=gˉnr∗(gˉn∗,Qˉn∗)denote these estimators applied to the TMLEs(gˉn∗,Qˉn∗)defined below.

Iterative targeted MLE ofψ0:

Initialization: Let(gn, Qˉn)be an initial estimator of(g0,Qˉ0). Letgn0=gn, Qˉn0=Qˉnand letk=0. Letgˉnr,k=gˉnr∗(gˉnk,Qˉnk)be obtained by non-parametrically regressing A onQˉnk,gˉnk. LetQˉnr,k=Qˉnr∗(gˉnk,Qˉnk)be obtained by non-parametrically regressingY−QˉnkonA=1,gˉnk.

Updating step: Consider the submodelLogit gˉnk(∈)=Logit gˉnk+∈HAQˉnr,k,gˉnk, and fit∈with the MLE

∈A,n=argmax∈Pnloggnk(∈).

Define the submodelLogit Qˉnk(∈)=Logit Qˉnk+∈1HY(gˉnk)+∈2HY1gˉnr,k,gˉnk, whereHY1(gˉr,gˉ)≡Agˉrgˉr−gˉgˉ

Let∈Y,n=argmin∈PnL(Qˉnk(∈))be the MLE, whereL(Qˉ)is the quasi-log-likelihood loss.

We definegnk+1=gnk(∈A,n)as the corresponding targeted update ofgnk, andQˉnk+1=Qˉnk(∈Y,n)as the corresponding update ofQˉnk. Letgˉnr,k+1=gˉnr(gˉnk+1,Qˉnk+1)andQˉnr,k+1=Qˉnr(gˉnk+1,Qˉnk+1).

Iterate till convergence: Now, setk←k+1, and iterate this updating process mapping a(gnk,Qˉnk,gˉnrk,Qˉnrk)into(gnk+1,Qˉnk+1,gˉnrk+1,Qˉnrk+1)till convergence or till large enough K so that the following three estimating equations are solved up till anoP(1/n)-term:

PnD∗(QnK,gnK)=oP(1/n)PnDA(Qˉnr,K,gˉnK)=oP(1/n)PnDY(QˉnK,gˉnr,K,gˉnK)=oP(1/n),

where

DY(Qˉ,gˉ0r,gˉ)=HY1(gˉ0r,gˉ)(Y−Qˉ).

Final substitution estimator: Denote the limits of this iterative procedure withQˉnr∗,gˉnr∗,gn∗,Qˉn∗. LetQn∗=(QW,n,Qˉn∗), whereQW,nis the empirical distribution estimator ofQW,0. The TMLE ofψ0is defined asΨ(Qn∗).

Equations solved by TMLE:

oP(1/n)=PnD∗(Qn∗,gn∗)oP(1/n)=PnDA(Qˉnr∗,gˉn∗)oP(1/n)=PnDY(Qˉn∗,gˉnr∗,gˉn∗).

Empirical process condition: Assume thatD∗(Qn∗,gn∗), DA(Qˉnr∗,gˉn∗), DY(Qˉn∗,gˉnr∗,gˉn∗)fall in aP0-Donsker class with probability tending to 1 asn ↦ ∞.

Negligibility of second-order terms: DefineQˉ0,nr=EP0(Y−Qˉ|A=1,gˉ,gˉn)andgˉ0,nr=EP0(A|gˉ,Qˉ,Qˉn∗). Assume that there exists aδ>0so thatgˉn>δ>0with probability tending to 1, thatgˉn∗,Qˉn∗are consistent forgˉ,Qˉw.r.t. ∥⋅∥0-norm, where eithergˉ=gˉ0orQˉ=Qˉ0, and assume that the following second-order terms areoP(1/n):

∥Qˉn∗−Qˉ∥0=oP(1)∥Qˉnr∗−Qˉ0r∥0=oP(1)∥gˉnr∗−gˉ0r∥0=oP(1)∥gˉn∗−gˉ∥02=oP(1/n)∥gˉn∗−gˉ∥0∥Qˉn∗−Qˉ∥0=oP(1/n)∥Qˉ0,nr−Qˉ0r∥0∥gˉn∗−gˉ∥0=oP(1/n)∥Qˉnr∗−Qˉ0r∥0∥gˉn∗−gˉ∥0=oP(1/n)∥gˉnr∗−gˉ0r∥0∥Qˉn∗−Qˉ∥0=oP(1/n)∥gˉ0,nr−gˉ0r∥0∥Qˉn∗−Qˉ∥0=oP(1/n).

Then,

Ψ(Qn∗)−ψ0=(Pn−P0)IC(P0)+oP(1/n),

where

IC(P0)=D∗(Q,g)−DA(Qˉ0r,gˉ)−DY(Qˉ,gˉ0r,gˉ).

Note that consistent estimation of the influence curve IC(P0) relies on consistency of gˉnr∗,Qˉnr∗ as estimators of gˉ0r,Qˉ0r, and estimators Qˉn∗,gˉn∗ converging to a Qˉ,gˉ for which either Qˉ=Qˉ0 or gˉ=gˉ0. These estimators imply an estimated influence curve ICn. An asymptotic 0.95-confidence interval is given by ψn∗±1.96σn/n, where σn2=PnICn2.

If gˉ=gˉ0, then EP0(A|gˉ,Qˉ)=gˉ, and therefore DY(Qˉ,gˉ0r,gˉ)=0 for all Qˉ. If Qˉ=Qˉ0, then it follows that Qˉ0r=0, and thus that DA(Qˉ0r,gˉ)=0 for all gˉ. In particular, if both gˉ=gˉ0 and Qˉ=Qˉ0, then IC(P0)=D∗(Q0,g0). We also note that if gˉ≠gˉ0, but gˉ is a true conditional distribution of A, given some function Wr of W for which Qˉ(W) is only a function of Wr, then it follows that EP0(A|gˉ,Qˉ)=gˉ and thus DY=0.

As shown in the final remark of the Appendix, the condition of Theorem 3 that either g=g0 or Qˉ=Qˉ0 can be weakened to (gˉ,Qˉ) having to satisfy P0(Qˉ−Qˉ0)(gˉ−gˉ0)/gˉ=0, allowing for the analysis of collaborative double robust TMLE, as discussed in the next section. However, as shown in the next section, if one arranges in the TMLE algorithm that gˉn∗=gˉnr∗ (i.e. gˉn∗ already non-parametrically adjusts for Qˉn∗), then there is no need for the extra targeting in Qˉnk, and the influence curve will be D∗(Q,g)−DA(Qˉ0r,gˉ).

5 Collaborative double robust inference for C-TMLE when using super-learning to fit outcome regression and reduced treatment mechanism

We first review the theoretical underpinning for collaborative estimation of nuisance parameters, in this case, the outcome regression and treatment mechanism. Subsequently, we explain that the desired collaborative estimation can be achieved by applying the previously established template for construction of a C-TMLE to a TMLE that solves certain estimating equations when given an initial estimator of (Q0,g0). This C-TMLE template involves (1) creating a sequence of TMLEs ((gn,k∗,Qn,k∗):k=1,…,K) constructed in such a manner that the empirical risk of both gn,k∗ and Qn,k∗ is decreasing in k, and (2) using cross-validation to select the k for which Qn,k∗ is the best fit of Q0. Subsequently, we present this TMLE that maps an initial of (Q0,g0) into targeted estimators solving the desired estimating equations and establish its asymptotic linearity under appropriate conditions, including that the initial estimator of (Q0,g0) is collaboratively consistent. Finally, we present a concrete C-TMLE algorithm that uses this TMLE algorithm as its basis, so that our theorem can be applied to this C-TMLE: a C-TMLE is still a TMLE, but it is a TMLE based on a data adaptively selected initial estimator that is collaboratively consistent, so that we can apply the same theorem to this C-TMLE.

5.1 Motivation and theoretical underpinning of collaborative double robust estimation of nuisance parameters

We note that P0D∗(Q,g)=P0Agˉ(Qˉ0−Qˉ)+Qˉ−Ψ(Q). If QW=QW,0, this reduces to

P0D∗(Q,g)=P0Agˉ(Qˉ0−Qˉ)=Ψ(Q0)−Ψ(Q)+P0A−gˉgˉ(Qˉ0−Qˉ).

Let G be the class of all possible distributions of A, given W, and let g0∈G be the true conditional distribution of A given W. We define the set G(P0,Qˉ)≡g:∈G:0=P0(A−gˉ)Qˉ0−Qˉgˉ. For any g∈G(P0,Qˉ), we have P0D∗(Q,g)=Ψ(Q0)−Ψ(Q). Suppose we have an estimator (Qn∗,gn∗) satisfying PnD∗(Qn∗,gn∗)=0 and converging to a (Q,g) so that g∈G(P0,Qˉ). Then it follows that P0D∗(Q,g)=0 and P0D∗(Q,g)=Ψ(Q0)−Ψ(Q), thereby establishing that Ψ(Qn∗) is a consistent estimator of Ψ(Q0). Let us state this crucial result as a lemma

Lemma 1(van der Laan and Gruber [33]) IfP0(A−gˉ)(Qˉ0−Qˉ)/gˉ=0, andP0D∗(Q,g)=0, thenΨ(Q)=ψ0. More generally, P0D∗(Q,g)=Ψ(Q0)−Ψ(Q)+P0(A−gˉ)(Qˉ0−Qˉ)/gˉ.

We note that G(P0,Qˉ) contains the true conditional distributions g0r of A, given Wr, for which (Qˉ−Qˉ0)/gˉ0r is a function of Wr, i.e. for which Qˉ−Qˉ0 only depends on W through Wr. We refer to such distributions as reduced treatment mechanisms. However, it contains many more conditional distributions since any conditional distribution g for which (A−gˉ(W)) is orthogonal to (Qˉ0−Qˉ)/gˉ in L02(P0) is an element of G(P0,Qˉ). We refer to van der Laan and Gruber [33] and Gruber and van der Laan [29] for the introduction and general notion of collaborative double robustness.

5.2 C-TMLE

The general C-TMLE introduced in van der Laan and Gruber [33] provides a template for construction of a TMLE (gn∗,Qˉn∗) satisfying PnD∗(Qn∗,gn∗)=0 and converging to a (g,Qˉ) with g∈G(P0,Qˉ) so that P0D∗(Q,g)=0 and thereby Ψ(Q)−Ψ(Q0)=0. Thus C-TMLE provides a template for construction of targeted MLEs that exploit the collaborative double robustness of TMLEs in the sense that a TMLE will be consistent as long as (Qn∗,gn∗) converges to a (Q,g) for which g∈G(P0,Qˉ). The goal is not to estimate the true treatment mechanism, but instead to construct a gn∗ that converges to a conditional distribution given a reduction Wr of W that is an element of G(P0,Qˉ). We could state that, just as the propensity score provides a sufficient dimension reduction for the outcome regression, so does, given Qˉ, (Qˉ−Qˉ0) provide a sufficient dimension reduction for the propensity score regression in the TMLE. The current literature appears to agree that propensity score estimators are best evaluated with respect to their effect on estimation of the causal effect of interest, not by metrics such as likelihoods or classification rates [45–48], and the above-stated general collaborative double robustness provides a formal foundation for such claims.

The general C-TMLE has been implemented and applied to point treatment and longitudinal data [20, 29–33, 35]. A C-TMLE algorithm relies on a TMLE algorithm that maps an initial (Qˉn,gn) into a TMLE (Qˉn∗,gn∗) and uses this algorithm in combination with a targeted variable selection algorithm for generating candidate models for the propensity score to generate a sequence of candidate TMLEs (gn∗k,Qˉn∗k), increasingly non-parametric in k, and finally uses cross-validation to select the best TMLE among these candidates estimators of Qˉ0.

5.3 A TMLE that allows for collaborative double robust inference

Our next theorem presents a TMLE algorithm and a corresponding influence curve under the assumption that the propensity score correctly adjusts for the possibly misspecified Qˉ and Qˉ0−Qˉ=E0(Y−Qˉ(W)|A=1,W). The presented TMLE algorithm already arranges that this TMLE indeed non-parametrically adjusts for Qˉ. In the next subsection, we will present an actual C-TMLE algorithm that generates a TMLE for which the propensity score is targeted to adjust for Qˉ−Qˉ0, so that this theorem can be applied.

Theorem 4

Definitions: For any givengˉ,Qˉ, letgˉnr(gˉ,Qˉ)andQˉnr(gˉ,Qˉ)be consistent estimators ofgˉ0r(gˉ,Qˉ)=EP0(A|gˉ,Qˉ)andQˉ0r(gˉ,Qˉ)=EP0(Y−Qˉ|A=1,gˉ), respectively (e.g. using a super-learner or other non-parametric adaptive regression algorithm). LetQˉnr∗=Qˉnr∗(gˉn∗,Qˉn∗)andgˉnr∗=gˉnr∗(gˉn∗,Qˉn∗)denote these estimators applied to the TMLE(gˉn∗,Qˉn∗)defined below.

“Score” equations the TMLE should solve: Below, we describe an iterative TMLE algorithm that results in estimatorsgˉnr∗,Qˉnr∗, gn∗, Qˉn∗that solve the following equations:

0=PnD∗(Qn∗,gn∗)

(3)0=PnDA(Qˉnr∗,gˉn∗).

Iterative targeted MLE ofψ0:

Initialization: LetQˉnandgn(e.g. aiming to adjust forQˉn−Qˉ0) be initial estimators.

LetQˉn0=Qˉn, gˉn0=gˉnr(gˉn,Qˉn0), andQˉnr0=Qˉnr(gˉn0,Qˉn0).

Updating step: Consider the submodelLogit gˉnk(∈)=Logit gˉnk+∈HA(Qˉnr,k,gˉnk), and fit∈with the MLE

∈A,n=argmax∈Pnloggnk(∈).

Define the submodelLogit Qˉnk(∈)=Logit Qˉnk+∈HY(gnk)and letL(Qˉ)be the quasi-log-likelihood loss function forQˉ0. Let∈Y,n=argmin∈PnLQˉnk(∈)be the MLE. LetQˉnk+1=Qˉnk(∈Y,n), gˉnk+1=gˉnr(gˉnk(∈A,n),Qˉnk+1), andQˉnrk+1=Qˉnr(gˉnk+1,Qˉnk+1).

Iterating till convergence: Now, setk←k+1and iterate this updating process mapping a(gnk,Qˉnk,Qˉnrk)into(gnk+1,Qˉnk+1,Qˉnrk+1)till convergence or till large enough K so that the following estimating equations are solved up till anoP(1/n)-term:

oP(1/n)=PnD∗(QnK,gnK)oP(1/n)=PnDA(QˉnrK,gˉnK).

Final substitution estimator: Denote these limits (in k) of this iterative procedure withgn∗, Qˉn∗, Qˉnr∗. LetQn∗=(QW,n,Qˉn∗), whereQW,nis the empirical distribution estimator ofQW,0. The TMLE ofψ0is defined asΨ(Qn∗).

Assumption on limitsgˉ,Qˉofgˉn∗,Qˉn∗: Assume that(gˉn∗,Qˉn∗)is consistent for(gˉ,Qˉ)w.r.t. ∥⋅∥0-norm, wheregˉ(W)=EP0(A|Wr)for some functionWr(W)of W for whichQˉonly depends on W throughWr, and assume thatP0Qˉ−Qˉ0gˉ(A−gˉ)=0, where the latter holds, in particular, ifQˉ−Qˉ0only depends on W throughWr(e.g. gˉn∗involves non-parametric adjustment byQˉ,Qˉ0). As a consequence, we havegˉ=gˉ0r.

Empirical process condition: Assume thatD∗(Qn∗,gn∗), DA(Qˉnr∗,gˉn∗)fall in aP0-Donsker class with probability tending to 1 asn ↦ ∞.

Negligibility of second-order terms: Define

Qˉ0,nr≡EP0Y−Qˉ|A=1,gˉ,gˉn∗.

We assumegˉ,gˉn∗are bounded away fromδ>0with probability tending to one, and

∥Qˉnr∗−Qˉ0r∥0=oP(1)∥gˉn∗−gˉ∥02=oP(1/n)∥Qˉ0,nr−Qˉ0r∥0∥gˉn∗−gˉ∥0=oP(1/n)∥Qˉn∗−Qˉ∥0gˉn∗−gˉ∥0=oP(1/n)∥Qˉnr∗−Qˉ0r∥0∥gˉn∗−gˉ∥0=oP(1/n)∥gˉ0,nr−gˉ0r∥0∥Qˉn∗−Qˉ∥0=oP(1/n)∥gˉ0,nr−gˉ0r∥0∥gˉn∗−gˉ∥0=oP(1/n)∥gˉ0,nr−gˉ0r∥0∥Qˉnr∗−Qˉ0r∥0=oP(1/n).

Then,

Ψ(Qn∗)−ψ0=(Pn−P0)IC(P0)+oP(1/n),

where

IC(P0)=D∗(Q,g)−DA(Qˉ0r,gˉ).

Thus, consistency of this TMLE relies upon the consistency of Qˉnr∗ as an estimator of Qˉ0r, and estimator (Qˉn∗,gˉn∗) converging to a (Qˉ,gˉ) for which gˉ equals a true conditional mean of A, given Wr, and Qˉ0−Qˉ,Qˉ only depend on W through Wr. Since Qˉ0,nr−Qˉ0r depends on how well gˉn∗ approximates gˉ, Qˉnr∗−Qˉ0r depends on how well (Qˉn∗,gˉn∗) approximates (Qˉ,gˉ), beyond the behavior of the non-parametric regression defining Qˉnr. In addition, gˉ0,nr−gˉ0r depends on either how well gˉn∗ approximates gˉ or how well Qˉn∗ approximates Qˉ. As a consequence, it follows that each of the second-order terms displayed in the theorem involves square differences of approximation errors gˉn∗−gˉ and Qˉn∗−Qˉ.

It is also interesting to note that the algebraic form of the influence curve of this TMLE is identical to the influence curve of the TMLE of Theorem 2 that relied on gˉn∗ being consistent for gˉ0.

5.4 A C-TMLE algorithm

The TMLE algorithm presented in Theorem 4 maps an initial estimator (Qn0,gn0) into an updated estimator (Qn∗,gn∗) that solves the two estimating equations (3), allowing for statistical inference with known influence curve if the initial estimator (Qn0,gn0) is collaboratively consistent (i.e. the limits of (Qn∗,gn∗) satisfy the condition in the theorem). The updating algorithm results in a gn∗ that non-parametrically adjusts for Qˉn∗ itself, and thus for its limit Qˉ in the limit. The condition on the limit g was that it should non-parametrically adjust not only for Qˉ but also for Qˉ−Qˉ0. If the initial estimator gn0 already adjusted for an approximation of Qˉn0−Qˉ0, for example, (gn0,Qn0) is already a C-TMLE, then this condition might hold approximately. Nonetheless, we want to present a C-TMLE algorithm that simultaneously fits g in response to Qˉ−Qˉ0, but also carries out the non-parametric adjustment by Qˉ. The latter is normally not part of the C-TMLE algorithm, but we want to enforce this in order to be able to apply Theorem 3 and thereby obtain a known influence curve. We achieve this goal in this subsection by applying the C-TMLE algorithm as presented by van der Laan and Gruber [49] and to the particular TMLE algorithm presented in Theorem 4.

First, we compute a set of K univariate covariates W1,…,WK, i.e. functions of W, which we will refer to as main terms, even though a term could be an interaction term or a super-learning fit of the regression of A on a subset of the components of W. Let Ω={W1,…,WK} be the full collection of main terms. In the previous subsection, we defined an algorithm that maps an initial (Q,g) into a TMLE (Q∗,g∗). Let O ↦L(Q)(O) be the loss function for Q0.

The general template of a C-TMLE algorithm is the following: given a TMLE algorithm that maps any initial (Q,g) into a TMLE (Q∗,g∗), the C-TMLE algorithm generates a sequence of increasing sets Sk⊂Ω of k main terms, where each set Sk has an associated estimator gk of g0, and simultaneously it generates a corresponding sequence of Qk, k=1,…,K, where both gk and Qk are increasingly non-parametric in k. Here increasingly non-parametric means that the empirical mean of the loss function of the fit is decreasing in k. This sequence (gk,Qk) maps into a corresponding sequence of TMLEs (gk∗,Qk∗) using the TMLE algorithm presented in Theorem 4. In this variable selection algorithm, the choice of the next main term to add, mapping Sk into Sk+1, is based on how much the TMLE using the g-fit implied by Sk+1, using Qk as initial estimator, improves the fit of the corresponding TMLE Qk∗ for Q0. Cross-validation is used to select k among these candidate TMLEs Qk∗, k=1,…,K, where the last TMLE QK∗ uses the most aggressive bias reduction by being based on the most non-parametric estimator gK implied by Ω.

In order to present a precise C-TMLE algorithm we will first introduce some notation. For a given subset of main terms S⊂Ω, let Sc be its complement within Ω. In the C-TMLE algorithm, we use a forward selection algorithm that augments a given set Sk into a next set Sk+1 obtained by adding the best main term among all main terms in the complement Sk,c of Sk. Each choice S corresponds with an estimator of g0. In other words, the algorithm iteratively updates a current estimate gk into a new estimate gk+1, but the criterion for g does not measure how well g fits g0; it measures how well the TMLE of Q0 that uses this g (and as initial estimator Qk) fits Q0.

Given a set Sk, an initial gk−1,Qk−1, we define a corresponding gk obtained by MLE-fitting of β in the logistic regression working model

Logit gˉk=Logit gˉ0rgˉk−1,Qˉk−1+∑j∈SkβjWj,

where we remind the reader of the definition gˉ0r(gˉ,Qˉ)=E0(A|Qˉ(W),gˉ(W)). Thus, this estimator gk involves non-parametric adjustment by gˉk−1,Qˉk−1, augmented with a linear regression component implied by Sk. This function mapping Sk,gk−1,Qk−1 into a fit gk will be denoted with g(Sk,gk−1,Qk−1). This also allows us to define a mapping from (Qk,Sk,Qk−1,gk−1) into a TMLE (Qk∗,gk∗) defined by the TMLE algorithm of Theorem 4 applied to initial Qk and gk=g(Sk,gk−1,Qk−1). We will denote this mapping into Qk∗ with TMLE(Qk,Sk,Qk−1,gk−1).

The C-TMLE algorithm defined below generates a sequence (Qk,Sk) and thereby corresponding TMLEs (Qk∗,gk∗), k=0,…,K, where Qk represents an initial estimate, Sk a subset of main terms that defines gk, and Qk∗,gk∗ the corresponding TMLE that starts with (Qk,gk). These TMLEs Qk∗ represent subsequent updates of the initial estimator Q0. The corresponding main term set Sk that defines gk in this k-specific TMLE, increases in k, one unit at a time: S0 is empty, |Sk+1|=|Sk|+1, SK=Ω. The C-TMLE uses cross-validation to select k, and thereby to select the TMLE Qk∗ that yields the best fit of Q0 among the K+1k-specific TMLEs (Qk∗:k=0,…,K) that are increasingly aggressive in their bias-reduction effort. This C-TMLE algorithm is defined as follows and uses the same format as presented in Wang et al. [35]:

Initiate algorithm: Set initial TMLE. Let k=0, and Qk=Q0, gstart be initial estimates of Q0, g0, and let S0 be the empty set. Let gk=g(S0,Q0,gstart). This defines an initial TMLE

Q0∗=TMLE(Q0,S0,Q0,g0).

Determine next TMLE. Determine the next best main term to add:

Sk+1,cand=argminSk∪Wj:Wj∈Sk,cPnLTMLEQk,Sk∪Wj,Qk−1,gk−1.

PnLTMLEQk,Sk+1,cand,Qk−1,gk−1≤PnL(Qk∗),

then (Sk+1=Sk+1,cand,Qk+1=Qk), else Qk+1=Qk∗, and

Sk+1=argminSk∪Wj:Wj∈Sk,cPnLTMLEQk∗,Sk∪Wj,Qk−1,gk−1.

[In words: If the next best main term added to the fit of EP0(A|W) yields a TMLE of EP0(Y|A,W) that improves upon the previous TMLE Qk∗, then we accept this best main term, and we have our next (Qk+1,Sk+1) and corresponding TMLE Qk+1∗,gk+1∗ (which still uses the same initial estimate of Q0 as Qk∗ uses). Otherwise, reject this best main term, update the initial estimate in the candidate TMLEs to the previous TMLE Qk∗ of EP0(Y|A,W), and determine the best main term to add again. This best main term will now always result in an improved fit of the corresponding TMLE of Q0, so that we now have our next TMLE Qk+1∗,gk+1 (which now uses a different initial estimate than Qk∗ used).]

Iterate. Run this from k=1 to K at which point SK=Ω. This yields a sequence (Qk,gk) and corresponding TMLE (Qk∗,gk∗), k=0,…,K.

This sequence of candidate TMLEs Qk∗ of Q0 has the following property: the estimates gk are increasingly non-parametric in k and PnL(Qk∗) is decreasing in k, k=0,…,K. It remains to select k. For that purpose we use V-fold cross-validation. That is, for each of the V splits of the sample in a training and validation sample, we apply the above algorithm for generating a sequence of candidate estimates (Qk∗:k) to a training sample, and we evaluate the empirical mean of the loss function at the resulting Qk∗ over the validation sample, for each k=0,…,K. For each k we take the average over the V splits of the k-specific performance measure over the validation sample, which is called the cross-validated risk of the k-specific TMLE. We select the k that has the best cross-validated risk, which we denote with kn. Our final C-TMLE of Q0 is now defined as Qn∗=Qkn∗, and the TMLE of ψ0 is defined as ψn∗=Ψ(Qn∗).

Fast version of above C-TMLE: We could carry out the above C-TMLE algorithm but replacing the TMLE that maps an initial (Q,g) into (Q∗,g∗) replaced by the first step of the TMLE that maps (Q,g) into (Q1,g1). In that manner, the selection of the sets Sk is based on the bias reduction achieved in a first step of the TMLE algorithm, and most bias reduction occurs in the first step. After having selected the final one-step TMLE Qkn1 and corresponding gkn, one should still carry out the full TMLE algorithm so that the final Qn∗=Qkn∗,gkn∗ is a real TMLE solving the estimating equations of Theorem 4.

Statistical inference for C-TMLE: Let Qˉnr∗=Qˉnr∗(gˉn∗,Qˉn∗) be the final estimator of Qˉ0r=Qˉ0r(gˉ,Qˉ)=EP0(Y−Qˉ|A=1,gˉ), a by-product of the TMLE algorithm. An estimate of the influence curve of ψn∗ is given by

ICn=D∗(Qn∗,gˉn∗)−DA(Qˉnr∗,gˉn∗).

The asymptotic variance of n(ψn∗−ψ0) can thus be estimated with σn2=1/n∑i=1nICn(Oi)2. An asymptotically valid 0.95-confidence interval for ψ0 is given by ψn∗±1.96σn∗/n.

6 Discussion

Targeted minimum loss-based estimation allows us to construct plug-in estimators Ψ(Qn∗) of a path-wise differentiable parameter Ψ(Q0) utilizing the state of the art in ensemble learning such as super-learning, while guaranteeing that the estimator Qn∗ and an estimator gn∗ of the nuisance parameter the TMLE utilizes in its targeting step solve a set of user-supplied estimating equations, empirical means of estimating functions. These estimating functions can be selected so that the resulting TMLE of ψ0 has certain statistical properties such as being efficient, or guaranteed to be more efficient than a given user-supplied estimator [28, 29], and so on. However, most importantly, these estimating equations are necessary to make the TMLE asymptotically linear, i.e. to make the TMLE unbiased enough so that the first-order linear expansion can be used for statistical inference. For example, by selecting the estimating functions to be equal to the canonical gradient of Ψ:M ↦IR one arranges that Ψ(Qn∗) is asymptotically efficient under conditions that assume consistency of Qn∗ and gn∗.

However, we noted that this level of targeting is insufficient if one only relies on consistency of gn∗, even when that suffices for consistency of Ψ(Qn∗). Under such weaker assumptions, additional targeting is necessary so that a specific smooth functional of gn∗ is asymptotically linear, which requires that an unknown smooth function of gn∗ is itself a TMLE. The joint targeting of Qn∗ and gn∗ is achieved by a TMLE that also solves the extra equations making this smooth function of gn∗ asymptotically linear, allowing one to establish asymptotic linearity of Ψ(Qn∗) under milder conditions that assume that the second-order terms are negligible relative to the first-order linear approximation.

In this article we also pushed this additional level of targeting to a new level by demonstrating how it allows for double robust statistical inference, and that even if we estimate the nuisance parameter in a complicated manner that is based on a criterion that cares about how it helps the estimator to fit ψ0, as used by the C-TMLE, we can still determine a set of additional estimating equations that need to be targeted by the TMLE in order to establish asymptotic linearity and thereby valid statistical inference based on the central limit theorem. This allows us now to use the sophisticated but often necessary C-TMLE while still preserving valid statistical inference under regularity conditions.

It remains to evaluate the practical benefit of the modifications of IPTW, TMLE, and C-TMLE as presented in this article for both estimation and assessment of uncertainty. We plan to address this in future research.

Even though we focussed in this article on a particular concrete estimation problem, TMLE is a general tool and our TMLE and theorems can be generalized to general statistical models and path-wise differentiable statistical target parameters.

We note that this targeting of nuisance parameter estimators in the TMLE is not only necessary to get a known influence curve but also necessary to make the TMLE asymptotically linear. So it does not simply suffice to run a bootstrap as an alternative of influence curve based inference, since the bootstrap can only work if the estimator is asymptotically linear so that it has an existing limit distribution. In addition, the established asymptotic linearity with known influence curve has the important by-product that one now obtains statistical inference with no extra computational cost. This is particularly important in these large semi-parametric models that require the utilization of aggressive machine learning methods in order to cover the model-space, making the estimators by necessity very computer intensive, so that a (disputable) bootstrap method might simply be too computer extensive.

Acknowledgments

This research was supported by an NIH grant R01 AI074345-06. The author is grateful for the excellent, helpful, and insightful comments of the reviewers.

Appendix

Proof of Theorem 1

To start with we note:

PnD(gn∗)−P0D(g0)=(Pn−P0)D(g0)+Pn(D(gn∗)−D(g0))=(Pn−P0)(D(g0)−ψ0)+P0(D(gn∗)−D(g0))+(Pn−P0)(D(gn∗)−D(g0)).

The first term of this decomposition yields the first component D(g0)−ψ0 of the influence curve. Since gn∗ falls in Donsker class the rightmost term is oP(1/n) if P0(D(gn∗)−D(g0))2↦0 in probability. So it remains to analyze the term P0(D(gn∗)−D(g0)). We now note

P0D(gn∗)−D(g0)=P0YA1/gn∗−1/g0=P0YA(g0−gn∗)/(gn∗g0)=P0YAg0−gn∗/g02+P0YAg0−gn∗2/g02gn∗.

By our assumptions, the last term

P0YA(g0−gn∗)2/g02gn∗=P0Qˉ0(gˉn∗−gˉ0)2/(gˉ0gˉn∗)=oP(1/n).

So it remains to study:

P0YAg0−gn∗/g02=P0Qˉ0gˉ0−gˉn∗/gˉ0.

Note that this equals −{Ψ1(gn∗)−Ψ1(g0)}, where Ψ1(g)=P0Qˉ0gˉ0gˉ is an unknown smooth parameter of g. Our strategy is to first approximate this parameter by an easier (still unknown) parameter Ψ1r(g)=P0Qˉ0r/gˉ0gˉ resulting in a second-order term: Ψ1(gn∗)−Ψ1(g0)=Ψ1r(gn∗)−Ψ1r(g0)+oP(1/n). This is carried out in the next lemma. The efficient influence curve of a target parameter Φ:gˉ↦P0Hgˉ (which treats P0 as known) at g0 is given by H(A−gˉ0). Thus, one likes to construct gˉn∗ so that it solves the empirical mean of H0r(A−gˉn∗) for H0r=Qˉ0r/gˉ0, so that gˉn∗ targets the parameter Ψ1r(g0). However, H0r is unknown. Therefore, instead gˉn∗ is constructed to solve the empirical mean of an estimate Hnr∗(A−gˉn∗) of the efficient influence curve H0r(A−gˉn∗), and we will show that this indeed suffices to establish the asymptotic linearity of Ψ1r(gˉn∗).

Lemma 2DefineΨ1(g)=P0Qˉ0gˉ0gˉ, Ψ1r(g)=P0Qˉ0rgˉ0gˉ, Qˉ0,nr≡EP0(Y|A=1,gˉ0(W),gˉn∗(W)), andQˉ0r=EP0(Y|A=1,gˉ0(W)), wheregˉn∗(W)is treated as a fixed function of W when calculating the conditional expectation. Assume

R1,n≡P0(Qˉ0,nr−Qˉ0r)(gˉn∗−gˉ0)/gˉ0=oP(1/n).

Then,

Ψ1(gn∗)−Ψ1(g0)=Ψ1r(gˉn∗)−Ψ1r(gˉ0)+R1,n.

Proof of Lemma 2: Note that

Ψ1(gn∗)−Ψ1(g0)=P0YAgn∗−g0/g02=P0Qˉ0,nrAgn∗−g0/g02=P0Qˉ0,nrgˉn∗−gˉ0/gˉ0=P0Qˉ0rgˉn∗−gˉ0/gˉ0+P0Qˉ0,nr−Qˉ0rgˉn∗−gˉ0/gˉ0.□

Since we assumed R1,n=oP(1/n), it remains to prove that Ψ1r(gn∗)−Ψ1r(g0)=P0Qˉ0r(gˉn∗−gˉ0)/gˉ0 is asymptotically linear. Recall that H0r=Qˉ0r/gˉ0, and Hnr∗=Qˉnr∗/gˉn∗, where Qˉnr∗ is obtained by regressing Y on the initial estimator gˉn(W) and A=1.

The next step of the proof is the following series of equalities

P0DA(Qˉnr∗,gˉn∗)=P0(DA(Qˉnr∗,gˉn∗)−DA(Qˉ0r,gˉn∗))+P0DA(Qˉ0r,gˉn∗)=∫(Hnr∗−H0r)(W)(A−gˉn∗(W))dP0(W,A)+P0DA(Qˉ0r,gˉn∗)=∫(Hnr∗−H0r)(W)(gˉ0−gˉn∗)(W)dP0(W)+P0DA(Qˉ0r,gˉn∗)=R2,n+P0DA(Qˉ0r,gˉn∗),

where, by assumption, R2,n=oP(1/n). We now note that

P0DA(Qˉ0r,gˉn∗)=∫H0r(W)(A−gˉn∗(W))dP0(A,W)=∫H0r(W)gˉ0(W)dP0(W)−∫H0r(W)gˉn∗(W)dP0(W)≡Ψ1r(gˉ0)−Ψ1r(gˉn∗).

Thus, we have

−P0DA(Qˉnr∗,gˉn∗)=Ψ1r(gˉn∗)−Ψ1r(gˉ0)−R2,n,

from which we deduce that, by Lemma 2 and PnDA(Qˉnr∗,gˉn∗)=0, that

Ψ1(gˉn∗)−Ψ1(gˉ0)=−P0DAQˉnr∗,gˉn∗+R1,n+R2,n=Pn−P0DAQˉnr∗,gˉn∗+R1,n+R2,n=Pn−P0DAQˉ0r,gˉ0+R1,n+R2,n+R3,n,

where we defined

R3,n=(Pn−P0)(DA(Qˉnr∗,gˉn∗)−DA(Qˉ0r,gˉ0)).

By our assumptions, R3,n=oP(1/n), so that it follows that Ψ1(gn∗)−Ψ1(g0)=(Pn−P0)DA(Qˉ0r,gˉ0)+oP(1/n). □

Proof of Theorem 2

One easily checks that

Ψ(Qn∗)−Ψ(Q0)=−P0D∗(Qn∗,g0)=−P0D∗(Qn∗,gn∗)+P0D∗(Qn∗,gn∗)−D∗(Qn∗,g0)=(Pn−P0)D∗(Qn∗,gn∗)+P0D∗(Qn∗,gn∗)−D∗(Qn∗,g0),

because PnD∗(Qn∗,gn∗)=0 by eq. (2). If D∗(Qn∗,gn∗) falls in a P0-Donsker class and P0{D∗(Qn∗,gn∗)−D∗(Q,g0)}2=oP(1) for some possibly misspecified limit Q of Qn∗, then the first term on the right-hand side equals (Pn−P0)D∗(Q,g0)+oP(1/n), giving us the first component D∗(Q,g0) of the influence curve of Ψ(Qn∗). The second term can be written as A+B with

A=P0D∗(Qn∗,gn∗)−D∗(Qn∗,g0)−D∗(Q,gn∗)−D∗(Q,g0)B=P0D∗(Q,gn∗)−D∗(Q,g0).

The first term A equals

−P0(HY(gn∗)−HY(g0))(Qˉn∗−Qˉ),

where HY(g)(A,W)=A/gˉ(W). By our assumptions, this term is oP(1/n). Thus, it suffices to establish asymptotic linearity of Ψ1(gn∗)=P0D∗(Q,gn∗) as an estimator of Ψ1(g0)=P0D∗(Q,g0). We have

Ψ1(gn∗)−Ψ1(g0)=−P0(Y−Qˉ)Agˉn∗gˉ0(gˉn∗−gˉ0)=−P0Qˉ0,nrAgˉn∗gˉ0(gˉn∗−gˉ0)=−P0Qˉ0,nr1gˉn∗(gˉn∗−gˉ0),

where Qˉ0,nr appeared by writing the expectation w.r.t. P0 as an expectation of the conditional expectation, given A,gˉn∗(W),gˉ0(W). Let H0,nr=Qˉ0,nr/gˉn∗ and recall H0r=Qˉ0r/gˉ0, where Qˉ0r=EP0(Y−Qˉ(W)|A=1,gˉ0(W)). The last term can be written as

−P0H0r(gˉn∗−gˉ0)−P0(H0,nr−H0r)(gˉn∗−gˉ0).

By our assumptions, the second term above is oP(1/n). Thus, in order to establish asymptotic linearity of Ψ1(gn∗), it suffices to establish asymptotic linearity of Ψ1r(gˉn∗)=P0H0rgˉn∗ as an estimator of Ψ1r(gˉ0)=P0H0rgˉ0, where P0 and H0r are treated as known.

The estimator gˉn∗ was constructed to target P0Hnr∗gˉ0 instead where we recall that Hnr∗=Qˉnr∗/gˉn∗. That is, our targeted estimator gn∗ solves the efficient influence curve equation PnDA(Qˉnr∗,gˉn∗)=0 for the parameter P0Hnr∗gˉ0 of gˉ0. We now note that

P0DA(Qˉ0r,gˉn∗)=∫H0r(W)(A−gˉn∗(W))dP0(A,W)=∫H0r(W)gˉ0(W)dP0(W)−∫H0r(W)gˉn∗(W)dP0(W)≡Ψ1r(gˉ0)−Ψ1r(gˉn∗).

We have

P0DA(Qˉnr∗,gˉn∗)=P0{DA(Qˉnr∗,gˉn∗)−DA(Qˉ0r,gˉn∗)}+P0DA(Qˉ0r,gˉn∗)=∫(Hnr∗−H0r)(W)(A−gˉn∗(W))dP0(W,A)+P0DA(Qˉ0r,gˉn∗)=∫(Hnr∗−H0r)(W)(gˉ0−gˉn∗(W))dP0(W)+P0DA(Qˉ0r,gˉn∗)≡R2,n+P0DA(Qˉ0r,gˉn∗),

where R2,n=oP(1/n), by assumption. Combining the last two equations yields:

Ψ1r(gˉn∗)−Ψ1r(gˉ0)=−P0DA(Qˉnr∗,gˉn∗)−R2,n=(Pn−P0)DA(Qˉnr∗,gˉn∗)−R2,n=(Pn−P0)DA(Qˉ0r,gˉ0)−R2,n+R3,n,

where we defined

R3,n=(Pn−P0)DA(Qˉnr∗,gˉn∗)−DA(Qˉ0r,gˉ0).

We have that R3,n=oP(1/n) if DA(Qˉnr∗,gˉn∗)−DA(Qˉ0r,gˉ0) falls in a P0-Donsker class with probability tending to 1, and P0{DA(Qˉnr∗,gˉn∗)−DA(Qˉ0r,gˉ0)}2↦0 in probability when n ↦∞. Thus, we have proven that Ψ1r(gn∗)−Ψ1r(g0)=(Pn−P0)DA(Qˉ0r,gˉ0)+oP(1/n). Thus,

Ψ1(gn∗)−Ψ1(g0)=−(Pn−P0)DA(Qˉ0r,gˉ0)+oP(1/n).□

Proof of Theorem 3

As outlined in Section 1, we have

Ψ(Qn∗)−Ψ(Q0)=−P0D∗(Qn∗,gn∗)+P0(Qˉ0−Qˉn∗)gˉ0−gˉn∗gˉn∗=(Pn−P0)D∗(Qn∗,gn∗)+P0(Qˉ0−Qˉn∗)gˉ0−gˉn∗gˉn∗=(Pn−P0)D∗(Q,g)+P0(Qˉn∗−Qˉ0)gˉn∗−gˉ0gˉn∗+oP(1/n),

if D∗(Qn∗,gn∗) falls in a Donsker class with probability tending to 1, and P0{D∗(Qn∗,gn∗)−D∗(Q,g)}2↦0 in probability as n ↦∞. The first term on right-hand side gives us the first component D∗(Q,g) of the influence curve of ψn∗.

It suffices to analyze the second term. Initially, we note that

P0(Qˉn∗−Qˉ0)gˉn∗−gˉ0gˉn∗=P0(Qˉn∗−Qˉ0)gˉn∗−gˉ0gˉ+R1,n,

where

R1,n=−P0(Qˉn∗−Qˉ0)(gˉn∗−gˉ0)gˉn∗−gˉgˉgˉn∗.

By assumption, R1,n=oP(1/n).

Now, we note

P0(Qˉn∗−Qˉ0)gˉn∗−gˉ0gˉ=P0(Qˉn∗−Qˉ+Qˉ−Qˉ0)gˉn∗−gˉ+gˉ−gˉ0gˉ=P0(Qˉn∗−Qˉ)gˉn∗−gˉgˉ+P0(Qˉn∗−Qˉ)gˉ−gˉ0gˉ+P0(Qˉ−Qˉ0)gˉn∗−gˉgˉ+P0(Qˉ−Qˉ0)gˉ−gˉ0gˉ.

By our assumptions, the first term R2,n=P0(Qˉn∗−Qˉ)gˉn∗−gˉgˉ satisfies R2,n=oP(1/n). In addition, the last term equals zero by assumption: Qˉ=Qˉ0 or gˉ=gˉ0.

So it suffices to analyze the second and third terms of this last expression. In order to represent the second and third terms we define

Ψ2,gˉ,gˉ0(Qˉn∗)=P0Qˉn∗gˉ−gˉ0gˉΨ1,gˉ,Qˉ,Qˉ0(gˉn∗)=P0Qˉ−Qˉ0gˉgˉn∗.

The sum of the second and third terms can now be represented as:

I(Qˉ=Qˉ0)Ψ2,gˉ,gˉ0(Qˉn∗)−Ψ2,gˉ,gˉ0(Qˉ)+I(gˉ=gˉ0)Ψ1,gˉ,Qˉ,Qˉ0(gˉn∗)−Ψ1,gˉ,Qˉ,Qˉ0(gˉ).

For notational convenience, we will suppress the dependence of these mappings on the unknown quantities, and thus use Ψ1,Ψ2.

Analysis ofΨ1(gˉn∗)ifgˉ=gˉ0: Recalling the definition Qˉ0,nr, we have

Ψ1(gˉn∗)−Ψ1(gˉ)=P0Qˉ−Qˉ0gˉ0(gˉn∗−gˉ0)=−P0(Y−Qˉ)Agˉ02(gˉn∗−gˉ0)=−P0EP0(Y−Qˉ|A=1,gˉ0,gˉn∗)gˉ0(gˉn∗−gˉ0)=−P0Qˉ0,nrgˉ0(gˉn∗−gˉ0)=−P0Qˉ0,nr−Qˉ0rgˉ0(gˉn∗−gˉ0)−P0Qˉ0rgˉ0(gˉn∗−gˉ0).

By our assumptions,

R3,n≡P0Qˉ0,nr−Qˉ0rgˉ0(gˉn∗−gˉ0)=oP(1/n),

so that it remains to analyze −P0H0r(gˉn∗−gˉ0), where H0r=Qˉ0r/gˉ0. Let Hnr∗=Qˉnr∗/gˉn∗, and recall that by construction PnDA(Qˉnr∗,gˉn∗)=PnHnr∗(A−gˉn∗)=0. We then proceed as follows:

P0H0r(gˉn∗−gˉ0)=P0Hnr∗(gˉn∗−gˉ0)+P0(H0r−Hnr∗)(gˉn∗−gˉ0)≡P0Hnr∗(gˉn∗−gˉ0)+R4,n,

where, by our assumptions,

R4,n=P0(Hnr∗−H0r)(gˉn∗−gˉ0)=oP(1/n).

In addition,

P0Hnr∗(gˉn∗−gˉ0)=−P0Hnr∗(A−gˉn∗)=(Pn−P0)Hnr∗(A−gˉn∗)=(Pn−P0)H0r(A−gˉ0)+R4,n+R5,n,

where R5,n=oP(1/n) if P0{DA(Qˉnr∗,gˉn∗)−DA(Qˉ0r,gˉ}2=oP(1) and DA(Qˉnr∗,gˉn∗) falls in a Donsker class with probability tending to 1. This proves that, if gˉ=gˉ0, then Ψ1(gˉn∗)−Ψ1(gˉ0)=−(Pn−P0)DA(Qˉ0r,gˉ)+oP(1/n).

Analysis ofΨ2(Qˉn∗)ifQˉ=Qˉ0: Recall the definitions of HY(gˉr,gˉ), gˉ0,nr=EP0(A|gˉ,Qˉn∗,Qˉ), gˉnr∗ (an estimator of gˉ0r=EP0(A|gˉ,Qˉ)), and that, by construction, PnHY(gˉnr∗,gˉn∗)(Y−Qˉn∗)=0. We have

Ψ2(Qˉn∗)−Ψ2(Qˉ0)=P0gˉ−gˉ0gˉ(Qˉn∗−Qˉ0)=−P0A−gˉgˉ(Qˉn∗−Qˉ0)=−P0gˉ0,nr−gˉgˉ(Qˉn∗−Qˉ0)=P0Agˉ0,nrgˉ0,nr−gˉgˉ(Y−Qˉn∗)=P0HY(gˉ0,nr,gˉ)(Y−Qˉn∗)=P0HY(gˉnr∗,gˉn∗)(Y−Qˉn∗)+P0{HY(gˉ0,nr,gˉ)(Qˉ0−Qˉn∗)−HY(gˉnr∗,gˉn∗)(Qˉ0−Qˉn∗)}

Here we used that gˉ0 is a conditional expectation of A allowing us to first replace gˉ0 by A and then retake the conditional expectation but now only conditioning on what is needed to fix all other terms within expectation w.r.t. P0. As a result of this trick, we were able to replace the hard to estimate gˉ0 that conditions on all of W by the easier gˉ0,nr. Similarly, we used this to replace Qˉ0 by Y. The last term is a second-order term involving square differences (Qˉn∗−Qˉ0)(gˉn∗−gˉ) and (Qˉn∗−Qˉ0)(gˉ0,nr−gˉnr∗). By our assumptions, this last term is oP(1/n). We now proceed as follows:

−P0HY(gˉnr∗,gˉn∗)(Y−Qˉn∗)=(Pn−P0)HY(gˉnr∗,gˉn∗)(Y−Qˉn∗)=(Pn−P0)HY(gˉ0r,gˉ)(Y−Qˉ)+oP(1/n),

where we assumed that HY(gˉnr∗,gˉn∗)(Y−Qˉn∗) falls in a Donsker class with probability tending to 1, and

P0HY(gˉnr∗,gˉn∗)(Y−Qˉn∗)−HY(gˉ0r,gˉ)(Y−Qˉ)2↦0,

in probability. This proves Ψ2(Qˉn∗)−Ψ2(Qˉ0)=−(Pn−P0)DY(Qˉ,gˉ0r,gˉ)+oP(1/n). □

Proof of Theorem 4

As in the proof of previous theorem, we start with

Ψ(Qn∗)−Ψ(Q0)=(Pn−P0)D∗(Q,g)+P0(Qˉn∗−Qˉ0)gˉn∗−gˉ0gˉn∗+oP(1/n),

where we use that D∗(Qn∗,gn∗) falls in a Donsker class with probability tending to 1, and P0{D∗(Qn∗,gn∗)−D∗(Q,g)}2↦0 in probability as n ↦∞. The first term yields the first component D∗(Q,g) of the influence curve of ψn∗.

As in the proof of previous theorem, we decompose this second term as follows:

P0(Qˉn∗−Qˉ0)gˉn∗−gˉ0gˉn∗=P0(Qˉn∗−Qˉ+Qˉ−Qˉ0)gˉn∗−gˉ+gˉ−gˉ0gˉn∗=P0(Qˉn∗−Qˉ)gˉn∗−gˉgˉn∗+P0(Qˉn∗−Qˉ)gˉ−gˉ0gˉn∗+P0(Qˉ−Qˉ0)gˉn∗−gˉgˉn∗+P0(Qˉ−Qˉ0)gˉ−gˉ0gˉn∗,

resulting in four terms, which we will denote with Terms 1–4. We will now analyze these four terms.

Term 1: The first term P0(Qˉn∗−Qˉ)gˉn∗−gˉgˉ=oP(1/n), by assumption.

Term 4: Due to our assumption that P0(Qˉ−Qˉ0)(gˉ−gˉ0)/gˉ=0 this last term equals:

−P0(Qˉ−Qˉ0)(gˉ−gˉ0)(gˉn∗−gˉ)gˉn∗gˉ=−P0(Qˉ−Qˉ0)(gˉ−gˉ0)(gˉn∗−gˉ)gˉ2+R1,n,

where, by assumption,

R1,n=−P0(Qˉ−Qˉ0)(gˉ−gˉ0)(gˉn∗−gˉ)2gˉ2gˉn∗=oP(1/n).

We proceed as follows:

−P0(Qˉ−Qˉ0)(gˉ−gˉ0)(gˉn∗−gˉ)gˉ2=−P0(Qˉ−Qˉ0)(gˉ−A)(gˉn∗−gˉ)gˉ2=−P0(Qˉ−Qˉ0)gˉ(gˉn∗−gˉ)+P0(Qˉ−Qˉ0)Agˉ2(gˉn∗−gˉ).

The first term is asymptotically equivalent with minus Term 3, which shows that Term 3 is canceled out by a component of Term 4 up till a second-order term that is oP(1/n), by assumption. The second term equals

P0(Qˉ−Qˉ0)Agˉ2(gˉn∗−gˉ)=−P0(Y−Qˉ)Agˉ2(gˉn∗−gˉ)=−P0EP0(Y−Qˉ|A=1,gˉ,gˉn∗)Agˉ2(gˉn∗−gˉ)=−P0EP0(Y−Qˉ|A=1,gˉ,gˉn∗)gˉ0,nrgˉ2(gˉn∗−gˉ)=−P0EP0(Y−Qˉ|A=1,gˉ)gˉ0rgˉ2(gˉn∗−gˉ)−P0(H1(gˉn∗)−H1(gˉ))(gˉn∗−gˉ),

where H1(gˉn∗)≡EP0(Y−Qˉ|A=1,gˉ,gˉn∗)gˉ0,nrgˉ2 approximates H1(gˉ)=EP0(Y−Qˉ|A=1,gˉ)gˉ0rgˉ2, gˉ0,nr=EP0(A|gˉ,gˉn∗), and gˉ0r=EP0(A|gˉ). Let Qˉ0,nr=EP0(Y−Qˉ|A=1,gˉ,gˉn∗) and Qˉ0r=EP0(Y−Qˉ|A=1,gˉ). We assumed ∥gˉ0,nr−gˉ0r∥0∥gˉn∗−gˉ∥0=oP(1/n), and ∥Qˉ0,nr−Qˉ0r∥0∥gˉn∗−gˉ∥0=oP(1/n), which implies that

R2,n=P0(H1(gˉn∗)−H1(gˉ))(gˉn∗−gˉ)=oP(1/n).

By assumption, E0(A|W1) for some W1 that is a function of W. Therefore, gˉ0r(W)=EP0(A|gˉ(W))=EP0(EP0(A|W1)|gˉ(W))=gˉ(W). Thus, it remains to analyze

(4)−P0EP0(Y−Qˉ|A=1,gˉ)gˉ(gˉn∗−gˉ).

This term is analyzed below and it is shown that this term equals

−(Pn−P0)DA(Qˉ0r,gˉ)+oP(1/n).

To conclude, we have then shown that the fourth term equals the latter expression minus the third term.

We now analyze (4) which can be represented as −P0Qˉ0rgˉ(gˉn∗−gˉ), where Qˉ0r=EP0(Y−Qˉ|A=1,gˉ). In this proof, we will use the notation H0r=Qˉ0r/gˉ, Hnr∗=Qˉnr∗/gˉn∗. Since gˉ(W)=E0(A|W1) for some W1, and Qˉ0r is thus also a function of W1, we have

P0Qˉ0rgˉgˉ=P0Qˉ0rgˉA.

We now proceed as follows:

−P0H0r(gˉn∗−gˉ)=−P0H0r(gˉn∗−A)=−P0Hnr∗(gˉn∗−A)−P0(H0r−Hnr∗)(gˉn∗−A)=P0Hnr∗(A−gˉn∗)−P0(H0r−Hnr∗)(gˉn∗−E0(A|gˉ,Qˉnr∗,gˉn∗)).

For the second term R4,n, we can substitute

gˉn∗−E0(A|gˉ,Qˉnr∗,gˉn∗)=(gˉn∗−gˉ)+E0(A|gˉ,Qˉ0r)−E0(A|gˉ,Qˉnr∗,gˉn∗),

by noting that gˉ=E0(A|gˉ,Qˉ0r). Thus, this second term results in two terms, one that can be bounded by ∥Hnr∗−H0r∥0∥gˉn∗−gˉ∥0 and the other is bounded by

∥Hnr∗−H0r∥0∥E0(A|gˉ,Qˉ0r)−E0(A|gˉ,Qˉnr∗,gˉn∗)∥0.

By assumption, both terms are oP(1/n) and thus R4,n=oP(1/n).

Since, by construction of gn∗, PnHnr∗(A−gˉn∗)=0, the first term can be written as follows:

P0Hnr∗(A−gˉn∗)=−(Pn−P0)Hnr∗(A−gˉn∗)=−(Pn−P0)H0r(A−gˉ)+R5,n,

where R5,n=oP(1/n) if P0{DA(Qˉnr∗,gˉn∗)−DA(Qˉ0r,gˉ)}2=oP(1) and DA(Qˉnr∗,gˉn∗) falls in a Donsker class with probability tending to 1, and we are reminded that DA(gˉ,Qˉ0r)=H0r(A−gˉ). This completes the proof for the fourth term.

Term 3: Our analysis of Term 4 showed that Term 3 cancels out and thus that the sum of the third and fourth terms equals −(Pn−P0)DA(Qˉ0r,gˉ)+oP(1/n), which yields the second component −DA(Qˉ0r,gˉ) of the influence curve of ψn∗.

Analysis of Term 2: Up till a second-order term that can be bounded by ∥gˉn∗−gˉ∥0∥Qˉn∗−Qˉ∥0=oP(1/n, we can represent Term 2 as

Ψ2,gˉ,gˉ0(Qˉn∗)−Ψ2,gˉ,gˉ0(Qˉ).

where

Ψ2,gˉ,gˉ0(Qˉn∗)=P0Qˉn∗gˉ−gˉ0gˉ.

We have

Ψ2(Qˉn∗)−Ψ2(Qˉ)=P0gˉ−gˉ0gˉ(Qˉn∗−Qˉ)=−P0A−gˉgˉ(Qˉn∗−Qˉ)=−P0EP0(A|gˉ,Qˉn∗,Qˉ)−gˉgˉ(Qˉn∗−Qˉ).

Recall that, by our assumption, gˉ=EP0(A|gˉ,Qˉ). Let gˉ0,nr=EP0(A|gˉ,Qˉn∗,Qˉ). By our assumptions,

(5)P0gˉ0,nr−gˉgˉ(Qˉn∗−Qˉ)=oP(1/n).

This proves that Ψ2(Qˉn∗)−Ψ2(Qˉ)=oP(1/n). □

Remark: Proof of additional result In this analysis of Term 2, we assumed gˉ=EP0(A|gˉ,Qˉ), and condition (5). Let us now try to provide a different type of analysis for this Term 2, relying on different conditions. We have

Ψ2(Qˉn∗)−Ψ2(Qˉ)=P0gˉ−gˉ0gˉ(Qˉn∗−Qˉ)=P0A−gˉgˉ(Qˉ−Qˉn∗)=P0gˉ0,nr−gˉgˉ0,nrgˉA(Y−Qˉn∗),

where gˉ0,nr=E0(A|gˉ,Qˉ,Qˉn∗), and if we assume that P0gˉ0,nr−gˉgˉ0,nrgˉA(Y−Qˉ)=0. The latter equality holds if we target in the TMLE algorithm Qˉn∗ with clever covariate HY(gˉnr∗,gˉn∗)=(gˉnr∗−gˉn∗)/(gˉnr∗gˉn∗)A, where gˉnr∗ estimates a non-parametric regression of A on Qˉn∗,gˉn∗, exactly as in Theorem 3. Under that assumption one can now show that we obtain another influence curve component DY defined by

DY(Qˉ,gˉ0r,gˉ)=gˉ0r−gˉgˉ0rgˉA(Y−Qˉ),

where gˉ0r=E0(A|gˉ,Qˉ). Thus, now we have that Ψ(Qn∗) is asymptotically linear with influence curve D∗(Q,g)−DA(Qˉ0r,gˉ)−DY(Qˉ,gˉ0r,gˉ). However, note that if gˉ0r=gˉ, i.e. if gˉ=E0(A|gˉ,Qˉ), then DY=0. To conclude, one can remove the condition that gˉn∗ needs to non-parametrically adjust for Qˉn∗ as arranged by the TMLE algorithm in Theorem 4 by adding the additional clever covariate HY(gˉnr∗,gˉn∗) to the submodel for Qˉn∗ in the TMLE algorithm, and the influence curve will now have another component DY(P0), as in Theorem 3. This results in a generalization of Theorem 3 which does not require that either Qn∗ or gn∗ is consistent, but only requires that their limits gˉ,Qˉ satisfy P0(gˉ−gˉ0)/gˉ(Qˉ−Qˉ0)=0. Thus, this latter generalization of Theorem 3 would provide an appropriate theorem for a C-TMLE that does not enforce the non-parametric adjustment for Qˉn∗, but still needs to adjust for Qˉn∗−Qˉ0.

References

1. BickelPJ, KlaassenCA, RitovY, WellnerJ. Efficient and adaptive estimation for semiparametric models. Springer-Verlag, 1997.Search in Google Scholar

2. GillRD. Non- and semiparametric maximum likelihood estimators and the von Mises method (part 1). Scand J Stat1989;16:97–128.Search in Google Scholar

3. GillRD, van der LaanMJ, WellnerJA. Inefficient estimators of the bivariate survival function for three models. Ann Inst Henri Poincaré1995;31:545–97.Search in Google Scholar

4. van der VaartAW, WellnerJA. Weak convergence and empirical processes. New York: Springer-Verlag, 1996.10.1007/978-1-4757-2545-2Search in Google Scholar

5. van der LaanMJ. Estimation based on case-control designs with known prevalence probability. Int J Biostat2008. Available at: http://www.bepress.com/ijb/vol4/iss1/17/.10.2202/1557-4679.1114Search in Google Scholar PubMed

6. van der LaanMJ, RoseS. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2012.Search in Google Scholar

7. van der LaanMJ, RubinD. Targeted maximum likelihood learning. Int J Biostat2006;20.10.2202/1557-4679.1043Search in Google Scholar

8. van der LaanMJ, DudoitS. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Technical report, Division of Biostatistics, University of California, Berkeley, CA, November 2003.Search in Google Scholar

9. van der LaanMJ, PolleyE, HubbardA. Super learner. Stat Appl Genet Mol Biol2007;6:Article 25.10.2202/1544-6115.1309Search in Google Scholar PubMed

10. van der VaartAW, DudoitS, van der LaanMJ. Oracle inequalities for multi-fold cross-validation. Stat Decis2006;240:351–71.10.1524/stnd.2006.24.3.351Search in Google Scholar

11. RobinsJM, RotnitzkyA. Recovery of information and adjustment for dependent censoring using surrogate markers. In Aids epidemiology. Methodological issues. Basel: Bikhäuser, 1992:297–331.10.1007/978-1-4757-1229-2_14Search in Google Scholar

12. RobinsJM, RotnitzkyA. Semiparametric efficiency in multivariate regression models with missing data. J Am Stat Assoc1995;900:122–9.10.1080/01621459.1995.10476494Search in Google Scholar

13. van der LaanMJ, RobinsJM. Unified methods for censored longitudinal data and causality. New York: Springer-Verlag, 2003.10.1007/978-0-387-21700-0Search in Google Scholar

14. RobinsJM, RotnitzkyA, van der LaanMJ. Comment on “on profile likelihood” by S.A. Murphy and A.W. van der Vaart. J Am Stat Assoc – Theory Methods2000;450:431–5.Search in Google Scholar

15. RobinsJM. Robust estimation in sequentially ignorable missing data and causal inference models. In Proceedings of the American Statistical Association, 2000.Search in Google Scholar

16. RobinsJM, RotnitzkyA. Comment on the Bickel and Kwon article, “inference for semiparametric models: some questions and an answer”. Stat Sin2001;110:920–36.Search in Google Scholar

17. ScharfsteinDO, RotnitzkyA, RobinsJM. Adjusting for non-ignorable drop-out using semiparametric nonresponse models, (with discussion and rejoinder). J Am Stat Assoc1999;940:1096–120 (1121–46).Search in Google Scholar

18. BembomO, PetersenML, RheeS-Y, FesselWJ, SinisiSE, ShaferRW, et al. Biomarker discovery using targeted maximum likelihood estimation: application to the treatment of antiretroviral resistant HIV infection. Stat Med2009;28:152–72.10.1002/sim.3414Search in Google Scholar PubMed PubMed Central

19. GruberS, van der LaanMJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Int J Biostat2010;6:Article 26. Available at: www.bepress.com/ijb/vol6/iss1/2610.2202/1557-4679.1260Search in Google Scholar PubMed PubMed Central

20. GruberS, van der LaanMJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat2010;60.10.2202/1557-4679.1182Search in Google Scholar PubMed PubMed Central

21. GruberS, van der LaanMJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Technical Report 265, UC Berkeley, CA, 2010.10.2202/1557-4679.1260Search in Google Scholar PubMed PubMed Central

22. RosenblumM, van der LaanMJ. Targeted maximum likelihood estimation of the parameter of a marginal structural model. Int J Biostat2010;60.10.2202/1557-4679.1238Search in Google Scholar PubMed PubMed Central

23. SekhonJS, GruberS, PorterK, van der LaanMJ. Propensity-score-based estimators and C-TMLE. In: MJvan der Laan and SRose, editors. Targeted learning: prediction and causal inference for observational and experimental data, chapter 21. New York: Springer, 2011.Search in Google Scholar

24. GruberS, van der LaanMJ. Targeted minimum loss based estimation of a causal effect on an outcome with known conditional bounds. Int J Biostat2012;8.10.1515/1557-4679.1413Search in Google Scholar PubMed

25. ZhengW, van der LaanMJ. Asymptotic theory for cross-validated targeted maximum likelihood estimation. Technical Report 273, Division of Biostatistics, University of California, Berkeley, CA, 2010.10.2202/1557-4679.1181Search in Google Scholar PubMed PubMed Central

26. ZhengW, van der LaanMJ. Cross-validated targeted minimum loss based estimation. In: MJvan der Laan and SRose, editors. Targeted learning: causal inference for observational and experimental data, chapter 21. New York: Springer, 2011:459–74.Search in Google Scholar

27. van der VaartAW. Asymptotic statistics. New York: Cambridge University Press, 1998.Search in Google Scholar

28. RotnitzkyA, LeiQ, SuedM, RobinsJ. Improved double-robust estimation in missing data and causal inference models. Biometrika2012;99:439–56.10.1093/biomet/ass013Search in Google Scholar PubMed PubMed Central

29. GruberS, van der LaanMJ. Targeted minimum loss based estimator that outperforms a given estimator. Int J Biostat2012;80:Article 11. DOI:10.1515/1557-4679.1332Search in Google Scholar

30. GruberS, van der LaanMJ. Marginal structural models. In: MJvan der Laan and SRose, editors. C-TMLE of an additive point treatment effect, chapter 19. New York: Springer, 2011.Search in Google Scholar

31. PorterKE, GruberS, van der LaanMJ, SekhonJS. The relative performance of targeted maximum likelihood estimators. Int J Biostat2011;70:1–34.10.2202/1557-4679.1308Search in Google Scholar PubMed PubMed Central

32. StitelmanOM, van der LaanMJ. Collaborative targeted maximum likelihood for time to event data. Int J Biostat2010:Article 21.10.2202/1557-4679.1249Search in Google Scholar PubMed

33. van der LaanMJ, GruberS. Collaborative double robust penalized targeted maximum likelihood estimation. Int J Biostat2010;60.10.2202/1557-4679.1181Search in Google Scholar

34. van der LaanMJ, RoseS. Targeted learning: prediction and causal inference for observational and experimental data. New York: Springer, 2011.10.1007/978-1-4419-9782-1Search in Google Scholar

35. WangH, RoseS, van der LaanMJ. Finding quantitative trait loci genes. In: MJvan der Laan and SRose, editors. Targeted learning: causal inference for observational and experimental data, chapter 23. New York: Springer, 2011.Search in Google Scholar

36. HernanMA, BrumbackB, RobinsJM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000;110:561–70.10.1097/00001648-200009000-00012Search in Google Scholar PubMed

37. GyörfiL, KohlerM, KrzyżakA, WalkH. A distribution-free theory of nonparametric regression. New York: Springer-Verlag, 2002.Search in Google Scholar

38. van der LaanMJ, DudoitS, van der VaartAW. The cross-validated adaptive epsilon-net estimator. Stat Decis2006;240:373–95.10.1524/stnd.2006.24.3.373Search in Google Scholar

39. van der LaanMJ, DudoitS, KelesS. Asymptotic optimality of likelihood-based cross-validation. Stat Appl Genet Mol Biol2004;3:Article 4.10.2202/1544-6115.1036Search in Google Scholar PubMed

40. DudoitS, van der LaanMJ. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat Methodol2005;20:131–54.10.1016/j.stamet.2005.02.003Search in Google Scholar

41. PolleyEC, RoseS, van der LaanMJ. Super learning. In: MJvan der Laan and SRose, editors. Targeted learning: causal inference for observational and experimental data, chapter 3. New York: Springer, 2011.Search in Google Scholar

42. PolleyEC, van der LaanMJ. Super learner in prediction. Technical report 200. Division of Biostatistics, UC Berkeley, Working Paper Series, 2010.Search in Google Scholar

43. van der LaanMJ, PetersenML. Targeted learning. In: ZhangC, MaY, editors. Ensemble machine learning. New York: Springer, 2012:117–56. ISBN 978-1-4419-9326-7.Search in Google Scholar

44. van der LaanMJ. Efficient and inefficient estimation in semiparametric models. Center for Mathematics and Computer Science, CWI-tract 114. 1996.10.1214/aos/1032894470Search in Google Scholar

45. LeeBK, LesslerJ, StuartEA. Improved propensity score weighting using machine learning. Stat Med2009;29:337–46.10.1002/sim.3782Search in Google Scholar PubMed PubMed Central

46. SchneeweissS, RassenJA, GlynnRJ, AvornJ, MogunH, BrookhartMA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology2009;20:512–22. DOI: 10.1097/EDE.0b013e3181a663cc.10.1097/EDE.0b013e3181a663ccSearch in Google Scholar PubMed PubMed Central

47. VansteelandtS, BekaertM, ClaeskensG. On model selection and model misspecification in causal inference. Stat Methods Med Res2010;21:7–30. DOI:10.1177/0962280210387717.10.1177/0962280210387717Search in Google Scholar PubMed

48. WestreichD, ColeSR, FunkMJ, BrookhartMA, SturmerT. The role of the c-statistic in variable selection for propensity scores. Pharmacoepidemiol Drug Saf2011;20:317–20.10.1002/pds.2074Search in Google Scholar PubMed PubMed Central

49. van der LaanMJ, GruberS. Collaborative double robust penalized targeted maximum likelihood estimation. Int J Biostat2009;6.10.2202/1557-4679.1181Search in Google Scholar PubMed PubMed Central

Published Online: 2014-2-11

Published in Print: 2014-5-1

Targeted Estimation of Nuisance Parameters to Obtain Valid Statistical Inference

Abstract

1 Introduction and overview

1.1 The role of nuisance parameter estimation

1.2 Targeting the fit of the nuisance parameter: general approach

1.3 Concrete example covered in this article

1.4 Relation to current literature on targeted nuisance parameter estimators

1.5 Organization

1.6 Notation

2 Statistical inference for IPTW-estimator when using super-learning to fit treatment mechanism

2.1 An IPTW-estimator using super-learning to fit the treatment mechanism

2.2 Asymptotic linearity of a targeted data-adaptive IPTW-estimator

2.3 Comparison of targeted data-adaptive IPTW and an IPTW using parametric model

3 Statistical inference for TMLE when using super-learning to consistently fit treatment mechanism

3.1 Asymptotic linearity of a TMLE using a targeted estimator of the treatment mechanism

3.2 Using a δ-specific submodel for targeting g that guarantees the positivity condition

4 Double robust statistical inference for TMLE when using super-learning to fit outcome regression and treatment mechanism

5 Collaborative double robust inference for C-TMLE when using super-learning to fit outcome regression and reduced treatment mechanism

5.1 Motivation and theoretical underpinning of collaborative double robust estimation of nuisance parameters

5.2 C-TMLE

5.3 A TMLE that allows for collaborative double robust inference

5.4 A C-TMLE algorithm

6 Discussion

Acknowledgments

Appendix

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Proof of Theorem 4

References

Journal and Issue

Articles in the same Issue