Abstract
In order to obtain concrete results, we focus on estimation of the treatment specific mean, controlling for all measured baseline covariates, based on observing independent and identically distributed copies of a random variable consisting of baseline covariates, a subsequently assigned binary treatment, and a final outcome. The statistical model only assumes possible restrictions on the conditional distribution of treatment, given the covariates, the so-called propensity score. Estimators of the treatment specific mean involve estimation of the propensity score and/or estimation of the conditional mean of the outcome, given the treatment and covariates. In order to make these estimators asymptotically unbiased at any data distribution in the statistical model, it is essential to use data-adaptive estimators of these nuisance parameters such as ensemble learning, and specifically super-learning. Because such estimators involve optimal trade-off of bias and variance w.r.t. the infinite dimensional nuisance parameter itself, they result in a sub-optimal bias/variance trade-off for the resulting real-valued estimator of the estimand. We demonstrate that additional targeting of the estimators of these nuisance parameters guarantees that this bias for the estimand is second order and thereby allows us to prove theorems that establish asymptotic linearity of the estimator of the treatment specific mean under regularity conditions. These insights result in novel targeted minimum loss-based estimators (TMLEs) that use ensemble learning with additional targeted bias reduction to construct estimators of the nuisance parameters. In particular, we construct collaborative TMLEs (C-TMLEs) with known influence curve allowing for statistical inference, even though these C-TMLEs involve variable selection for the propensity score based on a criterion that measures how effective the resulting fit of the propensity score is in removing bias for the estimand. As a particular special case, we also demonstrate the required targeting of the propensity score for the inverse probability of treatment weighted estimator using super-learning to fit the propensity score.
1 Introduction and overview
This introduction provides an atlas for the contents of this article. It starts with formulating the role of estimation of nuisance parameters to obtain asymptotically linear estimators of a target parameter of interest. This demonstrates the need to target this estimator of the nuisance parameter in order to make the estimator of the target parameter asymptotically linear when the model for the nuisance parameter is large. The general approach to obtain such a targeted estimator of the nuisance parameter is described. Subsequently, we present our concrete example to which we will apply this general method for targeted estimation of the nuisance parameter, and for which we establish a number of formal theorems. Finally, we discuss the link to previous articles that concerned some kind of targeting of the estimator of the nuisance parameter, and we provide an organization of the remainder of the article.
1.1 The role of nuisance parameter estimation
Suppose we observe n independent and identically distributed copies of a random variable O with probability distribution
The empirical mean of the influence curve
Suppose that
The latter is shown as follows. By the property of the canonical gradient (in fact, any gradient) we have
The first term is an empirical process term that, under empirical process conditions (mentioned below), equals
To obtain the desired asymptotic linearity of
1.2 Targeting the fit of the nuisance parameter: general approach
In this article, we demonstrate that if
The current article concerns the construction of such targeted IPTW and TMLE that are asymptotically linear under regularity conditions, even when only one of the nuisance parameters is consistent and the estimators of the nuisance parameters are highly data adaptive. In order to be concrete in this article, we will focus on a particular example. In such an example we can concretely present the second-order term
The same approach for construction of such TMLE can be carried out in much greater generality, but that is beyond the scope of this article. Nonetheless, it is helpful for the reader to know that the general approach is the following (considering the case that
1.3 Concrete example covered in this article
Let us now formulate our concrete example we will cover in this article. Let
For this particular example, such TMLE are presented in Scharfstein et al. [17]; van der Laan and Rubin [7]; Bembom et al. [18–21]; Rosenblum and van der Laan [22]; Sekhon et al. [23]; van der Laan and Rose [6, 24]. Since
The first term equals
However, if only one of these nuisance parameter estimators is consistent, then the second term is still a first-order term, and it remains to establish that it is also asymptotically linear with a second-order remainder. For sake of discussion, suppose that
In this article, we present TMLE that targets
1.4 Relation to current literature on targeted nuisance parameter estimators
The construction of TMLE that utilizes targeting of the nuisance parameter
The TMLEs presented in this article are always iterative and thereby rely on convergence of the iterative updating algorithm. Since the empirical risk increases at each updating step, such convergence is typically guaranteed by the existence of the MLE at each updating step (e.g. an MLE of coefficient in a logistic regression). Either way, in this article, we assume this convergence to hold. Since our assumptions of our theorems require
1.5 Organization
The organization of this paper is as follows. In Section 2, we introduce a targeted IPTW-estimator that relies on an adaptive consistent estimator of
In Section 5, we extend the TMLE of Section 3 (that relies on
1.6 Notation
In the following sections, we will use the following notation. We have
2 Statistical inference for IPTW-estimator when using super-learning to fit treatment mechanism
We first describe an IPTW-estimator that uses super-learning to fit the treatment mechanism
2.1 An IPTW-estimator using super-learning to fit the treatment mechanism
We consider a simple IPTW-estimator
as the choice of estimator that minimizes cross-validated risk. The super-learner of
2.2 Asymptotic linearity of a targeted data-adaptive IPTW-estimator
The next theorem presents an IPTW-estimator that uses a targeted fit
Theorem 1We consider a targeted IPTW-estimator
Definition of targeted estimator
We define
Empirical process condition: Assume that
Negligibility of second-order terms: Define
Then,
where
So under the conditions of this theorem, we can construct an asymptotic 0.95-confidence interval
and
Regarding the displayed second-order term conditions, we note that these are satisfied if
Regarding the empirical process condition, we note that an example of a Donsker class is the class of multivariate real-valued functions with uniform sectional variation norm bounded by a universal constant [44]. It is important to note that if each estimator in the library falls in such a class, then also the convex combinations fall in that same class [4]. So this Donsker condition will hold if it holds for each of the candidate estimators in the library of the super-learner.
2.3 Comparison of targeted data-adaptive IPTW and an IPTW using parametric model
Consider an IPTW-estimator using a MLE
The parametric IPTW-estimator is asymptotically linear with influence curve
For example, if the parametric model happens to have a score equal to
If, on the other hand, the parametric model is misspecified, then the IPTW-estimator using
3 Statistical inference for TMLE when using super-learning to consistently fit treatment mechanism
In the next subsection, we present a TMLE that targets the fit of the treatment mechanism, analog to the targeted IPTW-estimator presented above. In addition, this subsection presents a formal asymptotic linearity theorem demonstrating that this TMLE will be asymptotically linear even when
3.1 Asymptotic linearity of a TMLE using a targeted estimator of the treatment mechanism
The following theorem presents a novel TMLE and corresponding asymptotic linearity with specified influence curve, where we rely on consistent estimation of
Theorem 2
Iterative targeted MLE of
Definitions: Given
Initialization: Let
Updating step for
We define
Updating step for
Iterating till convergence: Now, set
Plug-in estimator: Let
Estimating equations solved by TMLE: This TMLE
Empirical process condition: Assume that
Negligibility of second-order terms: Define
where
Then,
where
Thus, under the assumptions of this theorem, an asymptotic 0.95-confidence interval is given by
3.2 Using a δ -specific submodel for targeting g that guarantees the positivity condition
The following is an application of the constrained logistic regression approach of the type presented in Gruber and van der Lann [19] for the purpose of estimation of
The MLE is simply obtained with logistic regression of
where
is the quasi-log-likelihood loss. The update
4 Double robust statistical inference for TMLE when using super-learning to fit outcome regression and treatment mechanism
In this section, our aim is to present a TMLE that is asymptotically linear with known influence curve if either
Theorem 3
Definitions: For any given
Iterative targeted MLE of
Initialization: Let
Updating step: Consider the submodel
Define the submodel
Let
We define
Iterate till convergence: Now, set
where
Final substitution estimator: Denote the limits of this iterative procedure with
Equations solved by TMLE:
Empirical process condition: Assume that
Negligibility of second-order terms: Define
Then,
where
Note that consistent estimation of the influence curve
If
As shown in the final remark of the Appendix, the condition of Theorem 3 that either
5 Collaborative double robust inference for C-TMLE when using super-learning to fit outcome regression and reduced treatment mechanism
We first review the theoretical underpinning for collaborative estimation of nuisance parameters, in this case, the outcome regression and treatment mechanism. Subsequently, we explain that the desired collaborative estimation can be achieved by applying the previously established template for construction of a C-TMLE to a TMLE that solves certain estimating equations when given an initial estimator of
5.1 Motivation and theoretical underpinning of collaborative double robust estimation of nuisance parameters
We note that
Let
Lemma 1(van der Laan and Gruber [33]) If
We note that
5.2 C-TMLE
The general C-TMLE introduced in van der Laan and Gruber [33] provides a template for construction of a TMLE
The general C-TMLE has been implemented and applied to point treatment and longitudinal data [20, 29–33, 35]. A C-TMLE algorithm relies on a TMLE algorithm that maps an initial
5.3 A TMLE that allows for collaborative double robust inference
Our next theorem presents a TMLE algorithm and a corresponding influence curve under the assumption that the propensity score correctly adjusts for the possibly misspecified
Theorem 4
Definitions: For any given
“Score” equations the TMLE should solve: Below, we describe an iterative TMLE algorithm that results in estimators
Iterative targeted MLE of
Initialization: Let
Let
Updating step: Consider the submodel
Define the submodel
Iterating till convergence: Now, set
Final substitution estimator: Denote these limits (in k) of this iterative procedure with
Assumption on limits
Empirical process condition: Assume that
Negligibility of second-order terms: Define
Assume that the following conditions hold for each of the following possible definitions of
We assume
Then,
where
Thus, consistency of this TMLE relies upon the consistency of
It is also interesting to note that the algebraic form of the influence curve of this TMLE is identical to the influence curve of the TMLE of Theorem 2 that relied on
5.4 A C-TMLE algorithm
The TMLE algorithm presented in Theorem 4 maps an initial estimator
First, we compute a set of K univariate covariates
The general template of a C-TMLE algorithm is the following: given a TMLE algorithm that maps any initial
In order to present a precise C-TMLE algorithm we will first introduce some notation. For a given subset of main terms
Given a set
where we remind the reader of the definition
The C-TMLE algorithm defined below generates a sequence
Initiate algorithm: Set initial TMLE. Let
Determine next TMLE. Determine the next best main term to add:
If
then
[In words: If the next best main term added to the fit of
Iterate. Run this from
This sequence of candidate TMLEs
Fast version of above C-TMLE: We could carry out the above C-TMLE algorithm but replacing the TMLE that maps an initial
Statistical inference for C-TMLE: Let
The asymptotic variance of
6 Discussion
Targeted minimum loss-based estimation allows us to construct plug-in estimators
However, we noted that this level of targeting is insufficient if one only relies on consistency of
In this article we also pushed this additional level of targeting to a new level by demonstrating how it allows for double robust statistical inference, and that even if we estimate the nuisance parameter in a complicated manner that is based on a criterion that cares about how it helps the estimator to fit
It remains to evaluate the practical benefit of the modifications of IPTW, TMLE, and C-TMLE as presented in this article for both estimation and assessment of uncertainty. We plan to address this in future research.
Even though we focussed in this article on a particular concrete estimation problem, TMLE is a general tool and our TMLE and theorems can be generalized to general statistical models and path-wise differentiable statistical target parameters.
We note that this targeting of nuisance parameter estimators in the TMLE is not only necessary to get a known influence curve but also necessary to make the TMLE asymptotically linear. So it does not simply suffice to run a bootstrap as an alternative of influence curve based inference, since the bootstrap can only work if the estimator is asymptotically linear so that it has an existing limit distribution. In addition, the established asymptotic linearity with known influence curve has the important by-product that one now obtains statistical inference with no extra computational cost. This is particularly important in these large semi-parametric models that require the utilization of aggressive machine learning methods in order to cover the model-space, making the estimators by necessity very computer intensive, so that a (disputable) bootstrap method might simply be too computer extensive.
Acknowledgments
This research was supported by an NIH grant R01 AI074345-06. The author is grateful for the excellent, helpful, and insightful comments of the reviewers.
Appendix
Proof of Theorem 1
To start with we note:
The first term of this decomposition yields the first component
By our assumptions, the last term
So it remains to study:
Note that this equals
Lemma 2Define
Then,
Proof of Lemma 2: Note that
Since we assumed
The next step of the proof is the following series of equalities
where, by assumption,
Thus, we have
from which we deduce that, by Lemma 2 and
where we defined
By our assumptions,
Proof of Theorem 2
One easily checks that
because
The first term A equals
where
where
By our assumptions, the second term above is
The estimator
We have
where
where we defined
We have that
Proof of Theorem 3
As outlined in Section 1, we have
if
It suffices to analyze the second term. Initially, we note that
where
By assumption,
Now, we note
By our assumptions, the first term
So it suffices to analyze the second and third terms of this last expression. In order to represent the second and third terms we define
The sum of the second and third terms can now be represented as:
For notational convenience, we will suppress the dependence of these mappings on the unknown quantities, and thus use
Analysis of
By our assumptions,
so that it remains to analyze
where, by our assumptions,
In addition,
where
Analysis of
Here we used that
where we assumed that
in probability. This proves
Proof of Theorem 4
As in the proof of previous theorem, we start with
where we use that
As in the proof of previous theorem, we decompose this second term as follows:
resulting in four terms, which we will denote with Terms 1–4. We will now analyze these four terms.
Term 1: The first term
Term 4: Due to our assumption that
where, by assumption,
We proceed as follows:
The first term is asymptotically equivalent with minus Term 3, which shows that Term 3 is canceled out by a component of Term 4 up till a second-order term that is
where
By assumption,
This term is analyzed below and it is shown that this term equals
To conclude, we have then shown that the fourth term equals the latter expression minus the third term.
We now analyze (4) which can be represented as
We now proceed as follows:
For the second term
by noting that
By assumption, both terms are
Since, by construction of
where
Term 3: Our analysis of Term 4 showed that Term 3 cancels out and thus that the sum of the third and fourth terms equals
Analysis of Term 2: Up till a second-order term that can be bounded by
where
We have
Recall that, by our assumption,
This proves that
Remark: Proof of additional result In this analysis of Term 2, we assumed
where
where
References
1. BickelPJ, KlaassenCA, RitovY, WellnerJ. Efficient and adaptive estimation for semiparametric models. Springer-Verlag, 1997.Search in Google Scholar
2. GillRD. Non- and semiparametric maximum likelihood estimators and the von Mises method (part 1). Scand J Stat1989;16:97–128.Search in Google Scholar
3. GillRD, van der LaanMJ, WellnerJA. Inefficient estimators of the bivariate survival function for three models. Ann Inst Henri Poincaré1995;31:545–97.Search in Google Scholar
4. van der VaartAW, WellnerJA. Weak convergence and empirical processes. New York: Springer-Verlag, 1996.10.1007/978-1-4757-2545-2Search in Google Scholar
5. van der LaanMJ. Estimation based on case-control designs with known prevalence probability. Int J Biostat2008. Available at: http://www.bepress.com/ijb/vol4/iss1/17/.10.2202/1557-4679.1114Search in Google Scholar PubMed
6. van der LaanMJ, RoseS. Targeted learning: causal inference for observational and experimental data. New York: Springer, 2012.Search in Google Scholar
7. van der LaanMJ, RubinD. Targeted maximum likelihood learning. Int J Biostat2006;20.10.2202/1557-4679.1043Search in Google Scholar
8. van der LaanMJ, DudoitS. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Technical report, Division of Biostatistics, University of California, Berkeley, CA, November 2003.Search in Google Scholar
9. van der LaanMJ, PolleyE, HubbardA. Super learner. Stat Appl Genet Mol Biol2007;6:Article 25.10.2202/1544-6115.1309Search in Google Scholar PubMed
10. van der VaartAW, DudoitS, van der LaanMJ. Oracle inequalities for multi-fold cross-validation. Stat Decis2006;240:351–71.10.1524/stnd.2006.24.3.351Search in Google Scholar
11. RobinsJM, RotnitzkyA. Recovery of information and adjustment for dependent censoring using surrogate markers. In Aids epidemiology. Methodological issues. Basel: Bikhäuser, 1992:297–331.10.1007/978-1-4757-1229-2_14Search in Google Scholar
12. RobinsJM, RotnitzkyA. Semiparametric efficiency in multivariate regression models with missing data. J Am Stat Assoc1995;900:122–9.10.1080/01621459.1995.10476494Search in Google Scholar
13. van der LaanMJ, RobinsJM. Unified methods for censored longitudinal data and causality. New York: Springer-Verlag, 2003.10.1007/978-0-387-21700-0Search in Google Scholar
14. RobinsJM, RotnitzkyA, van der LaanMJ. Comment on “on profile likelihood” by S.A. Murphy and A.W. van der Vaart. J Am Stat Assoc – Theory Methods2000;450:431–5.Search in Google Scholar
15. RobinsJM. Robust estimation in sequentially ignorable missing data and causal inference models. In Proceedings of the American Statistical Association, 2000.Search in Google Scholar
16. RobinsJM, RotnitzkyA. Comment on the Bickel and Kwon article, “inference for semiparametric models: some questions and an answer”. Stat Sin2001;110:920–36.Search in Google Scholar
17. ScharfsteinDO, RotnitzkyA, RobinsJM. Adjusting for non-ignorable drop-out using semiparametric nonresponse models, (with discussion and rejoinder). J Am Stat Assoc1999;940:1096–120 (1121–46).Search in Google Scholar
18. BembomO, PetersenML, RheeS-Y, FesselWJ, SinisiSE, ShaferRW, et al. Biomarker discovery using targeted maximum likelihood estimation: application to the treatment of antiretroviral resistant HIV infection. Stat Med2009;28:152–72.10.1002/sim.3414Search in Google Scholar PubMed PubMed Central
19. GruberS, van der LaanMJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Int J Biostat2010;6:Article 26. Available at: www.bepress.com/ijb/vol6/iss1/2610.2202/1557-4679.1260Search in Google Scholar PubMed PubMed Central
20. GruberS, van der LaanMJ. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat2010;60.10.2202/1557-4679.1182Search in Google Scholar PubMed PubMed Central
21. GruberS, van der LaanMJ. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. Technical Report 265, UC Berkeley, CA, 2010.10.2202/1557-4679.1260Search in Google Scholar PubMed PubMed Central
22. RosenblumM, van der LaanMJ. Targeted maximum likelihood estimation of the parameter of a marginal structural model. Int J Biostat2010;60.10.2202/1557-4679.1238Search in Google Scholar PubMed PubMed Central
23. SekhonJS, GruberS, PorterK, van der LaanMJ. Propensity-score-based estimators and C-TMLE. In: MJvan der Laan and SRose, editors. Targeted learning: prediction and causal inference for observational and experimental data, chapter 21. New York: Springer, 2011.Search in Google Scholar
24. GruberS, van der LaanMJ. Targeted minimum loss based estimation of a causal effect on an outcome with known conditional bounds. Int J Biostat2012;8.10.1515/1557-4679.1413Search in Google Scholar PubMed
25. ZhengW, van der LaanMJ. Asymptotic theory for cross-validated targeted maximum likelihood estimation. Technical Report 273, Division of Biostatistics, University of California, Berkeley, CA, 2010.10.2202/1557-4679.1181Search in Google Scholar PubMed PubMed Central
26. ZhengW, van der LaanMJ. Cross-validated targeted minimum loss based estimation. In: MJvan der Laan and SRose, editors. Targeted learning: causal inference for observational and experimental data, chapter 21. New York: Springer, 2011:459–74.Search in Google Scholar
27. van der VaartAW. Asymptotic statistics. New York: Cambridge University Press, 1998.Search in Google Scholar
28. RotnitzkyA, LeiQ, SuedM, RobinsJ. Improved double-robust estimation in missing data and causal inference models. Biometrika2012;99:439–56.10.1093/biomet/ass013Search in Google Scholar PubMed PubMed Central
29. GruberS, van der LaanMJ. Targeted minimum loss based estimator that outperforms a given estimator. Int J Biostat2012;80:Article 11. DOI:10.1515/1557-4679.1332Search in Google Scholar
30. GruberS, van der LaanMJ. Marginal structural models. In: MJvan der Laan and SRose, editors. C-TMLE of an additive point treatment effect, chapter 19. New York: Springer, 2011.Search in Google Scholar
31. PorterKE, GruberS, van der LaanMJ, SekhonJS. The relative performance of targeted maximum likelihood estimators. Int J Biostat2011;70:1–34.10.2202/1557-4679.1308Search in Google Scholar PubMed PubMed Central
32. StitelmanOM, van der LaanMJ. Collaborative targeted maximum likelihood for time to event data. Int J Biostat2010:Article 21.10.2202/1557-4679.1249Search in Google Scholar PubMed
33. van der LaanMJ, GruberS. Collaborative double robust penalized targeted maximum likelihood estimation. Int J Biostat2010;60.10.2202/1557-4679.1181Search in Google Scholar
34. van der LaanMJ, RoseS. Targeted learning: prediction and causal inference for observational and experimental data. New York: Springer, 2011.10.1007/978-1-4419-9782-1Search in Google Scholar
35. WangH, RoseS, van der LaanMJ. Finding quantitative trait loci genes. In: MJvan der Laan and SRose, editors. Targeted learning: causal inference for observational and experimental data, chapter 23. New York: Springer, 2011.Search in Google Scholar
36. HernanMA, BrumbackB, RobinsJM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology 2000;110:561–70.10.1097/00001648-200009000-00012Search in Google Scholar PubMed
37. GyörfiL, KohlerM, KrzyżakA, WalkH. A distribution-free theory of nonparametric regression. New York: Springer-Verlag, 2002.Search in Google Scholar
38. van der LaanMJ, DudoitS, van der VaartAW. The cross-validated adaptive epsilon-net estimator. Stat Decis2006;240:373–95.10.1524/stnd.2006.24.3.373Search in Google Scholar
39. van der LaanMJ, DudoitS, KelesS. Asymptotic optimality of likelihood-based cross-validation. Stat Appl Genet Mol Biol2004;3:Article 4.10.2202/1544-6115.1036Search in Google Scholar PubMed
40. DudoitS, van der LaanMJ. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat Methodol2005;20:131–54.10.1016/j.stamet.2005.02.003Search in Google Scholar
41. PolleyEC, RoseS, van der LaanMJ. Super learning. In: MJvan der Laan and SRose, editors. Targeted learning: causal inference for observational and experimental data, chapter 3. New York: Springer, 2011.Search in Google Scholar
42. PolleyEC, van der LaanMJ. Super learner in prediction. Technical report 200. Division of Biostatistics, UC Berkeley, Working Paper Series, 2010.Search in Google Scholar
43. van der LaanMJ, PetersenML. Targeted learning. In: ZhangC, MaY, editors. Ensemble machine learning. New York: Springer, 2012:117–56. ISBN 978-1-4419-9326-7.Search in Google Scholar
44. van der LaanMJ. Efficient and inefficient estimation in semiparametric models. Center for Mathematics and Computer Science, CWI-tract 114. 1996.10.1214/aos/1032894470Search in Google Scholar
45. LeeBK, LesslerJ, StuartEA. Improved propensity score weighting using machine learning. Stat Med2009;29:337–46.10.1002/sim.3782Search in Google Scholar PubMed PubMed Central
46. SchneeweissS, RassenJA, GlynnRJ, AvornJ, MogunH, BrookhartMA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology2009;20:512–22. DOI: 10.1097/EDE.0b013e3181a663cc.10.1097/EDE.0b013e3181a663ccSearch in Google Scholar PubMed PubMed Central
47. VansteelandtS, BekaertM, ClaeskensG. On model selection and model misspecification in causal inference. Stat Methods Med Res2010;21:7–30. DOI:10.1177/0962280210387717.10.1177/0962280210387717Search in Google Scholar PubMed
48. WestreichD, ColeSR, FunkMJ, BrookhartMA, SturmerT. The role of the c-statistic in variable selection for propensity scores. Pharmacoepidemiol Drug Saf2011;20:317–20.10.1002/pds.2074Search in Google Scholar PubMed PubMed Central
49. van der LaanMJ, GruberS. Collaborative double robust penalized targeted maximum likelihood estimation. Int J Biostat2009;6.10.2202/1557-4679.1181Search in Google Scholar PubMed PubMed Central
© 2014 by Walter de Gruyter Berlin / Boston