Two-stage residual inclusion estimation: Addressing endogeneity in health econometric modeling
Introduction
Endogeneity of regression predictors is a common problem in many areas of applied economics, including health economics and health services research, as these fields rely heavily on observational data. Endogeneity arises owing to problems such as omitted confounder variables, simultaneity between a predictor and the outcome, and errors in regression covariates. Instrumental variables (IV) methods form a common body of approaches to handing such endogeneity. The theoretical and methodological literature guiding the use of IVs in linear regression models is large and serves as the basis for most practitioners’ understanding of the assumptions and implementation of IV models. In many health economics and health services research problems, however, linear regression models are now being replaced by nonlinear regression models, including generalized linear models, as these models are often more appropriate for limited-dependent variables, count variables and skewed distributions such as healthcare costs.
Despite the growing use and appreciation of nonlinear models among empirical researchers in health economics and health services research, there appears to be some confusion surrounding the applications of IV methods in the context of these models. The goal of the present paper is to address this concern while at the same time unifying earlier results under a common nonlinear modeling framework. We carefully examine two instrumental variables (IV)-based approaches to correcting for endogeneity bias in nonlinear models – two-stage residual inclusion (2SRI) and two-stage predictor substitution (2SPS)—focusing especially on a class of nonlinear models that have been widely exploited in empirical health economics and health services research. We show the consistency of the 2SRI estimator in this class of models and reemphasize the inconsistency of the alternative 2SPS approach. Our goal is to demonstrate the superiority of the 2SRI method, to guide applied researchers in carrying out 2SRI estimation when they are trying to address endogeneity in nonlinear models, and to help them understand why they should steer away from the popular 2SPS approach.
2SPS is the rote extension to nonlinear models of the popular linear two-stage least squares (2SLS) estimator. In the first-stage of 2SPS, auxiliary (reduced form) regressions are estimated, and the results are used to generate predicted values for the endogenous variables. The second-stage regression is then conducted for the outcome equation of interest after replacing the endogenous variables with their predicted values. The 2SRI estimator has the same first-stage as 2SPS. In the second-stage regression, however, the endogenous variables are not replaced. Instead, the first-stage residuals are included as additional regressors in second-stage estimation. This method was first suggested by Hausman (1978) in the linear context as a means of testing for endogeneity. We focus on these two methods because both have been applied in empirical studies in health economics and health services research. Indeed, these models can be easily implemented using any modern statistical software package.
We begin, in the next section with detailed descriptions of the methods within a unified modeling framework. This framework extends the two-stage least squares (2SLS) linear modeling approach for instrumental variables to nonlinear outcome and/or auxiliary models, encompassing many parametric nonlinear models that are commonly used in empirical health economics and health services research. The statistical properties of the two alternative methods are more formally examined in Section 3. There, we note that although the two methods produce identical results in the fully linear model (a special case of the broader class of models we consider), they do not coincide in the generic nonlinear model. Moreover, we show why 2SRI is generally statistically consistent in this broader class, but 2SPS is not. In Section 4, we compare the methods using simulated data in the context of two interesting nonlinear models involving endogenous regressors. The results reflect the theoretical consistency of 2SRI and the lack thereof for 2SPS. Further comparisons are drawn between the methods in Section 5, wherein we re-estimate Mullahy's (1997) exponential regression model of the effect of prenatal smoking on birthweight using a more flexible functional form. The 2SPS and 2SRI estimates differ substantially. The final section summarizes and concludes. The theoretical consistency of 2SRI, the results of the simulation analyses, and the findings from re-estimation of Mullahy's (1997) model all support the use of 2SRI over 2SPS.
Section snippets
The model
We employ the following nonlinear modeling framework. The main, and minimal, assumption of the model is that the conditional mean of the outcome (y) is of the form:where M(·) is a known nonlinear function, and we distinguish among three types of regressor: denotes a 1 × S vector of endogenous regressors; is a 1 × K vector of observable exogenous regressors (observable confounders); and is a 1 × S vector of unobservable
Formal treatment of the consistency properties of the estimators
As discussed earlier, in the linear model 2SLS = 2SPS = 2SRI. Therefore, all three methods are consistent. These identities do not, however, hold in the generic nonlinear case so the consistency of each method must be individually examined. To prove the consistency of 2SRI, we cast it as a special case of the generic two-stage optimization estimator (OE) (see Newey and McFadden, 1994, White, 1994, Chapter 6; or Wooldridge, 2002, Chapter 12). For simplicity of exposition, let us assume that xe and xu
Simulation analysis
As a follow-up to the discussion in the previous section, we explore potential biases from 2SPS relative to 2SRI estimation using simulated data in a few interesting nonlinear models involving endogenous regressors. Each of these examples is inspired by a published study in the health economics literature.
Mullahy's birthweight model revisited
To demonstrate the potential differences that might arise in actual practice between the 2SPS and 2SRI estimates, we re-estimated Mullahy's (1997) model of the effect of prenatal cigarette smoking on birthweight using data supplied by the author. Mullahy (1997) suspects that maternal smoking during pregnancy may be correlated with the unobservable determinants of birthweight, so he specifies a nonlinear conditional mean regression model, which can be viewed as a special case of (1). In
Discussion
We have examined two estimation methods that are commonly used in health economic applications involving nonlinear models with endogenous regressors—two-stage predictor substitution (2SPS) and two-stage residual inclusion (2SRI). The discussion begins with a detailed description of the estimators in an intuitively appealing nonlinear regression framework that explicitly accounts for endogeneity (i.e. the presence of unobservable confounders). Within that framework we show that the 2SRI
Acknowledgements
This research was supported by the National Institute on Drug Abuse (R01 DA013968-02) and the Substance Abuse Policy Research Program of the Robert Wood Johnson Foundation (53902). The author is grateful for the helpful comments of Libby Dismuke and David Bradford, and for the excellent research assistance provided by F. Michael Kunz. We also thank the editor and two anonymous reviewers for their many suggestions that served to improve the presentation.
References (65)
- et al.
Estimating the quality of care in hospitals using instrumental variables
Journal of Health Economics
(1999) - et al.
Do religious nonprofit and for-profit organizations respond differently to financial incentives? the hospice industry
Journal of Health Economics
(2007) - et al.
Insurance and the utilization of medical services
Social Science in Medicine
(2004) - et al.
Controlling for systematic selection in retrospective analyses: an application to fluoxetine and sertraline prescribing in the United Kingdom
Value in Health
(1999) Efficient estimation of limited dependent variable models with endogenous explanatory variables
Journal of Econometrics
(1987)- et al.
Large sample estimation and hypothesis testing
- et al.
Limited information estimators and exogeneity tests for simultaneous probit models
Journal of Econometrics
(1988) - et al.
Moral hazard and adverse selection in Australian private hospitals: 1989–1990
Journal of Health Economics
(2003) Estimating count data models with endogenous switching: sample selection and endogenous treatment effects
Journal of Econometrics
(1998)Estimating endogenous treatment effects in retrospective data analysis
Value in Health
(1999)