Computational Statistics and Data Analysis

It is often the case that an outcome of interest is observed for a restricted non-randomly selected sample of the population. In such a situation, standard statistical analysis yields biased results. This issue can be addressed using sample selection models which are based on the estimation of two regressions: a binary selection equation determining whether a particularstatisticalunitwillbeavailableintheoutcomeequation.Classicsampleselection models assume a priori that continuous regressors have a pre-specified linear or non-linear relationship to the outcome, which can lead to erroneous conclusions. In the case of continuous response, methods in which covariate effects are modeled flexibly have been previously proposed, the most recent being based on a Bayesian Markov chain Monte Carlo approach. A frequentist counterpart which has the advantage of being computationally fast is introduced. The proposed algorithm is based on the penalized likelihood estimation framework. The construction of confidence intervals is also discussed. The empirical properties of the existing and proposed methods are studied through a simulation study. The approaches are finally illustrated by analyzing data from the RAND Health Insurance Experiment on annual health expenditures.


Introduction
Sample selection models are used when the observations available for statistical analysis are not from a random sample of the population.Instead, individuals may have selected themselves into (or out of) the sample based on a combination of observed and unobserved characteristics.The use of statistical models ignoring such a non-random selection can have severe detrimental effects on parameter estimation.
As a motivating example, consider the RAND Health Insurance Experiment (RHIE), a study conducted in the United States between 1974 and 1982 (Newhouse, 1999).The aim was to quantify the relationship between various demographic and socio-economic characteristics (see Table 6 in Section 4) and annual health expenditures in the population as a whole.Non-random selection arises if the sample consisting of individuals who used health care services differ in important characteristics from the sample of individuals who did not use them.When the relationship between the decision to use the services and health expenditure is through observables, selection bias can be avoided by accounting for these variables.However, if some individuals are part of the selected subsample because of some observables as well as unobservables, then regardless of whether such variables are correlated in the population they will be in the selected sample (e.g., Dubin and Rivers, 1990).Hence, the neglect of this potential correlation can lead to inconsistent estimates of the covariate effects in the equation for annual expenditure.
Statistical methods correcting for the bias induced by non-random sample selection involve the estimation of two regression models: the selection equation (e.g., decision to use health services), and outcome equation (e.g., amount of health care expenditure).The latter is used to examine the substantive question of interest, whereas the former is used to detect selection bias and obtain consistent estimates of the covariate effects in the outcome equation.Since their introduction by Heckman (1979), sample selection models have been used in various fields (e.g., Bärnighausen et al., 2011;Cuddeback et al., 2004;Montmarquette et al., 2001;Sigelman and Zeng, 1999;Winship and Mare, 1992).Most of the case studies consider parametric sample selection models where continuous predictors have a pre-specified linear or non-linear relationship to the response variable.The need for techniques modeling flexibly regressor effects, without making a priori assumptions, arises from the observation that all parameter estimates are inconsistent when the relationship between covariates and outcome is mismodeled (e.g., Chib et al., 2009;Marra and Radice, 2011).This may prevent the researcher from recognizing, for instance, strong covariate effects or revealing interesting relationships.Going back to the health expenditure example, covariates such as age and education are likely to have a non-linear relationship to both decision to use health services and amount to spend on them.Imposing a priori a linear relationship (or non-linear by simply using quadratic polynomials, for example) could mean failing to capture possibly important complex relationships.
In a parametric context, sample selection models are typically estimated using the two-step framework first introduced by Heckman (1979): using the parameter estimates of the selection equation, a component (called inverse Mills ratio) is calculated and then included in the outcome equation to correct for non-random sample selection.Such an approach was proposed to deal with violations of the assumption of normality.However, it has been found to be sensitive to correlation among covariates in the outcome and selection equations, which can be really problematic in applications (Puhani, 2000).This problem can be alleviated by imposing an exclusion restriction, which requires at least one extra covariate to be a valid predictor in the selection equation but the outcome equation.A number of estimation methods which do not impose parametric forms on the error distribution have been introduced.These are termed 'semiparametric' since only part of the model of interest (the linear predictor) is parametrically pre-specified (e.g., Ahn and Powell, 1993;Lee, 1994;Martins, 2001;Newey et al., 1990;Powell, 1994;Vella, 1998).In this direction, recent developments include Marchenko and Genton (2012) and van Hasselt (2011).The sample selection literature has also been focusing on models with non-normal responses (e.g., Boyes et al., 1989;Terza, 1998;Smith, 2003;Greene, 2012).There are other variants of the sample selection model; these include Li (2011) who considered the case in which there is more than one selection mechanism, and Omori and Miyawaki (2010) who extended selection models to allow threshold values to depend on individuals' characteristics.These models have also been compared to principal stratification in the context of causal inference with nonignorable missingness (Mealli and Pacini, 2008).
We are interested in modeling flexibly covariate effects when the response is Gaussian.Das et al. (2003) considered the estimation of non-linear effects by extending the Heckman (1979) two-step estimation procedure.Recently, Chib et al. (2009) and Wiesenfarth and Kneib (2010) have introduced two more general estimation methods.Specifically, the approach of the former authors is based on Markov chain Monte Carlo simulation techniques and uses a simultaneous equation system that incorporates Bayesian versions of penalized smoothing splines.The latter further extended this approach by introducing a Bayesian algorithm based on low rank penalized B-splines for non-linear effects, varying-coefficient terms and Markov random-field priors for spatial effects.Using a model specification that is very similar to that of Wiesenfarth and Kneib (2010), we introduce a frequentist counterpart which has the advantage of being computationally fast.Our proposal can especially appeal to practitioners already familiar with traditional frequentist techniques.The proposed algorithm is based on the penalized maximum likelihood (ML) estimation framework, and is implemented in the R package SemiParSampleSel (Marra and Radice, 2012).As in a Bayesian framework, the proposal supports the choice of any class of smoothers albeit without requiring extra computational effort, an advantage which is not shared by a Bayesian implementation.The construction of confidence intervals is also discussed.The performance of the proposed and available methods are examined through a simulation study.Finally, the methods are illustrated analyzing data from the RAND Health Insurance Experiment on annual health expenditures.

Model structure
The model consists of a system of two equations.Using the latent variable representation, the selection equation is where n is the sample size, and y * 1i is a latent continuous variable which determines its observable counterpart y 1i through the rule 1(y * 1i > 0).The outcome equation determining the response variable of interest is (2)  T , the n × P 1 model matrix containing P 1 parametric model components (such as the intercept, dummy and categorical variables), with corresponding parameter vector θ 1 , and the s 1k 1 are unknown smooth functions of the K 1 continuous covariates z 1k 1 i .In line with Wiesenfarth and Kneib (2010), our implementation also supports varying coefficients models, obtained by multiplying one or more smooth terms by some predictor(s) (Hastie and Tibshirani, 1993), and smooth functions of two or more (e.g., spatial) covariates as described in Wood (2006, pp. 154-167)  T , with coefficient vector θ 2 , and the s 2k 2 are unknown smooth terms of the K 2 continuous regressors z 2k 2 i .n s denotes the size of the selected sample.For identification purposes, the smooth functions are subject to the centering constraint Wood, 2006).As in Chib et al. (2009) and Wiesenfarth and Kneib (2010), we make the assumption that unobserved confounders have a linear impact on the responses.That is, the errors (ε 1i , ε 2i ) are assumed to follow the bivariate distribution where ρ is the correlation coefficient, σ 2 the standard deviation of ε 2i and σ 1 , the standard deviation of ε 1i , is set to 1 because the parameters in the selection equation can only be identified up to a scale coefficient (e.g., Greene, 2012, p. 686).The assumption of normality may be, perhaps, too restrictive for applied work; however it is typically made to obtain more tractable expressions.
The smooth functions are represented using the regression spline approach.To fix ideas and in the one-dimensional case, a generic s k (z ki ) is approximated by a linear combination of known spline basis functions, b kj (z ki ), and regression parameters, β kj , where J k is the number of spline bases (hence regression coefficients) used to represent s k , B k (z ki ) T is the ith vector of dimension J k containing the basis functions evaluated at the observation z ki , i.e.
 T , and β k the corresponding parameter vector.Calculating B k (z ki ) for each i yields J k curves (encompassing different degrees of complexity) which multiplied by some real valued parameter vector β k and then summed will give a (linear or non-linear) estimate for s k (z k ).The number of basis functions, J k , determines the flexibility allowed for s k (z k ); J k = 10 will lead to a ''wigglier'' curve estimate than when such a parameter is set to 5, for instance.Because, in practice, it is not easy to determine the optimal number of basis functions for the s k (z ki ), a reasonably large number for the J k is typically chosen (to allow enough flexibility in the model) and then the corresponding β k penalized in order to suppress that part of non-linearity which is not supported from the data.As it will be shown in the next section, this can be achieved using a penalized estimation approach which is the mainstream in the regression spline literature (see Ruppert et al., 2003 andWood, 2006 for more details).Basis functions should be chosen to have convenient mathematical properties and good numerical stability.Many choices are possible and supported in our implementation including B-splines, cubic regression and low rank thin plate regression splines (e.g., Ruppert et al., 2003).The case of smooths of more than one variable follows a similar construction.Based on the result above, Eqs. ( 1) and (2) can be written as and where In principle, the parameters of the sample selection model are identified even if and Kneib, 2010).In practice, however, the absence of equation-specific regressors may lead to a likelihood function which does not vary significantly over a wide region around the mode (e.g., Marra and Radice, 2011).Moreover, in applications, both functional form and model errors are likely to be misspecified to some degree.This suggests that empirical identification is better achieved if an exclusion restriction (ER) on the covariates in the two equations holds (e.g., Chib et al., 2009;Vella, 1998).That is, the regressors in the selection equation should contain at least one or more regressors not included in the outcome equation.See the simulation results in Section 3 for more discussion of this issue.

A penalized maximum likelihood estimation approach
Recall that the error terms (ε 1i , ε 2i ) are assumed to follow a bivariate normal distribution and define the linear predictors The observations can be divided into two groups according to the type of data observed.Each group of observations has a different form for the likelihood.When y 2i is observed, the likelihood function is the probability of the joint event y 2i and y * 1i > 0, i.e.
When y 2i is not observed, the likelihood function just corresponds to the marginal probability that y * 1i ≤ 0, i.e.
Therefore, the log-likelihood function for the complete sample of observations is As explained in the previous section, because of the flexible linear predictor structure considered in this article, unpenalized ML estimation is likely to result in smooth term estimates which are too rough to produce practically useful results.This issue can be dealt with by augmenting the objective function with a penalty term, such as , measuring the (second-order, in this case) roughness of the smooth terms in the model.The λ vk v are smoothing parameters controlling the trade-off between fit and smoothness.Since regression splines are linear in their model parameters, such a penalty can be expressed as a quadratic form in λ vk v S vk v and the S vk v are positive semi-definite known square matrices.The penalized log-likelihood is therefore given as Because ρ is bounded in [−1, 1] and σ 2 can only take positive real values, we use ρ Given values for the λ vk v , we seek to maximize (6).In practice, this can be achieved by Newton-Raphson's methods iterating until convergence, where a is an iteration index and S * λ an overall block-diagonal penalty matrix, i.e.
The main issues with the maximization problem (6) are that ℓ(δ) is not globally concave (e.g., Toomet and Henningsen, 2008), and that the Hessian may become non-positive definite on some occasions.Preliminary work confirmed these concerns as well as that the use of classic optimization schemes, implemented using R functions nlm() and optim(), do not perform satisfactorily on this problem.To tackle such issues, ( 7) is implemented using a trust region algorithm with eigen-decomposition of H at each iteration (e.g., Nocedal and Wright, 1999, Section 4.2), and initial values are supplied using an adaptation of the Heckman procedure (1979) which is detailed in Appendix B. This approach proved to be fast and reliable in most cases, with occasional convergence failure for small values of n and n s .
Joint estimation of δ and λ (containing the λ vk v ) via maximization of (6) would result in overfitting since the highest value for ℓ p (δ) would be obtained when λ = 0.This is why in (7) the λ vk v are fixed at some values.The next section illustrates how λ can be estimated.

Smoothness selection
Smoothing parameter selection is important for practical modeling.In principle, it can be achieved by direct grid search optimization of, for instance, the Akaike information criterion (AIC; Akaike, 1973).However, if the model has more than two or three smooth terms, this typically becomes computationally burdensome, hence making the model building process difficult in most applied contexts.There are a number of techniques for automatic multiple smoothing parameter estimation for univariate regression spline models.Without claim of exhaustiveness, we briefly describe some of them.Gu (1992) introduced the performance-oriented iteration method which applies generalized cross validation (GCV) or the unbiased risk estimator (UBRE; Craven and Wahba, 1979) to each working linear model of the penalized iteratively re-weighted least squares (P-IRLS) scheme used to fit the model.Wood (2004) extended this approach by providing an optimally stable computational procedure.Smoothing parameter selection can also be achieved by exploiting the mixed model representation of penalized regression spline models.Here, smoothing parameters become variance components and, as such, can be estimated by either ML or restricted maximum likelihood (REML) for the Gaussian case, and by penalized quasi-likelihood for the generalized case (e.g., Breslow and Clayton, 1993;Ruppert et al., 2003).Wahba (1985) showed that asymptotically prediction error criteria are better in a mean square error sense, even though Härdle et al. (1988) pointed out that these criteria give very slow convergence to the optimal smoothing parameters.Recent work by Reiss and Ogden (2009) shows that at finite sample sizes both GCV and AIC are more prone to undersmoothing and more likely to develop multiple minima than REML.However, as pointed out by Ruppert et al. (2003, p. 177), automatic smoothing parameter selectors might be somewhat erratic; they provide an empirical example where REML leads to severe oversmoothing.We adapt Gu's approach to the current context which, in our experience, proved to be very efficient and stable in most cases.
Given values for the λ vk v , noting that Newton-Raphson's iterative Eq. ( 7) can be written in the P-IRLS form, δ[a+1] is the W is any iterative weight non-diagonal matrix square root such that and . ., 4, and, assuming without loss of generality that the spline basis dimensions for the smooth terms in the model are all equal to J, X i is a [a] has been suppressed from d i , z i and W i , and is omitted from the quantities shown in the next paragraph, to avoid clutter.Smoothing parameter vector λ should be selected so that the estimated smooth terms are as close as possible to the true functions.In the current context, this is achieved using the approximate UBRE.Specifically, λ is the solution to the problem minimize V w where the working linear model quantities are constructed for a given estimate of δ, obtained in Newton-Raphson's equation ( 7) or (8), n * = 4n, A λ = X(X T WX + S * λ ) −1 X T W is the hat matrix and tr(A λ ) the estimated degrees of freedom of the penalized model.For each working linear model of the P-IRLS iteration, V w u (λ) is minimized by employing the approach by Wood (2004), which is based on Newton-Raphson's method and can evaluate the approximate UBRE and its derivatives in a way that is both computationally efficient and stable.Note that because W is a non-diagonal matrix of dimension n * × n * , computation can be prohibitive, even for small sample sizes.To this end, W −1 d,

√
Wz and √ WX are calculated exploiting the sparse structure of W. Hence, the working linear model in ( 9) can be formed in operations, where m is the number of columns of X.
The issue with evaluating the approximate UBRE is that the W i are not guaranteed to be positive-definite, mainly because of σ * 2 and ρ * (e.g., Marra and Radice, 2011;Yee, 2010).This is problematic in that √ W and W −1 are needed in (9).As a solution, the working linear model is constructed so that its key quantities depend on all model parameters but σ * 2 and ρ * since these are not penalized.In this way, it is possible to construct the working linear model quantities needed in (9) as well as reduce substantially the computational load and storage demand of the algorithm since in this case n * = 2.The structure of the algorithm used for estimating parameter vector δ is given in Appendix C.

Inference
Inferential theory for penalized estimators is not standard.This is because of the presence of smoothing penalties which undermines the usefulness of classic frequentist results for practical modeling.Solutions to this problem have been introduced in the literature (see, e.g., Gu, 2002 andWood, 2006 for an overview).Here, we show how to construct pointwise confidence intervals for the terms of a regression spline sample selection model by adapting to the current context the well known Bayesian 'confidence' intervals, originally proposed by Wahba (1983) and Silverman (1985).An appealing characteristic of these intervals is that they have close to nominal 'across-the-function' frequentist coverage probabilities for the components of a generalized additive model (Marra and Wood, 2012).To see this point, consider a generic s k (z ki ).
Intervals can be constructed seeking some constants C ki and A, such that where ACP denotes average coverage probability, I is an indicator function, α is a constant between 0 and 1, and q α/2 is the α/2 critical point from a standard normal distribution.
, and I to be a random variable uniformly distributed on {1, 2, . . ., n}, we have that ACP = It is then necessary to find the distribution of B k +V k and values for C ki and A so that the requirement (10) is met.As shown in Marra and Wood (2012), in the context of non-Gaussian response models involving several smooth components, such a requirement is approximately met when confidence intervals for the ŝk (z ki ) are constructed using where, in the current context, y refers to the response vectors, δ is an estimate of δ and V δ = (−H + S * λ ) −1 .The structure of this variance-covariance matrix is such that it includes both a bias and a variance component in a frequentist sense (Marra and Wood, 2012).Given result (11), confidence intervals for linear and non-linear functions of the model parameters can be easily obtained.For any parametric model components, using ( 11) is equivalent to using classic likelihood results because such terms are not penalized.Note that there is no contradiction in fitting the sample selection model via penalized ML and then constructing intervals using a Bayesian result, and such an approach has been adopted many times in the literature (e.g., Gu, 2002;Marra and Radice, 2011;Wood, 2006).Moreover, the quantities needed to construct the intervals are obtained as a byproduct of the estimation process; hence no extra computation is really required for inferential purposes.
Frequentist approaches treat λ as known.It is, therefore, reasonable to expect the neglect of the variability due to smoothing parameter estimation to lead to undercoverage of the intervals.This problem should become more relevant as the number of smoothing parameters increases, which is especially the case for sample selection models as compared to single equation models.In this respect, Bayesian approach would be advantageous as a smoothing parameter uncertainty can be naturally taken into account.As shown by Marra and Wood (2012), provided that smoothing parameters are selected so that the estimation bias is not too large a proportion of the sampling variability, the empirical performance of the intervals should have little or no sensitivity to the neglect of smoothing parameter uncertainty.This suggests that a fully Bayesian approach like that of Wiesenfarth and Kneib (2010) may lead to overcoverage.The simulation study in the next section will also shed light on this issue.

Simulation study
In this section, we conduct a Monte Carlo simulation study to compare the proposed method with classic univariate regression and the approach of Wiesenfarth and Kneib (2010).Computations were performed in the R environment (R Development Core Team, 2012) using the package SemiParSampleSel (Marra and Radice, 2012), which implements the ideas discussed in the previous sections, and bayesSampleSelection (available at http://www.unigoettingen.de/en/96061.html)written by Wiesenfarth.The approach by Chib et al. (2009) was not used in the comparison because of lack of R code.However, this should not be problematic as their method is closely related to that of Wiesenfarth and Kneib (2010).
The sampling experiments were based on the model where y 1i and y 2i were determined as described in Section 2.1.The test functions used were s 11 (z 1i ) = −0.7 )}, and s 21 (z 1i ) = 0.6 {exp(z 1i ) + sin(2.9z1i )} (see Fig. 1).(θ 12 , θ 21 , θ 22 ) and σ 2 were set to (2.5, −0.68, −1.5) and 1.To generate binary values for y 1i so that approximately 25%, 50% and 75% of the total number of observations were selected to fit the outcome equation, θ 11 was set to −0.65, 0.58 and 1.66, respectively.Regressors u i , z 1i and z 2i were generated as three uniform covariates on (0, 1) with correlation approximately equal to 0.5.This was achieved using rmvnorm() in the package mvtnorm, generating standardized multivariate random draws with correlation 0.5 and then applying pnorm() (e.g., Marra and Radice, 2011).Regressor u i was dichotomized using round().Standardized bivariate normal errors with correlations ρ = (±0.1,±0.5, ±0.9) were considered, and sample sizes were set to 500, 1500 and 3000.For each combination of parameter settings, the number of simulated datasets was 250.Models were fitted with and without exclusion restriction (ER and non-ER, respectively).Specifically, in the latter case z 2i was not included in the selection equation.
For the approach of Wiesenfarth and Kneib (2010), we used the following settings.The number of iterations for the burnin period, number of samples used for estimation, and degree of thinning were 2000, 22 000 and 20, respectively, and smooth terms were represented using P-spline bases with 20 inner knots and penalty matrices containing second order differences.To make a fair comparison, the smooth components of the proposed method were represented using P-splines with the same settings.Models were also fitted neglecting the sample selection issue: using the selected sample, simply fit Eq. ( 5) via penalized least squares, again with the same number of P-spline bases and penalty order.As this naive approach cannot correct sample selection bias, badly biased parameter estimates are clearly expected in the simulation results.However, reporting such results should be useful to highlight the negative effects that the neglect of non-random sample selection has on parameter estimates.Following a reviewer's suggestion, we also employed a standard Heckman approach where non-linear effects were modeled using second-order polynomial terms.

Results
In this section, we only show a subset of results; these are representative of all empirical findings.Since the selection equation is not affected by sample selection bias, we focus on the estimation results for the outcome equation only.Table 1 reports the percentage relative bias and the root mean squared error (RMSE) for θ 22 , ρ and σ , when assuming that ER holds and approximately 50% of the total number of observations are available to fit the outcome equation.The approaches employed are naive, standard Heckman with second-order polynomial terms, and penalized Bayesian and ML estimation (naive, HeckP, W&K and M&R, respectively).Table 2 shows the RMSE and 95% average coverage probability (ACP) for the four approaches when estimating s 21 (z 1 ) under the same settings as described above.Tables 3 and 4 report the bias and RMSE for θ22 , ρ and σ , and RMSE and ACP for ŝ21 (z 1 ), respectively, for the non-ER case when approximately 75% of the total number of observations are selected for the outcome equation.This scenario should be reasonably close to the empirical illustration presented in Section 4, where there is no ER and the 77% of observations are selected.The results for the non-ER case when approximately 50% of observations are selected are reported in Appendix D (Table 8).Table 5 reports the results obtained under the ER scenario with 50% of selected observations when in the data generating process the simple non-linear function s 21 (z 1 ) is swapped with function s 11 (z 1 ).In this case, θ 11 was set to −3.05.As in Wiesenfarth and Kneib (2010), based on the estimates for 200 fixed covariate values, RMSE(ŝ) was calculated as In terms of computing mean times, M&R showed lower computational cost than W&K.Specifically, mean times for M&R were between 0.02 and 0.17 min depending on n and ρ.Compared to M&R, W&K approximately took between 25 and 100 times longer to fit a sample selection model.
The main results can be summarized as follows.
• Table 1 shows that the neglect of non-random sample selection leads to seriously biased parameter estimates.When ρ is small, HeckP outperforms all other methods in terms of bias.However, as ρ increases (i.e., the sample selection issue becomes more pronounced), M&R and W&K outperform HeckP, with M&R being the best in terms of bias.For high ρ, M&R performs the best in terms of bias and precision.As n increases, the W&K, M&R and HeckP estimates show convergence to their true values.Similar conclusions are reached for s 21 (z 1 ) (see Table 2).The 95% ACPs for M&R are more accurate than those for W&K and HeckP.This suggests that a fully Bayesian approach, which accounts for smoothing parameter uncertainty, yields slightly conservative pointwise intervals.This finding supports the argument by Marra and Wood (2012) (see Section 2.3) and is in agreement with the results of Wiesenfarth and Kneib (2010).We did not report the ACPs for the biased naive fits because these were clearly below the nominal level.
• In the non-ER case, when 50% of observations are selected (Table 8 in Appendix D), all methods exhibit higher bias as compared to the ER case.In addition, similar trends as those in Tables 1 and 2 are detected, with W&K and M&R being the best and naive being the worst as ρ increases.When about three-fourth of the total number of observations are available for the outcome equation, the same patterns can again be observed but with lower bias and RMSE (see Tables 3 and 4).This finding complements the argument in the final paragraph of Section 2.1 by suggesting that the presence of ER is necessary to obtain good estimation results, unless the percentage of selected observations is high.
• Following a reviewer's suggestion, we also compared HeckP and M&R under the ER scenario with 50% of selected observations when in the data generating process s 21 (z 1 ) is swapped with s 11 (z 1 ) (see Table 5).As compared to the Overall, results show that HeckP does not model adequately regressor effects.Specifically, for any value of ρ and n, the HeckP RMSE of ŝ11 (z 1 ) is consistently higher than that of M&R.In addition, the residual confounding induced by the mismodeled non-linear effects seems to have negative consequences on the estimation of the parametric effects, where HeckP underperforms in terms of accuracy and precision.
• We also carried out additional simulation experiments where the model errors are generated according to a bivariate Student-t distribution with 3 degrees of freedom; see Tables 9 and 10 in Appendix D. As expected, results are worse than those presented in this section.However, the use of ER helps to obtain better estimates although still not as good as those produced when the assumption (3) is met.Even worse results (available upon request) are found when using asymmetric or bimodal bivariate model error distributions, case in which the presence of ER cannot really help.The reason for this result is that the likelihood of the model is wrongly specified (i.e., we assume normality but the model errors are generated from a Student-t distribution) and ML approaches are known to be sensitive to such issues.
In summary, methods which cannot control for non-random sample selection (such as naive) produce severely biased estimates.The two penalized (Bayesian and ML) regression spline approaches to sample selection modeling are effective and generally outperform standard Heckman with polynomial terms.Moreover, although W&K and M&R perform similarly, the former is more time-consuming than the latter.Finally, ER is generally required to obtain good estimation results.Of course, in the presence of severe model misspecification, ER cannot avoid obtaining considerably biased estimates.

Empirical illustration
The method presented in this paper as well as naive, standard Heckman with polynomial terms and W&K are illustrated using data from the RAND Health Insurance Experiment (RHIE) which was a comprehensive study of health care cost,   (Newhouse, 1999).As explained in the introductory section, the aim was to quantify the relationship between various covariates and annual health expenditures in the population as a whole.
In this context, non-random sample selection arises because the sample consisting of individuals who used health care services differ in important characteristics from the sample of individuals who did not use them.Because some characteristics cannot be observed, traditional regression modeling is likely to deliver biased estimates.We, therefore, need to correct parameter estimates for sample selection bias.We use the same subsample as in Cameron and Trivedi (2005, p. 553), and model annual health expenditures.The sample size and number of selected observations are 5574 and 4281.The variables are defined in Table 6.Additional information can be found in Cameron and Trivedi (2005, Table 20.4) and Newhouse (1999).
Following Cameron and Trivedi (2005) the outcome and the selection equations include the same set of regressors.In M&R and W&K, the two equations include logc, idp, fmde, physlm, disea, hlthg, hlthf, hlthp, female, child, fchild and black as parametric components, and smooth functions of pi, inc, fam, educdec and xage, represented using P-spline bases with 20 inner knots and penalty matrices based on second order differences.For naive, the same model specification is adopted but clearly a selection equation is not present.As for standard Heckman, we model the effects of pi, inc, fam, educdec and xage using second-order polynomials.For W&K, the number of iterations for burn-in, of samples used for estimation, and degree of thinning were the same as those employed in the simulation study.
The use of smooth functions for xage, educdec and inc is suggested by the fact that these covariates embody productivity and life-cycle effects that are likely to influence health expenditures non-linearly.Dismuke and Egede (2011) and Sullivan et al. (2007) consider parametric specifications where non-linear effects are modeled by categorizing these variables into groups based on intervals.However, categorizing a continuous variable has several disadvantages since, for example, it introduces problems of defining cut-points and assumes a priori that the relationship between response and covariate is flat within intervals (e.g., Marra and Radice, 2010).As for fam and pi, we do not have a priori knowledge of their effects and imposing linear or quadratic relationships may prevent us from revealing interesting non-linear relationships.
Smooth functions of other covariates such as idp and disea are not considered as their number of unique covariate values is too small.Table 7 and Fig. 2 report the parametric and smooth function estimates for the outcome equation (which is the one of interest) when applying the four approaches on the RAND RHIE dataset.Computational times for M&R and W&K were 1.03 and 18.26 min, respectively.
The parametric effects obtained using HeckP, M&R and W&K are very similar (except for the effect of child) and differ from those of naive.Specifically, the M&R and W&K results suggest that socioeconomic factors (black, female, child, and fchild) as well as health status variables (physlm, disea, hlthg, hlthf, hlthp) have a stronger effect on annual health expenses as compared to the naive results.The health insurance variables (logc, idp) seem not to determine the annual medical expenses when using naive.The estimated smooths for the socioeconomic variables (inc, fam, educdec and xage) obtained with M&R and W&K are reasonably close.This is not true for the health insurance variable pi where all four estimated functions are different.Notice how the estimated curves produced by HeckP are consistently U-or inverted Ushaped; this is an artifact of the polynomial specification.The estimates of ρ, which are important to ascertain the presence of selection bias, are high, positive, and statistically significant.This indicates that the unobservable factors which lead individuals to use health services also lead them to spend more on medical expenses.The estimates of σ 2 obtained with either HeckP or M&R and W&K are significantly different, with the latter showing a larger uncertainty.This result is consistent with that found in the simulated non-ER scenario with percentage of selected observations equal to 75%.Overall, the M&R and W&K estimates are coherent with the predictions of economic theory.For example, the results of age, education and income are consistent with the interpretation that health expenditure increases as people become older, have more years of schooling, and are wealthier.Also, individual health expenditure decreases as family size increases.The reliability of the results presented in this section relies on whether the assumption of normality is met.As noted by Cameron and Trivedi (2005, p. 555), the underlying normality is suspect for these data because of the presence of large outliers.Testing this assumption is especially important when ER is not present, as in this case.Without ER and under violation of the assumption of normality, selection bias correction fails (e.g., Vella, 1998).It would be therefore ideal to test for normality.Cameron and Trivedi (2005) checked this assumption by applying standard tests of heteroskedasticity, skewness and kurtosis on the outcome variable lnmeddol.However, in a regression context, normality should be assessed more rigorously.For example, in the current case, a possibility would be to employ a score test of bivariate normality whose density of the errors under the alternative hypothesis is based on a type AA bivariate Grami-Charlier series with 9 additional parameters (e.g., Chiburis, 2010, Lee, 1984).However, it is not entirely clear how this test can be extended to the penalized likelihood framework considered in this paper.Another possibility would be to exploit the fact that a penalized regression spline is approximately equivalent to a pure regression spline with degrees of freedom close to that of the penalized fit (e.g., Wood, 2006, pp. 210-212).This topic is beyond the scope of this paper and will be addressed in future research.

Conclusions
We introduced an algorithm to estimate a regression spline sample selection model for Gaussian data.The proposal is based on the penalized likelihood estimation framework.The construction of confidence intervals has also been illustrated, and the problem of identification has been discussed.The method has been tested and compared to a Bayesian counterpart and the classic Heckman sample selection model.Finally, the proposed approach and its competitors have been illustrated on data from the RAND Health Insurance Experiment on annual health expenditures.The R package SemiParSampleSel (Marra and Radice, 2012) implements the ideas discussed in this article.
The results of our simulation study highlighted the detrimental effects that the neglect of non-random sample selection has on parameter estimation.They also suggested that the two Bayesian and ML regression spline approaches considered in this article are effective and generally outperform standard Heckman with polynomials.The Bayesian and ML methods were found to perform similarly, with the former being more computationally expensive than the latter.We also found that ER is generally required to obtain good estimation results.
Because ML estimators are sensitive to model error misspecification, methods allowing for different bivariate distributions of the errors can be developed.For example, Marchenko and Genton (2012) introduced a sample selection model where the errors are assumed to follow a bivariate Student-t distribution.However, in their implementation the structure of the linear predictor is parametrically pre-specified.The proposed approach could be extended by adopting either a copula (e.g., Nelsen, 2006) or a nonparametric distribution function estimation framework.Future research will be conducted toward these directions.The elements of the score vector are  .
The elements of the Hessian are  ,    .

Appendix B. Starting value procedure
Sensible starting values can be provided by adapting Heckman's approach (1979) to the regression spline context.Regression function ( 5) can be written as It then follows that where θ ϑ = σ 2 ρ, ϑ i = φ(η 1i )/Φ(η 1i ) (the inverse Mills ratio) and η 1i = u T 1i θ 1 + B T 1i β 1 .Therefore, Eq. ( 12) can be written as where  ε 2i is a new disturbance term which, by construction, is uncorrelated with u 2i , B 2i and ϑ i .The coefficient estimates in (5) that would be obtained using a non-random selected subsample are biased if ρ ̸ = 0.This can be seen as an ordinary specification error with the conditional mean (13) deleted as a regressor in the model.Including ϑ i as an explanatory variable, as in Eq. ( 14), would in principle rectify this situation.But ϑ i is unknown; however it is possible to obtain a consistent estimate of it using the estimated coefficients of selection Eq. ( 4).
The two-step procedure to fit model ( 14) can be summarized as follows.
step 1 Fit a probit model for Eq. ( 4) and obtain estimates of  η 1i and  ϑ i , for all i. step 2 Using the selected sample only, fit model ( 14), where ϑ i is replaced with  ϑ i , for all i.
The correlation parameter ρ can be estimated by  ρ =  θ ϑ / σ 2 , where  σ  n s i=1  γ i /n s ,   ε 2i is the residual resulting from estimation of ( 14) and  γ i =  ϑ i (  ϑ i +  η 1i ) (e.g., Toomet and Henningsen, 2008).Note that since  ρ can be outside of [−1, 1], this quantity is truncated to stay within this range.Moreover, although the parameters of the models in the two steps can be estimated using an unpenalized procedure, this is not advisable in practice (see Section 2).Therefore, the models in the two-step procedure are estimated by maximization of a penalized log-likelihood function and by minimization of a penalized least squares criterion, respectively.Standard statistical software is available to achieve this (Ruppert et al., 2003;Wood, 2006).
In principle, because of the non-linearity of the inverse Mills ratio and the use of flexible covariate effects, the parameters of the two-step procedure are identified even if . However, since it is typically the case that  ϑ i can be approximated well by a linear function of the covariates in the model, there will be substantial collinearity between  ϑ i and the regressors in the outcome equation, which can affect parameter estimation.This will be especially the case when the range of values of η 1i is not very large.The presence of ER can alleviate this problem (see, e.g., Leung and Yu, 2000 for other remedies).We do not elaborate on this further as the presented two-step approach just serves as a starting value procedure.

Fig. 1 .
Fig. 1.The test functions used in the simulation studies.
hlthg binary variable for good self-rated health (the baseline is excellent self-rated health) hlthf binary variable for fair self-rated health hlthp binary variable for poor self-rated health inc family income fam family size educdec education of household head in years xage age of the individual in years female binary variable for female individuals child binary variable for individuals younger than 18 years fchild binary variable for female individuals younger than 18 years black binary variable for black household heads utilization and outcome conducted in the United States between 1974 and 1982 1)} /(e c + 1), M i = (2e c e 2i )/ {(e c + 1)σ 2 a} and C= −1 + ρ +  ρ 2 (ρ − 1) /a 2 .The remaining quantities are defined in Section 2.

Table 1
Percentage biases and RMSEs for θ22 , ρ and σ obtained from the ER experiments, when employing the naive, standard Heckman (with second-order polynomial terms), penalized Bayesian and ML estimation approaches (naive, HeckP, W&K and M&R).Number of simulated datasets and approximate percentage of selected observations are 250% and 50%.True values of θ 22 and σ are −1.5 and 1. ρ and n denote the correlation between the errors of the selection and outcome equations, and the sample size.See Section 3 for further details.

Table 2
RMSEs and 95% average coverage probabilities for ŝ21 (z 1 ) obtained from the ER experiments, when employing naive, HeckP, W&K and M&R.Number of simulated datasets and approximate percentage of selected observations are 250% and 50%.See the caption of Table1for further details.

Table 3
Percentage biases and RMSEs for θ22 , ρ and σ obtained from the non-ER experiments, when employing the naive, standard Heckman (with second-order polynomial terms), penalized Bayesian and ML estimation approaches (naive, HeckP, W&K and M&R).Number of simulated datasets and approximate percentage of selected observations are 250% and 75%.See the caption of Table1for further details.

Table 4
RMSEs and 95% average coverage probabilities for ŝ21 (z 1 ) obtained from the non-ER experiments, when employing naive, HeckP, W&K and M&R.Number of simulated datasets and approximate percentage of selected observations are 250% and 75%.See the caption of Table1for further details.

Table 5
Percentage biases and RMSEs for θ22 , ρ, σ and ŝ11 (z 1 ) obtained from the ER experiments, when employing the standard Heckman (with second-order polynomial terms) and ML estimation approaches (HeckP and M&R).Number of simulated datasets and approximate percentage of selected observations are 250 and 50%.See Section 3 for further details.

Table 6
Description of the outcome and selection variables, and of the regressors.Variable Definition lnmeddol log of the medical expenses of the individual (outcome variable) binexp binary variable indicating whether the medical expenses are positive (selection variable)

Table 7
Parametric estimates of the annual medical expense equation obtained by applying the naive, standard Heckman (with second-order polynomial terms), penalized Bayesian and ML estimation approaches (naive, HeckP, W&K and M&R, respectively) on the RAND RHIE dataset described in Section 4. Within parentheses are 95% confidence and credible intervals.For M&R, intervals have been calculated using the result of Section 2.3.Smooth function estimates obtained by applying naive (gray lines), HeckP (gray dot-dashed lines), W&K (gray dotted lines) and M&R (black lines) on the RAND RHIE dataset described in Section 4. The black dashed lines represent 95% pointwise confidence intervals calculated from the M&R estimates.The 'rug plot', at the bottom of each graph, shows the covariate values.To avoid clutter, credible intervals for W&K have not been reported.Due to the identifiability constraints, the estimated curves are centered around zero.