A penalized likelihood estimation approach to semiparametric sample selection binary response modeling

Sample selection models are employed when an outcome of interest is observed for a restricted non-randomly selected sample of the population. We consider the case in which the response is binary and continuous covariates have a nonlinear relationship to the outcome. We introduce two statistical methods for the estimation of two binary regression models involving semiparametric predictors in the presence of non-random sample selection. This is achieved using a multiple-stage procedure, and a newly developed simultaneous equation estimation scheme. Both approaches are based on the penalized likelihood estimation framework. The problems of identification and inference are also discussed. The empirical properties of the proposed approaches are studied through a simulation study. The methods are then illustrated using data from the American National Election Study where the aim is to quantify public support for school integration. If non-random sample selection is neglected then the predicted probability of giving, for instance, a supportive response may be biased, an issue that can be tackled using the proposed tools.


Introduction
Sample selection techniques are employed when observations are not from a random sample of the population. For instance, in public surveys it may be the case that some individuals choose not to answer some specific questions because they feel that their opinion might paint them in an unfavorable light. This leaves us with a self-selected sample. So if the interest is in quantifying the relationship between various demographic and socio-economic characteristics and an outcome variable in the population as a whole, then using the responding subsample is likely to produce biased estimates [16,9].
To fix ideas, let us consider a study of public opinion polls on school integration that uses data from an American survey. The main question was if respondents support government intervention to ensure that black and white children go to the same school. In particular, individuals were first asked if they had an opinion on the integration question (0 = no, 1 = yes) and then what that opinion was (0 = no integration, 1 = yes integration). Information on individual demographic and socio-economic characteristics was also recorded. If the respondents who opposed government involvement in school integration chose not to answer the question, because they felt their opinion might be perceived as socially unacceptable, then the sample of individuals who provided an opinion may have differed in systematic ways from the sample of non-respondents. To clarify this (often misunderstood) concept, let us characterize each individual by some observed and unobserved features or confounders. If the responding and nonresponding subsamples have similar characteristics, then the issue of non-random sample selection does not arise since the average (observed and unobserved) features of the responding sample are similar to those of the population. If the decision to answer is no longer random, because of differing characteristics between the responding and nonresponding individuals, then biased analyses are expected. When the relationship between the decision to respond and outcome is only through observables, it is possible to correct for non-random sample selection by controlling for these variables in the outcome equation. However, in the presence of unobservables influencing the decision to answer and the outcome, controlling only for observables is clearly insufficient. That is, if some individuals are part of the responding subsample because of their unobserved features, then regardless of whether observables and unobservables are correlated in the overall population they will be in the selected sample [9]. This means that ignoring the potential correlation between the unobserved factors influencing the decision to answer and the outcome can lead to inconsistent estimates of the covariate impacts in the outcome equation. For other examples of non-random sample selection, see [1,7,30].
Statistical methods correcting for non-random selection have been developed. Many of these concern models where the response variable is Gaussian [5,16,11,22,25,40]. There are also a number of works that go beyond Gaussian responses; these include models for skewed, count and ordinal data [35,4,36,29]. We consider the case in which the response is binary. The procedures currently available to fit a sample selection binary response model are those presented in [8,3,11]. These involve the (separate or simultaneous) estimation of two binary regression models for the selection and outcome equations. The outcome equation is used to examine the substantive question of interest, whereas the selection equation is used to detect non-random selection and hence obtain consistent estimates of the covariate effects in the outcome equation. One potential drawback to the application of these techniques is the lack of flexibility in handling the possible presence of nonlinear covariate-response relationships. That is, because the functional shape between predictors and outcome is rarely known a priori, imposing a parametric structure may prevent the researcher from recognizing a strong covariate effect or, more generally, revealing interesting relationships [34,42].
The contribution of this article is twofold, one methodological and the other practical. First, we extend the procedures discussed in [8,3,11] to incorporate semiparametric covariate effects. In particular, we present a multiple-stage estimation approach, and a penalized likelihood estimation framework for a simultaneous system of two binary equations. Both approaches allow for flexible functional dependence of the binary responses on continuous covariates. Second, we implement the methods discussed in this article in the R package SemiParBIVProbit [26]; this can be particularly attractive to practitioners who wish to fit such models. No other computational alternatives which consider a semiparametric sample selection binary response model are available in the literature. It may be argued that the model setup adopted here is fairly similar to that of [40] except that, in the current context, the outcome is binary. This suggests that the Bayesian estimation scheme introduced by [40] can be extended for fitting the model considered in this paper. We elected to follow a frequentist approach because it can especially appeal to researchers and practitioners already familiar with traditional frequentist techniques and has the advantage of being computationally fast. The empirical properties of the methods are studied through a simulation study, and the methods then illustrated using the abovementioned case study on public opinion polls on school integration.

Methods
In this section, we describe the model structure of a semiparametric binary response sample selection model, present two strategies for parameter estimation and discuss the problems of identification and inference.

The model
The model consists of a first selection equation and a second outcome equation determining the response. The selection equation, expressed using the latent variable representation, is given as where n denotes the sample size, and y * 1i is a latent continuous variable which is related to its observable counterpart y 1i through the rule 1(y * 1i > 0). The outcome equation is given as , and y 2i is missing when y 1i = 0. In (2.1), x + 1i = 1, x + 12i , . . . , x + 1P1i represents the i th row vector of x + 1 , the n×P 1 model matrix for any parametric model components (i.e. intercept, binary and categorical predictors), with corresponding parameter vector θ 1 . The f 1k1 (z 1k1i ) are unknown smooth functions of the K 1 continuous covariates z 1k1i . These components are represented using the regression spline approach (see next section). Each smooth term may be multiplied by some predictor, yielding a 'varying coefficients' model [15], and smooth functions of two covariates may also be considered [42, pp. 154-167]. Similarly in (2.2), x + 2i is the i th row vector of the n se × P 2 model matrix x + 2 , with coefficient vector θ 2 , the f 2k2 (z 2k2i ) are unknown smooth terms of the K 2 continuous regressors z 2k2i , and n se denotes the size of the selected sample. For identification purposes, smooth terms are subject to constraints such as i f vkv (z vkv i ) = 0, v = 1, 2, k v = 1, . . . , K v . As in [40], we make the assumption that unobserved confounders have a linear impact on the responses, i.e. (ε 1i , ε 2i ) ∼ N ([0, 0], [1, ρ, ρ, 1]), where ρ is the correlation coefficient and the error variances are normalized to unity since the parameters in the model can only be identified up to a scale coefficient.
It is important to stress that estimation of (2.2) alone when ρ = 0 will yield inconsistent parameter estimates. As explained in the previous section, intuitively, ignoring the correlation between the unobserved confounders influencing the decision to answer and the outcome will induce bias in the covariate impacts because of non-random sample selection on unobservables. This can be formally seen from the derivations in Section 2.2.
Going back to our example, y 1i and y 2i would correspond to the question of whether an individual had an opinion on the integration question and what that opinion was, respectively. Covariate vectors x + 1i and x + 2i would contain variables such as gender and region, and the f 1k1 (z 1k1i ) and f 2k2 (z 2k2i ) could be thought of as smooth nonlinear effects of covariates such as age and education in both the selection and outcome equations.

Smooth function representation
A popular and effective way of representing smooth functions of continuous covariates is the regression spline approach [10]. The basic idea is to approximate f k (z ki ), where subscript v has been dropped to avoid clutter, by a linear combination of known spline basis functions, b kj (z ki ), and regression parameters, where J k is the number of spline bases and hence regression coefficients used to represent f k , B k (z ki ) represents the i th row vector of dimension J k consisting of the basis functions evaluated at the observation z ki , i.e. B k (z ki ) = {b k1 (z ki ), b k2 (z ki ), . . . , b kJ (z ki )}, and β k is the corresponding parameter vector. Calculating B k (z ki ) for each i yields J k curves encompassing different degrees of complexity which multiplied by some real valued parameter vector β k and then summed give a curve estimate for f k (z k ). Basis functions are usually chosen to have convenient mathematical properties and good numerical stability. Possible choices are B-splines, cubic regression and thin plate regression splines (see [34] for a more detailed introduction). Based on the result above, equations (2.1) and (2.2) can written as and

Multiple-stage estimation approach
Non-random sample selection can be dealt with by using a device which [16] introduced for an analogous problem in binary-choice regression, and that [8,3] employed in the sample selection context for fully parametric probit models.
Here, we extend this device to incorporate semiparametric covariate effects. The population regression for (2.4) can be written as while the regression for the subsample of complete observations is where ϑ i = φ(η 1i )/Φ(η 1i ) (typically called inverse Mills ratio), and φ and Φ are the density and distribution functions of a standardized normal. Regression equation (2.5) can be therefore written as [16]. It is clear that the parameter estimates for the covariates in η 2i that are correlated with η 1i are inconsistent if ρ = 0. This can be thought of as arising from an ordinary specification error with the conditional mean (2.6) deleted as a covariate in the model. Including ϑ i as an explanatory variable, as in equation (2.7), would rectify this situation. In practice we do not know ϑ i , but it is possible to obtain a consistent estimate of it based on the estimated coefficients of selection equation (2.3). After dividing the components in the right hand side of (2.7) by τ i , because E( ε 2 2i |y * 1i > 0) = τ 2 i , and using the selected sample, we can obtain parameter estimates using where B * 2i includes the quantities corresponding to the spline bases for the smooth functions of the covariates z 2k2i rescaled by τ i , E(ε 2i |y * 1i > 0) = 0, E(ε 2 2i |y * 1i > 0) = 1, and ϑ i and τ i can be consistently estimated as described above.
The algorithm to fit model (2.8) can be summarised as follows: step 1 Fit a probit model for equation (2.3) to obtain consistent estimates of η 1i and hence ϑ i , for all i. step 2 Using the selected sample, obtain a consistent estimate of ρ by fitting a linear probability model for equation (2.7), where ϑ i is replaced with ϑ i for all i. step 3 Estimate τ i via 1 + ρ 2 ϑ i (− η 1i − ϑ i ), where ϑ i , η 1i and ρ are obtained in steps 1 and 2. step 4 Using the selected sample and after rescaling all components in the equation of interest by the τ i , fit a probit model for equation (2.8).
Remark 1. Equation (2.8) is fitted using probit regression even if the normality assumption ofε 2i , which is necessary for consistency of θ 2 and β 2 , is clearly not met. However, as shown by [8] in a fully parametric context, this approach can deliver estimates which are close, if not as good as, to those obtained using a consistent estimator such as maximum likelihood.
Remark 2. Standard errors of the parameter estimates in (2.8) are not realistic in that, for example, they do not account for that additional source of sampling variability due to the estimation of ϑ i and τ i . In addition, the method can produce an estimate for ρ which is not in the range [−1, 1].
Remark 3. The models used in the steps outlined above may be fitted using unpenalized parameter estimation procedures. However, because of the flexible model specification considered here, this is likely to result in smooth function estimates that are too wiggly to produce sensible results. This issue can be overcome by penalized estimation, where the objective function is augmented by a penalty term, such as k λ k f ′′ k (z k ) 2 dz k , measuring the (second-order, in this case) roughness of the smooth terms in the model. The λ k are smoothing parameters controlling the trade-off between fit and smoothness. Since regression splines are linear in their model parameters, such a penalty can be expressed as a quadratic form in the generic parameter vector β (containing the coefficients of all smooth terms in the model where the S k are positive semi-definite known square matrices. Depending on the model employed, parameters can be estimated by either minimization of a penalized least squares criterion or maximization of a penalized log-likelihood function. The λ k can be selected via a prediction error or likelihood criterion [34, Chapter 8].

Bivariate probit estimation approach
In the current sample selection context, the data identify the three possible events (y 1i = 1, y 2i = 1), (y 1i = 1, y 2i = 0) and (y 1i = 0), with probabilities where Φ 2 is the distribution function of a standardized bivariate normal with correlation ρ. The log-likelihood function is therefore . As pointed out in Remark 3 of the previous section, in a smoothing context it is necessary to penalize the regression spline coefficients to avoid exceedingly wiggly smooth function estimates. Hence, the model is fitted by maximization of the penalized log-likelihood Given values for the λ vkv , we seek to maximize (2.9). This is achieved by using a trust region algorithm [31,Section 4.2] which is based on where a is the iteration index and S * λ an overall block-diagonal penalty matrix made up of λ vkv S vkv and 0 components. The gradient vector g is defined by two subvectors g 1 = ∂ℓ(δ)/∂δ 1 and g 2 = ∂ℓ(δ)/∂δ 2 , and a scalar g 3 = ∂ℓ(δ)/∂ρ * , while the Fisher information matrix has a 3 × 3 matrix block structure with The use of a trust region algorithm proved to be faster and more reliable than the standard approaches adopted in the literature to estimate likelihood-based models, with occasional convergence failure for small values of n and n se . In (2.10), the smoothing parameters are fixed at some values. This is because joint estimation of δ and λ = (λ 1k1 , . . . , λ 1K1 , λ 2k2 , . . . , λ 2K2 ) via maximization of (2.9) would clearly lead to overfitting since the highest value for ℓ p (δ) would be obtained when λ = 0. Hence the need to estimate λ using an appropriate criterion.

Smoothing parameter selection
Smoothing parameter selection can be achieved by direct grid search optimization of a prediction error criterion, for example. However, if the model has more than one smooth term per equation, then this can become computationally burdensome, hence making the model building process difficult in most applied contexts. There are a number of techniques for automatic multiple smoothing parameter selection within the penalized likelihood framework. Without claim of exhaustiveness, these include the performance-oriented iteration method originally proposed by [13] and mixed model approach to penalized regression spline estimation [34]. The former applies the generalized cross validation or unbiased risk estimator [UBRE; 6] to each working linear model of the penalized iteratively re-weighted least squares (P-IRLS) scheme used to fit the model. The latter consists of viewing the λ vkv as variance components so that they can be estimated, e.g., by restricted maximum likelihood. Here, we adapt the approach by [13] to the current context. Given a parameter vector value for λ, iterative equation (2.10) can be written The superscript [a] has been suppressed from d i , z i , and W i , and is omitted from the quantities shown below, to avoid clutter.
Vector λ should be selected so that the estimated smooth functions are as close as possible to the true functions. In the current context, this is achieved using the approximate UBRE. Specifically, λ is the solution to the problem where the working linear model quantities are constructed for a given estimate of δ, n * = 3n, A λ = X(X T WX + S * λ ) −1 X T W is the hat matrix, and tr(A λ ) represents the estimated degrees of freedom of the penalized model. For each working linear model of a trust region iteration, V w u (λ) is minimized with respect to λ. The two steps, one for δ the other for λ, are iterated until convergence. This approach is implemented employing the approach by [41], which is based on Newton's method and can evaluate score (2.12) and their derivatives in a way that is both computationally efficient and stable. Generally speaking, this is achieved using √ WX = QR, obtained by pivoted QR decomposition, where Q and R are defined in the usual manner, and the singular value decomposition R L = UDV T , where L is any matrix square root of S * λ such that L T L = S * λ , the columns of U are those of an orthogonal matrix, V is an orthogonal matrix, and D is a diagonal matrix of singular values which are useful to detect numerical rank deficiency of the fitting problem. Based on this, evaluation of tr(A λ ) for new trial values of the smoothing parameters can be made relatively cheap, and the derivatives of V w u (λ) with respect to λ can be stably and efficiently evaluated. Note that minimization of the score is with respect to λ * = log(λ) since the smoothing parameters must be positive. See [41] for further details. Remark 1. In the context of simultaneous equation estimation methods, the use of the Fisher information matrix is recommended because the W i are positive-definite over a larger region of the parameter space as compared to those obtained by using the observed information. This is crucial given that √ W and W −1 (via z), obtained by eigen-decomposition, are needed in (2.12).

Remark 2.
Because W is a non-diagonal matrix of dimension n * × n * , computation can quickly become prohibitive, even for small sample sizes. To calculate W −1 d, √ Wz and √ WX so that the computational load and storage demand of the algorithm is kept as low as possible, the band structure of W is exploited. Hence, the working linear model in (2.12) is formed in O(n * (m + 2)) rather than O(n 2 * (m + 2)) operations, where m is the number of columns of X. Remark 3. As opposed to the multiple-stage approach discussed in Section 2.2, simultaneous estimation of all model parameters via the bivariate probit scheme introduced here does not rely on approximations and does not require the use of quantities estimated in preliminary steps. Therefore, the procedure can yield consistent estimates for δ. Moreover, correction of standard errors is not in principle required since no inverse Mills ratio is used and all parameters in δ are estimated jointly. This convenience comes at expense of computational cost and stability. However, these can be dealt with by using the approach described in last two sections, and supplying the multiple-stage estimates as starting values in the bivariate probit estimation scheme.

Identification
Under correct model specification, the parameters of the approach described in Section 2.2 are formally identified even if x + 1i , B 1i = x + 2i , B 2i . This is because of the nonlinearity of the inverse Mills ratio; see, e.g., [32]. However, in applications, this typically results in substantial collinearity between ϑ i and the other covariates in the outcome equation, especially when the variation in η 1i is such that the nonlinearity of the inverse Mills ratio does not play a major role. This collinearity can lead to large standard errors and instability in estimation. The parameters of the method introduced in Section 2.3 are also formally identified, but with the advantage of not having the limitations deriving from the use of the inverse Mills ratio. However, the likelihood functions of sample selection models may be affected by local maxima especially in the case of highly correlated error terms [20].
In practice, empirical identification is achieved if the exclusion restriction (ER) on the covariates in the two equations holds [37]. That is, the regressors in the selection equation should contain at least one or more regressors not included in the outcome equation. Such predictors can be regarded as instrumental variables, which, in this context, induce variation in the selection equation, do not directly affect the outcome, and are independent of (ε 1i , ε 2i ) given the covariates [23]. In the data analysis reported in Section 4, ER was achieved by including in the selection equation the binary variable indicating whether the respondent was persuaded to participate in the survey.

Inference
The methods described in Section 2 rely on penalized estimation. Within this framework, inferential theory is not standard because of the presence of smoothing penalties which undermines the usefulness of classic frequentist results for practical modeling; see, e.g., [42]. Solutions to this problem have been proposed. In this section, we show how to construct pointwise confidence intervals for the components of a semiparametric sample selection model adapting some of the results available in the literature.
The well known Bayesian 'confidence' intervals originally proposed by [39] in the univariate spline model context are typically used to represent the uncertainty of smooth functions [14,34,42]. An interesting feature of these intervals is that they have close to nominal 'across-the-function' frequentist coverage probabilities [27]. To better understand this point, let us consider a generic smooth component f (z i ). Intervals can be constructed seeking some constants C i and A, such that where ACP denotes average coverage probability, I is an indicator function, α is a constant between 0 and 1, and q α/2 is the α/2 critical point from a standard normal distribution.
At this point, it is necessary to find the distribution of B + V and values for the C i and A so that requirement (2.13) is met. As shown in [27], in the context of non-Gaussian response models involving several smooth components, such a requirement is approximately met when confidence intervals for the smooth components are constructed using δ|y∽N ( δ, V δ ), (2.14) where, for the approach described in Section 2.3, y refers to the response vectors, δ is the estimate of δ and V δ = (I + S * λ ) −1 is the inverse of the penalized Fisher information matrix obtained at convergence of the algorithm used to fit the model. Note that the same distributional result can be used for the models described in Section 2.2. Given (2.14), confidence intervals for linear and nonlinear functions of the model parameters can be easily obtained. For any parametric model components, using (2.14) is equivalent to using classic likelihood results since such model terms are not penalized. It is important to stress that there is no contradiction in fitting the sample selection model via penalized log-likelihood estimation and then constructing confidence intervals using a Bayesian result, and such an approach has been employed many times in the literature; see, e.g., [14,23,42].
Result (2.14) should produce intervals with good coverage probabilities for the model components when using the method described in Section 2.3. However, this is not likely to be true for the approach detailed in Section 2.2, for the reasons given in Remark 2. As a solution, posterior simulation can be employed [42]. Specifically, we propose to adjust the intervals as follows: • Let the parameter vector and covariance Bayesian matrix estimated in step 1 be δ 1 and V δ1 . Draw N s random vectors from N ( δ 1 , V δ1 ) and then calculate the corresponding N s values η * 1i and ϑ * i , for all i. • Fit N s step 2 models to obtain δ ols 2,1 , . . . , δ ols 2,Ns and V ols δ2,1 , . . . , V ols δ2,N s . For each parameter vector and covariance matrix combination, draw N s random vectors from the corresponding Gaussian distribution and then calculate the N 2 s values ρ * .
• Calculate the N 2 s values τ * i , for all i, using ϑ * i , η * 1i and ρ * . • Fit N 2 s step 4 models, using each of the ϑ * i and τ * i combination, to obtain δ 2,1 , . . . , δ 2,N 2 s and V δ2,1 , . . . , V δ 2,N 2 s . For each parameter vector and covariance matrix combination, draw N d random vectors from the corresponding Gaussian distribution so to obtain N 2 s × N d random draws from which approximate intervals for the component functions of model (2.8) can be constructed.
This procedure can account for the extra source of variability introduced via the quantities calculated in steps 1-3 of the multiple-stage estimation approach. As in [24], simulation experience suggested that small values for N s and N d , say 20 and 100, will be tolerable in practice.

Simulation study
To gain insights into the effectiveness of the estimation approaches detailed in the previous sections, a Monte Carlo simulation study was conducted. All computations were performed in the R environment [33] using the package SemiParBIVProbit which implements the ideas discussed in this article [26].
The smooth components were represented using penalized thin plate regression splines with basis dimensions equal to 10 and penalties based on secondorder derivatives [42]. In cross-sectional studies, 10 bases typically suffice to represent reasonably well smooth functions, although sensitivity analysis using fewer or more spline bases is advisable in applied work. Models were also fitted neglecting non-random sample selection, i.e. simply fitting equation (2.4) on the selected sample (henceforth, this will be referred to as naive approach); since this can not account for sample selection, biased parameter estimates are expected. We decided to report the naive results because they represent a benchmark for evaluating more realistic models as well as highlight the substantial detrimental effects that the neglect of non-random sample selection may have on the parameter estimates.

Results
In this section, we only show a subset of results; these are representative of all empirical findings. Since the parameters of the selection equation are not in principle affected by bias, we focus on the estimation results for the outcome equation. Figures 2, 3 and 4 present the boxplots of the estimates for θ 22 , ρ, and the empirical root mean squared errors (RMSE) of f 21 (z 1 ) when employing the naive, multiple-stage and bivariate probit estimation approaches, ER holds and approximately 50% of the total number of observations are available to fit the outcome equation. Figure 5 shows the estimated smooth functions for f 21 (z 1 ) averaged over the simulation runs. As in [40], based on the estimates for 200 fixed covariate values, RMSE( f 21 ) was calculated as The results can be summarized as follows: • Figure 2 shows that at all sample sizes and ρ = (0.5, 0.9) the sample selection estimators outperform the naive approach in terms of bias, and that bivariate probit is much more accurate than multiple-stage. However, for n = (500, 1000) the introduced estimators are less precise than naive, especially when the sample selection issue is negligible. • Figure 3 suggests overall that the multiple-stage estimates for ρ are systematically biased as compared to those of bivariate probit. When ρ = 0.9 the bias in the multiple-stage estimates worsens as n increases. This is    most likely due to the violation of the normality assumption (see Remark 1 of Section 2.2). • Figure 4 indicates that for ρ = 0.1 the performance of the naive estimator is superior at all sample sizes, and is comparable to that of the sample selection approaches for ρ = 0.5 and n = 500. In all the other cases, multiple-stage and bivariate probit outperform naive, with bivariate probit being the best for ρ = 0.9. The average fits, in Figure 5, show that the sample selection estimators yield better curve estimates except for ρ = 0.1 where naive recovers the underlying function fairly well.
In summary, in the presence of non-random sample selection and non-linear covariate effects, the sample selection estimators are less biased as compared to the naive estimator with the bivariate probit approach being the best. When the sample selection issue is negligible and the sample size small, the introduced estimators are more variable than naive. Models were also fitted using the data generating process described in the previous section but with no ER, i.e. without including z 2i in the selection equation. The results from this scenario (not reported here and available upon request) showed that the estimates of all methods are biased. This confirmed that empirical identification is achieved when the ER is present (see Section 2.4).
Average coverage probabilities of the 95% confidence intervals for θ 22 and f 21 (z 1 ), constructed as described in Section 2.5, were also calculated. N s and N d were set to 20 and 100. For θ 22 , the coverage rates of the multiple-stage approach were below the nominal level, with values going from 0.92 to 0.79 as ρ increases. For the bivariate probit method, despite its good accuracy in estimating θ 22 , rates were low with values ranging from 0.73 to 0.88; here, nonparametric bootstrap percentile intervals based on 199 replications appeared to be effective, with rates in the interval (0.91, 0.97). As for f 21 (z 1 ), nominal coverages were satisfactory for both methods with values in the range (0.92, 0.97). For the multiple-stage approach, confidence intervals calculated employing the correction procedure described in Section 2.5 offered marginal improvements.
Coverage rates for θ 22 were not satisfactory. For the multiple-stage estimator, this was most likely due to the bias in the estimates highlighted in Figure 2. For bivariate probit, the issue seems to lie in the information matrix which underestimates the variability of the parametric components. Limited simulation evidence suggests that using the observed (rather than Fisher) information matrix in (2.10) can yield improved coverage rates for parametric terms. However, as pointed out in Remark 1 of Section 2.3.1, the use of the Fisher information is crucial to be able to carry out the smoothing parameter selection step. Here, further research is needed to exploit the properties of both information matrices and avoid the use of computationally intensive bootstrap procedures.

Application
In this section, we illustrate the proposed methods using data on public opinion polls on school integration. As argued in [38], in certain situations, opinion polls may poorly reflect collective public sentiment because some individuals choose not to respond to some specific questions as they feel that their opinion may be perceived as socially unacceptable. When some individuals are not willing to show their views, polls measuring collective opinion on sensitive topics typically provide misleading estimates of preferences in the population as a whole. A number of studies have recognized the presence of this phenomenon which can be problematic for policy information; see, e.g., [12]. A survey of public opinion polls on school integration conducted in the USA is one such study [2].

American national election study
We use data from the American National Election Survey (ANES) conducted in 1992 1 . This study is part of a time-series collection of national surveys fielded continuously since 1952. The election studies were designed to present data on Americans' social backgrounds, enduring political predispositions, social and political values, perceptions and evaluations of groups and candidates, opinions on questions of public policy, and participation in political life. The 1992 ANES study entailed both a pre-election interview and a post-election reinterview [28].
As mentioned in the introductory section, the main question was whether respondents support government intervention to ensure that black and white children go to the same school. About 700 individuals were first asked if they had an opinion on the integration question (0 = no, 1 = yes) and then what that opinion was (0 = no integration, 1 = yes integration). This gave respondents an opportunity to opt out of the question answering process at an earlier stage. 64.57% of the individuals chose to answer the integration question. Among these, the proportion of 'yes' answers was 46.43%. The dataset also included information on individual demographic and socio-economic characteristics. The variables considered were age (in years), educ (number of years of education), sex (0 = female, 1 = male), race (0 = black, 1 = white), reg (0 = North-Central, 1 = North-East, 2 = South, 3 = West), child (number of children in the household), discpol (0 = never discuss politics, 1 = discuss politics), moralcons (1 = support for moral conservatism, 2 = no support for moral conservatism, 3 = neither), and perslett which was a binary variable indicating whether the interviewer attempted to convert a respondent who initially refused to participate in the survey.
We analyzed the ANES dataset using the naive, multiple-stage and bivariate probit estimation approaches with the same model fitting settings as those described in the simulation study section. The outcome equation included the variables sex, race, reg, child, moralcons and discpol as parametric components, and smooth functions of age and educ. The selection equation included the same variables plus perslett. The inclusion of perslett in the selection equation was used as ER on the ground that it may be regarded as a good predictor of the propensity to answer and is independent of the outcome. child was included as a parametric component because it did not have enough unique covariate values to justify the use of a smooth function. Table 1 and Figure 6 report the parametric and smooth function estimates for the outcome equation and, for completeness, also those for the selection equation, when applying the three approaches on the ANES dataset.

Results and interpretation
In the selection equation, the parameter of perslett, obtained using the sample selection estimators, is statistically significant at the 5% level, hence supporting its use as ER. Although there are some differences in the significance of the parametric terms of the sample selection models, the magnitude and sign of the coefficients are similar. For example, the negative parameter estimate for race.white is consistent with the interpretation that the propensity Table 1 Parametric estimates obtained applying the naive, multiple-stage and bivariate probit estimation approaches on the ANES dataset described in Section 4.1. Within parentheses are 95% confidence intervals calculated as described in Section 2.5 with Ns = 20 and N d = 100 for multiple-stage, and nonparametric bootstrap based on 199 replications for bivariate probit. Note that results from the naive approach concern only the outcome equation  Bayesian 'confidence' intervals calculated from the bivariate probit estimates. The 'rug plot', at the bottom of each graph, shows the covariate values. Note that to avoid clutter corrected confidence intervals for the multiple-stage approach have not been reported, results from the naive approach concern only the outcome equation, and due to the identifiability constraints the estimated curves are centered around zero.
that a white individual answers an integration question is lower than that of a non-white individual. Also, the fact that the interviewer attempted to convert a respondent who initially refused to participate in the survey has a positive effect on the probability of answering the integration question. As for the smooth components, the effect of age is linear for both sample selection models. The function estimates of educ are linear and nonlinear for bivariate probit and multiple-stage, respectively, and the bivariate probit intervals do not contain the multiple-stage curve for a part of the covariate value range. These findings support the presence of linear parametric effects for all terms in the selection equation of bivariate probit, and of linear and nonlinear covariate effects for multiple-stage. The reason for the difference in some of the estimates between the two sample selection estimators can perhaps be ascribed to sampling variability; at small sample sizes, the two methods are likely to be affected differently by some bias.
In the outcome equation, the probability that a white respondent supports integration is significantly lower as compared to that of a non-white; this conclusion is common to all approaches. The effects of age and educ show different degrees of nonlinearity across the three methods. The pointwise confidence intervals of bivariate probit contain the zero line, suggesting that neither age nor educ have (non-linear or linear) effects. These results suggest that information in the data is too weak to clearly support the need for inclusion of smooth terms.
The estimates of ρ, which are important to ascertain the presence of bias induced by non-random sample selection, are high and, for bivariate probit, statistically significant. This means that the process by which individuals decide to answer the integration question is connected to the process by which they decide what the answer is. This could not have been detected using the naive approach. The positive sign ofρ indicates that the unobserved factors which lead individuals to take part in the survey also lead them to take a more supportive stance on the integration issue.
A comparison among the parametric estimates of the outcome equation obtained from the three approaches indicates that, while none of them change sign, the magnitude of some parameters is altered by the correction for non-random sample selection. Specifically, the movement of the estimates of (Intercept) is of interest. The naive estimate is more than double that of multiple-stage and four times that of bivariate probit. This means that once selection effects are accounted for, respondents are more likely to oppose school integration than the naive estimate suggests. This result is consistent with that of [2] who also found that correcting for non-random sample selection decreases the probability of supporting government efforts to integrate schools.
To gauge the aggregate effects of the sample selection issue in the questionanswering process, we estimated the mean predicted probabilities of giving a supportive response under the three methods. 95% confidence intervals were conveniently obtained via posterior simulation using the results in Section 2.5. The results were 0.46 (0.43, 0.50), 0.41 (0.35, 0.46) and 0.31 (0.28, 0.35) for the naive, multiple-stage and bivariate probit approaches. So, predicted support for school integration is significantly lower when sample selection is accounted for, hence indicating that expressed opinion on the school integration question is a poor barometer of underlying support for integrationist policy. Based on our simulation evidence and the remarks given in Sections 2.2 and 2.3.1, the estimation results obtained using the bivariate probit approach may perhaps be regarded as the most accurate. The findings of this section may offer further empirical insights into how the school integration issue could be handled. For instance, support for school integration may be increased by investing in campaigns for social sensibilization.
The goodness of the results presented in this section relies especially on whether the assumption of normality is met. For the simultaneous equation estimation approach, a possibility would be to employ a score test of bivariate normality whose density of the errors under the alternative hypothesis is based on a type AA bivariate Gram Charlier series with 9 additional parameters [19]. However, it is not clear whether this test can be extended to the penalized framework proposed in this article.

Discussion
We introduced two statistical methods for the (separate or simultaneous) estimation of two binary regression models involving semiparametric predictors in the presence of non-random sample selection. The problems of identification and inference have also been discussed. The approaches have been illustrated using data on public opinion polls on school integration; predicted support for school integration calculated using the proposed tools is lower as compared to what a naive estimate would suggest.
The results of our simulation study showed that the sample selection estimators are less biased as compared to the naive estimator and that the bivariate probit approach can produce consistent parameter estimates. However, when the sample selection issue is negligible and the sample size small, the introduced approaches are more variable than naive; here, stability of sample selection estimators may be tenuous.
Because maximum likelihood estimation schemes are typically sensitive to model error misspecification, extensions of our proposals allowing for different joint distributions of the model errors seem feasible adopting copula functions. This approach has already been adopted in the context of non-random sample selection [18,21,35,43, and references therein]. An alternative solution could be based on a nonparametric distribution function framework; see, e.g., [17]. To accommodate more complex data structures arising, for instance, in longitudinal studies, future research will also focus on extending the methods presented in this article to allow for random effects in the linear predictors. Finally, it would be interesting to determine whether a generalized least squares approach can be exploited to improve the efficiency of the sample selection estimators.