Efficient parameter estimation in regression with missing responses

We discuss efficient estimation in regression models that are de- fined by a finite-dimensional parametric constraint. This includes a variety of regression models, in particular the basic nonlinear regression model and quasi-likelihood regression. We are interested in the case where responses are missing at random. This is a popular research topic and various methods have been proposed in the literature. However, many of them are compli- cated and are not shown to be efficient. The method presented here is, in contrast, very simple – we use an estimating equation that does not impute missing responses – and we also prove that it is efficient if an appropriate weight matrix is selected. Finally, we show that this weight matrix can be


Introduction
In this article we consider a general class of regression models that can be specified as a finite-dimensional parametric constraint, E{a ϑ (X, Y )|X} = 0, a ϑ = (a ϑ1 , . . . , a ϑk ) ⊤ , (1.1) with parameter ϑ belonging to the interior of some compact parameter space Θ ⊂ R p . This means in particular that the parameter ϑ is defined as a solution of a system of equations. Since there can be more than one solution of (1.1), or no solution at all, we will assume in the following that a solution ϑ exists and that it is unique. The variables X and Y are multi-dimensional, and we allow that Y is not always observed. In this setting it is possible to derive efficient estimators of ϑ as solutions of an appropriately chosen set of estimating equations, which is what we pursue in this article. The general model (1.1) covers the regression model given by with E(ε|X) = 0, which we call a "nonlinear regression model"; see below for more explanations. But model (1.1) also covers more complicated models, such as the quasi-likelihood model which is specified by the two-dimensional conditional constraint , (1.2) and the quantile regression model, which is defined by a ϑ (X, Y ) = p − 1{Y − r ϑ (X) < 0}.
In these examples Y is a one-dimensional response variable and X a vector of covariates. Let us first take a closer look at the simple but important case of a nonlinear regression model, which includes the linear regression model as a special case with r ϑ (X) = ϑ ⊤ X. We should emphasize that we are considering models that are solely specified by a conditional constraint of the form (1.1). This means that for the nonlinear regression model we do not assume a parametric form for the distribution of the covariate vector X or the error variable ε = Y − r ϑ (X). We also do not assume that X and ε are independent -we only assume that the errors are conditionally centered given the covariates, E(ε|X) = 0. Since this and the parametric form of the regression function is all the information given, the nonlinear regression model can be described by the simple one-dimensional constraint E{Y − r ϑ (X)|X} = 0, (1.3) which is indeed a special case of (1.1). It is also worth noting that it is not necessary here to introduce an error variable ε. Efficient estimation of ϑ in the complete data case has been studied by various authors. We refer first of all to Chapter 4 of Tsiatis (2006 [19]), who studied the nonlinear model (1.3) in detail, including the derivation of the efficient score function, and the adaptive estimation of the weight in the estimating equation. Müller (2007 [8]) considers weighted least squares estimators in possibly misspecified regression models and derives as a special case an efficient estimator for ϑ in the regression model above. The characterization sketched in that paper is analogous to that obtained in Müller and Wefelmeyer (2002 [11]) for autoregressive models satisfying a parametric constraint. A (different) derivation of the asymptotic variance bound is sketched in Chamberlain (1987 [3]), with generalizations in Chamberlain (1992 [4]). Two review articles are Newey (1990 [12], 1993 [13]).
Estimating ϑ efficiently is quite complicated in the classical regression setting, which assumes that covariates and errors are independent. The independence assumption is a structural assumption about the model, and must be incorporated by constructing an efficient estimator. Efficient estimation of the parameter in the classical setting with a linear regression function has been studied by Bickel (1982 [1]), Koul and Susarla (1983 [7]), and Schick (1987 [17], 1993 [18]). Schick (1993 [18]) also considers general semiparametric regression models with independent covariates and errors. He uses a preliminary estimator of ϑ and an estimator of the efficient influence function to construct an efficient estimator for ϑ. A further approach, which requires weaker conditions, is in Forrester et al. (2003 [6]).
All the above articles study estimation of ϑ when no data are missing. We are interested in the case when responses are possibly missing, in particular when responses are missing at random (MAR). This means that we only observe Y in those cases where some indicator δ equals one, and the indicator δ is conditionally independent of Y given X. This assumption is useful when information in the form of covariate data is available to explain the missingness. In that case we can estimate the propensity score π(X) = P (δ = 1|X) and the missingness mechanism is called ignorable.
A considerable amount of work has been done on regression models with responses missing at random, but little has been done on efficient estimation. Robins et al. (1994 [15]), for example, assume a parametric model for π(X) (or that π(X) is known), and estimate the regression parameters efficiently by solving an inverse probability weighted estimating equation. Also in Robins et al. (1995 [16]) a parametric model for π(X) is assumed, which is conceptually quite different from a nonparametric model for π(X), which will be assumed in this paper. The authors allow the response and the covariates to be varying over time. On the other hand, they do not establish the efficiency of their estimator.
Efficient estimation of ϑ in model (1.3) above, with MAR responses and with independence of covariates and errors, is studied in Müller (2009 [9]). There the influence function of an efficient estimator for ϑ is derived and the construction of an efficient estimator is discussed. Perhaps surprisingly, this can be done in the same way as in the complete data case: by simply omitting the covariates associated with missing responses and by using only the data (X, Y ) that are complete. We show in this paper that the same applies for our regression model where the independence assumption is not imposed: ϑ can be estimated efficiently by using a weighted least squares estimator which uses only the data pairs (X i , Y i ) for which response values are at hand. More precisely, we will show that the solutionθ of the estimating equation with respect to θ is efficient. Hereṙ θ is the vector of partial derivatives with respect to the components of θ, θ is an arbitrary value in the parameter space Θ, and σ 2 (X) is the conditional error variance given the covariates, The conditional variance function depends on θ, σ 2 (X) = σ 2 θ (X), but since we do not model it parametrically we prefer to write it without the subscript θ. This will also be helpful to distinguish the conditional variance in the nonlinear regression model from the conditional variance in more complex models such as the quasi-likelihood model, where we also assume a parametric model v θ (X) for the variance function, σ 2 (X) = v θ (X). Note that the estimating equation above is called undetermined since σ 2 (X) is unknown. Estimation of σ 2 is addressed in Section 3.
To our knowledge, there is no published work where efficiency of the above estimator is proved or where an efficient estimator is provided for the nonlinear regression model (1.3) with MAR responses. We will therefore pay particular attention to this model. This is also motivated by the fact that model (1.3) is a fundamental model and therefore important. Although Tsiatis (2006 [19]) studied model (1.3) in great detail for the case when all data are completely observed, and although one can argue that the consistency of his estimation method should remain valid with MAR responses, it is not at all clear whether the efficiency of his method can be carried over to the MAR case. This needs careful investigation.
The efficient estimator for ϑ in model (1.3) can also be used as a point of reference for related approaches in more complex models with MAR responses. Wang and Sun (2007 [21]), for example, compare three estimators for the regression function in a partly linear model, which coincides with model (1.3) if we assume that the unknown smooth part of the regression function is zero and if r θ is linear, r θ (X) = θ ⊤ X. Another example is Wang et al. (2010 [22]), who consider a single index model with regression function g(θ ⊤ X) which would be our model (1.3) with a linear regression function if g were known to be the identity.
The conditional constraint (1.1) implies that the unconditional constraint E{a ϑ (X, Y )} is zero, which is the model considered by Zhou et al. (2008 [23]) and by Wang and Chen (2009 [20]). In both articles the proposed estimators are similar to our estimator in that they are solutions of an estimating equation -but more complex. In contrast to our approach, the 'missing' terms of the estimating equation are replaced by nonparametric estimators of the conditional expectation E{a ϑ (X, Y )|X} (which estimates zero if our model is in fact true). The estimation of this conditional expectation requires the careful selection of a smoothing parameter. These procedures are therefore more complicated than our method. A general efficiency statement is not established, but possible variance reductions are discussed. Our method, in contrast, is very simple since it exploits the conditional constraint -which suggests a weighted estimating equation. Since our model class is characterized by a conditional constraint we cover many basic regression models, including the nonlinear regression model. Above all we show that our method is efficient if we work with an optimal weight matrix. Estimating these optimal weights may require the use of smoothing techniques, but choosing the smoothness parameter is less important here since only consistency (without a specific rate) is needed (see Section 3).
The paper is organized as follows. In the next section we define our estimator of ϑ and show its asymptotic normality. Section 3 discusses a number of special cases of the general theory and provides a small simulation study. The efficiency of our method is established in Section 4. Finally, Section 5 contains some concluding remarks and a discussion of open questions.

Estimation
The motivation for our estimating equation comes from the nonlinear regression example. A simple estimator for this model (modified for the missing response setting) is the least squares estimator, which is the minimizer of n i=1 δ i {Y i − r θ (X i )} 2 with respect to θ. It is obtained by solving the weighted estimating equation with respect to θ, where the weight vectorṙ θ (·) ⊤ is the p × 1 vector of partial derivatives of r θ (·) with respect to θ. Since the nonlinear regression model has a simple structure -in particular there is no form for the variance assumed -it is intuitively clear that more weight should be put on data points (X i , Y i ) when the variance is small and less weight when the variance is large. It appears to make sense to improve the usual least squares estimator by choosing weights Both approaches incorporate the gradienṫ r θ and can therefore be regarded as weighted least squares estimators, i.e. as solutions of Our estimator for the parameter vector ϑ in the conditionally constrained model (1.1) is defined analogously as a solutionθ of where W θ is a p × k weight matrix. Sometimes the system of equations (2.2) does not have a solution. This is often the case for quantile regression or any other model leading to non-smooth criterion functions. In that case we replace (2.2) by the minimizer of with respect to θ, where · is the Euclidean norm. In the nonlinear regression model (1.3) W θ is just a vector, and in the quasi-likelihood model (1.2) W θ is a p × 2 matrix. The estimating equation is unbiased for any choice of W θ (X) since it is easy to verify that E{δW ϑ (X)a ϑ (X, Y )} = 0: using the MAR assumption on the responses, which postulates that the indicators and the responses are conditionally independent given the covariates, we obtain Note that we explicitly use that E{a ϑ (X, Y )|X} = 0, which is the only model structure that we assume. This suggests that the above approach could yield an efficient estimator. In particular, it becomes evident that the preconditions for obtaining an appealing (simple and possibly efficient) estimator are ideal if a constrained model of the form (1.1) can be assumed, and if the missingness of the responses can be explained by covariates. Whether a solutionθ of the above equation is efficient or not will depend on the choice of W θ . Our approach to find the optimal weight matrix was to derive the efficient influence function first (see Section 4 on efficiency), which is Here we only assume that the expectation is differentiable with respect to θ. In many models we can even assume that a θ is differentiable. If this is the case we will write briefly For reasons of clarity we set We have shown that L(ϑ) = 0 and therefore estimate ϑ by the solutionθ of the corresponding estimating equation It should be pointed out that the resulting estimatorθ only uses completely observed pairs (X i , Y i ) -in particular it discards information that is given in the form of (observed) covariates X i . It remains to be shown that the influence function ofθ is indeed of the required form, i.e. we have to derive the asymptotic expansion of the estimator. Since our estimator is the solution of an estimating equation, this is a standard result for M -estimators, and rests on a Taylor expansion. Here we provide the statement under fairly weak conditions, using Theorem 3.3 in Pakes and Pollard (1989 [14]). The conditions in this theorem include the case where the criterion function L n (θ) is not smooth. It is also interesting to note that, regardless of the dimension of the original set of defining equations (namely of E{a ϑ (X, Y )|X} = 0), the dimension of the final estimating function L n (θ) always equals p -the dimension of θ.
the matrix I is of full rank and, for almost every x, the matrix E{a ϑ (X, Y )a ϑ (X, Y ) ⊤ |X = x} is also of full rank. (iv) For all j = 1, . . . , p, δℓ θ,j (X, Y ) is locally uniformly L 2 -continuous with respect to θ in the sense that for all θ 1 ∈ Θ, for all α = o(1), and for some constants s j ∈ (0, 1], K j > 0. Then (a) the estimatorθ has the stochastic expansion and is asymptotically normally distributed with covariance matrix (b) the estimatorθ is efficient for estimating ϑ, provided the joint distribution of (X, Y ) satisfies the mild regularity conditions stated in Section 4.
Part (b) is important: it shows that efficiency can be obtained without using complicated procedures to replace the missing responses with estimators. Our method, which completely discards the missing observations, is easy to compute and is efficient if the weight matrix is suitably chosen. Remark 1. Condition (i) can be easily shown using standard results (see e.g. Theorem 3.1 or Corollary 3.2 in Pakes and Pollard, 1989 [14]), whereas condition (ii) is needed for identifiability reasons. The differentiability condition in (iii) is imposed on the function L(θ), which will in many cases be smooth even if the function ℓ θ is not smooth in θ. Finally, note that condition (iv) also allows for discontinuous functions ℓ θ such as sign and indicator functions. In the smooth case, (iv) can be replaced by the following more direct condition: (iv)' For all j = 1, . . . , p, the function (δ, x, y) → δℓ θ,j (x, y) is Hölder continuous with respect to θ in the sense that Remark 2. By part (b) of Theorem 2.1, an efficient estimatorθ of ϑ satisfies (2.6), i.e. it has influence function I −1 δℓ ϑ (X, Y ). The classical approach to constructing an efficient estimator is to start with an initial inefficient estimator of ϑ and to improve it by adding an estimator of the influence function, with appropriate estimators for I and ℓ (see, for example, Bickel et al., 1998 [2]). This construction does not, however, take advantage of the special feature of our model and is not recommended: our method only requires solving (2.4), or, more generally, (2.5). In particular we do not need to estimate I.
Proof of Theorem 2.1. We have to verify that the stochastic expansion (2.6) in part (a) holds true. The proof of (b) is in Section 4 where we show that I −1 δℓ ϑ (X, Y ) is the efficient influence function for estimating ϑ (see the characterization at the end of Section 4). In Section 4 we work with some additional notation for the (rather technical) derivation, to keep the presentation clear. For example we write Q x for the conditional expectation given X = x. It is easy to verify that ℓ θ (x, We prove (2.6) by showing that the conditions of Theorem 3.3 in Pakes and Pollard (1989 [14]) are satisfied. Here the criterion function is δℓ θ (X, Y ). It can quickly be verified that these conditions hold true, provided that: (1) Our matrix I and Pakes and Pollard's matrix −Γ are the same, where
Let us begin with the matrix I on the left-hand side of (2.7). For reasons of clarity we use some notation from Section 4 and setQ x (a ϑ ) = ∂/(∂θ)Q x (a θ )| θ=ϑ . This lets us avoid writing ∂/(∂θ)Q x (a ϑ ) for the gradient which could be confusing since the conditional constraint Q x (a ϑ ) is zero. We have Here we have used which follows from the MAR assumption. Handling the matrix on the righthand side of (2.7) is notationally cumbersome. We therefore consider just a single entry of the matrix. Write W θ,i for the i-th row of W θ . Again using the MAR assumption, and the fact that E{a ϑ (X, Y )|X} = 0, the (i, j)-th entry computes as follows: Comparing this with the above calculation for I it is now apparent that the entries of I and −Γ are the same. Hence, (2.7) is satisfied. It remains to prove condition (2) above. This follows from Theorem 3 in Chen et al. (2003 [5]) (discarding the nonparametric nuisance function h which is present in that theorem).

Estimation of the weight matrix
As pointed out in the introduction, the estimating equation will in general be undetermined (and therefore of no use for applications) since the weights depend on unknown features of the distribution, for example on the conditional variance σ 2 (X) in nonlinear regression. This is not a problem: the unknown quantities can usually be estimated consistently by some simple nonparametric approach. This will not change the asymptotic variance of the resulting estimator. In particular, it will still be efficient. The estimator of W θ (·) does not need to converge to W θ (·) at a certain specific rate: simple (uniform) consistency is sufficient.
To show that the asymptotic variance does not change, one can use the results from Chen et al. (2003 [5]). They give high-level conditions under which a parameter estimator defined by the solution of a set of equations depending on a nonparametric estimator is consistent and asymptotically normal. These results extend Pakes and Pollard's (1989 [14]) article to the case of semiparametric estimators, and cover as a special case our model when the unknown weight matrix is replaced by a nonparametric estimator. Consider Theorem 2 in Chen et al. (2003 [5]), which states the asymptotic normality ofθ. Most of the highlevel conditions under which this result is valid are straightforward to verify. Two points, however, need closer attention: (1) we need to calculate the asymptotic variance ofθ, in order to confirm that it is not affected by using an estimatorŴ θ for the weight matrix W θ ; (2) we need to show that the required conditions onŴ θ are satisfied.
Let us address (1) first. According to Theorem 2 in Chen et al. (2003 [5]) the formula for the asymptotic variance ofθ depends on the matrices Γ 1 and V 1 given in conditions (2.2) and (2.6) of that paper. The matrix Γ 1 is constant and therefore not affected by using estimated weights. The matrix V 1 must be inspected more carefully: it is the asymptotic variance of an expression which involves the Gâteaux derivative of M (θ, W θ ) := E{δW θ (X)a θ (X, Y )} in the directionŴ θ − W θ (withŴ θ (X) a consistent estimator of W θ (X)), evaluated at θ = ϑ. The Gâteaux derivative is defined by Note that the expected value is calculated in accordance with the definition of the vector M (θ, W θ ), namely with respect to (δ, X, Y ), i.e. the stochastic nature ofŴ θ is not taken into account. Writing the last expectation in the above display as an iterated expectation (conditional on X) yields Γ 2 (ϑ, W ϑ )(Ŵ ϑ − W ϑ ) = 0.
In other words, the contribution to the asymptotic variance which comes from using estimated weights is zero. The matrix V 1 is the same as in the case with known weights. For (2), note that the main requirement onŴ θ is condition (2.4) in Chen et al. (2003 [5]), which requires that sup θ∈Θ sup x |Ŵ θ (x) − W θ (x)| = o p (n −1/4 ). However, a closer look at the proof of Theorem 2 in that paper reveals that the rate o p (n −1/4 ) can be weakened to o p (1) if M (θ, W θ ) depends on W θ in a linear way (or, equivalently, if Γ 2 (θ, W θ )(Ŵ θ − W θ ) = M (θ,Ŵ θ ) − M (θ, W θ )), which is the case here. Hence, all we need is an estimatorŴ θ (x) that is uniformly consistent (in θ and x).
This sketches the main steps of the proof that the estimation of the weight matrix does not impair the efficiency property ofθ.

Conditional versus unconditional constraints
Our focus here is on inference for parameters defined via conditional equations. A related topic is inference for parameters defined via unconditional constraints of the form E{a ϑ (X, Y )} = 0, see e.g. Zhou et al. (2008 [23]) and Wang and Chen (2009 [20]) for references on this type of models. Let us explain the relationship between the two classes of models. The conditional model E{a ϑ (X, Y )|X} = 0 a.s. is equivalent to E{W (X)a ϑ (X, Y )} = 0 for all possible functions W (·). Indeed, the former equation clearly implies the latter one. On the other hand, the latter set of equations yields that E[E{a ϑ (X, Y )|X} 2 ] = 0 by choosing W (X) = E{a ϑ (X, Y )|X}. This implies that E{a ϑ (X, Y )|X} = 0 a.s.. This means that the conditional constraint is equivalent to an infinite collection of unconditional constraints, one of which (namely the one corresponding to W = W θ given in (2.3)) is efficient. So the approach with the conditional constraint makes it possible to select the weight matrix that leads to an efficient estimator. On the other hand, an unconditional constraint corresponds to one single equation, or equivalently one single weight matrix.

Illustration: Linear and nonlinear regression
The estimating equation for the nonlinear regression model (which includes linear regression as a special case) is given in the introduction (1.4). Let us check that it is indeed a special case of the general estimating equation (2.4) Here the vector a ϑ is one-dimensional, a θ (X, Y ) = Y − r θ (X), which yields that the matrix E{a θ (X, Y )a θ (X, Y ) ⊤ |X} is one-dimensional as well, E{a θ (X, Y ) 2 |X} = σ 2 (X), where σ 2 (X) is the conditional variance of Y given X. Assuming that r θ is differentiable in θ we also have that E{ȧ θ (X, Y )|X} = −ṙ θ (X). This yields as postulated. A simple consistent nonparametric estimator of σ 2 (x) iŝ (which can be regarded as a ratio of Nadaraya-Watson estimators), where Here d is the dimension of X andθ 0 is some consistent estimator for ϑ, e.g. the ordinary least squares estimator (OLS) which uses weights W θ (x) =ṙ θ (x) ⊤ . In the general case a consistent estimator of the optimal weight matrix W θ may similarly involve a preliminary consistent estimatorθ 0 of ϑ. Such an estimator can be obtained as a solution of equation (2.2), i.e. of n i=1 δ i W θ (X i )a θ (X i , Y i ) = 0, now with an arbitrary (feasible) p × k weight matrix W θ (which does not need to depend on θ) such that the system of equations has a unique solutionθ 0 (see the discussion in Section 2).
As an illustration of the method we performed a small simulation study using R and compared three different approaches: the efficient estimator, the OLS which solves (2.1), and a weighted least squares estimator that uses the propensity score, W θ (X) = W (X) = π(X) −1 = E(δ|X) −1 . The latter choice of weights is suitable for the larger model defined by the unconditional constraint E{a ϑ (X, Y )} = 0, since the corresponding estimating equation is unbiased in that model, For the simulations we chose an increasing propensity score π(x) = 1/(1 + e −x ). The covariate X is generated from a uniform distribution on (−1, 1), and the error variable is of the form ε = σ(X)Z, where Z is standard normal and independent of X. We studied a linear regression function, r ϑ (X) = ϑX, and a nonlinear regression function, r ϑ (X) = cos(ϑX). In both cases ϑ = 2. The conditional variance σ 2 (x) is linear or parabolic, and estimated byσ 2 (x) from equation (3.1), withθ 0 the OLS estimator. We studied five bandwidths b between 0.1 and 0.5, and an automatically selected bandwidth b = b cv using the cross-validation method for fitting a smooth curve into the completely observed Table 1 lists the simulated mean squared errors based on 5,000 repetitions for the case of a linear regression function. The results for the cosine function are given in Table 2. The propensity score π(X) increases from 0.27 to 0.73 on (−1, 1) so that around 50% of the responses are missing. Hence, if n = 50, we are essentially only working with about 25 data points and the R routine "nls" (nonlinear least squares), which we used for the cosine function, does not always converge (the simulations 'crashed'). For this reason Table 2 only includes the results for n = 100 and n = 200. In the linear case (Table 1) the estimator can be calculated with an explicit formula and the simulations ran without any problems, allowing us to include results for n = 50 as well. The first two results columns give the mean squared errors (MSE) for the OLS and the propensity score weighted estimator (PS). For simplicity, PS uses the true π(X). The third column shows the results for the efficient estimator that uses the true conditional variance σ 2 (x) with (a) σ 2 (x) = 0.6 − 0.5x in the upper panel, and (b) σ 2 (x) = (x − 0.4) 2 + 0.1 in the lower panel. The six columns on the right-hand side refer to the efficient estimator based on the kernel estimator (3.1) for σ 2 (x), for five different fixed bandwidths b, and for b = bcv obtained by cross-validation. The entries are simulated mean squared errors of various estimators of ϑ as in Table 1, now with a nonlinear regression function, r ϑ (X) = cos(ϑX).
We observe that the efficient estimator that uses the variance estimatorσ 2 (x) always performs better than both the OLS and the propensity score weighted estimator (PS), for all choices of b, and also for the automatic bandwidth b cv selected by cross-validation. Note that our estimatorσ 2 (x) uses a normal kernel k, without adjusting for boundary bias, which is probably one reason why the estimator that uses the true variance function is better when n is small, e.g. n = 50 in Table 1. A reasonable next step, with view towards small sample performance, would be to develop a better estimator of the conditional variance. If the variance function is constant, ordinary least squares and the efficient estimator are asymptotically equivalent and we recommend the OLS estimator since it is easier to use. The same applies if the variance function does not show much variation and if the sample size is small so that the estimated variance function is nearly constant.

Further examples
Quasi-likelihood model Now consider the quasi-likelihood model where we assume parametric models for both the regression function and the conditional variance function. Here it is also straightforward to calculate ℓ θ (x, y): assuming that both r θ and v θ are differentiable in θ we obtain similar to that in the previous example.
Multi-response model In the two examples above the response variable was assumed to be univariate. Our method also applies if the responses are multivariate, i.e. if we assume a multi-response model. Again it would be straightforward to specify the estimating equation.

Quantile regression
The situation is different if a θ (x, y) involves indicators and cannot be differentiated with respect to θ. An important class of applications are quantile regression models. Suppose that the conditional p-th quantile of Y given X is specified by a parametric model r ϑ (X). This can be expressed as a conditional constraint, namely as A simple calculation shows that E{a θ (X, Y ) 2 |X} = p 2 + (1 − 2p)F Y |X {r θ (X)} for any θ ∈ Θ, where F Y |X {r θ (X)} = P {Y − r θ (X) < 0|X}. Thus the weights of the estimating equation reduce to The conditional probability F Y |X {r θ (X)} must be estimated with a smooth estimator to ensure that the partial derivatives can be calculated. One option is to use a kernel smoother of the form where, as before, k is a kernel function and b and h are appropriate smoothing parameters, and where K is a smooth distribution function, e.g. the cumulative integral of a suitable kernel density function.

Efficiency
In order to derive the canonical gradient of ϑ (which characterizes the efficient influence function) one can build on results from Müller (2007 [8]) on estimating ϑ when all data are observed. We will also rely on results by Müller et al. (2006 [10]) on efficient estimation of expectations Eh(X, Y ) in regression models (not covering our model) with responses missing at random, that is, with observations (X, δY, δ) as here. The characterization of an efficient estimatorθ of ϑ is given at the end of this section. We begin with the characterization of the influence function of an arbitrary differentiable functional κ of the joint distribution of (X, Y ) which is derived in that article. The joint distribution P (dx, dy, dz) of the observations (X, δY, δ) can be written as Here M (dx) is the marginal distribution of X, Q(x, dy) is the conditional distribution of Y given X = x, and π(x) = P (δ = 1|X = x). Here M nu and Q nv are Hellinger differentiable perturbations of M and Q, The perturbed distributions M nu and Q nv must both be probability distributions, i.e. integrate to one, which explains the form of U and V . Write T for the tangent space relevant for estimating κ (i.e. for functionals of M and Q), where the orthogonality follows from the missing at random assumption. It contains the canonical gradient, which is defined as a gradient that is also an element of the tangent space, i.e. it is of the form γ * (X, δY, δ) = u * (X) + δv * (X, Y ) with the terms of the sum being projections onto the tangent space. As a gradient of κ, γ * must satisfy the above characterization which now becomes For a full specification of the tangent space see Müller et al. (2006 [10]). That larger tangent space has to be considered when the goal is to estimate functionals of the full joint distribution, i.e. of functionals that also involve the conditional distribution π(x) of the indicator variable δ given x.
After these general considerations we now also take the structure of our model (1.1) into account, which is defined by a parametric constraint, The perturbed distribution must satisfy a perturbed constraint, Q xnv (a ϑnt ) = 0 for some ϑ nt close to ϑ, say ϑ nt = ϑ + n −1/2 t with t in R p . Using Q x (a ϑ ) = 0 and Q x (v) = 0 we obtain This leads to a constraint Q x (va ϑ ) = −Q x (a ϑ )t on v in V , which can be written in the form Q x (va ϑ ) = −Q x (ȧ ϑ )t if a θ is differentiable in θ. For fixed t ∈ R p we write H t for the solution space of this equation, and H * for the union of all affine spaces H t , t ∈ R p . In order to determine v * we find it convenient to go further and decompose H * into the space H 0 of solutions of the homogeneous equation, H 0 = {v ∈ V : Q x (va ϑ ) = 0}, and into the solution space of the inhomogeneous equation given above. This space can be written as a linear span, analogously to Müller (2007 [8]). The idea is to solve the equation for the standard basis vectors t = e j , j = 1, . . . , p. Call the solutions ℓ j . Then the solution space of the inhomogeneous equation is the linear span [ℓ] of the solutions ℓ 1 , . . . , ℓ p , where ℓ = (ℓ 1 , . . . , ℓ p ) ⊤ has the form y). Simple calculations show that ℓ indeed satisfies Q x (a ϑ ℓ ⊤ ) = −Q x (a ϑ ) and that ℓ is orthogonal to H 0 , i.e. H * = H 0 ⊕ [ℓ]. The tangent space of the constrained model is now specified, From now on we focus on estimating ϑ and write it as a functional of P by setting κ(P ) = ϑ if Q x (a ϑ ) = 0. The left-hand side of characterization (4.1) of the canonical gradient now involves t ∈ R p and simplifies to n 1/2 (ϑ nt − ϑ) = t. The canonical gradient γ * (X, δY, δ) = u * (X)+δv * (X, Y ) is therefore determined by Setting v = 0 and t = 0 we see that u * must be zero. Further we know that v * ∈ H 0 ⊕ [ℓ] where [ℓ] comes from ϑ being unknown. We can therefore assume that v * is of the form Jℓ, where J is a p × p matrix to be determined and where ℓ functions as the score function. This yields Here we have used the MAR assumption and the conditional constraint Q x (va ϑ ) = −Q x (a ϑ )t on v in V . Inserting this in the above gives This equals t if J = I −1 with Our canonical gradient for estimating ϑ is determined: it is γ * (X, δY, δ) = δv * (X, Y ) = δI −1 ℓ(X, Y ).
Characterization of the efficient estimator By the characterization of efficient estimators, an estimatorθ is efficient for ϑ if it is asymptotically linear with influence function equal to the canonical gradient. The efficient influence function for estimating ϑ is I −1 δℓ(X, Y ) in our model (1.1). An estimatorθ is therefore efficient for ϑ if it satisfies

Concluding remarks and future research
We have derived asymptotically efficient estimators for the parameter vector ϑ for the large class of regression models that can be specified by a conditional constraint of the form E{a ϑ (X, Y )|X} = 0. We focus on the situation when responses are missing at random, but this also covers the case when no data are missing, namely when π(X) = P (δ = 1|X) = 1 and all indicators are equal to one. The proposed method is not only efficient, it is also simple: we estimate ϑ by solving a weighted estimating equation which only incorporates completely observed cases (X, Y ), and discard those cases that contain missing values. Although this requires estimating the weights, we only need consistency (without a rate). It is certainly remarkable that an efficient estimator may be based only on the observations for which both the regressors and responses are available. However, the final efficient estimator does not necessarily have to be of this type: a consistent estimator of the weight matrix can be obtained by discarding the data for which the response is missing, but other consistent estimators of this weight matrix are allowed as well. For instance, one could use imputation of the missing responses if one is in favor of the imputation principle, although we do not recommend doing so because the estimators can become quite involved, as explained in the introduction.
There are several open questions for future research. For example, our class of models does not include regression models where the regression function itself contains a nonparametric part, such as partially linear models which are defined by the conditional constraint E{Y −ϑ ⊤ X 1 +η(X 2 )|X 1 , X 2 } = 0. This constraint additionally involves the infinite-dimensional nuisance parameter η.
It would also be interesting to see whether the methodology developed in this paper can be extended to other missingness schemes. Clearly, the results apply to the MCAR (missing completely at random) mechanism, i.e. when π(·) ≡ π is constant. On the other hand, when the missingness is not at random (NMAR), the present methodology cannot be applied: the equality E{δa ϑ (X, Y )|X} = E(δ|X)E{a ϑ (X, Y )|X}, which relies on the MAR assumption, is crucial for the development of an efficient (optimally weighted) estimator since it guarantees unbiasedness of the estimating equation (2.2). Of interest is also the situation when both covariates and responses are missing, or when only covariates are missing with the missingness explained by the response variable.
So far we have only studied estimation of the parameter vector, but it would also be interesting to derive estimators for expectations Eh(X, Y ), with the mean response EY as an important special case. Although the mean response has been well studied, it is not yet clear how to estimate expectations in our model efficiently. To our knowledge, this has not even been considered in the nonlinear regression model which is specified by E{a ϑ (X, Y )|X} = E{Y − r ϑ (X)|X} = 0. In this model we expect that, similar to the model with independent covariates and errors (Müller, 2009 [9]), the estimator n −1 n i=1 rθ(X i ), now with our efficient estimator from equation (2.5) plugged in, will be efficient for EY . This is in agreement with the linear regression model, r ϑ (X) = ϑ ⊤ X.
Here n −1 n i=1 rθ(X i ) = n −1 n i=1θ ⊤ X i =θ ⊤X , which is a smooth function of two efficient estimators and therefore efficient. Since efficient estimators are asymptotically normally distributed with the asymptotic variance specified by the length of the canonical gradient, the construction of (approximative) normal confidence intervals for moments of the response variable, and for more general expectations, would be straightforward. In applications it is often necessary to work with more complex models. We would expect interesting and useful results in the field of generalized linear models, for certain change point models, for models with censored/truncated data (in addition to missing data), and for models used in case-control studies in the field of biostatistics. For each of these models one would need to specify the function a θ (X, Y ), from which the formula of the weight matrix and its estimator can be obtained.