Weighted least squares estimation with missing responses: An empirical likelihood approach

: A heteroscedastic linear regression model is considered where responses are allowed to be missing at random. An estimator is constructed that matches the performance of the weighted least squares estimator with- out the knowledge of the conditional variance function. This is usually done by constructing an estimator of the variance function. Our estimator is a maximum empirical likelihood estimator based on an increasing number of estimated constraints and avoids estimating the variance function.


Introduction
We consider a heteroscedastic linear regression model in which the response variable Y is linked to a (one-dimensional) covariate X by the formula where θ is an unknown vector in R d , m is a known measurable function from R to R d , the error variable ε is conditionally centered, i.e., E[ε|X] = 0, and its conditional variance σ 2 (X) = E[ε 2 |X] is bounded and bounded away from zero. We assume that the matrix is well defined and positive definite. This identifies θ as M −1 E[m(X)Y ] and implies that E[ m(X) 2 ] is finite. The model contains as special cases, 1. regression through the origin: m(X) = X; 2. simple linear regression: m(X) = (1, X) ⊤ ; w is nonnegative definite and equals the zero matrix for ξ = ζ. This shows that the asymptotic dispersion matrix D w is minimal for the choice w = 1/σ 2 . Thus the (asymptotically) best estimatorθ * in the class of weighted least squares estimators minimizes the weighted sum of squares and satisfies the stochastic expansion This implies that n 1/2 (θ * − θ) is asymptotically normal with mean vector 0 and dispersion matrix H −1 . Since σ 2 is unknown, the best weighted least squares estimatorθ * is not available. For this reason we callθ * the oracle weighted least squares estimator. Since σ 2 is unknown, a natural approach is to minimize instead of Q(ϑ) the weighted sum of squareŝ in which an estimatorσ 2 replaces the unknown σ 2 . What is the behavior of such an estimator? Carroll (1982 [1]) was the first to consider this problem. He treated the case of simple linear regression and with the responses fully observed, i.e, the case with d = 2, m(X) = (1, X) ⊤ , and δ = 1. He used a regression kernel estimator based on the squared residuals from a least squares fit and showed that the resulting estimatorθ also satisfies the asymptotic expansion and hence is asymptotically equivalent to the oracle weighted least squares estimatorθ * in the case δ = 1. He proved this result under the assumptions that the covariate X has a density which has compact support and is positive and twice continuously differentiable on its support, that σ 2 is continuously differentiable on this support, and that ε has a finite sixth moment. Similar results were obtained by Müller and Stadtmüller (1987 [7]), Robinson (1987 [15]) and Schick (1987 [16]) for different estimators of σ 2 and under weaker conditions, all for the case δ = 1. By the transfer principle for asymptotically linear estimators of Koul, Müller and Schick (2012 [5]) these results immediately carry to the present case with responses missing at random and yield that a minimizerθ of ϑ →Q(ϑ) satisfies the expansion Müller and Van Keilegom (2012 [8]) have demonstrated this expansion in the present setting by a direct argument using a kernel estimator ofσ 2 . Their results show that an estimatorθ that satisfies (1.1) is even semiparametrically efficient in the sense of being a least dispersed regular estimator. This was already known in the case without missing responses, see Chamberlain (1987 [2]). The goal of this paper is to show that one can construct an estimator that is asymptotically equivalent to the oracle weighted least squares estimator without constructing an estimator of the variance function σ 2 . Indeed such an estimator can be constructed as a maximum empirical likelihood estimator of the empirical likelihood introduced by Peng and Schick (2012a [12]) modified to allow for missing responses. Let ϕ 0 , ϕ 1 , . . . denote the trigonometric basis of L 2 (U ) with U the uniform distribution on [0, 1] defined by ϕ 0 (x) = 1 and The modified version of the empirical likelihood of Peng and Schick (2012a [12]) is R n (ϑ) = sup n j=1 nπ j : π 1 ≥ 0, . . . , π n ≥ 0, n j=1 π j = 1, where v n = (ϕ 0 , ϕ 1 , . . . , ϕ rn ) ⊤ is the vector consisting of the first 1 + r n basis functions, r n ≥ d is an integer tending to infinity slowly with the sample size n, and G is the empirical distribution function based on the covariates with observed responses only, i.e., Note that G is an estimator of the conditional distribution function G 1 of X given δ = 1. The constraints in the above empirical likelihood try to capture the model assumption E[ε|X] = 0 which is equivalent to the linear constraints E[δεa(X)] = 0, a ∈ L 2 (G 1 ). The latter is equivalent to the countably many constraints E[δεa i (X)] = 0, i = 0, 1, 2, . . . , with a 0 , a 1 , . . . an orthonormal basis of L 2 (G 1 ). If G 1 is continuous, such a basis is given by {a i = ϕ i • G 1 , i = 0, 1, 2, . . . }. The above empirical likelihood uses the empirical versions of the first 1 + r n linear constraints, with the unknown G 1 replaced by the estimator G. By our assumption on π, the continuity of G 1 is equivalent to that of the distribution function G of X.
To be precise, our estimatorθ is a guided maximum empirical likelihood estimator defined as a maximizer of the restriction of R n to the random ball centered at the least squares estimatorθ L of radius C(log n/n) 1/2 for some constant C,θ = arg max Guided maximum empirical likelihood estimation was introduced and studied by Peng and Schick (2012b [13]). We shall prove the following result.
Theorem 1. Suppose the distribution function G of X is continuous, the error variable ε has a finite fourth moment, and r n satisfies r 5 n log n = o(n). Then the guided maximum empirical likelihood estimatorθ satisfies the expansion (1.1) and hence is asymptotically efficient.
As demonstrated in Peng and Schick (2012b [13]), this result follows if one shows that the local log-empirical likelihood ratio satisfies the expansion Indeed, as Γ n is asymptotically normal with mean vector zero and dispersion matrix H, we obtain the desired result (1.1) as in Peng and Schick (2012b [13]). Thus Theorem 1 is a consequence of the following result.
Theorem 2. Under the assumptions of Theorem 1, the uniform expansion (1.2) holds for every C.
The empirical likelihood was introduced by Owen (1988 [9], 1990 [10]) to construct nonparametric confidence regions and tests, see also Owen (2001 [11]). Its scope has recently been extended. Hjort, McKeague and Van Keilegom (2009 [4]) treat constraints with nuisance parameters and also investigate the case when the number of constraints goes to infinity, but without nuisance parameters. The latter case was also considered by Chen, Peng and Qin (2009 [3]). Peng and Schick (2012a [12]) allow for nuisance parameters and for the number of constraints to go to infinity. Maximum empirical likelihood estimation was studied by Qin and Lawless (1994 [14]) and extended by Peng and Schick (2012b [13]) to allow for irregular constraints and nuisance parameters.
It follows from Owen's work that R n (ϑ) equals holds, and equals zero otherwise. Moreover, such a random vector ζ(ϑ) exists precisely on the event where the convex hull of the random vectors δ 1 W 1 (ϑ), . . . , δ n W n (ϑ) contains the origin as an interior point. For a subset S of {1, . . . , n}, let S n denote the event that δ j equals 1 for j in S and 0 for j not in S. On the event S n , R n (ϑ) equals if the convex hull of W j (ϑ), j ∈ S, contains the origin as interior point and equals zero otherwise. Thus, on S n , the empirical likelihood R n (ϑ) equals On the event S n , we also have G(x) = j∈S 1[X j ≤ x]/card(S). This shows that the present empirical likelihood is the complete case version (as defined in Koul, Müller and Schick (2012 [5])) of the empirical likelihood in Peng and Schick (2012a [12]). Thus it can be calculated using the same program as in the case δ = 1, but using only the fully observed data, i.e., the pairs (X j , Y j ) with δ j = 1.
Remark 2. One can show that the conclusions of our theorems remain true if we replace G by the empirical distribution function based on all the covariates. However, simulations not reported here show that our estimator behaves better in moderate sample sizes with the present choice G.
hold for all x and y in [0, 1] and all unit vectors u in R rn+1 .

938
A. Schick Thus any other choice of v n with these properties can be used. One possible choice is v n = (ψ rn,0 , . . . , ψ rn,rn ) ⊤ , where For this choice, (C1) holds with c 0 = 1, c 1 = 2, c 2 = 1/6 and c 3 = 1. For example, to obtain the last inequality in (C1), we observe that ψ 2 r,i dU equals 1/3 for i = 0, r and 2/3 for i = 1, . . . , r − 1, and ψ r,i ψ r,j dU equals 1/6 for |i − j| = 1 and 0 for |i − j| > 1. From this we conclude the identity for any vector u = (u 0 , . . . , u rn ) ⊤ in R rn+1 , and see that (u ⊤ v n ) 2 dU is bounded from above by |u| 2 and from below by |u| 2 /6. Note that the functions ψ r,0 , . . . , ψ r,r form a basis for the set of linear splines with knots at the equidistant points 0/r, 1/r, . . . , 1. We obtain (C2) because the continuous functions are dense in L 2 (U ) and because for every continuous function g on [0, 1] the inequality As an illustration of our method we performed a small simulation study using R and compared several estimators, the OLSE, the oracle WLSE, and the GMELE for the choices r n = 2, 3, 4, which we denote by GS(2), GS(3) and GS(3). We report the results for v n = (ψ rn,0 , . . . , ψ rn,rn ) ⊤ as they are slightly better than those for the choice v n = (ϕ 0 , . . . , ϕ rn ) ⊤ . Following Müller and Van Keilegom (2012 [8]), we took X uniformly distributed on (−1, 1), m(X) = X, θ = 2, π(X) = 1/(1 + exp(−X)), and ε = σ(X)Z, with Z standard normal and independent of X. In addition to the choices (a) σ 2 (X) = .6 − .5X and (b) σ 2 (X) = (X − .4) 2 + .1 used in Müller and van Keilegom (2012 [8]), we also looked at (c) σ 2 (X) = exp(−2X). We took the constant C in the definition of the GMELE to be the product of (n/ log n) 1/2 /(N/ log N ) 1/2 with N = n j=1 δ j and an estimator of the asymptotic standard deviation of the OLSE, namely   Table 1 reports the simulated mean square errors for the sample sizes n = 50, n = 100 and n = 200 based on 10000 simulations. We observe that our proposed estimator performs better than the OLSE in all cases considered. This is also true for the estimator with v n = (ϕ 0 , . . . , ϕ rn ). The performance of our estimator comes closer to that of the oracle WLSE as the sample sizes increases. As expected, the performance is better for smaller r n if the sample size is small. Our results for the first two cases are similar to the ones reported by Müller and van Keilegom (2012 [8]). Case (c) shows that the improvements over the OLSE provided by the oracle WLSE estimator and our estimator can be quite dramatic.

Proof of Theorem 2
Let G 1 denote the conditional distribution function of X given δ = 1. It has density π/E[δ] with respect to G and is hence continuous. Since Y is missing at random, the conditional distribution of Y given X and δ = 1 equals the conditional distribution of Y given X. Thus We also set The functions ϕ 0 , ϕ 1 , . . . form an orthonormal basis of L 2 (U ). By the continuity of G 1 , the random variable G 1 (X) has conditional distribution U given δ = 1. This shows that the matrix E[δv n (G 1 (X))v ⊤ n (G 1 (X))] equals E[δ]I 1+rn , where I m denotes the m × m identity matrix. Since σ 2 (X) is bounded and bounded away from zero, there are constants 0 < k < K < ∞ such that ku ⊤ I 1+rn u < u ⊤ V n u < Ku ⊤ I 1+rn u for u ∈ R 1+rn . Thus we have i.e., the eigenvalues of V n belong to [k, K], and V n is invertible. Let C n = 2C log 1/2 n. We begin by proving that the desired result follows from the following three statements.
In view of the inequalities E[ U n 2 ] = trace(V n ) ≤ K(1 + r n ) and we have the rate sup t ≤Cn U n − A n t 2 (1 + t ) 2 = O P (r n ).
The statements (2.6) and (2.7) imply The statements (2.4), (2.8) and (2.9) yield sup t ≤Cn Let g be in L 2 (G 1 ). Then g •G −1 1 belongs to L 2 (U ) where G −1 1 is the left-inverse of G 1 . Conditioning first on X and δ and then on δ we find Here B is a bound on σ 2 . Since the functions ϕ 0 , ϕ 1 , . . . form an orthonormal basis of L 2 (U ), we have Thus we obtain E[ δεA ⊤ n V −1 n v n (G 1 (X)) − δεh(X) 2 ] → 0. From this we conclude A ⊤ n V −1 n U n = Γ n + o p (1) and A ⊤ n V −1 n A n → H and obtain the desired result (1.2) in view of (2.10). This completes the proof that the statements (2.2)-(2.4) imply the desired result.
Verify thatÛ n,t − U n + A n t equals T n2 + T n3 t. Thus (2.2) follows from the bound sup t ≤Cn Û n,t − U n + A n t 2 Proof of (2.3). We verify ( |u ⊤V n,t u − u ⊤V n u| = O p (r 3/2 n n −1/2 + C n r 1/2 n n −1/2 ) (2.12) We obtain (2.11) since its left-hand side is bounded by the euclidean norm V n − V n ofV n − V n and since we have the bound In view of the identity u ⊤V n,t u − u ⊤V n u = 1 n n j=1 [(u ⊤Ẑ j (t)) 2 − (u ⊤ Z j ) 2 ], we see that the left-hand side of (2.12) is bounded by 2(D n L n ) 1/2 + D n with We have L n = O p (1) by (2.1) and (2.11), and Thus (2.12) holds.
Since ε has a finite fourth moment and m(X) has a finite second moment, we have yields T n = O p (r 2 n ). The above show that P (λ n − 5Ẑ * n ξ n > k/4) → 1.
Thus the event {λ n > 5Ẑ * n ξ n } has probability tending to 1. On this event the left-hand side of (2.4) is bounded by which is of order O P (C n r 5/2 n n −1/2 + C 2 n r 4 /n) = o P (1). This gives the desired result (2.4).