A note on residual-based empirical likelihood kernel density estimation

: In general the empirical likelihood method can improve the performance of estimators by including additional information about the underlying data distribution. Application of the method to kernel density estimation based on independent and identically distributed data always improves the estimation in second order. In this paper we consider esti- mation of the error density in nonparametric regression by residual-based kernel estimation. We investigate whether the estimator is improved when additional information is included by the empirical likelihood method. We show that the convergence rate is not eﬀected, but in comparison to the residual-based kernel estimator we observe a change in the asymptotic bias of the empirical likelihood estimator in ﬁrst order and in the asymptotic variance in second order. Those changes do not result in a general uniform improvement of the estimation, but in typical examples we demonstrate the good performance of the residual-based empirical likelihood estimator in asymptotic theory as well as in simulations.


Introduction
Let ε 1 , . . . , ε n denote a sample of independent random variables with cumulative distribution function F and density f . In nonparametric inference the distribution function F is typically estimated by the empirical distribution function F n , and the density f by a kernel estimator f n [see Rosenblatt [17], among many others]. Now assume that additional information about the distribution of interest is available in form of the equality E[g(ε 1 )] = g(y)f (y) dy = 0, for some known function g, e. g. g(y) = (y, y 2 − σ 2 ) ⊤ for centeredness and a known variance σ 2 . This information can be incorporated into the estimation by the empirical likelihood method introduced by Owen [15] [see also Hall and LaScala [8], DiCiccio et al. [5], Kitamura [10], Einmahl and McKeague [7], and Hjort et al. [9], among many others]. It is shown by Qin and Lawless [16] that the empirical likelihood estimator for F , sayF n , in comparison to F n always has a smaller asymptotic mean squared error. Whereas for the estimation of the distribution function the improvement by the empirical likelihood method is observable in the first order of the asymptotic expansion, Chen [2] showed that the application of the empirical likelihood density estimator, sayf n , in comparison to the simple kernel estimator f n only changes the asymptotic mean squared error in second order. However, in second order the variance off n is always smaller than the variance of f n and the bias of the same asymptotic order is unchanged. Now assume an independent and identically distributed sample (X 1 , Y 1 ), . . . , (X n , Y n ) of bivariate random variables has been observed, which is modelled by a homoscedastic nonparametric regression model where we are interested in the distribution F or density f of the centered i.i.d. errors ε 1 , . . . , ε n . After estimating the conditional mean m in such a regression model one might be interested in the distribution of the observations around that mean, which is characterised by the distribution of the errors. Moreover an estimation of the error distribution function or density is the first step in analyzing features of that unknown distribution and is needed for hypotheses testing of certain qualities, such as symmetry, or testing the fit of a parametric class for the error distribution. For example, known symmetry or normality of the error distribution can lead to better efficiency of several statistical procedures in nonparametric regression [see Dette et al. [4] and Neumeyer et al. [14] for further references]. Furthermore estimated quantiles of the error distribution are needed to obtain prediction intervals for new observations [see Akritas and Van Keilegom [1]]. Because the errors are unobserved, they have to be estimated by residualsε i = Y i −m(X i ), i = 1, . . . , n. Let, to this end,m denote some nonparametric regression function estimator. Denote byF n the empirical distribution function for F and byf n the kernel density estimator for f , both based on the residualsε 1 , . . . ,ε n . See for instance Akritas and Van Keilegom [1], Cheng [3], Efromovich [6], or Müller et al. [12] for further motivation of error distribution estimation as well as asymptotic results onF n andf n . Note that for F n in comparison to F n the asymptotic expansion changes due to the estimation of the regression function. The distribution estimator becomes biased and also the asymptotic variance changes in first order. The asymptotic variance ofF n can even be smaller than the asymptotic variance of F n (however, F n (and f n ) are not feasible in the model considered here). Now assume again that we have additional information as given in (1), which should be incorporated into the estimation by the empirical likelihood method.
For example, g(y) = y yields the centeredness assumption, which is automatically valid due to the definition of the regression model, but so far is not incorporated explicitly into the estimation. Also a variance σ 2 might be known from previous experiments which leads to g(y) = (y, y 2 − σ 2 ) ⊤ . Analysis of empirical likelihood estimation in this regression context should be seen as starting point for further empirical likelihood procedures such as testing for the hypothesis E[g(ε 1 )] = 0. Albeit here we assume independence of the error ε 1 from the covariate X 1 it is a topic of current research to consider empirical likelihood estimation of the conditional error distribution, given the covariate, in more complicated models. Then it makes sense to consider additional information in terms of E[g(ε 1 |X 1 ) |X 1 ] = 0, where g(ε 1 |X 1 ) depends on the covariate. This will enable, for instance, to test for parametric structure of the conditional variance function by setting g(ε|x) = ε 2 − σ 2 ϑ (x), ϑ ∈ Θ, but is clearly beyond the scope of the paper at hand.
Let F n denote the empirical likelihood estimator for F based on the residualŝ ε 1 , . . . ,ε n under additional information (1). For this estimator Kiwitt et al. [11] showed that both asymptotic bias and variance of F n are different in first order in comparison to the residual-based empirical distributionF n . The incorporation of the additional information can lead to an improved estimator; however, in contrast to the i.i.d.-case considered by Qin and Lawless [16], it does not in all cases.
In the paper at hand we consider the empirical likelihood density estimator f n based on the residuals and investigate whether in comparison to the residualbased kernel estimatorf n the asymptotic bias and variance change and whether an improvement of the estimation can be achieved. We show that in contrast to Chen [2] the asymptotic mean squared error already changes in first order, due to a change in the bias, whereas the asymptotic variance only changes in second order. Although the empirical likelihood method does not uniformly lead to smaller asymptotic bias and variance, we show that in typical examples the method results in a smaller asymptotic mean integrated squared error. In simulations we demonstrate the good performance of the estimator.
The paper is organized as follows. In section 2 we define the kernel density estimators as well as the empirical likelihood kernel density estimators for i.i.d. data and residuals, respectively. In section 3 we derive the asymptotic expansions for bias and variance of all four density estimators and compare the results. Section 4 discusses examples in theory as well as by means of a small simulation study. Technical assumptions are stated in an appendix.

Definition of the estimators
In the following we give a short motivation of kernel density estimation in order to have a starting point for motivation of the empirical likelihood density estimation below. Let ε 1 , . . . , ε n denote an absolutely continuous i.i.d. sample from density f , which is to be estimated. A first crude estimate of a density gives equal weight 1/n to each of n observations ε i , i = 1, . . . , n. To obtain a smooth estimator the kernel approach distributes the weight 1/n in the interval [ε i − h, ε i + h] for some bandwidth h, such that each y obtains the weight K((ε i − y)/h)/h for some chosen density K with support [−1, 1]. Those weights are then added for i = 1, . . . , n to obtain the estimator Assumptions on the kernel K and bandwidth h = h n are postponed to the appendix for reason of better readability. Note that throughout we assume that the density f is twice continuously differentiable and that a kernel K of order 2 is used. Under more restrictive smoothness assumptions the use of higher order kernels results in a bias reduction for the estimator f n . However, we refrain from this because, on the one hand, it leads to negative values of the density estimator. On the other hand, analogue changes of the bias by the empirical likelihood method applied to the residual-based estimators as shown here appear in that case as well. Note also that we apply a bandwidth with optimal rate h = h n ∼ n −1/5 for kernel estimation of a twice continuously differentiable density.
We now assume that additional information about the underlying distribution is available. This auxiliary information is given in terms of equation (1), where g = (g 1 , . . . , g k ) ⊤ : R → R k is a known function. Now instead of giving equal weight to all observations, the empirical likelihood method [see Owen [15]] gives weight p i ∈ [0, 1] to ε i (i = 1, . . . , n). In order to include the information (1) into the estimation, the likelihood n i=1 p i (the probability that the given sample is observed) is maximized under the constraints n i=1 p i = 1 (to obtain a probability distribution) and n i=1 p i g(ε i ) = 0, which is the empirical version of (1). Now distributing such weights p i in [ε i − h, ε i + h] by a kernel approach as before gives the empirical likelihood kernel density estimator as considered by Chen [2]. Note that in our case the estimators f n andf n are not available because ε 1 , . . . , ε n are not observable. We consider a nonparametric homoscedastic regression model (2) with independent observations, where the covariates X 1 , . . . , X n are i.i.d. with density f X , and independent of the i.i.d. errors ε 1 , . . . , ε n with density f . For clarity reasons further technical model assumptions are listed in the appendix. In order to estimate the density f of the unobserved errors we build nonparametric residualsε i = Y i −m(X i ), i = 1, . . . , n, wherê m denotes the Nadaraya-Watson estimator [see Nadaraya [13], Watson [19]] for m, that ism( is a kernel estimator for the covariate density. Assumptions on the kernel k and bandwidth b are listed in the appendix. Note that we assume a bandwidth b = b n ∼ n −1/5 of optimal rate for estimation of a twice continuously differentiable regression function. To estimate the error density we considerf based on the residuals, instead of the non-feasible kernel estimator f n . Finally we define the empirical likelihood estimator with the same motivation as forf n , but based on residuals. From Qin and Lawless [16] and Kiwitt et al. [11] it follows that the empirical likelihood weightsp i = 1/(n+ nη ⊤ n g(ε i )) whereη n is defined as solution of the equation To compare the asymptotic performance of the estimators in the next section we give results for the asymptotic bias and variance of the four estimators f n , f n ,f n , and f n .

The kernel density estimator based on i.i.d. data
From standard results in the kernel estimation literature [see Wand and Jones [18], for instance] we have by Taylor's expansion where we assume that uK(u) du = 0 (see the appendix). For the variance one obtains similarly with uK 2 (u) du = 0 Var(f n (y)) = 1 Thus the first order bias is of rate h 2 , the first and second order variance terms are of rates (nh) −1 ∼ h 4 and n −1 , respectively.

The empirical likelihood kernel density estimator based on i.i.d. data
Chen [2] has shown that From the incorporation of the additional information by the empirical likelihood method there is no change of the bias, and no change of the variance in first order. However, the variance term of second order n −1 yields a reduction in comparison to f n because by assumption Σ = E[g(ε 1 )g(ε 1 ) ⊤ ] is positive definite (see appendix), and hence g(y) ⊤ Σ −1 g(y) > 0 (for all y).

The residual-based kernel density estimator
In this subsection we will sketch the proof for a stochastic expansion off n . In the following by R n,ℓ , ℓ = 1, . . . , 4, we denote remainder terms, each defined by the corresponding equality. Explanations for the expansions and for negligibility of the remainder terms are given below. We obtain for the residual-based kernel density estimator that with the deterministic term To obtain the expansion for B n we have used calculations typical for kernel estimation theory and also by integration by parts and Taylor's expansion. One can show that under the assumptions stated in the appendix the variance of the remainder term R n,4 in (3.4) is of order o(1/n) and the covariance of R n,4 with the other terms in the expansion (3.4) off n (y) is (by an application of the Cauchy-Schwarz inequality) also of order o(1/n). To this end note that R n,4 = R n,1 + (R n,2 − R n,1 ) + (R n,3 − R n,2 ) + (R n,4 − R n,3 ). Here with a Taylor expansion of the kernel K up to order five we obtain negligibility of R n,1 , whereas the term R n,2 − R n,1 is discussed by inserting the definition ofm in ε i −ε i =m(X i ) − m(X i ) in (3.1) (see section 2) and by replacing the random denominatorf X by the true density f X . Simple but cumbersome calculations of variances yield negligibility of R n,3 − R n,2 [inserting the model definition Y j = ε j − m(X j ) in (3. In comparison to the asymptotic expectation and variance of f n (y) as stated in section 3.1 we see that the estimation of the regression function m results in a change of the bias of (first) order b 2 and the variance of (second) order n −1 .

The residual-based empirical likelihood kernel density estimator
Finally, we consider the residual-based empirical likelihood density estimator, for which by the definition of the weightsp i = n −1 (1−η ⊤ n g(ε i )+(η ⊤ n g(ε i )) 2 /(1+ η ⊤ n g(ε i ))) we have for a remainder term where the rates of the first two factors on the right hand side in the first line are given in Lemma B.4 in Kiwitt et al. [11] and the rate of the last factor can be proved by applying Taylor's expansion similarly to the derivation in section 3.3. From Proposition 3.2 and Lemma B.2(ii) in Kiwitt et al. [11] we havê where Σ = E[g(ε 1 )g(ε 1 ) ⊤ ] and B is defined in section 3.3. (see also the remark below the list of assumptions in the appendix). Using Taylor's expansion in a similar way as in section 3.3 one can show that with some negligible remainder term R n,6 . Further, and from (3.6) together with (3.7) and (3.8), (3.9) we obtain an expansion for some remainder term R n,7 . One can show with lengthy but simple calculations that the variance of the remainder term is of order o(1/n) and the covariance of R n,7 with all other terms in the expansion is (by an application of Cauchy-Schwarz' inequality) also of order o(1/n). Hence, we obtain from (3.10) and (3.5) that and from (3.10) and (3.4) that Var(f n (y)) = Var(f n (y)) In comparison to expectation and variance off n (y) as given in section 3.3 we see that in contrast to Chen's [2] results the empirical likelihood method for the residual-based estimators leads to a change in the bias in first order. This change however only arises in the bias of order b 2 , that is due to the estimation of the residuals, whereas the h 2 -bias remains as before. In the variance we observe a change only in second order. We will investigate in section 4 if this change means that the estimation can be improved by the application of the empirical likelihood method.

The mean squared errors
To compare the asymptotic mean squared errors in second order we need more assumptions to obtain the second order bias. For simplicity in addition to the assumptions stated in the appendix we assume f , m and f X to be thrice continuously differentiable. Because the kernels K and k are symmetric with compact supports the third moments of the kernels vanish, and hence the bias remainder terms o(h 2 ) and o(b 2 ) are of order o(h 3 ) and o(b 3 ), respectively. Then, we obtain for the mean squared errors (note that h 5 ∼ n −1 , b 5 ∼ n −1 and

Examples and simulations
It was shown before that the residual-based kernel density estimator and the residual-based empirical likelihood kernel density estimator differ in the asymptotic bias in first and in the asymptotic variance in second order. However, the application of the empirical likelihood weights does not uniformly lead to an improvement for all possible densities f and functions g. Therefore, the theoretical asymptotic results are considered for two typical examples of having additional information. It is shown that the empirical likelihood method leads to a smaller asymptotic mean integrated squared error. Afterwards the behaviour of both estimators for smaller sample sizes is observed in a small simulation study, which demonstrates the superiority of the empirical likelihood estimator.

Additional information: Centered errors
The model assumption of centered errors is to be explicitly included in the estimation. Note that although the Nadaraya-Watson estimatorm by construction already uses the centredness of the errors, the explicit usage of this information by the empirical likelihood method can still improve the estimate of the error density (this was also observed by Kiwitt et al. [11] for the estimation of the error distribution). By choosing g(ε) = ε the variance formula (3.12) reduces with E[g ′ (ε)] = 1, due to g(y) − E[g ′ (ε 1 ) ⊤ ]y = y − 1 · y = 0, to Var(f n (y)) = Var(f n (y)) + o(n −1 ). Hence, there is no second order improvement in the variance possible, but for the asymptotic bias follows by taking Σ = E[ε 2 ] = σ 2 into account [see (3.11) under the assumptions of section 3.5] Thus a first order improvement in the asymptotic bias is possible. In case of normally distributed errors follows with f ′ (y) = −yf (y)/σ 2 that the b 2 -bias due to the estimation of the regression function cancels completely. Figure 1 illustrates an improvement for most y, but nevertheless, for some points the residual-based kernel density estimator has the smaller bias. Here only the dominating terms of first and second order are depicted and we take h = c 1 n −1/5 , b = c 2 n −1/5 , where, for simplicity, c 2 1 u 2 K(u) du = 1 and c 2 2 B = 1. But integrating the differences in the bias about all y gives an overall improvement in the asymptotic bias, and hence in the asymptotic mean integrated squared error (amise). To see this note that from the formulae in section 3.5 we obtain by symmetry of f and because f ′ (y) = −yf (y)/σ 2 that The greatest improvement is obtained in cases of a large bias b 2 |B| due to regression estimation combined with small variances σ 2 .

Additional information: Centered errors and Var(ε) = σ 2
In some applications the variance of the errors σ 2 is known. This information and the centeredness of the errors can be included in the estimation by defining g(ε) ⊤ = (ε, ε 2 − σ 2 ). For convenience we assume that the third moments of the error distribution are zero. With E[g ′ (ε)] = (1, 0) ⊤ and Σ = diag(σ 2 , E[(ε 2 − σ 2 ) 2 ]) follows the same asymptotic change in the bias as in   .12) yields Because of the obviously negative additional term, there is always a second order improvement for all y in the asymptotic variance of the residual-based empirical likelihood kernel density estimator. In Figure 2 you see the improvements in case of standard normally distributed errors. In that case for the asymptotic mean integrated squared error one obtains √ πσn .

Simulation study
In the simulation study the mean squared error of both estimators is analyzed for sample sizes n = 50, 100, 250. To this end the mean squared error of the residual-based empirical likelihood density estimator is approximated bymse(f n (y)) = 1 N N i=1 f n,i (y) − f (y) 2 , where N = 10000 is the number of repetitions for data generation and f n,i (y) is the estimator applied to the i-th sample;mse(f n (y)) is defined analogously. The regression function m(x) = 5x 2 is used and estimated by the Nadaraya-Watson estimator with bandwidth b = n − 1 5 . For the kernel density estimator we choose the bandwidth h = n − 1 5 , and for both estimators the Epanechnikov kernel. Figure 3 shows the simulations for standard normally distributed errors and the two-dimensional additional information g(ε) ⊤ = (ε, ε 2 − σ 2 ). For all sample sizes mentioned above the residual-based kernel density estimator is improved for almost all y by the empirical likelihood weights. The greatest improvements are in the interval [−0.2, 1], which is supported by the theoretical results [see panel (3) in Figure 2]. In Figure 4 are some other examples presented, where the main   All rows show the differencemse(fn(y)) −mse(f n (y)) for n = 50, 100, 250. The first one illustrates the case of standard normally distributed errors with g(ε) = ε. The other two consider the case of Student's t-distributed errors with 3 degrees of freedom and respectively g(ε) = ε, g(ε) ⊤ = (ε, ε 2 − σ 2 ).
The additional technical assumption 5 is needed to bound some remainder terms.
Assumption 6 is due to the kernel density estimation. The high rate of smoothness needed for the kernel is due to bounding of remainder terms in a Taylor expansion with the aim of obtaining second order rates.