A Significance Test for Covariates in Nonparametric Regression

We consider testing the significance of a subset of covariates in a nonparametric regression. These covariates can be continuous and/or discrete. We propose a new kernel-based test that smoothes only over the covariates appearing under the null hypothesis, so that the curse of dimensionality is mitigated. The test statistic is asymptotically pivotal and the rate of which the test detects local alternatives depends only on the dimension of the covariates under the null hypothesis. We show the validity of wild bootstrap for the test. In small samples, our test is competitive compared to existing procedures.


Introduction
Testing the significance of covariates is common in applied regression analysis. Sound parametric inference hinges on the correct functional specification of the regression function, but the likelihood of misspecification in a parametric framework cannot be ignored, especially as applied researchers tend to choose functional forms on the basis of parsimony and tractability. Significance testing in a nonparametric framework has therefore obvious appeal as it requires much less restrictive assumptions. Fan (1996), Fan and Li (1996) , Racine (1997), Chen and Fan (1999), Lavergne and Vuong (2000), Ait-Sahalia et al. (2001), and Delgado and González Manteiga (2001) proposed tests of significance for continuous variables in nonparametric regression models. Delgado (1993), Dette and Neumeyer (2001), Lavergne (2001), Neumeyer and Dette (2003), Racine et al. (2006) focused on significance of discrete variables. Volgushev et al. (2013) considered significance testing in nonparametric quantile regression. For each test, one needs first to estimate the model without the covariates under test, that is under the null hypothesis. The result is then used to check the significance of extra covariates. Two competing approaches are then possible. In the "smoothing approach," one regresses the residuals onto the whole set of covariates nonparametrically, while in the "empirical process approach" one uses the empirical process of residuals marked by a function of all covariates.
In this work, we adopt an hybrid approach to develop a new significance test of a subset of covariates in a nonparametric regression. Our new test has three specific features. First, it does not require smoothing with respect to the covariates under test as in the "empirical process approach." This allows to mitigate the curse of dimensionality that appears with nonparametric smoothing, hence improving the power properties of the test. Our simulation results show that indeed our test is more powerful than competitors under a wide spectrum of alternatives. Second, the test statistic is asymptotically pivotal as in the "smoothing approach," while wild bootstrap can be used to obtain small samples critical values of the test. This yields a test whose level is well controlled by bootstrapping, as shown in simulations. Third, our test equally applies whether the covariates under test are continuous or discrete, showing that there is no need of a specific tailored procedure for each situation.
The paper is organized as follows. In Section 2, we present our testing procedure. In Section 3, we study its asymptotic properties under a sequence of local alternatives and we establish the validity of wild bootstrap. In Section 4, we compare the small sample behavior of our test to some existing procedures. Section 5 gathers our proofs.
2 Testing Framework and Procedure

Testing Principle
We want to assess the significance of X ∈ R q in the nonparametric regression of Y ∈ R on W ∈ R p and X. Formally, this corresponds to the null hypothesis which is equivalent to The corresponding alternative hypothesis is The following result is the cornerstone of our approach. It characterizes the null hypothesis H 0 using a suitable unconditional moment equation.
Proof. Let ·, · denote the standard inner product. Using Fourier Inversion Theorem, change of variables, and elementary properties of conditional expectation, Since the Fourier transforms F [K] and F [ψ] are strictly positive, I(h) = 0 iff E E [u | W, X] ν (W ) e 2πi{ t,W + s,X } = 0 ∀t, s .
But this is equivalent to E [u | W, X] ν (W ) = 0 a.s., which by our assumption on ν(·) is equivalent to H 0 .

The Test
Lemma 1 holds whether the covariates W and X are continuous or discrete. For now, we assume W is continuously distributed, and we later comment on how to modify our procedure in the case where some of its components are discrete. We however do not restrict X to be continuous. Since it is sufficient to test whether I(h) = 0 for any arbitrary h, we can choose h to obtain desirable properties. So we consider a sequence of h decreasing to zero when the sample size increases, which is one of the ingredient that allows to obtain a tractable asymptotic distribution for the test statistic.
Assume we have at hand a random sample In what follows, f (·) denotes the density of W , Since nonparametric estimation should be entertained to approximate u i , we consider usual kernel estimators based on kernel L(·) and bandwidth g.
Denote by n (m) the number of arrangements of m distinct elements among n, and by [1/n (m) ] a , the average over these arrangements. In order to avoid random denominators, we choose ν (W ) = f (W ), which fulfills the assumption of Lemma 1. Then we can estimate I (h) by the second-order U-statistic We also consider the alternative It is clear thatĨ n is obtained from I n by removing asymptotically negligible "diagonal" terms. Under the null hypothesis, both statistics will have the same asymptotic normal distribution, but removing diagonal terms reduces the bias of the statistic under H 0 .
Our statisticsĨ n and I n are respectively similar to the ones of Fan and Li (1996) and Lavergne and Vuong (2000), with the fundamental difference that there is no smoothing relative to the covariates X. Indeed these authors used a multidimensional smooth- For I n being eitherĨ n or I n , we will show that nh p/2 I n d −→ N (0, ω 2 ) under H 0 and nh p/2 I n p −→ ∞ under H 1 . By contrast, the statistics of Fan and Li (1996) and Lavergne and Vuong (2000) exhibit a nh (p+q)/2 rate of convergence. The alternative test of Delgado and González Manteiga (2001) uses the kernel residualsû i and the empirical process approach of Stute (1997). This avoids extra smoothing, but a the cost of a test statistic with a non pivotal asymptotic law under H 0 . Hence, our proposal is an hybrid approach that combines the advantages of existing procedures, namely smoothing only for the variables W appearing under the null hypothesis but with an asymptotic normal distribution for the statistic. Given a consistent estimator ω 2 n of ω 2 , as provided in the next section, we obtain an asymptotic α-level test of H 0 as where z 1−α is the (1 − α)-th quantile of the standard normal distribution. In small samples, we will show the validity of a wild bootstrap scheme to obtain critical values.
The test applies whether X is continuous or has some discrete components. The procedure is also easily adapted to some discrete components of W . In that case, one would replace kernel smoothing by cells' indicators for the discrete components, so that for W composed of continuous W c of dimension p c and discrete W d , one would use It would also be possible to smooth on the discrete components, as proposed by Racine and Li (2004). To obtain scale invariance, we recommend that observations on covariates should be scaled, say by their sample standard deviation as is customary in nonparametric estimation. It is equally important to scale the X i before they are used as arguments of ψ(·) to preserve such invariance.
The outcome of the test may depend on the choice of the kernels K(·) and L(·), while this influence is expected to be limited as it is in nonparametric estimation. The choice of the function ψ(·) might be more important, but our simulations reveal that it is not. From our theoretical study, this function, as well as K(·) should possess an almost everywhere positive and integrable Fourier transform. This is true for (products of) the triangular, normal, Laplace, and logistic densities, see Johnson et al. (1995), and for a Student density, see Hurst (1995). Alternatively, one can choose ψ(x) as a univariate density applied to some transformation of x, such as its norm. This yields is any of the above univariate densities. This is the form we will consider in our simulations to study the influence of ψ(·).

Theoretical Properties
We here give the asymptotic properties of our test statistics under H 0 and some local alternatives. To do so in a compact way, we consider the sequence of hypotheses The null hypothesis corresponds to the case δ n ≡ 0, while considering a sequence δ n → 0 yields local Pitman-like alternatives.

Assumptions
We begin by some useful definitions.
Definition 1. (i) U p is the class of integrable uniformly continuous functions from R p to R; (ii) D p s is the class of m-times differentiable functions from R p to R , with derivatives of order ⌊s⌋ that are uniformly Lipschitz continuous of order s − ⌊s⌋, where ⌊s⌋ denotes the integer such that ⌊s⌋ ≤ s < ⌊s⌋ + 1.
Note that a function belonging to U p is necessarily bounded.
Definition 2. K p m , m ≥ 2, is the class of even integrable functions K : R p → R with compact support satisfying K (t) dt = 1 and, if t = (t 1 , . . . , t p ), This definition of higher-order kernels is standard in nonparametric estimation. The compact support assumption is made for simplicity and could be relaxed at the expense of technical conditions on the rate of decrease of the kernels at infinity, see e.g.
Definition 1 in Fan and Li (1996). In particular, the gaussian kernel could be allowed for. We are now ready to list our assumptions.
Assumption 1. (i) For any x ∈ R q in the support of X, the vector W admits a conditional density given X = x with respect to the Lebesgue measure in R p , denoted by π(· | x). Moreover, E [Y 8 ] < ∞. (ii) The observations (W i , X i , Y i ), i = 1, · · · , n are independent and identically distributed as (W, X, Y ).
The existence of the conditional density given X = x for all x ∈ R q in the support of X implies that W admits a density with respect to the Lebesgue measure on R p . As noted above, our results easily generalizes to some discrete components of W , but for the sake of simplicity we do not formally consider this in our theoretical analysis.
(iii) the function ψ (·) is bounded and has a almost everywhere positive and integrable Fourier transform; (iv) K (·) ∈ K p 2 and has an almost everywhere positive and integrable Fourier transform, while L (·) ∈ K p ⌊s⌋ and is of bounded variation; (v) let σ 2 (w, x) = E[u 2 | W = w, X = x], then σ 2 (·, x) f 2 (·) π (· | x) belongs to U p for any x in the support of X, has integrable Fourier transform, and is integrable and squared integrable for any x in the support of X, and Standard regularity conditions are assumed for various functions. A higher-order kernel L(·) is used in conjunction with the differentiability conditions in (i) to ensure that the bias in nonparametric estimation is small enough.

Asymptotic Analysis
The following result characterizes the behavior of our statistics under the null hypothesis and a sequence of local alternatives.
Theorem 1. Let I n be any of the statistics I n orĨ n . Under Assumptions 1 and 2, and if as n → ∞ (i) g, h → 0, (ii) n 7/8 g p / ln n, nh p → ∞, (iii) nh p/2 g 2s → 0, and (iv) The rate of convergence of the test statistic depends only on the dimension of W , the covariates present under the null hypothesis, but not on the dimension of X, the covariates under test. Similarly, the rate of local alternatives that are detected by the test depends only on the dimension of W . As shown in the simulations, this yields some gain in power compared to competing "smoothing" tests. Conditions (i) to (iv) together require that s > p/2 for I n =Ĩ n and s > p/4 for I n = I n , so removing diagonal terms in I n allows to weaken the restrictions on the bandwidths. Condition (ii) could be slightly weakened to ng p → ∞ at the price of handling high order U-statistics in the proofs, but allows for a shorter argument based on empirical processes, see Lemma 3 in the proofs section.
To estimate ω 2 , we can either mimic Lavergne and Vuong (2000) to consider or generalize the variance estimator of Fan and Li (1996) as The first one is consistent for ω 2 under both the null and alternative hypothesis, but the latter is faster to compute.
Corollary 1. Let I n be any of the statistics I n orĨ n and let ω n denote any of ω n or ω n . Under the assumptions of Theorem 1, the test that rejects H 0 when nh p/2 I n /ω n > z 1−α is of asymptotic level α under H 0 and is consistent under the sequence of local alternatives H 1n provided δ 2 n nh p/2 → ∞.

Bootstrap Critical Values
It is known that asymptotic theory may be inaccurate for small and moderate samples when using smoothing methods. Hence, as in e.g. Härdle and Mammen (1993) or Delgado and González Manteiga (2001), we consider a wild bootstrap procedure to approximate the quantiles of our test statistic. Resamples are obtained from Y The η i could for instance follow the two-point law of Mammen (1993) When the scheme is repeated many times, the bootstrap critical value z ⋆ 1−α,n at level α is the empirical (1 − α)-th quantile of the bootstrapped test statistics. The asymptotic validity of this bootstrap procedure is guaranteed by the following result.
Moreover, assume inf w∈S W f (w) > 0 and h/g 2 → 0. Then for I * n equal to any of I * n andĨ * n , where Φ (·) is the standard normal distribution function.

Monte Carlo Study
We investigated the small sample behavior of our test and studied its performances relative to alternative tests. We generated data through where W follow a two-dimensional standard normal, X independently follows a qvariate standard normal, ε ∼ N (0, 4), and we set θ = (1, −1) ′ / √ 2. The null hypothesis corresponds to δ = 0, and we considered various forms for d(·) to investigate power.
We only considered the test based onĨ n , labelled LMP, as preliminary simulation results showed that it had similar or better performances than the test based on I n . We compared it to the test of Lavergne and Vuong (2000, hereafter LV), and the test of Delgado and Gonzalez-Manteiga (2001, hereafter DGM). The statistic for the latter test is the Cramer-von-Mises statistic and critical values are obtained by wild bootstrapping as for our own statistic. To compute bootstrap critical values, we used 199 bootstrap replications and the twopoint distribution For all tests, each time a kernel appears, we used the Epanechnikov kernel applied to the norm of its argument u, that is 0.75 1 − u 2 1 { u < 1}. The bandwidth parameters are set to g = n −1/6 and h = c n −2.1/6 , and we let c vary to investigate the sensitivity of our results to the smoothing parameter's choice. To study the influence of ψ(·) on our test, we considered ψ(x) = l ( x ), where l(·) is a triangular or normal density, each with a second moment equal to one. To investigate power, we considered different forms of alternatives as specified by d(·).
We first focus on a quadratic alternative, where d ( Figure 2 reports power curves of the different tests for the quadratic alternative, n = 100, and a nominal level of 10% based on 2000 replications. We also report the power of a Fisher test based on a linear specification in the components of X. The power of our test, as well as the one of LV test, increases when the bandwidth factor c increases. This is in line with theoretical findings, though we may expect this relationship to revert for very large bandwidths. Our test always dominates LV test, as well as the Fisher test and DGM test, for any choice of c and any dimension q.
The power of all tests decreases when the dimension q increases, but the more notable degradation is for the DGM test. In Figure 3, we let n vary for a fixed dimension q = 5.
The power of all tests improve, but our main qualitative findings are not affected. It is noteworthy that the power advantage of our test compared to LV test become more pronounced as n increases. In Figure 4, we considered a linear alternative d (X) = X ′ β and a sine alternative, d (X) = sin (2 X ′ β). Our main findings remain unchanged.
For a linear alternative, the Fisher test is most powerful as expected. Compared to this benchmark, the loss of power when using our test is moderate for a large enough bandwidth factors c. For a sine alternative, our test is more powerful than the Fisher test for c = 2 or 4.
We also considered the case of a discrete X. We generated data following where W and ε are generated as before, and X is Bernoulli with probability of success p = 0.6. We compared our test to two competitors. The test proposed by Lavergne (2001) is similar to our test with the main difference that ψ(·) is the indicator function, Simulations reveal that our test outperforms its competitors in many situations, and especially when the dimension of covariates is large.

Proofs
We here provide the proofs of the main results. Technical lemmas are relegated to the Appendix.
In the following, for any integrable function δ( u ∈ R q . Moreover, for any index set I not containing i with cardinality |I|, define consistent with f i that corresponds to the case where I is the empty set.

Proof of Theorem 1
We first consider the case I n =Ĩ n . Next, we study the difference betweenĨ n and I n and hence deduce the result for I n = I n .
Case I n =Ĩ n . Consider the decomposition In Proposition 1 we prove that, under H 0 , I 0n is asymptotically centered Gaussian with variance ω 2 , while in Proposition 2 we prove that, under H 1n , I 0n is asymptotically Gaussian with mean µ and variance ω 2 provided δ 2 n nh p/2 converges to some positive real number. In Propositions 3 and 4 we show that all remaining terms in the decomposition of I n are asymptotically negligible. Proof. Let us define the martingale array {S n,m , F n,m , 1 ≤ m ≤ n, n ≥ 1} where S n,1 = 0, and and F n,m is the σ−field generated by {W 1 , . . . , W n , X 1 , . . . , X n , Y 1 , . . . , Y m } . Thus nh p/2 I 0n = S n,n . Also define The result follows from the Central Limit Theorem for martingale arrays, see Corollary 3.1 of Hall and Heyde (1980). The conditions required for Corollary 3.1 of Hall and Heyde (1980), among which V 2 n p −→ ω 2 , are checked in Lemma 2 below. Its proof is provided in the Appendix.
Lemma 2. Under the conditions of Proposition 1, 3. the martingale difference array {G n,i , F n,i , 1 ≤ i ≤ n} satisfies the Lindeberg con- Proposition 2. Under the conditions of Theorem 1 and H 1n , if δ 2 n nh p/2 → C with 0 < C < ∞, nh p/2 I 0n

and let us decompose
By Proposition 1, C 0n d −→ N (0, ω 2 ) . As for C n , we have By repeated application of Fubini's Theorem, Fourier Inverse formula, Dominated Convergence Theorem, and Parseval's identity, we obtain Moreover, Therefore C n = Cµ n + O p δ n n 1/2 h p/2 p −→ Cµ, and the desired result follows.
The proofs of the above propositions follow the ones in Lavergne and Vuong (2000)).
For illustration, we provide in the Appendix the proofs of the first statements of each proposition.
Case I n = I n . We have the following decomposition Hence, to show that I n has the same asymptotic distribution asĨ n , it is sufficient to investigate the behavior of V 1n to V 3n . Using Y i = r i + u i , it is straightforward to see that the dominating terms in V 1n , V 2n and V 3n are It then follows that nh p/2 Ĩ n − I n = O p h p/2 g −p which is negligible if h/g 2 → 0.
The asymptotic irrelevance of the above diagonal terms thus require more restrictive relationships between the bandwidths h and g. For the sake of comparison, recall that Fan and Li (1996) impose h (p+q) g −2p → 0 while Lavergne and Vuong (2000) require only h p+q g −p → 0. Since we do not smooth the covariates X, we are able to further relax the restriction between the two bandwidths.

Proof of Corollary 1
It suffices to prove ω 2 n − ω 2 = o p (1) with ω 2 n any of ω 2 n orω 2 n . First we consider the case ω 2 n = ω 2 n . A direct approach would consist in replacing the definition ofû ifi and u jfj , writing ω 2 n as a U−statistic of order 6, and studying its mean and variance. A shorter approach is based on empirical process tools. The price to pay is the stronger Lemma 3. Under Assumption 1, if r(·)f (·) ∈ U p , L(·) is a function of bounded variation, g → 0, and n 7/8 g p / ln n → ∞, then The proof relies on the uniform convergence of empirical processes and is provided in the Appendix. Now proceed as follows: square Equation (3), replaceû 2 if 2 i in the definition of ω 2 n , and use Lemma 3 to deduce that

Elementary calculations of mean and variance yield
and thus ω 2 n − ω 2 = o p (1). To deal withω 2 n , note thatω 2 n − ω 2 n consists of "diagonal" terms plus a term which is O (n −1ω2 n ). By tedious but rather straightforward calculations, one can check that such diagonal terms are each of the form n −1 g −p times a U−statistic which is bounded in probability. Henceω 2 n − ω 2 n = o p (1).

Proof of Theorem 2
Let Z denote the sample (Y i , W i , X i ), 1 ≤ i ≤ n. Since the limit distribution is continuous, it suffices to prove the result pointwise by Polya's theorem. Hence we show that ∀t ∈ R, P nh p/2 I * n /ω * n ≤ t | Z − Φ (t) = o p (1). First, we consider the case I * n =Ĩ n . Consider where we can further decompose Now let D * n =Ĩ * n − I * 0n and write It thus suffices to prove that The first result is stated below.
Proposition 5. Under the conditions of Theorem 2, conditionally on the observed sample, the statistic nh p/2 I * 0n /ω n,F L converges in law to a standard normal distribution.
Proof. We proceed as in the proof of Proposition 1 and check the conditions for a CLT for martingale arrays, see Corollary 3.1 of Hall and Heyde (1980). Define the martingale array S * n,m , F * n,m , 1 ≤ m ≤ n, n ≥ 1 where F * n,m is the σ-field generated by Z, η 1 , . . . , η m , S * n,1 = 0, and S * n,m = m i=1 G * n,i with Then
Next we show (4). First we need the following. The proof uses the following result, which is proved in the Appendix.
We next have to bound D * n = I * n,LV − I * 0n . For this, let us decomposê and replace all such differences appearing in the definition of D * n . First, let us look at I * 3 which does not contain any bootstrap variable η. We obtain Next, use the fact that and further replace terms liker i − r i . Among the terms I * 3,1 to I * 3,6 , the term I * 3,1 could be easily handled with existing results in Lavergne and Vuong (2000). Namely nh p/2 I * 3,1 = nh p/2 O p (g 2s ) + o p (1) by Proposition 7 of Lavergne and Vuong (2000). For the other five terms we have to control the density estimates appearing in the denominators. For this purpose, let us introduce the notation ∆ f I Then, we obtain for instance 3,5,1 + I * 3,5,2 + I * 3,5,3 + I * 3,5,4 + I * 3,5,5 + I * 3,5,6 .
Next, if we consider for instance I * 3,5,1 that contains only terms like f −1 i appearing from the decomposition 6, we obtain ,5,1,1 + I * 3,5,1,2 + I * 3,5,1,3 + I * 3,5,1,4 where the terms I * 3,5,1,2 to I * 3,5,1,4 are called "diagonal terms". Such terms require more restrictions on the bandwidths. next, the terms with containing terms like ∆ f I i −1 produced by the decomposition (6) can be treated like in the Propositions 8 to 11 of Lavergne et Vuong (2000). Finally, given that I is finite and with fixed cardinal k∈I L nik can be easily handled by taking absolute values. Now let us investigate the diagonal term I * 3,5,1,2 . We have To prove that he term I * 3,5,1,2 = o p (nh p/2 ) it suffices to prove E |I * 3,5,1,2 | = o(nh p/2 ) and this latter rate is implied by the condition h/g 2 = o(1). This additional condition on the bandwidths is not surprising as the bootstrapped statistic introduced "diagonal" terms as in Fan et Li (1996) which indeed require the condition h/g 2 → 0.

2,1,8
Handling one problem at a time, let us notice that I * 2,1,1 is a zero-mean U−statistic of order three with kernel H n Z Using the Hoeffding decomposition of I * 2,1,1 in degenerate U−statistics, it is easy to check that the third and second order projections are small. For the first which, given that ψ ∞ < ∞, is similar to the term ξ 1 bounded in the proof of Proposition 5 of Lavergne et Vuong (2000).
Finally, let us briefly consider the case I * n =Ĩ n . Like in the decomposition (2), we have where ∀j ∈ {1, 2, 3}, the V * jn s are obtained by replacing the Y i s by the Y * i s in the V jn s. All these terms could be handled by arguments similar to the ones detailed above. The proof of Theorem 2 is now complete. by Parseval's Theorem.

By elementary calculations,
plies ε −2 is positive and has expectation The desired result follows.
The following result, known as Bochner's Lemma (see Theorem 1.1.1. of Bochner (1955)) will be repeatedly use in the following. We recall it for the sake of completeness.
Lemma 5. For any function l (·) ∈ U p and any integrable kernel K (·), In the following we provide the proofs for rates for the remaining terms in the decomposition of I n , see Propositions 3 and 4. For this purpose, we use the following a decomposition for U−statistics that can be found in Lavergne and Vuong (2000): if U n = 1/n (m) a H n (Z i 1 , . . . , Z im ), then where (c) denotes summation over sets ∆ 1 and ∆ 1 of ordered positions of length c, and the i's position in ∆ 1 coincide with the j's position in ∆ 2 and are pairwise distinct otherwise. Now, we will bound E [U 2 n ] using the ξ c = (c) I (∆ 1 , ∆ 2 ) and the fact that by Cauchy's inequality, where Z c denotes the common Z i 's.
Proof of Proposition 3. After bounding the ψ ij 's by ψ ∞ the arguments are very similar to those used in Lavergne and Vuong (2000). We prove only the first statement.
(i) I 1,3 is a U-statistic with kernel H n (Z i , Z j , Z l ) = u i f i u l L njl K nij ψ ij . We need to bound the ξ c , c = 0, 1, 2, 3.
denote the uniform entropy integral, where the supremum is taken over all finitely discrete probability distributions Q on the space of the observations, and G 2 denotes the norm of G in L 2 (Q). Let Z 1 , · · · , Z n be a sample of independent observations and let be the empirical process indexed by G. If the covering number N(ε, G, L 2 (Q)) is of polynomial order in 1/ε, there exists a constant c > 0 such that J(δ, G, L 2 ) ≤ cδ ln(1/δ) for 0 < δ < 1/2. Now if Eγ 2 < δ 2 EG 2 for every γ and some 0 < δ < 1, and EG (4υ−2)/(υ−1) < ∞ for some υ > 1, under mild additional measurability conditions, Theorem 3.1 of van der Vaart and Wellner (2011) implies (8) where G 2 2 = EG 2 and the O p (1) term is independent of n. Note that the family G could change with n, as soon as the envelope is the same for all n. We apply this result to the family of functions G = {Y L((W − w)/g) : w ∈ R p } for a sequence g that converges to zero and the envelope G(Y, W ) = Y sup w∈R p L(w).
Proof of Lemma 4. We havê By Lemma 3 and the fact that f (·) is bounded away from zero, deduce that sup i |r i − r i | = o p (1) . From this and applying several times the arguments in the proof of Lemma 3 we obtain 1 n − 1 k =i (r i −r k ) L nik = o p (1) .
On the other hand, where we used again the arguments for ∆ 1i in the proof of Lemma 3 (here with η k u k and |η k | in the place of Y k ) to derive the last rate.