A nearest neighbor estimate of the residual variance (cid:42)

We study the problem of estimating the smallest achievable mean-squared error in regression function estimation. The problem is equivalent to estimating the second moment of the regression function of Y on X ∈ R d . We introduce a nearest-neighbor-based estimate and obtain a normal limit law for the estimate when X has an absolutely continuous distribution, without any condition on the density. We also compute the asymptotic variance explicitly and derive a non-asymptotic bound on the variance that does not depend on the dimension d . The asymptotic variance does not depend on the smoothness of the density of X or of the regression function. A non-asymptotic exponential concentration inequality is also proved. We illustrate the use of the new estimate through testing whether a component of the vector X carries information for predicting Y .


Introduction
In this paper we study the problem of estimating the smallest achievable meansquared error in regression function estimation in multivariate problems.We introduce and analyze a nearest neighbor-based estimate of the second moment of the regression function.The second moment of the regression function is closely tied to the best possible achievable mean squared error.It is shown that the estimate is asymptotically normally distributed.It is remarkable that the asymptotic variance only depends on conditional moments of the regression function but not on its smoothness.Moreover, the non-asymptotic variance is bounded by a constant that is independent of the dimension.We also establish a non-asymptotic exponential concentration inequality.We illustrate these results studying variable selection.In particular, we construct and analyze a test for deciding whether a component of the observational vector has predictive power.
The formal setup is as follows.Let (X, Y ) be a pair of random variables such that X = (X (1) , . . ., X (d) ) takes values in R d and Y is a real-valued random variable with E[Y 2 ] < ∞.We denote by µ the distribution of the observation vector X, that is, for all measurable sets A ⊂ R d , µ(A) = P{X ∈ A}.Then the regression function is well defined for µ-almost all x.The center of our investigations is the functional The importance of this functional stems from the fact that for each measurable function g : R d → R one has E (g(X) − Y ) 2 = L * + E (m(X) − g(X)) 2   and, in particular, where the minimum is taken over all measurable functions g : R d → R. In other words, L * is the minimal mean squared error of any "predictor" of Y based on observing X. L * is often referred to as the residual variance.
In regression analysis the residual variance L * is of obvious interest as it provides a lower bound for the performance of any regression function estimator.In this paper we study the problem of estimating L * based on data consisting of independent, identically distributed (i.i.d.) copies of the pair (X, Y ).It is convenient to assume that the number of samples is even and the 2n samples are split into two halves as D n = {(X 1 , Y 1 ), . . ., (X n , Y n )} and D n = {(X 1 , Y 1 ), . . ., (X n , Y n )} such that the 2n + 1 pairs (X, Y ), (X 1 , Y 1 ), . . ., (X n , Y n ), (X 1 , Y 1 ), . . ., (X n , Y n ) are independent and identically distributed.
An estimator L n of L * is simply a function of the data D n , D n .We are interested in "nonparametric" estimators of L * that work under minimal assumptions on the underlying distribution.In particular, a desirable feature of any estimate is that it is strongly universally consistent, that is, L n → L * with probability one, for all possible distributions of (X, Y ) with EY 2 < ∞.Such estimators may be constructed, for example, by constructing a strongly universally consistent regression function estimator m n based on the data D n (i.e., a function m n is such that E[(m n (X) − Y ) 2 |D n ] → L * with probability one for all distributions) and estimating its mean squared error by ( (For a detailed theory of universally consistent regression function estimation see [15].)However, the rate of convergence of such estimators is determined by the rate of convergence of the mean squared error of m n which can be quite slow even under regularity assumptions on the underlying distribution.Estimating the entire regression function m(x) is, intuitively, "harder" than estimating the value of L * .Indeed, nearest-neighborbased estimators of L * have been constructed and analyzed by Devroye, Ferrario, Györfi, and Walk [6], Devroye, Schäfer, Györfi, and Walk [10], Evans and Jones [12], Liitiäinen, Corona, and Lendasse [17], [18], Liitiäinen, Verleysen, Corona, and Lendasse [19], and Ferrario and Walk [13].These estimates have been shown to have a faster rate of convergence-under some natural assumptions-than estimates based on estimating the error of consistent regression function estimators.Moreover, the estimate in [6] is strongly universally consistent.
In this paper we introduce yet another universally consistent nearest-neighborbased estimator of L * .The advantage of this estimator, apart from sharing the fast rates of convergence of previously defined estimators, is that its random fluctuations may be bounded by dimension-, and distribution-independent quantities.In particular, we prove a central limit theorem and a distribution-free upper bound for the variance for the new estimator that show that it is concentrated around its expected value in an interval of width O(1/ √ n), independently of the dimension.The established concentration property is crucial in a variable-selection procedure that we discuss as an application.In particular, we design a test for deciding whether exclusion of a certain component of X increases L * or not.
The paper is organized as follows.In Section 2 we introduce a novel estimate of L * and establish some of its properties such as asymptotic normality and a non-asymptotic concentration inequality.The central limit theorem holds without any smoothness condition on the regression function, and the asymptotic variance depends only on the conditional moments of Y (Theorem 1).We prove a nonasymptotic bound on the variance that does not depend on the dimension of X (Theorem 2), and show an exponential concentration inequality for the centered estimate (Theorem 3).All these results are universal in the sense that we only assume that X has a density and Y is bounded.
In Section 3 we briefly describe how the results method based on the results of Section 2 may be relevant for variable selection.Finally, the proofs are presented in Section 4.

A nearest-neighbor based estimate and its asymptotic normality
Denoting the second moment of the regression function by we have and therefore estimating L * is essentially equivalent to estimating S * (as the "easy" part E Y 2 may be estimated by, e.g., (1/n) n i=1 Y 2 i whose behavior is well understood).
Next we introduce a nearest neighbor-based estimator of S * .Based on the data D n , we start by constructing a nearest-neighbor (1-NN) regression function estimator as follows.Let X 1,n (x) be the first nearest neighbor of x among X 1 , . . ., X n (with respect to the Euclidean distance in R d ) and let Y 1,n (x) be its label.(In order to rigorously define the nearest neighbor, we assume that ties are broken in order to favor points with smaller index.Since we assume the distribution of X to be absolutely continuous, this issue is immaterial since ties occur with probability zero.)The 1-NN estimator of the regression function m is defined as The proposed estimate of S * is By a straightforward adjustment of the arguments of Devroye, Ferrario, Györfi, and Walk [6], one may show that S n is a strongly universal consistent estimate of S * , that is, lim with probability one for any distribution of (X, Y ) with E[Y 2 ] < ∞.Note that the consistent functional estimate S n is based on a non-consistent regression function estimate m n .
Next we establish asymptotic normality of S n under the condition that the response variable Y is bounded.In order to describe the asymptotic variance, we introduce the dimension-dependent constant α(d) as follows.
Let B x,r denote the closed ball of radius r > 0 centered at x in R d and let λ denote the Lebesgue measure on R d .Let V be a random vector uniformly distributed in B 0,1 .Define 1 = (1, 0, 0, . . ., 0) ∈ R d and let B = B 1,1 B V , V .Introduce the random variable and define (2.1) Assume that µ has a density and that there exists a constant and define The dependence of the asymptotic variance on the dimension d is weak, merely via the constant α(d).Given X 1 , . . ., X n , Devroye, Györfi, Lugosi, and Walk [8] considered the probability measures of the Voronoi cells.They proved that the asymptotic variance of n-times the probability measure of the Voronoi cell is equal to α(d) − 1.Thus, this asymptotic variance is universal in the sense that it does not depend on the underlying density.A few values are α(1) = 1.5, α(2) ≈ 1.28, α(3) ≈ 1.18.In general, 1 ≤ α(d) ≤ 2 and α(d) → 1 exponentially fast as d → ∞.Thus, by (2.2) we have σ 2 ≤ 3L 4 , and therefore Theorem 1 implies that lim sup n→∞ nVar(S n ) ≤ 3L 4 .
The next theorem shows that, up to a constant factor, this bound holds non-asymptotically.
Theorem 2. Assume that µ has a density and that |Y | < L. Then for all n ≥ 1, The next result is a non-asymptotic exponential inequality that extends Theorem 2. It implies that for all t > 0, for a universal constant c > 0. It is an interesting open question whether the righthand side can be improved to e −(t/(cL 2 )) 2 .This would give a non-asymptotic analog of the central limit theorem of Theorem 1.

Theorem 3. Assume that µ has a density and that
Then for every n ≥ 1 and > 0, we have and 3) The proofs of Theorems 1, 2 and 3 are presented in Section 4.

Illustration: testing for dimension reduction
In standard nonparametric regression design, one considers a finite number of real-valued features X (i) , i ∈ I ⊂ {1, . . ., d} for predicting the value of a response variable Y .A first question one may try to answer is whether these features suffice to explain Y .In case they do, an estimation method can be applied on the basis of the features already under consideration.Otherwise more or different features need to be considered.The quality of a subvector {X (i) , i ∈ I} of X is measured by the minimum mean squared error that can be achieved using the features as explanatory variables.L * (I) depends upon the unknown distribution of (Y , X (i) : i ∈ I).
Thus, even before a regression function estimate is chosen, one may be interested in estimating L * .For possible dimensionality reduction, one needs, in general, to test the hypothesis for a particular (proper) subset I of {1, . . ., d}.A natural way of approaching this testing problem is by estimating both L * and L * (I), and accept the hypothesis if the two estimates are close to each other (De Brabanter, Ferrario and Györfi [5]).
Thus, dropping the component X (d) from the observation vector X = (X (1) , . . ., X (d) ) leads to the observation vector the null-hypothesis S * = S * is equivalent to We propose to approach this testing problem by considering the nearestneighbor estimate defined in Section 2. Let S n be the estimate of S * using the sample D 2n = {(X 1 , Y 1 ), . . ., (X 2n , Y 2n )} .
Assume that an independent sample of size 2n is available: We use D 2n to construct an estimate S n of S * .S n is defined as the nearest-neighbor estimate computed from the sample The proposed test is based of the test statistic and accepts the null hypothesis (3.2) if and only if where ω n is an increasing unbounded sequence such that a n → 0. Under the alternative hypothesis, according to the consistency result of Devroye, Ferrario, Györfi, and Walk [6], for bounded Y , T n → S * − S * > 0 with probability one, (3.3) and this convergence is universal, that is, it holds without any conditions.Thus, since a n → 0, if S * S * , then, with probability one, the test does not make any mistake for a sufficiently large n.
with σ 2 , σ 2 < 3L 4 .Since S n and S n are independent, we have In order to understand the behavior of the test, one needs to study the difference of the biases of the estimates under the null hypothesis (3.2).In this case we have If m and f are Lipschitz continuous and f is bounded away from 0, then, by Devroye, Ferrario, Györfi, and Walk [6], Thus, under the null hypothesis (3.2), for d ≥ 2. Note that for d ≤ 4, the bias is at most of the order of the random fluctuations of the test statistic.However, for d > 4 the bias may dominate.Such a dependence on the dimension is inevitable under fully nonparametric conditions like the ones assumed here.
Under the null hypothesis, (3.4) and (3.5) imply that the probability of error may be bounded as Thus, the test is consistent.
The condition that the density f is bounded away from zero may be avoided at the price of a worse rate of convergence.In particular, if m is C-Lipschitz and X is bounded, then In this case the threshold should be larger: One may prove that the test is not only consistent in the sense that P{T n > a n } → 0 under the null hypothesis but also in the sense that lim sup n→∞ 1 {T n >a n } = 0 with probability one.For a discussion and references on the notion of strong consistency we refer the reader to Devroye and Lugosi [9], Biau and Györfi [2], Gretton and Györfi [14].
The proof of strong consistency under the alternative hypothesis follows simply from (3.3).Under the null hypothesis it follows from Theorem 3. Indeed, Theorem 3 implies that with increasing unbounded ω n = o(n 2/d ).Then, under the null hypothesis and so the Borel-Cantelli Lemma implies that the test makes error only finitely many times almost surely.
Remark.In applications, one would like to test not only if a given component of X carries predictive information but rather test the same for each of the d variables.
In such cases, one faces a multiple testing problem with d dependent tests.In order to analyze such multiple testing procedures, say, by the Bonferroni approach, one needs a uniform control over the fluctuations of the test statistic.In such cases a non-asymptotic concentration inequality of Theorem 3 is particularly useful.

Proofs
In the proofs below we use two lemmas on the measure of Voronoi cells.Let Lemma 1.If µ has a density, then Proof.Devroye, Györfi, Lugosi, and Walk [8] proved that there exists a positive constant c k such that and nµ(A n (X 1 )) converges in distribution to a random variable Z such that . .This lemma is on the same non-asymptotic bound.We show that ≤ P {X n+1 , . . ., X n+k are the nearest neighbors of X 1 among X 2 , . . ., X n+k } , which implies that Recall that B x,r denotes the closed ball of radius r > 0 centered at x and note that and (4.1) follows from comparing the right-hand sides of the two equations above.On the one hand, while on the other hand, 2 Lemma 2. (Devroye, Györfi, Lugosi, and Walk [8]) Assume that µ has a density.Then for µ-almost all x, where α d is defined in (2.1).

Proof of Theorem 2
We prove the variance bound of Theorem 2 first.The proof relies of the following version of the Efron-Stein inequality, see, for example, [4, Theorem 3.1].

Lemma 3. (Efron-Stein inequality)
Let Z = (Z 1 , . . ., Z n ) be a collection of independent random variables taking values in some measurable set A and denote by Z (i) = (Z 1 , . . ., Z i−1 , Z i+1 , . . ., Z n ) the collection with the i-th random variable dropped.Let f : A n → R and g : A n−1 → R be measurable real-valued functions.Then By the decomposition we have that Conditionally on D n , S n is an average of independent, identically distributed (i.i.d.) random variables bounded by L 2 , and therefore Notice that we may write Then Considering L n as a function of the n i.i.d.pairs (X i , Y i ) n i=1 , we may use the Efron-Stein inequality to bound the variance of L n .Define L (j) n as L n when (X j , Y j ) is omitted from the sample.By Lemma 3, Let {A n (X 2 ), . . ., A n (X n )} be the Voronoi partition, when X 1 is omitted from the sample.Then

Thus, Lemma 1 implies
and therefore to the desired bound Proof of Theorem 1 where and We prove Theorem 1 by showing that, for any u, v ∈ R, where Φ denotes the standard normal distribution function, and that Györfi and Walk [16] proved that Thus, (4.2) holds if and Proof of (4.4).
Let's start with the decomposition Next we apply a Berry-Esseen type central limit theorem (see Theorem 14 in Petrov [20]).For a universal constant c > 0, we have Since we have We need to show that in probability and We use this to prove (4.7).Indeed, Thus, and so To complete the proof of (4.7), it suffices to show that the sum above converges to zero as n → ∞.To this end, note that Lemma 1 implies that and furthermore It remains to show that Fix any > 0 and choose a bounded continuous function M 2 such that Then, with The first term on the right-hand side converges to 0 by the dominated convergence theorem, since, by Lemma 6.1 in [15], To bound the second term, we introduce some notation.A set C ⊂ R d is a cone of angle π/3 centered at 0 if there exists an x ∈ R d with x = 1 such that Let γ d be the minimal number of cones C 1 , . . ., C γ d of angle π/3 centered at 0 such that their union covers R d .The second term on the right-hand side of (4.10) is bounded by by Lemma 6.3 in [15].Thus, (4.9) is proved and hence so is (4.7).For the proof of (4.8), we have that Similarly, the derivation for (4.7) implies that and so (4.8) is proved, too.Thus, These relations imply (4.4).
Proof of ( We prove (4.3) by a slight extension of the proof of Theorem 2. Set n as L n when X j is dropped.As in the proof of Theorem 2, Then and so where X 2,n (x) denotes the second nearest neighbor of x among X 1 , . . ., X n .Therefore by (2.2).Hence, As it is well known, for a real-valued random variable Z, by Hölder's inequality, One has as n → ∞, where the latter can be shown as the limit relation (4.9).Furthermore by (2.2) and Lemma 1.With the notation Condition (ii) of Theorem 4 follows from (2.2), Lemma 1 and Jensen's inequality: and We have that (4.18) follows from The expression on the right-hand side converges to zero.To show this, fix an arbitrary > 0 and choose a decomposition m = m * + m * * such that m * is Lipschitz continuous with bounded support and E[m * * (X) 2 ] < .Then it suffices to show the limit relation for m * .But this follows from the fact that diam(A n (X 1 )) → 0 in probability (Devroye, Györfi, Lugosi, and Walk [8, Section 5]).Lemma 2 implies that E n 2 µ(A n (X 1 )) By verifying (4.16).

Proof of Theorem 3
As we mentioned in the proof (4.4), for given D n , S n is an average of i.i.d.random variables bounded by L 2 .Therefore, by the Hoeffding inequality, one has For the term V n , apply the extension of the Efron-Stein inequality for the centered higher moments, which is a slight modification of Theorem 15.5 in Boucheron et al. [4]: Lemma 4. Let Z = (Z 1 , . . ., Z n ) be a collection of independent random variables taking values in some measurable set A and denote by Z (i) = (Z 1 , . . ., Z i−1 , Z i+1 , . . ., Z n ) the collection with the i-th random variable dropped.Let f : A n → R be a measurable realvalued function and the function g i : A n−1 → R is obtained from f by dropping the i-th argument, i = 1, . . ., n.Then for any integer q ≥ 1, with a universal constant c < 5.1.