Convergence of functional k-nearest neighbor regression estimate with functional responses

: Let ( X 1 ,Y 1 ) ,..., ( X n ,Y n ) be independent and identically dis- tributed random elements taking values in F ×H , where F is a semi-metric space and H is a separable Hilbert space. We investigate the rates of strong (almost sure) convergence of the k-nearest neighbor estimate. We give two convergence results assuming a ﬁnite moment condition and exponential tail condition on the noises respectively, with the latter requiring less stringent conditions on k for convergence.


Introduction
Let (F , d(., .)) be a semi-metric space, (H, · ) a separable Hilbert space, and let (X, Y ), (X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X n , Y n ) be independent identically distributed F × H-valued random pairs. In regression analysis, usually an estimate of the function m(x) = E(Y |X = x) is being sought using n pairs of data points.
In the literature, two related classes of nonparametric estimates have been proposed. The first one is the Nadaraya-Watson estimate or kernel estimate [20,16], with the well-known drawback that it ignores the local denseness/sparseness of the data and uses a fixed bandwidth parameter on the entire predictor space. The k-nearest neighbor (k-NN) method addresses this problem by using adaptive neighborhood size based on the distance of a point from its neighbors [5,4,13].
In the classical setting, the observation pairs reside in the Euclidean spaces. In particular, F = R d and H = R is the most common and most studied case in the statistical literature. With the increasing interest at the present moment in many fields of statistics in which the observations are curves, such as speech recordings, weather data, commodity prices, functional regression analysis as an extension of classical setting has risen to the center stage of statistical research. Two major H. Lian approaches exist for functional data analysis. The parametric modeling approach was masterfully documented in the monograph [19], and the nonparametric approach was proposed in the pioneering work [9] and also popularized by the book [11]. Another nonparametric approach is based on the reproducing kernel Hilbert spaces framework [18,15].
For some applications, the dependent variable takes values in a more general space than finite-dimensional Euclidean spaces. For example, one might predict annual precipitation using temperature measurements [19], or predict future hourly electricity consumption based on past history [1]. In this note we investigate the convergence rates of functional k-NN estimate when the regression output takes values in a general separable Hilbert space H. Although it is conceptually straightforward to apply k-NN method in this context, the demonstration of its asymptotic properties poses technical difficulties due to the functional responses.
This work can be regarded as an extension of [3] where k-NN method in functional regression with scalar responses is studied. For functional responses, the theoretical investigation involves extra complications. Besides, we use a slightly more general setup (in terms of weights v ni defined in the next section) and also emphasize the role of the assumption on errors.
During the final stage of preparation for this manuscript, the author learned that Dr. Frederic Ferraty and his collaborators have recently obtained corresponding results with functional responses, although in the context of Nadaraya-Watson kernel regression. On the one hand, they used the stronger assumption on the noise (similarly to our Assumption 4 below) while we also obtained rates under finite moment assumption (as in our Assumption 3). On the other hand, they studied inferences using bootstrap and we did not investigate the inference problems here.

Estimation and rates of convergence
Consider the simple additive noise model Y = m(X) + ǫ where ǫ takes values in H, has mean zero (in the sense of Bochner integral, see [14]), and is independent of covariate X. Given n copies of independent observations D n = {(X 1 , Y 1 ), . . . , (X n , Y n )}, the k-NN estimate at any x ∈ F is defined bŷ where (v n1 , . . . , v nn ) is a (possibly random) probability vector. Note we consider estimation and convergence at a fixed x and thus we sometimes omit explicitly stating the fixed covariate. For example, a nearest neighbor always refers to the nearest neighbor of a fixed x. Two specific examples of v ni follow.

Example 1.
Take v ni = a nj if X i is the j−th nearest neighbor, with a n1 ≥ a n2 ≥ · · · ≥ a nn a deterministic probability vector, thus putting more weights in (2.1) for data closer to x. Setting a nj = 1/k if j ≤ k and 0 otherwise gives us back the simple k-NN estimate. We should note that even in this simplest case, v ni depends not just on X i since all X j , j ≤ n together determine the identities of x's nearest neighbors, which leads to some complications in theoretical analysis.
where K is a kernel function and H is the distance of the k−th nearest neighbor. Mathematically, Naturally we need the following assumption on the regression function to obtain meaningful rates of convergence.

Assumption 1. m is bounded and Lipschitz continuous at
The Lipschitz condition only needs to be satisfied locally on an open neighborhood of the fixed x.
In the following theoretical investigations, we directly take v n1 ≥ v n2 ≥ · · · ≥ v nn . For our two examples above, this amounts to assuming that the n data pairs have already been ordered according to the distance of X i to x so that X 1 is the nearest neighbor of x, for example (ties are broken by comparing indices in the original sequence). We assume such reordering has been performed throughout. We need the following conditions on v ni .
where the asymptotic orders are in the sense of almost sure convergence. We also require that k/n → 0 and k/ log n → ∞.
Some moment conditions are necessary on the norm of the noise also. Assumption 3. E||ǫ|| r < ∞ for some r > 2.
Although Assumption 4 is much stronger than finite moment condition in Assumption 3, it is satisfied by many Gaussian processes, whose norm typically exhibits sub-Gaussian tails (see for example the Appendix of [23]) and thus satisfies this assumption with p = 2.
Our convergence results below are stated in terms of the critical quantity φ(h) := P ({x ′ : x ′ ∈ B(x, h)}) which is called the small ball probability. Its importance has been demonstrated in [10,11,8] for functional kernel regression. The quantity φ(h) is closely related to the ǫ−covering number of the Banach space F , which is defined as the smallest number of open balls of radius ǫ that cover the set F . A set with finite ǫ−covering number for all ǫ > 0 is called totally 34 H. Lian bounded. For our purpose, since we are interested in the regression function at a fixed x ∈ F , the global property of total boundedness is not necessary. However, if we assume a uniform small ball probability over F , that is cψ(h) ≤ P (B(x, h)) ≤ Cψ(h) for some positive increasing function ψ independent of x, then it automatically implies total boundedness of F . In fact, suppose D(ǫ) is the maximal number of points x i ∈ F with d(x i , x j ) ≥ ǫ (the so-called ǫ−packing number), we have 1 = P (F ) ≥ D(ǫ) · cψ(ǫ/2) using the fact that each ball of radius ǫ/2 around a point x i has probability at least cψ(ǫ/2), whence D(ǫ) is finite and F is totally bounded by the well known relationship between the packing number and the covering number (see for example [24]).
The main results for k-NN estimates satisfying the above assumptions are the following.

Theorem 1. If Assumptions 1, 2 and 3 hold and
Alternatively, assuming exponential tail decay, we have Comparing the two related results above, we see that in Theorem 1, when the weaker assumption 3 is used, we require an extra condition on the weight vector v ni . From the discussion after the corollary below, this condition actually imposes some strong constraints on k in some simple examples.
The theorems above are stated for general weight vector v ni , 1 ≤ i ≤ n. When specialized to some commonly used weight vector, we have the following corollary.
Corollary 1. For the simple k-NN estimates (v ni = 1/k for i ≤ k and 0 otherwise), the theorems above hold with b n = 0 and v 2 = O(1/ √ k). The same applies to Example 2 (with a kernel compactly supported and bounded away from zero on [0, 1]) presented previously. Remark 1. In the above Corollary we only aim for the simplest results while more complicated kernel functions can be dealt with using lengthier arguments and additional assumption on the small ball probability. In these two simple examples we have v ni ∼ 1/k for i ≤ k and 0 otherwise, and thus the condition is a function of n). We see this condition generally requires that k increases polynomially in n, with the requirement less stringent for bigger r.
Remark 2. In [11], the authors distinguished two types of processes: the fractal type processes and the exponential type processes. The former is characterized by φ(h) ∼ h τ , for some τ > 0 and the latter characterized by φ(h) ∼ exp{−(1/h τ1 ) log(1/h τ2 )}, τ 1 > 0, τ 2 ≥ 0. The fractal type processes are similar to finite dimensional problems in many aspects, while for infinite dimensional case such as when the covariate curves belong to some smoothness class, exponential type processes are more typical. For example, the simple Gaussian process, Brownian motion, is of exponential type. The paper [22] provides other more complicated Gaussian processes all of which are of exponential type. From the rates obtained in the Corollary, it is easy to see that for exponential type processes the convergence rates are logarithmic in the sample size, much slower than the classical finite-dimensional cases. Note that as discussed above, under Assumption 3, we require that k increases polynomially in n, which seems to make it similar to the finite dimensional case. However, this impression is misleading. For example, when φ(h) ∼ exp{−1/h τ } as in typical functional contexts, we have φ −1 (2k/n) ∼ {1/ log(n/(2k))} 1/τ , the convergence rate is logarithmic in n whether k increases polynomially or logarithmically in n.

Proofs
In the proofs, different appearances of C denote possibly different positive constants, even within the same expression. We start off by showing a relatively simple result on the distance from x to its k−th nearest neighbor.

Proof. First we note that φ is right-continuous and non-decreasing and thus
where we applied the Bernstein's inequality for Bernoulli random variables (see for example the Appendix in [17]). Then P (H ≥ φ −1 (2k/n), i.o.) → 0 can be shown using Borel-Cantelli lemma noting that k/ log n → ∞.
Proof of Theorem 1. We use the following decomposition into the bias term and the variance term.

H. Lian
The bias term is easier to deal with. In fact, by Assumption 1 and Lemma 1. Now we deal with the variance term. Let S n = n i=1 v ni ǫ i and the following arguments are conditional on {X 1 , . . . , X n } (in effect treating v ni as nonrandom weights). Following the idea of Section 6.3 in [14], we write ||S n || − E||S n || = || where G i is the σ−algebra generated by ǫ 1 , . . . , ǫ i (G 0 is the trivial σ−algebra). It is easy to see that {d i } is a real-valued martingale difference sequence which enables us to use relevant exponential type inequalities below. Citing Lemma 6.16 in [14], we know We bound the variance term in four steps.
Step 2: , i ≤ n is a martingale difference sequence, using Lemma 8.9 in [21] (Berstein's inequality for martingales), we obtain the desired bound.
Step 3: Using Hölder's inequality and Markov's inequality, we have ni L 1−r , and note that in the last line above we used the bound (3.2). Thus we have Step 4: Finally, we demonstrate the bound for the variance term in (3.1).
Proof of Theorem 2. The general proof strategy is the same as Theorem 1. In particular, the bias term is bounded in the same way. For the variance term, only Step 3 and Step 4 need to be replaced by the following.
Step 3': , if we set a = C(log n) 1+1/p v 2 and L = C(log n) 1/p v n1 for C large enough.
Consider the first probability, we have using (3.2) and Assumption 4 in the third inequality above. Thus E(d ′ i |G i−1 ) ≤ a/n if we set a = C(log n) 1+1/p v 2 (note that a ≥ v 2 ≥ v n1 ≥ 1/n) and L = C(log n) 1/p v n1 , and then P ( i E(d ′ i |G i−1 ) > a) = 0.

H. Lian
For the other probability term, again using (3.2) and Assumption 4, we have where in the last line above we used the simple inequality (1 − x) n ≥ 1 − nx.
Step 4': To demonstrate the bound for the variance term, we use by the bounds obtained in Step 2 and Step 3'. Finally set a = C 1 (log n) 1+1/p v 2 and L = C 2 (log n) 1/p v n1 (choose C 2 large enough to make the second term above summable and then choose C 1 large enough to make the first term summable) and apply the Borel-Cantelli Lemma and then use the result from Step 1 to get ||S n || = O((log n) 1+1/p v 2 ).
Proof of Corollary 1. For the simple k-NN method this is obvious. For kernel k-NN, it is also obvious that b n = 0 by the definition of H. Since v ni = K(d(X i , x)/H)/ j K(d(X j , x)/H) ≤ C/ j K(d(X j , x)/H) and K(d(X j , x)/H) is bounded away from zero for j ≤ k and 0 for j > k by the assumptions made on K, we have v ni = O(1/k) for i ≤ k and 0 otherwise. It then follows that v 2 = O(1/ √ k).

Discussion
We assumed in the paper that H is a Hilbert space while the covariate space is a much more general semi-metric space. That the response is in a Hilbert space is necessary for applying the results in [14] and thus it seems difficult to consider the response in a semi-metric space. However, it is possible to assume that H is a Banach space. The proofs go through without change for Banach space except for Step 1 in the proof where we used the inner product. In general Banach space, it is not clear how to deal with E S n in Step 1. However, under the additional assumption that H is a Banach space of type p, then by Proposition 9.11 in [14] or Definition 2.3 in [2], we have E S n = O((E S n p ) 1/p ) = O(( i v p ni E ǫ i p ) 1/p ) = O( v p ) and thus an extra v p will appear in the convergence rates.
Finally, we mention some possibilities for further studies. For functional regression with scalar responses, uniform convergence was obtained in [12], asymptotic normality was shown in [8,6] for the independent case and α-mixing case respectively, and [7] studied inferences using bootstrap. We expect these results can be extended to k-NN estimates with functional responses under stronger assumptions.