Convergence of Nonparametric Functional Regression Estimates with Functional Responses

We consider nonparametric functional regression when both predictors and responses are functions. More specifically, we let $(X_1,Y_1),...,(X_n,Y_n)$ be random elements in $\mathcal{F}\times\mathcal{H}$ where $\mathcal{F}$ is a semi-metric space and $\mathcal{H}$ is a separable Hilbert space. Based on a recently introduced notion of weak dependence for functional data, we showed the almost sure convergence rates of both the Nadaraya-Watson estimator and the nearest neighbor estimator, in a unified manner. Several factors, including functional nature of the responses, the assumptions on the functional variables using the Orlicz norm and the desired generality on weakly dependent data, make the theoretical investigations more challenging and interesting.


Introduction
The problem of regression with functional predictors has been receiving increasing interests nowadays, boosted by more and more datasets with observations that can be naturally perceived as curves. This trend starts with the popular monograph [24] that gives a detailed exposition of functional linear models. The existing literature contains numerous theoretical and empirical studies on functional linear models [5,8,18,21,29,4,15,6,9]. Nonparametric methods with functional predictors and scalar responses appear later [11,12,13,23,3], which by now has been widely accepted by the statistical community as a more flexible approach to functional regression with fewer structural assumptions imposed. As this area naturally develops and matures, the situation where the responses are also curves begins to receive more attention [1,7,17]. For example, one might predict annual precipitation using temperature measurements as in [25], or predict future hourly electricity consumption based on past history as in [2]. Although these two studies follows the parametric approach to functional regression, it is clear that nonparametric approach is a viable alternative [20].
On the other hand, the assumption of independence in most theoretical investigations carried out so far is often too restrictive in many applications. The necessity to respond properly to data dependence is clearly demonstrated by the example given in [10] where a functional observation denotes the monthly electricity consumption over a year and thus it is unrealistic to assume that electricity consumption in one year is independent that of the previous year. In previous studies regarding nonparametric functional regression, dependence is incorporated based on some mixing conditions [12]. Here we instead use the notion of L 4 − m−approximability advocated in [16,14] (with some appropriate minor extensions). The advantage compared to using mixing conditions is that the L 4 −m−approximability condition is easily verified in many examples as shown in [16].
In the more classical setting, the observation pairs reside in the Euclidean spaces. In this paper, we carry out a theoretical investigation of nonparametric functional regression with functional responses on dependent data. Two related classes of nonparametric estimates have been proposed, the k-nearest neighbor estimate (k-NN) and the Nadaraya-Watson kernel estimate. Because of their similarity in many aspects, we will try to unify the proofs for these two as much as possible. We will show almost sure convergence of these nonparametric estimators based on assumptions on Orlicz norms of the functional variables. Due to the functional nature of the responses and the assumption of weak dependence, the theoretical investigation poses serious challenges and some novel construction of martingale difference sequence will be introduced. Finally, we note that throughout the paper we use C to denote a generic constant that assumes different values at different places.
2 Almost sure convergence of nonparametric estimates 2.1 On the notion of Orlicz norm and weak dependence In this subsection we review the concept of Orlicz norm and collect some of its simple properties as a lemma here for easy reference later. Although all of the properties are simple and most are well-known, some others seem to be new (such as Lemma 1 (vi)(vii)) which we cannot find in the existing literature. We also review and extend the notion of L 4 − m−approximability of a data sequence using the more general Orlicz norm instead of L p norm. Following [28], let ψ be a convex, increasing function on [0, ∞) with ψ(0) = 0 and let X be a real-valued random variable. The Orlicz norm (or ψ-Orlicz norm to emphasize its dependence on ψ) is defined as which can be shown to be indeed a norm. For random elements X taking values in a normed space, the Orlicz norm of X (which is a real-valued random variable) is also denoted by X ψ for simplicity.
There are two commonly used ψ function: ψ(x) = x p and ψ(x) = exp{x p } − 1, p ≥ 1, and throughout the paper we use ψ p to denote the latter. With ψ(x) = x p , the Orlicz norm is simply the L p norm (E[X p ]) 1/p . With ψ(x) = ψ p (x) = exp{x p }− 1, the finiteness of Orlicz norm of X is closely related to the exponential decay of its tail probability, the exact statement of which is contained in the following Lemma together with other simple properties concerning Orlicz norm.
Lemma 1 Below we assume ψ is a valid function that defines an Orlicz norm, that is, ψ is convex, increasing on [0, ∞) with ψ(0) = 0. X is a random variable.
Proof. Results (i) and (ii) can be found in Section 2.2 of [28]. (iii) is obvious by the definition of Orlicz norm. To prove (iv), we note that Eψ(|X|/a X ψ ) ≤ aEψ(|X|/a X ψ ) ≤ Eψ(|X|/ X ψ ) ≤ 1, where we used that ψ(x/a) ≤ ψ(x)/a due to the convexity of ψ. For (v), since Eψ(|X|/a X ψ ) = Eφ(ψ(|X|/ X ψ )) ≤ φ(Eψ(|X|/ X ψ )) ≤ φ(1) = 1 (using Jensen's inequality), we get X ψ ≤ a X ψ by definition. For (vi), the result follows from Eψ( We already noted that L p norm is a special case of Orlicz norm when ψ(x) = x p . On the other hand, based on Lemma 1 (v), one can show that X p ≤ C X ψq for any p, q ≥ 1 and X ψq 1 ≤ C ′ X ψq 2 if q 1 ≤ q 2 , (where C, C ′ are universal constants that only depends on p, q, q 1 , q 2 ). In this sense the norm . ψq is stronger than L p , and the more so with larger q.
As explained in the introduction, for data collected sequentially over time, the assumption of independence is not realistic. In [16], the authors formalize the notion of dependence for functional data using L 4 − m−approximability. Instead of using the L 4 norm which is sufficient for the purpose of those studies, we instead use the Orlicz norm here.
Definition 1 Given a function ψ that defines an Orlicz norm, a sequence (taking values in a normed space) with finite Orlicz norm is said to be ψ−m−approximable if we have the representation where the α k are independent and identically distributed random elements of a measurable space and h is a measurable function. In addition, we assume that if In [16], several examples of L p − m−approximable sequence are given, minor modifications of these can produce more general ψ − m−approximable sequences. For example, a functional autoregressive process (Example 2.1 in [16]) is ψ − m−approximable as long as the innovation noise has finite ψ-Orlicz norm, by the same arguments. Although not explicitly stated there, a functional autoregressive process is ψ −m−approximable with exponential decay rate: γ m = O(exp{−Cm}) for some constant C.

Nonparametric estimates and convergence rate
Let (X 1 , Y 1 ), . . . , (X n , Y n ) be a stationary (in a strong sense) sequence of F × Hvalued random elements with E Y < ∞, where F is a semi-metric space with semi-metric d(., .) and H is a Hilbert space with norm . . The regression function is r(x) = E(Y |X = x) and we can write H are mean zero noises (in the sense of Bochner integral, see [19]). In this subsection, we always consider probabilities and expectations conditional on {X i }, in effect treating it as fixed. The asymptotic results stated are thus conditional on predictors even though we do not state this explicitly in the following. The implications of random predictors are treated in the next subsection after we present the general convergence results in this subsection.
The regression function can be estimated by local weighting of responseŝ ) is a probability vector of weights. Note that W ni (x) can be a function of all X k , k = 1, . . . , n, instead of X i only, as is the case for k-NN estimates (see the examples below). Since in this paper we only investigate pointwise convergence at a fixed point x, we will use the notation (W n1 , . . . , W nn ) in the following for simplicity. We rank (X i , Y i ), i = 1, . . . , n, based on increasing value of d(X i , x) (ties are broken by indices) and obtain a vector ( Our consideration of weak dependence leads to extra complications in the proofs. If the observations are independent, then obviously Y R i are also independent. How- is no longer stationary in general since the order of observations are broken. We will thus use representation (1) in most parts of our proofs, although representation (2) is easier to manipulate in the study of k-NN estimates for independent data. Example 1. Simple nearest neighbor estimate. Take v ni = 1/k for i ≤ k and v ni = 0 for i > k, so that the regression function estimate is just the average of responses corresponding to the k nearest neighbors of x. Even in this simplest case, although v ni is only a deterministic sequence, W ni still depends on all X j , 1 ≤ j ≤ n since all predictors jointly determine x's neighbors. More generally, we can take v ni to be a deterministic sequence with v n1 ≥ v n2 ≥ · · · ≥ v nn thus putting more weights on data closer to x.
Example 2. Nearest neighbor estimate based on kernel. Take In this subsection, since we condition on predictors {X i }, H is a known fixed value.
, which has exactly the same form as in the previous example. However, here H is a predetermined value usually called the bandwidth parameter, not derived from distance of x's kth nearest neighbor. Typically, one applies the same value of H for all values of x. Thus compared to nearest neighbor estimate, the Nadaraya-Watson estimate is not adaptive to the local sparseness of data. In this subsection when conditioning on predictors and for a given x, of course Nadaraya-Watson estimator is same as that in Example 2 since H is fixed in both cases. The differences will appear in the next subsection.
Naturally we need the following assumption on the regression function to obtain nontrivial rates of convergence.
Assumption 1: r is bounded and Lipschitz continuous.
In fact, since we only consider pointwise convergence, it suffices that r is Lipschitz continuous on an open neighborhood of x. We nevertheless use the above assumption for simplicity in statements.
Also, we denote by H the distance to x from its kth nearest neighbor, and we assume H → 0.
Although Assumption 2 as stated is more amenable for use for k-NN estimates, it can also be used for Nadaraya-Watson estimate, which will be clear in the next subsection. We also impose the following assumptions on the noise. Assumption 3: Given a convex increasing function ψ with ψ(0) = 0, and suppose for some constants C > 0, some concave increasing function φ with In the above assumption, the Orlicz norm is used for bounding the tail probability of noises (Lemma 1 (i)) as well as controlling the dependence. It is possible of course to use different ψ for these two different purposes, but using the same ψ seems to be most natural since it concern the same noises. The assumption x r ≤ φ(ψ(Cx)) deserves some explanation. By Lemma 1 (v), this implies that the r-th moment of the noise variable is finite, for some r ≥ 2 and it is in particular satisfied by ψ(x) = x p for p ≥ r and ψ(x) = ψ q (x) for q ≥ 1. When a stronger ψ-Orlicz norm is used, Assumption 3 imposes a stronger constraint, but the summability conditions in Theorem 1 below are easier to satisfy.
Our main result for functional nonparametric estimates with functional responses is the following.
Remark 1 Here we present a unified result for both nearest-neighbor estimate and the Nadaraya-Watson estimate. For nearest-neighbor estimate, k is a pre-specified constant and typically b n and c n2 are explicit functions of k and thus deterministic. On the other hand, H depends on k through (3) and thus depends on predictors. The situation for the Nadaraya-Watson estimate is exactly the opposite. H will be prespecified (typically as a function of sample size) and k is the number of predictors falling into the ball with radius H and thus depends on data. Similarly, v ni as order statistics of W ni depend on predictor values.
Remark 2 Because of the requirement ∞ n=1 exp{−Ca 2 /(aL + m 2 c 2 n2 + x)} < ∞, we see that the sequence a cannot converge faster than mc n2 and thus we will focus on cases where this rate is achievable up to some logarithmic terms in the following.
Remark 3 For independent data, γ 1 = 0 and the term (γ 1 v n1 ) 1/2 does not appear. More generally, this term can be ignored as long as v n1 = O(c 2 n2 ), by Remark 1 above. As an example, we obviously have v n1 = c 2 n2 = 1/k for the simplest k-NN estimate with v ni = 1/k, i ≤ k. In the next subsection, one will see that for the Nadaraya-Watson estimate in Example 3 above, we also have that v n1 and c 2 n2 are of the same order under mild assumptions.

Remark 4
In the convergence rate, b n and H α represent the bias while a comes from the variance of the estimator. As presented above, which aims for generality rather than clarity, it is hard to see what the convergence rate is in typical situations, and thus we discuss the rates in some special cases in the rest of this subsection.
Independent case When the data are independent, 1/ψ( x n /2/(γ 1 c n2 )) and 1/ψ(a/(2nv n1 γ m )) are zero (Informally, γ m = 0 when data are independent and we take ψ(∞) = ∞. More rigorously, it can be seen from the proofs that these two terms are zero), and we can take m = 1, x = 0. Taking L = c n2 and a = (log n)c n2 , the first sequence in (*) is then obviously summable. So as long as 1/ aψ 1−1/r (c n2 /(2Mv n1 )) is summable, we have convergence rate (log n)c n2 . For the simplest nearest neighbor estimate ) . For ψ(x) = x p or ψ(x) = exp{x p } − 1, this obviously is a restriction on k, in particular that k should diverge fast enough at a certain rate. We note that by existing results on k-NN estimate for independent data with scalar responses, the variance term is expected to be c n2 = 1/ √ k, which agrees with the rate here up to a logarithmic term. In summary, we have We note that for Nadaraya-Watson estimate in Example 3, discussions in the next subsection suggest that the convergence behavior is very much the same under reasonable assumptions.
Weakly dependent case Here the convergence rate is determined by the interplay of ψ and {γ m } in a more complicated way. For example, qualitatively, the summability of 1/ψ(a/(2nv n1 γ m )) is easier to be satisfied the smaller is γ m (weaker dependence). Moreover, the choice of x must take into account the trade off between the summability of exp{−Ca 2 /(aL + m 2 c 2 n2 + x)} and the summability of 1/ψ( x/2/(γ 1 c n2 )) (the former is an increasing function of x while the latter is a decreasing function of x). Similarly, the choice of m must take into account the trade off between summability of (m/a)/ψ 1−1/r (L/(2Mmv n1 )) and 1/ψ(a/(2nv n1 γ m )) (the former is an increasing function of m while the latter is typically a decreasing function of m). Ignoring the complication of choosing m, when ψ(x) = ψ p (x) = exp{x p } − 1, the following corollary gives one possible situation where it is possible to set a = mc n2 up to an extra logarithmic term.
Finally, we note that in the above corollary, if γ m = e −Cm for some C > 0, then we can take m ∼ log n so that all sequences in (*) are summable, and the rate of convergence is (log n) 3 c n2 .

On the properties of H and k with random covariates
In the previous subsection, we treat the predictor as fixed and the convergence rate depends on the sequence {X i }. Here we study the behavior of some of the quantities that appeared in the rates when X i is a random stationary sequence in typical situations. Results obtained in this subsection can be combined with Theorem 1 to obtain more explicit convergence rates. The necessity of studying H (for NN estimator) or k (for Nadaraya-Watson estimator) is seen from Remark 1 in the previous subsection.
When X i are random, we will make use of the important quantity ϕ(h) := P ({x ′ : x ′ ∈ B(x, h)}) which is called the small ball probability. Its importance has been demonstrated in [13] for functional kernel regression with scalar responses. In particular, the use of ϕ(h) in a functional setting replaces the common assumption on the existence of a density for X when X belongs to some Euclidean space. It is easy to see that in the classical setting with mild assumption on the density of X ∈ R d , we have ϕ(h) ∼ h d .
Nearest neighbor estimate. We only consider the simplest k-NN estimate as in Example 1 with v ni = 1/k, i = 1, . . . , k. Then in the convergence rates, b n = 0, c 2 n2 = i W 2 ni = 1/k and max i W ni = 1/k. Thus only the quantity H depends on {X i }. If the sequence {X i } contains independent elements, one can show H = O(ϕ(2k/n)) almost surely as in the following proposition.
Proposition 1 Suppose k/n → 0 and k/ log n → ∞. Let H be the distance from x to its k-th nearest neighbor as defined in (3) Proof. First we note that ϕ is right-continuous and non-decreasing and thus h = Denote c = ϕ −1 (2k/n), p = ϕ(c) and thus np ≥ 2k. We have where we applied the Bernstein's inequality for Bernoulli random variables (see for example Appendix B in [22]). Then P (H > φ −1 (2k/n), i.o.) → 0 can be shown using Borel-Cantelli lemma noting that k/ log n → ∞. . In [13], the authors distinguished two types of processes: the fractal type processes and the exponential type processes. The former is characterized by φ(h) ∼ h τ , for some τ > 0 and the latter characterized by φ(h) ∼ exp{−(1/h τ 1 ) log(1/h τ 2 )}, τ 1 > 0, τ 2 ≥ 0. The fractal type processes are similar to finite dimensional covariates in many aspects, while for infinite dimensional case such as when the covariate curves belong to some smoothness class, exponential type processes are more typical. For example, the Brownian motion is of exponential type. The paper [27] provides other more complicated Gaussian processes all of which are of exponential type. Combining Proposition 1 above with Corollary 1, we obtain the rates O([ϕ −1 (2k/n)] α +(log n)/ √ k) for independent data. When the optimal k is chosen, it is easy to see that for exponential type processes the convergence rates are logarithmic in the sample size, much slower than the classical finite-dimensional cases. Also note that this slow rate is largely determined by the term [ϕ −1 (2k/n)] α which converges to zero logarithmically whether k increases logarithmically or polynomially in n.

For weakly dependent sequence {X i }, in particular assuming
m=1 β m < ∞ (a minor extension to Definition 1 is needed here since X i ∈ F which is not a normed space, thus we need to use d(., .) instead of X 1 − X (m) 1 ), we can show the following proposition whose proof is deferred to the next section. Note that although we used the same notation as before, ψ here are different from that in Assumption 3 since here we are considering the predictor sequence instead of the noise sequence.
and we only consider the simple case where kernel function K satisfies cI [−1,1] ≤ K ≤ CI [−1,1] for some C > c > 0. Unlike k-NN estimate, here H is predetermined. In Assumption 2, we let k be the number of covariates inside the ball B(x, H) and thus if X i is not one of the k nearest neighbors of x, we have W ni = 0 and thus b n = n k+1 v ni = 0 in the convergence rate in Theorem 1. Since H is predetermined in Nadaraya-Watson estimates, the only quantity in the convergence rates that depends on as long as k ≥ 1, and c n2 ∼ 1/ √ k which can be easily shown, we only need to study the asymptotic behavior of k, the number of predictors inside the ball B(x, H).
With {X i } an independent sequence, we have The proofs for these two propositions are very similar to those for Propositions 1 and 2, and thus omitted.

Proofs
Based on two different representations of the nonparametric estimate in (1) and (2), we decompose r(x) − r(x) into the bias term and the variance term, The bias term is easier to deal with. In fact, by Assumptions 1 and 2. Now we deal with the variance term. Let η i = W ni ǫ i , S n = n i=1 η i and the following arguments are conditional on {X 1 , . . . , X n } (in effect treating W ni as nonrandom weights). Following the idea of Section 6.3 in [19] where G i is the σ−algebra generated by ǫ 1 , . . . , ǫ i (G 0 is the trivial σ−algebra). It is easy to see that {e i } is a real-valued martingale difference sequence which potentially enables us to use relevant exponential type inequalities. However, in general it seems at least not easy to obtain directly appropriate moment bounds for d i in order to apply, for example, Lemma 8.9 in [26] (Bernstein's inequality for martingale differences), and thus we instead work with the quantity where m is same as that in the statement of the theorem and, as discussed in Remarks following the theorem, need to be chosen appropriately (as a side note, m = 1 suffices for independent data in which case we actually have d i = e i ). If i+m−1 > n, the expression S n −η i −· · ·−η i+m−1 is taken to mean S n −η i −· · ·−η n . Obviously d i is still a martingale difference sequence. We denote Lemma 2 shows that Aided by these results, we can bound the variance term S n in three steps.
Step 1: Letψ(x) := ψ( √ x). By Assumption 3,ψ is convex and increasing and thus defines an Orlicz norm. Using (6), we have where in the last line above we use that ǫ (j−i+1) j is independent of G i−1 , and also use the inequality ǫ follows from the parallelogram identity. Furthermore, where we used Lemma 1 (vii) for the second inequality above and Lemma 1 (vi) for the third inequality above. Then, using Lemma 1 (i), we have for any , i ≤ n is a martingale difference sequence, using Lemma 8.9 in [26] (Bernstein's inequality for martingales) together with (9), we obtain the desired bound as follows: Step 2:

From (5), we have that
and thus using Lemma 1 (i) and (v), and Using Hölder's inequality, we have and thus, using Markov's inequality, we have Step 3: Finally, we demonstrate the bound for the variance term in (4).
) and then by the previous two steps and (7). The above expression is summable by assumption of the theorem, and an application of the Borel-Cantelli Lemma leads to S n − E S n = O(a). Combining this with (8), the variance term is thus S n = O(a + c n2 + (γ 1 v n1 ) 1/2 ). As noted in Remark 1 following the theorem, the term c n2 can be omitted since we always have c n2 = O(a). Lemma 2 Using the notation in the proof of Theorem 1, we have , the first equation is obvious. Denote . Using the interpretation of E(ξ i |G i−1 ) as the projection of ξ i , we have proving the second equation.
By exactly the same arguments P ( i f ′′ i > a/2) ≤ 1/ψ a/(2n( max 1≤j≤n W nj )γ m ) , and the Lemma is proved by combining the above two displayed equations.
Lemma 4 Let S n = n i=1 η i = n i=1 W ni ǫ i as in the proof of Theorem 1, we have E S n = O c n2 + √ γ 1 max i W ni .
We also have that where we used Lemma 1 (i) in (11) and used (10) in (12). The lemma follows from the Borel-Cantelli Lemma. .