Principal components analysis for sparsely observed correlated functional data using a kernel smoothing approach

In this paper, we consider the problem of estimating the covariance kernel and its eigenvalues and eigenfunctions from sparse, irregularly observed, noise corrupted and (possibly) correlated functional data. We present a method based on pre-smoothing of individual sample curves through an appropriate kernel. We show that the naive empirical covariance of the pre-smoothed sample curves gives highly biased estimator of the covariance kernel along its diagonal. We attend to this problem by estimating the diagonal and off-diagonal parts of the covariance kernel separately. We then present a practical and efficient method for choosing the bandwidth for the kernel by using an approximation to the leave-one-curve-out cross validation score. We prove that under standard regularity conditions on the covariance kernel and assuming i.i.d. samples, the risk of our estimator, under $L^2$ loss, achieves the optimal nonparametric rate when the number of measurements per curve is bounded. We also show that even when the sample curves are correlated in such a way that the noiseless data has a separable covariance structure, the proposed method is still consistent and we quantify the role of this correlation in the risk of the estimator.


Introduction
Noisy functional data arise frequently in various fields, for example longitudinal data analysis, chemometrics, econometrics, etc (Ferraty and Vieu, 2006). Depending on how the measurements are taken, there can be two different scenarios -(i) individual curves are measured on a dense, regular grid; (ii) the measurements are observed on a sparse, and typically irregular set of points in an interval. The first situation usually arises when the data are recorded by some automated instrument, e.g. in chemometrics, where the curves represent the spectra of certain chemical substances. The second scenario is more typical in longitudinal studies where the individual curves could represent the level of concentration of some substance, and the measurements on the subjects may be taken only at irregular time points.
In these settings, when the goal of analysis is either data compression, model building or studying covariate effects, one may want to extract information about the functional principal components (i.e., the eigenvalues and eigenfunctions of the covariance kernel). The eigenfunctions give a nice basis for representing the data, and hence are very useful in problems related to model building and prediction for functional data. For example, they have been used extensively in functional linear regression (Cardot, Ferraty and Sarda (1999), Hall and Horowitz (2007), Cai and Hall (2006)). Ramsay and Silverman (2005) and Ferraty and Vieu (2006) give extensive surveys of the applications of functional principal components. In the first scenario, i.e., data on a regular grid, as long as the individual curves are smooth, the measurement noise level is low, and the grid is dense enough, one can essentially treat the data to be on a continuum, and employ techniques similar to the ones used in classical multivariate analysis. However, the irregular nature of data in the second scenario, and the associated measurement noise require a different treatment. In this paper, we propose a kernel smoothing approach to estimate the covariance surface and its functional principal components based on sparse, irregularly observed, noise corrupted functional data. This method is based on the pre-smoothing of individual curves, with suitable modification of the diagonal, for estimating the covariance kernel. We prove the consistency and derive the rate of convergence of the proposed estimator. Also, under many practical circumstances, the sample curves are correlated, for example, spatio-temporal data (Hlubinka and Prchal, 2007), online auction data (Peng and Müller, 2008), time course gene expression data (Spellman et al., 1998). However, in the existing literature, most of the theoretical study on principal components analysis assume i.i.d. sample curves. The analysis presented in this paper shows that the asymptotic consistency of the principal components holds for the proposed method even under certain types of correlation structures (as discussed later).
Before we go into the details of the proposed procedure, we first give an outline of the data model and an overview of different approaches to this problem. Suppose that we observe n realizations of an L 2 -stochastic process {X(t) : t ∈ [0, 1]} at a sequence of points on the interval [0, 1] (or, more generally, on an interval [a, b]), with additive measurement noise. That is, the observed data {Y ij : 1 ≤ j ≤ m i ; 1 ≤ i ≤ n} can be modeled as : where {ε ij } are i.i.d. with mean 0 and variance 1. Since X(t) is an L 2 stochastic process, by Mercer's Theorem (Ash, 1972) there exists a positive semi-definite kernel C(·, ·) such that Cov(X(s), X(t)) = C(s, t) and each X i (t) has the following a.s. representation in terms of the eigenfunctions of the kernel C(·, ·) : where µ(·) = E(X(·)) is the mean function; λ 1 ≥ λ 2 ≥ . . . ≥ 0 are the eigenvalues of C(·, ·); ψ ν (·) are the corresponding orthonormal eigenfunctions; and the random variables {ξ iν : ν ≥ 1}, for each i, are uncorrelated with zero mean and unit variance. Furthermore, we assume that for each pair (i, j) with 1 ≤ i = j ≤ n, the correlation is modelled by for 1 ≤ ν, ν ′ ≤ M , and ρ ij may be nonzero. This gives rise to a separable covariance structure for the noiseless data. That is, the processes {X i (·)} n i=1 satisfy, Cov(X i (s), X j (t)) = ρ ij C(s, t), with ρ ii ≡ 1. This holds, for example when the principal component scores {ξ iν } n i=1 for different ν are i.i.d. stationary time series. Finally, in the observed data model (1), we assume that T i = {T ij : j = 1, . . . , m i } are randomly sampled from a continuous distribution.
As an example that is particularly suitable for modeling within the framework presented above, we consider the data on atmospheric radiation in Hlubinka and Prchal (2007). There, the measurements are taken from balloons from Earth's surface up to an altitude of 35 km. The data points corresponding to the i-th balloon are of the form (a i , z i ), where a represents the altitude and z represents the average number of pulses at altitude a, which is thought to be proportional to the radiation intensity. Thus, these vertical profiles of atmospheric radiation are considered as individual realizations of a functional data. That is, here a i 's are measurement points, z i 's are the measurements and the subjects are indexed by time. Hence there is a natural dependence among the sample curves observed over different time points. Moreover, it is reasonable to assume that the dependence across time does not change with the vertical distance except possibly through a long-term trend, i.e., the spatio-temporal covariance structure is separable.
Below we give a short overview of two existing approaches to the problem of estimation of functional principal components from sparse data. Yao, Müller and Wang (2005) propose a local linear smoothing of the empirical covariances { C i (T ij , T ij ′ ) : j = j ′ } n i=1 : where µ is the estimate of the mean function µ(·) obtained by local linear smoothing. They prove asymptotic consistency of this estimator and the estimated eigenfunctions, by assuming i.i.d. sample curves. Hall, Müller and Wang (2006) prove further that the problem of estimating the covariance kernel and that of estimating its eigenfunctions are intrinsically different in that the former is a two-dimensional smoothing problem while the latter is an one-dimensional problem, which results in different choices for optimal bandwidth. They also prove that the proposed local polynomial estimator achieves the optimal nonparametric convergence rate with the optimal choice of bandwidths, under the i.i.d. setting, when the number of measurements per curve is bounded. Instead of the local polynomial approach, where one imposes regularization on the estimates by varying the bandwidth of the kernel, one can impose regularization by restricting the eigenfunctions in a known basis of smooth functions. This approach has been used by various researchers including Besse, Cardot and Ferraty (1997), Cardot (2000), James, Hastie and Sugar (2000) and Peng and Paul (2007). Peng and Paul (2007) propose to directly maximize the restricted log-likelihood under the working assumption of Gaussianity, such that the resulting estimator satisfies the geometry of the parameter space. This method is implemented through a Newton-Raphson algorithm on the Stiefel manifold of rectangular matrices with orthonormal columns. The latter space is the parameter space for the matrix of basis coefficients for the eigenfunctions. Furthermore, in Paul and Peng (2007) the authors prove that this restricted maximum likelihood (REML) estimator also achieves the optimal nonparametric rate when the number of measurements per sample curve is bounded and the sample curves are i.i.d.
We now give a brief description of the estimation procedure proposed in this paper. The method is partly motivated by the observation that the naive sample covariance based on the presmoothed individual sample curves is a highly bias estimation along the diagonal of the covariance kernel, when m i , the number of measurements per curve, is small. As can be seen clearly from (6) in Section 2.1, this bias does not vanish asymptotically unless (min 1≤i≤n m i )h n → ∞ as n → ∞, where h n is the bandwidth of the kernel smoother. Under the latter setting, Hall et al. (2006) discuss the possibility of using a local linear smoother for individual sample curves and then performing a PCA on the smoothed curves. Furthermore, when the design points T ij are regularly spaced and sufficiently dense, they show that using conventional PCA for functional data (see statements and conditions in Theorem 3 of that paper for details) one obtains root-n consistent estimates of the eigenvalues and eigenfunctions so that the problem is asymptotically equivalent to a parametric problem. It is an interesting question that whether the naive kernel smoothing approach can be suitably modified such that it can produce estimators with good asymptotic risk properties even when the m i 's are relatively small. Our approach in this paper goes towards this direction and involves estimating the diagonal and the off-diagonal portions separately, and then merging them together using a smooth weight kernel. The estimation of the off-diagonal portion is based on presmoothing individual sample curves by a linearized kernel smoother. The estimation of the diagonal part involves linearized kernel smoothing of the empirical variances. The task of selecting an appropriate bandwidth, and the number of nonzero eigenvalues, is addressed through obtaining a computationally efficient approximation to the leave-one-curve-out cross validation score. This approximation procedure, as well as the asymptotical analysis of the estimators, is based on the perturbation theory of linear operators. Now we summarize the main contributions of this paper. Our approach of merging two separate presmoothed linearized kernel estimates of the diagonal and the off-diagonal parts of the covariance kernel is new and is computationally very efficient. We prove that the proposed estimator achieves the optimal nonparametric rate when the observations are i.i.d. realizations of a finite dimensional smooth stochastic process, and when the number of measurements per curve is bounded. This result parallels to the one obtained by Hall et al. (2006) for the local polynomial approach. Moreover, we obtain explicit expressions for the integrated mean squared error of the estimated eigenfunctions under a regime of separable covariance structure among the sample curves. The quantification of the role of correlation in the risk behavior (Theorem 4.2) is seemingly new in the literature, under the context of functional data analysis. We also derive a lower bound on the rate of convergence of the risk of the first eigenfunction (Theorem 4.3) which is sharper than an analogous (but more general) bound obtained in Hall et al. (2006). This lower bound and the matching upper bound on the rate of convergence for the i.i.d. case shows that the proposed estimator obtains the optimal rate even when max 1≤i≤n m i → ∞, at least under the restricted setting described in Theorem 4.3. Moreover, if the correlation between sample curves is "weak" in a suitable sense, then the optimal rate of convergence for eigenfunctions in the correlated and i.i.d. cases are the same. Furthermore, we show that our estimation procedure also allows for a computationally efficient approximation of leave-one-curve-out cross validation score, which is used for selecting the bandwidth for estimating the eigenfunctions. This approximation is based on a perturbation analysis approach that is natural given the form of our estimator. In the paper, we also show that the widely used prediction error loss for cross validation is not correctly scaled under the current context. Thus we propose to use the empirical Kullback-Leibler loss for the cross validation criterion.
The rest of the paper is organized as follows. In Section 2, we propose the estimation procedure and contrast it with the naive kernel smoothing approach. In Section 3, we propose an approximation to the leave-one-curve-out cross validation score based on the perturbation theory for linear operators. In Section 4, we state the main results about the consistency and rate of convergence of the estimators of the covariance kernel and its eigenfunctions. In Section 5, we give an outline of the proof of the main results (Theorems 4.1 and 4.2) and discuss their implications. In Section 6, we give an overview of various related issues and future research directions. The proof details are provided in the appendices.

Method
Throughout this section, we assume that the mean curve has been estimated separately, and has been subtracted from the data. Thus, without loss of generality we assume that µ = 0. Also, in the asymptotic analysis carried out in Section 4, we make the same assumption to simplify the exposition. The case of arbitrary µ with sufficient degree of smoothness can be easily handled.

Naive kernel smoothing approach
A popular method in nonparametric function estimation is to smooth the individual sample curves by a kernel averaging of the sample points. In principle, one can adopt a similar approach in the current context. This means that first smoothing individual sample curves, and then computing the covariance of the "pre-smoothed" sample curves, followed by an eigen-analysis of this "presmoothed" empirical covariance. In the following, we first describe briefly such an approach, and then show that even in the case of i.i.d. data, the estimator thus obtained has an intrinsic bias while estimating the diagonal of the covariance kernel, unless the number of measurements per curve is large.
Let K(·) be a summability kernel with an adequate degree of smoothness, and satisfying the following conditions: We then define the presmoothed sample curves as follows: where for h > 0 and h n,i is the bandwidth for the i-th curve. Then the empirical covariance based on the presmoothed curves is simply In the following, we derive an expression for the expectation of C(s, t) in estimating C(s, t) to quantify the bias, when h n,i = h n for all i, under the assumption that C(·, ·) is twice continuously differentiable. Suppose for simplicity that the density of the design points {T ij } m i j=1 , for each subject, is uniform on [0, 1]. Define C(t) = C(t, t) for t ∈ [0, 1], and K 2 (·) = K(· − u)K(−u)du. Also, we assume that m ′ i s are given. In the following proposition the bounds hold under h n → 0. And, |s − t| < 2B K h n ), which result in overestimation. In fact the degree of overestimation gets really big (by a scale factor of h n ) as soon as |s − t| < 2B K h n . This demonstrates clearly that the naive kernel smoothing approach is intrinsically biased and needs to be appropriately modified.
To understand the reason for this bias, notice that if a pair of points (T ij , T ij ′ ), for some 1 ≤ j = j ′ ≤ m i , is randomly sampled from [0, 1] 2 , then it has a probability of the order O(h 2 n ) to be in a neighborhood of length and width h n of a given point (s, t) (which is away from the diagonal). In contrast, there is O(h n ) probability of a randomly chosen point T ij to belong to a neighborhood of length h n of the point (t, t) along the diagonal. Therefore, measurements are much denser along the diagonal and this explains the difference in rates.

Modification to naive kernel smoothing
In this section, we propose a modification to deal with the bias in the naive kernel smoothing approach described in Section 2.1. We propose to remedy the effect of unequal scale along the diagonal of the covariance kernel (and the resulting bias) by estimating the diagonal and the offdiagonal parts separately. We then use a suitable (smooth) weight kernel to combine those two estimates together.
Throughout the paper, we assume that the density of the time-points {T ij } is known and is denoted by g(·). In practice we can estimate g from the data separately. We further assume that there are constants 0 < c 0 ≤ c 1 < ∞ such that c 0 ≤ g(·) ≤ c 1 .
We also propose to use a linearized version of the kernel smoothing to reduce the bias while controlling the variance. For this purpose, define Q(s, t) to be a tensor-product kernel (that is a kernel of the form Q(s, t) = Q(s)Q(t) for some smooth function Q) with the following properties, together referred as condition B2: (iv) Q is symmetric about 0.
Property (iii) can be rephrased as saying that integer translates of Q form a partition of unity. As an example, the B-spline basis functions (Chui, 1987) satisfy all four properties. Let Q h (·, ·) denote the kernel Q(h −1 ·, h −1 ·).
For estimation of the diagonal C(t) = C(t, t), let C(t):= C * (t) − σ 2 , where σ 2 is an estimator of σ 2 (discussed in Section 2.3), and C * (t) is the estimate of C(t) + σ 2 obtained by using a linearized kernel smoothing of the terms { 1 m i Y 2 ij : j = 1, . . . , m i ; i = 1, . . . , n}. This is because, for each pair (i, j), the conditional expectation of the quantity Y 2 ij (conditional on T i and m i ) is C(T ij , T ij ) + σ 2 . Define a grid on [0, 1] with grid spacings h n and denote the grid points by {s l : l = 1, . . . , L n } where L n = c L hn for an appropriately chosen c L ≈ 1. Then define, with Note that, (7) is a linearized version of the conventional kernel smoothing, which can be interpreted as a local linear smoothing of the empirical variances. A similar principle is applied to construct an estimator of the off-diagonal part (see (9) below). The linearization has two advantages: on one hand, it helps in reducing the bias in the estimate; and on the other hand it facilitates efficient computation both in terms of estimation and model selection. The difference of this linearization approach with the local linear smoothing mainly lies in the fact that we are using g(t) (or an estimate of g(t)) in the denominator, while in local linear smoothing, the denominator implicitly is a local estimate of g obtained by averaging the smoothing kernel in a neighborhood of t. Note that, as opposed to our estimator of g, which uses different bandwidth than the one for estimating the covariance, local linear smoothing essentially uses the same bandwidth for estimating both g and C, and thus it suffers from instability. More specifically, the local linear estimator of Yao et al. (2005) involves ratios with a denominator consisting of essentially the number of time points falling in a small interval. Since the time points are assumed to be randomly distributed and are sparse, in practice this can cause instability. Let X i (t) be the i-th smoothed sample curve as defined in (3), and X ′ i (t) be the derivative of X i (t). Then define the estimate of the off-diagonal part as (with a slight abuse of notation) Here w(m i ) = m i m i −1 is a weight function which is determined through an asymptotic bias analysis (Proposition 2.1). Note that, as long as |s − t| ≥ Ah n for some constant A depending on B K and C Q , in the inner sum in definition (9), the terms for which l = l ′ are absent. Therefore, according to our analysis in the previous section, they do not contribute anything by way of bias. Now let W (·, ·) be a weight kernel on the domain [0, 1] 2 defined as Define W e hn (s, t) = W ((s − t)/ h n ) and W e hn (s, t) = 1 − W e hn (s, t), where h n = Ah n for the above A > 0. We then smooth the kernels W e hn and W e hn by convolving them with a Gaussian kernel G τn (·) with a small bandwidth τ n (in the sense that τ n = o(h n )). And with an abuse of notation, denote the resulting kernels also by W e hn and W e hn , respectively. Finally, we are ready to define the proposed combined estimator of C(s, t) as where C hn (·):= C * ,hn (·) − σ 2 . The use of maximum in the second term is just to guarantee that the estimator of the diagonal is nonnegative and the bias is O(h 2 n ). We now discuss briefly the computational aspects of the proposed estimator. A key step is the computation of the functions S i (·) and X i (·) and their derivatives at the grid points s l : l = 1, . . . , L n . Each one of these computations requires O(m i ) floating point operations (for each i = 1, . . . , n). From these, we obtain C hn (s, t) and C * ,hn (t) by using (9) and (7), respectively. Both expressions are in the form of discrete convolutions, and hence can be computed very rapidly by using the Fast Fourier Transform. Thus, the estimation procedure is computationally very efficient, with O(nmL n log L n ) computations on the whole grid, where m = max i m i .

Estimation of σ 2
Here we briefly outline a method for estimating the error variance σ 2 . The method is similar to the approach taken in Yao, Müller and Wang (2006), and hence we omit the details.
First, for a given bandwidth h n , we estimate the function C(s, t) for |s − t| > Ah n , for some A depending on B K and C K , using (9). Then, as in Yao et al. (2006), we estimate the diagonal {C(t) : t ∈ [0, 1]}, using an oblique linear interpolation, by for some probability distribution function G supported on [A 1 , On the other hand, we estimate the curve {C(t) + σ 2 : t ∈ [0, 1]} by C * ,hn (t) defined in (7). Now, we estimate σ 2 by where 0 < T 0 < T 1 < 1. It can be shown that (Corollary 4.1 in Section 4) the estimator σ 2 thus obtained is consistent for an appropriate choice of h n .

Bandwidth selection
The choice of optimal bandwidth for the kernel is a key step in any kernel-based estimation procedure. Yao et al. (2005) use a leave-one-curve-out cross validation score based on the prediction error for selecting the bandwidth of the smoother, and an AIC approach for selecting the number of non-zero eigenvalues. However, leave-one-curve-out cross validation is computationally very expensive. Also, as shown below, the prediction error loss is not an appropriate criterion for cross validation under the current context. Therefore, in this paper we address the issue of model selection by producing an approximation to the leave-one-curve-out cross validation score based on the empirical Kullback-Leibler loss. The approximation is based on the idea that the estimator obtained by dropping any single curve is a small perturbation of the estimator based on the whole data . In particular, we use perturbation theory of linear operators to quantify this perturbation and produce a first order approximation to the CV score that is computationally efficient. It also enables us to select the bandwidth and the dimension of the process simultaneously. We first discuss the choice of the loss function, which is very important for a cross-validation scheme. We want to point out that, the prediction problem is intrinsically different from the estimation of the covariance kernel. We find out that the criterion based on prediction error loss is not correctly scaled, as opposed to the one based on empirical Kullback-Leibler loss. To make this point clear, we examine these two cross validation criteria in details.
We assume that the covariance kernel can be represented using K orthonormal eigenfunctions for some K ≥ 1. Then the leaveone-curve-out cross validation score based on the prediction error loss is given by Here Y (t) are the estimates of µ(t) and ψ ν (t) computed from observations is the estimated principal component score based on observations {Y i ′ } n i ′ =i . Note that, the estimated principal components scores ξ can be obtained through the procedure described in Yao et al. (2005), even though it will not be necessary for the model selection procedure we shall adopt.
On the other hand, the CV score based on the empirical Kullback-Leibler loss is given by where and ℓ i is (up to an additive constant) the negative log-likelihood of the i-th observation under the working assumption of Gaussianity, which is To gain an understanding of what these CV scores are approximating, we assume that we have two independent samples, each with n i.i.d. sample curves. Furthermore, to simplify exposition, we assume that µ ≡ 0. Suppose that the estimates Ψ = { ψ ν } K ν=1 , Λ = { λ ν } K ν=1 are obtained from the first sample. Then a leave-one-curve-out CV score can be reasonably approximated by substituting these estimates in the corresponding empirical loss function based on the second sample, and with an abuse of notation we also denote this quantity by CV . If ℓ i (Ψ, Λ) denotes the loss function corresponding to the i-th observation in the second sample, then the CV score is given by 1 n n i=1 ℓ i ( Ψ, Λ). For simplicity, we assume that there is a true model (Ψ * , Λ * ) within the class of models we are considering. A first order expansion of the difference between the CV scores under the true and estimated parameters for the empirical Kullback-Leibler loss shows that, with high probability, where · F is the Frobenius norm, and Σ * i and Σ i are the covariance matrices of the observations Y i = (Y i1 , . . . , Y im i ) T , corresponding to the true parameter (Ψ * , Λ * ) and estimates ( Ψ, Λ), respectively. Since we can essentially ignore the O(·) term in (16) as long as 1 is not too small, (16) gives a quadratic approximation to the CV score. Notice that, in each term within the summation of this quadratic approximation, directions with high variability are down-weighted by the multiplicative factors Σ −1/2 i . Therefore this CV score based on the empirical Kullback-Leibler loss is properly scaled. Moreover, note that approximation (16) does not really depend on Gaussianity but only on the tail of the distributions involved.
On the other hand, it can be shown by simple algebra that, up to a multiplicative factor, the CV score based on the prediction error loss is CV = 1 the empirical covariance matrix corresponding the i observation vector. The corresponding difference of the CV scores between estimated and true parameters becomes (ignoring the multiplicative constant), with high probability. Here which is already properly scaled. Therefore, from (17) it is clear that this CV score itself is not correctly scaled. Also, the expression 1 (17) is not necessarily nonnegative. This means that the prediction error loss does not enjoy the pleasing property of the Kullback-Leibler loss that the minimum of the expected loss occurs at the true parameter. Hence the use of the prediction error loss is not recommended for the current problem.

First order approximation
Direct computation of the criterion CV * (K, h n ) (equation (15)) is a laborious process since we need to compute C (−i) c (s, t) and perform its eigen-analysis for every i = 1, . . . , n. Therefore, we propose to approximate CV * (K, h n ) by using a first order approximation to the quantities µ around the estimates µ i , ψ ν (·) and λ ν , respectively. The approximations of the eigenfunctions and eigenvalues is based on a perturbation analysis approach. The key idea is that the leave-one-curve-out estimator C (for convenience we omit h n in the notation). Since the latter quantity has a rather simple expression which involves essentially only the i-th observation, this step substantially reduces the computational burden of the cross-validation procedure.
Proposition 3.1. For the proposed estimator C c given by (11), we have, where (a) P ν = ψ ν ⊗ ψ ν where, for f, g ∈ L 2 ([0, 1]), f ⊗ g denotes the integral operator with kernel f (x)g(y) and acts on any with δ being the Dirac δfunction, i.e., Also, where , with h µ being the bandwidth for estimating µ (chosen separately).
After we obtain the approximations for ψ from Proposition 3.1, we plug them back in equation (15) for CV * (K, h n ) to obtain the final approximation of the CV score, denoted by CV * (K, h n ): given by (23). An expression for σ 2 (−i) − n n−1 σ 2 is easily obtained by using (7), (12) and (13). Note that this step does not require any extra computation beyond that for computing σ 2 .
Observe that our objective of minimizing the criterion CV * (K, h n ) is to estimate the number of nonzero eigenvalues and to select an appropriate bandwidth for estimating the eigenfunctions. If instead the objective is to select an appropriate bandwidth for estimating the covariance kernel, we can do so by replacing the term K ν=1 λ iν ψ iν ψ T iν in the definition of Σ i with the leave-one-curveout estimate of covariance kernel, viz. C c,hn evaluated at the design points, and minimizing the corresponding CV criterion. This distinction is important since the theoretical results (Theorems 4.1 and 4.2) show that the optimal rates for the bandwidth h n are different for estimating the covariance kernel and its eigenfunctions.

Representation of
In order to obtain the approximate CV score CV * (K, h n ) efficiently, we need to compute the quantities ( H ν ∆ i ψ ν )(t) and tr ( P ν ∆ i ) in an efficient manner. Thus we have the following further approximation based on Lemma 7.1.
where, for any two functions f 1 and f 2 defined on [0, 1], In the above, the computation of γ k,hn (i) can be easily done by using fast fourier transformation. Also, γ k,hn (i, t) ≈ γ k,hn (i) for all t ∈ [0, 1]. However, the computation of γ k,k ′ ,hn (i) involves a double integration. Thus we need to do some approximations to simplify the computation. A computationally efficient approximation to γ k,k ′ ,hn (i) is described in Appendix B. Computation of β 1,h (u) and β 2,h (u) can be done in closed form whenever Q h (·) has a "nice" functional form (e.g. a B-spline). From Propositions 3.1 and 3.2 it is clear that most of the components have already been computed in constructing the estimator, and convolutions can be performed in a fast manner by using FFT. Thus, the key advantage afforded by Proposition 3.2 is to replace the expensive computation of double integrals to a much cheaper computation of single integrals and convolutions. See Appendix F for details of some of these steps.

Asymptotic properties
In this section, we present the theoretical properties of the proposed estimators through a large sample analysis. Our main interest is in the estimation accuracy of the covariance kernel and its eigenfunctions. The statements of the results and the associated regularity conditions are given below.
We first state the following assumptions on g, the density of the design points; C, the covariance kernel; and {ψ k } M k=1 , the eigenfunctions. A1 g is twice continuously differentiable and the second derivative is Hölder(α), for some α ∈ (0, 1). Also, the same holds for the covariance kernel C.
We also assume that the kernels K(·) and Q(·) satisfy conditions B1 and B2, respectively. We need to make further assumptions about the covariance kernel C and the correlations among the sample curves. Let R denote an n × n matrix with (i, j)-th entry ρ ij . Assume: C1 λ 1 > λ 2 > · · · > λ M > 0 and λ M +1 = · · · = 0. That is, the nonzero eigenvalues are all distinct and the covariance kernel is of finite dimension.
→ 0 as n → ∞, and R ≤ κ n for κ n > 0. Note that, the first part of C3 quantifies the total contribution of the correlations among the sample curves in the variance of the estimated covariance kernel (see Theorem 4.1). The second part of C3 imposes a stability condition on the correlation matrix R. In other words, the sample curves are "weakly correlated" as R is bounded by κ n . Define m = min 1≤i≤n m i and m = max 1≤i≤n m i . We further assume that C4 m/m is bounded above as n → ∞.
We now give the bias and variance of the proposed combined estimator.
One implication of Theorem 4.1 is that it gives the rate of convergence of the estimator σ 2 defined in (13) as illustrated in the following corollary.
Using Corollary 4.1 and Theorem 4.1, we get a bound on the variance of the proposed estimator of the covariance kernel when σ 2 is estimated by σ 2 defined in (13).
Next we state the result about the asymptotic behavior of the estimated eigenfunctions. Let the loss function for ψ ν be the modified L 2 -loss given by where · 2 denotes the L 2 norm, and ψ ν , ψ ν = 1 0 ψ ν (x)ψ ν (x)dx. For the statement of Theorem 4.2, we only need to assume that the estimator σ 2 of σ 2 satisfies E( σ 2 − σ 2 ) 2 = o(1).
One important implication of Theorems 4.1 and 4.2 is that, if the correlation between sample curves is "weak" in a suitable sense, then the best rate of convergence for the correlated and i.i.d. cases are the same. Comparing with the i.i.d. case, we immediately see that, in order for this to hold, under the conditions of Theorem 4.2, we need where h n, * is the optimal bandwidth choice (at the level of rates of convergence) for the i.i.d. case. Also, in order to ensure the optimal rate for the estimate of the covariance kernel in the correlated case is the same as that in the i.i.d. case, it is sufficient that (by Corollary 4.2) where h n, * is the optimal bandwidth choice for the covariance estimator (at the level of rates of convergence) for the i.i.d. case. Specifically, for estimating the covariance, h n, * = (nm 2 ) −1/6 (by Another important point is that the conditions in Theorem 4.2, specifically that mh n = o(1), nh 2 n → ∞, and mκ n h −1 n n −1/2+ǫ ′ = o(1), which imply that m = o(n 1/4 ), are not the most general conditions. We conjecture that (36) hold under weaker conditions. Indeed, in the i.i.d. case, (36) holds (without the second term on the RHS) under a much wider range of possible values of m as indicated by the following result. The following result gives a lower bound on the rate of convergence of the first eigenfunction when m → ∞ under the i.i.d. setting. This bound is a refinement over an analogous result (Theorem 2) in Hall et al. (2006), even though the latter holds for all eigenfunctions. Notice that this lower bound, together with the upper bound elucidated in the paragraph following Theorem 4.2, implies that at least for the first eigenfunction, the best rate of convergence for eigenfunctions, viz. O((nm) −4/5 ) is optimal when m → ∞ at a faster rate and if (37) holds. Theorem 4.3. Let C denote the class of covariance kernels Σ(·, ·) on [0, 1] 2 with rank ≥ 1, and nonzero eigenvalues {λ j } j≥1 satisfying C 0 ≥ λ 1 > λ 2 ≥ 0 with λ 1 −λ 2 ≥ C 1 , and the first eigenfunction ψ 1 being twice differentiable and satisfying ψ ′′ 1 ∞ ≤ C 2 , for some constants C 0 , C 1 , C 2 > 0. Also, let G denote the class of continuous densities g on [0, 1] such that c 1 ≤ g ≤ c 2 for some 0 < c 1 ≤ 1 ≤ c 2 < ∞. Suppose that we observe data according to models (1) where X i (·) are i.i.d. Gaussian processes with mean 0 and covariance kernel Σ. Also suppose that the number of . Then for sufficiently large n, for any estimator ψ 1 with l 2 norm one, the following holds: The proof of Theorem 4.3 is given in Appendix G.
Once we obtain an expression for this (as given in Section 5.1), we use a probabilistic bound on the operator norm of the difference between estimated and true covariance kernels, to complete the proof. Proofs of Theorems 4.1 and 4.2 require repeated computation of mixed moments of correlated Gaussian random variables. The details of all these computations are given in the appendices.

Asymptotic risk for estimating ψ ν
The key result in this section is the following proposition.
for any arbitrary but fixed ǫ > 0.
Here we briefly describe the main idea of the proof. For convenience of exposition, throughout we replace max{ C( s+t 2 ), h 2 n } in the definition (11) by C( s+t 2 ). Using appropriate exponential inequalities for C * (t), it can be shown that, asymptotically this does not make any difference as long as min t∈[0,1] C(t, t) > c 3 for some c 3 > 0. Also, for computational purposes, it is helpful to consider the unsmoothed version (10) of the kernel W , and take h n = Ah n , where A ≥ 4(B K + C Q ). The advantage of this is in being able to deal with the contributions from the diagonal and off-diagonal parts of the estimator separately. Since the definition of H ν involves the Dirac-δ operator, we need to account for the contribution of terms involving δ carefully. The estimation error in σ 2 also plays a role, and is taken into account separately. The main decompositions that facilitate the computations are given through (55) -(57) in Appendix D. The last bound reduces the task of bounding , with C c (·, ·) as described in Appendix D. Note also that, if σ 2 is assumed to be known, then the decomposition (57) is not required, and we can get rid of the multiplicative factor (1 + ǫ) in the expression (36) for the risk in Theorem 4.2.

Norm bound on
To complete the proof of the theorems, we need to find a probabilistic bound for C c − E C c , where · denotes the operator norm. We shall first find a bound on the sup norm of C c −E C c , and then we can bound the operator norm where · F denotes the Hilbert-Schmidt norm. This is in turn due to the inequality, Note that, by piecewise differentiability of the estimate C c , in order to provide exponential bounds for the deviations of C c − E C c ∞ , it is enough to provide exponential bounds for the fluctuations of | C c (s, t)−E[ C c (s, t)]| for a finite (but polynomially growing with n) number of points (s, t) ∈ [0, 1]. Thus, we fix an arbitrary (s, t) ∈ [0, 1] and derive an exponential inequality for the deviation of estimate at this point. For simplifying the computations, without loss of generality, we assume that g is the density of the Uniform(0,1) distribution. Then we have the following proposition.

Connection to parametric rate for "purely functional" data
It is instructive to compare the optimal rate for our procedure with that obtained by Hall et al. (2006). We can regard the first line on the right hand side of (40), as the parametric component of the risk and the second line as the nonparametric component. If we take h = O(n −1/5 ), then for bounded m we get the optimal nonparametric rate. For consistency of ψ ν in L 2 sense, we clearly need 1 . If m increases with increasing sample size, then the rate also improves. But there is no result about optimality.
When the observations are i.i.d., it can be checked by using a modification to the proof of Proposition 5.2 that, if m → ∞, h → 0, such that (mh) −1 = o(1) and h = o(n −1/4 ), we obtain the parametric rate for the L 2 -risk of ψ ν (as indicated in Hall et al., 2006). In other words, under that setting there is asymptotically no difference between the risk of estimating the eigenfunctions from data obtained with observational noise and measured at randomly distributed points, and that from data measured on the continuum without noise. Indeed, such a scenario is possible if m −1 = o(n −1/4−ǫ ), for an ǫ > 0. Then, by taking h n = o(n −1/4 ), and assuming that either σ 2 is known, or an estimator σ 2 satisfying | σ 2 − σ 2 | = O P (h 2 n ) is available, we attain the conditions mentioned above. We conjecture that the same result holds even when the observations are "weakly" correlated.

Discussion
In this paper, we presented a procedure for estimating the covariance kernel and its eigenfunctions from sparsely observed, noise corrupted and correlated functional data. The estimator for the covariance kernel is based on merging two separate estimators: (i) the estimator of the off-diagonal part based on computing linearized empirical covariances of the smoothed version of individual sample curves; (ii) the estimator of the diagonal part based on linearized kernel smoothing of the empirical variances. The importance of this modification to the naive kernel smoothing approach, especially in the scenario when the number of design points per curve is small, is demonstrated through an asymptotic bias analysis. The linearized version of the kernel smoothing helps in reducing bias, while controlling the variance, and is computationally appealing. Asymptotic risk behavior of the proposed estimators is studied under the assumption that the sample curves have a "separable covariance" structure and are "weakly" correlated. Exact quantification of the asymptotic risk for the eigenfunctions is obtained under the Gaussian setting (Theorem 4.2). It is also shown that the L 2 -risk for the eigenfunctions achieves the optimal rate, under an appropriate choice of the bandwidth, when the number of measurements per curve is bounded. Also, in the i.i.d. case, we obtain a lower bound on the rate of convergence for estimating the first eigenfunction that is sharper than bounds in the existing literature, which proves the rate-optimality of our estimator in a wider regime. Finally, we propose a computationally tractable model selection procedure based on minimizing an approximation to the leave-one-curve-out cross validation score that uses the empirical Kullback-Leibler loss. We also show that in the context of estimating the covariance kernel or its eigenfunctions, it has clear advantages over the commonly used prediction error loss.
The proposed procedure for estimation and model selection is easily implementable and computationally more tractable as compared to some of the existing methods. Moreover, due to the linear structure of the pre-smoothing of individual curves, our estimator is stable. Furthermore, the linear structure of the proposed estimator also allows for a simple approximation to the cross validation score. Finally, even though the results are proved under Gaussianity of the noise process, it can be shown that at the level of rates of convergence, the upper bounds hold under sufficient moment conditions on the noise, and hence the estimator is expected to be robust to distributional assumptions.
There are a few aspects of the estimation procedure that need further exploration. In the asymptotic analysis, we assumed that g, the density function of the design points, is known. In practice it has to be estimated from the data. Additional computations are needed to show that the results derived here hold under that setting as well. It will be useful also to study its impact on the estimation procedure through simulation studies, and in real data applications when the assumption of exact randomness of the design points may be violated.
A natural generalization of the framework studied in this paper will be when the principal component scores jointly form a stationary vector autoregressive process. Under such a setting, we would like to extend the estimation and model selection procedures described here to exploit the special structures of such processes. This is likely to summarize the statistical properties of some real-life phenomena and also help in model building and prediction, for example in spatio-temporal models when the covariance is not separable.

Appendix
Appendix A

Perturbation of eigen-structure
The following lemma is a modified version of a similar result in Paul and Johnstone (2007). Several variants of this lemma appear in the literature (see, e.g., Kneip and Utikal (2001), Cai and Hall (2006)), and most of them implicitly use the approach taken in Kato (1980). In the following we use A to denote the operator norm of an operator A, i.e., the largest singular value of A. Then, the residual term R r can be bounded as where the second bound holds only if δ r < where P Er 1 (A) is the orthogonal projection operator of A corresponding to the eigenvalues λ r 1 (A), . . . , λ r 2 (A), and the residual R r 1 ,r 2 satisfies |R r 1 ,r 2 | ≤ (r 2 − r 1 + 1) 6 B 2 min 1≤j =r≤∞ |λ j (A) − λ r (A)| .

Large deviations of quadratic forms
The following lemmas are from Paul (2004). Suppose that Φ : X → R n×n is a measurable function. Let Z be a random variable taking values in X .
Lemma 7.2. Suppose that X and Y are i.i.d. N n (0, I) and are independent of Z. Then for every L > 0 and 0 < δ < 1, for all 0 < t < δ 1−δ L, Suppose that X is distributed as N n (0, I) and is independent of Z. Also assume that Φ(z) = Φ T (z) for all z ∈ X . Then for every L > 0 and 0 < δ < 1, for all 0 < t < 2δ 1−δ L,

Computation of conditional mixed moments
In order to calculate the bias and variance of the proposed estimator, we need to compute the We shall use the following well-known result, which is a special case of Wick formula ( We shall use the formula to compute the above mixed moments with the observation that Cov(X i 1 j 1 , X i 2 j 2 |T i 1 , T i 2 ) = ρ i 1 i 2 C(T i 1 j 1 , T i 2 j 2 ). The details of this computation in various generic cases are given in Appendix F.

Appendix B
In Appendix B and the following appendices, we shall often write h and h to denote h n and h n , respectively, and we shall drop the subscript h n from the covariance estimates. For example, C c will be used to denote C c,hn .

Proof of Proposition 3.1
This is a straightforward application of Lemma 7.1, by taking the estimated covariance kernel C c as operator A and −∆ i = C (−i) c − C c as operator B. Note that in (20) the last term corresponds to the zero eigenvalues of C c .

Proof of Proposition 3.2
We can express ∆ where, the kernel K s,l (·) ≡ K s,l,hn (·) is defined as for s ∈ [0, 1] and l = 1, . . . , L n ; and for any function f , It is easy to see from the expression for R i (u, v) and (23) that for reasonable choices of h µ , the contribution of R i (u, v) can be ignored, since it is of a smaller asymptotic order (in fact can be shown to be o P (n −1 )). Hence, we end up with the approximation . Thus, we can separate out the first term on the RHS of (43) into two parts -one with multiplier 1, and the other with multiplier Wh n (u, v). Then using (22) and the second representation of H ν in (20), we obtain the expressions in the first four lines on the RHS of (24). Next, using the fact that W e hn (u, v) ≈ 1 {|u−v|≤ A 2 hn} , and using the approximations ψ ν (v) ≈ ψ ν (u) and g( u+v , we obtain the last three terms on the RHS of (24). Now, using (21), noting that tr ( P ν C c ) = λ ν , and following similar arguments, we have (25).
Approximation of γ k,k ′ ,hn (i) First, to fix notation, suppose that h n = Ah n for some constant A > 0. Then, by definition of W e hn , and the symmetry of Q, the integral appearing in (28) can be expressed as (ignoring the boundaries) Noticing that, , we can approximate the inner integral (with respect to v) by Substituting this in (45), we have the approximation by definition of G j (f 1 , f 2 )(·), j = 0, 1. Since

then the integrand in the first step of (46) is zero. So the domain of integration is, effectively, [s
. This implies that if |s l − s l ′ | > (2C Q + A 2 )h n , then the effective domain of integration is empty, meaning that If Q(·) is chosen to be a centered cubic B-spline (so that C Q = 2), we can compute G Q j ′ (·) explicitly, without having to perform a numerical integration (Appendix F).

Appendix C
In the following, we often drop the subscript n from h n for simplicity and sometimes we even drop the subscript h from the notation.

Proof of Proposition 2.1
By elementary calculations, and supposing that m i ≥ 2 for each 1 ≤ i ≤ n, we have where the last step is by Taylor series expansions. Now, noticing that K is symmetric about 0, K(x)dx = 1 and xK(x)dx = 0, (5) and (6) follow from (47) after simplifications.

Asymptotic pointwise bias (31)
We first compute the expected value of the estimate described by (11). For simplicity of notations, we express X i (s l ) + (s − s l ) X ′ i (s l ) by X i,l (s). Observe that Let the support of kernel K(·) be denoted by [−B K , B K ]. Then, for each fixed j = 1, . . . , m i , and i = 1, . . . , n, We assume that the conditions in Section 4 hold. Then using the representation (49), and the calculations done in Appendix F, we get an expression for the asymptotic bias in estimating C(·, ·) as a function of the bandwidth h ≡ h n . These results are summarized in the following lemmas, where C s , C ss and C t , C tt denote the first and second partial derivatives of C(s, t) with respect to s and t, respectively.
Then, for |s − t| > 2Ah n , Note that because of property (iii) of the kernel Q, and the fact that s l = (l + a)h for l = 1, . . . , L n , for some constant a ∈ [−3, 3], we have for s ∈ (c, 1 − c), for some c ∈ (0, 1), Therefore, we can choose L n and the sequence of points {s l } Ln l=1 so that L n ≈ h −1 n , and Q h (s) ≡ 1 for all s ∈ [0, 1]. That is, from Lemma 7.5, we have E C(s, t) = C(s, t) + O(h 2 ).
Proof of Lemma 7.6 follows along the lines of Lemma 7.5. Furthermore, if an estimator σ 2 is such that E σ 2 = σ 2 + O(h 2 ), then it follows from Lemma 7.6 that the estimator C(t) uniformly on t ∈ [0, 1], since Q h (t) ≡ 1 on t ∈ [0, 1]. Next, since C(s, t) = C(t, s) and C(·, ·) is smooth, it follows that C s − C t ≡ 0. Consequently, using a Taylor series expansion, it follows that, for any A > 0, Combining (52) and (53) we get, Appendix D

Proof of Proposition 5.1
We shall extensively use the following representation where

The first step is to express
Therefore, in order to separate the effect of estimating σ 2 , use the fact that for any fixed ǫ > 0, The equality follows since using H ν ψ ν = 0, the definition of W e hn , and the Mean Value Theorem, we have (1), it is enough to show that E H ν C c ψ ν 2 2 has the bound given by the RHS of (40), without the multiplicative factor (1 + ǫ). With a slight abuse of notation, we write C(s) to indicate C * (s) − σ 2 . Then, since Thus, in order to obtain E H ν C c ψ ν 2 2 , we need to evaluate the quantities E[ C(s 1 , t 1 ) C(s 2 , t 2 )], E[ C( s 1 +t 1 2 ) C( s 2 +t 2 2 )], and E[ C(s 1 , t 1 ) C( s 2 +t 2 2 )]. Let where K s,l (·) is as in (44). Then we can express the expectation of the first term on the RHS of (58) as The following proposition is the key to get a simplified bound on (60). It is proved using a lengthy, but fairly straightforward calculation. The details are given in Appendix F.
Now we deal with the last two terms on the RHS of (58). Let Then, For convenience, in the rest of this subsection we shall use z k to denote (s k +t k )/2, for k = 1, 2. Then the following proposition describes the contribution of the quantities of the type Proposition 7.2. Suppose that A > 4(B K + C Q ). Then for (i) |s k − t k | ≤ Ahn 2 , k = 1, 2, where Z 7 := Z 7 (z 1 , z 2 ) is asymptotically equivalent to Z(z 1 , z 2 ). Next, if (ii) |s 1 − t 1 | > Ahn 2 and where the O(h 2 n ) terms within brackets in the first term on the RHS depend on (s 1 , t 1 ) and (s 2 , t 2 ) respectively, and Z j := Z j (s 1 , t 1 , z 2 ), j = 8, 9 satisfy

Asymptotic pointwise variance (32)
In this section, we prove (32), (33) and (34). Most of the derivations are similar to that of Proposition 5.1. Thus we simply give a brief outline.
Since E( σ 2 − σ 2 ) 2 has the rate given by (33) (Corollary 4.1), we only need to provide bounds for W e hn (s, t)Var( C(s, t)) and W e hn (s, t)Var( C * ( s+t 2 )). We state these in the following propositions. W e hn (s, t)Var( C * ( The proof of (34) is finished by combining Propositions 7.3 and 7.4 and Corollary 4.1.

Proof of Corollary 4.1
First observe that, On the other hand, since for any bounded u ∈ [A 1 , A 2 ], uniformly in t ∈ [T 0 , T 1 ], it follows from Lemmas 7.5 and 7.6 (Appendix C) that the last term on the RHS of (68) is O(h 4 n ).

Proof of Proposition 5.2
Without loss of generality we assume g to be uniform density on [0, 1]. We need to consider two cases separately : Since | K s,l (T ij )| = O(h −1 ) and the summands are nonzero for finitely many l, there exists a constant C 3 > 0 such that Note further that B i (s, Furthermore, for each k = 1, . . . , M , {B 1i,k (s)} n i=1 are independent, and these random variables are independent of {ξ ik : 1 ≤ k ≤ M } n i=1 and {ε ij : 1 ≤ j ≤ m i } n i=1 . Then, we can express C(s, t) − E[ C(s, t)] as, The last two terms in the above expression vanish since |s − t| > 4(B K + C Q )h. Note that, max 1≤i≤n w(m i ) is bounded. By (70), |B 1i,k (s)B 1i,k (t)| ≤ C 2 4 h −2 are bounded for k = 1, . . . , M , and for all k, k ′ , (see Appendix F). Thus by Bernstein's inequality, and using the condition that m 2 = o(nh 2 / log n), given η > 0, there exists c 1,η > 0 such that for sufficiently large n (so that the bound in (72) is O((mh) −2 )), (73) Next, let A be the set of indices i such that B 1i,k (s)B 1i,k ′ (t) = 0 for some k, k ′ . And let N n = |A|. Since for any k, k ′ , P (B 1i,k (s)B 1i,k ′ (t) = 0) ≤ C 5 m 2 h 2 , it follows by another application of Bernstein's inequality that there exists a set D n (in the sigma field generated by {T ij }) and a constant c 2,η > 0 such that Therefore we can restrict our attention to the set D n , and conditioning on T We can express ξ A,k = (ξ ik ) i∈A as ξ A,k := (R AA ) 1/2 ξ A,k , where the random vectors ξ A,k have N Nn (0, I) distribution and are independent for different k's. Then we can write (conditionally on T) where Φ(T) = (R AA ) 1/2 diag(w(m i )B 1i,k (s)B 1i,k ′ (t)) i∈A (R AA ) 1/2 . Observe that by (70) and condition C3, we have Φ(T) ≤ C 4 κ n h −2 . Therefore, by an application of Lemma 7.2, we have, for some c 3,η > 0, Very similar arguments can be used to obtain bounds of order mκ n log n nh 2 (that hold with probability at least 1 − O(n −η ), for any given η > 0) for the second, fourth and fifth terms on the RHS of (71). Thus, by conditions on κ n and h n , we have, for some constant c 4,η > 0, ]) (ignoring the maximum over h 2 n in the definition). Then similar (but somewhat simpler) arguments, now involving Lemma 7.3, show that for some c 5,η > 0, Combining (74) and (75) we obtain the result.

Details of computation of G Q j (·)
We want to give explicit functional form for G Q j (y), for j = 0, 1 and for any y ∈ R. Let Then the centered version of the cubic B-spline Q has the form Note that G Q j (y) can then be computed by utilizing the fact that, for j = 0, 1, where the integrals on the right hand side are defined to be zero if the corresponding upper limits are less than −2. The integrals on the RHS of above equation can be computed from the representation of Q(·) as follows:

Details of the calculation of pointwise bias
Performing a Taylor series expansion around (s, t) we get, and First we consider the off-diagonal terms, i.e., compute E C(s, t), for |s − t| > 2Ah.
H ν (x, s 1 , s 2 , t 1 , t 2 )(C(s 1 , s 2 )) r 1 (C(t 1 , t 2 )) r 2 ds 1 ds 2 dt 1 dt 2 dx Implicitly using (130) -(132), we also have the bound From Proposition 7.1, the total contribution in (60) of the first terms on the RHS of (61) and (62) becomes Since H ν Cψ ν ≡ 0, it can be checked that the first integral in (86) is O(h 2 n ). On the other hand, from the definition of W e hn (s, t) and the fact that H ν Cψ ν ≡ 0, it follows that the last integral term is O(h n ).
Next, apply H ν to the following functions : W e hn (s 1 , t 1 )W e hn (s 2 , t 2 )D 2 (s 1 , s 2 , t 1 , t 2 ) and 2W e hn (s 1 , t 1 )W e hn (s 2 , t 2 )D 3 (s 1 , s 2 , t 1 , t 2 ), where D 2 (s 1 , s 2 , t 1 , t 2 ) and D 3 (s 1 , s 2 , t 1 , t 2 ) are the terms given by the sum of the first three terms on the RHS of (64) (including the isolated O(h 2 n ) term), and the sum of the first three terms on the RHS of (65), respectively. Then, adding these terms to (86), we have, by (82) -(85), (132) (for dealing with the isolated O(h 2 n ) term in (64)), and the comment following (86), that this sum equals Next, for notational convenience, express the integral operator H ν applied to Z j (where Z j are as in Propositions 7.1 -7.2) times W e hn (s 1 , t 1 )W e hn (s 2 , t 2 )W e hn (s 1 , t 2 ) by H ν W s 1 ,t 1 W s 2 ,t 2 W s 1 ,t 2 Z j , etc.

Proof details for Proposition 5.1
Proof of Proposition 7.1 : We need to deal with terms of the form For computational convenience, we also define, First, consider the case i 1 = i 2 = i, say. Then, using to ⋆ to denote E ii;j 1 j ′ Next, consider the case i 1 = i 2 . Then, with ⋆ denoting E i 1 i 2 ;j 1 j ′ 1 j 2 j ′ 2 (s 1 , t 1 , s 2 , t 2 ; l 1 , l ′ 1 , l 2 , l ′ 2 ), respectively, for A satisfying A ≥ 4(B K + C Q ) and h n = Ah n . This can be verified by using the definition of E i 1 i 2 ;j 1 j ′ 1 j 2 j ′ 2 (s 1 , t 1 , s 2 , t 2 ; l 1 , l ′ 1 , l 2 , l ′ 2 ), equations (113), (114), (116) -(119), and arguing as in the analysis of the term (48). Therefore, since 1 |s k −t k |≤ A 2 h W e hn (s k , t k ) = 0, for k = 1, 2, the sums corresponding to either j 1 = j ′ 1 or j 2 = j ′ 2 in (89) and (90) do not contribute anything to (60). Thus, when i 1 = i 2 , the only sum that contributes to (60) corresponds to j 1 = j ′ 1 , j 2 = j ′ 2 . When i 1 = i 2 = i, the sums that contribute to (60) are the ones corresponding to We consider these cases one by one.
The following lemma gives an expression and the corresponding bound for the term Z 1 .
The following lemma gives expressions and the corresponding bounds for the term Z 2 , Z 3 and Z 4 .
The following lemma gives expressions and the corresponding bounds for the terms Z 5 and Z 6 .
We first consider E(V i 1 (z 1 )V i 2 (z 2 )). Then Lemmas 7.11 and 7.12, stated below, give expressions for the leading term and the term Z 7 (and corresponding bound), respectively, in (64).
Proof of Lemma 7.9 : Follows by arguments analogous to those for deriving (92).
From this, and similar arguments as before, we have, for |s 1 − t 1 | > Ah 2 , and |s 2 − t 2 | ≤ Ah 2 , with A ≥ 4(B K + C Q ), The last equality follows from the fact that the terms C(s 1 , t 1 ) + O(h 2 )) appearing lines four, nine and ten are the same.
Proof of (72) since the term corresponding to j 1 = j ′ 1 = j 2 = j ′ 2 vanishes. Now by using the fact that T ij 's are i.i.d., we can simplify each sum on the RHS.
Now, using the facts that E(

Computation of conditional mixed moments
The computation of the moments is done by using the Wick formula (Lemma 7.4). We consider all the different generic cases below: • Case : . Therefore, by (42), • Case : • Case : • Case : . (117) • Case : i 1 = i 2 , j 1 = j ′ 1 = j 2 = j ′ 2 (equivalent to i 1 = i 2 , j 1 = j 2 = j ′ 1 = j ′ 2 ; and i 1 = i 2 , j 1 = j ′ 2 = j ′ 1 = j 2 ): This this case E(Y i 1 j 1 Y i 1 j ′ 1 Y i 2 j 2 Y i 2 j ′ 2 |T i 1 , T i 2 ) = (C(T i 1 j 1 , T i 1 j 1 ) + σ 2 )(C(T i 1 j 2 , T i 1 j 2 ) + σ 2 ) + 2(C(T i 1 j 1 , T i 1 j 2 )) 2 . (118) • Case : i 1 = i 2 , j 1 = j ′ 1 = j 2 = j ′ 2 : In this case Computation of unconditional mixed moments (off-diagonal part) Here, we obtain simplified forms the certain expectations that are used in the proof of Propositions 7.3 and 7.4. Observe that, based on our calculations in Appendix A, we only need to compute the expectations of the form Notice that, when the pairs (T i 1 j 1 , T i ′ 1 j ′ 1 ) and (T i 2 j 2 , T i ′ 2 j ′ 2 ) are independent, the expectation in (120) factorizes as Each individual term is exactly of the same form that we encountered while calculating the bias of our estimate. The expectations appearing above are of the form For other terms we need to evaluate or approximate various other integrals. The general forms of these integrals are given below, for 1 ≤ l, l ′ , m, m ′ ≤ L n and s, s ′ , t, t ′ ∈ [0, 1].
Some error bounds involving Dirac-δ Here, we provide some key estimates that are crucial to obtaining the overall risk bound. They all involve the operator H ν . Due to the decomposition (55) we can reduce the computations of these bounds to integrals involving {ψ k (·)} M k=1 and δ(·, ·). Throughout we assume that R(s 1 , s 2 , t 1 , t 2 ) is a "nice" function satisfying certain (boundedness) conditions. Then,