Kernel estimates of nonparametric functional autoregression models and their bootstrap approximation

This paper considers a nonparametric functional autoregression model of order one. Existing contributions addressing the problem of functional time series prediction have focused on the linear model and literatures are rather lacking in the context of nonlinear functional time series. In our nonparametric setting, we define the functional version of kernel estimator for the autoregressive operator and develop its asymptotic theory under the assumption of a strong mixing condition on the sample. The results are general in the sense that high-order autoregression can be naturally written as a first-order AR model. In addition, a component-wise bootstrap procedure is proposed that can be used for estimating the distribution of the kernel estimation and its asymptotic validity is theoretically justified. The bootstrap procedure is implemented to construct prediction regions that achieve good coverage rate. A supporting simulation study is presented in the end to illustrate the theoretical advances in the paper. MSC 2010 subject classifications: 62G08, 62G09, 37M10, 60G25.


Introduction
Popularized by the pioneering works of Ramsay and Silverman (1997) [29], (2002) [30], Functional Data Analysis (FDA) has emerged as a promising field of statistical research in the past decade. When functional data objects being collected sequentially over time that exhibit forms of dependence, such data are known as functional time series. The typical situation in which functional time series arise is when long continuous records of temporal sequence are segmented into curves over natural consecutive time intervals. Examples include daily price curves of financial transactions, daily electricity consumptions and daily patterns of environmental data. The primary goal of functional time series analysis is to provide reliable guesses for the future realizations. In this paper, we focus our attention on a first-order nonparametric functional autoregression-FAR(1) model which is defined by the recursion: where the observations X n and the error terms E n are functions, and no linearity restrictions are imposed on the functional operator Ψ. Precise definitions and details of the model are stated in Section 2. Existing contributions have largely focused on the functional linear autoregression while the research addressing the nonlinear model is scarce. We approach this problem by merging the ideas in nonparametric time series and functional regression analysis, extending the theoretical study to the nonparametric model of functional autoregression. The research pertains to the FAR model can trace back to Bosq (2000) [4], in which the theory of linear processes in functional space was first developed. One of the major contribution of that book was the study of linear autoregressive processes in the Hilbert space. Under the assumption that the functional operator Ψ in (1.1) is linear, Bosq has derived a one-step ahead predictorΨ based on a functional form of the Yule-Walker equation, which has been regarded as the classical benchmark in FAR(1) prediction. Since then, there has been abundant literatures on the study of the linear functional processes. We refer the readers Antoniadis and Sapatinas (2003) [2], Antoniadis et al. (2006) [1], Bosq (2007) [5], Kargin and Onatski (2008) [18], Gabrys et al. (2010) [14] and Horváth and Kokoszka (2011) [17], among other contributions. Bosq's predictor in [4] has a rather complicated form which makes it unrealistic to implement in practice. Aue et al. (2015) [3] proposed an alternative method of predicting linear FAR(1) process utilizing functional principal component analysis (FPCA). The method appears to be much more widely applicable under the idea that the dimension reduction with FPCA should lead to a vector-valued time series of FPC scores that can be predicted by existing multivariate methodologies. Hörmann et al. (2015) [15] proposed a dynamic version of functional principal component analysis to address the problem of dimension reduction for functional time series. More recently,   [20] used a similar dimension reduction technique to model the FARMA process with an application to traffic data. See also   [19] in which an innovation algorithm was proposed to obtain the best linear predictor of a functional moving average (FMA) process.
On the other hand, kernel methods have been a powerful tool when dealing with nonparametric models. Numerous early references have investigated its implementations in nonparametric univariate autoregression. To mention some, asymptotic study of the kernel smoother was presented in Robinson (1983) [32], and Masry (1996) [22]. The bootstrap procedures for this model and its validity were provided in Franke et al. (2002) [12]. Pan and Politis (2016) [25] developed a coherent methodology for the construction of bootstrap prediction intervals, which can be successfully applied to the nonlinear univariate autoregression models. Those results can be naturally extended to multivariate time series, but that is not the case for functional time series due to the infinite dimensional nature of functional data.
Nonparametric statistical methods for functional data analysis were established in Ferraty and Vieu (2006) [11]. The nonparametric functional regression model, i.e. Y = r(X ) + ε, has been extensively studied since then. Ferraty et al. (2007) [10] concentrated on the situation where the response variable Y is scalar and X takes values in some functional space. Asymptotic properties concerning the kernel estimatorr(·) of the regressor r(·) have been investigated and the validity of its bootstrap approximation was proved in Ferraty et al. (2010) [8]. The results have been extended to the model with double functional setting (i.e. both Y and X are functionals); see Ferraty et al. (2012) [9]. Masry (2005) [23] and Delsol (2009) [6] investigated the same model taking into account dependent functional data.
Motivated by the prior works aforementioned, this paper investigates the kernel estimator for nonparametric functional autoregression. We show the consistency of the estimator under the assumption of a strong mixing condition on the sample. The proof of its consistency involves a functional central limit theorem for dependent sequence in a triangular array setting. In addition, we propose a bootstrapping procedure in this functional dependent framework that can be used for estimating the distribution of the projection of the kernel estimation. The bootstrap prediction regions are constructed as a measurement of accuracy for the functional prediction. A regression bootstrap scheme is implemented in the procedure which provides a simplification for the bootstrap method in the autoregression case. Franke et al. (2002) [12] first applied the regression-type bootstrap in univariate nonlinear autoregression for inference of the kernel estimator, and Neumann and Kreiss (1998) [24] considered to what extent regression-type bootstrap procedures can be successfully applied as long as nonparametric estimators and tests for conditional mean in nonparametric autoregressions are considered. It was mentioned in Kreiss and Lahiri (2012) [21] that the regression-type bootstrap is also valid for the Yule-Walker esti-mates for coefficients in a parametric AR(p) model, but it might not lead to asymptotically valid results for more general statistics.
The rest of the paper is organized as follows. In Section 2, the detailed mathematical background of the model is provided and the functional version kernel estimator of the autoregressive operator is defined. Some notations and necessary assumptions are stated in Section 3. Section 4 provides the asymptotic results of the proposed estimator. A componentwise bootstrap scheme is introduced and its validity is shown in Section 5. A simulated study is given in Section 6 while Section 7 presents the approach to construct the bootstrap prediction region. All proofs are gathered in Appendix (Section 8).

The FAR(1) model
Let {X n } be a stationary and α-mixing functional sequence in some separable Hilbert space H with the usual definition of α-mixing coefficients introduced by Rosenblatt (1956) [33]. H is endowed with inner product ·, · and corresponding norm || · || (i.e. ||g|| 2 = g, g ), and with orthonormal basis {e k : k = 1, · · · , ∞}. A semi-metric d(·, ·) is also defined on H to measure the proximity between two elements in H. The semi-metric structure d will be the key tool for controlling the good behavior of the estimators whereas some separable Hilbert structure is necessary for studying the operator Ψ component by component. See more details on the two-topology framework in Ferraty et al. (2012) [9].
We consider the following FAR(1) model Assume here that the model is homoscedastic, that is, σ E (X i ) ≡ σ E . The operator Ψ is not constrained to be linear; this is a Nonparametric Functional Autoregression model. Remark 2.1. Because of the generality of the notation, a higher-order autoregression, say FAR(2), in which X i+1 depends on X i and X i−1 , can still be written as FAR(1) by redefining the X and Ψ; e.g. in the FAR(2) case, one may let Y i = (X i , X i−1 ) with an obvious choice for the FAR(1) operator relating Y i+1 to Y i only.
Estimation of Ψ is given by the functional version of Nadaraya-Watson estimator of time seriesΨ where χ is a fixed element in H, K is a kernel function and h is a bandwidth sequence tending to zero as n tends to infinity. The way of choosing a semimetric d(·, ·) was discussed in Ferraty and Vieu (2006) [11]. For a fixed k ∈ Z + , applying ·, e k on both sides of the Eq. (2.1) yields Let X n,k , ε n,k be the jth component of the functional X n and E n respectively, i.e. X n,k = X n , e k , ε n,k = E n , e k . Also, define the functional ψ k from H to R such that ψ k (·) = Ψ(·), e k .
When k is fixed, we will drop the index k for the simplicity of notations, using {X n } and {ε n } to denote the sequences {X n,k } and {ε n,k }, respectively. Similarly, ψ can be used in place of ψ k . For a fixed k, we obtain Eq. (2.4) can be treated as an auxiliary functional autoregressive model with scalar response. The scalar innovations ε i 's are i.i.d. random variables satisfy Again, the operator ψ in (2.4) is not constrained to be linear. Accordingly, its kernel estimation is given bŷ , (2.5) and the two estimators,Ψ h andψ h , are connected in the following waŷ Consistency of bothψ h (χ) andΨ h (χ) will be addressed in Section 4. While it is more of the interest to study the estimatorΨ h (χ), the need for model (2.4) and the estimatorψ h (χ) will be seen in Section 5 where a componentwise bootstrap approximation is proposed.

Assumptions and notations
In the sequel, χ is a fixed element and X is a random element of the functional space H. For k = 1, 2, . . . , let ϕ χ,k be a real-valued function defined as Again, for the simplicity of notations, we drop the index k to use ϕ χ in place of ϕ χ,k when k is fixed. Let F be the distribution of the random variable d(X , χ), which is usually called the small ball probability function in functional data analysis. Also define for s ∈ [0, 1], Technical aspects of the functions ϕ χ , F χ and τ hχ have been discussed in (A1)-(A5) are the standard assumptions inherited from those introduced in the independent case in the setting of nonparametric functional regression. For our autoregressive model, additional assumptions (A6)-(A10) below are necessary: where p and a are defined in (A6) and (A8) respectively, and where p and a are defined in (A6) and (A8), respectively. These assumptions enable us to obtain asymptotic results of the estimator in our autoregressive model, which has a formal resemblance to the analogous regression model. In particular, (A8)-(A10) is a set of conditions on the mixing coefficients, and (A8) is the so-called arithmetically α-mixing condition which is typically satisfied in the case of an autoregressive model. From a theoretical point of view, a more general set of conditions on the mixing coefficients ((H1)-(H2) in Delsol (2009) [6]) is available, see the details therein.
The semi-metric d will act on the asymptotic behavior of the estimator through ϕ χ , F χ and τ hχ , and the following quantities:

Consistency of estimatorψ h (χ)
First, we have the following point-wise asymptotic results for the estimator ψ h (χ): ).  [10]. See details of the proof in Appendix. Combining 4.1 and 4.2, we immediately obtain the following corollary: The pointwise asymptotic normality is given in Theorem 4.3 below: Proof. See Appendix.
The bias term can be ignored under the following additional assumption:

Consistency of estimatorΨ h (χ)
Corollary 4.1, together with Eq. (2.6), implies the following componentwise consistency ofΨ h : However, (4.6) does not guarantee the consistency of the estimatorΨ h in an infinite-dimensional space. A more general consistency result is desired. We consider the following regularity conditions: (C1) For each k ≥ 1, ψ k is continuous in a neighborhood of χ with respect to semi-metric d, and F χ (0) = 0.
Remark 4.1. To prove Theorem 4.4, we need a functional central limit theorem for dependent sequence in a triangular array setting such as Theorem 2.3 in Politis and Romano (1994) [28]. See Appendix for details of the proof. The assumptions (i)-(iii) show a tradeoff between the moment assumptions and the mixing conditions. The conditions on mixing coefficients can be less stringent if higher moments are assumed. The parameter δ controls the moment while δ controls the mixing condition and they can be chosen under the constraint (i), for example, δ = 0.5 and δ = 5.

Bootstrap approximation
Ferraty et al. (2010,2012) [8,9] have employed both naive and wild bootstrap to approximate the asymptotic distribution of the kernel estimators for nonparametric functional regressions. Their first result showed the validity of bootstrap when the explanatory variable is functional and the response is real. To extend the bootstrap approach to the double functional setting, i.e. when both variables are functional, they introduced the notion of "componentwise bootstrap", in which the idea is to show that the bootstrap approximation has good theoretical behaviors when functionals are projected to a fixed basis element e k .
Here we take advantage of the auxiliary univariate model in (2.4), extending this componentwise bootstrap idea to the functional autoregression. First, we propose a bootstrap procedure to approximate the distribution of ψ h (χ) − ψ(χ) under the AR model (2.4), which consists of the following steps: .
From a theoretical point of view, the second smoothing parameter b has to be asymptotically larger than h (see condition (D5)), so over-smoothing is needed to make the bootstrap procedure work, as is the case in the functional regression. However, the two bandwidths have to be fixed in practice and a cross-validation procedure is used to determine the bandwidths in the simulation study in Section 6.

Remark 5.2.
It is also worth being noted that instead of generating a real bootstrap sequence, here we are actually generating a scatter plot. Every bootstrap point X * i+1 is generated by the prior true point X i -see step 3 above-and the desired estimation in the bootstrap world comes from fitting the pairs (X * i+1 , X i ). This is so called regression bootstrap scheme, introduced in Franke et al. (2002) [12] as one of the three bootstrap schemes proposed for the scalar-valued nonlinear autoregression. The main reason we use the regression bootstrap scheme here is in the fact that when conditioning on the sample {X 1 , . . . , X n }, it eliminates the random element in the denominator ofψ * hb (χ)-see Eq. (5.1)which makes the proof of Theorem 5.1 proceed (see details in appendix). The regression-type bootstrap is considered as an important simplification for the bootstrap method in autoregression. We refer the readers Neumann and Kreiss (1998) [24] for its applications in nonparametric autoregressions, and Kreiss and Lahiri (2012) [21] for its extensions in parametric time series models.
Theorem 5.1 shows the validity of the bootstrap procedure forψ h . Then, the bootstrap procedure forΨ h is proposed as follows: .
Proof. Choosing a basis with e 1 = η, this theorem is a direct consequence of Theorem 5.1.

Simulations
In this section, the theoretical results of the previous sections are illustrated through a simulation study. First we provide the details of the process of simulating a FAR(1) series. We use a linear functional series here since its stationarity can be guaranteed by existing theories. The performances of the kernel estimation and the behaviors of the bootstrap procedure will be shown through the experiments performed on the simulated data.
Data Generating Process. The simulated realization of a linear FAR(1) series has been discussed in Didericksen (2012) [7]. Curves in the series are assumed to be elements of the Hilbert space L 2 [0, 1]. The linear operator Ψ(X ) = 1 0 ψ(s, t)X (s)ds is acted on the functions X i 's, thereby the series are generated according to the model X n+1 (t) = 1 0 ψ(t, s)X n (s)ds + E n+1 (t). (6.1) We use the kernel ψ(s, t) = C · s1{s ≤ t}, such that (6.1) becomes Here, C is a normalizing constant to be chosen such that ||Ψ|| < 1, which ensures the existence of a stationary causal solution to FAR(1) model; see Bosq (2000) [4]. We pick C = 3, such that ||Ψ|| = 0.5. We use the Brownian bridge process as the error process E(t) (see Didericksen (2012) [7]), which is defined by where W (·) is the standard Wiener process where Z k 's are standard independent normal and Z 0 = 0. The interval [0, 1] is equally partitioned such that 0 = t 1 < t 2 < · · · < t 99 < t p = 1 with p = 100. We choose the initial curve X 1 = cos(t) for t ∈ [0, 1], and build the series X 1 , . . . , X 250 according to the following scheme for j = 1, . . . , 100: a cross-validation procedure. A projection-based semi-metric is considered: where v k,n , k = 1, . . . , J are orthonormal eigenfunctions associated with the largest J eigenvalues of the empirical covariance operator of the learning sample: To estimate f * k,χ , we use the bootstrap procedure as described in Section 5: , v k,n . The bandwidth h = h CV is selected through a cross-validation procedure and we set b = h, which was the same setting used in the context of functional regression in Ferraty et al. (2010) [8]. They have studied the influence of both bandwidths on the behavior of the bootstrap procedure by varying b and h around h CV . Their simulation showed the bootstrap works well for any combination of b and h, which leads to the conclusion that bootstrap results are not sensitive to the bandwidth choice. Figure 3 presents the comparisons between the estimated f * k,χ and the estimated f true k,χ for the first four components, at the curves χ = X 201 , . . . , X 205 . The density estimation is performed with Gaussian kernel and the bandwidth being chosen three times the rule-of-thumb bandwidth estimator (1.06σn −1/5 ) whereσ is the sample standard deviation.

Construction of bootstrap prediction regions
Politis (2013) [26] constructed bootstrap prediction intervals in regression based on a bootstrap approximation to the distribution of the error in prediction-also called a 'predictive root'. This method was later extended to autoregression by Pan and Politis (2016) [25] who constructed bootstrap prediction intervals in the setting of a (univariate) nonparametric autoregression. Given a valid resampling procedure as developed in Section 5, we are able to construct bootstrap prediction regions in the functional autoregression model in the spirit of Pan and Politis (2016) [25]. Two bootstrap ideas, forward and backward, were considered in Pan and Politis (2016) [25] for the construction of prediction intervals with conditional validity. It is unrealistic, however, to generate the bootstrap pseudo-data backward in the functional setting. Therefore, we restrict to forward bootstrap scheme in our functional model. (7.1)

Compute the fitted residuals:Ê
where b is a second smoothing parameter. 3. Center the residuals: (a) Let the empirical distribution of r i,b be denoted byF n , and draw bootstrap pseudo-residuals E * 1 , . . . , E * n i.i.d. fromF n . (b) Use regression bootstrap scheme to generate the pseudo-data: where || · || is a norm of the practitioner's choice; note that difference choices for the norm will lead to prediction regions of different shape.
Analogously to Pan and Politis (2016) [25], we also consider using predictive, as opposed to fitted, residuals for the bootstrap. We define the predictive residuals asÊ is calculated from the delete-X t data set, i.e. the available data for the scatterplot of X k vs. X k−1 over which the fitting takes place excludes the single point that corresponds to k = t. The forward bootstrap with predictive residuals is similar to Algorithm 7.1 except for Step 2.
Then compute the predictive residuals:Ê t,b in Algorithm 7.1; the remaining steps are the same. Remark 7.1. Recall that prediction intervals are asymptotically valid (in a conditional sense) when the probability of coverage of the future value X n+1 conditional on the observed data X 1 , . . ., X n gets close to the nominal one as n increases. The prediction regions constructed in Algorithm 7.1 and 7.2 would indeed be asymptotically valid if the predictive root X n+1 −X n+1 and the bootstrap predictive root X * n+1 −X * n+1 have the same distribution asymptotically. For the predictive root, we have where E n+1 is independent of the estimation error A. Similarly, the bootstrap predictive root can be written as Consequently, the prediction regions of Algorithm 7.1 and 7.2 would be asymptotically valid provided that the distribution of the true errors E 1 , . . . , E n is captured in the limit by the empirical distribution of the residuals (fitted or predictive). → 0. Note that the condition ||Ê 1,b || d → ||E 1 || would follow ifF n , the empirical distribution of the residuals, were shown to converge weakly to the distribution of the true errors as shown e.g. by Franke and Nyarige (2016) [13] in the case of a linear operator Ψ. However, the importance of Corollary 7.1 is limited: as discussed in Politis (2015) [27]-asymptotic validity of prediction regions does not tell the whole story since the variability of estimation-which is asymptotically negligible-does not enter the picture. A prediction region will have good finite sample coverage only if the method employed is able to capture the variability of estimation error; that is why Algorithms 7.1 and 7.2 employ the bootstrap for prediction intervals instead of just relying on the empirical quantiles ofF n . Indeed, by construction, our model-based bootstrap procedure is capable of approximating the distribution of the re-scaled estimation error nF χ (h)A by that of nF χ (h)A * due to Theorem 5.2. Hence, our bootstrap prediction regions should have good finite sample coverage which is something that can not be captured by the property of asymptotic validity.
As mentioned in the introduction, Aue et al. (2015) [3] proposed a method of predicting FAR(1) using functional principal component analysis (FPCA). In addition, they proposed an algorithm for computing the prediction regions to assess the prediction accuracy. That is the only existing method we could find on the construction of prediction regions for functional time series. In the next subsection, we are going to compare it with the algorithm we proposed.

Monte Carlo studies
In this subsection, we use the Monte Carlo simulations to evaluate the quality of the bootstrap prediction regions with both fitted and predictive residuals. Simulations are performed on the model from the previous section; see Eq. (6.2). 500 'true' datasets each with sample size n are generated and n varies from 100 to 400. For the ith true dataset, we use one of the bootstrap methods to create B = 1000 bootstrap sample paths (step 4 of the algorithms), and construct the prediction region.
To assess the corresponding coverage level (CV R) of the constructed region, we also generate 1000 one-step ahead predictions Y n+1,j =Ψ i (X n,i ) for j = 1, 2, ...1000 whereΨ i is the estimate from the ith 'true' dataset, X n,i is the ith dataset's last data and E j is randomly drawn from the error process (6.3). Then the empirical coverage level from the ith dataset is given by where 1(·) is the indicator function and β is the nominal coverage level. The coverage level for bootstrap methods is calculated by the average {CV R i } over the 500 'true' datasets, i.e.
Note that the last observation X n,i is different for each dataset; hence the coverage CV R represents an unconditional coverage probability. Prediction regions are constructed with nominal coverage levels of 95% and 90%. Five different norms in the functional space are selected to measure the proximity between functions. The first three are the common l p norms in the functional space: We also consider two pointwise measures:  where x and y are functions in L 2 [0, 1] and e 1 is the eigenfunction associated with the largest eigenvalue of the sample covariance operator. Table 1 presents the empirical coverage rate of the prediction region we construct. The results are promising as the empirical coverage rates pretty well match the nominal coverage rate even with quite small sample size. As expected, when using predictive residuals, the coverage rate is a bit higher compared to fitted residuals. For the comparison purpose, we apply the algorithm of Aue et al. (2015) [3] on the functional data we generate. Table 2 presents the coverage rate of their constructed prediction region. The value of p represents the number of components to be kept in the prediction algorithm; see Aue et al. (2015) [3] for more details. The experiment is also done on three different sample size, 100, 200 and 400. As is shown in the table, their prediction regions fail to achieve the ideal coverage rate when sample size is small-see the case n = 100-and results get better when sample size grows. This phenomenon is not surprising, and explainable. Recall that there are two constituents in the expression of predictive root X n+1 −X n+1 , the estimation error A and the true error E. The bootstrap predictive root in our algorithm is capable of capturing the distribution of both errors by A * and E * . However, the method of Aue et al. (2015) [3] does not attempt to capture the variability of estimation error and thus, as expected, it yields coverages less than nominal (undercoverage). Coverage rate improves under larger sample size since the estimation error diminishes as n grows. As a conclusion, we can say that the prediction regions constructed through our bootstrap procedure can achieve better finite sample coverage.

Appendix: Proofs
Throughout this section, given some random variable U , P U stands for the probability measure induced by U . Since the stationarity of the sequence {X n }, we assume X i s are identically distributed as X . Also, the pairs (X i+1 , X i ) have the same joint distribution. So are the functions of them. The kernel estimator ψ h will be decomposed as follows: Similarly, the estimatorΨ h can be decomposed as:

Proof of Theorem 4.1, 4.2 and 4.3
To prove these theorems, we need the following lemmas: Proof. For the first assertion, we have the third line coming from Eq. (14) in Ferraty et al. (2007) [10] and the fifth line follows from the uniform boundedness of the integrand. The second assertion is proved as follows: where the fifth line follows from the continuity of ψ with respect to the semimetric d.
Proof of Theorem 4.1. For the proof of Eq. (4.1), we write the following decomposition: with and The last two terms on the right-hand side of Eq.  [31]). For the first term on the right-hand side of (8.6), we calculate which follows from the stationarity of the sequence {X n }. Assume X i ∼ X for i = 1, · · · , ∞, then we can write the numerator in (8.9) as follows here P U stands for the probability measure induced by U , and the last line comes form the first order Taylor's expansion for ϕ around 0. For the denominator in (8.9), we have Consequently, which completes the proof.
Proof of Theorem 4.2. We use the following decomposition: ). Proof of Theorem 4.3. Since the sequence (X i , X i+1 ) i is stationary and strong mixing, this theorem is the same as that obtained in Delsol (2009) [6].

Proof of Theorem 4.4
Proof. Consider the expression Following similar arguments as in the proof of Theorem 4.1 in Ferraty et al.
(2012) [9], we have that the above expression has the same asymptotic distribution as where for 1 ≤ i ≤ n, By assumption (i), we can apply Hölder's inequality to obtain In the above expression, E||X i+1 − Ψ(χ)|| 2+δ 2+δ 2+δ is finite because of assumption (ii). For the last item, we note that is a constant depending on K(·), the right-hand side is bounded by CF χ (h) for a suitable C and all small enough h. Now we have, E||Z n,i || 2+δ ≤ C < ∞ for all n. Combining this with assumption (iii), it follows from Theorem 2.3(i) in Politis and Romano (1994) [28] that