Optimal Prediction for Additive Function-on-Function Regression

As with classic statistics, functional regression models are invaluable in the analysis of functional data. While there are now extensive tools with accompanying theory available for linear models, there is still a great deal of work to be done concerning nonlinear models for functional data. In this work we consider the Additive Function-on-Function Regression model, a type of nonlinear model that uses an additive relationship between the functional outcome and functional covariate. We present an estimation methodology built upon Reproducing Kernel Hilbert Spaces, and establish optimal rates of convergence for our estimates in terms of prediction error. We also discuss computational challenges that arise with such complex models, developing a representer theorem for our estimate as well as a more practical and computationally efficient approximation. Simulations and an application to Cumulative Intraday Returns around the 2008 financial crisis are also provided.


Introduction
Functional data analysis (FDA) concerns the statistical analysis of data where one of the variables of interest is a function. FDA has seen rapidly increasing interest over the last few decades and has successfully been applied to a variety of fields, including economics, finance, the geosciences, and the health sciences. One of the most fundamental tools in statistics is linear regression, as such, it has been a major area of research in FDA. While the literature is too vast to cover here, we refer readers to Ramsay and Silverman (2006); Ramsay et al. (2009) ;Horváth and Kokoszka (2012); Kokoszka and Reimherr (2017), which provide introductions to FDA, as well as Morris (2015), which provides a broad overview of methods for functional linear regression.
A major challenge of functional regression is handling functional predictors. At least conceptually, a functional predictor means having a large number (theoretically infinite) of predictors that are all highly correlated. To handle such a setting, certain regularity conditions are imposed to make the problem tractable. Most of these conditions are directly or indirectly related to the smoothness of the parameter being estimated. However, the convergence rates of the resulting estimators then depend heavily on these assumptions, and the rates are not parametric when the predictor is infinite dimensional.
One of the most well studied models in FDA is the functional linear model. Commonly, one distinguishes between function-on-scalar, scalar-on-function, and function-on-function regression when discussing such models, with first term denoting the type of response and the second term denoting the type of covariate. The convergence rates for function-on-scalar regression are usually much faster than for the scalar-on-function or function-on-function. Methodological, theoretical, and computational issues related to functional linear models are now well understood. More recently, there has been a growing interest in developing nonlinear regression models. While it is natural to begin examining nonlinear models after establishing the framework for linear ones, there is also a practical need for such models. Functional data may contain complicated temporal dynamics, which may exhibit nonlinear patterns that are not well modeled assuming linearity ;Fan et al. (2015) examine this issue deeply.
Nonlinear regression methods for FDA have received a fair amount of attention for the scalaron-function setting, while function-on-function regression models, where the relationship between the response and covariates is believed to be nonlinear, have received considerably less attention. Concerning nonlinear scalar-on-function regression, James and Silverman (2005) introduced a functional single index model, where the outcome is related to a linear functional of the predictor through a nonlinear transformation. This work would later be extended in Fan et al. (2015), allowing for a potentially high-dimensional number of a functional predictors. Preda (2007) explored fitting a fully nonlinear model using reproducing kernel Hilbert spaces (RKHS). In contrast, Müller et al. (2013) simplified the form of the nonlinear relationship by introducing the functional additive model, which combines ideas from functional linear models and scalar additive models (Hastie and Tibshirani, 1990). Optimal convergence rates for the functional additive model were then established by Wang and Ruppert (2015), which generalized the work of Cai and Yuan (2012) in the linear case. An alternative to the functional additive model was given in Zhu et al. (2014) who first expressed the functional predictor using functional principal components analysis, FPCA, and then built an additive model between the outcome and scores. An extension to generalized linear models can be found in McLean et al. (2014); Du and Wang (2014).
Moving to function-on-function regression, Lian (2007) extended the work of Preda (2007) to functional outcomes, which was then also considered in Kadri et al. (2010). Most relevant to the present paper is the work of Scheipl et al. (2015) who extended the work of Müller et al. (2013) by introducing an additive model for function-on-function regression. They used a general trivariate tensor product basis approach for estimation, which allowed them to rely on GAM from the MGCV package in R to carry out the computation, as is implemented in the Refund package. Ma and Zhu (2016), examining the same model, considered a binning estimation technique combined with FPCA. In addition, they were able to prove convergence of their estimators, but made no mention of optimality while also needing a great deal of assumptions which are challenging to interpret. Another estimation technique was examined in Kim et al. (2018), which was similar to the trivariate tensor product approach of Scheipl et al. (2015), but two of the bases are explicitly assumed to be orthogonal B-splines, while the third comes from an FPCA expansion. However, as with Scheipl et al. (2015), no theoretical justification is provided. Lastly, in very recent work, Sun et al. (2017) considered the case of using an RKHS framework to estimate a function-onfunction linear model. Extending the work the Cai and Yuan (2012), they were able to establish the optimality of their procedure. Our work can be viewed as extending this work to nonlinear relationships via a function-on-function additive model.
The goal of this work is to develop a penalized regression framework based on Reproducing Kernel Hilbert Spaces, RKHS, for fitting the additive function-on-function regression model, AFFR (Scheipl et al., 2015). A major contribution of this work is to provide optimal convergence rates of our estimators in terms of prediction error, and that this rate is the same as for the scalar outcome setting (Wang and Ruppert, 2015). We also discuss computational aspects of our approach, as the RKHS structure allows for a fairly efficient computation as compared to the trivariate tensor product bases that have been used previously. Background and the model are introduced in Section 2. Computation is discussed in Section 3, while theory is presented in Section 4. We conclude with a numeric study consisting of simulations and an application to financial data.

Model and Background
We assume that we observe i.i.d pairs {(X i (t), Y i (t)) : i = 1, . . . , n, t ∈ [0, 1]}. The functions could be observed on other intervals, but as long as they are closed and bounded, then they can always be rescaled to be [0, 1], thus it is common in FDA to work on the unit interval. Both the outcome, Y i (t), and X i (t) are assumed to be completely observed functions, a practice sometimes referred to as dense functional data analysis (Kokoszka and Reimherr, 2017); practically this means that the curve reconstruction contributes a comparatively small amount of uncertainty to the final parameter estimates. More rigorous definitions can be found in Cai and Yuan (2011);Li et al. (2010); Zhang et al. (2016). For sparsely observed curves, it is usually better to use more tailored approaches such as PACE (Yao et al., 2005), FACE (Xiao et al., 2017), or MISFIT (Petrovich et al., 2018).
The additive function-on-function regression model is defined as We assume that the functions X i , ε i , and Y i are elements of L 2 [0, 1], which is a real separable Hilbert space. The trivariate function, g(t, s, x) is assumed to be an element of an RKHS, K.
Recall that an RKHS is a Hilbert space that possesses the reproducing property, namely, we assume that K is a Hilbert space of functions from [0, 1] × [0, 1] × R → R, and that there exists a kernel function k(t, s, x, t , s , x ) = k t,s,x (t , s , x ) that satisfies f (t, s, x) = k t,s,x , f K , for any f ∈ K. There is a one-to-one correspondence between K and k, thus choosing the kernel function completely determines the resulting RKHS. The functions in K inherit properties from k, in particular, one can choose k so that the functions in K possess some number of derivatives, or satisfy some boundary conditions. In addition, many Sobolev spaces, which are commonly used to enforce smoothness conditions, are also RKHS's. We refer an interested reader to Berlinet and Thomas-Agnan (2011) for further details.
We propose to estimate g by minimizing the following penalized objective: i.e.,ĝ = arg inf g∈K RSS λ (g), where λ > 0. As we will see in the next section, an explicit solution to this minimization problem exists due to the reproducing property. However, we will also discuss using FPCA to help reduce the computational burden.

Computation
One of the benefits of using RKHS methods is that one can often get an exact solution to the corresponding minimization problem such as the one in (1), due to the representer theorem (Kimeldorf and Wahba, 1971). This also turns out to be the case here, however, later on we will discuss using a slightly modified version that still works well and is easier to compute. The expression we derive is quite a bit simpler than the analogs derived in Cai and Yuan (2012); Wang and Ruppert (2015); Sun et al. (2017); this is partly due to our use of functional principal components, which simplify the expression and also provide an avenue for reducing the computational complexity of the problem, and also due to our use of the RKHS norm penalty when fitting the model (where as others used a more general penalty term).
Using the reproducing property we have k t,s,X i (s) , g K = g(t, s, X i (s)) for i = 1, 2, ..., n.
We then have that (2) 1 0 g(t, s, X i (s))ds = 1 0 g, k t,s,X i (s) K ds = g, 1 0 k t,s,X i (s) ds K , which is justified by the integrability constraints inherent in Assumption 1(iii ), discussed in the next section. Letv 1 ,v 2 , ...,v n denote the empirical functional principal components, EFPC's, of Y 1 , Y 2 , ..., Y n . Then, assuming the Y i 's are centered, it is a basic fact of PCA that span{v 1 , . . . ,v n } = span{Y 1 , . . . , Y n }. Recall that it is also a basic fact from linear algebra that thev 1 ,v 2 , ...,v n can be completed to form a full orthonormal basis (all of the additional functions will have an empirical eigenvalue of 0). We then apply Parseval's identity to obtain Define the subspace (of K) H 1 = span 1 0 1 0 k t,s,X i (s)vj (t)dtds, i = 1, 2, ..., n, j = 1, . . . , n , as well as its orthogonal compliment H ⊥ 1 . The space K can be decomposed into the direct sum: , which means that we can write any function g ∈ K as g = g 1 + g ⊥ 1 , with g 1 ∈ H 1 and g ⊥ 1 ∈ H ⊥ 1 . Using this decomposition we have that, for 1 ≤ j ≤ n, g, K , it follows from (1) and (3) thatĝ ∈ H 1 and so has the form Note that this same expression would hold if we replaced the {v j (t)} with {Y j (t)} (since they span the same space), however, it would not hold for an arbitrary basis. We use the FPCs for computational reasons as we discuss at the end of the section. To compute the estimate,ĝ, we only need to compute the coefficients {α ij }. As usual, the coefficients α ij can be computed via a type of ridge regression. Note that ĝ, Turning to the norm in the penalty we can use the same arguments to show that Thus the minimization problem can be phrased as We now vectorize the problem by stacking the columns of Y ij and α ij , denoted as Y V and α V . We also turn the array A iji j into a matrix A V , by collapsing the corresponding dimensions. We can then phrase the minimization problem as Thus, the final estimate can be expressed aŝ Note that we are estimating n 2 parameters and inverting an n 2 × n 2 matrix. Thus for computational convenience, it is often useful to truncate the EFPCs at some value J < n. However, even without truncating this approach still has the potential to lead to less parameters than the basis methods of Scheipl et al. (2015), where the number of parameters to estimate is m 3 , with m being the number of basis functions used in their tensor product basis. In contrast, our approach yields n 2 parameters, and combined with an FPCA, this can be reduced to nJ with relatively little loss in practical predictive performance. There is also the possibility of using an eigen-expansion on k to reduce the computational complexity even further (Parodi and Reimherr, 2017), though we don't pursue that here.

Alternative Domains
While our work is focused primarily on the "classic" function-on-function paradigm, we briefly mention in this section an easy way to modify the kernels to allow for more complex domains. In particular, one major concern brought up by a referee is when both X i (t) and Y i (t) are observed concurrently. In that case, the classic approach would actually use future values of the covariate to predict present values of the outcome. Interestingly, we need only make a very slight adjustment to the kernels to handle such a setting.
The goal here is to adjust the model such that or equivalently to require that g(t, s, X i (s)) = 0 if s > t. More generally, we can allow the domain of X used to predict Y to change arbitrarily with t.
, which is what we use to highlight this approach in Section 6. We aim to fit the more general model Interestingly, this can be done through a simple modification of the kernel. In particular, we can define a new kernel ask (t, s, x, t , s , x ) = 1 s∈At 1 s ∈A t k(t, s, x, t , s , x ).
A direct verification shows thatk is a valid reproducing kernel as long as the original k was. Then our estimate would take the form which means thatŶ n+1 (t) can be computed using only {X n+1 (s) : s ∈ A t } and a very slight modification of our current approach. We illustrate this technique in Section 6.

Asymptotic Theory
In this section, we demonstrate that the excess risk, n (defined below), of our estimator converges to zero at the optimal rate. Optimal convergence of n , for scalar-on-function linear regression was established by Cai and Yuan (2012), while optimal convergence for the continuously additive scalar-on-function regression model was established in Wang and Ruppert (2015). In both cases an RKHS estimation framework was used. Because our model involves a functional response, the form of the excess risk n is different and requires some serious mathematical extensions over previous works. However, we will show that the convergence rate for our model is the same as the one found in Wang and Ruppert (2015).
We begin by defining the excess risk, n . Let X n+1 (t) be new predictor which is distributed as, but independent of (X i (t)) n i=1 . We let E * denote the expected value, conditioned on the data {(Y i , X i ) : 1 ≤ i ≤ n}. Then the excess risk is defined as Note that n is still a random variable as it is a function of the data. Intuitively, this quantity can be thought of as prediction error, namely, for a future observation, how far away is our prediction from the optimal one where the true g is known. For ease of exposition, we present all of assumptions below, even the ones discussed previously.
Assumption 1. We make the following assumptions.
where {X i } and {ε i } are independent of each other and iid across i = 1, . . . , n.
(ii ) Denote by L k the integral operator with k as its kernel: The kernel, k, which also defines the RKHS, K, is assumed to be symmetric, positive definite, and square integrable.
(iii ) Assume that there exists a constant c > 0 such that for any f ∈ K and t ∈ [0, 1] we have (iv ) Let L 1/2 k denote a square-root of L (which exists due to Assumption 1(ii )) and define (vi ) The function g lies in Ω, which we assume is a closed bounded ball in K.
We are now in a position to state our main result.
Theorem 1. If Assumption 1 holds and the penalty parameter, λ, is chosen such that λ n − 2r 2r+1 then we have that Before interpreting this result, let us discuss each of the assumptions individually. Assumption 1(i ) explicitly defines the model we are considering. Assumption 1(ii ) ensures that the kernel has a spectral decomposition via Mercer's theorem, which will be used extensively. Assumption 1(iii ) is fairly typical in these sorts of asymptotics, assuming that the fourth moment is bounded by a constant times the square of the second. Assumption 1(iv ) introduces a central quantity that is used extensively in the proofs. While not immediately obvious, this assumption basically states how "smooth" or "regular" the function g is, as g must lie in K, whose kernel contributes to C. In such results it is common for X to contribute to the asymptotic behavior as the prediction error depends on the complexity of the X. Note that k 1/2 t,s,x is a well defined quantity and it is easy to show via the reproducing property that it is an element of L 2 ([0, 1] 2 × R). The operator C does depend on the choice of the square-root L 1/2 k (which is not a unique choice), however its eigenvalues do not. Assumption 1(v ) simply assumes that the point-wise variance of the errors is bounded, while the last assumption requires that the true function lie in a ball in K, which is used to control the bias of the estimate.
The rate given in Theorem 1 is the same as was found in the scalar outcome case in Wang and Ruppert (2015), thus we know that this is the minimax rate of convergence. In our case, as well as in Wang and Ruppert (2015) and Cai and Yuan (2012), it is the interaction between the covariance of X and the kernel k which determines the optimal rate. The proof is quite extensive and given in the appendix. The idea of the proof is to rephrase the estimate using operator notation instead of the representation theorem. The difference between the estimate and truth is then split into a bias/variance decomposition. Bounding the bias turns out to be relatively straight forward.
Bounding the variance is done by decomposing it into five more manageable pieces, and then bounding each of them separately. Our task is complicated by the fact that the errors and response are now functions, where as in both Wang and Ruppert (2015) and Cai and Yuan (2012) they were scalars. This requires extending many of the lemmas to this new setting, as well as using some completely new arguments to get the necessary bounds in place.

Simulation Study
Here we investigate the prediction performance of AFFR. We compare it with a linear model estimated in one of two ways. The first way will be denoted as LM R (linear model reduced) and LM F (linear model full), where both use FPCA to reduce the dimension of the predictors, but LM R also reduces the dimension of the outcome, while LM F does not. To implement our approach we relied heavily on the TensorA package van den Boogaart (2007) in R, which allowed us to carryout various tensor products very quickly.
We consider three different settings for g(t, s, x) one linear and two nonlinear forms: In all settings, the predictors X i (t) and errors i (t) are taken to be iid Gaussian processes with mean 0 and the following covariance function from the Matérn family: where ρ = 1/4. For the RKHS we considered both the Gaussian kernel and exponential kernel where δ is the range parameter. We will examine the sensitivity of our approach to this parameter in Tables 2 and 3. All of the curves (X i (t), Y i (t), and ε i (t)) were simulated on a M = 50 equispaced grid between 0 and 1. The data is approximated using K = 100 B-splines. We denote by J X and J Y the number of principal components of X and Y respectively. These steps are carried out using the Data2fd and pca.fd functions in the R package fda. Our approach uses an FPCA on Y only, but the LMR approach uses the FPCs for both X and Y . The common recommendations for choosing J Y is either to use some cutoff for explained variability (commonly 85%) or to look for an elbow in the scree plot (J X can also be chosen the same way or using a model based criteria such as BIC) (Kokoszka and Reimherr, 2017). Using an 85% cutoff here results in 3 FPCs for our simulations, though we also include 6 and 9 to show that our approach is not very sensitive to this choice as long as a large proportion of variability is explained. However, one should note the trade offs when choosing J Y . In general, the major gain in choosing a smaller J Y is faster computation, which is nontrivial for this problem. The major loss is that one "gives up" on some proportion of the variability in Y . For example, if the FPCs explain 95% of the variability, then one immediately gives up on predicting that remaining 5%. This is a different consideration than when choosing FPCs for predictors. In general, users can tailor this choice to their data; if one expects very accurate predictions then a larger J Y can be helpful so that one does not lose prediction accuracy, while if it is known a-priori that the prediction accuracy will be low, then J Y can be safely made smaller.
To evaluate the different approaches, we used 1000 repetitions of every scenario. In each case we generate 150 curves to fit the different models and then generated another 150 curves to evaluate out-of-sample prediction error. The metric for determining prediction performance we denote as RPE, for relative prediction error. This metric denotes the improvement of the predictions over just using the mean, and can be thought of as a type of out-of-sample R 2 . An RPE of 0 implies that the model shows no improvement over just using the mean, while an RPE of 1 means the predictions are perfect. More precisely, we first compute the Mean Squared Prediction error as: where Y i is a predicted value using one of the three discussed models or simply the mean. The RPE is then defined as where M SP E mean denotes the MSPE using a mean only model. Note that even in the mean only model, all parameters are estimated on the initial 150 curves and prediction is then evaluated on the second 150. Therefore, it is actually possible to have a numerically negative RPE if an approach isn't predicting any better than just using the mean.
The RPEs of LM R and LM F for the three models (a), (b), and (c) are summarized in Table   1. For both models, we took J X = 3, which explained over 85% of the variability of the predictors and for LM R we took J Y = 3 PCs for the outcome as well. The RPEs for our approach with δ = {2 −3 , 2 −2 , 2 −1 , 1, 2} and J Y = 3, 6, 9 are summarized in Tables 2 and 3, which represent the Gaussian and exponential kernels respectively. An initial look at the tables confirms much of what one would expect. When the true model is linear, the two linear approaches work best, resulting in about twice the RPE of AFFR. However, when moving to the two nonlinear models, the AFFR approach does substantially better. This increased performance is seen for any choice of J Y and δ.
Furthermore, the prediction performance seems relatively robust to the choice of J Y , δ, and even the kernel. In the case of J Y this is not so surprising as over 90% of the variability of the Y i is explained by the first three FPCs. In contrast, there is some sensitivity to the choice of δ, but it is relatively weak given how much we are changing δ in each row. In our application section we set δ using a type of median, but one could also refit the model with a few different δ and choose the one with the best prediction performance. Given how consistent the AFFR predictions are, trying a few δ appears to be satisfactory, and large grid searches can be avoided.
As a final illustration of the efficacy of AFFR, we provide several plots to help visualize the performance. In Figure 1

scenarios (rows 2 and 3), one can clearly see the RPE results reflected in the predictions as AFFR
is much closer to the optimal prediction. In Figure 2 we plot several realizations ofĝ(t, s, X i (s)), which are again done out of sample along with the true value of g(t, s, X i (s)). Plotting in this way allows us to visualize g using surfaces, where as plotting g(t, s, x) would be challenging since the domain has three coordinates. As we can see, the estimates are quite close to the true values, capturing the nonlinear structure quite well.

Application to Cumulative Intraday Data
We conclude with an illustration of our approach applied to real data. Cumulative Intra-Day Returns (CIDR's) consist of daily stock prices that are normalized to start at zero at the beginning of each trading day. FDA methods have been useful in analyzing such data (Gabrys et al., 2010;Kokoszka and Reimherr, 2013;Horváth et al., 2014), given the density at which stock prices can be observed. Let P i (t j ) denote the price of a stock on day i and time of day t j . The CIDRs are then defined as R i (t j ) = 100 [ln P i (t j ) − ln P i (t 1 )] , i = 1, ..., n, j = 1, ..., M.  Table 3: Relative prediction error, RPE, for AFFR using an exponential kernel and with different kernel parameter values, δ. In every case the penalty parameter, λ, is chosen using cross-validation. We investigate the performance of the market indexes, S&P 500 and DJ, in predicting GE and IBM for the three periods. Understanding such relationships is imperative for developing financial portfolios as many strategies consist of balancing buying/shorting certain stocks with buying/shorting market indices (Nicholas, 2000). We fit four different models; the first two are our discussed models, AFFR, one based on using the full X i (t) to predict Y i (t) (AFFR) and one where only the current and past values are used (AFFR Pre) as described in Section 3.1. The other two methods are the linear models. The first linear model uses an FPCA on both the outcome and predictor (5 PCs for both) and then fits a multivariate linear model, while the second linear model only uses FPCA on the predictor (5 PCs) (Kokoszka and Reimherr, 2017). To evaluate the prediction performance for each period we split each period into 3 equal folds and use a type Kfold cross-validation. The model is fit on two folds, while prediction is then evaluated on the third.
We use the Gaussian kernel from Section 5 and the smoothing parameter selected via Generalized Cross-Validation. Prediction performance is then averaged over the 3 folds. To provide a more readily interpretable metric for prediction performance, we use the same RPE metric given in Section 5, which denotes the relative performance of a model with respect to a mean only model.
A value of 1 means perfect prediction, while a value of 0 indicates that the model is doing no better than just using the mean. The results are summarized in Table 4.
As we can see, all models perform better during and after the crisis. This suggests that the behavior of the market had not returned to its pre-crisis characteristics. Looking at Figure 3, we can clearly see that the volatility increases during and after the financial crises. This suggests that the overall "market" effect on the stocks is stronger during periods of high-volatility. When comparing The bottom row plots the corresponding (out of sample) predictionĝ(t, s, X(s)).
the four different models, the linear models do nearly the same, which is to be expected since 5 PCs explains over 90% of the variability of the stocks. The AFFR model is not too far behind, but does noticeably worse in every setting. This suggests that the relationship between the discussed stocks and the indices is approximately linear; if there are any nonlinear relationships then they are either very minor deviations from linearity or are not well captured by an additive structure.
The results of AFFR using only current and past values of X i (t) to predict Y i (t) (AFFR Pre) does substantially worse before the crises. Interestingly, during and after its performance is closer to AFFR, though some relationships it still does not capture well. Thus suggests that, unsurprisingly, knowing the future values of X i (t) is very helpful for predicting currentvalues of Y i (t), though this is obviously impractical. During the financial crises, many stocks are likely being driven by large market level effects. In this setting, AFFR Pre, does quite well, even beating AFFR slightly in some settings, suggesting that the simpler structure has actually helped with prediction.  denoting perfect prediction and 0% denoting no increase over just using the mean.

Conclusions and Future Work
In this paper we have presented a new RKHS framework for estimating an additive functionon-function regression model, that is better able to account for complex nonlinear dynamics in functional regression models than classic linear models. We showed that the estimator is minimax in the sense that it achieves an optimal rate of convergence in terms of prediction error. In addition, computing the estimate is computationally efficient, especially if dimension reduction is incorporated.
Nonlinear models for functional data have recently received a great deal of attention, however, there are still a number of interesting questions that remain open. One that is especially relevant to the work presented here concerns further statistical properties of the estimate,ĝ. In particular, convergence rates ofĝ as well as its asymptotic distribution would be especially interesting for quantifying the estimation uncertainty in practice. Using such tools, one could also construct confidence/prediction bands, which would be of great use in practice.
Another nontrivial extension would be to curves that are observed sparsely. Nonlinear models in FDA often require that the curves be observed or at least consistently estimated. However, for some data this is unrealistic and there is a great deal of uncertainty related to imputing the curves.
Lastly, extensions to more complex settings would also be of interest. For example, the handling of more complex domains, e.g. space or space-time. In these cases, the minimax rates usually depend on the dimension of the domain. Another important extension would be to functional binary or categorical outcomes (as opposed to quantitative) would be of interest as one must incorporate tools from functional glms.

A Proof of Theorem 1 A.1 Excess Risk
We begin by expressing the excess risk in an alternative form. Recall that k(t, s, x; t , s , x ) is the kernel function used to define the RKHS, K. This kernel can be viewed as the kernel of an integral operator, L k , which maps L 2 ([0, 1] 2 × R) → K ⊂ L 2 ([0, 1] 2 × R). In particular (L k f )(t, s, x) = k(t, s, x; t , s , x )f (t , s , x ) dt ds dx .
From here on, for simplicity, we will denote L 2 ([0, 1] 2 × R) as simply L 2 . By Assumption 1, L k is a positive definite, compact operator, which is also self-adjoint in the sense that f, L k g = L k f, g , for any f and g in L 2 . We can therefore define a square-root of L k , denoted as L 1/2 k that satisfies Note that if L k has a nontrivial null space, then L −1/2 k can still be well defined since assuming f ∈ K means that f is orthogonal to the null space of L. Recall that one can also move between the K and L 2 inner product as follows We refer the interested reader to Kennedy and Sadeghi (2013) for more details.
Letĝ denote our estimate of the true function, g. We then define the following Using the reproducing property, we have that g(t, s, X(s)) = k 1 2 t,s,X(s) , h L 2 andĝ(t, s, X(s)) = k 1 2 t,s,X(s) ,ĥ L 2 .
Now define the random operator, T : L 2 → L 2 as where ⊗ denotes the tensor product, and the resulting object is interpreted as an operator: We also define a second operator, which integrates out t, s, and s , and takes an expectation over X n+1 : Note that C is a symmetric, positive definite, compact operator, and thus has a spectral decompo- where ρ k ≥ 0 and φ k ∈ L 2 are, respectively, the eigenvalues and eigenfunctions of C. This decomposition will be used later on.
As we said before, denote by E * the expected value conditioned on the data (X 1 , Y 1 ), ..., (X n , Y n ).
The excess risk can be written as Thus, the excess risk can be expressed as sort of a weighted L 2 norm, where the operator C defines the weights, which is composed of the kernel and the distribution of X n+1 .

A.2 Re-expressing the Estimator
In this section we define an alternative form for the estimatorĝ, which was given in Section 3.
In particular, instead of using the reproducing property, we will write down the estimator using operators. To do this, we will take derivatives of RSS λ (g) with respect to g. Since these are functions, we mean the Fréchet derivative or strong derivative. Note that RSS λ (g) is a convex differentiable functional over K. However, so that we are working with L 2 instead of K, we use Now RSS λ (h) is a convex differentiable functional over L 2 . Thus, when taking the derivative, we are using the topology of L 2 not K.
To take the derivative of RSS λ (h) we first focus on the penalty, which is easier. We have that Turning to the first term in RSS λ (h) we first define the empirical quantities Now we can apply a chain rule to obtain ∂ ∂h For notational simplicity, define So, we finally have that which yields the estimateĥ where I is the identity operator.

A.3 Proof of Theorem 1 -Controlling Bias
Using Assumption 1 we can express and we therefore have that This implies thatĥ from (8) can be expressed aŝ h = (C n + λI) −1 C n (h) + (C n + λI) −1 f n .
We introduce an intermediate quantity, h λ , which is given by where C is defined in (5). The difference between h λ and h represents the bias of the estimatorĥ.
Balancing this quantity with the variance, discussed in the next section, is called the bias-variance trade off a common term in nonparametric smoothing. Inherently, the idea is that to achieve an optimalĥ we have to balance both the bias and variance so that neither one is overly large.
Using the eigenfunctions of C as a basis, we can write a k φ k .
Since C and C + I have the same eigenfunctions, it follows that we can express So we have that h λ can be expressed as So the difference, h λ − h can be written as The bias is therefore given by It is easy to verify that the maximum of F (x) = x/(λ + x) 2 is achieved at x = λ with the maximum value being 1 4λ . We can therefore bound the bias as In the statement of Theorem 1 we assume that λ n 2r/(2r+1) , which implies that the bias is of the order n − 2r 2r+1 O(1). We will show in the next section that the variance of our estimate achieves the same order.

A.4 Proof of Theorem 1 -Controlling Variability
Controlling the variability of the estimates, ĥ − h λ C follows similar arguments as controlling the bias. However, there are many more terms which must be analyzed separately. In particular, we decomposeĥ − h λ into five separate components: where the T i terms are given by While a bit tedious, it only requires linear algebra and repeated calls to the definitions ofĥ and h λ to verify (10), we thus omit the details here. We now develop bounds for each term, T i C , separately.
For the first four, it turns out to be convenient to bound C ν T i L 2 for 0 < ν ≤ 1/2, as these bounds will be needed for the final term T 5 . Notice that when ν = 1/2 we have C ν T i L 2 = T i C .
1. Using the eigenfunctions of C to express h λ − h as in (9), we get that We then have that Again, it is a basic calculus exercise to show that We thus have the bound where c is a constant that depends only on ν.
2. Using the same arguments as in the previous step, we have that 3. Turning to T 3 , we apply Lemma 1 with 0 < ν ≤ 1/2 to obtain where r is defined as in Assumption 1. By the statement of Theorem 1 it follows that nλ 1+ 1 2r tends to a nonzero constant, meaning that 1 since λ → 0. Thus we have that C ν T 3 L 2 = O p (λ ν ).
5. The last term is the most involved to bound and the reason why the previous four bounds involved C ν T i . We begin by expressing Here ν > 0 is chosen to satisfy 2r(1 − 2ν) > 1. Applying Lemma 3, we have that We have now, in some sense, looped back and are dealing with the term h −ĥ. Using (10) we have The first four terms we already have bounds for, so we need only focus on the last, which again, has looped back to our original term. We now apply Lemma 2 to obtain Combining the above with (15) we have that Using (13) it thus follows that and applying steps 1-4 we get that and we finally have that We can now combine Steps 1-4, taking ν = 1/2, with step 5 to finally conclude that Combined with the results of Section A.3, this concludes the proof.

A.5 Auxiliary Lemmas
Here we state four lemmas which are generalizations of ones used in Cai and Yuan (2012) and Wang and Ruppert (2015).

Proof of Lemma 1
Recall that Using Parseval's identity we have that Taking expected values yields By Jensen's inequality we have t,s * ,X(s * ) , φ k L 2 dsds * dt.
Using the assumed independence between and X, as well as the assumption that E( 2 (t)) ≤ M , where M is a constant, we obtain t,s * ,X(s * ) , φ k L 2 dsds * dt = M C(φ k ), φ k = M ρ k .
Since 0 ≤ ν ≤ 1 2 and both ρ k and λ are positive, we can obtain the bound Now we apply Lemma 5 with ν 2 = 1/2 to obtain where c * is a constant. An application of Markov's inequality completes the proof.
Fix f ∈ L 2 such that f L 2 = 1. We can expand f as By Parseval's identity we have Applying the Cauchy-Schwartz inequality and using the fact that f L 2 = 1 we have that So we can bound the operator norm as Applying Jensen's equality we get that Using the definition of C n from (7) we have that E φ j , (C n − C)φ k 2 L 2 ≤ E φ j , C n φ k Note the first inequality follows from the fact that C is the mean of C n and thus replacing C above Using Cauchy-Schwartz inequality again E φ j , (C n − C)φ k 2 L 2 ≤ 1 n
Note that and recall that from the proof of Lemma 2, By Parseval's identity we obtain By Cauchy-Schwartz inequality and using the same steps in the proof of Lemma 2, we have Using the same arguments in the proof of Lemma 2, we obtain Note that the condition 2r(1 − 2ν) > 1 implies ν < 1 2 − 1 2r < 1 2 . We therefore have λ + ρ j ρ j 2ν−1 ≤ 1.

Recall that
By applying Lemma 5 we have that where β is a constant. An application of the Markov inequality completes the proof.
For the integral to be finite, it is enough if 2r(2ν 2 + 2ν) ≥ 1 + δ, for some δ > 0, as the integrand will go to zero faster than y −(1+δ) . The argument for the lower bound follows the same arguments.