Functional wavelet regression for linear function-on-function models

1 Appendix B: Additional proofs of theorems . . . . . . . . . . . . . . . 2 1.1 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . 17 2 Appendix D: Proofs of technical lemmas . . . . . . . . . . . . . . . . . 21 3 Appendix C: Additional simulation . . . . . . . . . . . . . . . . . . . . 44 4 Appendix D: Additional figures . . . . . . . . . . . . . . . . . . . . . . 47 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48


Introduction
With the development of technology and advance of complex data, functional data analysis (FDA) has received much attention in recent years, where the primary unit of observation is a curve or, in general, a function. For a general view of FDA, we refer the reader to Ramsay and Silverman [39], Ferraty and Vieu [11], Bosq [4], Horváth and Kokoszka [16], Hsing and Eubank [17]. Functional regression is a useful tool in FDA. Based on the type of variables, regression in FDA can be classified into three broad groups: scalar-on-function (scalar responses and functional predictors), function-on-scalar (functional responses and scalar predictors), and function-on-function (functional responses and functional predictors). Many methods have been developed for scalar-onfunction and function-on-scalar regression models, including but not limited to Ramsay and Dalzell [38], Cardot et al. [6], Brown et al. [5], Ratcliffe et al. [41], Ramsay and Silverman [39], Reiss and Ogden [43], Marx and Eilers [28], James [20], Müller and Stadtmüller [32], Goldsmith et al. [13] for linear or generalized linear scalar-on-function regression models, James and Silverman [21], Li and Marx [24], Yao and Müller [53], McLean et al. [29] for non-linear scalar-onfunction regression models, and Hart and Wehrly [15], Faraway [10], Guo et al. [14], Lin et al. [25], Morris and Carroll [31], Reiss et al. [42] on function-on-scalar regression models.
In this paper, we consider the following function-on-function linear regression model (in population level) where S q and T are finite intervals. z 1 (s 1 ), z 2 (s 2 ), · · · , z Q (s Q ) are Q predictive curves, y(t) is the response curve, and β 1 (t, s 1 ), β 2 (t, s 2 ), · · · , β Q (t, s Q ) are Q nonrandom integral kernel functions. No restrictions are imposed on the covariance function of the noise ε(t) and various within correlation structures in ε(t) are allowed. Some work has been done for the linear function-on-function regression model with one predictor curve (in this case, we remove the subscript q). Ramsay and Dalzell [38] first used piecewise Fourier bases for β(t, s), and Ramsay and Silverman [39] discussed the method of double expansion of the coefficient surface with basis function:  [54] and Wu and Müller [52] used the eigenfunctions of the covariance functions of x(s) and y(t) for basis expansion. Ivanescu et al. [18] considered the model (1.1) with Q > 1 by representing the function-on-function regression model as a penalized additive model and then fitting the additive model. Scheipl et al. [46] extended the method in Ivanescu et al. [18] and proposed additive regression models for correlated functional responses, allowing functional random effects. Wang [51] developed a linear mixed function-on-function regression model and estimated parameters by maximizing the log likelihood via the ECME algorithm.
With its ability to extract local features of curves at different levels of resolution and the sparsity of the coefficient vector, wavelet transformation has been used in functional regression models. Brown et al. [5], Malloy et al. [27], Zhao et al. [55], Luo and Qi [26] considered the scalar-on-function models. They conducted wavelet transformations on predictive curves and transformed the scalar-on-function models into regression models with scalar responses and high dimensional multivariate predictors. Brown et al. [5] and Malloy et al. [27] built Bayesian models and regularized the regression coefficients by placing a spikeand-slab prior, a mixture of a normal distribution and a point mass at zero, on coefficients. Zhao et al. [55] applied the LASSO [49] to make feature selection and fit the high-dimensional regression model. Luo and Qi [26] proposed a penalized generalized eigenvalue problem to fit the high-dimensional regression model. Meyer et al. [30] considered function-on-function regression models with multilevel functional data. They transformed both the predictive and response curves to wavelet space and conducted variable selection in the Bayesian framework by assuming a spike-and-slab prior on the regression coefficients and vague proper priors on the variance components.
In this paper, we first apply the wavelet transformation to the functional predictors and transform the original function-on-function model (1.1) to a linear model with functional response and high dimensional multivariate predictor variable. With n observations, the transformed model has the form: Y(t) = μ(t)1 n + Xβ(t) + ε(t), where 1 n is the n-dimensional vector with all elements equal to one, X is an n × p matrix of wavelet coefficients of the original predictive curves, Y(t) = (y 1 (t), · · · , y n (t)) T is the vector of n observed response functions and β(t) = (β 1 (t), · · · , β p (t)) T is the vector of p coefficient functions. For the transformed model, we propose a signal compression approach. Based on the best finite dimensional approximation to the signal function Xβ(t), we establish an expansion of β(t) which has the form ∞ j=1 α j w j (t) and enjoys good prediction properties, where α j is a p-dimensional vector and w j (t) is a function. For any k, the truncated expansion k j=1 α j w j (t) has nearly the smallest prediction errors among all k-dimensional estimates of form k j=1 b j v j (t) for arbitrary b j ∈ R p and v j (t). To estimate this expansion, we propose a penalized generalized eigenvalue problem to estimate α j , followed by a penalized least squares problem to estimate w j (t). We provide the oracle inequalities for our estimates. Simulation studies in various settings for both one and multiple prediction curves demonstrated that our approach has good predictive performance and is efficient in dimension reduction.
The rest of the paper is organized as follows. In Section 2, we discuss the wavelet transformation of model (1.1). Then for the transformed model, we introduce the signal compression approach in Section 3. The theoretical properties will be provided in Section 4. Simulation studies in various settings and application studies are provided in Sections 5 and 6, respectively. We summarize this paper in Section 7 and provide proofs for two theorems in Appendix. All the other proofs and additional simulations and figures are available in the authors' webpage.

Wavelet transformation
For notational convenience, without loss of generality, we assume that T = [0, 1] and S q = [0, 1] for q = 1, . . . , Q, in the model (1.1). For the general theory of wavelets and its application in statistics, we refer the reader to the books Daubechies et al. [8] and Nason [33]. Let ψ(x) denote a wavelet function (also called the mother wavelet) and φ(x) denote a scaling function (also called the father wavelet). They are chosen to have compact supports (i.e. they are equal to zero outside a bounded interval) (see Chapters 5 and 6 in Daubechies et al. [8] and Sections 2.3 and 2.4 in Nason [33]). Let where j = 0, 1, 2, · · · and k = 0, ±1, ±2 · · · . The two indices j and k represent the dilation and translation, respectively. Then all these functions form a complete orthonormal basis of L 2 (R). A larger j indicates that the basis function ψ j,k (s) has a smaller support interval and a finer resolution. We expand the predictive curves in the model (1.1) using this wavelet basis, where x q k = 1 0 z q (s)φ k (s)ds and x q jk = 1 0 z q (s)ψ jk (s)ds are the wavelet coefficients. With this wavelet expansion, we have Since β q (t, s) is smooth, both β q k (t) and β q jk (t) are smooth functions. Because φ(s) has a bounded support and β q ds, when the absolute value of k is large enough, we have φ(s − k) = 0 for all 0 ≤ s ≤ 1 and hence β q k (t) = 0. Similarly, given j ≥ 0, β q jk (t) is a nonzero function only for a finite number of k's. We define the sets K = {k : β q k (t) is a nonzero function} and K j = {k : β q jk (t) is a nonzero function} for each j ≥ 0. Moreover, in this paper, we assume that β q jk (t) = 0 for all j with 2 j large enough for the following two reasons. First, because . Then x q jk only depends on the values of z q (s) on an interval shorter than (b − a)/2 j . In practice, the curves are all discretely observed. When 2 j is sufficiently large so that (b − a)/2 j is less than the distance between the adjacent observation points, we cannot calculate x q jk and obtain information about x q jk β q jk (t). In this case, we can view the term x q jk β q jk (t) as random noise. Second, since we have assumed that β q (t, s) is smooth, the wavelet coefficients have the l 1 sparsity property in that the sum of the absolute values of these coefficients have a relatively small value. Moreover, β q jk (t) decreases fast as j increases. Thus, ∞ k=−∞ x q jk β q jk (t) will be a good approximation to β q (t, s) when 2 M is large. Based on these two reasons, we assume There are only finite nonzero terms in this expansion. To simplify notation, in the following, we use a single index to replace the triple index (q, j, k) and denote the expansion where p is the total number of nonzero terms in this expansion. Then the model Since p is typically large, the transformed model is a linear model with functional response and high-dimensional multivariate predictors.
Suppose that we have n independent observations {y i (t), z iq (s), 1 ≤ q ≤ Q}, 1 ≤ i ≤ n, from the model (1.1). In this paper, we consider the case that for each 1 ≤ q ≤ Q, the sample curves, z iq (s), 1 ≤ i ≤ n, are observed at a common dense set of points in [0, 1]. The set of observation points can be different for different q. The response curves y i (t), 1 ≤ i ≤ n, are also densely observed at a common set in [0, 1]. If for any 1 ≤ q ≤ Q, the observation points for z iq (s) are equally spaced and the number of the observation points is N q = 2 Mq for some positive integer M q , then we apply the discrete wavelet transformation (DWT) (Nason [33]) which converts the 2 Mq -dimensional vector of discrete observations of z iq (s) to a 2 Mq -dimensional wavelet coefficient vector. In simulations and applications of this paper, we use the package "wavethresh" (Nason [34]) of the R software (R Core Team [37]) and choose the default Daubechies leastasymmetric wavelet basis functions with filter number ten (Section 2.5.1 in Nason [33]). If the observation points of z iq (s) are not equally spaced or the number of the observation points N q is not a power of 2, we first approximate z iq (s) by a basis expansion using the method in Chapter 5 in Ramsay and Silverman [39]. The details are provided in Section 6. Then we use the values of the approximation function at 2 Mq equally spaced points as new observations and make the DWT transformation. To choose M q , we note that if M q is too small, we may lose information in the original discrete observations of z iq (s); if M q is too large, extra noise will be introduced. To make a balance, we choose For the i-th observation, we concatenate the wavelet coefficient vectors for (2.2) In this paper, as in Bickel et al. [3], we will assume that X is a nonrandom matrix and the columns of X have mean zero. Throughout this paper, we use · L 2 to denote the L 2 -norm in L 2 [0, 1], and · 1 and · 2 to denote the l 1 and l 2 -norm for vectors, respectively.

Expansion based on signal compression
Prediction based on model (2.2) is closely related to finding an efficient approximation to the signal Xβ(t). Since Xβ(t) is an n-dimensional vector of functions of t, we define an L 2 norm for a vector of functions. Let d be any integer and M(t) = (m 1 (t), · · · , m d (t)) T be a vector of functions. We define L2 . We will find a sequence of functions, w 1 (t), w 2 (t), · · · , in L 2 [0, 1] and a sequence of n-dimensional vectors t 1 , t 2 , · · · , such that for any k ≥ 1, where the minimum is taken over all possible functions {v 1 (t), · · · , v k (t)} in L 2 [0, 1] and all possible n-dimensional vectors {r 1 , · · · , r k }. So k j=1 t j w j (t) has the smallest approximation error among all k-dimensional approximations. To find these two sequences, we consider the generalized singular value decomposition (SVD) of Xβ(t), where σ 1 ≥ σ 2 ≥ · · · ≥ σ K > 0 is the collection of all positive singular values of Xβ(t) (K is infinity if all σ k > 0), γ k ∈ R n and u k (t) ∈ L 2 [0, 1] are the left-singular vector and the right-singular function corresponding to σ k with γ k 2 = 1 and u k (t) L 2 = 1, respectively. Then {γ 1 , · · · , γ K } are orthogonal to each other, and so are {u 1 (t), · · · , u K (t)}. We have where α k is a p-dimensional vector and the integral is coordinate-wise. Let Then by (3.3) and (3.4), is the best k-dimensional approximation to Xβ(t) for any 1 ≤ k ≤ K. Let β k (t) = α 1 w 1 (t) + α 2 w 2 (t) + · · · + α k w k (t), 1 ≤ k ≤ K. (3.5) Then is an expansion of β(t) and β k (t), k < K, is the truncated expansion. We will show in Theorem 4.1 that β k (t) has nearly the smallest prediction error among all k-dimensional expansions.
As a dimension reduction tool, the principal component analysis (PCA) has been conducted on the wavelet coefficients X in Johnstone and Lu [22], Røislien and Winje [45], Meyer et al. [30]. This dimension reduction procedure only involves the predictor variables and does not depend on the coefficient functions. For our decomposition, by (3.3), w k (t) is proportional to the k-th right eigenfunction in the SVD of Xβ(t), hence it is also an eigenfunction of the sample covariance function of Xβ(t). So the proposed decomposition can be viewed as a PCA procedure of the signal function Xβ, which not only depends on X, but also is adaptive to the coefficient functions. The w k (t)'s capture the major variations in the signal function. As we do not make restrictions on the covariance function of ε(t), in general, {w k (t) : 1 ≤ k ≤ K} are not the eigenfunctions of the covariance function of Y(t). To estimate the decomposition (3.5), below we propose a penalized generalized eigenvalue problem to estimate α k , and then estimate w k (t) by solving a penalized least squares problem.

Estimation of α k
where the integrals are coordinate-wise. Then S is the p × p sample covariance matrix of X, and the p × p matrix B and n × n matrix Ξ are symmetric and nonnegative definite. The following theorem provides a way to estimate α k .
Moreover, the singular values in (3.2) satisfy σ k = nμ k (Ξ), 1 ≤ k ≤ K, and the maximum value of (3.7) is equal to α T k Bα k = μ k (Ξ) = σ 2 k /n. (c). For any 1 ≤ k ≤ K, the approximation error of the best k dimensional approximation to Xβ(t) is By Theorem 3.1(b), μ k (Ξ) = σ 2 k /n can be viewed as a measure of the magnitude of the signal in the k-th component of the SVD (3.2) of Xβ(t). Through the SVD decomposition, the signal is compressed into the first few terms in the decomposition as much as possible. So we call our method the wavelet based signal compression approach (wSigComp). By (3.8), even if K is not small, as long as μ k (Ξ) decreases fast enough, Xβ(t) can be well approximated by the first few components.
To estimate α k from samples, we define as an estimate of B, whereȳ(t) is the sample mean of y 1 (t), · · · , y n (t). In highdimensional settings, the solution to (3.7) with B replaced by B may not be a consistent estimate of α k as both the sample size n and the dimension p of x i go to infinity. We recall that β(t) is the collection of all wavelet coefficients of the smooth coefficient functions β q (t, s q ), 1 ≤ q ≤ Q, in the original functionon-function regression model (1.1). Therefore, β(t) has the sparsity property in the l 1 sense, which, together with (3.3), implies that α k is a sparse vector for any 1 ≤ k ≤ K. So we propose a penalized generalized eigenvalue problem with sparsity penalty to estimate α k . We propose to get the estimate α k of α k by solving 2 1 , and both τ ≥ 0 and 0 ≤ λ < 1 are tuning parameters. In the penalty τ α 2 λ , the l 2 term is used to overcome the singularity problem of S and the l 1 term encourages the sparsity of α k . This penalty τ α 2 λ was introduced in Qi et al. [36] for sparse principal component analysis and used in Qi et al. [35] for sparse regression and sparse discriminant analysis.
Suppose that {y i (t), 1 ≤ i ≤ n} are observed at L common observation points There are several choices for weights by different interpolation formulas. For example, for equally spaced observation points, we can choose δ = 1/L; for unequally spaced observation points, we can choose )/2 based on the trapezoidal formula. We use these approximations to calculate the integrals in the expression (3.9) of B.
To solve the penalized optimization problem (3.10), we first note that due to the scale-invariant property, (3.10) is equivalent to max α T Bα, subject to α T Sα + τ α 2 λ ≤ 1 and α T S α l = 0, (3.11) for all 1 ≤ l ≤ k − 1. The solution to (3.11) differs from that of (3.10) only by a scale factor. An algorithm to solve a more general optimization problem than (3.11) has been proposed in Qi et al. [35]. We apply the algorithm to solve (3.11) and then scale the solution to obtain the estimate α k .

Estimation of μ(t) and w k (t)
With the estimates  (3.12) where t 1 , · · · , t K can be viewed as new predictors and w 1 (t), · · · , w K (t) are the coefficient functions. The true value of t k is not observed but can be estimated by t k = X α k , 1 ≤ k ≤ K. We propose to estimate μ(t) and W(t) by regressing It follows from the definition of w k (t) in (3.3) and the smoothness of β(t) that {w 1 (t), · · · , w K (t)} are smooth functions. To take care of this property, we use the penalized least squares method for the function-onscalar regression in Chapter 13 of Ramsay and Silverman [39]. Specifically, the estimates μ(t) and where the first term is the mean squared residuals, the second term is the smoothness penalty, and η is the smooth tuning parameter.
In a general function-on-scalar regression model, all the coefficient functions are estimated simultaneously by solving (3.13), which leads to a heavy computational load. In our situation, we can estimate μ(t) and w 1 (t), · · · , w K (t) separately, which improves the computational efficiency. As t k = X α k , due to the constraints in (3.10), are orthogonal to each other and t k 2 2 /n = 1. Moreover, since the column means of X are zero, t k also has mean zero for any 1 ≤ k ≤ K. This means that t 1 , · · · , t K are also orthogonal to 1 n . Hence, the penalized least squares problem (3.13) can be decomposed as K + 1 sub-problems. Specifically, and for any 1 For any k, w k (t) only depends on t k = X α k . Because α k only depends on α 1 , · · · , α k−1 and does not depend on α k+1 , α k+2 , · · · , the estimates α k , w k (t) and β k (t) do not depend on the choice of the number of components. This property helps to improve the computational efficiency in practice.
To solve (3.14) and (3.15), we approximate the solutions by basis expansions.
Since we only observe Y(t) at t 1 , · · · , t L , we approximate (3.14) by the following problem, where h is the expansion coefficient vector of μ(t). The solution of (3.16) is which has solution

Choice of the number of components and tuning parameters
We first consider the choice of (τ, λ). Usually one jointly chooses two tuning parameters in a two dimensional grid. But in our situations, the two tuning parameters are not equally important. Theorem 4.2 in section 4 implies that λ is not essential for the convergence rates in our theoretical results. We can choose λ to be any number as long as it is bounded away from zero and does not affect the convergence rates. On the other hand, in the penalty τ α 2 λ = τ (1 − λ) α 2 2 + τ λ α 2 1 , the sparsity is mainly determined by the l 1 -norm part τ λ α 2 1 . Roughly speaking, the effect of (τ, λ) on the sparsity of our estimates is mainly through their product τ λ and thus a small τ with a large λ has a similar effect as a large τ with a small λ. Hence, to improve the computational efficiency, we do not consider all possible pairs of (τ, λ) in a two dimensional grid. Instead, we select the parameters from a set of paired values where with the increase of τ , the value of λ also increases. Specifically, in the following simulation studies and applications, we choose the pairedvalue for (τ, λ) from 7 pairs, (0.01, 0.01), (0.05, 0.01), (0.05, 0.05), (0.1, 0.05), (0.1, 0.1), (0.5, 0.1), (0.5, 0.2). The smoothness tuning parameter η is chosen from For the i-th pair of (τ, λ), we first determine the maximum number of components K i we need to calculate, 1 ≤ i ≤ 7. The optimal number of components will be chosen between 1 and these maximum numbers. As mentioned after , we only compute the first few components with large values of μ k (Ξ) and stop when the value becomes small enough. On the other hand, by Theorem 3.1 (a), the number of components cannot exceed min(n, p). Based on these considerations, we define So we stop solving the sequential problems (3.10) when the number of components reaches any of the two numbers n, p, or when the ratio between μ k (Ξ) and the cumulative sum μ 1 (Ξ) + · · · + μ k (Ξ) is less than 2%. Once we have determined the K i for all 1 ≤ i ≤ 7, we use the cross-validation method to determine the tuning parameters and the optimal number of components simultaneously. We summarize the details of the procedure in the following algorithm.
Letē i0j0K0 = min i,j,kēijk . Then the i 0 -th pair of values for (τ, λ) and the j 0 -th value for η are chosen, and the optimal number of components is

Oracle inequalities in high-dimensional settings
Now we provide oracle inequalities for the estimates of α k , w k (t) and β k (t), 1 ≤ k ≤ K, in high-dimensional settings. These oracle inequalities hold for any n, p and β(t) which satisfies the conditions in this section.
We introduce some notation. For any d×d symmetric and nonnegative definite matrix M, where d is any positive integer, we define two norms, the operator norm M = sup v∈R d , v 2=1 Mv 2 = λ max (M) and the max norm M ∞ = max 1≤k,l≤d |M kl |, where M kl is the (k, l)-th entry of M, and λ max (M) denotes the largest eigenvalue of M. We adopt the notation in Bickel et al. [3]. For any p-dimensional vector a = (a 1 , · · · , a p ) T , let J(a) = {j ∈ {1, · · · , p} : a j = 0} denote the collection of indices of nonzero coordinates of a and M(a) = |J(a)| denote the number of nonzero coordinates of a, where |J(a)| is the cardinality of J(a). M(a) measures the sparsity of a. Similarly, we define J(β(t)) = {j ∈ {1, · · · , p} : β j (t) = 0} and M(β(t)) = |J(β(t))|. Then it follows from the definitions of α k in (3.3) and β k in (3.5) that for any 1 ≤ k ≤ K, Before we provide the main results, we first show that β k (t) has nearly the smallest prediction error among all k-dimensional estimates when n is large. Let x new be the vector of wavelet coefficients of a new observation of predictive curves and has the covariance matrix Σ. The corresponding new response is where the minimum is taken over all possible β k of the forms Since β(t) is the collection of the wavelet coefficient functions for β q (t, s), The term s ln(p)/n is the convergence rate of the LASSO and the Dantzig selector provided in Bickel and Levina [2]. Since β(t) is sparse, s ln (p)/n β(t) 2 L 2 is small when n and p are large. Thus, the prediction error of β k (t) is close to the smallest one among all k-dimensional estimates. We assume that S − Σ ∞ ≤ C ln (p)/n because it has been shown (Equation (A14) in Bickel and Levina [2]) that ln (p)/n is the order of the max norm of the difference between the sample covariance matrix and population covariance matrix of p-dimensional multivariate normal distribution.
We state three regularity conditions for the main theorem. In the setting of large p and small n, the identification problem exists for the model (2.2). That is, there exists β(t) = β(t) such that X β(t) = Xβ(t). Bickel et al. [3] imposed the restricted eigenvalue assumptions on X. We will make the same assumption below in Condition 1.
where c > 1 and κ > 0 are two constants, δ J0 and δ J c 0 are the subvectors of δ consisting of the coordinates of δ with indices belonging to J 0 and J c 0 , respectively.
Although this assumption does not lead to the identification of the model among all possible coefficients, it makes the model identifiable among all sparse coefficients. In fact, under this assumption, for any two sparse p-dimensional vectors, α and α with sparsity M(α) ≤ s and M(α ) ≤ s, if Xα = Xα , then we have α = α (see the second remark after Theorem 7.3 in Bickel et al. [3]). Therefore, the model (2.2) is identifiable among all vectors sparser than or as sparse as β(t). That is, if X β(t) = Xβ(t) and M( β(t)) ≤ M(β(t)), then The next regularity condition is on the distribution of the random noise function ε i (t). Note that we do not make restrictions on the within-function covariance of ε i (t). We allow ε i (t ) and ε i (t ) to be correlated for any 0 ≤ t , t ≤ 1. We define the median M ε and variance σ 2 for ε i (t). M ε is defined to be the median of the real-valued random variable ε i (t) L 2 and the variance is defined as (Section 3.1 in Ledoux and Talagrand [23]) dt is the length of the projection of ε i (t) onto the direction of u(t) and has a normal distribution. Therefore, (4.2) means that σ 2 is the maximum of the variances of the projections of ε i (t) along all the possible directions in L 2 [0, 1]. In our theoretical development, we need to estimate the tail probabilities of the norms of L 2 [0, 1]-valued Gaussian variables which can be controlled by M ε and σ 2 (Section 3.1 in Ledoux and Talagrand [23]).

Condition 3.
All the diagonal elements of S = X T X/n are equal to 1. All the positive eigenvalues, μ 1 (Ξ), · · · , μ K (Ξ) of Ξ are different. Let Bickel et al. [3] assumed that the diagonal elements of S = X T X/n are equal to 1 as they derived the oracle inequalities for the Lasso and the Dantzig selector. This assumption can be satisfied by scaling X. The c 2 measures how well the eigenvalues of Ξ can be separated.
We first provide upper bounds on the l 1 norms (that is, the l 1 sparsity) of α k , 1 ≤ k ≤ K, and the oracle inequalities for them in the following theorem. Although we use the same tuning parameters (τ, λ) in (3.10) for all components for computational efficiency in practice, in our theoretical results, we allow different tuning parameters for different components. We use (τ (k) , λ (k) ) to denote the tuning parameters for the k-th component, 1 ≤ k ≤ K. Let γ k = Z α k = t k / √ n, which are estimates of γ k = Zα k = t k / √ n for 1 ≤ k ≤ K. Define = ln (p)/n and recall that s = M(β(t)).

Theorem 4.2. Assume that Conditions 1-3 hold. Suppose that
κ is the constant in Condition 1 and is a constant. Let the tuning parameters (τ (k) , λ (k) ), 1 ≤ k ≤ K, satisfy conditions where A (k) and δ 0 are positive constants such that c −1 + δ 0 < 1, and c is the constant in Condition 1.
and ≥ 0 , we have . For the higher order components (1 < k ≤ K), we further assume that where c 4 and c 5 are two constants. Then there exist constants 0 , where D k,1 , D k,2 and D k,4 are constants only depending on δ 0 , c, c 2 ∼ c 5 .
Although we have two tuning parameters, τ (k) and λ (k) , for each k, by Theorem 4.2, λ (k) is not essential for the convergence rates. Actually, it can be any number in a subinterval of (c −1 , 1] and does not affect the convergence rates. Next, based on Theorem 4.2, we provide the oracle inequalities for W(t), β k (t), and X β k (t), 1 ≤ k ≤ K. Bickel et al. [3] provided the oracle inequality for the coefficient vector under the l 1 -norm. We extend the l 1 norm of a usual vector to a vector of functions. For any M(t) = (m 1 (t), · · · , m p (t)) T , we define the L 1,2norm: . The L 1,2 -norm is stronger than the L 2 -norm, that is, M 1,2 ≥ M L 2 . We will provide the oracle inequalities for β k (t) and β(t) based on the L 1,2 -norm.  3 and L k,4 are constants only depending on A (j) , 1 ≤ j ≤ k, c, and c 2 ∼ c 5 . In particular, when K 0 = K, we have Therefore, for any K 0 ≤ K, K0 k=1 X α k w k (t) is an estimate of the best K 0 dimensional approximation to Xβ(t). The upper bounds of β(t) − β(t) 1,2 and X β(t) − Xβ(t) L 2 in Theorem 4.3 are the same as those for the Lasso and the Dantzig selector [3] except the constants.

Simulation studies
We study the performance of our method for the case that there is only one predictive curve in Section 5.1 and the case with multiple predictive curves in Section 5.2. In all simulation studies, the predictive functions are defined in 0 ≤ s ≤ 2 with 128 equally spaced discrete observation points, and the response functions are defined in 0 ≤ t ≤ 1 and there are 60 equally spaced observation points.

Simulation 1: One predictive curve
In this section, we consider the model with one predictive curve and compare our method (wSigComp) with the functional linear regression via the Principal Analysis by Conditional Estimation (PACE) algorithm (PACE-reg) by Yao et al. [54], and the penalized function-on-function regression (pffr) by Ivanescu et al. [18] and (pffr.pc) by Scheipl et al. [46]. Our method is implemented in R and use 40 cubic B-spline basis as the basis functions Φ(t) ((3.16) and (3.17) in section 3.3) for both μ(t) and w k (t). Both pffr and pffr.pc are implemented in the R package "refund" (Crainiceanu et al. [7]). We use the default settings except that we use 40 basis functions for both of them. The PACE-reg is downloaded from http://www.stat.ucdavis.edu/PACE/. It is implemented in matlab (The MathWorks [48]) and we use the default setting. We also consider the method linmod for functional linear regression by Ramsay and Silverman [39], which is implemented in the R package "fda" (J. O. Ramsay and Hooker [19]). We use 40 cubic B-spline basis for both s and t, and choose both smoothness parameters from {10 −10 , 10 −8 , 10 −6 , 10 −4 , 10 −2 , 1}. As numerical problems frequently occurred due to the singularity of some matrices, we do not provide results from linmod in this simulation, but will consider linmod in Section 6.1 for a real data set.
We consider two settings similar to those in Ivanescu et al. [18]. In supplementary materials on our webpage, we provide additional simulations for the model with one predictive curve which is generated from Gaussian processes. In for 0 ≤ s ≤ 2 and 0 ≤ t ≤ 1, where m , 1 ≤ m ≤ 40, are independent standard normal random variables. In both settings, the intercept function is β 0 (t) = 2e −(t−1) 2 , and the noise ε(t) is generated from the Gaussian process with covariance function Σ ε (t, t ) = σ 2 ρ {10|t−t |} 2 . We fix σ 2 = 0.1 and choose the within-function correlation ρ = 0 or 0.7. When ρ = 0, ε(t) is Gaussian white noise. When ρ is bigger, the within-function correlation in ε(t) is stronger and the sample noise curve is smoother. We consider all the 4 combinations of two types of (z(s), β(t, s)) and two values of ρ. For each combination, we repeat the following procedure 50 times. In each repeat, we generate 100 discretely observed random samples (5.1) We report the averages and standard deviations of the MSPEs of 50 replicates in Table 1. In general, the wSigComp method has the smallest prediction error. A stronger within-function correlation in ε(t) tends to increase the prediction error of all methods. The wSigComp chooses 2 components in all cases, and the PACE-reg chooses about 7 components for z(s) and 1 component for y(t). The pffr.pc chooses about 8 and 15 components on average for the two settings, respectively. The averages and standard deviations of the running time for one repeat over 50 replicates are also provided in Table 1, which shows that the wSigComp method is very computational efficient.

Multiple predictive curves
We consider two sets of simulation studies with multiple predictive curves. In both simulations, we generate the predictive curves from Gaussian processes and study the effects of the correlation between multiple predictive curves and their smoothness on the predictive performance. We consider three types of Gaussian processes with different smoothness levels. Their covariance functions are given by 2) The first one in (5.2) is the squared exponential covariance function and the corresponding Gaussian process has mean square derivatives of all orders (Chapter 4 in Rasmussen and Williams [40]). The second one belongs to the Matérn class and the corresponding Gaussian process has the second order mean square derivative. The last one is the γ-exponential covariance function with γ = 1.5 and the Gaussian process is mean square continuous but not mean square differentiable. We plot sample curves for each of the three Gaussian processes in Figure 1. In both simulations, the predictive curves z 1 (s), · · · , z Q (s) are generated from the Gaussian process with one of the covariance functions in (5.2). We model the correlation between curves z 1 (s), · · · , z Q (s) in the following way. Let S be the Q × Q matrix with the (i, j)-th entry equal to ρ curve if i = j and 1 if i = j, where 0 ≤ ρ curve ≤ 1 controls the correlation between the predictive curves. We decompose S = ΔΔ T , where Δ is a Q × Q matrix. Given one of the covariance functions in (5.2), we generate Q independent curves u 1 (s), · · · , u Q (s) from the corresponding Gaussian process. Let (z 1 (s), z 2 (s), · · · , z Q (s)) = (u 1 (s), u 2 (s), · · · , u Q (s))Δ T .
For the coefficient surface functions, in one case, we specify the explicit expression for β q (t, s), 1 ≤ q ≤ Q, with Q fixed. In another case, we evaluate our methods for different number of predictive curves Q and various β q (t, s) with different roughness levels.
which are plotted in Figure 3. (z 1 (s), z 2 (s), · · · , z 4 (s)) is generated by (5.3) with ρ curve = 0 and 0.7, respectively. The noise ε(t) is generated in the same way as in Section 5.1. We consider two noise levels σ 2 = 0.1, 0.25, and fix ρ = 0. The averages and standard deviations of MSPEs of 50 repeats for our method are listed in Table 2. The averages and standard deviations of the number of components chosen by our method are given in Table 3. Although there are four β q (t, s), our method chooses 3 components in most of the repeats and chooses 2 components in the others. As an example, we choose one repeat with ρ curve = 0.7, σ 2 = 0.25 and z q (s) generated using Σ 1 to plot    the estimates β j (t, s), 1 ≤ j ≤ 4, in Figure 4, and w j (t), 1 ≤ j ≤ 3 in Figure 5, where three components were chosen. The shape of w 1 (t) implies that the most important variation in the signal function is in the middle of the interval [0, 1]. w 2 (t) reflects the variations of the contrast between the values of the signal functions in the middle and those in the two ends. The third one only account for very small proportion of the variations in the signal function.

Simulation 3:
In previous simulations, Q is fixed and the coefficient surfaces β q (t, s), 1 ≤ q ≤ Q, have explicit expressions. In this section, we will consider different number of predictive curves Q and test our method for various β q (t, s) with different roughness. We note that any coefficient surface function β(s, t) can be expanded of form where the terms in the expansion can be finite or infinite. We will first randomly generate a finite number of pairs {(ζ i (t), ξ i (s)), 1 ≤ i ≤ k} and obtain a coefficient surface which is equal to ζ i (t)ξ i (s). In this way, we can obtain various coefficient functions and control their roughness by controlling the roughness of {(ζ i (t), ξ i (s))}.
Specifically, let where ζ jq (t) (1 ≤ j ≤ 3, 1 ≤ q ≤ Q) and ξ jq (t) (1 ≤ j ≤ 3, 1 ≤ q ≤ Q) are independently generated from the Gaussian process with the same covariance function. We consider two covariance functions: Σ 1 and Σ 3 in (5.2). We plot examples of the coefficient surface functions generated in (5.4) using Σ 1 and Σ 3 , respectively, in Figure 6. We consider Q = 1, 5, 10, and 30. (z 1 (s), z 2 (s), · · · , z Q (s)) is generated from (5.3) with ρ curve = 0 or 0.7. The noise ε(t) is generated in the same way as in Simulation 1 and we fix σ 2 = 0.1 and ρ = 0. We list the averages and standard deviations of MSPEs and the number of selected components in Tables 4 and 5, respectively. The MSPE increases as Q increases because the model becomes more complicated. In this simulation, strong correlation between multiple functional predictors helps the predictive accuracy, especially for a large Q. Generally, the prediction errors for a noisier β q (t, s) are larger. But when Q = 30, the difference in the prediction errors for the two types of coefficient surface functions is not significant. We also observe that the number of the selected components does not always increase as Q increases. Instead, the average number reaches the maximum at Q = 5 or Q = 10 and then is almost unchanged or slightly decreases with further increase of Q. A possible explanation comes from the trade-off between bias and variance. Selecting more components reduces bias but increases variance in prediction. Given the number Q of predictive curves, there is no great difference in the running time from different settings. When ρ curve = 0, and z q (s) and β q (t, s) are both generated by Σ 3 , the

Diffusion tensor imaging data
In the human brain, white matter tracts consist of axons that connect nerve cells and transmit information via electrical nerve impulses. Axons are surrounded by Table 5 The averages and standard deviations (in parenthesis) of the number of components selected by our method in 50 repeats in Simulation 3. βq(t, s)'s are generated by Σ1 zq(s) by ρcurve Q = 1 Q = 5 Q = 10 Q = 30 a white fatty insulation called myelin, which increases the speed of transmission of nerve signals. Changes in water diffusion in the brain could potentially be associated with demyelination, a disease of the nervous system in which the myelin sheath of neurons is damaged. Diffusion tensor imaging (DTI) tractography [44] is a magnetic resonance imaging technique that studies white-matter tracts by measuring the diffusivity of water in the brain: in white-matter tracts, water diffuses anisotropically (perfectly organized and synchronized movement of all water molecules in one direction) in the direction of the tract, while elsewhere water diffuses isotropically (Brownian motion). One of the diffusion measures is fractional anisotropy (FA) which takes values between zero and one. A value of zero means that diffusion is isotropic. A value of one means that diffusion occurs only along the direction of the white-matter tracts and is fully restricted along all other directions. Tievsky et al. [50], Song et al. [47], Ivanescu et al. [18] have used FA as a proxy variable for demyelination of the white matter tracts, and assume larger FA values are closely associated with less demyelination and fewer lesions.
We use the DTI data in the R package "refund", which consists of FA tract profiles for the corpus callosum (CCA) and the right corticospinal tract (RCTS) for 142 individuals at one or multiple visits. The individuals are either multiple sclerosis (MS) cases or controls (MS is a demyelinating autoimmune-mediated disease). Each profile contains two curves: the FA values along the CCA tract and the RCTS tract from one individual in a visit. We will call them CCA curve and RCTS curve, respectively. Using a similar data set, Goldsmith et al. [12] predicted multiple sclerosis cases and controls based on functional predictors, and Ivanescu et al. [18] built function-on-function regression models to study the spatial associations between functional predictors and responses. We consider To evaluate the predictive performance of our method, PACE-reg, linmod, pffr and pffr.pc on this data set, we randomly choose 200 observations as the training data and the remaining as the test data. We repeat the following procedure 50 times. In each repeat, for PACE-reg, linmod, pffr and pffr.pc, we use exactly the same way as in the simulation study to choose tuning parameters and fit the model using the training set and calculate the MSPE based on the test set. For our method, since the number of observation points of the predictive curve z(s) is not a power of two, we first make basis expansions for the sample curves z i (s), 1 ≤ i ≤ n, using 40 cubic B-spline basis with equally spaced knots and the smoothing method with a roughness penalty in Chapter 5 of Ramsay and Silverman [39]. We choose the tuning parameter for this roughness penalty from the set {10 −10 , 10 −8 , 10 −6 , 10 −4 , 10 −2 , 1} by minimizing the generalized crossvalidation statistic (GCV) which is calculated using the smooth.basis function in the R package "fda" on the training data set. Then we use the selected tuning parameter and the function smooth.basis to calculate the basis expansions for all the sample curves z i (s), 1 ≤ i ≤ 376. By the rule introduced in Section 2, we evaluate these basis expansions at 2 6 = 64 equally spaced points and perform the DWT to obtain 64-dimensional wavelet coefficient vectors x i , 1 ≤ i ≤ 376, as our predictors, using the R package "wavethresh" and the default Daubechies leastasymmetric wavelet basis functions with filter number ten. The most frequently chosen tuning parameters in the 50 repeats of our method are (τ, λ) = (0.01, 0.01), η = 10 −6 and K = 5. We fit a model using these tuning parameters and all the 376 observations in this data set to obtain the estimates α k and w k (t), 1 ≤ k ≤ 5, where α k is a 64-dimensional vector. To obtain the estimate β(t, s) of the coefficient surface β(t, s), we use inverse DWT to transfer α k back to the function space and obtain a function ψ k (s) for any 1 ≤ k ≤ 5. Then we have β(t, s) = ψ 1 (s) w 1 (t) + · · · + ψ 5 (s) w 5 (t). We plot ψ k (s), w k (t), and β(t, s) in Figure 8. For 1 ≤ k ≤ 5, w k (t) is the estimate of w k (t), the k-th eigenfunction of the covariance function of the signal function. These five components account for about 83%, 7%, 3%, 2%, 2% of the variations in the signal function, respectively. w 1 (t) is positive throughout the tract, and has two peaks at 0.17 and 0.91 (the relative distance along the CCA tract), implying that large variations in the signal part of CCA curves exist around these two locations. For any z i (s), the predicted curve y pred,i (t) for z i (s) is equal to the mean response curve plus a linear combination of the five functions w 1 (t), · · · , w 5 (t) with the coefficient of w k (t) given by  Figure 9, where we plot 30 centered predictive RCTS curves with the corresponding centered predicted CCA curves. We draw the predicted CCA curves above the mean CCA curve in blue and those below the mean in red. One can see that the values of z(s) in 0.25 ≤ s ≤ 0.45 have great effects on the predicted CCA curves.

Daily air quality data
The Air Quality data, available in UCI Machine Learning Repository (Bache and Lichman [1]), were recorded by an array of five metal oxide chemical sensors embedded in an air quality chemical multisensor device located in a significantly polluted area, at road level, within an Italian city. This dataset contains the hourly averages of the concentration values of five different atmospheric pollutants in each day. The five pollutants are Nitrogen dioxide (NO 2 ), Carbon monoxide (CO), Non-methane hydrocarbons (NMHC), total Nitrogen Oxides  (NO x ), and Benzene(C 6 H 6 ). In addition, the temperature (in Celsius) and relative humidity (in Percentages) were also recorded hourly in each day. For the details of the experiment, we refer the reader to De Vito et al. [9]. We view the 24 hourly averaged concentration values of each pollutant in each day as a discretely observed curve. Together with the temperature and relative humidity, we have seven functional variables. With the removal of missing values, we obtained 355 sample curves for each of the seven functional variables. We plot all the sample curves in Figure 10. For convenience, we scale the 24 observation time points to the interval [0, 1]. To study the relationship between the five pollutants, we investigate to what extent we can predict the daily curve of one pollutant by the other pollutants together with the temperature and relative humidity. We take the curve of NO 2 as the response and the other six curves as predictors. Since the number of observation points for all predictive curves is 24 which is not a power of two, we make basis expansions for the sample curves in the same way as in the analysis of the DTI data in Section 6.1. After obtaining the basis expansions, by the rule in Section 2, we evaluate each basis expansion at 2 5 = 32 equally spaced points and perform the DWT to obtain a 32-dimensional wavelet coefficient vector for each predictor curve. Then we combine the six wavelet coefficient vectors for the six predictor curves into a 192-dimensional vector as our predictors. We randomly choose 200 observations as the training data, and take the other 155 observations as the test data. After obtaining the fitted model from the training data, we apply it to the test data and get the MSPE. In addition, for each repeat, we also calculate the average functional R 2 which has been used in Meyer et al. [30] and can be approximately calculated by where Y train,i (t) is the i-th response curve in the training set, Y i (t) is the corresponding predicted or fitted curve,Ȳ train (t) is the mean response curve in the training set and 0 = t 1 < t 2 < · · · < t 24 = 1 are 24 equally spaced observation time points. We repeat this procedure 50 times. The average of the MSPEs in 50 iterations is 0.0089, with standard deviation of 0.0008. The average of the R 2 ave in 50 iterations is 88.0%, with standard deviation of 0.7%. The wSigComp selects 6 components in most repeats and select 5 or 7 components in other repeats.
Finally, we fit the model using all the 355 observations to obtain the estimates α kq and w k (t), where α kq is a 32-dimensional vector corresponding to the q-th functional predictor and the k-th component, 1 ≤ k ≤ 6 and 1 ≤ q ≤ 6. We apply the inverse DWT to α k and obtain ψ kj (s) and then β q (t, s) = ψ 1q (s) w 1 (t)+· · ·+ ψ 6q (s) w 6 (t). We plot ψ kq (s)'s and w k (t)'s in Figure 11. These six components account for about 83%, 8.5%, 3%, 2%, 1.5%, 1% of the variations in the signal function, respectively. The first component accounts for most of the variations and is the most important. w 1 (t) has a peak around 0.33 corresponding to eight o'clock in the morning, which implies that the predicted daily pollution level of NO 2 has a large variation around eight o'clock in the morning. The R 2 ave for this fitted model is 94.7%.

Discussion
We consider the linear function-on-function regression models with multiple predictive curves. We first apply the wavelet transformation to the predictive curves and transform the original model to a linear model with functional response and high dimensional multivariate predictors. Based on the best finite dimensional approximation to the signal part in the response curve, we find an expansion of the vector of coefficient functions, which enjoys a good predictive property. For any k, the truncated expansion has nearly the smallest prediction error among all k-dimensional estimates. To estimate this expansion, we propose a penalized generalized eigenvalue problem followed by a penalized least squares problem. We provide the sparse oracle inequalities for our estimates in the high-dimensional settings. The choices of tuning parameters and the number of components are discussed. Simulation studies and applications to two real data sets demonstrate that our method has good predictive performance and is efficient in dimension reduction. eigenvalue σ 2 k (the squared singular value). By (3.6), 1 0 (Xβ(t))(Xβ(t)) T dt = nΞ. Therefore, γ k , 1 ≤ k ≤ K, are the first K eigenvectors of Ξ and we have σ k = nμ k (Ξ). Similarly, the right-singular functions u k (t), 1 ≤ k ≤ K, are the first K eigenfunctions of the covariance function (Xβ(t)) T (Xβ(s)) = β(t) T X T Xβ(s) with eigenvalues σ 2 k = nμ k (Ξ). We will prove part (a) by induction. When k = 1, we first show that the maximum value of (3.7) is less than or equal to μ 1 (Ξ). Let α ∈ R p be a vector satisfying the constraint in (3.7) with k = 1, that is, α T Sα = 1. Define v = Zα. Then we have v 2 2 = α T Z T Zα = α T Sα = 1. By the definition of B in (3.6), Therefore, the maximum value of (3.7) is not greater than μ 1 (Ξ). On the other hand, by (3.4), which leads to Hence, when k = 1, α 1 is the solution to (3.7) and the maximum value is μ 1 (Ξ) = σ 2 1 /n. Now we assume that for any 1 ≤ j < k, α j is the solution to (3.7) with k = j and the maximum value is α T j Bα j = μ j (Ξ). Based on this induction hypothesis, we will prove that α k is the solution to (3.7) and the maximum value is α T k Bα k = μ k (Ξ). For any α ∈ R p satisfying the constraints in (3.7) , that is, α T Sα = 1 and α T l Sα = 0 for all 1 ≤ l ≤ k − 1, let v = Zα. Then we have v 2 2 = α T Z T Zα = α T Sα = 1 and α T l Sα = α T l Z T Zα = γ T l v = 0 for all 1 ≤ l ≤ k − 1. Therefore, v is orthogonal to the first k − 1 eigenvectors of Ξ, and then we have So the maximum value of (3.7) is not greater than μ k (Ξ). On the other hand, Hence, α k is the solution to (3.7) and the maximum value is α T k Bα k = μ k (Ξ). By induction, Part (a) holds for any 1 ≤ k ≤ K.

B.2. Proof of Theorem 4.1
First, we have and similarly, . Next, (x new ) T β(t) is a stochastic process in [0, 1] and its Karhunen-Loève expansion is given by (x new ) T β(t) = ∞ k=1 Z k φ k (t), where φ k (t) is the k-th eigenfunction of the covariance function of (x new ) T β(t) and It is well known that the truncated Karhunen-Loève expansion has the minimum mean integrated squared error. That is, for any k ≥ 1, we have where the minimum is taken over all possible random variables Z j and all possible nonrandom function v j (t), 1 ≤ j ≤ k. Therefore, we have min bj ∈R p ,vj (t)∈L 2 [0,1], 1≤j≤k On the other hand, by (B.5), we have M( ln p n where the inequality in the third line from the last follows from the Cauchy-Schwarz inequality and and the first inequality in the last line is because each coordinate function of k j=1 b 0 j φ j (t) is the projection of the corresponding coordinate function of β(t) onto the space spanned by {φ i , 1 ≤ i ≤ k}. Similarly, we have ln p n .
By (B.6), (B.7) and (B.8), we have ln p n , (B.11) where the third inequality follows from the facts that Xβ k (t) is the best k-