Identifiability in penalized function-on-function regression models

Regression models with functional responses and covariates constitute a powerful and increasingly important model class. However, regression with functional data poses well known and challenging problems of non-identifiability. This non-identifiability can manifest itself in arbitrarily large errors for coefficient surface estimates despite accurate predictions of the responses, thus invalidating substantial interpretations of the fitted models. We offer an accessible rephrasing of these identifiability issues in realistic applications of penalized linear function-on-function-regression and delimit the set of circumstances under which they are likely to occur in practice. Specifically, non-identifiability that persists under smoothness assumptions on the coefficient surface can occur if the functional covariate's empirical covariance has a kernel which overlaps that of the roughness penalty of the spline estimator. Extensive simulation studies validate the theoretical insights, explore the extent of the problem and allow us to evaluate their practical consequences under varying assumptions about the data generating processes. A case study illustrates the practical significance of the problem. Based on theoretical considerations and our empirical evaluation, we provide immediately applicable diagnostics for lack of identifiability and give recommendations for avoiding estimation artifacts in practice.


Introduction
The last two decades have seen rapid progress in regression methodology for high-dimensional data, largely driven by applications to genomic data in the "small n, large p" paradigm. In regression models with functional predictors, similar problems arise from the fact that covariate information comes in the shape of high-dimensional, strongly auto-correlated vectors of function evaluations. Whenever the number of regression parameters to estimate exceeds the number of observations, estimates are not unique and the resulting model is not identifiable. To overcome this lack of identifiability, it then becomes necessary to use heuristics or prior knowledge to impose additional structural constraints like sparsity or smoothness. Results then inherently depend -at least to some degree -on the assumptions underlying the chosen regularization method. In the following, we present a detailed analysis of the way in which smoothness assumptions combined with properties of the data generating process affect estimation results for function-on-function regression. We differentiate between two kinds of non-identifiability: First, simple non-identifiability arising from low information content in the functional data. This can be cured by imposing structural assumptions like smoothness or sparsity on the estimators, i.e., by regularization of the estimators. Fig. 1 shows an example for 4 estimates under different structural assumptions all yielding identical fits in such a scenario. Second, persistent non-identifiability that remains despite regularization for certain combinations of models and data. Figures 4 and  8 show examples for the latter on synthetic and real data, respectively.
The problem of -especially persistent -non-identifiability is as yet under-appreciated in the functional data literature and analyzed here in depth for the first time. As software capable of fitting increasingly complex models with functional data becomes available (e.g. fda, Ramsay et al. (2014); fda.usc, Febrero-Bande and Oviedo de la Fuente (2012); refund, Huang et al. (2015); PACE, Yang et al. (2012); WFMM, Herrick (2013)), investigating the practical relevance of identifiability issues arising in these models is both timely and important in this rapidly developing field. The present work aims to perform such an investigation for the model class described in  and implemented in refund's pffr function, while results carry over to other penalized function-on-function regression approaches such as those implemented in the fda package.
A popular approach in regression for functional data restricts the functional coefficients to lie in the span of the first K < n estimated eigenfunctions of a functional covariate's covariance operator with the largest eigenvalues, (see Cardot et al., 1999Cardot et al., , 2003Yao et al., 2005;Reiss and Ogden, 2007;Yao and Müller, 2010;Wu et al., 2010, for example). This functional principal component regression (FPCR) approach solves the problem of overparameterization (i.e., non-identifiability of the functional effect) by a -usually drastic -dimension reduction. The main challenges in this approach then become 1) achieving good estimates of the covariance's eigenfunctions ("functional principal components" or FPCs), eigenvalues, and FPC scores from observed functional data and 2) choosing the regularization parameter K. In practice, the effect of the functional covariate is estimated by using the first K estimated FPC scores as synthetic covariates. However, the critical assumption that the true coefficient lies in the span of the first few empirical eigenfunctions of a suitable (cross-)covariance operator estimate is impossible to verify empirically. Due to the often wiggly and unsmooth nature of eigenfunctions of real data this assumption can also lead to estimates that are difficult to interpret or implausible to practitioners.
An alternative approach is to make assumptions on the functional coefficients informed by insights into the problem at hand, e.g., sparsity or smoothness of functional coefficients, and to estimate these functional coefficients subject to an appropriate penalty (e.g., LASSO or smoothness penalties). In this work, we will focus on smooth spline-based penalized regression models for functional responses with functional covariates as described in Ivanescu et al. (2015) and , which constitute a powerful and flexible model class able to deal with the wealth of functional data increasingly collected in many fields of science. Nevertheless, our considerations carry over to other approaches to estimate smooth coefficient functions, such as approaches using derivative-based penalties as advocated by Ramsay and Silverman (2005). This paper describes the data settings in which penalized models remain unidentifiable despite the penalty in Section 3 and develops and evaluates suitable diagnostics and modified penalties for such settings.
Identifiability issues in functional regression have previously been discussed in Cardot et al. (2003) in the context of functional regression models for scalar responses and also, briefly, in the context of models with both functional responses and functional predictors by He et al. (2000), Chiou et al. (2004) and Prchal and Sarda (2007). While results therein provide conditions for the theoretical existence and unicity of solutions based on functional analysis arguments, they do not yield empirically verifiable criteria to determine whether the conditions for unicity are violated for a given data set. They also always assume that the true coefficient surface lies in the space spanned by the eigenfunctions of a (cross-)covariance operator. As far as we are aware, case studies in the previous literature have implicitly assumed that this assumption and the necessary conditions based on it will be satisfied for observed data. This is problematic since 1) the theoretical conditions found in the previous literature are impossible to satisfy, or at least verify, on finite samples of functional data in finite resolution, and 2) our experience with applications of functional regression models as well as results from simulation studies indicate that persistent nonidentifiability leading to spurious coefficient estimates may occur quite regularly. This is obviously a concern for applied statisticians desiring interpretable regression models associating functional covariates and (functional) responses.
Instead of relying on the functional analysis arguments suitable for investigations of asymptotic properties of the theoretical model, we use simple linear algebra to derive a condition for unicity of coefficient surface estimates in realistic, finite sample data settings in which functional covariates are observed with finite resolution in Section 3. This allows us to give a necessary and sufficient condition for persistent non-identifiability in penalized function-on-function regression that is empirically verifiable and thus applicable in realistic problems. The criterion is based on the amount of overlap between the kernel of the penalty matrix and the kernel of the design matrix for the functional effect. Simulation studies indicate that, in practice, severe errors due to non-identifiability are strongly associated with our criterion; thus, this criterion is the first one that can be used in a wide variety of applications to assess identifiability or lack thereof. Our analyses also indicate that many widely used preprocessing techniques for functional data which replace observed curves with spline-based or FPC-based low-rank approximations (i.e., pre-smoothing) or the centering of individual observed curves will considerably increase the likelihood of identifiability issues in many settings.    We use the well-known dti data set -publicly available in R-package refund (Huang et al., 2015) -to illustrate the issues we discuss here. These data contain spatially indexed, i.e., functional, measurements of fractional anisotropy (FA), a proxy variable for neuronal health, along 3 cerebral white matter tracts (WMTs) of multiple sclerosis (MS) patients. To illustrate how strongly different structural assumptions about the regression coefficient surface can affect the results, we fit simple univariate functional linear models E(Y (t)) = β 0 (t) + X(s)β(s, t)ds regressing FA along the right cortico-spinal WMT (RCST-FA, Y (t)) on pre-smoothed FA along the corpus callosum WMT (CCA-FA, X(s)). Figure 1 shows estimated coefficient surfacesβ(s, t) for penalized spline based fits with second or first order difference penalties (two leftmost panels) or a ridge penalty (third panel from left), as well as the coefficient surface implied by a functional principal component (FPC) based fit (right). Even though the surfaces are quite different, they result in (practically) identical fitted values in this example for a setting with simple non-identifiability. The first and third panels from the left in the top row of Figure 7 show the data used in this example. Graphical examples of the second kind of non-identifiability that persists even under penalization are shown in Figures 4 and 8 for synthetic and real data, respectively.
Our paper is structured as follows: Section 2 defines the model and data structure under discussion. We present an accessible rephrasing of the fundamental issue (Section 3) and derive necessary and sufficient conditions for settings in which β(s, t) is or is not identifiable in Section 3.2. Section 4 follows up with an analysis of the scope of the problem based on simulated data, while Section 5 describes a real-world example of the issue. The main conclusions we draw from our analysis are that the complexity of observed functional covariates puts hard limits on the identifiability of coefficient surfaces in a number of ways, that these limits can be diagnosed based on the data at hand, and that pre-processing of functional covariates can often exacerbate identifiability issues.

Model and Data Structure
In the following, bold symbols denote vectors and matrices, and calligraphic letters denote function domains, function spaces, or sets of functions or vectors.
Define a simple function-on-function regression model as where Y i (t) and X i (s), i = 1, . . . , N , are functional responses and covariates on closed intervals T and S in R, respectively, and assume that they are realizations of zero-mean square integrable stochastic processes Y (t) ∈ L 2 [T ] and X(s) ∈ L 2 [S] with continuous covariance functions, respectively. To simplify notation and exposition but without loss of generality, we assume that E(Y i (t)) ≡ 0 and E(X i (s)) ≡ 0. Propositions 3.1 to 3.3 in Section 3 directly carry over to any model with an additive predictor that includes terms like S X i (s)β(s, t)ds. The further development in Section 3 leading up to Proposition 3.4 assumes that the estimate for β(s, t) minimizes a (penalized) quadratic loss function, which is equivalent to maximizing the likelihood of a model with i.i.d. Gaussian errors ε it , but does not strictly speaking depend on distributional assumptions about ε it and is also very similar to the system of equations solved in each iteration of the penalized iteratively re-weighted least squares algorithm (P-IWLS; Wood, 2000) used to fit additive models like (1) for non-Gaussian responses. Due to the assumptions on the functional covariates, they can be represented by a Karhunen-Loève expansion with orthonormal φ m (s), S φ m (s)φ m (s)ds = δ mm , and uncorrelated zero-mean FPC scores ξ im with variances ν 1 ≥ ν 2 ≥ · · · ≥ 0, m ∈ N. The ν m and φ m (s), m = 1, . . . , M, are the ordered eigenvalues and eigenfunctions of the covariance operator K X of X(s), respectively, with the covariance function given by Mercer's theorem as Since estimating β(s, t) is an inverse problem, some kind of regularization is required. Functional principal component based approaches like, for example, Yao et al. (2005) restrict β(s, t) to lie in the span of the first K estimated eigenfunctionsφ m (s), m = 1, . . . , K for all t. The number of eigenfunctions K that is used serves as the (discrete) regularization parameter. In contrast, we will discuss and analyze a penalized approach. The underlying assumption is that β(s, t) is a smooth function that can be well represented as a linear combination of suitable basis functions defined on S × T .
In practice, functional responses Y i (t) and functional covariates X i (s) are observed on grid points s i = (s i1 , . . . , s iS i ) and t i = (t i1 , . . . , t iT i ). For simplicity, we assume those to be identical vectors s, t with lengths S and T , respectively, for each observation i. In the following, expressions a(s) or a(t) with a bold argument denote the vector of evaluations of a(·) on the respective grid, e.g., a(s) = (a(s 1 ), . . . , a(s S )) .
Model (1) can then be approximated for observed data as with β(s, t) = [β(s j , t k )] j=1,...,S k=1,...,T and ε i = (ε it 1 , . . . , ε it T ) . We also define a weight vector w for numerical integration, e.g. w = (w j ) j=1,...,S for simple quadrature via Riemann sums, with w j the length of the sub-interval of S represented by s j . The symbol · denotes element-wise multiplication. The coefficient surface β(s, t) is represented using a tensor product spline basis with basis matrices B s and B t of K s and K t basis functions evaluated in s and t, respectively, and spline coefficient matrix Θ. The roughness penalty matrix for the surface is given by P ≡ P (λ s , λ t ) = λ s (P s ⊗ I Kt ) + λ t (I Ks ⊗ P t ) (Wood, 2006), where λ s , λ t are smoothing parameters to be estimated from the data and P s and P t are the fixed and known marginal penalty matrices for the s-and t-directions, respectively. Estimation and inference is described in more detail in Ivanescu et al. (2015) and . In the following, our considerations are not limited to simple models such as model (1), but carry over to more general modelsỸ The additive predictor η i (t) represents the sum of other terms in the model such as a global functional intercept β 0 (t), index-varying linear or smooth effects of scalar covariates x like x i β(t) or f (x i , t), scalar or functional random effects, etc.  contains methods and applied examples for this class of flexible additive functional regression models. Of course, these more general models may suffer from additional identifiability problems caused by collinearity or concurvity of the terms in the additive predictor which are outside the scope of this paper. While we focus our discussion on a spline-based approach, the representation in (5) also accommodates other choices of basis functions and penalties.

Identifiability
In this section, we discuss potential sources of non-identifiability in model (1). The first subsection restates some known results on these issues for the theoretical model (1) with truly functional observations X i (s) and Y i (t). Subsection 3.2 then discusses identifiability for the finite resolution vector data available in practice.

Identifiability in the theoretical model
It is well known (e.g. Prchal and Sarda (2007, c.f. p. 5), He et al. (2000, Th. 4.3. c)) that coefficient surfaces are identifiable only up to the addition of functions in the kernel of K X , i.e., if β(·, t) fulfills model (1), so does β(·, t) + β K (·, t) for any β K (·, t) with S k X (s, v)β K (v, t)dv = 0 for all s, t. Thus, we have identifiability only when the kernel is trivial.
An important secondary consequence is that predicted responses S X i (s)β(s, t)ds can be entirely unaffected by large changes in β(s, t). Thus, strategies for detection of identifiability problems cannot be based on predictive performance in cross-validation, bootstrapping and related methods.
Non-identifiability in Proposition 3.1 occurs when ke(K X ) is non-trivial, when the eigenfunctions in (2) with non-zero eigenvalues ν m do not span the L 2 [S]. While it is possible to assume a trivial kernel in theory (e.g. Prchal and Sarda, 2007, equation (4)), in practice, functional covariates are observed on a finite number of grid points S and the empirical covariance for N observations thus can have at most min(N, S) non-zero eigenvalues. As is exploited in functional principal component analysis (e.g. Ramsay and Silverman, 2005), functional observations are often simple enough to be represented accurately by a relatively small number of eigenfunctions, with eigenvalues of higher order small compared to noise or measurement error. It is also wide-spread practice to use pre-smoothed versions of observed functional covariates as inputs for models like (1) (e.g. James, 2002;Ramsay and Silverman, 2005), and these will have a non-empty kernel since they are represented as linear combinations of a limited number of basis functions. Basis function representations of X(s) are also used when sparsely or incompletely observed functional covariates have to be imputed on a grid of s-values to be used as inputs for model (1) (e.g. Goldsmith et al., 2011).
In the following section, we thus investigate identifiability problems for finite-sample finite-resolution functional data and the interplay between the rank of the observed covariance, the rank of the basis used to represent β(s, t) in s-direction and the penalty used in the penalized estimation approach for β(s, t) introduced in Section 2.

Identifiability in practice
Rank-deficient design matrix In practice, β(s, t) is represented as a linear combination of a finite number K s K t of basis functions, see (5). For the following, we will assume that the corresponding approximation error is negligible and that a suitably flexible basis has been chosen for β(s, t). Combining (4) and (5), we can write the model as In the linear regression model (7) for y = vec(Y ), the parameter vector θ = vec(Θ) is identifiable if and only if the design matrix D = B t ⊗ (XW B s ) is of full column-rank. The rank of D is equal to rank(D) = rank(B t ) rank(XW B s ). B t will typically be of full rank K t as long as K t ≤ T , as the K t spline functions form a basis and the columns of B t are thus linearly independent for non-pathological cases. For X, let X = ΞΦ be the empirical version of the Karhunen-Loève expansion (2), where X X = Φ ΛΦ with Φ an S × M orthonormal matrix of eigenvectors, M = rank(X), Λ = diag(λ 1 , . . . ,λ M ) a diagonal matrix of ordered positive eigenvalues and Ξ containing M columns of estimated scores with empirical variancesλ 1 , . . . ,λ M . Then, by construction, the matrix XW B s is at most of rank min(N, M, K s , S) = min(M, K s ), since M ≤ min(N, S).
We then have the following proposition: Proposition 3.2. Assume that B t is of full rank K t . Then, the design matrix D = B t ⊗ (XW B s ) in model (7) is rank-deficient if and only if Proposition 3.2 yields a direct criterion to check for rank-deficient design matrices. Case a) corresponds to a low-rank covariance for the X-process. In this case, the functional predictor does not carry enough information, as measured by the number of eigenfunctions φ m (s) with non-zero eigenvalues, compared to the number of parameters to estimate. Case b) means that even if M ≥ K s , non-identifiability can occur if the span of the basis used for β(s, t) in s-direction contains functions in ke(K X ), as measured by numerical integration using the integration weights w. More intuitively, this means that the basis for the s-direction of β(s, t) accommodates modes of variation orthogonal to those of the X(s)-process.
Note that identifiability is determined by the interplay between the complexity of the X(s) and the coefficient basis for β(s, t). Thus, more data will not necessarily resolve identifiability issues: A finer grid s will only eliminate identifiability problems present on a coarser grid if there is sufficient small-scale structure in X(s) that is also present in the basis used for β(s, t) in s-direction. Increasing the sample size N will likewise eliminate problems with identifiability only if the low-rank of the covariate's covariance is due to small sample size.

The effect of the penalty
In cases of non-identifiability, the best we can hope for is partial identifiability of the parameters in a parameter subspace, i.e., identifiability under additional assumptions on the parameters. In this vein, functional principal component regression (e.g. Yao et al., 2005) restricts β(s, t) to lie in the span of the first K eigenfunctions φ m (s), m = 1, . . . , K, for all t. Remaining problems then include the fact that theφ m (s) are estimated quantities in practice, with corresponding measurement error, and the choice of K, which can strongly affect the shape of the resulting function estimate (Crainiceanu et al., 2009). Also, this approach couples assumptions on the shape of β(s, t) to properties of the space spanned by the retained φ m (s). In particular, their smoothness determines that of β(s, t), and inclusion of higher-order eigenfunctions often leads to wiggly surface estimates that are hard to interpret and unstable under replication.
Here, we focus on a penalized approach assuming smoothness of β(s, t) and investigate the effects of the penalty on identifiability. While the use of a penalized approach is well-known to avoid identifiability problems due to high correlation between observations at neighboring grid points (e.g. Ramsay and Silverman, 2005, Ch. 15.2), the full interplay between penalty and identifiability is, we believe, not fully understood and underappreciated.
Consider again the design matrix the singular value decompositions of B t and D s := XW B s , respectively. Let indices + and 0 denote the corresponding sub-matrices obtained by removing columns and/or rows corresponding to zero and non-zero singular values, respectively. We assume in the following that B t is of full rank K t . Then, Thus, for any given θ with Dθ =: f , there exists a linear subspace If we assume our parameter function to come from a space of smooth functions, we can select the smoothest solution on a given hyperplane H f by minimizing θ P θ for a suitable penalty matrix P that penalizes roughness of the function parameterized by θ. Simple non-identifiability occurs if this minimum is unique, persistent non-identifiability occurs if it is not. We have the following proposition regarding uniqueness of the corresponding minimum.
Proposition 3.3. Let P = λ s (I Kt ⊗ P s ) + λ t (P t ⊗ I Ks ), with P s and P t positive semidefinite matrices. Assume that B t is of full rank K t , that rank(P t ) < K t and that λ s > 0, λ t ≥ 0. Then, for any f ∈ im(D) there is a unique minimum min θ∈H f {θ P θ} if and The assumption that P t is of less-than-full rank is natural in the context of derivativebased penalties and excludes cases like the ridge penalty P t = I Kt , which would have the same effect as a full-rank penalty P s = I Ks , even in cases of a kernel overlap between D s D s and P s . A potentially full-rank P t would change the 'if and only if' in Propositions 3.3 and 3.4 below to 'if'. Proposition 3.3 shows that in the case of a kernel overlap ke(D s D s ) ∩ ke(P s ) = {0}, the additional side condition θ P θ → min does not yield a unique smoothest point on the hyperplane and the model remains unidentifiable. On the other hand, if there is no kernel overlap, there is a unique smoothest point θ f ∈ H f and this unique point has the form of a projection of θ along the hyperplane. Note that for the ridge penalty P = λI KsKt , one obtains the projection onto the image im(D), which sets the part in the kernel of D D to zero. More generally, a smoothness penalty P generates a projection that may have a non-zero component in the kernel of D D if this yields a smaller overall penalty value. In this case of no kernel overlap, we thus have a weak form of identifiability, which guarantees that there is a unique smoothest representative on any hyperplane of parameters giving the same conditional distribution for Y .
This characterization, which only requires checking of design matrix and penalty in sdirection and not for the full model, carries over to the penalized maximum likelihood or least squares estimation problem Proposition 3.4. Assume that B t is of full rank K t , that rank(P t ) < K t and that λ s > 0, λ t ≥ 0. Then, there is a unique penalized least squares solution for (8) if and only if Proposition 3.4 gives a criterion for the uniqueness of the penalized least squares solution. We can show how the penalty achieves this uniqueness by writing (8) as a nested minimization problem. Here, the outer minimization finds the f = Dθ with optimal fit to the data, and the inner minimization minimizes the penalty term over H f to obtain the smoothest solution for a given level of residual variation.
where θ f = HU + v + for a given f is uniquely defined as in (11), with v + = Σ −1 As V + Σ + is a matrix of full column rank d, this minimization problem has a unique solution that, for given λ s , λ t , balances the fit to the data and smoothness.
To summarize, in the case of no kernel overlap, i.e., ke(D s D s )∩ke(P s ) = {0}, we obtain a weaker form of identifiability even when the design matrix D is not of full rank, which guarantees that there is a unique smoothest representative on any hyperplane of parameters giving the same conditional distribution for y. Then, there will also be a unique solution to the penalized estimation problem, which is the smoothest representative on the hyperplane of possible solutions with equally good fit.
In practice, λ s and λ t are estimated from the data. We here do not investigate the more complex case when λ s , λ t are not fixed. It should also be noted that (D D + λ s (I Kt ⊗ P s ) + λ t (P t ⊗ I Ks )) may still be close to singular even in cases of no kernel overlap if smoothing parameters are very small, with corresponding reduced stability in estimation.

Diagnostics, practical recommendations and countermeasures
In order to safeguard against misleading coefficient estimates in practical applications of functional regression, it is necessary 1) to develop empirical criteria for diagnosing problematic data settings in which the coefficient function is not identifiable based on the available data, or where only the penalty ensures unique estimates, 2) to avoid preprocessing protocols and tuning parameters that increase the likelihood of identifiability issues and 3) to develop improved algorithms for estimating function-on-function effects that are less prone to severely misleading estimates in problematic settings.
Diagnostics We are interested in identifying settings with simple non-identifiability in which only the penalty term guarantees the existence of a unique solution. Following Proposition 3.2, the most direct approach to do so is to compute the condition number of D s D s = (XW B s ) T XW B s and choose a suitable cut-off (10 6 , in the following) for numeric rank deficiency.
In addition, propositions 3.3 and 3.4 indicate that a measure of the degree of overlap between the spans of ke(D s D s ) and ke(P s ) can be used to detect persistent non-identifiability that remains despite the penalization. In our empirical evaluation of such measures, we found that a measure for the distance between the spans of two matrices introduced in Larsson and Villani (2001), when modified for our setting, showed the most promise as the resulting measure is free of tuning parameters and can be computed quickly from the data.
In particular, we modify the original definition of Larsson-Villani in order to accommodate two matrices of unequal column numbers. We then define the amount of overlap LV between the span of two matrices Here, V Z is a matrix containing the left singular vectors of the matrix Z and is thus an orthogonal matrix spanning the same column space as Z, Z ∈ {A, B}. It is easy to see that this measure is symmetric, LV (A, B) = LV (B, A). Similarly to Theorem 2 in Larsson and Villani (2001), one can also show that LV (A, B) ∈ [0, min(p A , p B )], with the overlap assuming its maximum of min( To measure the degree of overlap between the kernels of D s D s and of P s , we could use LV ((D s D s ) ⊥ , P s⊥ ), as the span of (D s D s ) ⊥ corresponds to the kernel of D s D s and the span of P s⊥ corresponds to the kernel of P s . In the following, we will however use as this choice obtained slightly better sensitivity and specificity for the detection of problematic settings in our simulations in Section 4. Moreover, one can easily show that such that the two formulations address the same question. For W = I S , this measure has the interpretation of the overlap between the empirical null-space of the observed X(s) process or ke(K X ) and P s⊥ , the space of functions not penalized by the penalty defined by P s , evaluated on the grid given by s. It can be determined quickly and accurately before the model is fit. Problematic cases are indicated by overlap measures ≥ 1, as this is indicative of an at least one-dimensional sub-space of functions contained in the kernel overlap.
Practical recommendations The theoretical results suggest several recommendations for pre-processing of functional covariates and choice of the penalty in practice.
1. Pre-smoothing of functional covariates is commonly done to remove measurement error and/or obtain functions on a common grid (e.g. James, 2002;Ramsay and Silverman, 2005;Goldsmith et al., 2011). If the resulting (effective) rank of the smoothed covariate process drops below K s , this will lead to models that are only identifiable through the penalty term. We thus recommend to use a sufficiently large number of FPCs and/or spline basis functions if such pre-processing is required. As a peculiar consequence of this point it may be preferable in some cases to accept a small amount of measurement error-induced attenuation inβ(s, t) based on noisy, unprocessed X(s) in order to avoid a potentially much larger non-identifiability-induced error inβ(s, t) based on low-rank, pre-processed X(s).
2. Curve-wise centering of functional covariates such that S l=1 X i (s l ) = 0 for all i is sometimes used e.g. in the context of spectroscopy data to remove the optical offset (c.f. Fuchs et al., 2015). Then, constant functions lie in the kernel of K X , ke(K X ). This is not recommended if a penalty is used that does not penalize constant functions (as most difference or derivative-based penalties do) to avoid non-identifiability.
3. Penalties with larger null-spaces increase the likelihood of a kernel overlap and resulting non-identifiability problems. For difference or derivative-based penalties, for example, a penalty penalizing deviations from constant functions (first order differences or derivatives) would thus be preferable in this sense to higher-order differences/derivatives. Constant coefficient functions, which then span the penalty nullspace, correspond to models with the mean over the functions as covariate -which are often used by practitioners -and thus also lend themselves to intuitive interpretations. In particular, for penalties where constant coefficient functions span the penalty nullspace, it is straightforward to see that only X-processes that are centered curve-wise will result in a kernel overlap (unless smoothing parameters are estimated to be very small). Unless curve-wise centering is performed as discussed in 2., such processes will typically occur rarely and using first-order difference/derivative penalties should thus guard against many if not most serious identifiability issues in practice.
Countermeasures The third point above suggests modifications of the penalty null-space to avoid non-identifiability caused by potential overlap between the penalty null-space P s⊥ and ke(K X ). Alternatively, the estimated coefficient surface can be constrained to be orthogonal to functions in the overlap of P s⊥ and ke(K X ). We describe three approaches using penalties with empty null-spaces and one constraint-based approach; a systematic comparison of their performance is given in Section 4.2.
1. The simplest approach is the use of a simple ridge penalty P s = I Ks . However, the resulting estimates will typically not have good smoothness properties as the ridge penalty is not a roughness penalty in the conventional sense and its bias towards small absolute values of β(s, t) may increase estimation error, cf. Figures 1, 4 and 5.
2. A second approach uses a modified marginal penalty matrix without null-space along the lines of the so-called "shrinkage approach" described in Marra and Wood (2011) and originally developed for the purpose of variable selection in generalized additive models. Marra and Wood (2011) replace the marginal penalty P s = Γ diag(ρ 1 , . . . , ρ Ks )Γ T , with eigenvectors contained in Γ, eigenvalues ρ 1 , . . . , ρ Ks and rankK s < K s , by a full rank marginal penaltỹ This substitutes the zero eigenvalues ρK s+1 , . . . , ρ Ks with ρK s for all k =K s + 1, . . . , K s using 0 < 1, and thereby adds a small amount of penalization to parameter vectors in the null space of the original penalty. In the following, we will refer to such a modified penalty as a full-rank penalty. By imposing a small degree of regularization on functions in P s⊥ one can in principle preserve the attractive smoothing properties of the original penalty while still avoiding large artifacts due to non-identifiability resulting from a kernel overlap. We use = 0.1 as suggested in Marra and Wood (2011), but results show that this choice is not always effective in removing artifacts if the estimated smoothing parameter is very small and overall penalization of the fit is weak.
3. In a similar effort to avoid spurious estimates in scalar-on-function regression, James and Silverman (2005, their eq. (16)) suggested using the empirical FPCs of X(s) scaled by their inverse eigenvalues as a penalty. This penalizes coefficient functions with large variability in directions in which X(s) varies very little or not at all (i.e., in ke(K X )). We adapted this approach to our function-on-function setting by replacing the conventional difference operator based B-spline penalty matrix with (ν −1/2 mφm (s)β(s, t 0 )) 2 ds suggested by James and Silverman (2005). The empiricalφ m (s) andν m have to be estimated from a singular value decomposition of X. It is unclear, however, how to compute the inverse eigenvalues if X is of low rank, i.e., if some of theν m are (numerically) zero, which is of course precisely the setting in which this penalty might yield more stable estimates. In our experiments, we tried replacement of the zero eigenvalues with the smallest non-zero eigenvalue and a variety of other replacement schemes, but results from this approach seem to be fairly sensitive to both the chosen replacement scheme for zero eigenvalues and to the estimated FPCs themselves.
4. Instead of the augmented or alternative penalization schemes for components of the coefficient surface that lie in the overlap ke(K X ) ∩ P s⊥ , we can also directly constrain these components to be zero. As these components are not estimable from the data, such constraints are a plausible default but need to be taken into account for the interpretation of the estimated surface. To implement them, we compute a basis of the overlap and constrain the coefficient surface evaluated on the observed grid to be orthogonal to vectors in the span of that basis. Let X ⊥ = (X ) ⊥ and the S × K s matrix B s containing the marginal basis functions over S evaluated on s. A basis spanning the overlap is then defined by the left singular vectors V Cs+ of the matrix X ⊥ (X ⊥ ) X ⊥ −1 (X ⊥ ) diag(w)B s P s⊥ that have positive singular values.
For intuition, consider that if w = 1, the expression above projects B s P s⊥ , i.e., a basis for P s⊥ , into the span of X ⊥ , i.e, a basis for ke(K X ). This yields a basis for the intersection of the spans of B s P s⊥ and X ⊥ . The expression above is cheap to compute in a numerically stable way using the QR decomposition of X ⊥ . We then estimate Θ under the constraints V Cs+ diag(w)B s Θ ! = 0. To give an example, the constrained coefficient surface estimate for a curve-wise centered functional covariate that does not contain any constant components will be centered around zero and not be offset in any direction if the penalty null-space includes constant functions. Note that, different from the augmented or alternative penalties described above, these constraints are empty if ke(K X ) ∩ P s⊥ is empty and penalties with constraints then reduce to the usual penalties without constraints. The constraint definition above is quite general and can be used for any basis with a quadratic penalty, it is not limited to the difference penalties we focus on. Specifically, it will also be applicable for derivative based penalties as well as some of the PDE-based penalties introduced by Ramsay et al. (2007) whenever identifiability issues arise. By default, the pffr function uses the diagnostic criterion (9) combined with a check for a rankdeficient design matrix to determine the presence of persistent non-identifiability. If persistent non-identifiability is found, a corresponding warning is issued and the constraints developed here are applied to the fit. Interpretation of the constrained coefficient surface estimates requires careful attention. For example, if constrained surface estimates are centered around zero because ke(K X ) ∩ P s⊥ contains constant functions, the signs of different regions ofβ(s, t) are not interpretable due to the estimated absolute level ofβ(s, t) being essentially arbitrary.

Simulation study
This section presents results on the practical consequences of the theoretical development in Section 3. Specifically, we investigate the performance of the tensor product splinebased approach given in Section 2 on artificial data of varying complexity and noise levels in terms of estimation accuracy and use the simulation results to validate the diagnostics for problematic settings we have developed. Subsequently, we present results for the modified full-rank penalties of Marra and Wood (2011), the FPC-based penalty of James and Silverman (2005) and the constrained estimates described in Section 3.3. All models were fitted with the pffr() function available in the refund package, which estimates the smoothing parameters using restricted maximum likelihood (REML).

Simulation setup
We simulate data from data generating process (1), with n = 50 subjects, T = S = [0, 1] and S = 100 grid-points for X i (s) = M m=1 ξ im φ m (s). The effect surface β(s, t) is estimated using tensor product cubic B-splines. We set K t = 10 and use a marginal first order difference penalty for the t-direction. Test runs showed that results are insensitive to the number of grid-points for the response; we used T = 50 grid-points for Y (t). Twenty replications were simulated for all sensible combinations of the following parameters (144000 replicates in total). For the fitting algorithm, we vary • the marginal roughness penalties for the spline coefficients: either second order difference penalties ("∆ 2 ") or first order difference penalties ("∆ 1 "). For the second order differences, coefficient vectors in the penalty's null-space ke(P (λ s , λ t )) parameterize surfaces that are constant or linear in both directions. For the first order differences, coefficient vectors in ke(P (λ s , λ t )) parameterize constant surfaces. Note that rank deficiencies can increase if associated smoothing parameters become sufficiently small. In that case, the penalty null-space effectively becomes the entire span of the associated basis functions. • and the number of basis functions over S: K s ∈ {5, 8, 12}.
For the data generating process, we vary the following parameters: • number of eigenfunctions for the X(s)-trajectories with non-zero eigenvalues: M ∈ {3, 5, 8, 12, 20}. This means we have settings with M ≤ K s and M > K s for most M . Note that the effective numerical rank of a simulated X can be (much) lower than M depending on the speed with which the eigenvalues decrease.
• signal-to-noise ratio: ). Some of these processes are constructed so that their covariance operator kernels ke(K X ) include functions in the penalty null-space P s⊥ .
-Poly: eigenfunctions are orthogonal polynomials of degree 0 to M − 1 with linear or exponentially decreasing eigenvalues (Poly,Lin and Poly,Exp, respectively). For Poly, ke(K X ) is disjunct from P s⊥ , since the first and second eigenfunctions are constant and linear polynomials. . , M +1} so that ke(K X ) overlaps the null-space of the second differences penalty but not the first differences penalty. All three processes are associated with linearly decreasing ν m . From top to bottom, these processes become increasingly more "antagonistic" in the sense that 1) the kernels of these eigenfunction systems move increasingly closer to the kernels of the penalties we consider and 2) more quickly decreasing eigenvalues result in lower effective rank of the observed X(s).
-The marginal B-spline bases B s and B t have either 4 or 8 basis functions for each direction. λ s = λ t are either 0.1 or 1. This generates coefficient surfaces of varying complexity and roughness. We do not fit models where the basis used to generate β(s, t) is larger than that used to estimate β(s, t) since that could introduce a distracting approximation error not relevant to the issues at hand.

Results
Identifiability The estimation accuracy forŶ (t) (not shown) is excellent across the board even for the very noisy settings, with 90% of relative integrated mean square errors below 0.01 and a median of 0.00051. Estimation accuracy forβ(s, t), however, varies wildly over a range of 18 magnitudes between 1.3 × 10 −8 and 2.2 × 10 10 . Further analysis shows that the simulation study design succeeds in creating the identifiability issues described by the results in Section 3.2. To quantify the severity of identifiability issues, we compute rank correlations between rIMSE β and rIMSE Y over the 20 replicates of each simulation setting. As expected, we observe low or even negative correlations mostly for settings in which D s is rank-deficient. This effect increases both for lower signal-to-noise ratios and for more complex true shapes of β(s, t). For intuition, consider that the "best" solution for (8) for any given error will be the smoothest surface (i.e., the one with the smallest penalty term) in the set of surfaces that can be generated by adding functions from ke(K X ) to any initial β(s, t) with the given error. This may be quite close or quite far from the true β(s, t), depending on the specific setting, with more noisy data and more complex true shapes more likely to result in fits that are quite far from the truth and still producing good model fit.
Estimation performance for β(s, t) Figure 2 shows the estimation errors for coefficient surfaces generated with SNR = 10. The right column shows results for numerically rank deficient D s (i.e., condition number κ(D s D s ) ≥ 10 6 ), the left column for designs with κ(D s D s ) < 10 6 . The top row shows results for first differences penalty, bottom row for second differences penalty. Boxplots are grouped by the amount of overlap between ke(K X ) and P s⊥ as computed by X ⊥ P s⊥ (see (9)), color-coded for the different processes the X(s)-trajectories are sampled from. Results for SNR = 2 and 10 3 were qualitatively very similar -errors obviously become larger for noisier data but the pattern shown in Figure 2 remains the same. Note that relative estimation errors below ≈ 0.01 correspond to estimates that are visually indistinguishable from the true surfaces, and that errors below ≈ 0.1 (thick black horizontal line) usually preserve most essential features of the true β(s, t) well. Results with rIMSE β > 1 bear little resemblance to the "true" function. Also recall that all of these fits, with rIMSE β values varying by more than 14 magnitudes, resulted in a comparatively small range of rIMSE Y between 10 −5 and 10 −3 . Closer inspection of results shows that the extremely large errors for Poly(1+), Poly(-1) and Poly(2+) are caused by the expected behavior: estimates are shifted by functions from the overlap of ke(K X ) and P s⊥ . The top row of Figure 4 shows an example of this behavior: the estimate for first order difference penalty is shifted by a large constant, while the estimate for the second order difference penalty is shifted by both a constant and a huge linear trend in s-direction. The fitted values of all models shown in Figure 4 are practically identical.  Figure 2: Boxplots for relative integrated mean square error rIMSE β for all 18000 results for SNR = 10 with conventional difference penalties. Columns show results for full rank D s versus numerically rank deficient D s . Rows show results for the first and second order difference penalties. Boxplots grouped by overlap between ke(K X ) and P s⊥ as computed by X ⊥ P s⊥ , color-coded for the different processes the X(s) are sampled from. Vertical axis on log 10 -scale, extreme errors > 10 5 for Poly(2+) are cut off.
These results mostly corroborate the results derived in Section 3 -we see that: • Serious errors rIMSE β > 0.1 are rare for both full-rank and rank-deficient D s if the generating process for X(s) is not antagonistic in the sense that ke(K X ) ∩ P s⊥ = {0}, i.e. for the Poly, Fourier, Wiener and BrownBridge processes.
• As long as ke(K X ) ∩ P s⊥ = {0} (approximated numerically by the criterion that X ⊥ P s⊥ < 0.95), regularization of the estimated coefficient surface allows us to achieve good estimates even if the unpenalized regression model per se would not be identifiable due to rank deficiency of D s .
• The larger P s⊥ (top to bottom), and the closer ke(K X ) is to P s⊥ (left to right in each group of boxplots), the larger the likelihood of estimates far from the truth and the larger the average rIMSE β .
Diagnostics As in Figure 2, we use X ⊥ P s⊥ (see (9)) as a measure of the overlap between ke(K X ) and P s⊥ . We consider a replicate to be "flagged" as problematic if both X ⊥ P s⊥ ≥ 0.95 and κ(D T s , D s ) > 10 6 . Figure 3 shows a mosaic plot of the contingency table of "flagged'" replicates and categorized rIMSE β . While the sensitivity for identifying replicates with rIMSE β > 1 is 0.92, the specificity is only 0.28. The sensitivity for identifying replicates with rIMSE β > 0.1 is 0.58, the specificity is 0.09. These fairly low specificities indicate that the penalized approach to function-on-function regression discussed here can outperform theoretical expectations and frequently finds good solutions even in very difficult settings. Total accuracy for identifying settings with rIMSE β > 0.1 is 0.88 and 0.94 for rIMSE β > 1. The positive predictive value (precision) of the criterion for rIMSE β > 1 is 0.72, while it is 0.91 for rIMSE β > 0.1. The negative predictive value of the criterion for rIMSE β > 1 is 0.99, while it is 0.87 for rIMSE β > 0.1. Also note that certainly not all errors in the (0.1, 1]-range are due to identifiability issues, so not all of the (rare) non-detections are failures of the criterion.
Performance of modified penalties We broadened the scope of our simulation study by additionally comparing the performance of the non-standard penalties and constraints introduced in Section 3.3: • a full-rank ridge penalty ("∆ 0 "), • the modified full-rank roughness penalties as suggested by Marra and Wood (2011); in our case we used both full-rank first order differences penalties ("∆ 1 ") and full-rank second order differences penalties ("∆ 2 "), • the FPC-based penalty of James and Silverman (2005) ("ke(K X ) (FAME)"). We replaceν m by max(ν m , 10 −10ν 1 ) in order to remove any (numerically) zero or negative eigenvalues, • and conventional first and second order differences penalties with additional constraints ("∆ 1 + C", "∆ 2 + C") that enforce orthogonality of the estimated coefficient surface to functions in ke(K X ) ∩ P s⊥ if the data are flagged as problematic by our diagnostic criterion (κ(D T s D s ) ≥ 10 6 and X ⊥ P s⊥ > .95). Note that ∆ 1 + C and ∆ 2 + C reduce to ∆ 1 and ∆ 2 , respectively, if ke(K X ) ∩ P s⊥ = ∅ and need not be compared to the other techniques in that case. Figure 5 shows the rIMSE β for SNR ε = 10 for the different X(s)-processes and penalties. Note that, in contrast to Figure 2, colors now represent the different penalties, not the different X(s)-processes. Boxplots for ∆ 1 and ∆ 2 contain the same results as those shown in Figure 2.
The full-rank difference penalties∆ 1 and∆ 2 (cyan and turquoise boxes) seem to drastically reduce the size and likelihood of severe estimation errors in the difficult settings, especially compared to ∆ 2 (dark blue). We occasionally incur slightly worse estimates for the easier settings, but these differences are small and hardly relevant here. However, we have found that neither∆ 1 nor∆ 2 with = .1 are effective in removing non-identifiability artifacts if the estimated smoothing parameter is very small, i.e., if the effect of the penalty on the fit is weak. This explains the large outliers observed for some replicates for the antagonistic X-processes (top row) for∆ 1 and∆ 2 . Figure 8 (bottom row, leftmost two panels) shows this for the dti data set, where the estimated smoothing parameters are small. Note that no such outliers occur for the ridge (∆ 0 ) and FAME penalties (green boxes). However, both penalties are not competitive for numerically rank deficient D s for the non-pathological X(s)-processes in the bottom two rows and perform worse than the (modified) difference penalties in all other settings, and so cannot be recommended for general use as well. Furthermore, in the simulated settings we use, the true β(s, t) are centered around 0, which reduces the negative effect of these penalties' bias towards small absolute values of β(s, t). Top row, left to right: first order difference penalty, second order difference penalty, ridge penalty, FAME penalty. Bottom row, left to right: full-rank first and second order difference penalties, first and second order difference penalties with constraints. Note the different z-axis scales in the two left panels of the top row. Subtitles give rIMSE β and rIMSE Y for each fit. Figure 6 shows results only for simulated data sets flagged as problematic by our diagnostic criterion for ∆ 1 or ∆ 2 . In such settings, it makes sense to use difference penalties combined with "orthogonal-to-null-space-overlap" constraints on the fitted surface, these results are denoted by ∆ 1 + C or ∆ 2 + C (violet and purple), respectively.

Poly(2+)
Poly ( Figure 4 shows illustrative exemplary fits for the modified penalties and penalties with constraints in a data setting with κ(D T s D s ) ≥ 10 6 . in the two rightmost panel of the top row and the panels in the bottom row. While rIMSE Y is very similar for all 8 penalties except FAME (top right), the shapes of the corresponding coefficient surface estimates differ from each other in the expected fashion: The ridge penalty's bias towards small |β(s, t)| and tendency to under-smooth the estimated surface is visible, as is the typical wiggliness of FAME fits which are also not roughness penalized in the conventional sense. Compared to the result for ∆ 1 , the fit for∆ 1 is much closer to the level of the true β(s, t) and adds a much smaller (spurious) constant to the fit. Similar remarks apply for the comparison between ∆ 2 and∆ 2 : The huge spurious linear trend in s is almost completely removed. The overlap criterion X ⊥ P s⊥ is ≈ 1 for ∆ 1 , ≈ 2 for ∆ 2 and 0 by construction for all others. Both constrained fits (∆ 1 + C, ∆ 2 + C, two bottom right panels) completely suppress the spurious constant (and trend) present in the unconstrained fits (two top right panels), and ∆ 2 + C yields a slightly smoother fit than ∆ 1 + C as expected.
Note that the true β(s, t) are centered around 0 in our simulation. As all modified penalties and penalties with additional constraints result in estimated surfaces around 0, results will be shifted up-or downwards compared to the true surfaces if this is not the case in a real application. Such a shift reflects the fact that the average level of the  Figure 6: rIMSE β for SNR ε = 10 only for "pathological" data-sets. X(s)-processes in panels, penalties coded by color.
surface cannot be inferred from the data in settings with corresponding persistent nonidentifiability. Thus, in cases where our diagnostics indicate persistent non-identifiability, estimated coefficient surface values can only be interpreted relative to one another, but no interpretation of their sign is possible due to the estimated absolute level being essentially arbitrary. A corresponding implication regarding linear trends applies to the case of second order difference penalties.

Summary of simulation results
We can draw the following conclusions based on the entirety of simulation results: • the potential for extreme estimation errors is large for X(s)-processes with low effective rank whose ke(K X ) is not disjunct from P s⊥ .
• there is no strong positive correlation between accuracy of fitted values (rIMSE Y ) and accuracy of the estimated coefficient surface (rIMSE β ) for rank deficient D s . Even extremely wrong estimates of β(s, t) can yield good model fits.
• calculating X ⊥ P s⊥ and κ(D T s D s ) yields a suitable criterion that can diagnose persistent non-identifiability reliably, albeit with a substantial rate of "false alarms" in which conventional difference penalties work well despite an antagonistic data situation.
• the full-rank roughness penalties often stabilize estimates in persistent non-identifiability settings without diminishing accuracy in other settings by a relevant amount, but not reliably so.
• for problematic data settings flagged by the criterion developed here, the combination of difference penalties with suitable constraints on the coefficient surface ("∆ + C"), ridge penalties ("∆ 0 ") and penalties based on the spectrum of the functional covariate ("FAME") all perform similarly and outperform both full-rank and conventional difference penalties.

Case Study
To illustrate the practical relevance of the theoretical development given in Section 3 and the performance of the countermeasures developed therein, we describe a deliberately constructed but realistic setting in which non-identifiability persists despite penalization. We use a subset of 100 patients from the dti-data described in the introduction and fit a realistic model in which investigators try to separate the effect of mean CCA-FA levels on  Top row from left: First order (∆ 1 ), second order difference penalty (∆ 2 ), ridge penalty (∆ 0 ), FPCkernel-based penalty (FAME). Bottom row from left: Full rank first order (∆ 1 (M-W)), second order difference penalty (∆ 2 (M-W)), first order with constraint on kernel overlap (∆ 1 + C), second order with constraint on kernel overlap (∆ 2 + C). All fits performed with pffr(). Note that z-axis limits for the 4 panels on the left differ from each other and from those of the 4 panels on the right.
RCST-FA from that of the shape of CCA-FA along the tract. Such a model can be defined as where Y i (t) is RCST-FA for patient i, β 0 (t) is a global functional intercept,x i = X i (s) are the mean CCA-FA levels with associated effect β 1 (t), and X c i (s) ≈ X i (s) −x i are the denoised, curve-wise centered CCA-FA curves. In this example, we use a reconstruction X c i (s) based on the six largest empirical FPCs (ca. 90 % of total variance explained) to denoise CCA-FA and interpolate missing values. Note that curve-specific mean centering causes all constant functions to be in the kernel of the covariance of X c i (t), i.e. X ⊥ P s⊥ ≥ .95 for first and second order difference penalties. All fits shown below use K s = K t = 12 marginal basis functions, so by construction we expect κ(D T s D s ) to be large since the rank of the functional covariate is at most M = 6. More than 6 marginal basis functions are required, however, so that the basis for β(s, t) is flexible enough for approximating the rather complex shape we observe here. Note that curve-wise centering is also a necessary pre-processing step for many spectroscopic data analyses where absorption spectra are used as functional covariates and different mean intensity levels of such spectra are often pure laboratory artifacts (Fuchs et al., 2015), and more generally in settings such as this one where practitioners try to separate effects of the mean level of a functional covariate from those of its shape.
From left to right, the top row of Figure 7 shows the responses and fitted values for this model, the uncentered functional covariates used for the models shown in Figure 1, and the centered functional covariates used for the models in this section. For uncentered covariates, this setting provides an example of simple non-identifiability, while centered covariates induce persistent non-identifiability for difference penalties. Figure 8 shows the estimated coefficient surfacesβ(s, t) for this model for various penalties. Despite the divergent shapes of the different coefficient surface estimates, all 8 fits lead to practically identical fitted values on the training data (100 patients) and practically identical predictions for the test set (155 patients), with integrated MSE ≈ 0.0031 and predictive integrated MSE ≈ 0.0046 for each model. Bothβ 0 (t) andβ 1 (t) are also practically identical for all 8 models, they are shown in the bottom row of Figure 7 along with the estimated contributions S X c i (s)β(s, t)ds of the functional covariate to the additive predictor. The model specification is appropriately flagged as causing persistent non-identifiability (i.e., X ⊥ P s⊥ = .98 [1.15] and κ(D T s D s ) → ∞) for first [second] order difference penalties ∆ 1 [∆ 2 ]. It is instructive to compare the resulting estimates for β(s, t) for different penalty specifications: the first difference penalty fit (top, left) includes a constant offset from 0, while the second difference penalty fit (top, second from left) includes different offsets from 0 for each t, in a linear decrease that is unpenalized by the marginal second order difference penalty in t direction, as well as a linear trend in s. The presence of this spurious linear trend is surprising since we did not explicitly remove linear components from the functional covariate and it is not present in the fits for the uncentered data (c.f. Figure 1). Further investigation revealed that the centered functional covariate's null-space contains a fairly strong linear component ( LV ((X c ) ⊥ , w ·s) = 0.77, where (X c ) ⊥ denotes the orthogonal complement of the observed X c (s)), which is overlap enough to produce spurious linear shifts in this example. Note that the linear component is much stronger in the non-centered covariates ( LV ((X) ⊥ , w · s) = 0.11) and its lack is caused by the curve-wise centering in this example. It is straightforward to show that this can occur if X i (s)sds ≈x i sds ∀i = 1, . . . , n, as is the case here. Both offset and trends are reduced, but not entirely removed, by instead using the fullrank difference penalties with = 0.1 (bottom row, two leftmost panels). Due to the wiggliness of the coefficient surface, the estimated smoothing parameters are quite small, so there is only little penalization going on. Consequently, the additional small penalty on P s⊥ is not strong enough to eliminate non-identifiability artifacts from the fit in this case. Estimated coefficient surfaces with full-rank difference penalties with = 1 (not shown), however, are very similar to the those in the four right-most panels of Figure 8. That results for the full-rank difference penalties depend so strongly on this tuning parameter is a considerable disadvantage.
The four right-most panels show results based on the ridge penalty (∆ 0 , second from right, top row), the penalty based on the FPCs of ke(K X ) (FAME, top right), and the difference penalties with additional "orthogonal-to-kernel-overlap" constraints (∆+C, bottom row). It is reassuring to see that all four of these penalties lead to very similar results in this setting despite their different motivations and mathematical properties. Admittedly, understanding the estimated coefficient surfaces in terms of the supposed data generating process remains difficult. First, because of their complex shape and secondly (and more pertinently to the main points of this paper), because the restriction to surfaces centered around 0 that is enforced for ∆ + C and implied by the ∆ 0 and ke(K X ) penalties precludes facile interpretation of, e.g., the sign of features of β(s, t). Since the "true" offset of the coefficient surface is not estimable from the data, negativity or positivity of certain peaks or troughs ofβ(s, t) is not directly interpretable and interpretation can thus only rely on relative heights across the surface. Higher dimensional overlaps between penalty and covariate null-spaces than the one encountered here will compound these expositional difficulties.
This example demonstrates how reasonable, but unfortunate combinations of data preprocessing and model specifications can lead to simply as well as persistently non-identifiable models despite penalization. The diagnostics and countermeasures developed in Section 3, however, seem to be suitable for detecting and remedying such problems in a real data setting.

Conclusion and Discussion
Coefficient surface estimates in spline-based function-on-function-regression (1) can suffer from persistent identifiability problems if the span of the marginal basis for the coefficient surface over a functional covariate's domain overlaps the kernel of its covariance operator. A rank deficient design matrix can occur in particular if the functional covariate's covariance operator is of effective rank smaller than the number of marginal basis functionseither because the number of eigenfunctions with non-zero eigenvalues is truly below the number of marginal basis functions, or if eigenvalues of the covariance operator decrease too rapidly compared to the noise level of the data.
In practice, spline based approaches are typically fitted with a regularization penalty corresponding to a smoothness assumption on the coefficient surface. We have shown that identifiability problems persist if, and only if, in addition to a rank deficiency of the design matrix, the kernel of the functional predictor's covariance ke(K X ) overlaps the function space P s⊥ spanned by parameter vectors in the null-space of the spline's roughness penalty. In the case of no overlap, there is a unique smoothest representative on any hyperplane of parameter vectors leading to the same additive predictor and the penalized estimation problem finds a unique smoothest solution. Similar results hold for the simpler case of penalized scalar-on-function regression models (Happ, 2013). They are also expected to hold for more general loss functions than the quadratic loss analysed here since generalized additive models are typically estimated by the penalized iteratively reweighted least squares (P-IWLS) method, where the system of equations solved in each step is identical to that of (8) except for the introduction of a vector of IWLS weights and the substitution of y by IWLS working responses that are element-wise linear transformations of y.
A lack of identifiability also implies a lack of correlation between accuracy of the coefficient estimates and goodness of fit for the responses. As this extends to prediction errors for out-of-sample data from the same process, it is usually not possible to detect identifiability issues for a given data set based on subsampling or cross-validation schemes. Instead, based on theoretical considerations and simulation results, we have identified two easily computable diagnostic criteria in order to detect non-identifiable model specifications before estimation. The criteria combine the condition number of a partial design matrix with a measure of the amount of overlap between the kernel of the functional predictor's covariance and the null-space of the penalty. Non-identifiability may in particular be an issue if both criteria are indicative of a problematic setting, or if the partial design matrix is numerically rank deficient and the penalty smoothing parameter is estimated to be close to zero. If a non-identifiable model specification is discovered, we recommend that practitioners choose modified full-rank roughness penalties to safeguard against spurious estimates or estimate coefficient surfaces under constraints that force these spurious components to zero. The pffr-function in the refund package uses a first order differences penalty by default, incorporates the diagnostic checks developed and evaluated in this work, and issues corresponding user warnings and enforces suitable constraints if a persistently non-identifiable model specification is detected.
Another practical consequence of our results is that pre-processing methods for functional covariates should be avoided if they reduce their effective rank (such as pre-smoothing with low-dimensional bases) or if they increase the amount of overlap between ke(K X ) and P s⊥ (such as curve-wise centering).
Jointly, these provisions seem to be sufficient to diagnose and safeguard against most serious artifacts of non-identifiability in practice. Our results indicate that in many cases, penalization allows the reasonable estimation of coefficient surfaces that are not identifiable in the theoretical model under an additional smoothness assumption, avoiding instead the common assumption that the estimated coefficient surface lies in the span of the first few eigenfunctions of the covariance operator of the covariate.
This work drives home the point that we cannot hope to reliably estimate arbitrarily complex effect shapes from functional covariates with low information content. In that sense, assuming smoothness of the coefficient surface and constraining its non-identifiable components to be zero is simply following a principle of parsimony. At the same time, substantial interpretation of coefficient surface estimates derived from rank-deficient designs is difficult and has the potential to be very misleading. Functional principal component regression approaches do not suffer from the potential identifiability issues discussed here, but they do so at the price of restricting the estimated coefficient surface to the span of the estimated functional principal components. These are significant challenges for the maturing field of functional regression methods, at least in applications where these methods are used not only for prediction, but also for inferring and understanding the underlying data generating processes. It is our hope that the theoretical development and practical examples presented here can serve as a starting point for critical reflection on this important issue.

A. Proofs
A.1. Proof of Proposition 3.2 Assume that B t is of full rank K t . Then, the design matrix D = B t ⊗ (XW B s ) in model (7) is rank-deficient if and only if a) M < K s or b) if M ≥ K s , but rank(ΦW B s ) < K s .

A.2. Proof of Proposition 3.3
Let P = λ s (I Kt ⊗P s )+λ t (P t ⊗I Ks ), with P s and P t positive semi-definite matrices. Assume that B t is of full rank K t , that rank(P t ) < K t and that λ s > 0, λ t ≥ 0. Then, for any f ∈ im(D) there is a unique minimum min θ∈H f {θ P θ} if and only if ke(D s D s )∩ke(P s ) = {0}.
Proof. We have min{θ P θ} s.t. θ ∈ H f = min{θ U U P U U θ} s.t. Dθ = f = min{(θ U + |θ U 0 ) U + P U + U + P U 0 U 0 P U + U 0 P U 0 Denote v + = U + θ and v 0 = U 0 θ, with (v + , v 0 ) = U θ a bijective re-parametrization of θ. Note that v + is fixed while v 0 is free to vary within the hyperplane. Setting the derivative with respect to v 0 equal to zero yields Now, if ke(D s D s ) ∩ ke(P s ) = {0}, choose u s = 0 with U s0 u s ∈ ke(D s D s ) ∩ ke(P s ) and u t = 0 with U t u t ∈ ke(P t ). As U 0 = U t ⊗ U s0 , we have (u t ⊗ u s )U 0 P U 0 (u t ⊗ u s ) = λ s (u t u t )u s U s0 P s U s0 u s + λ t u t U t P t U t u t (u s u s ) = 0.
Thus, U 0 P U 0 is not of full rank, no unique solution v 0 of (10) exists, so there is no unique minimum min θ∈H f {θ P θ}.
On the other hand, if ke(D s D s ) ∩ ke(P s ) = {0}, for any x with x U s0 P s U s0 x = 0 we have U s0 x ∈ ke(D s D s ) ∩ ke(P s ) = {0} and thus x = U s0 U s0 x = 0. As this means that U s0 P s U s0 is positive definite, we also have that U 0 P U 0 = λ s I Kt ⊗ (U s0 P s U s0 ) + λ t (U t P t U t ) ⊗ I (Ks−d/Kt) is positive definite and thus invertible. Therefore, there is a unique minimum v 0 = −(U 0 P U 0 ) −1 U 0 P U + v + and a unique smoothest point with H = (I KsKt − U 0 (U 0 P U 0 ) −1 U 0 P ) and θ f P θ f = min θ∈H f θ P θ.
A.3. Proof of Proposition 3.4 Assume that B t is of full rank K t , that rank(P t ) < K t and that λ s > 0, λ t ≥ 0. Then, there is a unique penalized least squares solution for (8) if and only if ke(D s D s ) ∩ ke(P s ) = {0}.
On the other hand, if ke(D s D s ) ∩ ke(P s ) = {0}, there is a u s = 0 with D s u s = 0 and P s u s = 0. Choose 0 = u t ∈ ke P t . Then, (u t ⊗ u s )(D D + λ s (I Kt ⊗ P s ) + λ t (P t ⊗ I Ks ))(u t ⊗ u s ) = u t B t B t u t · 0 + λ s u t u t · 0 + λ t · 0 · u s u s = 0.
Thus, (D D + λ s (I Kt ⊗ P s ) + λ t (P t ⊗ I Ks )) is singular and not invertible.