Bayesian Estimation of Principal Components for Functional Data

The area of principal components analysis (PCA) has seen relatively few contributions from the Bayesian school of inference. In this paper, we propose a Bayesian method for PCA in the case of functional data observed with error. We suggest modeling the covariance function by use of an approximate spectral decomposition, leading to easily interpretable parameters. We study in depth the choice of using the implied distributions arising from the inverse Wishart prior and prove a convergence theorem for the case of an exact finite dimensional representation. We also discuss computational issues as well as the care needed in choosing hyperparameters. A simulation study is used to demonstrate competitive performance against a recent frequentist procedure, particularly in terms of the principal component estimation. Finally, we apply the method to a real dataset, where we also incorporate model selection on the dimension of the finite basis used for modeling.


Introduction
In the rapidly expanding area of functional data analysis, data compression has become an oft-employed strategy. Principal component analysis (PCA) has become a widespread tool in the area of functional data, where the high dimensionality of the data can quickly become unmanageable. Principal components can be used to reconstruct a process approximately, using relatively few random variables. At its heart, PCA is an exploratory tool used to gain insight into the structure of the data. It is also used in less scrupulous endeavors, such as preprocessing for a regression analysis. For a textbook length treatment of classical multivariate PCA, see Jolliffe (1986).
PCA for functional data has likewise become a very popular technique. Ramsay and Silverman (2005) certainly helped make functional PCA (FPCA) a standard first step when dealing with functional data. For another textbook account of FPCA, see Horváth and Kokoszka (2012).
Bayesian methods for multivariate PCA have been relatively absent from the literature. Tipping and Bishop (1999) showed how the traditional method of PCA can be viewed as the solution to a maximum likelihood procedure; this likelihood was then used for a Bayesian treatment in Bishop (1999). For functional data, Behseta et al. (2005) proposed a Bayesian method for FPCA, and van der Linde (2008) proposed an approximate Bayesian method using variational calculations. Frequentist PCA is commonly used as the first step in multi-step procedures; one reason for the lack of the subjective perspective in the PCA literature is certainly due to the fact that Bayesian procedures are, by their nature, not performed stepwise. The extension of Bayesian procedures to more complicated situations usually comes from hierarchically integrating the simpler model into a larger one.
In this paper, we investigate a potential model for the covariance structure of functional data observed with noise. The model jointly smooths the observations and estimates the principal components. As a Bayesian procedure, model selection on the chosen number of basis functions is conceptually straightforward, and we demonstrate this on a real data set.
Section 2 provides background material. Section 3 describes the motivation for the model used for the data, along with the priors used throughout. Section 3 also discusses some issues in choosing hyperparameters for the typical priors. Section 3.6 describes the method used for model comparison in terms of the number of basis functions used for approximation. In Section 4, we prove a convergence theorem for the case of an inverse Wishart prior. Sections 5 and 6 present a simulation study and applied data example, respectively.

Background
Let {X(t) : t ∈ [0, 1]} be a stochastic process such that the sample paths of X(t) are square integrable. By adopting a change of location and scale, if necessary, any bounded interval can be reduced to [0, 1], which we shall abbreviate by I. Usually the index represents time, and for functional observations the boundedness of the domain is most natural because data can be observed only over a limited time. Let μ(t) := E[X(t)] for all t ∈ I, and let the covariance function of the process be given by κ(s, t) = Cov(X(s), X(t)) for all s, t ∈ I. Note that κ is symmetric in its arguments and is a positive definite function, i.e., for any t 1 , . . . , t k ∈ I, the matrix ((κ(t i , t j ))) is positive definite. We assume that κ is continuous on I × I and let M := sup{|κ(s, t)| : s, t ∈ I} < ∞. On the space L 2 = L 2 (I) of square integrable functions, we use the standard inner product f, g = I f (t)g(t)dt, and the norm which this implies.
The covariance function then defines an integral operator, T κ : L 2 → L 2 , given by for all f ∈ L 2 . It is well known that T κ is a Hilbert-Schmidt operator, and, in particular, T κ is a compact linear operator, which necessarily has a countable spectrum {λ 1 , λ 2 , . . .} with 0 as the only accumulation point (Ash, 1965). Therefore, the important quantities for us are the eigenfunctions and eigenvalues of the covariance operator, i.e., functions {φ 1 , φ 2 , . . .} and nonnegative real numbers {λ 1 , λ 2 , . . .} (assumed to be in nonincreasing order) such that Mercer's theorem (Ash, 1965, Theorem 2.4) states that the covariance function can be represented, for s, t ∈ I, as where the convergence is uniform on I × I.
The eigenfunctions, {φ 1 , φ 2 , . . .} are called the principal components for reasons we will now discuss. The map f → Var f, X = Var( I f (t)X(t)dt) obtains its maximum on the unit sphere at φ 1 . The set {φ 1 , φ 2 , . . .} satisfies In this way, the eigenfunctions are the principal directions of variation for the process. The other reason for the term, "principal components," is due to the Karhunen-Loève expansion theorem, which states that the process, X, as a random element of L 2 , can be represented as where {Z 1 , Z 2 , . . .} are uncorrelated random variables with unit variance and the convergence is in mean square, uniformly on I. In this way, the eigenfunctions can be seen as a way to decompose the process into orthogonal (uncorrelated) components. An important special case to note is that (5) holds pointwise on I almost surely if X is a Gaussian process, and then each Z i is also a Gaussian random variable, and hence they are all independent.

Model, Prior Specification, and Posterior Computation
Let X 1 , X 2 , . . . be independent and identically distributed observations from a Gaussian process, GP(μ, κ), on I = [0, 1], where μ is the mean function, and κ is the covariance function. We will assume, however, that our observations have been contaminated with additional noise, i.e., we observe noisy data on some grid of points {t 1 , . . . , t T }. We assume that all of the data is observed on the same grid for simplicity of formulas. Let X i = (X i (t 1 ), . . . , X i (t T )) be the ith underlying discretized function, and let Y i = . Our goal will be to estimate κ and the principal components it induces from data, and we will do so by placing priors on all the parameters.
To put a prior on μ, we shall use a Gaussian process, which reduces to the multivariate normal distribution on the discretized observations. In putting a prior for κ, we shall construct our prior based on an approximate spectral representation by truncating the series in (3), but allowing a prior on the number of terms to ensure full support. We shall induce a prior on the eigenvalues and eigenfunctions indirectly from that on the covariance matrix on the finite grid of time points, which will be chosen as the inverse Wishart distribution.
Consider a given basis, {h 1 .h 2 , . . .}, for L 2 . Since any eigenfunction, φ i , of T κ can be expanded as a particularly convenient method of putting a prior on {φ 1 , φ 2 , . . .} is by truncating (6) at some level, J ∈ N. Let h J = (h 1 , . . . , h J ) . We will also truncate the expansion of κ in (3) at some level, K ∈ N, putting a prior on the resulting coefficients and also a prior on J. Since the finitely truncated series converges to (6) as J → ∞, this procedure ensures that the resulting objects, {φ 1 , φ 2 , . . .}, get a fully supported prior if the coefficients get such a prior for each value of J. Furthermore, a prior on κ is induced by truncating (3) , and imposing a prior on K. Let A KJ = ((α ij )) be the K × J matrix of coefficients and φ K = (φ 1 , . . . , φ K ) be given by Let Λ K = diag(λ 1 , . . . , λ K ). Then the prior on κ can be induced by the relation and priors on K and J. However, as mentioned above, instead of directly putting priors on A KJ and Λ K , we proceed in the reverse order and induce priors on them through a convenient prior on Σ = A KJ Λ K A KJ . Details of the specification are explained below.

Model and Priors
In the following, if K = J, it leads to substantial simplification, although such a choice may lead to overfitting of the covariance function; it is reasonable to believe that more basis functions would be needed than the number of principal components needed to reconstruct the covariance function. We first describe the model for the case of K = J to then motivate the low rank model (K < J). Conditional on K = J, the model and the prior distribution can be described by the following hierarchical scheme for i = 1, . . . , n, where H J is a T × J matrix whose columns consist of the basis functions evaluated at all grid points. When convenient, we may drop the subscript from certain expressions.
The functions h(t) β i , i = 1, . . . , n correspond to the underlying (unobserved) noisefree functional observations. The function h(t) θ corresponds to the overall population mean for the functional observations, h(s) Σh(t) is the covariance function of interest, and Ah(t) is the vector of functional principal components. When estimating the full model, that is, averaging across the posterior distribution of K, it is important to restrict attention to parameters whose dimension does not depend on K. For example, the covariance function evaluated at the observed grid points, HΣH , has dimension T × T , and has meaning across values of K, whereas β i only has meaning for a given value of K.

Low Rank Model
Because the Wishart distribution gives probability 1 to nonsingular matrices, if we wish to allow the number of principal components, K, to be less than the number of basis functions used in approximations, J, we should allow for singular Wishart matrices. The most straightforward approach is to use a singular center matrix, of rank K, in the prior specification. First, let Ξ = U LU , where U is orthogonal and L is the diagonal matrix of ordered eigenvalues. Choose This gives the motivation for our full, low-rank model: This model implies that U K Σ −1 U K ∼ Wishart(ν, Ξ + K ) that provides the singularity we desire in our modeling. Because our prior puts mass on all (J, K) pairs with 1 ≤ K ≤ J, we still have a marginal full-rank model, but our posterior will be mixed over rankdeficient models.

Model Choice
There is a vast literature on Bayesian model choice, and many of the possible procedures could be used for our purposes. Reversible jump Markov Chain Monte Carlo (MCMC) (Green, 1995) has become a widely used method, but presents a major hurdle in our case with the need for a proposal distribution when using the model that includes the parameter Σ. The main challenge is respecting the positive definiteness of the matrix when proposing a jump to a higher dimension. This can be overcome by marginalizing out Σ, obtaining posterior samples of the β i 's, and, finally, generating samples for HΣH , which would then all live in the same dimension, even though K might change between steps.
The approach we take is to estimate the posterior distribution of K through approximations to the marginal likelihood for each model. We employ independent Gibbs samplers for each value of K, and compute estimates of their marginal likelihood using results from Chib (1995). The obvious disadvantage of this strategy is the need to run MCMC chains for each possible model. The likely faster convergence of the Gibbs sampler, however, can partially offset the cost along with the ability to implement the chains in parallel. In the application considered below, the relatively small number of sampled time points lends itself well to this method, and gives much more confidence in the convergence than other approaches not based on Gibbs sampling.

Choice of Hyperparameters
Especially in the case of the inverse Wishart prior, selection of the hyperparameters is an important issue. Because of the link between principal components and covariance estimation, prior elicitation can be done in either domain. However, some problems can arise with what appears to be a good default choice. Specifically, because of the required smoothness conditions on the covariance function, it is not possible to choose the identity matrix as the center matrix for the inverse Wishart prior.
A sensible choice of Ξ (whose size depends on J) is given below. Using the prior covariance function, construct the covariance matrix corresponding to the grid being used; call this Σ * . We then propose using Ξ −1 = (H H) −1 H Σ * H(H H) −1 as the choice of hyperparameter, which can be seen as a least-squares projection. This matrix can be shown to be invertible using the facts that H has full column rank because it is comprised of function evaluation from an orthogonal set of functions, and Σ * is invertible because it is derived from a valid covariance function of a vector without linear restrictions. To complete the specification of the Wishart prior, for model (J, K), a reasonable default choice for the degrees of freedom is ν = 2K, which, in the Gibbs step described later, implies the mean of the inverse Wishart distribution that resembles a typical covariance estimate with denominator n + k − 1. This choice is implemented in the empirical comparisons below, and performs very competitively.
Finally, for the inverse gamma prior, the choice of (a, b) is very important since it controls the amount of smoothing performed on the data. It corresponds to the prior beliefs on the amount of sampling noise present in the data. It turns out that the choice is sensitive, and, in practice, several values should be tried when performing the analysis. There is an empirical choice available for these hyperparameters.
Using this, a and b can be chosen to be the values which maximize this quantity evaluated at the data.

Posterior Computation
Posterior computation is done independently for each pair (J, The primary advantage is the ability to implement a Gibbs sampler for each model. Since the models are computed separately, they can be run in parallel to offset the computation cost of running all models. Having independent samples for each model also allows for the adjustment of the priors on J and K without the need to rerun the MCMC chains. Finally, it provides full information for each individual model in the case that a single model is desired, instead of a fully Bayesian, model-averaged posterior. For a fixed pair (J, K), the Gibbs sampler following from the low rank model (10) is as follows: • Sample Σ −1 from a Wishart distribution with degrees of freedom ν + n + 1, and center matrix • Finally, sample σ −2 from a gamma distribution with shape parameter a + n/2 and rate parameter

Alternative Posterior Computation
Conditional on (J, K), the posterior distribution can be obtained using a Gibbs sampling scheme. When computation is performed within this paper, this was the implemented approach. Model averaging over (J, K) can be obtained by using independent MCMC chains for each value. We do wish to point out some equivalent models that have potential computational benefits. Marginalizing out β i gives for the top level. In this case, we see that, for the covariance to be an identifiable parameter, we require that K < T , so that any prior that we use should have probability one of meeting this restriction.
We can also marginalize out θ and Σ to obtain an equivalent prior for the β i 's. Specifically, if we let B K = [β 1,K |β 2,K | · · · |β n,K ], and Θ 0 = [θ 0 | · · · |θ 0 ], then the marginal prior on B K is the so-called matrix t-distribution (Lad, 1996) Since we have then lost conjugacy, we can also integrate out σ 2 to represent the model in terms of B K only. The joint density of (Y 1 , . . . , Y n ) is proportional to This formulation can be useful for implementing a reversible jump MCMC scheme, so that proposals are not needed for the covariance parameter, which would require positive definiteness constraints. It can also be used to obtain the posterior mode for β i , i = 1, . . . , n. This can potentially be much faster than full MCMC, especially when we would like to compare many different models. In either of these cases, we can use the following conditional posterior distributions, and expectations derived from them, to infer on other parameters:

Asymptotic Results
We now study the posterior rate of contraction, n , such that the posterior probability of the M n n -ball around the true parameter given n observations tends to 1 as n increases to infinity for any M n → ∞.
Let Π denote the prior measure on P, the parameter space regarded as a subset of all probability measures, with a typical element P having density p, and within which exists the true measure, P 0 . For P, Q ∈ P, let where log + x = max(log x, 0). We will also make use of the Hellinger distance, defined as d H (P, Q) = ( ( √ p − √ q) 2 ) 1/2 , and the Frobenius norm on matrices, which, for A = ((a ij )), is defined to be A 2 F = i,j a 2 ij = tr(A A). Let P 0 stand for the true distribution with density p 0 . In order to obtain the posterior rate of convergence, n , we apply Theorem 2.1 of Ghosal et al. (2000). Thus, we need to verify that for a constant C > 0, and a sequence {P n } of subsets of the parameter space, log N ( n /2, P n , d) ≤ n 2 n , where N ( /2, P n , d) is the covering number, i.e., the minimum number of d-balls of size n /2 needed to cover P n . We assume given values of J, K, and σ. We also make the simplifying assumption that the functional observations have already been detrended, so that μ(·) ≡ 0 and the prior mean, θ 0 is also taken to be 0. The theorem is stated in terms of the Frobenius norm on matrices. Now we state the main theorem of the paper on the rate of convergence of our posterior. Theorem 1. Let Y i iid ∼ N T (0, σ 2 I + HU Σ 0 U H ), i = 1, 2, . . . for a known σ 2 and K = K 0 < T fixed. Using the inverse Wishart prior from above Π Σ : Σ 0 − Σ F ≥ M n n −1/2 log n|Y 1 , . . . , Y n → 0, as well as

Proofs
Lemma 1 (Kullback-Leibler divergence under information loss). Let X and Y be random variables with densities p and q, respectively, and U be a random variable with uniform distribution on the unit interval, independent of X and Y . Letp andq be the densities of T (X, U ) and T (Y, U ), respectively, for a measurable function T . Then K(p,q) ≤ K(p, q) and V + (p,q) ≤ V + (p, q).
This lemma can be proved by considering the conditional distributions of X given T (X, U ) and Y given T (Y, U ), and using the convexity of the maps (u, v) → u log u v and (u, v) → u log 2 u v on the set u > v > 0. A complete proof may be found in Ghosal and van der Vaart (2016, Appendix, Lemma B.12).
Next, we need a lemma relating the Hellinger distance to the Frobenius norm induced on our parameter of interest. This next lemma will be used in the entropy calculation.

Lemma 2. Let d be the metric induced by the Hellinger distance on the k-dimensional centered multivariate Gaussian family. Then
Furthermore, there exists δ > 0 and constant C > 0, depending on Σ 1 , such that, if d(Σ 1 , Σ 2 ) < δ, then Proof. Let {λ j } be the eigenvalues of Σ −1/2 1 , which is a symmetric, positivedefinite matrix. Then for some orthogonal matrix P . Then, Σ −1/2 1 We now show that (21) is less than or equal to k j=1 (λ j − 1) 2 by induction.
For the induction step, we define γ(λ) = λ 1/4 ( 1 2 (1 + λ)) −1/2 and claim that γ(λ) ≤ 1 for λ ≥ 0. This can be seen by the fact that γ(1) = 1 and So, by the induction hypothesis and the case k = 1, we have that The proof is now complete by noting that Σ The reverse inequality can be established following the arguments given in Banerjee and Ghosal (2015, Lemma A.1).
Let d be the distance on the space of symmetric, positive definite matrices induced by the Hellinger metric on the multivariate Gaussian family. For Σ 1 , Σ 2 ∈ P n , we have that which implies that log N ( n /2, P n , d) ≤ log N (n −1/8 n /2, P n , · F ) ≤ k 2 log for the chosen n . The second inequality on the preceding line is due to the fact that, using the Frobenius norm, the space of positive definite k × k matrices can be viewed as a subset of R k 2 (see Pollard (1990, Section 4) for the entropy calculation in Euclidean space). Finally, for the chosen n . The first inequality is true because of the relationship between the Frobenius norm and the trace of a positive definite matrix. The second inequality follows from Markov's inequality using the moment generating function of tr(Σ −1 ) evaluated at 1 (Muirhead, 2009).
This establishing the consistency result in terms of the Hellinger distance, specifically, as n → ∞, Now, the reverse inequality of Lemma 2 implies the desired result in terms of the Frobenius norm: Finally, since Σ 0 is a fixed matrix, the first stated result follows from the relation AB F ≤ A F B F , for two matrices A and B. To derive the second assertion, write Σ −1 0 − Σ −1 as Σ −1 0 (Σ − Σ 0 )Σ −1 and apply the norm inequality repeatedly.

Simulation Study
To assess the finite-sample performance of the proposed method, we present the results of a simulation study comparing the approach advocated herewithin to a recent frequentist approach, FACE (Xiao et al., 2013), implemented using the refund package in R.

Description of FACE
The frequentist method to which we compare our method is the "Fast Covariance Estimation" (FACE) method of Xiao et al. (2013). It is a very common frequentist method for the analysis of functional data, and has been made popular by the refund package available in R. The FACE estimator,F , is simply a sandwich-smoothed sample covariance matrix, that isF = SF S, whereF is the sample covariance matrix and S is a symmetric "smoothing" matrix, which is constructed using penalized B-splines. The form allows fast computation of the estimator. Thus, the comparison to our Bayesian method can be seen as demonstrating the potential benefits of a more complex method when the available computation allows for a fully Bayesian method.

Results
Each data set consists of 20 noisy observations on an evenly spaced grid of 50 time points in the interval [−1, 1]. The true underlying functional observations all have a true mean of μ(t) = sin(2πt), and a covariance of either κ 1 (s, t) = exp{−3(t − s) 2 } or κ 2 (s, t) = min{s + 1, t + 1}, depending on the experimental conditions. Independent sampling noise is then added in the form of independent N (0, 0.3) random variables (other values for the variance of the noise were considered, but the results remained qualitatively very similar). The two methods are compared in the following realms: estimation of the mean function, estimation of the covariance function, estimation of the principal components, and reconstruction of a new set of underlying functional observations (generated according to the same model). Function estimation is evaluated using the supremum, L 1 and L 2 metrics, and principal component estimation is evaluated using the angle between the estimate and the true function (this was chosen instead of Figure 1: Simulation Study: Boxplots for the simulation results corresponding to κ 1 used for both the truth and the prior. Results are shown in pairs, with the left box (red color) representing the proposed Bayesian procedure, and the right (black) representing the FACE procedure. The value above each pair represents the proportion of data sets in which the Bayesian procedure performed superiorly.
using squared distance, to take advantage of the Hilbert space structure). Observations are always reconstructed using the first 4 principal components.
For each choice of κ 1 and κ 2 as the true covariance function, we ran two simulations corresponding to choosing κ 1 or κ 2 as the prior covariance, yielding four total experimental conditions. Each scenario is repeated 100 times.
The proposed Bayesian procedure is computed using a Gibbs sampler with 5000 burn-in iterations, and 5000 iterations to estimate posterior means. The concentration parameter for the Wishart distribution, ν, was chosen to be 2K; this empirically seemed Figure 2: Simulation Study: Boxplots for the simulation results corresponding to κ 1 used for the truth and κ 2 used for the prior. Results are shown in pairs, with the left box (red color) representing the proposed Bayesian procedure, and the right (black) representing the FACE procedure. The value above each pair represents the proportion of data sets in which the Bayesian procedure performed superiorly.
to be a reasonable default choice. Legendre polynomials were used as the orthonormal basis, a Poisson prior with mean 7 was placed on the number of basis functions, and a Poisson prior with mean 1 was placed on the number of principal components. However, for the computation, only models with fewer than 20 basis functions were run, which would correspond to a truncated prior. This practically had no effect on the results since models outside that range had negligible posterior mass. See Figure 5 for examples of the posterior distribution of (J, K). The principal components were estimated using the decomposition of the estimated posterior mean of the covariance matrix. The results can be seen in Figures 1-4. Figure 3: Simulation Study: Boxplots for the simulation results corresponding to κ 2 used for both the truth and the prior. Results are shown in pairs, with the left box (red color) representing the proposed Bayesian procedure, and the right (black) representing the FACE procedure. The value above each pair represents the proportion of data sets in which the Bayesian procedure performed superiorly.
As can be seen from the results, the Bayesian procedure performs consistently well across the conditions in the estimation of the principal components themselves when measured by the angle from the truth. The practical importance of prior information can be seen in the improvements in reconstruction when the true covariance is used to construct the prior. In the functional data setting, the smoothness of the underlying true observations is usually well understood scientifically in an applied context, and should be incorporated into the analysis. The overall picture that these results show is that the proposed Bayesian method has the ability to perform competitively with the most modern frequentist procedures when judged by repeated sampling criteria. Figure 4: Simulation Study: Boxplots for the simulation results corresponding to κ 2 used for the truth and κ 1 used for the prior. Results are shown in pairs, with the left box (red color) representing the proposed Bayesian procedure, and the right (black) representing the FACE procedure. The value above each pair represents the proportion of data sets in which the Bayesian procedure performed superiorly.

Canadian Weather Data
To illustrate our method on real data, we analyzed the popular Canadian weather data, which is freely available in the fda package in R. The data was made popular by Ramsay and Silverman (2005), and our analysis is consistent with theirs. These data consist of 35 functional observations observed on a common grid of 365 time points. They correspond to the average daily temperature of 35 Canadian cities. We employ Legendre polynomials as the basis, with an unknown number of basis functions. We use a modified Poisson distribution on K with mean 7, and truncated above at 30. The Figure 5: Simulation Study: Examples of the posterior distribution of models for each of the four experimental conditions. Row 1 corresponds to κ 1 as the prior, and column 1 corresponds to κ 1 as the truth. The other row and column correspond to κ 2 . parameter θ 0 is taken to be zero, and the prior covariance function that is approximated was κ(s, t) = exp{−3(s − t) 2 }. For each model, 180,000 MCMC iterations were used for estimation after 20,000 burn-in iterations. Estimates are only calculated at the sampled time points; if there are other time points of interest, ideally, they should be treated as missing data and incorporated into the MCMC approximations.
In the posterior, almost all the mass lies on the model with J = 12, K = 12 (96.1%), and a small amount on J = 16, K = 15 (3.9%); a plot of the marginal likelihoods for each model can be seen in Figure 6. Posterior estimates shown are full model estimates, although they will be extremely close to conditioning on the maximum a posteriori model. The observations along with their smoothed estimates can be seen in Figure 7. The estimated covariance function can be seen in Figure 8, along with the implied principal components in Figure 9.  The first principal component represents the overall temperature of the city throughout the year; it differentiates between generally "mild" and "cold" cities. The second principal component seems to quantify the relative difference in temperature between summer and winter months, and differentiates between cities that have a more flat temperature function, compared to those with extremely cold winters. The higher order principal components represent more complicated phenomena. In all of these cases the difference is significant in a frequentist sense with the null hypothesis that there is equal probability of either method winning a trial.

Discussion
There are two points that warrant further discussion beyond what has already been presented: modifications to the model under special circumstances, and the computational difficulties of fitting the model.
Although our method allows for the possibility that the number of basis functions used for approximation and the number of principal components require different values, in the form we present, the number of basis functions is the same for both mean and covariance modeling. One possible concern is that, in the situation where the mean function requires many more basis functions to be well-approximated compared with the covariance function, the chosen value of K will be forced too high. Specifically, to deal with this, we can allow for an extra overall mean term, η, in (10) that can capture the excess roughness present in the mean: Allowing Ψ to be full rank would be sure to provide great flexibility, but may lead to over-fitting. A further modification could be to express η in the same basis domain as the rest of the model, and let η = H M ξ, M > K, with a prior on ξ. This is similar to the approach in generalized additive models of allowing differing number of basis functions for each component, and the addition of this term causes an identifiability issue with the collection {β i,K }. This issue can be handled in a similar fashion to Lang and Brezger (2004). A step-wise frequentist approach could be to de-mean the data before processing, possibly including smoothing, and then apply our method with θ removed from the model, that is, β i,K iid ∼ N K (0, Σ) in (10).
The final issue is the computational difficulties in fitting the full model. The approach we have taken is to run independent MCMC chains for each pair of values (J, K), 1 ≤ K ≤ J ≤ J max , up to some predefined value, J max . The software approach the we have implemented will be made available on one of the authors' website (https://www.ajsuarez.com). It uses R's C interface to be able to take advantage of using OpenMP for parallelization of the MCMC chains. Specifically, for each value of J, the models, 1 ≤ K ≤ J, are run in parallel batches. For example of computation time, on a data set containing 50 functional observations at 100 time points, fitting all models up to J max = 25 takes approximately 3 seconds per 1000 MCMC steps on a 6-core Intel Haswell CPU running at 4 GHz. Computation increases on the order of J 2 max , and this dominates compared with the time points. For an MCMC procedure that provides full information on each individual model, we believe this approach to be worth the computational time.