Functional principal components analysis via penalized rank one approximation

Two existing approaches to functional principal components analysis (FPCA) are due to Rice and Silverman (1991) and Silverman (1996), both based on maximizing variance but introducing penalization in different ways. In this article we propose an alternative approach to FPCA using penalized rank one approximation to the data matrix. Our contributions are four-fold: (1) by considering invariance under scale transformation of the measurements, the new formulation sheds light on how regularization should be performed for FPCA and suggests an efficient power algorithm for computation; (2) it naturally incorporates spline smoothing of discretized functional data; (3) the connection with smoothing splines also facilitates construction of cross-validation or generalized cross-validation criteria for smoothing parameter selection that allows efficient computation; (4) different smoothing parameters are permitted for different FPCs. The methodology is illustrated with a real data example and a simulation.


Introduction
Principal components analysis is a key technique for unsupervised functional data analysis Ramsay and Silverman (2005) where the goal is to decompose variation in a two-way data table. Two different approaches to smoothed functional principal components analysis (FPCA) were proposed by Rice and Silverman (1991) and Silverman (1996), both of which we briefly describe in what follows. Let X(·) denote a random function which we assume for now can be observed repeatedly and as a whole, without discretization; in addition, let α be a smoothing penalty parameter. To find the j-th principal component weight function γ j (·), the Rice-Silverman approach maximizes var( γ X) − α γ ′′2 γ 2 (1) subject to the constraint γγ k = 0 for k < j, whereγ k is the estimated k-th principal component function, while the Silverman approach maximizes var( γ X) γ 2 + α γ ′′2 .
subject to γγ k + α γ ′′γ′′ k = 0 for k < j. Both approaches impose smoothness on principal components using roughness penalty ideas, but they differ in the way they incorporate the penalty. The variance var( γ X) in (1) and (2) needs to be estimated from a random sample of realizations of X(·).
Following Hotelling (1933), maximizing variance of a standardized linear combination of variables is the standard textbook treatment of principal components analysis (PCA, e.g., Mardia et al., 1979). A different approach is by way of fitting low rank approximations to the data matrix (Pearson, 1901), and this approach is intimately connected to the singular value decomposition (Eckart and Young, 1936). See Jolliffe (2002) for further discussion and references. In this article we apply the roughness penalty idea to a low rank approach to FPCA. As we will see, there exist difficulties in introducing roughness penalties, and a guiding principle in navigating these difficulties will be certain invariance under scale transformations. Scale invariance considerations will not only allow us to select the proper form of penalty, but also to re-derive and thereby justify the regularized variance criterion of Silverman (1996). However, penalized low rank approximation combined with invariance principles amounts to more than a novel justification for an existing approach; it has several methodological advantages over existing regularized variance approaches.
First, our approach yields a power algorithm that is an efficient variant of the power algorithm (e.g., Appendix A of Jolliffe, 2002) for calculating eigenvectors. Second, spline smoothing of discretized data is naturally built into our method which therefore gains the theoretical advantages of smoothing splines: our method applies directly to discretized data, and the estimated FPCs are solutions of an optimization problem defined on a function space. More importantly, the connection of our method to spline smoothing helps us develop computationally efficient cross-validation (CV) and generalized cross-validation (GCV) criteria for selecting smoothing parameters. This development fills a gap in the FPCA literature as is shown by a comparison: both Rice and Silverman (1991) and Silverman (1996) assume the whole curve is available through either interpolation or smoothing, while (Ramsay and Silverman, 2005, Chapters 8 & 9) represents functional data using a basis expansion prior to performing PCA. Finally, our method allows different principal components to have different smoothing parameters, which is a beneficial flexibility that is not shared by the method of Silverman (1996).
In this paper we focus on functional data that are sampled on a common grid across subjects -a typical setting of functional data analysis. Sometimes, the functions may be irregularly or even sparsely sampled, as often occurs in biomedical longitudinal studies. FPCA methods for sparsely sampled functional data or longitudinal data have been developed, among others, by James (2000), Rice (2001), and Yao et al. (2005a,b).
The rest of the article is organized as follows. Section 2 presents the new method for extracting principal components for finite dimensional vector spaces; Section 5 extends the method to function space. Section 3 describes the power algorithm. Section 4 discusses the criteria for smoothing parameter selection and also provides the derivations. Section 6 gives some numerical results. Finally, the Appendix contains some technical details.

A new framework on functional principal components
Since functional data are usually observed discretely, this section presents our method in terms of discretized data. Unlike standard multivariate analysis, we will take into account the intrinsic functional structure of the data with general Ridge penalties, which in the functional case will be smoothness penalties. Recovery of the whole curve of principal component weight functions involves spline interpolation and will be discussed in Section 5. In what follows we focus on the first principal component; subsequent principal components can be extracted sequentially by removing preceding components.

Minimizing reconstruction error
Consider a collection or sample of functional data, observed or recorded at a common set of discrete observation points t 1 , . . . , t m . Denote the underlying functions for the sample as x i (·), i = 1, . . . , n. The observed data are x ij = x i (t j ), i = 1, . . . , n, j = 1, . . . , m. Let the n × m data matrix be X = (x ij ). For simplicity and without loss of generality, suppose that X is column centered, that is, the sample mean of each column of X is zero.
Consider the problem of finding the best rank one approximation of X. Any rank one matrix of size n × m can be written as uv T , where u = (u 1 , . . . , u n ) T and v = (v 1 , . . . , v m ) T . Let X F denote the Frobenius norm of X. The problem can be formally stated as minimizing with respect to u and v the following reconstruction error, For a fixed v, the u that minimizes (3) is u = Xv/v T v. Plugging this u into (3) we find that the minimizingv of (3) maximizes v T X T Xv/v T v. Since X T X is proportional to the sample variance-covariance matrix, v T X T Xv/v T v is proportional to the sample variance of the data projected onto the direction of v. Thusv is, up to a scale factor, the first principal component loading (weight) vector for X according to standard multivariate analysis textbooks (e.g., Mardia et al., 1979). We also note that the minimizingû andv of (3) are the first singular vector pair of X, again up to a scale factor.
The appeal of minimizing (3) is that it resembles a prediction problem. If u were observable, then (3) would be the least squares criterion for a multivariate regression. The difference is that the predictor variable u needs to be estimated here. Based on the above observation, we next discuss how to modify (3) to take into account the functional nature of data by smoothing with a roughness penalty. The connection with a prediction problem will be critical once again for deriving cross-validation criteria in Section 4.

Penalization and transformation invariance
Following the basic philosophy of functional data analysis, we assume that there is an underlying smooth function γ(·) such that v j = γ(t j ). If the domain of values t is the real line, we can assume t i sorted. Because of the smoothness of γ(·), a pair of adjacent values v j and v j+1 are necessarily tied to each other by correlation. Using the roughness penalty idea, it is natural to consider minimizing where Ω is a non-negative definite roughness penalty matrix, and α > 0 is a penalty parameter. The penalty matrix Ω is chosen such that a larger value of v T Ωv is associated with a greater penalty on differences between adjacent values. For example, an intuitive choice of Ω for equispaced t j based on second Construction of general penalty matrices that are suitable for both smoothing and interpolation will be discussed in Section 5 (see Theorem 1).
An immediate problem with (4) is that it is not invariant under the pair of scale transformations u → cu and v → v/c. While the goodness-of-fit term X − uv T 2 F does not change, the roughness penalty term v T Ωv changes. A simple possible fix is to make the penalty term unitless and to minimize Fixing v, the minimizing u is u = Xv/v T v, which can be plugged back into (5) to show that minimizing (5) is equivalent to maximizing This maximizing criterion is essentially the same as the Rice-Silverman criterion expressed in (1). However, the criterion (5) has a defect: The optimization problem is not invariant under scale transformation of the measurements. Under the transformation X → cX and the corresponding transformation u → cu, the goodness-of-fit term becomes c 2 X − uv T 2 F , while the penalty term remains unchanged. The reason is that the goodness-of-fit term is not unitless but the penalty term is. The criterion (4) has the same defect because of a mismatch of units of the two terms involved. -This motivates us to consider minimizing where the two terms now have the same units. Optimization of (7) is invariant under scale transformation in the following sense: Ifû andv form its minimizer for the data matrix X, thenû * = cû andv * =v form its minimizer for the rescaled data matrix X * = cX. In other words, the minimizing v of this criterion is the same before and after the scale transformation X → cX. -Fixing v, the minimizing u is u = Xv/v T (I m + α Ω)v, which can be plugged back into (7) to show that minimizing (7) is equivalent to maximizing This maximizing criterion is essentially the same as the Silverman criterion expressed in (2). Use of the penalized reconstruction error combined with invariance considerations indicates that the Silverman proposal is preferable to the Rice-Silverman proposal for functional principal components analysis. Such insight would not be forthcoming by studying penalized variance alone. In addition, as we shall show below, a direct methodological advantage of using the penalized reconstruction error (7) is that it facilitates extension of CV-and GCV-type smoothing parameter selection criteria for spline smoothing to FPCA.
From now on we shall focus on the criterion (7) for extracting smoothed principal components. An extension of (7) for extracting the whole principal component weight function is given in Section 5. Since we extract the principal component functions sequentially, multiple smoothing parameters are allowed. This is different from Silverman (1996)where a single smoothing parameter is used for extracting all principal component functions. The benefit of the flexibility provided by using multiple smoothing parameter will be illustrated in Section 6 with simulated data.
Remark 1. Invariance of scale transformation of measurements is motivated by the consideration that results should remain the same under change of metric of the data. Such consideration of invariance might be meaningless if the data matrix would be standardized prior to analysis. However, there is no obvious way of standardization for functional data. If each column of the data matrix is standardized by dividing the corresponding sample standard deviation, for example, then the functional nature of the data will be lost. Standardizing the whole data matrix by the overall sample standard derivation is not a sound operation either, because the sample variance may vary from column to column.

The power algorithm
The criterion (7) can be minimized by alternating minimization of u and v in an iterative algorithm: Separating the scale constants, the algorithm is as follows: The initial v can be chosen to be the first right singular vector of X.
Step 2. (c) forces v to have norm 1, which is somewhat arbitrary and only meant for identifiability purposes between u and v; any other normalization with different scale trade-offs between u and v would work, too. For example, an alternative restriction on v is v T (I + α Ω)v = 1, but its dependence on the smoothing parameter α renders it less convenient. When starting from the right singular vector of X, this algorithm converges rather quickly, usually in a few iterations. The algorithm is a simple variant of the power algorithm for computing eigenvectors.
there is no penalty (α = 0), the algorithm is essentially the power algorithm for conventional PCA (e.g., page 409 of Jolliffe, 2002). As discussed below in Section 4, this iterative algorithm naturally facilitates the derivation of an explicit cross-validation criterion for smoothing parameter selection.

Cross-validation and generalized cross-validation
Our formulation of FPCA suggests a new method for selecting the smoothing parameter.
Step 2. (b) of the iterative algorithm essentially smooths X T u with S(α) = (I+α Ω) −1 to update v. This interpretation suggests that CV and GCV criteria for selecting the spline tuning parameter α (Green and Silverman, 1994) can be adopted to FPCA. The CV score is defined as and the GCV score is defined as In the implementation of either method, we can simply nest CV-or GCVselection of α inside the loop, i.e., in Step 2. (b) of the algorithm. The above motivation of CV and GCV is by analogy only. To give a formal justification, we show below in Section 4.2 that the CV score given above can indeed be derived from the basic idea of cross-validation, deletion of one data item at a time. Similarly, the GCV score can also be derived from the basic idea of GCV (Craven and Wahba, 1979). The technical difference of our derivations compared to those for smoothing splines is that we delete one column of X at a time, rather than one matrix entry.
The connection of column deletion in FPCA with point deletion in spline smoothing is clearly seen by considering an extreme case of (7). If X has only one row, denoted by y T , then u = u is a scalar. Requiring u to have norm 1 and fixing its sign for identifiability, it is necessary that u = 1. Then (7) becomes which is exactly the penalized least squares criterion for smoothing splines. This connection motivates our column deletion procedure that we now elaborate.

Derivation of CV and GCV criteria
Minimization of (7) with u fixed can be considered as a ridge regression. Conditional on u with v as the vector of regression coefficients, this ridge regression has the following response vectorȳ, design matrixX, and conditional ridge penalty matrix Ω v|u : where x j is the j-th column of X, and whereȳ is of size mn × 1 andX is of size mn × m. Both the design matrixX and the ridge penalty depend on u. It follows immediately that the penalized sum of squares (7) is equal to which constitutes a penalized LS problem for v. The associated penalized covariance matrix isX and thus the hat matrix of the ridge regression is where S = S(α) = (I + α Ω) −1 . Consider now the cross-validation that deletes one column of X at a time. This corresponds to deleting a block of size n fromȳ at a time. Partition the hat matrix H into m × m equal-sized blocks where each block corresponds to a column of X. Letv (−j) = (v m ) T be the v that minimizes (11) when the j-th block ofȳ and the corresponding rows ofX are removed. We have the following lemma about the leave-out-one-column prediction errors.
Lemma 1. The j-th leave-out-one-column cross-validated prediction error sum of squares is where γ j = S jj / u 2 , and S jj is the (j, j)-th element of the matrix S = S(α).
The proof of Lemma 1 is relegated to Appendix A. Note thatv j is the j-th element ofv = S X T u/u T u and x T j u is the j-th element of X T u. Note also that γ j u 2 = S jj . Since we condition on u, the first two terms on the right hand side of (12) are irrelevant. Averaging the last term in (12) over j, we obtain which, ignoring the u 2 factor in the denominator, is exactly the cross-validation criterion given in (9). Replacing [S(α)] jj in (13) by their average value, (1/m)tr{S(α)}, we obtain which, ignoring the u 2 factor, is the generalized cross-validation criterion given in (10). Remark 2. Both Rice and Silverman (1991) and Silverman (1996) suggest cross-validation based on row deletion in X. Row deletion lacks simple computational shortcuts such as (9); it hence involves actual computation of a large number of leave-out-one-row estimates. As a result, CV based on row deletion is computationally expensive as confirmed by our numerical studies in Section 6. Remark 3. Since the principal components are extracted sequentially, the squared errors in the numerators of the CV and GCV criteria vary with the principal components. The CV score (9) and the GCV score (10) are defined for extracting the first principal component. When each subsequent principal component is extracted, the X matrix in the definition of the CV and GCV scores needs to be replaced by the residual matrix after the effects of the previous principal components are removed.

Extracting the whole curve of principal component weight function
So far we have focused on the discretized problem, although the functional nature has been taken into account by regularization with a second-order roughness penalty. This section introduces an optimization criterion, the minimizer of which gives the entire principal component weight function. It also turns out that the extracted function is a natural cubic spline that interpolates the weighted vector obtained by minimizing (7). Our development relies on some standard results from spline smoothing that can be found in Green and Silverman (1994). Replacing the discretized vector v with the complete function γ(·), we propose to find the estimateγ(·) of the first principal component weight function by minimizing, with respect to u i and γ(·), the penalized sum of squares, where α > 0 is a smoothing parameter. The estimated principal component functionγ(·) is the optimizer of (14) over the class of all functions that satisfy {γ ′′ (t)} 2 dt < ∞. Similar to (7), the two terms in (14) have the same units, and the optimization problem is invariant under scale transformation of the measurements. However, unlike (7), the whole function is recovered by optimizing (14).
We now characterize the solution to the optimization problem (14) and show that it is closely related to the solution of the discretized problem (7) with an appropriately defined penalty matrix. To this end, we define two banded matrices, Q and R, as follows. Let h j = t j+1 − t j for j = 1, . . . , m − 1. Let Q be the m × (m − 2) matrix with entries q jk , for j = 1, . . . , m and k = 2, . . . , m − 1, given by . . , m − 1, and q jk = 0 for |j − k| > 2. To simplify the presentation, the columns of Q are numbered in a non-standard way, starting at k = 2, so that the top left element of Q is q 12 . The symmetric matrix R is (m − 2) × (m − 2) with elements r jk , for j and k running from 2 to (m − 1), given by and r jk = 0 for |j − k| > 2. The matrix R is strictly diagonal dominant and thus is strictly positive-definite. (14) is a natural cubic spline with knots at t j . Letv j =γ(t j ) andv = (v 1 , . . . ,v m ) T . Thenv is the optimizer of the discretized problem (7) with the penalty matrix Ω = QR −1 Q T .

Theorem 1. Theγ(·) optimizing
The proof of Theorem 1 can be found in Appendix B. According to Theorem 1, to obtain the entire curve of the principal component weight function, one needs to first solve (7) with a penalty matrix Ω = QR −1 Q T to obtainv 1 , . . . ,v m , and then find the natural cubic spline that interpolates (t j ,v j ). Computation of the interpolating natural cubic spline γ(·) at any evaluation point t using its values at the knots t j is a standard operation. Specifically, let v j = γ(t j ) and s j = γ ′′ (t j ). By the definition of natural cubic spline, the second derivative of γ at t 1 and t m is zero, so that s 1 = s m = 0. The interpolating natural cubic spline is completely determined by its values and second derivatives at each of the knots t j according to the following formula: . . , m−1; the spline outside [t 1 , t m ] is obtained by linear extrapolation. The vector s = (s 2 , . . . , s m−1 ) T of the second derivatives used above can be obtained by s = R −1 Q T v. See Chapter 2 of Green and Silverman (1994) for more details of efficient computation of interpolating splines.

Call center arrival data example
We applied the proposed method to the call center arrival data analyzed in Shen and Huang (2008). The data recorded the number of calls that got connected to a call center during every quarter hour between 7:00AM and midnight for every weekday between January 1 and October 26 in the year 2003. In total, there are 42 whole weeks during the period and each day consists of 68 quarter hours. Let N ij denote the call volume during the j-th time interval on day i. We used the transformed data X ij = N ij + 1/4 which together form a 210 × 68 matrix. The square-root transformation is used to stabilize variance and make the distribution close to normal. The same transformation has been used previously by Brown et al. (2005) and Shen and Huang (2008). The mean curve from the data (or the column mean vector of X) is quite smooth and summarizes the average intraday arrival pattern (Figure 1). It is bimodal with a main peak around 11:00AM followed by a second lower peak around 2:00PM. We subtracted the mean curve from the data and then ap-  Figure 1. The application of the leave-out-one-column GCV leads to qualitatively very similar results.
For comparison, we also followed Silverman (1996)and applied CV with row deletion to select a single smoothing parameter. This alternative method selects α = 0.13. The estimated principal component functions are plotted in Figure 1. We observe that Silverman's method undersmooths the principal component functions, while our method of using multiple smoothing parameters performs suitable smoothing. The computing times also demonstrate that the CV with row deletion is computationally much more expensive than our delete-column CV because of its lack of a computational shortcut. To generate the results in Figure 1, it took 55 and 2 seconds for the two approaches, respectively, using our R program running on a Debian Linux desktop with Intel R Pentium R 4 CPU of a clock speed of 2.8 Gigahertz. We used the computational tricks presented in Appendix C when implementing both approaches.

A simulated data example
To further understand the difference between the two methods used in Section 6.1, we performed the following simulation study. The data generating model is where u i1 ∼ N (0, σ 2 2 ), and ǫ ij i.i.d.
∼ N (0, σ 2 ). The parameters are chosen as n = m = 101, σ 1 = 20, σ 2 = 10, σ = 4, and the 101 grid points t j are equally spaced in [−1, 1]. The two underlying functional principal component functions are where s 1 and s 2 are the normalizing constants that ensure v 1 and v 2 to have unit norm. We generated one hundred simulated data sets, and estimated the first two smooth principal component functions for each data set. We calculated the mean squared errors (MSE) over the 101 grid points for each estimated principal component function. Panels (a) and (b) of Figure 2 are the scatterplots of MSEs for the two methods, with each point representing a simulated data set. We observe that using single smoothing parameter yields larger MSEs for most simulated data sets, especially for the first functional principal component (FPC). The same   message is also evident in Panels (c) and (d) that provide the histograms of the ratios of MSEs of the two methods. Table 1 reports some summary statistics of the MSE ratio and shows that, by using a single instead of multiple smoothing parameters, the mean of the MSE ratio increases by 64% for the first FPC, and by 8% for the second FPC. We confirm the observed difference as significant by the sign test that gives the p-values that are essentially 0 for both FPCs. We also observe that the difference of the two methods for estimating the second FPC is not as big as that for the first one. This is because the two methods select similar smoothing parameters for the second FPC. As far as computing times go, delete-row CV on average needs 116 seconds for one simulation while delete-column CV only takes 2 seconds, using the same computer reported in Section 6.1.   To see the effect of estimating the mean function on extracting the principal component functions, a mean function was added to the data generating model (15). The mean was then removed by column centering of the data matrix prior to applying the FPCA algorithms. Figure 3 and Table 2 present the results after removing the mean, and when compared with Figure 2 and Table 1, suggest that the effect of estimating the mean function is not very significant.

Appendix B: Proof of Theorem 1
Since the natural cubic spline interpolant is the unique minimizer of γ ′′2 over all functions that interpolate the data (t j ,v j ) (Theorem 2.3 of Green and Silverman, 1994), the minimizing functionγ(·) of (14) is necessarily a natural cubic spline with knots at the points t j . Therefore in (14) we can restrict attention to natural cubic splines with knots at t j . A natural cubic spline γ(t) that interpolates (t j , v j ) is uniquely defined. By Theorem 2.1 of Green and Silverman (1994) Let X = (x ij ) and u = (u 1 , . . . , u n ) T . It follows immediately that the penalized sum of squares (14) can be written as which is exactly the discretized criterion (7) given in Section 2.

Appendix C: Implementation details
We provide in this appendix some implementation details of our method. We first describe a method of computing smoothed PC using any existing SVD software, then discuss efficient computation of the CV/GCV criteria for multiple candidate values of α.
Note that the penalized reconstruction error criterion (7) can be expanded as follows: Denote S(α) = (I + α Ω) −1 ,ṽ = S −1/2 (α) v, andX = X S 1/2 (α). Then, the above expression is equivalent to which can be simplified as For a given smoothing parameter α, the minimizing u andṽ of (A.3) can be easily obtained as the first pair of singular vectors ofX = XS 1/2 (α), as discussed in Section 2.1. Since S 1/2 (α) can be interpreted as a half-smoothing operator, the transformed matrixX is obtained by half-smoothing the rows of the original data matrix X. Afterṽ is obtained as the first right singular vector ofX, we half-smooth it to obtain the smoothed PC function v = S 1/2 (α)ṽ. Thus by using existing SVD software, we can avoid directly programming the iterative power algorithm. This is convenient when a high level programming language is used for coding. The eigen decomposition of the symmetric and positive definte penalty matrix Ω can be written as where Γ is an orthogonal matrix containing the eigenvectors, and Λ is a diagonal matrix of the eigenvalues. For any value of α, the eigen decompositions of the smoothing and half-smoothing matrices are S(α) = Γ (I + α Λ) −1 Γ T and S 1/2 (α) = Γ (I + α Λ) −1/2 Γ T .
Now we discuss efficient evaluation of the GCV criterion (10). Since the trace that appears in the denominator of (10) equals to tr{S(α)} = k 1 1 + αλ k , and the numerator in (10) is Denote w = (XΓ) T u. The numerator of the GCV criterion equals to the Euclidean norm of the shrunken w with the k-th component shrunken by a factor of αλ k /(1 + αλ k ). When computation of GCV is needed for a set of candidate values of α, considerable computing saving is achieved since XΓ and the eigen decomposition of Ω only need to be calculated once.