Penalized Orthogonal-Components Regression for Large p Small n Data

We propose a penalized orthogonal-components regression (POCRE) for large p small n data. Orthogonal components are sequentially constructed to maximize, upon standardization, their correlation to the response residuals. A new penalization framework, implemented via empirical Bayes thresholding, is presented to effectively identify sparse predictors of each component. POCRE is computationally efficient owing to its sequential construction of leading sparse principal components. In addition, such construction offers other properties such as grouping highly correlated predictors and allowing for collinear or nearly collinear predictors. With multivariate responses, POCRE can construct common components and thus build up latent-variable models for large p small n data.


Introduction
Available high-throughput biotechnologies make it possible to comprehensively analyze genomic, proteomic, or metabolomic profiles of biological samples, thus identifying molecular signatures to understand complex biological systems. Such profile analysis holds an enormous promise for its use in early disease detection, assessment of prognosis, measurement of drug efficacy, and eventually, personalized medicine. However, it usually entails collection of a massive amount of possible predictors (i.e., large p) from each of a small number of biological individuals (i.e., small n), and therefore identifying the underlying sparse predictors presents a task of "finding a very few needles in a haystack". The structured and noisy predictors make the task even more difficult. Breiman (1996) showed that classical step-wise regression is unstable since modifying a single observation can change the fitted model significantly. On the other hand, ridge regression is stable but it lacks the ability to select variables. Tibshirani (1996) employed an ℓ 1 -norm penalty and proposed the lasso method, which gained popularity due to its ability to select variables and, at the same time, exhibit the stability of ridge regression. This method has a Bayesian interpretation with independent Laplace priors (Tibshirani (1996); Park and Casella (2008)). However, lasso lacks the grouping property, that is, it tends to select one predictor from a group of highly correlated predictors, see Zou et al. (2005) for more details.
The grouping property plays an important role in analyzing p ≫ n data with clustered but noisy predictors. The predictors for molecular signatures are naturally grouped due to sharing metabolomic pathways or biological processes, and are preferred to be included or excluded from the model simultaneously. On the other hand, highly correlated predictors can borrow strength from each other to counter the noise effect. Many lasso variants have therefore been proposed to take advantage of the grouped predictors either implicitly or explicitly. For example, Zou et al. (2005) proposed the elastic net (EN) which added a ℓ 2norm penalty; Tibshirani et al. (2005) proposed the fused lasso including another ℓ 1 -norm penalty to encourage similarity between coefficients; and Yuan and Lin (2006) proposed the group lasso which modified the ℓ 1 -norm penalty for grouped coefficients.
Another strategy in analyzing p ≫ n data is to first reduce the dimension of predictors by constructing components, i.e., "eigen" predictors, and then fit regression models by applying step-wise approaches to these components. Such construction of components not only provides a potential solution to the "curse of dimensionality", but also groups predictors which are highly correlated or share certain common coherent patterns. Both unsupervised and supervised dimension reduction methods have been proposed. While many unsupervised methods have been proposed on the basis of principal component analysis (PCA; Hastie et al. (2000), Bair et al. (2006), Cook (2007), the partial least squares (PLS; Garthwaite (1994)) regression is a supervised approach and has been widely used in chemometrics and bioinformatics, see Kramer (1998), and Nguyen and Rocke (2002), among others.
In this paper we propose a penalized orthogonal-components regression (POCRE) via a new penalization framework which can effectively identify sparse predictors from a large number of candidates. Section 2 presents the general idea of orthogonal-components regression, and the penalized orthogonal-components regression is proposed in Section 3. The penalization is implemented in Section 4 using the empirical Bayes thresholding proposed by Johnstone and Silverman (2004). Such implementation allows adaptively identifying sparse predictors and leads to the computationally efficient POCRE algorithm which is summarized in Section 5. Simulation studies and real data analysis are shown in Section 6 and 7 respectively. We conclude this paper with a discussion.

Orthogonal-Components Regression
To illustrate the ideas behind the orthogonal-components regression, we assume where Y is a k-dimensional column vector, X is a p-dimensional column vector independent of ǫ, E[X] = 0, and β is a p × k matrix. When var(X) is non-singular and the sample size n is reasonably larger than p, either likelihood method or moment method can provide a satisfactory estimate of β.
Here we are interested in estimating β in the large p paradigm. First, var(X) may be singular or nearly singular due to collinear or highly correlated predictors in X. Second, when p is too large, it is usually infeasible to assume that the sample size n is larger than p. In either case, it is difficult, if not impossible, to estimate β using the classical methods.
To avoid possible problems with large p, we construct orthogonal components as linear combinations of all predictors in X, and then regress Y on these orthogonal components. Such orthogonal components can be sequentially constructed. Specifically, let X 1 = X andỸ 1 = Y . The first component ω T 1X 1 is constructed with ω = ω 1 maximizing cov(Ỹ 1 , ω TX 1 ) 2 under the condition ω = 1. Since cov(Ỹ 1 , ω TX 1 ) 2 = cov(Y, ω T X) 2 , ω 1 is the leading eigenvector of cov(Y, X) T cov(Y, X). Here the leading eigenvector refers to the one with the largest eigenvalue. When Y is univariate, i.e., k = 1, ω 1 ∝ cov(Y, X) T .
After constructing the j-th component ω T jX j , we then remove ω T jX j fromX j such that We also remove ω T jX j fromỸ j such thatỸ j+1 =Ỹ j − ϑ j ω T jX j is uncorrelated to ω T jX j , i.e., under the condition ω = 1. Note that ω j+1 is the leading eigenvector of cov(Y,X j+1 ) T × cov(Y,X j+1 ). When k = 1, ω j+1 equals to the normalized cov(Y,X j+1 ) T . This construction stops whenever Y is uncorrelated toX j . Since we denote the j-th component as ̟ T j X. Upon the completion of the construction, ̟ T 1 X, ̟ T 2 X, · · ·, are uncorrelated, i.e., they constitute a sequence of orthogonal components, which lead to the orthogonal-components regression model.
Compared to the original regression (1), the orthogonal-components regression (2) can be fit by only calculating the eigenvectors of matrices but not the inverses, which makes it appealing in analyzing p ≫ n data. Furthermore, if the predictors are highly correlated or even collinear, the orthogonal-components regression is still able to provide robust solution. The calculation is very fast due to the fact that ̟ T 1 X, ̟ T 2 X, · · ·, can be easily constructed and that they are uncorrelated.

Penalized Orthogonal-Components Regression
Implementing the orthogonal-components regression (2) is subject to finding the leading eigenvector of cov(Y,X j ) T cov(Y,X j ) to construct the j-th component ̟ T j X. However, the involved covariances are not observed and need to be estimated from the observed data, say the i.i.d. sample (Y n×k , X n×p ). Wold (1975) estimated the covariances with their empirical estimates and proposed the partial least squares. Each subsequently constructed component is a linear combination of all available predictors. In the case of p ≫ n data, especially when only a small number of predictors contribute to the response variables, the results from partial least squares regression inflate the errors besides the difficulty in interpreting the results. Here we will pursue a penalized construction for sparse loadings. Let be an estimate of cov(Y,X j ). A major step in implementing the orthogonal-components regression is to find the leading sparse eigenvector of M T M. The following theorem by Zou et al. (2006) implies that finding the leading eigenvector can be taken as an optimization problem, which sheds light on constructing sparse eigenvectors.
Theorem 2. (Zou et al. (2006)) For any κ > 0, let To ensure a sparse principal component, we consider a general version of the criterion (3), i.e., with tuning parameter λ and penalty function p λ (γ), Here the penalty is introduced to benefit estimating covariances and thresholding γ such that most of the elements in γ are zero, i.e., γ is sparse. While Theorem 2 implies that specific value of κ does not affect the solution to optimization problem (3), the following theorem states that sparse γ can be derived from a problem without specifying κ in (4).
We will iteratively solve (5) forα andγ. First, for a given γ, we havê Second, for a given α, we havê which will be approximated using the empirical Bayes thresholding as discussed in the following section.

Penalization via Empirical Bayes Thresholding
Denote Z = M T Mα. Then solving forγ(α) in (6) is subject to minimizing Z − γ 2 + p λ (γ) with respect to γ. Suppose the i-th component of Z is z i , and further assume, Since p is large and most of {µ i , 1 ≤ i ≤ p} are zero, the variance σ 2 can be estimated bŷ Note that this estimate partially accounts for under-or over-dispersion due to dependent data, see Efron (2004). When implementing the penalization of POCRE, we also introduce a tuning parameter λ to account for the possible over-dispersion when standardizing z i using λσ. Without loss of generality, hereafter we assume ǫ i iid ∼ N (0, 1). When p λ (·) is specified by the logarithm of a prior density function, the optimal γ is indeed a Bayesian estimate of (µ 1 , · · · , µ p ) T . In consideration of the sparsity of γ, we employ the empirical Bayes thresholding (EBT) proposed by Silverman (2004, 2005) for a better approximation to the leading sparse eigenvalue of M T M.
Specifically, we assume a mixture prior with a point mass at zero and a quasi-Cauchy distribution for each µ i , i.e., where δ 0 (·) is Dirac's delta function. Since the marginal distribution of z i is an estimate of w, sayŵ, can be calculated by maximizing the marginal likelihood. Then µ i can be estimated by the posterior median, i.e., Asŵ provides a data-driven estimate of the parameter sparsity, the resultant estimate is adaptive to the sparsity of the underlying parameter. Johnstone and Silverman (2004) also showed that the empirical Bayes estimatorμ(z) is a thresholding estimator in the sense that (i)μ(z) is increasing on z ∈ R; (ii) |μ(z)| ≤ |z|, ∀z ∈ R; (iii)μ(−z) = −μ(z); (iv) there exists τ > 0 such thatμ(z) = 0 if and only if |z| ≤ τ . As noted above, althoughμ i is constructed by assuming all components of Z are independent, using the estimateσ in (7) and the tuning parameter λ in the penalty function p λ (·) account for possible dependence. In practice, ten-fold cross-validation can be employed to elicit the optimal value of λ ranging from 0.6 to 1. As demonstrated by our simulation studies, it usually suffices to consider λ ∈ {0.8, 0.81, 0.82, · · · , 1}.

The Algorithm
Without loss of generality, we further assume that both X and Y are centered. Therefore, an estimate of cov(Y, X) is M ∝ Y T X. Suppose ω 1 , · · · , ω j−1 have been calculated, and X j has been updated accordingly. An estimate of cov(Y,X j ) is proportional to Y T X j . We can therefore proceed to find ω j as follows, 1. Initialize γ to be the leading eigenvector of X T j YY T X j ; 5. Repeat 2 -4 until convergence, then ω j = γ/ γ ; Note that the first five steps are used to calculate the first principal component of X T j YY T X j , which is adaptive to the sparsity of the non-zero loadings. Among these steps, the first step may be easily implemented using the following power method (Stewart (1974)), which has been used for the nonlinear iterative partial least squares (NIPALS; Wold (1975)), 1.a. Initialize ψ to be the first column of Y j ; 1.e. ψ = Yϕ; 1.f. Repeat 1.b -1.e until the convergence of γ.
When ω j converges to the leading eigenvector of X T j YY T X j , then η j is an eigenvector of X j X T j YY T , which defines the j-th orthogonal component. Note that P j in Step 7 helps calculate X j+1 due to the fact that η T j X j+1 = 0. Since X j+1 = X j − η j P j = X j (I − ω j P j ), when writing X j+1 = Xζ j+1 , ζ j+1 can be sequentially calculated as follows, ζ 1 = I p×p ; ζ j+1 = ζ j (I − ω j P j ), j = 1, 2, · · · .

Simulation Studies
We consider five different cases of large p small n data to evaluate the performance of POCRE and compare with other approaches such as partial least squares (PLS), ridge regression, lasso, and elastic net (EN). The first two cases have highly and mildly correlated predictors respectively, the third one has clustered predictors, the fourth one demonstrates a measurement-error model, and the fifth one features a latent-variable model. In all cases, we fix p = 1000 and consider both n = 50 and n = 100.

Case 4 (Errors in Predictors
Here we evaluate the algorithms on the basis of two different criteria, i.e., the loss defined as E[ Y −Ŷ 2 β ] − tr{var(Y |X)}, and the false discovery rate (FDR). In each case, we simulated 100 datasets, and therefore calculated the values of the loss and FDR on the basis of the estimated parameters. Ten-fold cross-validations are used to find the optimal tuning parameters for EN, lasso, POCRE, and ridge regression, and the optimal number of components for PLS.
Since neither PLS nor ridge regression selects variables and both instead build up the model using all available predictors, FDR is not reported for either method. In all cases, both methods report very large losses compared to the other three methods due to inflated prediction errors by using all predictors. It is interesting to note that both PLS and ridge regression perform similarly in terms of losses, although PLS is able to build common components for multivariate responses.
In Case 1 with highly correlated predictors, both lasso and POCRE present much smaller losses than EN, as shown in Table 1. When the correlations between predictors are mild as in Case 2, the losses of both EN and POCRE dramatically decrease but the loss of lasso increases when n = 100. For n = 50, all three methods increase the losses with lasso increases the most. In both cases, lasso presents the smallest losses. However, POCRE is able to build up common components shared by multiple responses and lowers the losses, as shown in Case 5. Indeed, POCRE has much smaller loss than other methods for n = 100, and is comparable to lasso for n = 50.
In Case 3 with clustered predictors, POCRE performs extremely well when compared to all other methods. In Case 4 with errors in predictors, POCRE also presents the smallest losses. Indeed, in Case 3, POCRE decreases 55.82% and 79.30% of the losses when compared to the best of all other methods for n = 50 and n = 100, respectively. And in Case 4, POCRE decreases 22.61% and 40.00% for n = 50 and n = 100, respectively. Therefore, POCRE prevails in handling clustered or noisy predictors due to its building up components through maximizing their correlations to the response variables. In all cases, POCRE performs the best in terms of FDR, as shown in Table 2. With n = 100, POCRE can control the FDR under 25% for all cases except Case 1 in which the FDR is at 57.45% as POCRE tends to include predictors which are highly correlated to those true predictors. On the other hand, lasso presents FDR as high as 84.21%, with the lowest level at 57.45%. Not surprisingly, EN performs better than lasso in Case 3, i.e., with the lowest FDR at 41.18%, as it can account for group effects of predictors. However, it presents higher FDRs than lasso for all other cases. With n = 50, although POCRE still presents lower FDRs than other two methods, all methods present high FDRs except that POCRE has the FDR at 18.92% in Case 3. Lan et al. (2006) designed an experiment to identify the genetic basis for differences between two inbred mouse populations (B6 and BTBR). A total of 60 arrays were used to monitor the expression levels of 22,690 genes of 31 female and 29 male mice. Some physiological phenotypes, including numbers of stearoyl-CoA desaturase 1 (SCD1), glycerol-3phosphate acyltransferase (GPAT) and phosphoenopyruvate carboxykinase (PEPCK), were also measured by quantitative real-time RT-PCR. The gene expression data and the phenotypic data are available in GEO (http://www.ncbi.nlm.nih.gov/geo; accession number GSE3330).

A Real Data Analysis
We adjusted the phenotypic values to remove the possible gender effects. For each phenotype, its correlation to each gene is calculated, then an overall correlation coefficient (OCC) of the three phenotypes to a single gene is defined as minimizing the absolute values of the correlation coefficients between the gene and three phenotypes. Here we investigated expression profiling of the top 5,000 genes (ranked on the basis of OCC) to predict the three physiological phenotypic values. We set up the test dataset including randomly selected 5  female and 5 male mice, and the rest are included in the training dataset. We built up the model using the training dataset and then calculated the sum of squared prediction errors (SSPE) using the test data.
With each of EN, lasso, and POCRE, we separately build up regression models for each of the three physiological phenotypic values. The results are presented in Table 3. Overall, lasso tends to select small number of predictors, and also reports the largest SSPE. On the other hand, POCRE reports the smallest SSPE for each phenotype, and selects smaller number of predictors for both SCD1 and PEPCK, but larger number of predictors for GPAT than EN. POCRE generates three components for SCD1 (see Figure 1), and one component for each of the other two phenotypes (results not shown).
We also fit a multivariate-response regression model for the three phenotypes using POCRE. Four common components are generated using a total of 277 genes. The resultant model reports SSPE for a total of 22.85 (i.e., 2.79, 18.14, and 1.93 for SCD1, GPAT, and PEPCK, respectively). The two regression models built by POCRE share only 21 genes for SCD1, 59 genes for GPAT, and 36 genes for PEPCK, although they report similar SSPE values.

Discussion
Effective dimension reduction is crucial for a successful analysis of p ≫ n data. Traditional unsupervised dimension reduction can be used to exclude many features from constructed sparse predictors, but the false discovery rate (FDR) can be very high. On the other hand, available supervised dimension reduction, such as PLS, ignores the sparse nature of the underlying signatures. Furthermore, all these methods assume that the predictors are accurately measured, and do not incorporate functional relatedness of candidates. As a result, despite years of searching, only a handful of predictive biomarkers have advanced to general clinical practice. Clearly, more effective approaches are called if the true potential of predictive molecular signatures is to be realized. POCRE builds up orthogonal components by aggregating contribution of predictors along the direction which maximizes their correlations to the response variables or residuals (when predictors are standardized). It sequentially constructs these orthogonal components by finding penalized leading principal components. The involved computation is efficient and feasible for large p small n data. As in Section 7 which presented a training dataset with n = 50 and p = 22, 690, POCRE, coded in MATLAB R , took less than two minutes to fit the regression model with four components (the tuning parameter was set at λ = 0.75, and it was run on a desktop computer with Intel R 3.0GHz Core TM 2 Duo CPU).
POCRE implements the penalization via an empirical Bayes thresholding. Since this empirical Bayes thresholding is constructed with a sparsity-adaptive prior, POCRE is automatically enabled to select sparse variables in the large p small n paradigm. As shown in the simulation studies, it provides a clear and significant benefit to the general task of variable selection in the large p small n paradigm, even with clustered predictors or noisy predictors. It confirmed the utility of the new method in molecular profiling, thus indicating an enormous promise for its use in transcriptional profiling (genomics), protein profiling (proteomics), methylation profiling (epigenomics), and metabolite profiling (metabolomics). The full potential of the new framework, however, lies in providing breakthrough solutions to implementing the Bayesian penalization for structured noisy features. Therefore, cov(Ỹ l+1 , X) = cov(Ỹ l+1 ,X l+1 ) + l j=1 cov(Ỹ l+1 , ω T jXj )θ T j = 0.