Sufficient principal component regression for pattern discovery in transcriptomic data

Abstract Motivation Methods for the global measurement of transcript abundance such as microarrays and RNA-Seq generate datasets in which the number of measured features far exceeds the number of observations. Extracting biologically meaningful and experimentally tractable insights from such data therefore requires high-dimensional prediction. Existing sparse linear approaches to this challenge have been stunningly successful, but some important issues remain. These methods can fail to select the correct features, predict poorly relative to non-sparse alternatives or ignore any unknown grouping structures for the features. Results We propose a method called SuffPCR that yields improved predictions in high-dimensional tasks including regression and classification, especially in the typical context of omics with correlated features. SuffPCR first estimates sparse principal components and then estimates a linear model on the recovered subspace. Because the estimated subspace is sparse in the features, the resulting predictions will depend on only a small subset of genes. SuffPCR works well on a variety of simulated and experimental transcriptomic data, performing nearly optimally when the model assumptions are satisfied. We also demonstrate near-optimal theoretical guarantees. Availability and implementation Code and raw data are freely available at https://github.com/dajmcdon/suffpcr. Package documentation may be viewed at https://dajmcdon.github.io/suffpcr. Contact daniel@stat.ubc.ca Supplementary information Supplementary data are available at Bioinformatics Advances online.


Introduction
Global transcriptome measurement with microarrays and RNA-Seq is a staple approach in many areas of biological research and has yielded numerous insights into gene regulation. Given data from such experiments, it is often desirable to identify a small number of transcripts whose expression levels are associated with a phenotype of interest (for instance, disease-free survival of cancer patients). Indeed, projects such as The Cancer Genome Atlas (TCGA) have aimed to generate massive volumes of such data to enable molecular characterization of various cancers. While these data are readily available, their high-dimensional nature (tens of thousands of transcript measurements from a single experiment) makes identification of a compact gene expression signature statistically and computationally challenging. While the identification of a minimal gene expression signature is valuable in evaluating disease prognosis, it is also helpful for guiding experimental exploration. In practical terms, a set of five genes highly associated with a certain disease phenotype can be characterized more rapidly, at lower cost, and in more depth than a set of 50 or 100 such genes using genetic techniques such as CRISPR knockout and cancer biological methods such as xenotransplantation of genetically-modified cells into mice. Therefore, this paper prioritizes selecting a small subset of transcript measurements which still provide accurate prediction of phenotypes.
With these goals in mind, supervised linear regression techniques such as ridge regression [22], the lasso [42], elastic net [50], or other penalized methods are often employed. More commonly, especially in genomics applications, the outcomes of interest tend to be the result of groups of genes, which perhaps together describe more complicated processes. Therefore, researchers often turn to unsupervised methods such as principal component analysis (PCA), principal component regression (PCR), and partial least squares (PLS) for both preprocessing and as predictive models [e.g., 7,17,26,43].
In genomics, one may collect expression measurements for thousands of genes from microarrays or RNA-Seq with the goal of predicting phenotypes or class outcomes. In these settings, the number of patients is much smaller than the number of gene measurements, and researchers are interested in (1) the accurate prediction of the phenotype, (2) the correct identification of a handful of predictive genes, and (3) computational tractability. Among these properties, the correct identification of a small number of predictive genes is of crucial importance in practice, since it can lead biologists to further investigate specific genes through CRISPR knockout or other techniques. It is this genetic pattern discovery for which our proposed methodology is intended: data with many more measurements than observations; the potential that some of the measurements may be grouped or correlated; the existence of either a continuous or discrete outcome we wish to predict; and the belief that these predictions only depend on some small collection of groups rather than the entire set of measurements.

Recent related work
PCA has two main drawbacks when used in high dimensions. The first is that PCA is non-sparse, so it uses information from all the available genes instead of selecting only those which are important, a key objective in omics applications. That is, the right singular vectors or "eigengenes" [1] depend on all the genes measured rather than a small collection. The second is that these sample principal components are not consistent estimators of the population parameters in high dimensions [25]. This means essentially that when the number of patients is smaller than the number of genes, even if the first eigengene could perfectly explain the data, PCA will not be able to recover it.
Modern approaches specifically for pattern discovery in the genomics context such as supervised gene shaving [19], tree harvesting [20], and supervised principal components (SPC) [4,5,38] seek to combine the presence of the phenotype with the structure estimation properties of eigendecompositions on the gene expression measurements using unsupervised techniques to obtain the best of both. PLS is common in genomics [8, 28, e.g.], though it remains uncommon in statistics and machine learning, and its theoretical properties are poorly understood. Other recent PCA-based approaches for genetics, though not directly applicable for prediction are SMSSVD [21] and ESPCA [35].

Contributions
In this paper, we leverage the strong theoretical properties associated with sparse PCA to improve predictive accuracy for regression and classification problems in genomics. We avoid the strong assumptions necessary for SPC, the current state-of-the-art, while obtaining the benefits associated with sparse subspace estimation. In the case that the phenotype is actually generated as a linear function of a handful of genes, our method, SuffPCR, performs nearly optimally: it does as well as if we had known which genes were relevant beforehand. Furthermore, we justify theoretically that our procedure can both predict accurately and recover the correct genes. Our contributions can be succinctly summarized as follows: 1. We present a methodology for discovering small sets of predictive genes using sparse PCA; 2. Our method improves the computational properties of existing sparse subspace estimation approaches to enable previously impossible inference when the number of genes is very large; 3. We demonstrate state-of-the-art performance of our method in synthetic examples and with standard cancer microarray measurements; 4. We provide near-optimal theoretical guarantees.
Our methodology can be used in a variety of genomic pattern discovery settings. One such example is a modified version of traditional differential expression analysis. If we have treatment and control measurements, the logistic version of our method is appropriate with the advantage that it examines the impact of one gene adjusted for the contributions of others. Additionally, with a continuous treatment, the detection power can be increased relative to using an artificial dichotomization.
In Section 2, we motivate the desire for sufficient PCR relative to previous approaches and present details of SuffPCR. Section 3 illustrates performance in simulated, semi-simulated, and real examples and discusses the biological implications of our methods for a selection of cancers. Section 4 gives theoretically justifies our methods, providing guarantees for prediction accuracy and correct gene selection. Section 5 concludes.
Notation. We use bold uppercase letters to denote matrices, lowercase Arabic letters to denote row vectors and scalars, and uppercase Arabic letters for for random variables. Let Y be a random, real-valued n-vector of independent variables Y i , and X be the rowwise concatenation of i.i.d. draws X i from a distribution on R p with covariance Σ. We denote the observed realization of the outcome variable Y as y ∈ R n . To be explicit in the genomics context, X is an n × p matrix where each row is a set of transcriptomic measurements from RNA-Seq or microarrays for a patient while y i is an observed phenotype of interest for the i th patient. Because X is a matrix, this symbol represents both a random matrix and its realization. In the following, the meaning should be clear from the context. We assume, without loss of generality, that E [X i ] = 0, and that the measurements X has been centred. The singular value decomposition (SVD) of a matrix A is A = U(A)Λ(A)V T (A). In the specific case of X, we suppress the dependence on X in the notation and write X = UΛV T . We write A d to indicate the first d columns of the matrix A and a j to denote the j th row. In the case of the identity matrix, we use a subscript to denote its dimension when necessary: I p . Let tr(A) denote the sum of the diagonal entries of A while A 2 F = ij a 2 ij is the squared Frobenius norm of A. A 2,0 denotes (2, 0)-norm of A, that is the number of rows in A that have non-zero 2 norm. A 1,1 is the sum of the row-wise 1 -norms. Finally, 1(a) is the indicator function for the expression a, taking value 1 if a is true or 0 if not.

Motivation and methodology
Supervised Principal Components [4,5,38] is widely used for solving high-dimensional prediction and feature selection problems. It targets dimension reduction and sparsity simultaneously by first screening genes (or individual mRNA probes) based on their marginal correlation with the phenotype (or likelihood ratio test in the case of non-Gaussian noise). Then, it performs PCA on this selected subset and regresses the phenotype on the resulting components (possibly with additional penalization). This procedure is computationally simple, but, zero population marginal correlation is neither necessary nor sufficient to guarantee that the associated population regression coefficient is zero. To make this statement mathematically precise, consider the linear where Y i is a real-valued scalar phenotype, X i is real-valued vector of genes, β * is the true (unknown) coefficient vector, and i is a mean-zero error. Defining as above Cov(X i , X i ) = Σ, and Cov(X i , Y i ) = Φ, then, for this procedure to correctly recover the true nonzero components of β * , it requires In words, we assume that the dot product of the j th row of the precision matrix with the marginal covariance between x and y is zero whenever the j th element of Φ is zero. While reasonable in some settings, this assumption frequently fails. For example, individual features may only be predictive of the response in the presence of other features. To illustrate why this assumption fails for genomics problems, we examine a motivating counterexample. Using mRNA measurements for acute myeloid leukemia (AML, Bullinger et al. 6), we estimate both Σ −1 and Φ and proceed as if these estimates are the true population quantities.
To estimate Φ, we use the empirical covariance and set all but the largest n = 116 values equal to zero, corresponding to an extremely sparse estimate. For Σ −1 , we use the Graphical Lasso [14] for all p = 6283 genes at different sparsity levels ranging from 100% sparse ( Σ −1 ij = 0 for all i = j) to 95% sparse. We then create a pseudotrue β * = Σ −1 Φ as in Equation (1). This is essentially the most favorable condition for SPC. To reiterate, in order to evaluate this assumption, we create β * based on estimates from real genetics data that are highly sparse. But, as we will see below, because the inverse covariance matrix is not "sparse in the right way", SPC will have a very high false negative rate and ignore important genes. Table 1 shows the sparsity of Σ −1 , the percent of non-zero regression coefficients, and the percent of non-zero regression coefficients which are incorrectly ignored under the assumption (the false negative rate). Even if the precision matrix is 99.9% sparse, the false negative rate is over 40%, meaning we find fewer than 60% of the true genes. If the sparsity of Σ −1 is allowed to decrease only slightly, the false negative rate increases to over 95%. Clearly, this screening procedure will ignore many important genes in even the most favorable conditions for SPC.
More recent work has attempted to avoid this assumption. Ding and McDonald [11] uses the initially selected set of features to approximate the information lost in the screening step via techniques from numerical linear algebra. An alternative discussed in Piironen and Vehtari [39] iterates the screening step with the prediction step, adding back features which correlate with the residual. Finally, Tay et al. [41] assumes that feature groupings are known and and estimates separate subspaces for different groups. All these methodologies are tailored to perform well when Φ and β * have particular compatible structures.
On the other hand, it is important to observe that a sufficient condition for β * j = 0 in Equation (1) is that the j th row of the left eigenvectors of Σ is 0. Based on this intuition, we develop sufficient PCR (abbreviated as SuffPCR) which leverages this insight: row sparse eigenvectors imply sparse coefficients, and hence depend on only a subset of genes. SuffPCR is tailored to the case that X lies approximately on a low-dimensional linear manifold which depends on a small subset of features. Because the linear manifold depends on only some of the features, β * does as well.

Prediction with principal components
PCA is a canonical unsupervised dimension reduction method when it is reasonable to imagine that X lies on (or near) a low-dimensional linear manifold. It finds the best d-dimensional approximation of the span of X such that the reconstruction error in 2 norm is minimized. This problem is equivalent to maximizing the variance explained by the projection: where S = 1 n X T X is the sample covariance matrix. Let X = UΛV T , then the solution of this optimization problem is V d , the first d right singular vectors, and the estimator of the first d principal components is U d Λ d or XV d equivalently. Given an estimate of the principal components, principal component regression (PCR) is simply ordinary least squares (OLS) regression of the phenotype on the derived components U d Λ d . One can convert the lower-dimensional estimator, say γ, back to the original space to reacquire an estimator of β * as β = V d γ d . Other generalized linear models can be used place of OLS to find γ.

Sparse principal component analysis
As discussed in Section 1.1, standard PCA works poorly in high dimensions. Much like the high-dimensional regression problem, estimating high-dimensional principal components is ill-posed without additional structure. To address this issue many authors have focused on different sparse PCA estimators for the case when V is sparse in some sense. Many of these methods achieve this goal by adding a penalty to Equation (2). Of particular utility for the case of PCR when β * is sparse is to choose a penalty that results in row-sparse V. This intuition is justified by the following result.
The proof is immediate. For any j, if v(Σ) j 2 = 0, then every element in v(Σ) j is 0, indicating the j th row of Σ −1 will be 0. Since β * j = (Σ −1 Φ) j where Cov(X i , y i ) = Φ, it also results in β * j = 0. This result stands in stark contrast to the assumption in Equation (1). This proposition gives a guarantee rather than requiring an assumption: if the rows of V d are sparse, then β * is sparse. The same intuition can easily be extended to the case Λ(Σ) jj ≥ 0 for all j given a gap between the d th and (d + 1) st eigenvalues. In this setting, the natural analogue of PCA is the solution to: Solutions V d of Equation (3) will give projection matrices onto the best d-dimensional linear manifold such that V d is row sparse. However, this problem is NP-hard. Many authors have developed different versions of sparse PCA. For example, d'Aspremont et al. [9] and Zou et al. [51] focus on the first principal component and add additional principal components iteratively to account for the variation left unexplained by the previous principal components. Vu and Lei [45] derives a rate-minimax lower bound, illustrating that no estimator can approach the population quantity faster than, essentially, q d/n where q is a deterministic function of Σ. Later work in Vu et al. [46] proposes a convex relaxation to Equation (3) which finds the first d principal components simultaneously and nearly achieves the lower bound: where F d := {VV T : 0 VV T I p and tr(VV T ) = d} is a convex body called Fantope, motivating the name Fantope Projection and Selection (FPS). The authors solve the optimization problem in Equation (4) with an alternating direction method of multipliers (ADMM) algorithm.
Elementwise soft-thresholding 6: , sort in descending order 10: Choose t by applying Algorithm 2 to l 11: Set rows in V d whose 2 norm is smaller than t as 0, and get V d 12: Solve γ = argmin γ y − X V d γ 2 For these reasons, FPS is known as the current state-of-the-art sparse PCA estimator with the best performance. However, despite its theoretical justification, FPS is less useful in practice for solving prediction tasks, especially in genomics applications with p n (rather than just p > n) for two reasons. First, the original ADMM algorithm has per-iteration computational complexity O(p 3 ), which is a burden especially when p is large. Secondly, because of the convex relaxation using Equation (4) rather than Equation (3), V d from FPS tends to be entry-wise sparse, but infrequently row-wise sparse unless the signal-to-noise ratio (SNR) is very large (q is a function of this ratio). We give explicit formulas for the SNR under this model in the Supplement, but heuristically, the SNR captures how well the data is described by a d-dimensional subspace through the relative magnitude of tr (Λ d ) compared to p. In genomics applications with low SNR, which is common, estimates β tend to have large numbers of non-zero coefficients with very small estimated values. Thus we design SuffPCR based on the insights from Proposition 1, utilizing the best sparse PCA estimator FPS, and further addressing both of these issues to achieve better empirical performance while maintaining theoretical justification.

Sufficient principal component regression
In this section, we introduce SuffPCR. The main idea of SuffPCR is to detect the relationship between the phenotype Y and gene expression measurements X by making use of the (near) low-dimensional manifold that supports X. In broad outline, SuffPCR first uses a tailored version of FPS to produce a row-sparse estimate V d , and then regresses Y on the derived components to produce sparse coefficient estimates.
SuffPCR for regression is stated in Algorithm 1 and summarized visually in Figure 1. For ease of exposition, we remind the reader that Y and X are standardized so that S = 1 n X T X is the correlation matrix. The first issue is the time complexity of the original FPS algorithm. Essentially, FPS uses the same three steps depicted in lines 4, 5, and 6 in Algorithm 1.
The only difference here between our implementation and that in FPS is in step 4. Each of these steps takes a matrix and produces another matrix, where the matrices have p 2 elements. The second and third steps are computationally simple (element-wise soft-thresholding and matrix addition). But the first, Proj F d (Q), is more challenging. The solution requires computing the eigendecomposition of Q, an O(p 3 ) operation, and then modifying the eigenvalues of Q through the solution of a piecewise linear equation in τ : Because of the cubic complexity in p, the authors suggest the number of features should not exceed one thousand. But typical transcriptomics data has many thousands of gene measurements, and preliminary selection of a subset is suboptimal, as illustrated above. Due to the form of the piecewise solution, most eigenvalues will be set to 0. Thus, while we will generally require more than d eigenpairs, most are unnecessary, certainly fewer than min{n, p}. Our implementation computes only a handful of eigenvectors corresponding to the largest eigenvalues, rather than all p. If we compute enough to ensure that some Λ 2 i,+ (Q) will be 0, then the rest are as well. Our implementation uses Augmented Implicitly Restarted Lanczos Bidiagonalization [AIRLB, 2] as implemented in the irlba package [3], though alternative techniques such as those in Gittens and Mahoney [15], Homrighausen and McDonald [23] may work as well. We provide a more detailed discussion in the Supplement.
For moderate problems (n, p ≈ 100), the truncated eigendecomposition with AIRLB rather than the full eigendecomposition leads to a three-fold speedup while the further incorporation of specialized initializations leads to an eight-fold improvement without any discernable loss of accuracy (results on a 2018 MacBook Pro with 2.7 GHz Quad-Core processor and 16GB of memory running maxOS 10.15). The results are similar when p = 5000, though the same experiment on a high-performance Intel Xeon E5-2680 v3 CPU with 12 cores, 256GB of memory, and optimized BLAS were somewhat less dramatic (improvements of three-fold and four-fold respectively). For large RNA-Seq datasets (p ≈ 20000), we observed a nearly ten-fold improvement in computation time.
The second issue is that the Fantope constraint in Equation (4) ensures only that tr(VV T ) = d but not that the number of rows with non-zero l 2 -norm is small. This feature of the convex relaxation results in many rows with small, but non-zero, row-norm resulting in dense estimates of β * . Thus, to make the final estimator V d sparse, we hard-threshold rows in V d whose 2 norm is small, as illustrated in line 9, 10, and 11 in Algorithm 1. From empirical experience, we have found that there is often a strong elbow-type behavior in the row-wise 2 norm of V d , similar to the Skree plot used to choose d in standard PCA. Therefore, we develop a simple procedure, Algorithm 2, to find the best threshold automatically. Essentially, it calculates the empirical derivative of the observation-weighted variances on each side of a potential threshold and maximizes their difference, resulting in signal and noise groups. We set the rows in V d corresponding to the noise to 0. SuffPCR is also amenable for solving other generalized linear models. For example, replacing line 12 in Algorithm 1 with logistic regression solves classification problems.
Algorithm 2 Find a t to hard-threshold l 1: Input: a p-vector l 2: for i ∈ 1, · · · , p do 3: We have omitted the other methods from the ROC curve for legibility, but their behavior is similar to lasso. TPR and FPR stand for True/False Positive Rate respectively. Note that (as one would expect from the simulation conditions), SPC has the worst performance in terms of the ROC curve while both SuffPCR and Elastic net have AUC of almost 1.

Empirical evaluations
In this section, we show how SuffPCR performs on synthetic data and on real public genomics datasets relative to state-of-the-art methods. Section 3.1 first presents a generative model for synthetic data and motivates the assumptions required for our theoretical results in Section 4. We include here one synthetic experiment under conditions favorable to SuffPCR relative to SPC. We also investigate conditions favorable to SPC, the influence of tuning parameter selection, and the effect of the signal to noise ratio but defer these to the Supplement. Section 3.2 uses the NSCLC data as the X matrix, but creates the response from a linear model. Section 3.3 reports the performance of SuffPCR on 5 public genomics datasets. The supplement includes similar results for binary survival-status outcomes. Across most settings in both synthetic and real data, SuffPCR outperforms all competitors in prediction mean-squared error and is able to select the true genes (those with β * = 0) more accurately. An R package implementing SuffPCR and raw data are freely available at https://github.com/dajmcdon/suffpcr. Package documentation may be viewed at https://dajmcdon.github.io/suffpcr. In terms of the ROC curve, SPC and AIMER has the best performance, though SuffPCR is not far behind. But note that SPC has much worse precision and recall.

Synthetic data experiments
We generate data from the multivatiate Gaussian linear model We impose an orthogonal factor model for the covariates x ) entries independent of u i , and σ x > 0. We assume V d is row sparse with only s rows containing non-zero entries. These non-zero rows are the "true" features to be discovered, and they correspond to β * = 0.
It is important to note that, under this model, the rows of X follow a multivariate Gaussian distribution independently, with mean 0 and full-rank covariance Σ = VLV T whenever σ 2 x > 0. Here the columns of V are orthonormal eigenvectors on R p and the eigenvalues are l 1 ≥ · · · ≥ l p ≥ 0. Straightforward calculation shows that the first d columns in V are the same as the right singular vectors V d in the signal component of . . , p. We generate y ∈ R n as a linear function of the latent factors U d with additive Gaussian noise: y = U d Θ + z, where Θ is the regression coefficient, and z i are i.i.d. N(0, σ 2 y ), i = 1, . . . , n, independent of X. Under this model the population marginal correlation between each feature in X and y is Φ = V d Λ d Θ, and the population ordinary least squares coefficient of regressing y on X is β Note that the number of non-zero β * is s, because V d has only s rows with non-zero entries.
In all cases, we use n = 100 observations and p = 1000 features, generating three equal-sized sets for training, validation, and testing. We use prediction accuracy on the validation set to select tuning parameters for all methods. For the case of SuffPCR, this means only λ, because we choose t with Algorithm 2 and set d = 3. We use the test set for evaluating out-of-sample performance. Each simulation is repeated 50 times. Results with n = 200 and p = 5000 were similar. Algorithm 3 makes this entire procedure more explicit.
We compare SuffPCR with a number of alternative methods. The Oracle estimator uses ordinary least squares (OLS) on the true features and serves as a natural baseline: it uses information unavailable to the analyst (the true genes) but represents the best method were that information available. We also present results for Lasso [42], Ridge [22], Elastic Net [50], SPC [5], AIMER [11], ISPCA [39], and PCR using FPS directly without feature screening (using Algorithm 1 without step 9, 10 and 11). For ISPCA, we use the dimreduce R package to estimate the principal components before performing regression. For all competitors, we choose any tuning parameters that do not have default values using the validation set. Examples are λ in Lasso, Ridge, Algorithm 3 Generate synthetic data 1: Input: n = 100, p = 1000, r = 5, d = 3, SNR x = SNR y = 5. d − 1, . . . , 1)) ∈ R d×d . 4: Generate i.i.d. N(0, 1) V ∈ R d×d and orthogonalize the columns. 5: Extend V ∈ R s×d by repeating each row r times (s = rd).
and Elastic Net or the initial thresholding step in SPC. We use the correct embedding dimension (d = 3) whenever this is meaningful. Additional experiments are given in the Supplement. There, we investigate conditions favourable to SPC, the choice of d, and the impact of different SNR choices.

Conditions favorable to SuffPCR
The first setting is designed to show the advantages of SuffPCR relative to alternative methods, especially SPC. We note that other methods that employ screening by the marginal correlation [11,39] will have similar deficiencies. Because SPC works well if Equation (1) holds, we design Σ to violate this condition and set the first 15 features to have non-zero β * but allow only the first 10 features to have non-zero correlation with the phenotype. This behaviour is achieved with line 8 of Algorithm 3. By solving this equation in one unknown component of Θ, we force Φ = 0 for the third group of 5 components. Thus, as described in Section 2, Equation (1) will not hold: some Φ j = 0 but β * j = 0. We set the true dimension of the subspace as d = 3, and we use the correct dimension for methods based on principal components. Figure 2 shows the performance of SuffPCR and state-of-the-art alternatives. In addition to reporting each method's prediction MSE on the test set, we also show the number of features selected, precision, recall, and the ROC curve. The ISPCA implementation does not select features. In this example, SuffPCR actually outperforms the oracle estimator, attaining smaller MSE while generally selecting the correct features. This seemingly implausible result is likely because the variance of estimating OLS on 15 features is large relative to that of estimating the low-dimensional manifold followed by 3 regression coefficients. SuffPCR has a clear advantage over all the alternative methods, especially SPC which is 3 orders of magnitude worse. SPC works so poorly because it ignores 5 features. ISPCA has slightly lower MSE than SPC. Ridge is the worst, due to fitting a dense model when a sparse model generated the data. SuffPCR reduces MSE significantly relative to simply using FPS due to more accurate feature selection. The right plot in Figure 2 further shows the ROC curve for SuffPCR, Lasso, Elastic Net, SPC and AIMER in which we can easily vary the tuning parameter and select various numbers of features. SuffPCR and AIMER have a perfect ROC curve, while the other 3 methods are unable to identify 5 features. We undertake a similar exercise under conditions favorable to SPC in the Supplement.

Semi-synthetic analysis with real genomics data
The simulations in Section 3.1 explore various scenarios for the data generation process and show the performance of SuffPCR relative to the alternatives, however, they do not use any real genomic data. In this section, rather than fully generating X, we create a semi-synthetic analysis wherein only the phenotypes are generated. We first performed PCA on the NSCLC data [27] and note that the first two empirical eigenvalues are relatively large, so we chose the number of PCs to be d = 2. We keep the top 20 rows in the empirical V which have the largest norm and set the rest to 0. We then recombine and add noise. The phenotype is constructed as in the previous simulations, and the SNR is calibrated as in Section 3.1.1. Figure 3 shows the results analogous to those in Figure 2. SuffPCR continues to perform well relative to alternatives, though here, FPS has similar MSE, albeit poor feature selection.

Analysis of real genomics data
We analyze 5 microarray datasets that are publicly available and widely used as benchmarks. Four of the datasets present messenger RNA (mRNA) abundance measurements from patients with breast cancer [34,44], diffuse large B-cell lymphoma (DLBCL) [40], and acute myeloid leukemia (AML) [6], and the fifth reports microRNA (miRNA) levels from non-small cell lung cancer (NSCLC) patients [27]. The features in X are gene expression measurements from microarrays. In the Supplement, we apply SuffPCR to predict COVID-19 viral load from RNA-Seq data.
The phenotypes Y are censored survival time in all cases, though some of the datasets also contain binary survival status indicators. Because the real valued phenotype is non-negative and right censored, we follow common practice and transform Y to log(Y + 1). Each observation is a unique patient. The first breast cancer dataset has 78 observations and 4751 genes, the second has 253 observations and 11331 genes, DLBCL has 240 observations and 7399 genes, AML has 116 observations and 6283 genes, and NSCLC has 123 observations and 939 genes.
We randomly split each dataset into 3 folds for training, validation, and testing with proportions 40%, 30% and 30% respectively. We set the number of components d = 3 and search over 5 log-linearly spaced λ values. Other choices for d and λ yield similar results. We train all methods on the training set, use the validation set to choose any necessary tuning parameters, and report performance of each method on the test set. We repeat the entire process (data splitting, validation, and testing) 10 times to reduce any bias induced by the random splits. In all cases, all methods were tuned to optimize validation-set MSE. Table 2 shows the average prediction MSE and the average number of selected features for SuffPCR and any alternative methods that perform feature selection. SuffPCR works better than all the alternative methods on 4 out of 5 datasets with a comparatively small number of features selected. The DLBCL data is difficult for both sparse and PC-based methods. As described in Section 2.3, FPS cannot be used for these data sets because of the number of genes. Non-sparse alternatives have much smaller MSE, suggesting that many genes may play a roll in mortality rather than only a subset. SPCA is designed to maximize the variance explained by the principal components subject to a penalty on the non-sparsity, and it does not seem to work well in regression tasks. DSPCA has relatively low prediction MSE, and it does in principle perform feature selection, though it generally produces a dense model. While Ridge, Random Forests and SVM predict well in general, they do not perform any feature selection, which is a key objective here, so show their MSE in the Supplement.
To assess the potential relevance of the genes selected by SuffPCR to the cancer type from which they were identified, we further explored the DLBCL data and extracted the selected genes. (We do the same with AML in the Supplement.) We first find the best λ via 5-fold cross-validation on all the data and then train SuffPCR with this λ. Our model selects 87 features corresponding to 32 unique genes and 2 expressed sequence tags (ESTs) for DLBCL. Seventeen of the identified genes encode ribosomal proteins, overexpression of  Table 2: Prediction MSE and number of selected features for regression of survival time on gene expression measurements which is associated with poor prognosis [13]. A further 9 genes encoding major histocompatibility complex class II (MHCII) proteins were detected, a notable finding in light of the fact that MHCII downregulation is a means by which some DLBCLs evade the immune system [10]. Discovering these large groups of similarly functioning genes illustrates the benefits of SuffPCR relative to alternatives. CORO1A encodes the actin-binding tumor suppressor p57/coronin-1a, the promoter of which is often hypermethylated, and therefore likely silenced in DLBCL [29]. FEZ1 expression has been used in a prognostic model [31]. RAG1, encoding a protein involved in generating antibody diversity, can induce specific genetic aberrations found in DLBCL [33]. RYK encodes a catalytically dead receptor tyrosine kinase involved in Wnt signaling and CXCL5 encodes a chemokine. To our knowledge, neither gene has been implicated in DLBCL and thus may be of interest for further exploration. EST Hs.22635 (GenBank accession AA262469) corresponds to a portion of ZBTB44, which encodes an uncharacterized transcriptional repressor, while EST Hs.343870 (GenBank accession AA804270) does not appear to be contained within an annotated gene. The Supplement lists the selected genes and associated references. A separate listing of the genes encoding ribosomal and MHCII proteins are given in the Supplement.

Theoretical guarantees
When the sparse factor model described in Section 3 is true, SuffPCR enjoys near-optimal convergence rates. We now make the necessary assumptions concrete and note that some can be weakened.
Assumptions A1-A4 are the same as those used in Section 3.1 to generate data from a linear factor model. Assumption A5 says that the number of true nonzero coefficients β * must be no more than s and that the size of the associated components must be large enough. Assumption A6 means that eventually, we must have at least as many observations n as a logarithmic function of p times the true number of components plus the square of the number of nonzero β * coefficients.
Theorem 1. Suppose assumptions A1 to A6 hold and let β be the estimate produced by SuffPCR with λ = cλ 1 log(p)/n and t < 2τ where t is the threshold used in Algorithm 1 and τ is given in A5. Then Theorem 2. Suppose assumptions A1 to A6 hold and let β be the estimate produced by SuffPCR with λ = cλ 1 log(p)/n and 2τ > t > τ where t is the threshold used in Algorithm 1 and τ is given in A5. Then where A B = A/B ∪ B/A is the symmetric difference operator and supp denotes the support set.
In both results above, c is a positive number (possibly different between the two) that is independent of n and p, but may depend on any of the other values given in A1-A6. Theorem 1 gives a convergence rate for the prediction error of SuffPCR comparable to that of Lasso though with explicit additional dependence on d. Under standard assumptions with fixed design, this dependence would not exist for Lasso. On the other hand, our results are for random design with d small, along with different constants absorbed by the big-O. Theorem 2 shows that our procedure can correctly recover the set of nonzero β * as long as the threshold t is chosen correctly. We note that this result is a direct consequence of Vu et al. [46,Theorem 3.2]. In practice, the condition 2τ > t > τ cannot be verified, although the "elbow" condition we employ in the empirical examples seems to work well. Finally, we emphasize that, as is standard in the literature, these results are for asymptotically optimal tuning parameters λ, t rather than empirically chosen values. The proof of Theorem 1 is given in the Supplement. These results suggest that SuffPCR is nearly optimal as p and n grow.

Discussion
High-dimensional prediction methods, including regression and classification, are widely used to gain biological insights from large datasets. Three main goals in this setting are accurate prediction, feature selection, and computational tractability. We propose a new method called SuffPCR which is capable of achieving these goals simultaneously. SuffPCR is a linear predictor on estimated sparse principal components. Because of the sparsity of the projected subspace, SuffPCR usually selects a small number of features. We conduct a series of synthetic, semi-synthetic and real data analyses to demonstrate the performance of SuffPCR and compare it with existing techniques. We also prove near-optimal convergence rates of SuffPCR under sparse assumptions. SuffPCR works better than alternative methods when the true model only involves a subset of features.

A Genetics data discussion
In this section we provide more details on the 5 standard genetics data sets discussed in the manuscript. We then apply our methodology to one/two additional data sets which are less well studied in the literature.

A.1 Detailed description
In the manuscript (and below in Section C.4), we analyze 5 microarray datasets that are publicly available and widely used as benchmarks. Four of the datasets present messenger RNA (mRNA) abundance measurements from patients with breast cancer [34,44], diffuse large B-cell lymphoma (DLBCL) [40], and acute myeloid leukemia (AML) [6], and the fifth reports microRNA (miRNA) levels from non-small cell lung cancer (NSCLC) patients [27].
These data sets have between about 80 and 250 patients with expression measurements on 900 to 11000 genes. Table 3 gives specific statistics about each dataset. Name n (patients) p (genes) Breast cancer [44] 78 4751 Breast cancer [34] 253 11331 DLBCL [40] 240 7399 AML [6] 116 6283 NSCLC [27] 123 939 Table 3: Summary of canonical datasets The AML data was originally collected and analyzed by [6]. They used complementary-DNA microarrays to measure gene expression from either peripheral-blood or bone marrow samples from 116 adults with AML. The gene expression measurements for the 6283 genes that were highly variable across patients are included. The observed outcomes are the (possibly right-censored) survival time in days as well as a binary indicator for whether or not the patient died.
The first set of breast cancer data from [44] is based on 78 sporadic lymph-node-negative patients. The authors derived cRNA from snap-frozen tumor samples and pooled across each of the sporadic carcinomas. The 4751 available genes were significantly regulated across the patients (showing at least a two-fold difference across at least 5 patients). The outcome is the follow-up time (in months) for metastases as well as a binary indicator for whether prognosis was "good" (no metastases within 5 years) or "poor".
The second set of breast cancer gene expression measurements is based on 253 patients who's cancer tissues were sequenced for the p53 mutation [34]. A binary variable for the presence of the p53 mutation is missing for two patients, which were removed from further analysis. The regression exercise focuses on the right-censored survival time (measured in years). The measured gene expressions come from Affymetrix high-density oligonucleotide arrays. A number of other clinical variables are also present in the data contained in the R package (https://github.com/dajmcdon/suffpcr).
The DLBCL data was collected by [6] and contains Biopsy samples from 240 patients with gene expression measurements from DNA microarrays. The microarrays were constructed from 12,196 clones of cDNA then used to quantify the expression of mRNA in the tumors. Genes with significant differential expression across patients were included. The response is survival time after chemotherapy in years as well as binary survival status (57% of patients died).

B Application to COVID-19 Data
In this section, we apply our methods to RNA-Seq data for COVID-19-positive patients. As a developing disease, biological analysis remains highly desirable, and so to facilitate this sort of undertaking, recent work has endeavoured to collect and distribute this data publicly [48].
Here, we examine the data collected by Lieberman et al. [30], and examine how well RNA-Seq measurements predict viral load. The data contains results for 430 PCR-confirmed COVID-positive patients and 54 negative controls. Lieberman et al. [30] examines both differential expression between positives and controls and the viral load of the positive patients, while we focus on only the viral load. Of the 430 confirmed positives, 413 have reported usable measurements. Viral load in this case is measured through a proxy: the cycle threshold (Ct) of the SARS-CoV-2 nucleocapsid gene region 1 (N1) target from a PCR test.
Larger Ct values indicate more cycles are required to detect the N1 target, and thus, likely indicate lower viral load. While Lieberman et al. [30] bin this continuous measurement into "high" and "low" groups to undertake differential analysis, our methodology directly models the continuous response. This allows for (1) increased ability to detect potentially predictive genes and (2) direct quantification of the effect of increased expression on viral load. The raw expression measurements are preprocessed before being used in SuffPCR. First, we remove any genes whose median expression level across the 413 patients with Ct measurements is 0. This reduces the data from ∼36,000 genes to 9435. We then transform the raw counts usingz → log(z + 1). Finally, we centre and scale by the mean and standard deviation of the similarly transformed measurements from the healthy controls. Mathematically, ifz gt is the vector of expression measurements for gene g in the treatment group andz gc is the vector of expression measurements for the same gene in the control group, we form where z gc = mean(log(z gc + 1)) and sd(z gc ) = sd(log(z gc + 1)).
We apply SuffPCR to the X matrix formed by the columnwise concatenation of x gt with the N1 Ct as the response vector. We examined embedding sizes of d ∈ {3, 5, 15} and λ on a log 10 -spaced grid between 0 and 1. We chose both parameters using the minimum of the 5-fold cross-validation error. The result is shown in Figure 4 (here d = 15). Our method selected 59 genes. Many of these with the largest magnitude (darkest colour) are similar to those described in [30]: the CXCL10 and 11 genes along with IDO1 and IFI27 are proinflammatory and/or interferon induced and may be related to the "cytokine storm" found in some patients [32]. Novel and potentially interesting PLEKHA4, which is strongly related to higher viral load, but more closely related to melanomas. PCS5K is more strongly expressed in patients with lower viral load. This gene encodes a proprotein convertase that may potentially help to clear the virus. ROCK1P1, also more strongly expressed in lower viral load patients, is not well understood.
As mentioned above, 27 patients were positive but did not have a measured N1 Ct from the PCR test. These were also missing age and gender information. Using the fitted SuffPCR model, we can predict their missing N1 cycle thresholds. These predictions are shown in Figure 5. The green area shows a density estimate for the 413 patients with observed measurements while the purple lines display predicted values for those positive patients whose N1 Ct values are labeled "Unknown". Because the data is missing, the accuracy of these predictions cannot be determined.

C Additional experiments and results for regression C.1 Conditions favorable to SPC
This example is designed to show the performance of SuffPCR under favorable conditions for SPC. Here, the only alteration to the data generation process is that we set the first 10 features to have non-zero β * , the same features which have non-zero marginal correlations with the phenotype. This is achieved using Algorithm 3 in the manuscript with a few alterations: (1) in line 5, we generate V d ∈ R (d−1)×3 (recall that we have chosen d = 3); and (2) we replace lines 7-9 by directly simulating Θ ∈ R d with i.i.d. standard normal entries. Figure 6 gives the analogous results for this example. Here, SuffPCR has comparable MSE to SPC, which is the best, slightly better than the oracle as was the case for SuffPCR above. However, SPC is less likely to select the correct features, while SuffPCR selects the correct features most of the time. Furthermore, SuffPCR tends to select fewer features, and it has the best precision and recall (Ridge will always have recall equal to 1).

C.2 Selecting tuning parameters
In SuffPCR, the tuning parameters are the penalty term λ, the dimension of the projected subspace d, and the threshold t. In all the simulations, we examine the same values of λ. When analyzing real datasets, our algorithm offers automatic calculation of potential values of λ given the sample covariance matrix. In practice, the more penalty parameters we explore, the better the results could be assuming the sample size is large enough that risk estimation is not too volatile.
To demonstrate the importance of selecting d, we simulate the data with d = 3 and set all the other parameters the same as in the first simulation. We estimate SuffPCR using d ∈ {1, 3, 5} and display the results in Figure 7. When d is smaller than the true value, it is hard for SuffPCR to capture all the information contained in X, thus SuffPCR has terrible prediction MSE, worse than any other methods. When d is larger than the true value, SuffPCR remains reasonable because its bias is small, but the variance will eventually increase as d increases, diminishing performance. Selecting t is not difficult in our simulations, so the results have been omitted, but it is less trivial in the real data settings.

C.3 Investigation of signal-to-noise ratio with synthetic data
In all or our synthetic examples, the data generating model has two sources of noise, one from constructing X corresponding to σ x , and the other from constructing Y , denoted σ y . This simulation aims to control the two noise sources separately to examine their effect on the performance of SuffPCR. Note that X is random unlike in standard prediction studies, so both sources of noise are important. We use SNR x and SNR y to denote the signal-to-noise ratio for X and Y respectively. These are given by Note that SNR y depends not only on β * but also on σ x and the linear manifold through V d . We alter the values of SNR x and SNR y to generate X and Y while everything else is as in the first simulation. Figure 8 shows the prediction MSE for all methods on four sets of simulated data for both high and moderate combinations of SNR x and SNR y . When the SNR decreases, the prediction MSE of all the methods increases as expected. In all configurations, SuffPCR performs similarly to the oracle, better than Lasso or Elastic Net, while SPC is more similar to Ridge regression (though better). Interestingly, changing SNR x and SNR y has a similar impact on nearly all the methods except Ridge, SPC, AIMER, and ISPCA, which are nearly unaffected and uniformly worse by an order of magnitude. Table 4 enumerates genes selected by SuffPCR on the DLBCL data, Table 5 lists the selected genes for AML, Table 6 lists the miRNA sequences selected by SuffPCR on the NSCLC data, while Table 7 lists all the features selected by SuffPCR on the Breast Cancer1 data. Table 8 gives the prediction MSE for Ridge, Random Forests and SVM for the 5 regression datasets we examine in Section 3.3 of the manuscript. These methods predict well in general, but they do not select features.

C.4 Predictive genes/miRNA for additional cancers
For AML we describe the related work summarized in Table 5 in more detail. Of the 4 discovered genes, 3 have been discussed in the literature. BPGM, encoding a multifunctional metabolic enzyme restricted to erythrocytes and placental cells, is upregulated in mouse models of AML [37]. PI4KB, encoding a lipid kinase, is amplified in breast cancer [47] and has been proposed to be a druggable AML-specific dependency [49]. LOC90379 encodes an uncharacterized protein that is highly immunogenic in ovarian cancer serum [16]. The fourth feature discovered, Hs.321434, is an EST (GenBank accession H96229) that corresponds to an intronic sequence of an uncharacterized long noncoding RNA (lncRNA), LOC101929579, and is thus of uncertain significance. Similar listings for the discovered genes for the NSCLC and Breast Cancer 1 data are given in the Supplement without the accompanying literature review.

D Extension to classification
SuffPCR is easily extended to solve classification tasks using logistic regression (or even other classification methods). We use logistic regression as an example to show how SuffPCR performs for classification tasks. Note that the algorithm is similar to Algorithm 1 except that step 12 is replaced by the objective of logistic regression.

D.1 Synthetic data for binary classification
We first use a simulation to demonstrate the generalization of SuffPCR to solve a binary classification problem. Using the same data generating model and parameters as in the favorable scenario in Section 3.1.1 of the manuscript, we generate Y using the logistic function. As before, we choose tuning parameters with the validation set, and report the overall prediction accuracy on the test set. Figure 9 shows the classification accuracy compared to the oracle (logistic regression on the true predictors), logistic lasso, logistic ridge, and logistic FPS (logistic regression on the estimated principal components from FPS). SuffPCR has very high classification accuracy on the test set relative to the alternative methods. We also simulate classification data analogously to the other scenarios discussed in the manuscript, and the results are very similar.

D.2 Analysis of real genomics data
We analyze 3 of the 5 real genomics datasets from Section 3.3 of the manuscript, which include a binary survival status. We use the same training test split process as in the manuscript and compare SuffPCR with logistic lasso and logistic ridge. hsa-miR-376a* Table 6: RNA sequences selected by SuffPCR for NSCLC data. Table 9 shows the average classification accuracy and average number of selected genes in the classification tasks. The results in the classification tasks are similar to those in regression. Again, the DLBCL dataset tends to be difficult for both sparse and PC-based methods.

E Approximate singular value decomposition in Algorithm 1
To approximate the top j eigenvalues of a symmetric matrix A, we require the k-step (partial) Lanczos bidiagonalization (j < k). For initial vector p 1 , this is given by where P (k),T P (k) = Z (k),T Z (k) = I ∈ R k×k , P (k),T r (k) = 0, P (k) e 1 = p 1 , and W (k) is bidiagonal. Approximate eigenvectors and eigenvalues for A can then be computed using the SVD for W, which is easy because of the bidiagonal structure. However, the accuracy is closely tied to the choice of the initial vector p 1 and can be measured by the norm of the residual vector r (k) . AIRLB [2] essentially augments the SVD of W with additional information to reexpress P (k) , Z (k) and W (k) . This process is repeated until the residual is deemed small enough.
If p 1 is in the span of the top j eigenvectors of A, then no iteration will be necessary, and the approximation is exact. So, within our ADMM, when the span of the eigenvectors for B − C + S/ρ is similar across iterations, previous iterates can be used as initializations for the restarted Lanczos procedure. Thus, as we loop through the ADMM steps, these initializations improve the speed subsequent decompositions.
This modification significantly improves the per-iteration efficiency of the algorithm while still converging in a few dozen iterations. Note that ADMM will still converge (though in perhaps more iterations) if some or   all of the steps are implemented approximately [12]. In particular, by approximating the projection of Q, Proj F d (Q), with Proj F d (Q), ADMM will converge provided ∞ k=1 Proj F d (Q (k) ) − Proj F d (Q (k) ) < ∞.
Nishihara et al. [36] suggests linear convergence for ADMM, and our experience is that this case remains linear despite the approximation.

F Proofs
To prove Theorem 1, we first need the following technical lemma.   Here, the first equality is a standard property of idempotent matrices and the third follows from [18,Thm. 20.5.6].
Proof of Theorem 1. Write Π = V d V T d and Π = V d V T d for the orthogonal projectors onto the column span of the population quantity V d and the estimate produced by Algorithm 2 respectively. Let and defineβ = V dγ . Note that β * ∈ col(V d ) and β ∈ col( V d ) as β * = V d L −1 d Λ d Θ =: V d r and β = V d γ. Then, where the last line follows by Lemma 1. Now, X β = Y ∈ col(X Π) and Xβ =Ỹ ∈ col(XΠ), and col(XΠ) ⊂ col(X) ∩ row(Π) = col(X) ∩ col(Π) (because Π is symmetric). Thus, we have that Y ∈ col(X Π) ⊂ col(X) ∩ col( Π) and Ỹ ∈ col(XΠ) ⊂ col(X) ∩ col(Π). By linearity, there exist orthogonal projectors Q and R such that Y = QY andỸ = RY . Therefore, where the first inequality is Hölder's inequality for the Frobenius norm. For the second, since Q and R are (orthogonal) projectors onto subspaces of col(Π) and col( Π), it must be that Q − R F ≤ Π − Π Combining these terms gives the result.