On Principal Components Regression, Random Projections, and Column Subsampling

Principal Components Regression (PCR) is a traditional tool for dimension reduction in linear regression that has been both criticized and defended. One concern about PCR is that obtaining the leading principal components tends to be computationally demanding for large data sets. While random projections do not possess the optimality properties of the leading principal subspace, they are computationally appealing and hence have become increasingly popular in recent years. In this paper, we present an analysis showing that for random projections satisfying a Johnson-Lindenstrauss embedding property, the prediction error in subsequent regression is close to that of PCR, at the expense of requiring a slightly large number of random projections than principal components. Column sub-sampling constitutes an even cheaper way of randomized dimension reduction outside the class of Johnson-Lindenstrauss transforms. We provide numerical results based on synthetic and real data as well as basic theory revealing differences and commonalities in terms of statistical performance.


Introduction
Principal Components Regression (PCR), first introduced in [17,25], is perhaps the most basic approach to dimension reduction in linear regression. In PCR, the design matrix X ∈ R n×d containing the original predictor variables is replaced by XR ∈ R n×r , r < d ∧ n, where R ∈ R r×d reduces X to its top r principal components. From a statistical point of view, PCR can be motivated as a way of dealing with multi-collinearity and reducing estimation variance at the expense of additional bias. From the point of view of computation, PCR potentially achieves a reduction from a large number of variables to a parsimonious model, which can be beneficial for both model fitting and prediction of future observations. The use of PCR is debated in the literature as it does not need to be case that principal components corresponding to small singular values do not significantly contribute in predicting the response variable [5,22]. Herein, we mostly avoid touching upon this issue. Instead, the purpose of this paper is to establish a connection between PCR and the use of randomized methods of dimension reduction in linear regression in which the matrix R above is sampled from a suitable distribution. The latter approach, typically referred to as the method of random projections, is motivated from large-scale datasets in which both the number of samples n and the dimension d are large; in this case, computation of principal components via the SVD can be demanding. Random projections only require a matrix multiplication which can be easily parallelized. Having its roots in the celebrated Johnson-Lindenstrauss-Lemma [21], the idea has meanwhile a long history in computer science [39], and has recently attracted considerable interest in statistics (see [12] and the references therein).
Contributions and Related Work. It is critical to understand the statistical properties of the dimension reduction provided by random projections. Regarding linear regression, there are many more papers (e.g., [2,16,32,33,40,41]) on the scenario in which X is reduced to RX, i.e., R is multiplied from left instead of from the right with X being reduced to XR. The latter scenario was first analyzed in [29] under the term "compressed least squares" (CLS) which will also be employed herein. For a fixed design setting, refinements appear in [23], and very recently in [37]. Together with a preliminary version [36] of the present paper, the paper [37] is the first to make a connection between PCR and CLS. However, as we show below, while improving over the main result in [23], the upper bounds on the prediction error of CLS in [37] still leave a considerable gap to PCR. In the present paper, we try to close this gap. In brief, our main result states that CLS can roughly match the performance of PCR with respect to prediction at the expense of a moderate increase (at most by a logarithmic factor) in the reduced dimension. This property is shown to hold for a certain class of matrices comprising those that are typically considered in the literature on randomized dimensionality reduction and compressed sensing. We leave it as an open problem whether a similar result can be established for column subsampling, i.e., the columns of R are chosen uniformly at random from the canonical basis vectors. Results are provided indicating that more stringent conditions are required in that case. Finally, we note that the very recent work [24] presents an analysis of CLS when the goal is to recover the vector of regression coefficients under sparsity. Such an assumption is not made herein, and accordingly settings, goals and results are not comparable.
Outline. In Section §2, we provide background and review the results of prior work in more detail. Our main result is contained in §3, while §4 discusses extensions and open questions. We conclude with a brief summary im §5. The appendix contains all proofs.
Notation. For a positive integer m, we write [m] := {1, . . . , m}. For a matrix M , we write P M for the projection operator onto the column space of M . Its Frobenius norm is denoted by M F = tr(M M ), where "tr" is the trace of a diagonal matrix. The Gaussian distribution with zero mean and variance s 2 is denoted by N (0, s 2 ), and Unif(S) denotes the uniform probability distribution on a set S. For a, b ∈ R, we write a ∧ b = min{a, b} and a ∨ b = max{a, b}. Positive constants are denoted by C, C 1 , c, c 1 etc. We make use of the usual Big-O notation in terms of O, o, Ω and Θ.

Background
We start by providing some context for our main result. After fixing the setup, we derive bounds on the prediction error of PCR and put them into relation to existing results on CLS. This will point to a significant gap that motivates our analysis of CLS in §3.

General setting
We consider fixed design linear regression for data (y i , x i ), with y i taking values in R and x i taking values in R d , i ∈ [n]. The predictors x i are considered as fixed, and with f i = E[y i ] and ξ i following a distribution with mean zero and variance σ 2 , i ∈ [n]. Moreover, the {ξ i } n i=1 are assumed to be uncorrelated. More concisely, we write y = f + ξ, where y = (y i ) n i=1 , etc. We denote by X ∈ R n×d the design matrix whose rows are given by x i , i ∈ [n]. The optimal linear predictor Xw * of y given X with respect to squared loss is defined by the optimization problem where the expectation is with respect to the noise ξ. Any minimizer w * of the above problem satisfies Xw * = P X f with P X defined as in the paragraph on notation above; if there are multiple such w * we choose the one with minimum 2 -norm. Accordingly, we define the excess risk of an estimator θ = θ(X, y) of w * by If the linear model holds exactly (i.e., P X f = f ), E( θ) equals the in-sample mean squared prediction error that measures how well the {x i θ} n i=1 predict the "denoised" observations {x i w * } n i=1 on average. An ordinary least squares (OLS) estimator w satisfies X w = P X y. Its excess risk is given by To keep matters simple, we assume that X has full rank d ∧ n unless otherwise stated.
In this paper, we are interested in a high-dimensional setup in which rank(X) is of the same order of magnitude as n. In this situation, OLS does not yield satisfactory statistical performance. Moreover, if both n and d are large, obtaining w or making predictions based on w becomes computationally costly.
In light of these issues, it makes sense to consider alternatives that aim at leveraging some sort of low-dimensional structure. Scenarios in which w * exhibits one of various forms of sparsity are dominating in the literature, see the monographs of [8,15] for an overview. In the present paper, we follow another direction in which the predictors {x i } n i=1 are linearly mapped into a lower-dimensional space, and linear least squares regression is then performed based on the subspace obtained in this way. Put differently, one considers a new design matrix X R = XR with R being a d-by-k matrix, k d. On the statistical side, one potentially achieves a substantial reduction of variance at the expense of an increased bias as made precise below. The excess risk of the approach is given by where w R is a least squares solution based on the reduced design matrix X R , i.e., w R satisfies X R w R = P X R y, and the expectation is with respect to ξ (in later sections, R will be random, and we will then also take the expectation with respect to R). Straightforward calculations show that E(R) can be decomposed into a bias and a variance term: We commonly have rank(X R ) = k. The choice of k determines the bias-variance trade-off. If k can be chosen much smaller than d ∧ n while at the same time the magnitude of the bias can be controlled, an improvement over the excess risk of OLS in (1) is obtained. The approach can as well be motivated from the computational side given that we only need to solve a least squares problem of dimension k instead of d ∧ n. In addition, having a smaller number of predictors yields savings in storage and when making predictions.

Excess risk of PCR
The traditional choice of constructing X R is in terms of the leading principal components of X. Subsequent use of this new set of predictors in regression is known as principal components regression (PCR). Let X = U ΣV be the singular value decomposition (SVD) of X, where U ∈ R n×d∧n , U U = I, is the matrix of left singular vectors, Σ ∈ R d∧n×d∧n is the diagonal matrix whose diagonal contains the decreasingly ordered sequence of singular values σ 1 ≥ . . . ≥ σ d∧n , and V ∈ R d×d∧n , V V = I, is the matrix of right singular vectors. For r ∈ {1, . . . , d ∧ n}, consider where U r and V r ∈ R d×r contain the top r left respectively right singular vectors, and Σ r contains the corresponding singular values. The remaining singular vectors respectively singular values are contained in U r+ , V r+ and Σ r+ . The top r principal components are extracted from X by setting R = V r : The corresponding projection is given by P X R = U r U r and the bias term in (3) results as Let us define α * ∈ R d∧n by Combining (2), (5) and (6), the excess risk of PCR can then be expressed as follows.
We see from (7) that the excess risk of PCR behaves favorably if (i) the tail of the squared singular values at truncation level r is small (i.e., X can be well approximated by a matrix of rank r) and (ii) if there are no large coefficients in α * outside its top r entries corresponding to the leading singular vectors. Condition (ii) constitutes the main source of criticism of PCR: if nature is malicious, then α * has most of its mass in α * r+ . Under generic random sampling, however, this is not a concern: if V is sampled uniformly at random from its respective Stiefel manifold, then α * / α * 2 is uniformly distributed on the unit sphere in R d∧n so that the entries of α * are roughly homogeneous in magnitude.
In the sequel, we derive a series of bounds on the excess risk of PCR depending on the decay of the squared singular values {σ 2 j } d∧n j=1 . For this purpose, we use the following simple upper bound on E(V r ): where T r (X) = U r Σ r V r equals the best rank r-approximation to X with respect to the Frobenius norm. For what follows, we assume that X is scaled such that 4 The quantity ∆ r 2 F /n in (8) can then be expressed as After these preparations, we study the excess risk of PCR in three basic scenarios.

Scenario (F): perfectly flat spectrum
A flat spectrum means that σ 2 . The choice of r minimizing the bound (8) is given by r * = d ∧ n if α * ∞ > σ/ √ d ∨ n, and r * = 0 otherwise. We note that when using the exact expression (7) for the excess risk, the optimal r * would result as the largest value of r such that α r > σ/ √ d ∨ n. Eventually, this does not make much of a difference if the entries of α * are of a comparable magnitude, and does not affect the conclusion that in general, we cannot hope for improvements over OLS when the spectrum is constant.

Scenario (P): polynomial decay
, for q ≥ 2 and a constant C determined by the relation d∧n j=1 σ 2 j = n · d. Comparing series and integrals, we obtain Minimizing the right hand side w.r.t. r, we obtain To get some insight into (11), fix q = 2 and consider the case of generic random sampling of V as discussed above so that α * / α * 2 follows a uniform distribution on the unit sphere in R d∧n . In this situation, α * ∞ / α * 2 scales as O( log(d)/d) as d gets large. Assuming further that α * 2 = O(1) yields r * = O({log(d)n} 1/2 ) and E(V r * ) = O( log(d)/n). We have hence identified a regime in which PCR achieves better statistical and computational performance than OLS if d = Ω(n). Clearly, the improvements get amplified as q increases.

Scenario (E): exponential decay
Suppose that σ 2 j = C 0 θ j for θ ∈ (0, 1). Then, τ (r) ≤ θ r 1−θ = C 1 exp(−cr), say. The optimal choice of r * and the corresponding bound on E(V r * ) result as Cases (P) and (E) show that PCR may improve significantly over OLS in terms of achievable dimension reduction and excess risk depending on the decay of the spectrum of X.

Existing bounds for CLS
We now consider the case of dimension reduction via a random matrix R. We refer to the columns of R as "random projections" as R maps the predictors {x i } n i=1 to a random linear subspace, typically of dimension k. Regarding the distribution of R, sampling its entries i.i.d. from a Gaussian distribution with expectation zero and variance 1/k constitutes the basic case in the literature on randomized dimensionality reduction [39]. The column space of R then follows the uniform distribution on the Grassmannian G(d, k). Random Gaussian matrices of this form are the canonical example of Johnson-Lindenstrauss transforms [21] (henceforth JLTs for short), cf. Definition 1 below. This class of matrices extends to i.i.d. sub-Gaussian matrices [1,31], the fast JLT of [3], and certain row-subsampled orthonormal matrices [4,26,38] as they are also used in compressed sensing [11].
Maillard & Munos [29] were the first to study the use of JLTs for randomized dimension reduction in least squares regression with random design under the name "compressed least squares" (CLS). They show a bound on a corresponding notion of excess risk of the order for k = Θ( √ n log n w * 2 E[ x 1 2 ] 1/2 /σ) random projections. For fixed design, Kaban [23] (specializing Theorem 1 therein to the case where R has i.i.d. N (0, 1/k) entries) shows that where E(R) is defined in (2) and the expectation in (14) is with respect to R.
Optimizing the bound (14) with respect to k, we obtain with c = √ c. Comparing the bound of Maillard and Munos (13) with (15), we essentially observe an agreement apart from a √ log n factor, noting that for random design with In [29,35], the excess risk bound in (15) is interpreted as being of the order O(1/ √ n). This would mean that the performance of CLS is comparable to that of PCR in scenario (P) with exponent q = 2 above, independent of the {σ 2 j } d∧n j=1 . However, such interpretation is not valid in general. In a fixed design setting, it is common to assume that the columns {X j } d j=1 of X are scaled such that X j 2 2 = n, j ∈ [d], whereas for standard random designs, e.g. X with i.i.d. rows from a zero-mean Gaussian distribution with unit variances, this scaling holds in expectation. In this case, tr(Γ) respectively E[ x 1 2 2 ] evaluate as d which makes the bounds (13) and (15) of rather limited use. For (15), we obtain This means that k * is of the same order (or may even exceed) d ∧ n, while the bound on the excess risk is generally inferior to that of OLS (1). It turns out that the outcome (15) is the consequence of a crude bound on the bias of CLS. From (14), we find that the correct variance term σ 2 k/n is present. On the other hand, a simple argument shows that the bias term tr(Γ) w * 2 2 /k is improvable: if X has rank r, 1 ≤ r ≤ d ∧ n, CLS with R as a matrix with N (0, 1)-entries yields P X = P X R and in turn zero bias in (2) with probability one as long as k ≥ r. By contrast, according to (14) basically k/d → ∞ is required for the bias to vanish.
In [23], the expected bias (where the expectation is w.r.t. R) is bounded as Evaluation of (16) then inevitably leads to the term tr(Γ) w * 2 2 /k as it appears in (14).
In a recent paper by Thanei et al. [37], the following bound is used instead of (16): The authors evaluate the above expectation for R with i.i.d. N (0, 1/k) entries, and then minimize with respect to v. This yields the bound (cf. Theorem 2 in [37]) In order to assess (18) for improvements over earlier bounds, one needs to gain more insights into the weights {ω j }. In Appendix A, we show that min 1≤j≤d∧n ω j ≥ 2/(2 + k), which implies that Compared to (14), this means that at best, the term tr(Γ) w * 2 2 gets replaced by (w * ) Γw * . However, even the lower bound in (20) does not yield satisfactory results if X is (approximately) of low rank as the bias term again scales as O(1/k) independent of the spectrum. As a consequence, no matter how small the rank of X is, according to (20) we still obtain an upper bound on the bias of O(1/ √ n) and k * = Ω( √ n) for the optimal number of random projections.

Improved Analysis (of CLS)
In this section we present and discuss the main result of the paper, a bound on the excess risk of CLS that is of a similar flavor of that of PCR in §2.2. In this manner, we establish a substantially stronger connection between CLS and PCR as could be made based on other analyses reviewed in §2.3.

Assumptions on the random projections
Our analysis basically requires R to be a Johnson-Lindenstrauss transform (JLT). The precise conditions are given as follows.
holds with probability at least 1 − δ.
The next definition is akin to the restricted isometry property in the theory of sparse estimation [6] with the difference that approximate norm preservation is required only for a single subspace (as opposed to the union of subspaces of sparse vectors). Definition 2. Let V ⊂ R d be an arbitrary subspace of dimension s. A random d-by-k matrix R is said to be an (s, ε, δ)-restricted isometry for ε, δ ∈ (0, 1) if holds with probability at least 1 − δ.
Remark. The conditions in the above two definitions are related in the following way: it is shown in [6] that if R is a 12 ε k , ε/2, δ -JLT, then R is also an (s, ε, δ)-restricted isometry. We here state two separate definitions for ease of reference.

Main result
We are now in position to state our main result.
Theorem 1. For r ∈ {1, . . . , d ∧ n}, let the following conditions be satisfied: Then with probability at least 1 − δ 1 − δ 2 , Conditional on the event (21), the excess risk of CLS (3) can be bounded as A meaningful interpretation of Theorem 1 and the bound (22) requires an understanding of how large the number of random projections need to be so that the conditions (C1) and (C2) are satisfied. The next statement addresses this key point.
According to Proposition 1, the bound on the excess risk (22) holds with high probability for k = Ω(r log n) sub-Gaussian random projections. The logarithmic factor can potentially be removed, in view of results in Halko et al. [14] for Gaussian random projections. Specifically, in this case Halko et al. show that the expectation of the left hand side of (21) with respect to R can be bounded as for k ≥ r + 2. In particular, for k = 2r + 1, the error in Frobenius norm for approximating the matrix X by P X R X is within a factor two of the r-truncated SVD (4).

Comparison of PCR and CLS
Statistical performance. Comparing the bounds on the excess risk (8) and (22) for PCR and CLS, respectively, we see an agreement in their structure up to constant factors (ignoring the potentially spurious log-factor in the required number of projections k) and the change of α * 2 ∞ to w * 2 2 . As a result, CLS profits from a rapid decay in the sequence of singular values of X as does PCR, enjoying similarly favorable excess risk bounds under scenarios (P) and (E) given at the end of §2.2; changes in those bounds are only in terms of constants and the replacement of α * 2 ∞ by w * 2 2 . The latter may amount to a factor of d ∧ n in the worst case 2 . However, in light of (11) and (12) this difference does not have much of an effect as long as the spectrum of X exhibits strong decay.
In spite of this, there is an extreme case in which the ratio of the excess risk of CLS and PCR can be arbitrarily large as can be seen from the exact bound (7): if α * happens to be perfectly aligned with the top r singular values so that α * r+ = 0, we have E(V r ) = 0. On the other hand, the column space of X R does not contain that of U r unless k = d ∧ n, hence in this rather specific case CLS falls short of PCR. On the other hand, we are not aware of scenarios in which CLS can substantially improve over PCR.
To be fair, it is worth pointing out that the bound (22) need not always be an improvement over those reviewed in §2.3, but it yields qualitatively a much better fit if the singular values decay rapidly.
Computational cost. PCR requires access to the top r left singular vectors of X which is typically done via Krylov subspace methods like Lanczos' algorithm [13] in O(ndr) flops on average. Assuming that k = O(r), this is comparable to CLS whose dominating operation is given by the matrix-matrix multiplication XR. However, as discussed in [14], reducing PCR to an O(ndr) operation is problematic as the actual computational complexity can vary significantly depending on subtle spectral properties of X: while the nominal complexities of the two approaches are comparable, CLS often requires less runtime. In addition, CLS provides several other advantages. The matrix-matrix multiplication is trivially parellelizable, and can be computed in a single pass over the data. The latter property becomes beneficial once X is too large to fit into the main memory since accessing external memory is slow. Lastly, Theorem 1 and Proposition 1 are likely extendable to structured JLTs [3,4,26,38] for which the cost of forming XR is considerably reduced.

Further topics
We here discuss several miscellaneous topics that naturally arise from the analysis of the preceding section.

Assessing the applicability of CLS when being on a computational budget
The analysis above indicates that CLS can achieve reasonable statistical performance while using a substantially reduced number of predictors provided that the singular values of X decay at a fast rate. Verifying such decay seems to require the SVD though, which would eliminate potential computational advantages of CLS over PCR. Direct evaluation of the quantity δ 2 (21) is computationally demanding as well: finding the left singular vectors U ∈ R n×k of X R can be done in O(nk 2 ) flops; however, forming P X R X = UU X requires O(ndk) flops which is of the same order of magnitude as computing X R , the 2 if n < d, we may assume without loss of generality that w * is contained in the orthogonal complement of the null space of X so that α * 2 2 = w * 2  dominating operation in CLS. Hence, we would like to circumvent this operation. We here suggest to recycle the method of random projections in order to get an accurate estimate of δ 2 R while achieving a reduction to O(nd) flops. The basic idea is to apply P X R to a small number L of random elements from the range of X rather than to all its columns.
Conditional on R, consider the following estimator of Then, for any c ∈ (0, 1) and any C > 1, as long as where the probability is w.r.t. {ω l } L l=1 and conditional on R.
For example, setting C = 3, c = 1/3, we would need L = 36 to estimate δ 2 R within a multiplicative factor of 3 with probability near 1. Note that computing P X R Xω l = U(U (Xω l )) for a single l only amounts to O(nd) flops. The constants in Proposition 2 may not necessarily be optimal. In the example of Figure 1, we use L = 10 random vectors to estimate δ 2 R (k) simultaneously for 1 ≤ k ≤ 300.

Column Subsampling vs. Dense Random Projections
We can think of CLS as a scheme that picks k random elements from the subspace spanned by the columns of X, and subsequently uses these random elements to predict the response y. The analysis of the previous section asserts that if k is in proper relation to r, then we will do roughly as good as when using the top r principal components as predictors. For this result to hold, Theorem 1 imposes certain restrictions on the matrix R, or equivalently, the way the random elements are generated. It is natural to ask whether one can be more flexible in this regard. The simplest approach one can possibly think of would be to select columns from X uniformly at random without replacement, i.e., where {e 1 , . . . , e d } are the canonical basis vectors of R d , and Unif(. . .) denotes the uniform probability distribution, so that XR = [X i 1 . . . X i k ] is a random column submatrix of X.
Note that a random matrix R generated according to (24) fails to be a (1, ε, δ)-JLT for any ε ∈ (0, 1), any δ ∈ (0, 1/2) and any k ≤ d 2 3 , hence the framework used for Theorem 1 does not yield useful results. In general, the excess risk of column subsampling can be considerably worse than when using a suitable JLT. Both approaches are equivalent in terms of excess risk if i) n ≥ d and the spectrum is flat (scenario (F) in §2.2), or ii) X is a random Gaussian matrix with i.i.d. N (0, 1) entries.
Proposition 3. Let S by a random d × k matrix generated according to (24) and let R be an d × k random matrix with i.i.d. N (0, 1/k) entries.
ii) If n < d, there exists an X with flat spectrum, i.e., where the expectations are w.r.t. the randomness of both X and R respectively X and S.
We note that property iii) crucially relies on Gaussianity of the entries of X. Proposition 3 will be complemented with numerical results presented in §5 below.

Averaging
An idea in the spirit of bagging [7] considered in [37] is to generate an ensemble of i.i.d. random projections {R b } B b=1 and then average the resulting predictions where w R b is the least squares estimator given response y and design matrix are the projections on the respective column spaces of {XR b } B b=1 . As B → ∞, the average of projectors can be expected to behave similarly to P k = E[P XR ]. The next proposition summarizes basic properties of averaging and the operator P k .
iii) Reduction in variance: iv) In the setting of ii), we additionally have where the last expectation is w.r.t. R and R drawn independently, and {θ (L, L )} k =1 denote the canonical angles between two k-dimensional subspaces L and L [13].
Parts ii) and iv) already appear in [37] for the n ≥ d case with slight differences in presentation. Parts ii) and iv) provide a rough quantification of the reduction in bias and variance by averaging in the limit B → ∞ and Gaussian random projections. Since the relationship between the singular values {σ j } d∧n j=1 and the coefficients {η j } d∧n j=1 is not well understood, a more precise connection is yet to be made. In light of Theorem 1, we know that when choosing k = Ω(r log n) large enough, the bias essentially depends only on the tail {σ 2 j } j>r . Accordingly, the {η j } r j=1 must be close to one, and accordingly the variance term in iv) is at least about σ 2 r/n. For a choice of k that makes the tail {σ 2 j } j>r (and hence the bias) negligible, averaging thus does not yield significant benefits. However, when using averaging, the optimal choice of k is guaranteed to be lower than when using a single random projection.
For fixed k, the maximum possible reduction in variance is seen to be a factor k/(d ∧ n), from σ 2 k/n to σ 2 k 2 /{(d ∧ n) · n}, which follows immediately from the representation of the variance in terms of the {η j } d∧n j=1 in iv) and the fact that the minimum 2 -norm over the simplex ∆(k) is attained at its barycenter. It is not hard to see that this maximum reduction is attained precisely when the spectrum of X is flat. This case remains of limited interest though as the bias is of the same order as without averaging.
The last identity in iv) provides a geometric interpretation of the variance after averaging in terms of the expected sum of squared cosines of the principal angles between to independently generated subspaces range(XR) and range(XR ). Again, we see that the bias is low, i.e., if range(XR) well approximates range(X), two randomly chosen subspaces will be essentially aligned so that the squared cosines evaluate all about one, which implies that averaging will not achieve any substantial reduction in variance.

Experiments
We present the results of experiments with synthetic and real data in order to illustrate and support the main results of the paper, as well as pointing to some open questions.

Synthetic data
We start by generating a random n-by-d matrix X 0 with n = 1000, d = 500, where the entries of X 0 are drawn i.i.d. from the N (0, 1) distribution. The SVD of X 0 is given by We then replace Σ 0 by a diagonal matrix Σ whose diagonal elements {σ j } d j=1 are chosen in a deterministic fashion according to one of the following regimes: polynomial : σ j ∝ j −q , q ∈ {.5, .75, 1, 1.5, 2, 4}, j ∈ [d], where the constant of proportionality is determined by the scaling d j=1 σ 2 j = n · d. We subsequently work with X = U 0 ΣV 0 , generating data from the model where w * is drawn uniformly from the unit sphere in R d , σ ∈ 2 p , p ∈ {−1, −0.5, . . . , 1}, and ξ has i.i.d. standard Gaussian entries. Given data (X, y), we then perform PCR with ten different choices of r, using an equispaced grid of values depending on the regime according to which X has been generated. For CLS, R is chosen as a standard d-by-k Gaussian matrix with k = αr, where the oversampling factor α ∈ {1, 1.2, 1.5, 2, 2.5, 3}. We conduct 100 independent replications for each regime. Our main interest is in the bias and the prediction error of PCR and CLS: where w Vr and w R denote the least squares estimator for data (XV r , y) and (X R , y), respectively. In order to compare CLS and column subsampling, we also generate R as a d × k column submatrix of the identity chosen uniformly at random, cf. (24).
A subset of the results involving two different regimes of decay is shown in Figure 2. In the regime of polynomial decay (q = 1), we observe that the bias of CLS is roughly proportional to that of PCR (or alternatively, we need to choose k as a suitable multiple of r to achieve the same bias). Accordingly, the dip in the prediction error curve occurs for k = 2r * with r * = 40 yielding the smallest prediction error for PCR. In both low and high noise settings, PCR and CLS improve significantly over OLS in terms of prediction error (≈0.02 and ≈0.04 vs. 0.125 and ≈ 0.1 and ≈ 0.15 vs. 2). In the regime of exponential decay, the bias of CLS is not quite proportional to that of PCR for small values of r, but this improves once r reaches 20. Overall, the results agree well with what is suggested by Theorem 1. Figure 3 shows that if X 0 in (25) is generated from a Gaussian distribution, Gaussian random projections and column subsampling perform the same on average as asserted by Proposition 3. Interestingly, the performance of column sampling degrades considerably when the entries of X 0 are drawn i.i.d. from the standard Cauchy distribution, whereas Gaussian random projections are not affected by this change.

Real data
We here consider two data sets from the UCI machine learning repository for illustration purposes. Twitter social media buzz. Our data analysis is inspired by that in [28]. This is a regression problem in which the goal is to predict the popularity of topics as quantified by its mean number of active discussions given 77 predictor variables such as number of authors contributing to the topic over time, average discussion lengths, number of interactions between authors etc. We here only work with the first 8000 observations. Several of the original predictor variables as well as the response variable are log-transformed prior to   analysis. Following [28], we add quadratic interactions which yields d = 3080 predictors in total. We consider 50 random partitions into a training set of size 6000 and a test set of size 2000 which is used to evaluate the prediction error. The design matrices of the training sets are centered and subsequently scaled to unit norm before performing least squares regression with a centered response and a reduced design matrix obtained from i) a truncated SVD with r ∈ {5, 10, . . . , 50, 60, . . . , 100, 120, . . . , 200}, ii) random Gaussian projections with k = rα, where the grid for the factor α is as for the synthetic data, iii) subsets of k columns sampled uniformly at random without replacement. The thus obtained regression coefficients are back-transformed to account for the preliminary centering/scaling step before using them to make predictions on the test set. Approaches i) to iii) are compared in terms of the mean squared prediction error on the test set. For each of the 50 partitions into training and test set, we obtain ten i.i.d. sets of random projections respectively subsampled columns for ii) respectively iii) and perform regression with each of the resulting reduced matrices, and compute the average as well as the maximum error over each of those ten runs. Blog Feedback. The task associated with the data set is the prediction of the number comments on blog posts [10] within a 24-hour time window after a certain base time. The training set consists of n = 52, 397 observations and originally 280 predictors including meta data about the blogs in which the posts were made, the number of comments received within specific time windows before the base time, number of comments on related posts, bag-ofwords data, and the weekday of the post. After eliminating redundant and non-informative predictors, we end up with 114 predictors, several of which are log-transformed. A subset of those are expanded in terms of quadratic and interaction terms which eventually yields d = 2, 589. The target variable is log-transformed as well. As distinguished from the first case study, this data set comes with a fixed test set of size 7, 624. Evaluation then proceeds as described above, with the only difference that r ∈ {10, 20, 50, 100, . . . , 500} in order to adjust for a slower rate of decay of the singular values.
The main results are summarized in Figure 4. The left panels show that the singular values exhibit different rates of decay for the first respectively second data set. For the Twitter data, the decay is not far from linear on a log-scale, whereas the decay is noticeably slower for the Blog Feedback data. As already seen for the synthetic data experiments, CLS requires only a moderate amount of oversampling to achieve the approximation error of PCR X − P Ur X 2 F ; we here use this quantity as a surrogate for the bias since w * is unknown, and in order to establish a connection to result (21) in Theorem 1. The corresponding quantity of column subsampling is essentially to that of CLS for the Blog Feedback data, but is markedly worse for the Twitter data set. Turning to the right panels, we observe that the test error of PCR dips for r = 70 (Twitter) respectively r = 150 (Blog Feedback). The errors of CLS and column subsampling are not far off, but again an increase of k relative to r is indicated to achieve optimal results. Comparing CLS and column sampling, we see that the latter overall performs better particularly for small k, but is also less stable in the sense that the gap between the average error and the maximum error (taken over ten different realizations of the random matrix R) is substantially larger than for CLS. In particular, for the Twitter data, the maximum error of CLS is uniformly smaller than that of column subsampling. Eventually, the relative performance will depend on properties of X (as illustrated by Figure 3), and likely on possible sparsity of w * and its interaction with properties of X.

Conclusion
Regarding the use of random projections in linear regression, the literature has mainly focused on the setting in which the random matrix R is applied from the left, i.e., X is reduced to RX. The converse setting with X being reduced to XR has been studied in several earlier papers as well, however, without establishing a tight link to PCR at the level of achievable excess risk as made herein. Towards the end of our paper, we raise the question how much randomness is needed in generating a random subspace so that such connection holds true. Gaussian random projections induce "maximum randomness", whereas in column sub-sampling we only consider random subspaces spanned by subsets of the canonical basis vectors. Given the dramatic computational advantages of the latter over "dense" random projections it is worth elaborating general conditions under which column sub-sampling can be shown to exhibit similar statistical performance as classical Johnson-Lindenstrauss transforms.

C Proof of Theorem 1
Before going into the proof, let us introduce a bit more of notation. Below, T s (M ) denotes the best rank-s approximation of a matrix M with respect to Frobenius norm which can be obtained from a truncated SVD, cf. (8). Moreover, we write M − for the Moore-Penrose pseudoinverse of a matrix M . The j-th column of M is denoted by M :,j . M 2 denotes the spectral norm.
Note that in view of (3) so that (22) immediately follows from (21). In the sequel, we hence prove (21), following the strategy of the proof of Theorem 14 in [34]. The proof can be partitioned into three basic steps.
Lemma C.1. We have Proof. Observe that according to the definition of P X R , we have Let B * ∈ R k×d denote a minimizer of the optimization problem on the l.h.s. of (28) such that P X R X = X R B * . Let the SVD of that matrix be given by Denote by M r ∈ R d×d the diagonal matrix whose first r diagonal entries are equal to one and zero else. Then T r (P X R X) = X R B * M r = X R B. Since B = B * M r is a feasible solution for the minimization problem (28), we conclude (27).
Lemma C.2. We have where Π is the orthogonal projection on the subspace spanned by the columns of Φ r = P X R T r (X), i.e. Π = P Φr = P P X R Tr(X) .
Proof. Consider the following optimization problem: Then any minimizer B * of the above problem satisfies X R B * = T r (P X R X) (see Proposition 1 and Lemma 14 in [9]). Noting that Π = X R M for some matrix M ∈ R k×d with rank(M ) ≤ r (as T r (X) has rank no more than r), M is feasible for the above optimization problem, and we conclude (29).
In the second step, we decompose X − ΠX 2 F into two parts: an "easy" part and one more delicate part that requires sophisticated analysis. Recalling (4), we have where the inequality follows from the fact that I − Π is an orthogonal projection.

It remains to bound
Let us write C * = X − R and C = (T r (X)R) − . Note that for any matrix M of appropriate dimension, we have Moreover, observe that according to the definition of Π in (30) ΠT r (X) = P P X R Tr(X) T r (X) = P X R T r (X).
Using (33) and (34), we obtain that and consider the least squares problems with minimizer λ * i , i = 1, . . . , n, and the corresponding sketched regression problems with sketching matrix R : min with minimizer λ i , i = 1, . . . , n. It is straightforward to show that For the sketched regression problems, an optimal set of coefficients is given by Identifying terms, we see that the right hand side in (35) can be written as Consider the residuals By analyzing the structure of (general) sketched regression problems, it can be shown that where V r is the same matrix as in (4). The analysis leading to property (39) will be given at the end of this proof. In the sequel, we use this property in combination with conditions (C1) and (C2) to deduce the final result. We will first derive a lower bound on the l.h.s. of (39) with the help of (C2), and then we derive an upper bound on the r.h.s. by means of (C1). Combining both, we obtain an upper bound on n i=1 β i 2 2 and in turn on the quantity T r (X) − ΠT r (X) 2 F that we eventually need to bound. Let V r ⊂ R d denote the column space of V r . Invoking (C2) with V = V r , the following event holds with probability at least 1 − δ 2 : where Ω = V r RR V r and λ min (·) denotes the smallest eigenvalue. Conditional on that event, we have that Next, observe that V :,j w i = 0, j = 1, . . . , r, i = 1, . . . , n, as follows immediately from the definition of the {w i } n i=1 in (38). We now apply (C1) with the following set of vectors: where w i = w i / w i 2 , i = 1, . . . , n. Note that |S| = 2rn. In the next step, we will establish that with the specified probability, the inner products between V :,j w i , are preserved up to an additive term of ε 1 w i 2 , i ∈ [n], j ∈ [r], where ε 1 = ε 1 / √ r according to (C1). Recall that for arbitrary x, y, it holds that x, y = 1 4 x + y 2 2 − x − y 2 2 . With R being a (2nr, ε 1 , δ 1 ) JLT, we therefore have with probability at least 1 − δ 1 It follows that R V :,j , R w i ≥ V :,j , w i −ε 1 and in turn also R V :,j , R w i ≥ V :,j , w i − ε 1 w i 2 by homogeneity.
Regarding the upper bound, we argue analogously: and thus R V :,j , R w i ≤ V :,j , w i + ε 1 and in turn R V :,j , R w i ≤ V :,j , w i + ε 1 w i 2 .
In order to finish the proof, it remains to establish (39) as is done below. Let λ * denote a minimizer of the original least squares problem and let λ denote the minimizer of the sketched least squares problem. Furthermore, we write U for the matrix of left singular vectors of A.
for certain vectors α and β. We now decompose the least squares error when using λ: Bringing the sketching matrix R into play, we have Furthermore, we have Combining the previous displays, we obtain that R U(α + β) = R Uα + P R U R w and thus R Uβ = P R U R w.
Multiplying both sides with U R, this implies Note that (42) has the form as claimed in (39) with V r playing the role of U: according to (36), this is as it should be since V r contains the left singular vectors of T r (X) . The proof is thus complete.

D Proof of Proposition 2
Let us recall that the statement is conditional on R, and for what follows only {ω l } L l=1 is considered as random. We first verify that Xω l − P X R Xω l 2 2 is an unbiased estimator of δ 2 R , l ∈ [L]. We have Concentration. We now establish concentration for the estimator δ 2 R by invoking results in [18,27]. Let ω ∈ R d·L be the vector one obtains when stacking ω 1 , . . . , ω L vertically. Let us also introduce Ψ = X (I − P X R )X and let Ψ = 1 L I L ⊗ Ψ, where ⊗ denotes the Kronecker product. Then δ 2 R can be re-written in the following way: In other words, δ 2 R can be expressed as a quadratic form in a Gaussian random vector of dimension dL and a positive definite matrix. We can thus use the following tail inequalities [18,27] P(ω Ψω > tr(Ψ) + 2 t tr(Ψ 2 ) + 2 Ψ 2 t) ≤ exp(−t), t > 0.
This can be re-written using the following relations: As a result, for any 0 < c < 1 and any C > 1, as long as

E Proof of Proposition 3
We start with a basic observation to be used several times. Let R be an (d ∧ n) × k random matrix with N (0, 1) entries. From the rotational invariance of the Gaussian distribution, we have that where D = denotes equality in distribution. Turning to property i), using that Σ = √ nI we nU R = according to (43). We then compute where the inverse exists with probability one. Accordingly, With α * u = α * / α * 2 and U R ⊥ as a matrix containing a set of orthonormal basis vectors of range( R) ⊥ as its columns, we have [20]. By rotational invariance, where E d−k ∈ R (d−k)×d contains the first d − k canonical basis vectors as its rows and u ∼ Unif(S d−1 ). Combining (44), (45) and (46) concludes the derivation of the bias. The expression for E[E(R)] given in the proposition is obtained by adding the variance term σ 2 k/n. Similarly, we evaluate E[E(S)] by computing its bias. Expanding P X S , we get that where we have used that V V = I and S S = I, where the latter property results from the fact that column sampling is done without replacement. It remains to evaluate E[SS ].
The entries of SS are given ( S i,: , S j,: ) 1≤i,j≤d , where S l,: denotes the l-th row of S, l ∈ [d].
Turning to property ii), the arguments for E(R) parallel those used for i), with the difference that Σ = √ dI n . The subsequent steps are as in i) and are thus omitted. The situation is different for E(S) because the expansion (47) is no longer valid since V V = I as n < d. Consider a matrix X with dimensions n < d whose matrix of right singular vectors V ∈ R d×n takes the form

F Proof of Proposition 4
For property i), observe that the map A → φ(A) := (I − A)Xw * 2 2 /n from R n×n to R + is convex, hence φ 1 . Taking expectations then yields the assertion. Likewise, regarding property iii), we have The claim then follows by noting that for any pair (b, b ), we have tr( We finally turn to properties ii) and iv). Consider the operator P k = E[P XR ]. We first show that range(P k ) = range(X). The inclusion range(P k ) ⊆ range(X) holds trivially. For the other direction, since P k is symmetric positive definite, it suffices to show that v P k v > 0 ∀v ∈ range(X). Suppose by contradiction that there exists v ∈ range(X) s.t.
v P k v = v E[P XR ]v = E R [ P XR v 2 2 ] = 0, which would imply that v is contained in the orthogonal complement of range(XR) with probability one, i.e., v ∈ null((XR) ) ⇔ R X v = 0 with probability one. This contradicts the fact that the entries of R are from a distribution that is absolutely continuous with respect to the Lebesgue measure. In particular, the fact that range(P k ) = range(X) implies that P k has exactly d ∧ n positive eigenvalues {η j } d∧n j=1 contained in the simplex ∆(k) = {z : d∧n j=1 z j = k, 0 ≤ z j ≤ 1}, noting that P k is an expectation over orthogonal projections onto k-dimensional subspaces. The last property of P k to be established in order to arrive at ii) and iv) is the fact that U P k U = diag(η 1 , . . . , η d∧n ), where U is the matrix of left singular vectors of X as its columns. We have where the last identity uses that V R D = R by rotational invariance (43). It remains to show that the matrix is diagonal. This has been shown in [30], noting that a matrix A is diagonal if and only if DAD = A for all diagonal matrices D with diagonal elements ±1; the claim then follows from the fact that DΣ R D = Σ R and that R DΣ 2 D R = R Σ 2 R for all such D. We note that the diagonal elements {η j } d∧n j=1 of the diagonal matrix (48) depend only on the singular values {σ j } d∧n j=1 but not on U or V as follows again from rotational invariance. Equipped with the property U P k U = diag(η 1 , . . . , η d∧n ), we compute E Xw * − P X R Xw * 2 2 /n = 1 n (w * ) X E[I − P XR ]Xw * = 1 n (w * ) X (I − P k )Xw * = 1 n d∧n j=1 σ 2 j {α * j } 2 (1 − η j ), after expanding X in its singular value composition and recalling that α * = V w * . In the same vein, we obtain that Xw * − P k Xw * 2 2 /n = 1 n Xw * 2 2 − 2(w * ) X P k Xw * + (w * ) X P 2 k Xw * = 1 From U P 2 k U = diag(η 2 1 , . . . , η 2 d∧n ), we immediately obtain the first identity in property iv). For the second identity, we let R be an i.i.d. copy of R and note that where the last identity is obtained directly from the definition of canonical angles between subspaces [13].