Optimal equivariant prediction for high-dimensional linear models with arbitrary predictor covariance

: In a linear model, consider the class of estimators that are equivariant with respect to linear transformations of the predictor basis. Each of these estimators determines an equivariant linear prediction rule. Equivariant prediction rules may be appropriate in settings where sparsity assumptions (like those common in high-dimensional data analysis) are untenable or little is known about the relevance of the given predictor basis, insofar as it relates to the outcome. In this paper, we study the out-of-sample prediction error associated with equivariant estimators in high-dimensional linear models with Gaussian predictors and errors. We show that non-trivial equivariant prediction is impossible when the number of predictors d is greater than the number of observations n . For d/n → ρ ∈ [0 , 1), we show that a James-Stein estimator (a scalar multiple of the ordinary least squares estimator) is asymptotically optimal for equivariant out-of-sample prediction, and derive a closed-form expression for its asymptotic predictive risk. Finally, we undertake a detailed com- parative analysis involving the proposed James-Stein estimator and other well-known estimators for non-sparse settings, including the ordinary least squares estimator, ridge regression, and other James-Stein estimators for the linear model. Among other things, this comparative analysis sheds light on the role of the population-level predictor covariance matrix and reveals that other previously studied James-Stein estimators for the linear model are sub-optimal in terms of out-of-sample prediction error.

High-dimensional linear models, where d is large, have been extensively studied in recent research. In this challenging setting, additional conditions on β, such as sparsity, are often required in order to ensure consistent estimation or prediction. Two of the more widely studied types of sparsity are ℓ 0 -and ℓ 1sparsity: β is sparse if its ℓ 0 -or ℓ 1 -norm is "small." While sparsity conditions may be required for consistency in high-dimensional linear models, these conditions may be untenable in some instances. Moreover, it remains important to identify optimal methods for practical objectives like out-of-sample prediction, even in non-sparse (or "dense") high-dimensional settings.
In this paper, we study high-dimensional out-of-sample prediction problems and a class of estimators that are equivariant with respect to linear transformations of the predictors x i , under the assumption that the data are multivariate normal. We argue that equivariant estimators are appropriate in problems where there is little prior knowledge about the relevance of the given predictor basis vis-à-vis the outcome y i . In particular, equivariant estimators may be appropriate in settings where sparsity assumptions on β are not desirable or realistic, as sparsity is highly dependent on the predictor basis. Our analysis provides new insight into the capabilities and limitations of dense methods for high-dimensional linear models.
Most of the results in this paper fall into one of three categories: (i) impossibility results, (ii) optimality results, and (iii) comparative results. Our primary impossibility result (Theorem 1 (c) in Section 3) implies that if d > n, then the only equivariant estimator isβ null = 0; thus, non-trivial equivariant prediction is impossible when d > n. While it is widely understood that high-dimensional "dense problems" are very difficult, our impossibility results help to make this idea more precise. It is worth pointing out that these results are derived under the assumption that Cov(x i ) = Σ is an arbitrary, unknown positive definite matrix (this is a random predictor analysis). If Σ is known or estimable, then results from (Dicker, 2013) imply that non-trivial equivariant prediction may be possible even when d > n; this is discussed in more detail in Section 5.2.
The optimality results in this paper primarily focus on a class of James-Stein estimators for β, which are scalar multiples of the ordinary least squares (OLS) estimatorβ ols = (X T X) −1 X T y. Stein shrinkage and the James-Stein estimator (James and Stein, 1961;Stein, 1955) are of fundamental importance in modern statistics. Most research on Stein shrinkage has focused on the Gaussian sequence model and the normal means problem; however, variants of the James-Stein estimator for linear models have also been studied (Baranchik, 1973;Copas, 1983;Huber and Leeb, 2012;Stein, 1960). The James-Stein estimators proposed in this paper are, to our knowledge, new. In Theorems 2-3 of Section 4, we prove that the proposed James-Stein estimators are asymptotically optimal among equivariant estimators when d/n → ρ ∈ [0, 1) and d → ∞. Our analysis of the James-Stein estimator shares similarities with (Marchand, 1993) and (Beran, 1996), who considered Stein estimation and equivariance in the normal means problem. However, the present analysis reveals unique features of linear models; for instance, our results demonstrate that adjusting for the degrees of freedom lost to a high-dimensional predictor has a non-trivial effect on out-of-sample prediction error.
Finally, we undertake a comparative analysis involving the proposed James-Stein estimator and other well-known dense estimators, including the ordinary least squares (OLS) estimatorβ ols = (X T X) −1 X T y, ridge regression (Hoerl and Kennard, 1970;Tikhonov, 1943), and other previously studied James-Stein estimators for β. In Theorem 4 of Section 5.1, we show that if 0 < inf d/n ≤ sup d/n < 1 and d, n are sufficiently large, then the proposed James-Stein estimator has uniformly smaller predictive risk than the OLS estimator; hence, under the specified conditions, the James-Stein estimator is minimax. Our discussion of ridge regression helps clarify the role of the predictor covariance matrix Cov(x i ) = Σ in dense out-of-sample prediction problems. In particular, we show that if Σ is known, then a certain equivariant ridge estimator has smaller predictive risk than the James-Stein estimator; furthermore, results from (Dicker, 2013) imply that this ridge estimator is asymptotically optimal among equivariant estimators that may depend on Σ . More standard ridge estimators (which do not require knowledge of Σ ) are also discussed. After discussing ridge regression, we consider another previously studied James-Stein estimator for β (Baranchik, 1973) and show that -perhaps surprisingly -it is sub-optimal, in terms of out-of-sample prediction error.
The rest of the paper is organized as follows. In Section 2, we introduce notation and definitions, and describe the statistical setting for what follows. Equivariance is discussed in Section 3. The James-Stein estimator is defined in Section 4; some of its optimality properties are also discussed in Section 4. Section 5 contains a comparative analysis of the proposed James-Stein estimator and other dense estimators for β. A concluding discussion is contained in Section 6, where we consider practical implications of the results contained in this paper and possible extensions. Proofs may be found in the Appendices.

Notation, definitions, and the statistical setting
Let P D(d) denote the collection of d × d positive definite matrices. In addition to assuming the linear model (1), we assume that are independent, where Σ ∈ P D(d) and σ 2 > 0. Linear models with similar distributional assumptions have been previously studied by Stein (1960), Baranchik (1973), Breiman and Freedman (1983), Brown (1990), and Leeb (2009), among others; Dicker (2013) considered the model (1)-(2) under the additional assumptions Σ = I and σ 2 = 1. The significance of the normality assumption (2) -and the possibility of relaxing it -is further discussed in Section 6.1.
Each estimatorβ for β determines a linear prediction rule,ŷ(x) = x Tβ . The unconditional out-of-sample prediction error (predictive risk) ofβ is given by where (y new , x T new ) is independent of (y, X) and has the same distribution as (y i , x T i ). We emphasize that the expectation in (3) is taken over (y new , x new ) and (y, X). Broadly speaking, the goal of the unconditional out-of-sample prediction problem considered in this paper is to minimize (3) over estimatorsβ.
In order to introduce more convenient notation for studying out-of-sample prediction error, let In this way, we establish a correspondence between the parameters Σ ∈ P D(d), β ∈ R d , and σ 2 > 0 in the linear model (1), and positive definite matrices V ∈ P D(d + 1). After standardizing by σ 2 , the predictive risk (3) is equivalent to where the subscript V in the expectation on the right-hand side above indicates that the expectation is taken over w 1 , . . . , w n ∼ N (0, V ). In fact, R V (β) is the primary object of study in the sequel and we will typically refer to R V (β) itself as the predictive risk (or out-of-sample prediction error) ofβ. Note that the predictive risk R V (β) is completely determined by the estimatorβ and the positive definite matrix V ∈ P D(d + 1). We will often write E Σ (·) in place of E V (·) when the expectation only involves the random predictors X. Similarly, we write P V (·) or P Σ (·) when computing probabilities involving w 1 , . . . , w n or X, respectively.

Equivariance
Consider the following definition.
for all positive scalars t > 0. If an estimator is both linearly equivariant and scale invariant, we say that it is LiSc.
As an initial example, notice that the OLS estimator is LiSc. Linearly equivariant estimators are compatible with linear transformations of the predictors x i . Intuitively, this type of compatibility implies that the data are treated "the same" (for the purposes of prediction), regardless of the given predictor basis. Hence, linearly equivariant estimators may be preferred in situations where there is little prior knowledge about the relevance of the given predictor basis, insofar as it relates to the outcome. By contrast, sparsity assumptions convey specific information about the designated predictor basis, and linear equivariance is less appropriate for sparse problems. Indeed, most sparse estimators, such as lasso (Tibshirani, 1996), are not linearly equivariant. Scale invariance is less specific to non-sparse problems; however, in our view, it is a reasonable property to require of estimators for β, including sparse estimators (see, for instance, the scaled lasso (Sun and Zhang, 2012)).
In this paper, we primarily focus on LiSc estimators. Our main objectives include (i) finding LiSc estimators with small predictive risk and (ii) understanding the magnitude of these estimators' predictive risk in high-dimensional linear models.
A nice feature of LiSc estimators is that their predictive risk is completely determined by the signal-to-noise ratio (in addition to d, n). In particular, in order to evaluate the predictive risk of an LiSc estimator, it suffices to consider the case where Σ = I; this greatly simplifies calculations involving LiSc estimators.
where the relationship between V and β, Σ , σ 2 is given by (4). Then Θ d (η 2 ) is the class of linear models with signal-to-noise ratio β T Σ β/σ 2 = η 2 . The following proposition is proved in Appendix A.
(a) Ifβ is linearly equivariant, then R V (β) = R V0 (β), where and u ∈ R d is any fixed unit vector.
Proposition 1 indicates that the signal-to-noise ratio β T Σ β/σ 2 plays an important role in the analysis of LiSc estimators. The signal-to-noise ratio's significance is further highlighted in our subsequent analysis of LiSc estimators, where we first focus on "oracle" estimators, which are derived under the assumption that the signal-to-noise ratio is known, and then study "adaptive" estimators, which rely on an estimate of the signal-to-noise ratio.
LiSc estimators have a great deal of structure. By taking advantage of this structure, we are able to identify optimal LiSc estimators in settings where d < n, d = n, and d > n, separately, assuming that η 2 = β T Σ β/σ 2 is known.
Theorem 1 is proved in Appendix A. Theorem 1 (b)-(c) implies that when d ≥ n, the optimal LiSc estimator isβ null = 0. Theorem 1 (c) implies further that if d > n, thenβ null is the only LiSc estimator. Thus, we have a fairly definitive characterization of LiSc estimators for out-of-sample prediction when d ≥ n. This characterization is quite negative, which may raise questions about the appropriateness of LiSc estimators for high-dimensional data analysis. We defer such questions to Section 6.2, which contains a broader discussion of LiSc estimators and high-dimensional data analysis.
If d < n, then the optimal LiSc estimator is nontrivial and is given bŷ β opt = h opt (η 2 )β ols in Theorem 1 (a). Theorem 1 (a) should be compared with Section 2.1 of (Marchand, 1993), where a best equivariant estimator for the normal means problem is derived. Observe that evaluating h opt (η 2 ) seems challenging and h opt (η 2 ) depends on the signal-to-noise ratio, which is typically unknown; thus, implementingβ opt in practice is generally infeasible. Furthermore, Theorem 1 (a) does not provide any information about the magnitude of r(η 2 ), which is important for understanding the performance limits of LiSc estimators. All of these issues are addressed in the next section.

James-Stein estimators
In this section we study James-Stein shrinkage estimators for β and show that their predictive risk is asymptotically equivalent to the optimal LiSc risk r(η 2 ) in high-dimensional linear models with d/n → ρ ∈ [0, 1) and d → ∞. In Section 4.1, we identify an oracle James-Stein estimator that utilizes a non-random shrinkage parameter, which depends on the signal-to-noise ratio. We show that the oracle James-Stein estimator is asymptotically equivalent to the optimal LiSc estimator β opt , which was derived in Theorem 1 (a). We also obtain an explicit formula for the predictive risk of the oracle James-Stein estimator; combined with our optimality results for the oracle James-Stein estimator, this easily yields an exact asymptotic expression for the minimal LiSc risk r(η 2 ). In Section 4.2 we propose an adaptive James-Stein estimator that depends on a data-driven shrinkage parameter; this estimator more closely resembles the original James-Stein estimator (James and Stein, 1961), which also relies on a data-driven shrinkage parameter. We show that if d/n → ρ ∈ (0, 1), then the adaptive James-Stein estimator is asymptotically equivalent to the oracle estimator and, hence, the optimal LiSc estimator.

The oracle estimator
For d < n− 1 and a shrinkage parameter t ≥ 0, define the James-Stein estimator Notice that for fixed t ≥ 0,β js (t) is LiSc. Additionally,β js (0) =β null = 0 andβ js (∞) =β ols . A closed-form expression for the predictive risk ofβ js (t) follows easily from properties of the inverse-Wishart distribution; it is then straightforward to optimize over t ≥ 0 and find the James-Stein estimator with minimal predictive risk. Details are given in the following proposition, which is proved in Appendix A. .
Theorem 2. Suppose that η 2 ≥ 0 and that 0 < d/n ≤ ρ + < 1 for some fixed constant ρ + ∈ R. Then Theorem 2 is proved in Appendix A. By Proposition 1, . Theorem 2 implies that if n → ∞ and sup d/n < 1, then the predictive risk ofβ js (η 2 ) is close to r(η 2 ); in other words, the oracle James-Stein estimator is asymptotically optimal among LiSc estimators. This is made more precise in the following corollary, which also gives explicit asymptotic formulas for r(η 2 ). The corollary follows immediately from Proposition 2 and Theorem 2.
The asymptotic risk formula R 0 (η 2 , ρ) ∼ r(η 2 ) in Corollary 1, which is valid when d/n → 0, appears frequently in minimax analyses involving the Gaussian sequence model (Nussbaum, 1999;Pinsker, 1980). On the other hand, if d/n → ρ ∈ (0, 1), then r(η 2 ) ∼ R >0 (η 2 , d/n) > R 0 (η 2 , d/n). This reflects the increased difficulty in prediction problems where d/n is substantially larger than 0 and may be attributed to a degrees of freedom correction that accounts for the number of predictors in the linear model; in particular, the effect of this correction is non-vanishing when d/n → ρ ∈ (0, 1).
Observe thatβ js adapts to the unknown signal-to-noise ratio η 2 . Furthermore, β js is an LiSc estimator. The next result implies that if n is large and d/n is bounded below 1, then the predictive risk of the adaptive James-Stein estimator is almost as small as that of the oracle James-Stein estimator.
Theorem 3. Suppose that 0 < d/n < ρ + < 1, where ρ + ∈ R is a fixed constant. Then Theorem 3 is proved in Appendix A. It follows from Theorem 3 that if d/n → ρ ∈ [0, 1), then the predictive risk of the adaptive James-Stein estimator converges uniformly to that of the oracle James-Stein estimator. Note that Theorem 3 is less informative when the signal-to-noise ratio is very small. Indeed, if η 2 = O(n −1/2 ) and d/n → ρ ∈ [0, 1), then sup V ∈Θ d (η 2 ) R V {β js (η 2 )} = O(n −1/2 ) has the same magnitude as the upper bound in Theorem 3. A more refined analysis of the adaptive James-Stein estimator when the signal-to-noise ratio is small may be of interest, but is not pursued further here.
The following corollary is an immediate consequence of Theorems 2-3.

OLS estimator
The predictive risk of the OLS estimator follows immediately from Proposition 2: if d < n − 1, then Proposition 2 also implies that On the other hand, as discussed in detail above, the oracle James-Stein estimatorβ js (η 2 ) is generally not implementable, because the signal-tonoise ratio β T Σ β/σ 2 = σ 2 is typically unknown. The adaptive James-Stein estimatorβ does not depend on the signal-to-noise ratio and in Section 4.2 we argued that it is asymptotically equivalent to the oracle James-Stein estimator. The next result is valid in finite samples, for d, n sufficiently large, and relates the risk of the adaptive James-Stein estimator to that of the OLS estimator.
Theorem 4 follows directly from Theorem 2, Theorem 3, and (12). Theorem 4 implies thatβ js is minimax over the entire parameter space P D(d + 1), when d, n are sufficiently large. We refer to Theorem 4 as a "semi-finite sample" result, because it addresses finite sample properties ofβ js , but we are unable to specify precisely how large d, n must be in order for (13) to hold. Theorem 4 may be contrasted with more classical finite sample results on James-Stein estimators for the normal means problem (James and Stein, 1961) and linear models (Baranchik, 1973), which imply that certain James-Stein estimators are minimax under explicit conditions on the dimension; for instance, Baranchik (1973) shows that a different James-Stein estimator for β (which is discussed in more detail in Section 5.3) is minimax whenever d > 2 and n − d > 1. Similar results may be available for the adaptive James-Stein estimatorβ js ; however, this paper is more focused on high-dimensional optimality properties ofβ js , like those discussed in Section 4, and alternative techniques are likely required to obtain more detailed finite sample results.
Note thatβ r (Λ) is defined for all d, n. While generalized ridge estimators have been studied for many classes of Λ, by far the most common is Λ = λI, where λ > 0 is a scalar shrinkage factor subject to further specification. Note, however, that for fixed λ > 0,β r (λI) is not LiSc. Furthermore, the following result suggests thatβ r (λI) has significant drawbacks (in a minimax sense) when its performance is evaluated over linear models with fixed signal-to-noise ratio.
Proposition 3 is proved in Appendix A. It implies that the ridge estimator β r (λI)'s worst-case out-of-sample prediction error is at least as bad as that of the trivial estimatorβ null = 0, over linear models with fixed signal-to-noise ratio.
As an alternative toβ r (λI), we consider a generalized ridge estimator that depends on the predictor covariance matrix Cov(x i ) = Σ . For V ∈ Θ d (η 2 ) given by (4), define the oracle ridge estimator To motivate this estimator, we note that in (Dicker, 2013), the author considered ridge regression in high dimensional linear models where Cov(x i ) = I and σ 2 = 1; the oracle ridge estimator (14) corresponds to oracle estimators derived in (Dicker, 2013), after transforming the data via (y, X) → (y, XΣ −1/2 ). It follows thatβ r {d/(nη 2 )Σ } shares many optimality properties with the ridge estimators studied in (Dicker, 2013). Adaptive ridge estimators may be derived by replacing η 2 and Σ in (14) with estimates,η 2 andΣ ; ifη 2 is consistent for η 2 andΣ is operator norm-consistent for Σ , then the associated adaptive ridge estimator is typically asymptotically equivalent to the oracle ridge estimator. The oracle ridge estimator (14) satisfies an equivariance property for estimators depending on Cov(x i ) = Σ that extends the LiSc property given in Definition 1. Letβ =β(y, X, Σ ) be an estimator for β that may depend on Cov( for all d × d invertible matrices A ∈ GL(d) and all t > 0. Thus, an LiSc estimator's dependence on Σ must respect linear transformations of the predictors Clearly, ifβ does not depend on Cov(x i ) = Σ , then (15) reduces to the LiSc property given in Definition 1. Furthermore, the oracle ridge estimator (14) is LiSc. We emphasize that the oracle ridge estimator is LiSc, even for d > n; by contrast, Theorem 1 (c) implies that if d > n, then the null estimator is the only LiSc estimator that does not depend on Σ . Some of the basic risk properties ofβ r {d/(nη 2 )Σ } are given in the following proposition.
Then lim Proposition 4 follows immediately from Proposition 1 and Corollary 1 in (Dicker, 2013). Proposition 4 (a) gives a simplified expression for the predictive risk of the oracle ridge estimator; in particular, it implies that R V [β r {d/(nη 2 )Σ }] is completely determined by the signal-to-noise ratio β T Σ β/σ 2 = η 2 . Proposition 4 (b) gives a closed-form expression for the asymptotic predictive risk of the oracle ridge estimator that is valid when d/n → ρ ∈ (0, ∞).
It is evident from Proposition 4 that the oracle ridge estimator has smaller risk thanβ null = 0, even when d > n; that is, if V ∈ Θ d (η 2 ), then with equality if and only if η 2 = 0. Since the oracle ridge estimator is LiSc, it follows that if Cov(x i ) = Σ is known, then non-trivial equivariant out-ofsample prediction may be possible when d > n; on the other hand, Theorem 1 (c) implies that this is impossible if Cov(x i ) = Σ is unknown.
To conclude this subsection, we give a simple result which implies that the oracle ridge estimator dominates the oracle James-Stein estimator in finite samples. Again, this is not surprising because the ridge estimator leverages knowledge of Cov(x i ) = Σ , while the James-Stein estimator does not.
Proposition 5 follows from Jensen's inequality and is proved in Appendix A.

Other James-Stein estimators
Other James-Stein type estimators for β have been previously studied in the literature (Baranchik, 1973;Brown, 1990;Copas, 1983;Oman, 1984;Stein, 1960;Takada, 1979). Much of the previous work on James-Stein estimators for β focuses on identifying situations where the specified estimators have uniformly smaller predictive risk than the OLS estimator in finite samples. To our knowledge, the asymptotic risk of other James-Stein estimators for β in highdimensional linear models (with d/n → ρ ∈ (0, 1)) has received relatively little attention. In this section, we derive the asymptotic predictive risk of a James-Stein estimator for β studied by Baranchik (1973). For d < n and constant c > 0, this estimator is defined bŷ c||y − Xβ ols || 2 − 1 . Baranchik (1973) proved that if 0 < c < 2(d − 2)/(n − d + 2), d ≥ 3, and n ≥ d + 2, then R V {β bar (c)} < R V (β ols ). Other previously studied James-Stein estimators share strong similarities withβ bar (c). For instance, Copas (1983) considers preciselyβ bar (c) and provides arguments for using various specific values of c, whileβ bar (c) serves as a motivating example for a more general class of James-Stein estimators proposed by Takada (1979). Many of these estimators can be analyzed using techniques similar to those found in this subsection, and throughout the paper. Below, we show thatβ bar (c) is generally sub-optimal, in terms of predictive risk, and that it is out-performed (asymptotically) by the adaptive James-Stein estimatorβ js defined in Section 4.2. On the other hand, we also show that β bar (c) is asymptotically optimal for another, closely related loss function. The main idea behind our asymptotic analysis is that if V ∈ Θ(η 2 ), d/n ≈ ρ ∈ (0, 1), and n is large, then ||y − Xβ ols || 2 ≈ n(1 − ρ)σ 2 and ||y|| 2 ≈ n(β T Σ β + σ 2 ).
By Proposition 2, the risk of the James-Stein estimatorβ js (t) is minimized when Since, in general, t bar (c) = η 2 , (18) suggests thatβ bar (c) is suboptimal among James-Stein estimators. Observe that while the equality t bar (c) = η 2 may hold for some specific values of c, ρ, and η 2 , in order for it to hold in general, the constant c fromβ bar (c) must vary with ρ and η 2 . Some of the ideas from the previous discussion are made more rigorous in the next proposition. A detailed proof is omitted; however, part (a) is a straightforward calculation, part (b) may be proved similar to Theorem 3, and part (c) follows directly from part (b), Corollary 1, and Theorem 3.
Part (a) of Proposition 6 addresses suboptimality ofβ js {t bar (c)} and part (b) provides justification for (18). Proposition 6 (c) implies that the predictive risk of the adaptive James-Stein estimatorβ js is asymptotically smaller than that ofβ bar (c).
It follows from Proposition 6 thatβ bar (c) is suboptimal in terms of predictive risk, even among the class of James-Stein estimators,β js (t). This naturally leads to the question: are there other circumstances under which Baranchik's estimatorβ bar (c) is asymptotically optimal among James-Stein estimators? The answer is affirmative. Consider a prediction problem where the predictors x new associated with future outcomes y new are required to be drawn from {x 1 , . . . , x n }. If we assume that P {x new = x i |X} = n −1 , i = 1, . . . , n, then a reasonable measure of predictive risk isR and, if n is large, thenβ bar {d/(n − d)} ≈β js (t * bar ). Ultimately, one can show that if d/n → ρ ∈ (0, 1), thenβ bar {d/(n − d)} is asymptotically optimal among James-Stein estimators, with respect to the risk functionR V (·). In fact, Copas (1983) reaches a similar conclusion -that one should take c ≈ d/(n − d) in β bar (c) -by essentially studying the risk functionR V (·). However, Copas (1983) does not take an asymptotic approach, nor does Copas meaningfully distinguish between the risk functions R V (·) andR V (·). Indeed, Copas asserts that differences between the two risk functions are "unimportant if n is large" (p. 314 of (Copas, 1983)). This is true if n is large and d is small; however, the results in this section imply that these differences are significant when both n and d are large.

Distributional assumptions
The normality condition (2) is restrictive. The extent to which this condition is necessary for the results in this paper is somewhat unclear; working to relax (2) may be an interesting area for future research. In this section, we discuss some of the issues that may arise in pursuing such work.
If the data are non-Gaussian, then the exact risk formula for James-Stein estimators given in Proposition 2 will not hold in general (among other things, Proposition 2 relies on the fact that E I {tr(X T X) −1 } = d/(n − d − 1)). Furthermore, in the absence of normality, it may be more challenging to obtain exact decision theoretic results for LiSc estimators, such as Theorem 1 (a)-(b), which rely on orthogonal invariance of the multivariate normal distribution (note, however, that Theorem 1 (c) continues to hold regardless of distributional assumptions:β null = 0 is the only LiSc estimator when d > n).
The challenges discussed in the previous paragraph may be complemented by more encouraging observations. For instance, Stein-type estimators are known to have desirable finite sample properties in related problems with non-Gaussian data (see, for example, the review article (Brandwein and Strawderman, 1990) on estimating a location parameter in the presence of orthogonally invariant noise); these results may be relevant for generalizing the results in this paper to settings where the data are non-normal. Additionally, it seems reasonable to expect that many of the asymptotic results in this paper (or close variants) will continue to hold under weaker distributional assumptions -even in settings where the underlying distributions are not orthogonally invariant. Basic numerical experiments conducted by the author seem to support this hypothesis when the entries of X are binary random variables (detailed results not reported here). Existing theoretical work on high-dimensional data analysis with non-Gaussian design matrices may also be useful for establishing extensions in this direction, e.g. (Bunea et al., 2007a).

Practical implications
We have argued that LiSc estimators are a reasonable class of estimators for settings where little is known about how the given predictor basis relates to the outcome of interest. In these settings, if d < n and no reliable estimate of Cov(x i ) is available, then the results in this paper suggest that James-Stein estimators may be an effective option for out-of-sample prediction; if Cov(x i ) is known (or if a norm-consistent estimator is available), then results in Section 5.2 and (Dicker, 2013) imply that ridge regression may be more appropriate. On the other hand, if β is known to be sparse (i.e. if the outcome has a sparse representation in the given predictor basis) or some other significant prior information about β is available, then sparse methods, such as lasso, or Bayesian methods may be indicated (it is worth pointing out thatβ js (t) is a Bayes estimator under the prior distribution β|X ∼ N {0, ν 2 (X T X) −1 }, where ν 2 = t(n − d − 1)σ 2 /d).
If d ≥ n, then Theorem 1 (b)-(c) imply thatβ null = 0 is the optimal LiSc estimator. Thus, one can argue that if d ≥ n, then requiring an estimator to be LiSc is "asking too much." On the other hand, given the high-level discussion of LiSc estimators in the previous paragraph and in Section 3, an alternative interpretation of Theorem 1 (b)-(c) is as follows: if d ≥ n, Cov(x i ) is unknown, and little is known about how the predictor basis relates to the outcome, then non-trivial prediction may be impossible and, consequently, more information is needed for effective out-of-sample prediction. This interpretation may guide one's view towards identifying and understanding additional information about the model and the data that may help to improve performance in settings where d ≥ n.
Various types of information about the model (1) may potentially be leveraged to develop better prediction methods. The discussion of ridge regression in Section 5.2 implies that if Cov(x i ) is known, then an equivariant version of ridge regression (14) may perform well (significantly better thanβ null ) when d ≥ n. However, to obtain more substantial improvements in out-of-sample prediction error, it appears that additional information about β (such as sparsity) must be utilized. Indeed, the predictive risk of ridge regression is roughly of order d/n; if β is sparse, then the risk of lasso may be of order log(d)/n (Bunea et al., 2007b). Slightly recasting these observations, we conclude that while ridge regression outperformsβ null = 0 when d/n → ρ ∈ (0, ∞), additional information about β must be utilized in order to obtain vanishing risk in these asymptotic settings and, a fortiori, in settings where d/n → ∞.

Acknowledgements
The author thanks Patrick O. Perry for valuable comments on an earlier version of this paper. The author is grateful to the Editor, Associate Editor, and Referees for many comments that helped to improve the paper.
Since the above equality must hold for any (d − n) × n matrix C and any D ∈ GL(d − n), it follows thatβ(y, X) = 0. Part (c) follows immediately.
Proof of Proposition 2. Note that (X T X) −1 follows an inverse Wishart distribution and E I (X T X) −1 = (n − d − 1) −1 I (see Chapter 3 of (Muirhead, 1982), for instance). It follows that if V ∈ Θ d (η 2 ), then Thus, (8). The rest of the proposition follows by basic calculus.
To bound E V (L 1 ) and E V (L 2 ), we have

L.H. Dicker
Thus, Lemma B5 implies that To bound E V (L 3 ), notice that and, by Lemma B5, The theorem follows by combining this with (22)-(23).
Proof of Theorem 3. Suppose that V ∈ Θ d (η 2 ) and consider the following decomposition of the absolute difference between the predictive risk of the oracle and adaptive James-Stein estimators, notice that where Similar to the proof of Theorem 2, we bound .
Repeatedly applying the Cauchy-Schwarz inequality and Lemmas B2 and B4, it follows that To bound |E V (J 3 )|, we use Stein's lemma (integration by parts). We have, Using Lemmas B2-B4, one easily checks that We conclude that The theorem follows by combining this with (24)-(25).
The proposition follows.
Lemma B2. Suppose that κ ≥ 1 is a fixed constant and letη 2 + be as in (11). Suppose further that 0 < d/n ≤ ρ + < 1 for some fixed constant ρ + ∈ R. Then Proof. Notice that The result follows from Lemma B1.