Estimation of linear projections of non-sparse coeﬃcients in high-dimensional regression

: In this work we study estimation of signals when the number of parameters is much larger than the number of observations. A large body of literature assumes for these kind of problems a sparse structure where most of the parameters are zero or close to zero. When this assumption does not hold, one can focus on low-dimensional functions of the parameter vector. In this work we study one-dimensional linear projections. Speciﬁ-cally, in the context of high-dimensional linear regression, the parameter of interest is β and we study estimation of a T β . We show that a T ˆ β , where ˆ β is the least squares estimator, using pseudo-inverse when p > n , is minimax and admissible. Thus, for linear projections no regularization or shrinkage is needed. This estimator is easy to analyze and conﬁdence intervals can be constructed. We study a high-dimensional dataset from brain imaging where it is shown that the signal is weak, non-sparse and signiﬁcantly dif- ferent from zero.


Introduction
This research emerged from analysis of a high-dimensional dataset obtained from brain imaging. The data belongs to a study of cortical thickness of adults who had a diagnosis of attention deficit/hyperactivity disorder (ADHD) as children (Proal et al., 2011). The dataset consists of cortical thickness measurements for about 80000 cortical voxels, obtained from magnetic resonance imaging (MRI) scans, as well as demographic and behavioral measurements, for each of 139 individuals. In this study, it had been noticed by Reiss et al. (2012) that zscores corresponding to the voxelwise relationship between cortical thickness and ADHD diagnosis did not follow the theoretical standard normal distribution. Instead, the distribution of z-scores exhibited a substantial shift away from zero, indicating a possible widespread cortical thinning over the brain for individuals with ADHD. It is unclear, however, whether those results could have been caused by correlation between voxels rather than by a real relationship with clinical diagnosis (Azriel and Schwartzman, 2015).  Figure 1 shows histograms of (a) the actual z-scores (standardized to be distributed N (0, 1) under the global null hypothesis) and (b,c) the z-scores in simulated data sets; see Section 2.2 below for exact definitions. In (b) the response variable is simulated independently (no signal) whereas in (c) the regression coefficients are not zero (simulated from a normal distribution). In all cases, there is a clear departure from the global null, but in (b) the departure is caused only by the correlation structure and in (c) it can be attributed to a true signal. In case (a) it is not clear if there is a signal or not. The fact that the correlation structure can cause shifted z-scores can also be seen in the brain maps in Figure 2; (a) corresponds to Figure 1(a), showing the observed widespread positive z-values, while (b) corresponds to Figure 1(b), showing the simulated widespread negative z-values. Thus, as originally pointed out by Efron (2007) (and later made more precise by Schwartzman (2008) and Azriel and Schwartzman (2015)), even when the theoretical null model is correct, the empirical distribution of the test statistics can be different from the theoretical null distribution simply because of the correlation structure. In this work, we aim at estimating such departures and detecting whether the departure is due to correlation, or to true signal.
In our motivating dataset, there are two groups: 59 adults in whom ADHD had been established in childhood, along with 80 controls. Previous works (Proal et al., 2011;Reiss et al., 2012) considered the resulting multiple hypotheses problem (voxel-wise) and used the FDR criterion to detect areas in the cortical region where the null is rejected. As discussed below, this approach has low power to detect signals that are weak and non-sparse. As an alternative perspective, we set up here the problem as a regression of the response on the cortical thickness measurements as predictors, and attempt to find linear projections of the regression coefficients that will indicate a spatial distribution of the signal. In Azriel and Schwartzman (2015) we studied the distribution of the z-scores under the global null hypothesis accounting for the correlation structure. This allows us to test the global null, as we show here (in Section 5.2), but not to estimate the signal, which is the interest of the current work.
Consider the linear regression model, Y = Xβ + ε where ε = (ε 1 , . . . , ε n ) T is i.i.d. with mean zero and variance σ 2 and X is a fixed n × p matrix. While in many cases, including the brain imaging dataset mentioned above, X is in fact random, in this work we adopt the conditionality principle (Birnhaum, 1962) and treat it as fixed since it is ancillary to the parameter of interest. Here we focus on the high-dimensional case p > n. In this context, the Lasso estimator suggested by Tibshirani (1996) has gained much popularity and many extensions were suggested. That line of research is related to sparsity assumptions where most of the parameters are assumed to be zero or close to zero. Those assumption are violated in many datasets including the one that motivated our study, as demonstrated by the histogram of Figure 1. Dicker (2014) and Janson et al. (2017) studied inference of signal-to-noise ratio, and of σ 2 in the non-sparse case. For multiple comparisons, the Benjamini-Hochberg (BH) procedure (Benjamini and Hochberg, 1995) is guaranteed to choose the non-zero parameters with controlled error rate, but in our dataset, no significant voxels (at the level of α = 0.05) are found by BH. Indeed, the BH estimator is highly variable under strong dependence (Owen, 2005;Schwartzman and Lin, 2011) and has low power when there are many non-zero parameters but all have still small values.
To answer the question of whether a non-sparse signal is present, we aim at estimating θ = a T β (when it is identifiable) for a predetermined vector a. When a is sparse, θ corresponds to a small subset of β that is of interest, while for non-sparse a, θ is a global measure of the signal. We show that the estimatorθ = a T X T X − X T Y, where X T X − denotes the Moore-Penrose pseudo-inverse, is unbiased, admissible and minimax. Moreover, its distribution is easily derived and therefore one can construct confidence intervals and perform hypothesis testing. We also study the asymptotic behavior of this estimator and show that it is consistent in certain cases. Even if it is not consistent, a confidence interval can be constructed, it just does not shrink to zero. Our estimator is a linear function of the z-scores, which are defined in Section 2. We first study estimation of linear combinations of the expectation vector of the z-scores in Section 3 and then return to the problem of estimation of θ in Section 4. We analyze the motivating dataset in Section 5. Section 6 concludes with final remarks. The proofs of the theoretical results and additional results are given in the appendices.

The use of z-scores
In high-dimensional regression problems, it is common to reduce the data to univariate z-scores, computed from univariate regressions of the outcome on each predictor. In our situation, univariate z-scores are a useful tool not only for data analysis but as a theoretical tool for deriving the desired estimator.

Model
The brain imaging dataset we study consists of n = 139 subjects with information on p = 81924 vertices per subject. Let Y i denote the behavioral assessment of the i-th subject and let X (j) i to be the cortical thickness in the j-th vertex of the i-th subject. We consider the regression model where the i, j-th entry of X is X β is a p-dimensional vector of unknown coefficients and the ε's are i.i.d with mean zero and variance σ 2 . Here both the response and the covariates are assumed centered so that there is no need for an intercept in the model, and so the matrix X characterizes the substantive predictors. For simplicity, we ignore the weak dependence induced by the centering of the response.

z-scores
Consider the simple linear regression estimatesβ i , and define the z-scores Then we have E(β (j) ) = β (j) + k =j s jk β (k) /s jj and V ar(β (j) ) = σ 2 /s jj , so that Letting D be a diagonal matrix with D jj = √ s jj , we can rewrite in matrix form with respective expectation and covariance matrix Let A − denote the pseudo-inverse of matrix A. For any vector a such that θ = a T β is identifiable, we show in Proposition 3 below that there exists a vectorã = σD X T X − a such that a T β =ã Tμ , mapping the estimation of linear functions of β to linear functions ofμ. Conversely, for anyã, we can define a = DΣã/σ such thatã Tμ = a T β, mapping linear functions ofμ to linear functions of β.
Notice that the pairwise correlations between the z-scores are the pairwise correlations between the cortical thickness measurements at each vertex. Because the regression is conditional on the vertexwise measurements, we take the pairwise correlations as fixed and known. Note too that the definition ofZ involves σ, which is unknown. Estimation of σ 2 is discussed in Section 5; for now it is considered as a known constant.

The natural estimator
Writing θ =ã Tμ , the natural estimator isθ =ã TZ . As simple as this estimator is, it turns out it is not a bad one. It is unbiased, and we show in Section 3 that, under Model (1) when assuming that the ε's are normal, it is minimax and admissible.
If we write this estimator in terms of the representation θ = a T β we obtain the elegant formθ = a T X T X − X T Y (Section 4.2). In order to investigate the properties of this estimator we first study the general problem of estimating θ =ã T μ using Z, which is a vector with mean μ and finite variance. This is done in the next section. We return to the regression problem in Section 4.

Properties of the natural estimator
In this section we study the general model where Z is a vector of mean μ and covariance matrix Σ. Our interest is in estimation of θ =ã T μ and we study the properties of the natural estimatorã T Z. The basic setting and notation is described in Section 3.1. Section 3.2 shows that the natural estimator is minimax and admissible.

Dimension reduction
Suppose that Z := (Z 1 , . . . , Z p ) is a vector of mean μ and covariance matrix Σ.
Our interest is in estimating θ =ã T μ and the natural estimator isθ =ã T Z. We will suppress in the notation the dependence on p when there is no confusion. Let Γ be the matrix whose columns are the eigenvectors of Σ, denoted by γ 1 , . . . , γ p , and let Λ be the diagonal matrix with the corresponding eigenvalues, denoted by λ 1 ≥ · · · ≥ λ p , in the diagonal. When the rank of Σ, denoted by r, is smaller than p (as in our motivating dataset), then Σ has r positive eigenvalues and the rest p−r eigenvalues are 0. In this case, we define Λ to be r ×r diagonal matrix with the r positive eigenvalues in the diagonal and we define Γ to be p × r matrix with the r corresponding eigenvectors in the columns. We have that Σ = ΓΛΓ T .
It is convenient to work with the following dimension reducing transformation Hence, W is an uncorrelated, low dimensional representation of Z. Similar to principal components analysis (PCA), W and η can be thought of as the expressions of Z and μ in the canonical coordinate system defined by the covariance matrix Σ. When Σ is of full rank, we can write in this coordinate system, θ = b T η, where b = Λ 1/2 Γ Tã . This can be generalized to the degenerate case r < p as stated in the following proposition, which also shows that there is a one-to-one correspondence between Z (of length p) and W (of length r). That is, even though Z belongs to R p , it lies in a r-dimensional sub-space. This implies that θ can be estimated using either Z or W and both ways are equivalent.
(I) Let Γ ⊥ be the p × (p − r) matrix whose columns are the eigenvectors orthogonal to Γ. Then Γ T ⊥ Z = Γ T ⊥ μ is a deterministic known vector, and therefore we can assume without loss of generality that it equals 0.
(II) If Γ T ⊥ μ = 0, then the same relation as in the full-rank case holds, namely, , then the dimension reduction transformation is one-to-one and we can write Z = ΓΛ 1/2 W.

Minimaxity and admissibility of the natural estimator
The natural estimator isθ =ã does not depend on θ. Also, when Z is normal, as shown in the proof of the result below,θ is a limit of Bayes rules. These properties imply thatθ is minimax and admissible. The result it stated below, and the formal proof is given in Appendix D.
In Appendix A an asymptotic analysis of the variance of the natural estimator is presented. Specifically, it is shown that under some regularity conditions, when the natural estimator is inconsistent, then so is every other estimator. Yet, even if the variance ofθ does not go to zero, a confidence interval can still be constructed; it just does not shrink.

High-dimensional regression
In this section we study estimation of linear projections of β, which is meaningful only when those linear projections are identifiable. Section 4.1 provides a sufficient and necessary condition for identifiability of θ. The results of Section 3.2 are applied to the regression setting in Section 4.2. The natural estimator is compared to the ridge regression estimate in Section 4.3 and a simulation study is reported in Section 4.4. Section 4.5 introduces an asymptotic analysis of the estimator's variance.

Identifiability
Recall the regression setting described in Section 2. The full vector β is not identifiable: if Xβ = Xβ then β and β are indistinguishable. Here we are interested in estimation of θ = a T β. Proposition 2 below provides a necessary and sufficient condition for identifiability of θ. To state the result, we use the eigendecomposition notation of Section 3.1 for Σ = X T X/n = ΓΛΓ T .
The proposition indicates that θ is identifiable iff Γ T ⊥ a = 0, i.e., iff a belongs to the subspace spanned by the columns of Γ. Another way to understand Proposition 2 is that it specifies the part of θ that is estimable.
by the orthogonal decomposition of a, we see that θ contains the portion Γ T β of β that projects β onto the subspace spanned by the columns of X and the portion orthogonal to it. The former is accessible through the observations (linear model) but the latter is not and therefore not estimable. We will see this phenomenon in the data analysis (Section 5.4).
Proposition 2 implies that θ = a T β is identifiable when a ∈ R p belongs to a subspace of dimension r. Since r ≤ n, when p > n, then for "most" a's, θ = a T β is not identifiable. In practice, there are two ways to proceed. First, for a given a one can consider instead of a the closest identifiable vector, i.e., arg miñ which by standard linear algebra is ΓΓ T a. Second, one can consider all possible directions, γ 1 , . . . , γ r (the columns of Γ) and adjust for multiplicity. These two ways are demonstrated in our motivating dataset, as described in Section 5.4.

Estimation of θ
We are interested in estimating θ = a T β. Assuming that θ is identifiable, Proposition 2 implies that a = Γα. The natural estimator of θ is given by the following result. Recall the definitions of the z-score vectorZ and its expectationμ given by (2) and (3).
Interestingly, the form of the natural estimator is very similar to that of the usual linear regression estimate in low dimensions, except that here the pseudoinverse is used because the matrix X T X is not invertible. The form of the variance is a generalization of the variance of a Tβ , whenβ is the least squares estimate. Regarding the quality of the natural estimator, Theorem 1 applies and we have the following corollary.
is minimax and admissible among all estimates that depend on Y through X T Y.

Comparison to Ridge
To illustrate the admissibility and minimaxity of the natural estimator, consider the ridge regression estimatê for a tuning parameter λ. The corresponding estimate of θ is a Tβ R λ ; it coincides with the natural estimator when λ = 0. The mean square error of Γ Tβ R λ is (Farebrother, 1976) and therefore the mean square error of a Tβ R λ = α TΓ TβR is the first summand is the variance and the second is the squared bias. The latter is not bounded in β and therefore a Tβ R 0 , which is the natural estimator, is minimax. When λ is sufficiently small then the mean square error of a Tβ R λ is smaller than that of a Tβ R 0 (Farebrother, 1976). However, the optimal λ depends on unknown quantities and hence the natural estimator, which is a Tβ R 0 , is still admissible.

Comparison to Ridge and Lasso via simulations
To illustrate the theoretical results, we compared the natural estimator to the Lasso and Ridge estimates in a simulation study. For Lasso and Ridge estimates, the tuning parameter λ was chosen using cross-validation (computed by the 'parcor' package in R). We chose n = 100 and p = 500. Each row i ) was sampled from a multivariate normal distribution with mean 0 and exchangeable correlation structure with ρ = 0.7 (see Appendix B for exact definitions); X is fixed across all simulations. For each simulated data sets we sampled ε 1 , . . . , ε n ∼ i.i.d N (0, 1). We repeated the above procedure 1000 times and considered model (1) under three scenarios: • Full β: β j = 1/ √ 500 for j = 1, . . . , 500, • Half-full β: β j = 2/ √ 500 j = 1, . . . , 250 0 j = 251, . . . , 500 , under all three scenarios ||β|| = 1. For all scenarios we considered estimation of θ = a T β for a = ΓΓ T 1, which is the closest vector (in L 2 sense) to 1 spanned by the eigenvectors corresponding to positive eigenvalues, yielding an identifiable θ.
The simulation average and standard deviation of the mean square error (MSE) are reported in Table 1. It is shown that the risk of the natural estimator is constant across all scenarios. For the full and half-full β scenarios, the risk of the natural estimator is smaller than both Lasso and Ridge. Under the sparse scenario the risk of the natural estimator and ridge is about the same. Lasso under-performs in all three scenarios but it has the advantage of identifying the non-zero entries of β in the sparse scenario. Overall, the results agree with our theoretical findings: the natural estimator estimates θ well in a minimax sense.

Consistency
We now study the asymptotic behavior ofθ. For the asymptotic analysis, we assume the following regularity conditions (recall that here Σ = X T X/n): The diagonal elements of Σ are bounded from above (4) The positive eigenvalues of Σ are bounded from below (5) is bounded for every j, j ∈ {1, . . . , r}.
Condition (4) is natural and is also assumed in the asymptotic analysis in Appendix A. Condition (5) guarantees that Σ does not become degenerate on the subspace spanned by the columns of Γ (which is the relevant subspace; see Proposition 2). This means that the columns of X are bounded away from being linearly dependent. By (6), β is "equally spread out" over all the eigenvectors that correspond to the positive eigenvalues, i.e., β is not concentrated on only part of the subspace. This means that β is not sparse in the coordinate system defined by Γ (although it could be sparse in the original coordinate system). When (7) holds true, then the signal-to-noise ratio is of order of a constant. Conditions (4) and (5) can be easily verified since Σ is fixed. In our motivating dataset, the maximal diagonal element of Σ is 0.71 and the minimal positive eigenvalue is 2.80. In contrast, conditions (6) and (7) depend on unknown parameters and cannot be checked directly. While condition (7) is natural, condition (6) is less so. However, Proposition 4 below gives an asymptotic approximation to V ar(θ), which depends on these conditions. Thus, in practice, one can compute V ar(θ) exactly and compare it to the approximation. We do so in the data analysis in Section 5.4 and find the fit between the theoretical and actual variances to be good.
The following proposition states the order of magnitude of the variance ofθ for two cases of a. .
The variance ofθ in the global average case depends on the asymptotic behavior of p and r j=1 1 λj . For high-dimensional X (p > n), we have that r = n. In this setting suppose that there are K eigenvalues of order p (spike model) and all the remaining eigenvalues are of the same order. Because j>K λ j = O(p), that order must be p/n, and then n j=1 1 λj = O(n 2 /p). As a consequence, the variance is O(1/n) andθ is consistent. Also when X is low-dimensional (p = o(n)), then r = p. Because all λ j are bounded from below (Condition (5)), then p j=1 1 λj ≤ O(p) and thereforeθ is consistent. For the single entry case, consistency depends on the eigenvalue λ j and the rate of grow of p. For high-dimensional X (p > n), when λ j is "large", i.e., of order of p, thenθ is consistent. When X is low-dimensional (p = o(n)),θ is consistent (using condition (5) that λ j is bounded from below).
As before, regardless of the consistency ofθ, a confidence interval for θ can still be constructed; it just may not shrink.

Analysis of the data
We now return to the brain imaging dataset, which was described in Section 2.

Estimation of σ 2
In order to calculate the z-scores, σ 2 needs to be estimated. Estimation of σ 2 when β is non-sparse is a topic of two recent papers (Dicker, 2014;Janson et al., 2017). They work under the framework of random X and require rather restrictive assumptions on the distribution of X. Here we use the simple estimatê . The latter is an upper bound since V ar(Y ) ≥ V ar(Y |X) = σ 2 . Therefore, the resulting confidence intervals are conservative. The upper bound is tight in the null case when β = 0.

Testing for the global null in the brain imaging dataset
In Azriel and Schwartzman (2015) we studied the empirical cumulative distribution of a large number of correlated normals. We also analyzed the above dataset assuming the global null holds true, i.e., that β = 0.
Consider the z-scores (denoted byZ) and its correlation matrix (denoted byΣ) defined in (2) and (3). Roughly speaking, we showed that if K is the number of "large" eigenvalues ofΣ, i.e., of order p, then there exists a vector ξ = (ξ 1 , . . . , ξ K ) T , where ξ 1 , . . . , ξ K ∼ i.i.d N (0, 1), such thatZ|ξ is weakly correlated (i.e., the Frobenius norm of the correlation matrix divided by p converges to zero). Letm i := E(Z i |ξ), i = 1, . . . , p, then we have that under the global null In the above dataset we found that K = 2 works quite well. When K = 2 we have that p i=1m 2 i = 132018.9. The two large eigenvalues are λ 1 = 23995.7 and λ 2 = 6959.6. Thus, for T = λ 1 ξ 2 1 + λ 2 ξ 2 2 we have that P (T > 132018.9) = 0.023 (computed by simulations), i.e., the p-value for the above test is 0.023, indicating that the global null is rejected at the α = 0.05 level. We repeated the above computation with K = 3, . . . , 10 and the resulting p-values were all smaller than 0.03. This result seems to imply that β = 0. Since β is a large vector, it cannot be estimated. Instead, below we study its low-dimensional projections.

Estimation ofã Tμ
Consider now the mean of the z-scoresμ = E(Z) defined in (3). We first estimate θ = 1 p p j=1μ j (ã = 1/p). Hereθ =ã TZ = 1.173 andã TΣã = 0.264. The latter can be efficiently computed by r j=1 α 2 jλ j , whereλ j 's are the eigenvalues ofΣ, computed as the squared singular values of XD −1 ; in this dataset the rank is r = 136. Therefore a confidence interval for θ based on two standard deviations is (0.144, 2.201); notice that it does not include 0. To interpret the result in more natural units, by (19) (see Appendix C), an estimate of the average correlation is 1 p p j=1Z j √ n = θ/ √ n = 0.099; a confidence interval based on two standard deviations is (0.012, 0.187). This indicates that the average correlation between the outcome and the cortical thickness over the brain is somewhat small, but is still significantly different from zero.
While we get a significant result, the confidence interval for the average θ does not shrink with increasing p. In fact, the confidence interval stabilizes already for relatively small p. For example, taking a random sample of voxels of size p = 2000 gives a point estimate of 1.185 and a confidence interval of (0.154, 2.216), not far from the results for the full dataset with p = 81924. The reason for this, as suggested by the theory, is correlation between the voxels. To assess this, the eigenvalues ofΣ are plotted in Figure 3(a). The first two eigenvalues are much larger than the rest, and it can be checked that in the variance decomposition r j=1 α 2 jλ j , the first term captures 99.2% of the variance. This heterogeneity among the eigenvalues is caused by strong correlation; if the voxels were independent, then the eigenvalues would be much more homogeneous.
As a point of reference, if the entries of the matrix X were i.i.d. N (0, 1), then the range of eigenvalues would be that of the Marchenko-Pastur distribution (Paul and Aue, 2014). For n = 139 and p = 81924, the Marchenko-Pastur range is (1 ± p/n) 2 = (541.8, 638.9). In contrast, in our data the range of the eigenvalues is (λ 1 ,λ p ) = (33.4, 23995.7); see also Figure 4 (b). In addition, we compared the eigenvalues ofΣ to a spiked covariance model (Johnstone, 2001), where each row in X is sampled from independent mean zero normals with variances (λ 1 , . . . ,λ K , c, . . . , c) for c = p− K j=1λ j p−K (so the sum of variances is p). That is, the first K eigenvalues are the same as in Σ and the rest are equal. For a literature review on spiked models see Paul and Aue (2014). The parameter K is the number of large eigenvalues of the model. Figure 4 compares the eigenvalues λ 1 , . . . ,λ 20 to the empirical eigenvalues of 200 simulations of the spiked model with K = 2 and K = 10 and also to 200 simulations of i.i. d. N (0, 1). Although the first K eigenvalues fit, more or less, the empirical ones, the gap betweenλ K andλ K+1 is much smaller than in the spiked model. Therefore, while the spiked model compares better to data than the Marchenko-Pastur distribution, there is still non-negligible difference between model's prediction and the observed eigenvalues of X. It looks like that the eigenvalues of the real covariance matrix decay slower than the variances in the spiked model.
Linear combinations ofμ other than the average are also of interest. Sincẽ μ is by definition spanned by the columns X, consider the representation of μ with respect to the eigenvector space, i.e., we writeμ = r j=1γjd j , wherẽ d j =γ T jμ /p are the coefficients ofμ with respect to the eigenvectorsγ 1 , . . . ,γ 136 ofΣ. Figure 3(b) plots the estimates ofd j and the associated confidence intervals. We find thatd 1 = 1.268 and a confidence interval based on two standard deviations is (0.186, 2.351). Note that these values are similar to the estimate and confidence interval for θ = 1 p p j=1μ j calculated above. This is not a coincidence; the first eigenvector is very close to the averaging constant vector a = 1/p with a correlation of 0.938 between the two. For illustration, Figure 5 shows a brain map of the first eigenvector, which is indeed almost constant over the brain. Back in Figure 3(b), the rest of thed j 's for j > 1 are much closer to 0. This suggests that most of the signal is contained in the subspace spanned by the first eigenvectorγ 1 . However, the variance ofd 1 is also higher than the otherd j 's.
The variance ofd j isγ T jΣγj =λ j /p; indeed, the first eigenvalue is much larger than the other eigenvalues as illustrated in Figure 3(a). If we considerd j for large j, then the signal seems to be weaker, but also the variance is smaller. As indicated in Proposition 5 in Appendix A, whenã is orthogonal to the "large" eigenvectors, the variance is small.
Not only the average signal is significantly different from zero, but one can test the global null hypothesisμ = 0 using the linear projectionsd j 's. Let N (0, 1), under the null (assuming normality). Therefore, under the null, max j=1,...,136 |T j | is distributed as the maximum absolute value of 136 i.i.d normals. In this case, max j=1,...,136 |T j | = 3.3, which yields a p-value of 0.064 (computed by simulation).
To sum up, our findings indicate that most of the signalμ is in the direction of the first eigenvector. However, since in this direction the variance is also higher, it is difficult to determine the level of the signal in this direction, although the confidence interval does not cover zero. This is consistent with the theory given in Appendix A, where it is shown that ifã is not orthogonal to the eigenvectors corresponding to large eigenvalues, then the variance ofθ does not shrink to zero.

Estimation of a T β
We now consider estimation of θ = a T β. As in Proposition 4 we study two cases: global average and single entry. Starting with the former, ideally, we wish to estimate the global average 1 p p j=1 β j = 1 T β/p. Unfortunately, however, the vector 1 is not spanned by the columns of Γ and as a consequence the average is not identifiable. We can see this in terms of Proposition 2. Taking the expansion 1 = Γα + Γ ⊥ α ⊥ , we have that The orthogonal component Γα ⊥ is not zero, as required by Proposition 2 for identifiability. However, its norm is small relative to the vector 1.
As a result, we consider instead the identifiable portion, determined by the vector a = ΓΓ T 1/ √ n, and estimate θ = a T β. The vector a is the closest vector (in L 2 sense) to 1/ √ n spanned by the columns of Γ, with a correlation with it of 0.991, so again, the loss in estimation is small. The normalizing constant gives ||a|| 2 = p/n and makes it consistent with the normalization of Proposition 4.

D. Azriel and A. Schwartzman
For such a we obtainθ = a T X T X − X T Y = 1.685. The variance of the estimator is r j=1α j λ j = 1.627. Therefore, a confidence interval based on two standard deviations is (−0.866, 4.236), which contains 0. In contrast, the confidence interval of the estimate of 1 p p j=1 μ j did not contain 0. In this dataset, estimating the average β is harder than estimating the average μ. According to Proposition 4, V ar(θ) is of the same order as p n 3 r j=1 1 λj . In this dataset the latter expression is 0.47, which is not close to zero. As in Section 5.3, we are also interested in estimating d j = γ T j β (which is identifiable). This is the single entry case of Proposition 4. The estimates and confidence intervals are presented in Figure 6. Here, unliked j above, the variance of d j is large for large j, since the eigenvalues appear in the denominator of the expression of the variance p/(nλ j ). Also unlike above, the signal does not seem to be very strong on the first eigenvectors. For some j's, however, the estimator of d j is significantly different from zero; for example, when j = 132,d j /sd(d j ) = 3.47. Here the test for the maximum ofd j /sd(d j ) yields p-value of 0.035. For illustration, Figure 7 shows a brain map of the eigenvector γ 132 . Note that, in contrast with the first eigenvector γ 1 shown in Figure 5, the eigenvector γ 132 is much more concentrated spatially. While j = 132 gives the strongest effect, the vector β seems to have smaller components in other eigenvectors as well.

Summary of the data analysis and significance
We have seen that the vectorμ is mostly related to the first eigenvector γ 1 , while the vector β is not. The relationship between these two vectors is expressed in (3). Writing that relation asμ =ΣDβ/σ = nD −1 ΓΛΓ T β/σ, we see that the vector β is mapped toμ through the eigenvalues of X. In particular, in this dataset, the first eigenvalue λ 1 is large, causingμ to be highly aligned with the first eigenvector γ 1 . The fact that the averageμ (or its projection onto the first eigenvector) appears to be non-zero, is indicative of the vector β being non-zero, with its effect amplified by the first eigenvalue. The vector β itself is hidden from the z-scores, yet some linear combinations of it can be estimated by the methods proposed here.
For the vector β we found that γ T jβ is largest when j = 132, which is a relatively spatially concentrated eigenvector as presented in Figure 7. This finding might be used to identify areas in the cortical region that are correlated with the behavior assessed in the study.

Final remarks
The motivation for this study came from a dataset from brain imaging relating cortical thickness at each voxel to a global behavioral measurement. We aim at estimating one-dimensional linear projections of the coefficient vector β without assuming sparsity. In fact, the resulting signalμ contained in the voxelwise zscores does not seem to be sparse and traditional high-dimensional methods that aim at identifying the small number of non-zero parameters are not useful. The challenge is to distinguish between real signal and a seemingly high signal due to a correlation effect. Our theoretical results imply that for certain projections and correlation structures these two situations cannot be distinguished consistently.
For a given high-dimensional regression model, the general proposed approach is to estimate γ T j β for all identifiable j's. The results can be used for two types of inference: a global test to check whether β = 0 and identification of significant one-dimensional projections of β.
Interestingly, regularization does not play a role in the estimation of linear projections. In the classical setting Z ∼ N (μ, I), the famous work of Stein (1956) showed that Z is inadmissible for μ. A better estimate can be obtained by using shrinkage, or equivalently a certain kind of regularization. However, we have shown thatã T Z is admissible and minimax forã T μ. Thus, when the interest is in the entire vector of μ, regularization is warranted, but for estimation of linear projections of μ no regularization is required.
In this paper we investigated linear functions of β and necessary conditions for when a consistent estimator to θ = a T β exists. Other non linear functions are also of interest. For example, in many applications it is desired to estimate the number of non-zero entries of β (Chen, 2018) or the squared norm of β (Dicker, 2014). We hope that a future study will suggest procedures to estimate such quantities and indicate when consistent estimates exists for non-linear functions of β.

Appendix A: Consistency results for the natural estimator of Section 3
To state our asymptotic results, we need some more notation. Recall that λ 1 ≥ . . . ≥ λ r are the positive eigenvalues of Σ and γ 1 , · · · , γ r are the corresponding eigenvectors. Define λ i = lim inf p λi p , andλ i = lim sup p λi p ; since the variances are assumed bounded, thenλ i is finite for every i. Further, let By definition we have that K ≤K. Notice that K could be infinity and that if K < ∞ then K i = 1 for i ≤ K and K i = 0 otherwise; the same forK. Define The variance ofθ is b T b =ã T Σã. The following proposition provides results on the limiting behavior ofã T Σã and of b.
Proposition 5. Assume that Z is a vector with mean μ and covariance matrix Σ and consider the natural estimatorθ =ã T Z. Suppose that ||ã|| 2 = O(1/p) and the variances of Z are bounded.
Part (I) shows that lim sup pã T Σã can be written as a sum ofK summands: , . . . ,K}; therefore lim sup pã T Σã goes to zero iffᾱ i = 0 for i ∈ {1, . . . ,K}. In other words, Proposition 5 implies thatθ is consistent iffã is (asymptotically) orthogonal to the eigenvectors that correspond to the largestK eigenvalues. When K = ∞, thenã T Σã is an infinite sum, but it could be bounded by a finite sum. The proof of Proposition 5 implies that if there are no large eigenvalues, i.e., ifK = 0, then consistency ofθ follows as stated in the following corollary.
Proposition 5 implies that when ||ã|| 2 is of order larger than O(1/p), then lim pã T Σã = ∞ and if ||ã|| 2 = o(1/p) then lim pã T Σã = 0. Therefore, in the former case,θ is inconsistent for all correlation structures and in the latter case it is always consistent as stated in the corollary below. When K = ∞,ã T Σã can be bounded by finite sums as in (8). These finite sums approximations are used below in the proof of Theorem 2, as it reduces the p summands ofã T Σã to a bounded number of summands.

B.4. Empirical Bayes estimator under exchangeable correlation
Empirical Bayes (EB) estimators are found useful in many high-dimensional situations (Efron, 2010) and therefore are potential candidates to improve the natural estimatorθ. However, Theorem 1 implies that EB estimates cannot uniformly improveθ. Furthermore, it implies that there is no consistent way to identify the cases where EB estimates are better. In this section we compare the EB estimator toθ under the exchangeable correlation structure (9).
In practice, one cannot verify condition (18), since there is no consistent estimate for η 1 andη. In other words, there is no consistent way to know whenθ EB is better. On one extreme, if η 1 = · · · = η p , then the left hand-side of (18) converges to 0, while the right hand-side converges to 1, so the risk ofθ EB converges to 0. On the other hand, if (η 1 −η) 2 is large, i.e., when η 1 is distant from the other η's, thenθ is better.

B.5. Empirical covariance matrix under exchangeable correlation
Recall the regression setting of Section 2 and suppose that each row of the matrix X is sampled independently from a distribution with covariance matrix (9). The results of Fan and Wang (2017) indicate that the leading eigenvalue of the sample covariance is of order p, as of the original distribution, but it is biased upward. In terms of our notation, the sample covariance satisfiesK = 1, but λ 1 is greater than the leading eigenvalue ρp + 1 − ρ of the true underlying distribution and still of order p. This is the case considered in the Simulations Section 4.4.