The relative effects of dimensionality and multiplicity of hypotheses on the F-test in linear regression

Recently, several authors have re-examined the power of the classical F-test in linear regression in a `large-p, large-n' framework (cf. Zhong and Chen (2011), Wang and Cui (2013)). They highlight the loss of power as the number of regressors p increases relative to sample size n. These papers essentially focus only on the overall test of the null hypothesis that all p slope coefficients are equal to zero. Here, we consider the general case of testing q linear hypotheses on the (p+1)-dimensional regression parameter vector that includes p slope coefficients and an intercept parameter. In the case of Gaussian design, we describe the dependence of the local asymptotic power function on both the relative number of parameters p and the number of hypotheses q being tested, showing that the negative effect of dimensionality is less severe if the number of hypotheses is small. Using the recent work of Srivastava and Vershynin (2013) on high dimensional sample covariance matrices we are also able to substantially generalize previous results for non-Gaussian regressors.


Introduction
Following a suggestion of Bai and Saranadasa [1] to investigate classical statistical procedures in high-dimensional settings, Wang and Cui [27] re-examine the usual F -test in the linear regression model under a "large p, large n" asymptotic framework. They derive the asymptotic power in a fairly general, non-Gaussian setting, highlighting the dependence of the local power function on the dimensionality of the problem, i.e., on the limit ρ = lim p/n ∈ (0, 1), where n is sample size, and p is the number of regressors in the model. In particular, they find that the rejection probability of the F -test for H 0 : Rβ = r 0 , where R = [0, I q ] and p/n → ρ, q/n → ρ, satisfies Here F n is the usual F -statistic, f (1−ν) q,n−p−1 is the appropriate F -quantile, Φ is the cdf of the standard normal distribution, ζ 1−ν = Φ −1 (1 − ν) and ∆ β = (Rβ − r 0 ) (RΣ −1 R ) −1 (Rβ − r 0 )/σ 2 is the scaled distance from the null hypothesis. From this approximation we see that the local asymptotic power of the F -test depends monotonically on the value of ρ and inflates to the nominal significance level ν as ρ increases to one. The result of Wang and Cui [27] is consistent with the derivations of the local asymptotic power in the case of Gaussian errors, as obtained by Zhong and Chen [29]. Both of these studies consider only the overall F -test for the null hypothesis that all, or almost all (cf. Condition (C3) in Wang and Cui [27]) of the p slope coefficients are equal to zero. Also, they do not consider hypotheses involving the intercept parameter. Here, we extend this analysis and study the problem of testing q general linear hypothesis (including also hypotheses on the intercept term), without the restriction that (p − q)/n → 0. In this sense, we examine the effect of the dimension of the null hypothesis (i.e., the number of linear restrictions being tested) on the local asymptotic rejection probability of the F -test. We find that when testing the null hypothesis H 0 : R 0 γ = r 0 , for some q × (p + 1) matrix R 0 of rank q ≤ p + 1, such that p/n → ρ 1 and q/n → ρ 2 ≤ ρ 1 , the rejection probability of the F -test satisfies Now the asymptotic rejection probability depends also on the mean µ ∈ R p of the random design through ∆ γ = (R 0 γ − r 0 ) (R 0 S −1 R 0 ) −1 (R 0 γ − r 0 )/σ 2 , where γ = (α, β ) is the vector of regression coefficients including an intercept parameter α ∈ R and S = 1 µ µ Σ + µµ .
This limiting expression coincides with that in (1.1) if ρ 1 = ρ 2 and R 0 = [0, R]. But (1.2) refines the statement in (1.1) and shows the impact of both the relative number of regressors ρ 1 and the relative number of hypotheses ρ 2 . These quantities affect the asymptotic rejection probability monotonically, which is consistent with small sample analyses in the Gaussian error case [cf. 12]. However, in contrast to the complicated nature of the cdf of the non-central F -distribution as a function in p, q and the non-centrality parameter, our asymptotic approximation to the rejection probability depends on the quantities ρ 1 , ρ 2 and ∆ γ only through elementary operations and the Gaussian cdf, and it is valid for a large class of error distributions. In particular, we see that even if ρ 1 is close to 1, the F -test still has power if ρ 2 is sufficiently small. In a second step, and under slightly more restrictive assumptions on the data generating process, we also investigate the case where only a very small relative number of hypotheses q/n is tested, i.e., q is bounded as n → ∞ and ρ 2 = 0, and the result in (1.2) no longer holds. Our work heavily builds on the ideas of Wang and Cui [27] (hereafter abbreviated as WC). The first part of the present work is concerned with reproducing their results under substantially more general assumptions. First of all, here we do not require independence between the random design and the error terms, but we assume only the usual first and second order specification of conditional moments of the errors given the design. This extension requires a slight modification of the result of Bhansali, Giraitis and Kokoszka [5] on the asymptotic normality of certain quadratic forms as applied by WC (cf. Lemma 6.1). Furthermore, we do not assume that the n × p design matrix X, after standardization, consists of i.i.d. components, as is needed for the application of the famous Bai-Yin Theorem [2] used by WC in order to control extreme eigenvalues of large sample covariance matrices. Instead, we apply a recent result of Srivastava and Vershynin [23] which essentially requires only certain moment restrictions on the i.i.d. rows of X. For our extensions, we also develop a novel result on the diagonal entries of a fairly general random projection matrix that might be of interest on its own (see Lemma 6.3). It has the statistical interpretation that in a moderately high-dimensional regression the leverage values h i , i.e., the diagonal entries of the projection matrix U (U U ) −1 U , where U = [ι, X] is the design matrix including an intercept column, typically behave like p/n. Finally, we point out that since we also consider tests on the intercept parameter, the distribution of the F -statistic, in general, also depends on the mean µ of the random design vectors x 1 , . . . , x n . This causes certain technical complications due to non-centrality issues which are often avoided in the literature on random design regression by restricting to the case µ = 0. Here, we present a detailed treatment of the general case.
The paper is organized as follows. Section 2 introduces the setup and notation and presents our main results in Theorem 2.1 and Corollary 2.2, which provide a precise formulation of the statement in (1.2). In Section 3, we specifically consider the situation where q is fixed and also provide a unifying result that does not distinguish between large or small q. Next, in Section 4 we provide a detailed discussion of our technical assumptions and explain the main differences to those imposed by WC. The results of an extensive numerical study are reported in Section 5. Finally, Section 6 provides the basic steps in the proof of our main results. Some of the more technical arguments are deferred to the appendices.

Model formulation and main results
We consider a random array {(y i,n , x i,n ) : 1 ≤ i ≤ n, n ≥ 1} where, for each n ∈ N, the pairs (y i,n , x i,n ) n i=1 are i.i.d. observations of a real valued response variable y 1,n and p n -dimensional random regressors x 1,n with p n < n − 1, satisfying E[y 1,n |x 1,n ] = α n + β n x 1,n and Var[y 1,n |x 1,n ] = σ 2 n ∈ (0, ∞). Equivalently, writing ε i,n = y i,n − E[y i,n |x i,n ], the observations can be represented as y i,n = α n + β n x i,n + ε i,n , i = 1, . . . , n, (2.1) where the (ε i,n ) n i=1 are i.i.d., satisfying E[ε i,n |x i,n ] = 0 and Var[ε i,n |x i,n ] = σ 2 n . Note that ε 1,n does not need to be independent of x 1,n . For identifiability, we also assume that Σ n := Var[x 1,n ] is positive definite and we define µ n := E[x 1,n ]. Furthermore, we adopt the matrix notation Y n = (y 1,n , . . . , y n,n ) , X n = [x 1,n , . . . , x n,n ] , ε n = (ε 1,n , . . . , ε n,n ) , γ n = (α n , β n ) and U n = [ι n , X n ], where ι n = (1, 1, . . . , 1) ∈ R n . For notational convenience we will drop the subscript n whenever there is no risk of confusion, i.e., we write Y = Y n , X = X n , α = α n , β = β n , etc., keeping in mind that, unless noted otherwise, all quantities to follow depend on sample size n. With this, the model equations in (2.1) become Y = U γ + ε. (2.2) We want to test a general linear hypothesis on the coefficients γ, i.e., where R 0 is a q × (p + 1) matrix with rank R 0 = q ≤ p + 1 and r 0 ∈ R q . Without restriction we may assume that R 0 has orthonormal rows (premultiply (2.3) by (R 0 R 0 ) −1/2 ). We test H 0 by use of the F -statistic F n defined as , (2.4) provided that all the appearing quantities are well defined, and F n = 0, otherwise. The F -statistic is then compared to the 1 − ν quantile of an F -distribution with q and n − p − 1 degrees of freedom, which we denote by f (1−ν) q,n−p−1 . Here, γ n = (α n ,β n ) is the OLS estimate in the unrestricted model. We also define the usual estimator of the error varianceσ 2 n = Y − Uγ n 2 /(n − p − 1), that appears in the denominator of the F -statistic.
In Section 6 we prove the following results, involving the scaled distance from the null hypothesis A list of further technical conditions is given below in Section 2.1. (In this case, either q n = p n or q n = p n + 1, and thus (p n − q n )/n → 0 holds.) (ii) The assumptions (A1).(a,b,c,d,e) and (A2) on the random design and on the error distribution, are satisfied, (p n − q n )/n → 0, and (R 0 γ − r 0 ) R 0 SR 0 (R 0 γ − r 0 )/σ 2 = O( √ q n /n) holds. 1 (iii) Assumption (A2) on the error distribution is satisfied and the design vectors x 1,n , . . . , x n,n are i.i.d. Gaussian with mean µ n ∈ R pn and positive definite covariance matrix Σ n . 2 By a simple argument involving Polya's theorem this translates into the following corollary on the rejection probability of the F -test.
Corollary 2.2. If lim sup n p n /n < 1 and q n → ∞, as n → ∞, and the conclusion of Theorem 2.1 holds, then the rejection probability of the F -test satisfies Here, ζ 1−ν = Φ −1 (1−ν) is the 1−ν quantile of the standard normal distribution and ν ∈ (0, 1) does not depend on n. 1 Notice that this additional requirement implies and strengthens the assumption that ∆γ = o(qn/n). Simply observe that, by block matrix inversion, where R 1 is a (pn + 1 − qn) × (pn + 1) matrix with orthonormal rows which are also orthogonal to the rows of R 0 . Therefore, ∆γ Proof. It is easy to see, using Polay's theorem and Lemma C.8, thatf n := s qn,n−pn−1 − 1) satisfiesf n → ζ 1−ν . Now use the conclusion of Theorem 2.1, Polya's theorem and the Lipschitz continuity of Φ to obtain If (p n − q n )/n → 0 and 0 < lim inf n q n /n ≤ lim sup n p n /n < 1, then Corollary 2.2 recovers the result of WC under weaker assumptions on the joint distribution of the design and the errors, and for a null hypothesis that possibly restricts also the intercept parameter α (cf. the assumptions of Theorem 2.1(i) and (ii)). In this case the factor b n above asymptotically reduces to as in (1.1). It highlights the dependence of the power function on the relative number of regressors p n /n. However, since (p n − q n )/n → 0, the individual roles of p n and q n can not be discerned. This shortcoming is removed here, but it comes at the price of a stronger design condition (cf. Theorem 2.1(iii)). It is tempting, however, to conjecture that Assumptions (A1) and (A2) are actually sufficient also for the general case. Corollary 2.2 nicely shows the effect of both the dimension of the parameter space as well as the dimension of the null hypothesis, on the asymptotic power function. In particular, we see that even in a case where the relative number of regressors p n /n is large, the classical F -test still has power, as long as we are interested in testing only a relatively small number of hypotheses. However, we should make a cautionary remark at this point. In Theorem 2.1 and Corollary 2.2 we have assumed that q n → ∞. If the number of hypotheses q being tested is too small, then the asymptotic approximation presented above will not be very accurate, in the same way the χ 2 q distribution is not very accurately approximated by the normal if q is small. Therefore, in the next section we specifically study the case where q n is bounded.
In Theorem 2.1, we treat the special cases of R 0 = I p+1 and R 0 = [0, I p ] separately, because here it is considerably much easier to deal with the noncentrality term in the decomposition of the F -statistic (see Section 6.4.1). In particular, in this case we do not need to impose further restrictions on the distance from the null ∆ γ other than that it is of order o(q n /n) as n → ∞ and we can also work with weaker design conditions (cf. Theorem 2.1(i)).

Remark 2.3 (On the detection boundary of the F -test).
In the classical setting, where q and p are fixed, while n goes to infinity, it is well known that the detection boundary of the F -test is n −1/2 . This means that a violation of the null hypothesis H 0 : R 0 γ = r 0 that is of the order R 0 γ − r 0 2 n −1/2 leads to non-trivial asymptotic power, while a slower order yields asymptotic power equal to the size of the test and a larger order yields asymptotic power equal to one. However, in general, when q = q n and p = p n are allowed to grow with sample size n, the detection boundary of the F -test is no longer n −1/2 but rather q 1/4 n /n 1/2 . To see this, first we ignore the influence of the nuisance parameters and set µ = 0, Σ = I p and σ 2 = 1, so that S = I p+1 and ∆ γ = R 0 γ − r 0 2 2 . From Corollary 2.2, we see that in order to obtain non-trivial asymptotic power, i.e., asymptotic power in the open interval (ν, 1), the non-centrality term has to stay away from 0 and ∞. If we exclude the pathological case lim sup n p n /n = 1, then this requirement is met if the violation of the null hypothesis is of the order R 0 γ − r 0 2 q 1/4 n /n 1/2 . Remark 2.4 (Gaussian errors and fixed design). In the classical case where the error ε follows a spherical normal distribution which is independent of the design, the F -statistic (2.4) follows a non-central F -distribution with q n and n − p n − 1 degrees of freedom and non-centrality parameter ς 2 n = n∇ n , conditional on X, Nevertheless, even in this traditional case, only basic monotonicity results are available for the power function P(F n > f (1−ν) qn,n−pn−1 |X) as a function of q n , n − p n − 1 and ς 2 n [e.g., 12,26]. In Section 6.4 we investigate ∇ n as n → ∞, such that lim sup n p n /n < 1. Our results provide approximations for the average (or unconditional) rejection probability, i.e., for E[P(F n > f (1−ν) qn,n−pn−1 |X)], which are given by the Gaussian cdf applied to an elementary function in p n /n, q n /n and ∆ γ = (R 0 γ − r 0 ) (R 0 S −1 R 0 ) −1 (R 0 γ − r 0 )/σ 2 and which are therefore easy to interpret (cf. Corollary 2.2). Remark 2.5 (On omitted variable misspecification). One major motivation for us to extend the result of WC to scenarios where there is some dependence between the design and the errors, and also among the components of the standardized design vectors themselves (see Section 4 for details), was to treat simple sub-models of high-dimensional linear models. These sub-models typically exhibit misspecification due to omitted regressor variables. Consider an i.i.d. sample (y i , z i ) n i=1 from the high-dimensional linear model where the z i are random d-vectors with d n that are independent of the u i , which satisfy E[u i ] = 0 and E[u 2 i ] ∈ (0, ∞). Moreover, assume that the marginal distribution of the regressors z i can be represented as z i = Ω 1/2z i , where Ω 1/2 is symmetric and positive definite andz i has a Lebesgue density fz which is such that the components ofz i are independent with zero means, unit variances and bounded 8-th moments, so that in particular Cov[z i ] = Ω. Now suppose we want to use only a small number p of the d available regressors, with p < n, so that classical regression methods are feasible within this subset regression problem. These working regressors can be described as where M is a d×p matrix of full rank p < d. For instance, M could be a selection matrix so that x i consists of a certain choice of p components of z i . In such a situation, the sample (y i , x i ) n i=1 need not follow a linear homoskedastic model as in (2.1), because, in general, the conditional expectation E[y i |x i ] is not linear in x i and the conditional variance Var[y i |x i ] is not constant if the pair (y i , x i ) is not jointly Gaussian. However, one can always write Here, the parameterβ corresponds to the best linear predictor of y i given x i , and it may be of interest to test whether H 0 :β = 0, i.e., whether the selected regressors x i = M z i have any value for linearly predicting the response variables y i .
Clearly, in the present setting the errors ξ i are not independent of the design vectors x i and, in general, The corrected sample clearly follows a linear homoskedastic model as in Moreover, the corrected sample satisfies Assumption (A1) in view of Lemma A.2 applied with Γ = M Ω 1/2 , µ = 0 and m n = d, and because the design matrix X = [x 1 , . . . , x n ] has a Lebesgue density on R n×p . Of course, in general, the actual sample (y i , x i ) n i=1 may be very different from the corrected sample (y * i , x i ) n i=1 , but the results of Steinberger and Leeb [24] suggest that if d p, then E[ξ i |x i ] ≈ 0 and Var[ξ i |x i ] ≈ Var[ξ i ], at least for a large collection of selection matrices M . Therefore, the observed sample and the corrected sample should be very similar if d is large relative to p, and one might expect that also for a sufficiently regular . We suspect that the results of Steinberger and Leeb [24] can also be used to establish the validity of Assumption (A2) for the corrected sample, with e i,n := Var[ξ i ]/ Var[ξ i |x i ] and ε i,n := ξ i − E[ξ i |x i ]. Thus, we expect the F -test for H 0 :β = 0 based on the sample (y i , x i ) n i=1 to be asymptotically valid and even possess similar power as the F -test in a correctly specified model, for most choices of M , provided that d = d n and p = p n tend to infinity along with n, such that d n p n . The details of this line of reasoning will be further developed elsewhere.

Technical conditions
Throughout this paper, the reader will encounter several different norms. For for the usual Euclidean norm, whereas for matrices M ∈ R k× we distinguish between the spectral norm M S = (λ max (M M )) 1/2 and the Frobenius norm M F = (trace(M M )) 1/2 . We write P M for the matrix of orthogonal projection onto the column span of M . If M satisfies rank M = ≤ k, then P M = M (M M ) −1 M . We also make use of the stochastic Landau notation. For a sequence of real random variables z n , we say that z n = O P (1) if the sequence is bounded in probability, i.e., if sup n∈N P(|z n | > δ) → 0 as δ → ∞, and we say that z n = o P (1) if z n → 0 in probability. For a non-stochastic real sequence a n = 0 we write z n = O P (a n ) if z n /a n = O P (1) and z n = o P (a n ) if z n /a n = o P (1).
The following is a list of technical assumptions needed in the proof of Theorem 2.1.
(A1) (a) The design vectors x i,n are linearly generated as follows: where Γ n is a p n × m n matrix with m n ≥ p n , such that Γ n Γ n = Σ n . The random m n -vectors z 1,n , . . . , z n,n are i.i.d. and satisfy E[z 1,n ] = 0, E[z 1,n z 1,n ] = I mn .
(b) The (n − 1) × (p n + 1) matrix U n,−1 = [0, I n−1 ]U n has rank p n + 1 with probability one. (In particular, P(det(U n U n ) = 0) = 0.) (c) For every n ∈ N, the random m n -vector z 1,n from Assumption (A1).(a) also has the following property. There exist universal positive constants c and C, not depending on n, such that for every orthogonal projection P in R mn and for every t > C rank P , we have P( P z 1,n 2 > t) ≤ Ct −1−c . This set of assumptions is weaker than the analogous conditions imposed by WC to treat the case (p n − q n )/n → 0. A detailed discussion and comparison of the differences between our treatment and the one of WC is deferred to Section 4.

Testing only a small number of hypotheses
In Theorem 2.1 above, we needed the assumption that q n → ∞ as n → ∞ in order to achieve asymptotic normality of the F -statistic. Since the asymptotic distribution of the F -statistic is well understood in the Gaussian error and fixed design case (cf. Remark 2.4 and Lemma C.8), we expect the asymptotic distribution of the F -statistic to be χ 2 q rather than Gaussian if q n = q is fixed. The following result establishes this asymptotic χ 2 distribution for the F -statistic when the error distribution of the model 2.1 is arbitrary up to bounded fourth moments. For this result we need somewhat stronger assumptions than those of Theorem 2.1. The proof is deferred to Section 6.
Theorem 3.1. Consider the linear, homoskedastic model (2.1) and assume that the design vectors x 1,n , . . . , x n,n are i.i.d. Gaussian with mean µ n ∈ R pn and positive definite covariance matrix Σ n , and that the design X n = [x 1 , . . . , x n ] is independent of the errors ε n = (ε 1,n , . . . , ε n,n ) which satisfy E[(ε 1,n /σ n ) 4 ] = O(1). Moreover, suppose that lim sup n p n /n < 1, ∆ γ = o(q n /n) and that the null hypothesis does not involve a restriction on the intercept parameter (i.e., the first column of R 0 is equal to zero). If q n = q ∈ N does not depend on n, then where s n and b n are as in Theorem 2.1.
Together with the asymptotic normality of the F -statistic in the case q n → ∞ (cf. Theorem 2.1), we can establish a non-central F approximation for the Fstatistic for any number of tested hypotheses q n . Corollary 3.2. Consider the linear, homoskedastic model (2.1) and assume that the design vectors x 1,n , . . . , x n,n are i.i.d. Gaussian with mean µ n ∈ R pn and positive definite covariance matrix Σ n , and that the design X n = [x 1 , . . . , x n ] is independent of the errors ε n = (ε 1,n , . . . , ε n,n ) which satisfy E[(ε 1,n /σ n ) 4 ] = O(1). Moreover, suppose that lim sup n p n /n < 1, ∆ γ = o(q n /n) and that the null hypothesis does not involve a restriction on the intercept parameter (i.e., the first column of R 0 is equal to zero). Then, where F qn,n−pn−1 (λ n ) denotes a random variable following the non-central F distribution with q n and n − p n − 1 degrees of freedom and non-centrality parameter λ n = ∆ γ (n − p n − 1 + q n ).
Proof. Suppose the claim does not hold. Then there exists a subsequence n , such that the supremum converges to a positive number along n . Then, by compactness of the closed unit interval and the extended real line, we can extract a further subsequence n , such that p n /n → ρ 1 ∈ [0, 1), q n /n → ρ 2 ∈ [0, ρ 1 ], and q n → q ∈ [1, ∞], as n → ∞. If q = ∞, then we are in the setting of Theorem 2.1(iii) and we obtain asymptotic normality of s −1/2 n (F n − 1) − √ n∆ γ b n . Since the limiting distribution function is continuous, we get uniform convergence of the corresponding distribution functions, in view of Polya's theorem. Since by Lemma C.8, s −1/2 n (F qn,n−pn−1 (λ n ) − 1) − √ n∆ γ b n is also asymptotically standard normally distributed, we get a contradiction in that case. If q < ∞, then q n = q for all large n , because q n is integer valued. Thus Theorem 3.1 applies and shows that s Since the limiting distribution function is again continuous, and since Lemma C.8 shows that s −1/2 n (F qn,n−pn−1 (λ n ) − 1) − √ n∆ γ b n has the same asymptotic distribution, we also get a contradiction in this case, upon using the same argument as before, involving Polya's theorem. Corollary 3.2 provides a unified treatment of the asymptotic behavior of the F -statistic without distinguishing between the cases q n = q and q n → ∞. However, this does not immediately lead to a neat formula for the local asymptotic power function of the F -test because of the complicated nature of the cdf of the non-central F -distribution. Remark 3.3 (Fixed vs. random design). It is instructive to compare the noncentral F approximation of Corollary 3.2 to the distribution of the F -statistic in the Gaussian error and fixed design case (cf. Remark 2.4). In the former case the non-centrality parameter is given by n∆ γ (n − p n − 1 + q n )/n, while in the latter case it is n∇ n . Now if the design X is random with i.i.d. rows and S = E[U U/n], U = [ι, X], it turns out that ∆ γ := (R 0 γ − r 0 ) (R 0 S −1 R 0 ) −1 (R 0 γ − r 0 )/σ 2 is not a good approximation for ∇ n := (R 0 γ − r 0 ) (R 0 (U U/n) −1 R 0 ) −1 (R 0 γ − r 0 )/σ 2 if p n /n is not close to q n /n, even if n is very large. In fact, we need a correction factor of E[χ 2 n−pn−1+qn /n] = (n − p n − 1 + q n )/n in order to account for the additional randomness coming from the design X (cf. Lemma C.5). This correction factor, however, is close to one if q n /n ≈ p n /n. If the number q n of hypotheses to be tested is much smaller than the number of parameters p n , this correction is quite significant. Thus, the limiting distribution of the Fstatistic under random design with E[U U/n] = S is, in general, not equal to the distribution of the F -statistic under Gaussian errors and fixed design X satisfying U U/n = S. This distinction only occurs for q n p n . In particular, this issue disappears completely if p n /n ≈ 0.
Remark 3.4 (On confidence sets for R 0 γ). Corollary 3.2 can immediately be used to construct asymptotically valid confidence sets for R 0 γ. Simply define and note that where F n is the F -statistic under the null hypothesis r 0 = R 0 γ.

Discussion of the technical assumptions
We pause for a moment to discuss the meaning of our Assumptions (A1) and (A2), as well as the other conditions used in Theorem 2.1, and we comment on the main differences to the conditions imposed in WC. First of all, Assumption (A1).(a) of linear generation of the design from possibly much higher dimensional random vectors also appears in WC who take it as a modification from Bai and Saranadasa [1]. We point out that this is a straight forward generalization of the case m = p, where moment restrictions have to be imposed directly on the design vectors x i (note that the components of x 1 may not be independent, even after standardization, whereas x 1 can still be linearly generated from a vector z 1 whose components are independent). Moreover, this assumption also allows for the interpretation that there is actually a much higher dimensional set of explanatory variables z i available whose dimensionality m (possibly m n) has already been reduced to p < n. See also Remark 2.5.
(C1) x i is linearly generated by a m-variate random vector z i = (z i1 , ..., z im ) so that x i = Γz i +µ, where Γ is a p×m matrix for some m ≥ p such that ΓΓ = Σ, each z il has finite 8-th moment, In fact, in addition to (C1), WC also need the 8-th moments of z ik to be uniformly bounded so that none of them goes off to infinity as n (and m = m n ) increases. The factorization requirement of the 8-th mixed moments in (C1) is a straight forward relaxation of an independence assumption. However, just like the independence assumption, it rules out many spherical distributions (cf. Lemma A.1(i)). Therefore, moment conditions like (A1).(d,e) are much more natural to accommodate both product and spherical distributions. In fact, Condition (C1), together with uniform boundedness of E[z 8 ik ], is strictly stronger than our Assumptions (A1).(a,d,e) (cf. Lemma A.1(iii) and Lemma A.2(ii)).
Our Assumption (A1).(b) is important to guarantee that the F -statistic is equal to the expression on the right-hand-side of (2.4), at least with asymptotic probability one, which is used in WC implicitly. The reason that we not only require almost sure invertibility of U U but also of the design matrix based on n − 1 observations is only of a technical nature and plays an important role in the proof of Lemma 6.3 (cf. the end of Subsection 6.2), which is based on leave-one-out ideas. This lemma replaces the strong assumption of WC that there exists a global constant c 1 > 0, such that the smallest eigenvalue of the sample covariance matrix satisfies λ min (X X /n) ≥ c 1 , almost surely, where X = (X − ιµ )Σ −1/2 is the design matrix based on the standardized regressors (cf. page 147 in WC). 4 Finally, Assumption (A1).(c) is taken directly from Srivastava and Vershynin [23] to control the extreme eigenvalues of large sample covariance matrices, and different sets of sufficient conditions for (A1).(c) can be found in that reference. In WC, control of extreme eigenvalues is accomplished by use of the celebrated Bai-Yin Theorem of Bai and Yin [2] (cf. Lemma 2 in WC). However, this comes at the price of the implicit assumption that the standardized design vector Σ −1/2 (x 1 − µ) has independent components. 5 Altogether, our design condition (A1) includes linear functions of both, product distributions with uniformly bounded 8-th marginal moments and a large class of spherically symmetric distributions (cf. Lemma A.1 and Lemma A.2 in Appendix A).
The Assumption (A2) on the error distribution extends the fourth moment condition (C2) in WC, which simply states that E[(ε 1,n /σ n ) 4 ] = O(1), as n → ∞. 6 If the errors are independent of the design and q n → ∞ (as is the case in WC), then (C2) and (A2) are equivalent. However, Condition (A2) is suited to also allow for some amount of dependence between the errors and the design. This dependence is ruled out in WC, because they use results of asymptotic normality of quadratic forms from Bhansali, Giraitis and Kokoszka [5] that apply only in the independence case (see Lemma 6.1 and the discussion at the beginning of Subsection 6.2). We note that an (8 + κ)-th moment condition like, e.g., sup n E[(ε 1 /σ n ) 8+κ ] ≤ K, together with max i e i = O P (1), is sufficient for (A2), provided that lim inf n q n /n > 0.
The additional requirement of Theorem 2.1, that lim sup n p n /n < 1 and q n → ∞, simply describes the regime of the relative number of parameters and hypotheses we are interested in. The corresponding assumption (C3) in WC and also parts (i) and (ii) of Theorem 2.1 additionally require that (p n − q n )/n → 0. This is a more serious restriction which is convenient in the present strategy of proof to show that the non-centrality term in the F -statistic under the local alternative degenerates to the correct value (cf. Section 6.4.2). This, however, means that asymptotically we are only dealing with hypotheses where almost all of the p parameters are restricted, since q n /p n → 1 in this regime. It is therefore important to extend the analysis of the rejection probability of the Ftest also to the regime where q n p n in order to asses the different contributions of the overall dimensionality and the multiplicity of hypothesis testing to the asymptotic rejection probability. This is what we do in Theorem 2.1(iii) and 4 Notice that this assumption rules out, for example, Gaussian design [see, e.g., 11, Theorem 2.1]. 5 This is particularly inconvenient if one is interested in the case where the random vectors z 1 , . . . , zn in (C1) (and (A1).(a)) have independent components. If both z 1 and Σ −1/2 (x 1 − µ) = (ΓΓ ) −1/2 Γz 1 have independent components and at least two rows of (ΓΓ ) −1/2 Γ have only non-zero entries (this can be relaxed even further), then, by the Darmois-Skitovich Theorem [cf. 7, Theorem 5.3.1.], z 1 must already be Gaussian. 6 In WC it is implicitly assumed that lim infn σ 2 n > 0.
in Section 3 in the Gaussian design setting. The requirement that q n → ∞ as n → ∞ is essential in Theorem 2.1 in order to obtain a Gaussian limit. This assumption is dropped in Theorem 3.1 and Corollary 3.2.
Finally, the assumption in Theorem 2.1 that ∆ γ = o(q n /n), is rather natural in the case where lim inf n q n /n > 0, where it simply reduces to ∆ γ = o(1). In this case, it says that we study the asymptotic rejection probability only in a shrinking neighborhood of the null hypothesis. If lim inf n q n /n > 0, we also do not need to specify a rate at which ∆ γ approaches zero. Note, however, that part (ii) of Theorem 2.1 actually does require a specific rate of contraction which is, again, only needed for technical reasons in establishing the asymptotic behavior of the non-centrality term (cf. Section 6.4). The corresponding Assumption (C4) in WC is rather dubious and seems to arise from a miscalculation when dealing with said non-centrality term. In fact, they also need the n imposed by our Theorem 2.1(ii) and nothing more. 7 In the case where lim inf n q n /n = 0, we need the rate ∆ γ = o(q n /n) in order to ensure that the mixed term in the expansion of the F -statistic is asymptotically negligible (cf. the discussion following display (6.7)). Note that in the extreme case where q n = q is fixed, the assumption ∆ γ = o(q n /n) appears to be somewhat restrictive because it rules out n∆ γ 1, as required for non-trivial local asymptotic power (cf. Remark 2.3). In the classical case where q n = q (and p n = p) is fixed, the classical approach based on the asymptotic normality of the OLS estimatorγ n allows for a much wider range of local alternatives than our present strategy. Of course, the asymptotic normality of the whole vector γ n breaks down if the dimensions p n and q n are too large relative to sample size n (see, e.g., Portnoy [20,21]), which is why we here use a different strategy involving the assumption ∆ γ = o(q n /n). Judging from the simulations of Section 5 below, it seems as if some bound on ∆ γ that is proportional to q n is essential for the validity of the normal approximation to the power function, because this approximation turns out to be accurate for a larger range of ∆ γ if q gets larger.

Numerical results
In order to better understand the quality of the theoretical approximations to the power function of the F -test derived above, we conducted an extensive simulation study. We roughly follow the simulation setup of WC but our focus is more on the role of the number of tested hypotheses q. The linear model we considered was See the first display on page 146 in WC, where also the matrix X 2 X T 2 needs to be standardized. Also, there is a scaling factor of √ n missing in that argument, which is necessary to bring the non-centrality term to the same scale as the noise term.
where the ε i were i.i.d. with mean zero and variance one, independent of the design. We tried different error distributions but found little effect on the power function so we report only the results for N (0, 1) errors and for t(5)/ 5/3 errors. 8 The p-dimensional design vectors x i were generated as i.i.d. realizations of a moving average process where z i = (z i1 , . . . , z i(p+T −1) ) was either a (p + T − 1)-dimensional standard normal vector or generated with i.i.d. Γ(1, 1) − 1 entries 9 and T = 10. The coefficients α t were generated as i.i.d. uniform from (0, 1) only once and then held fixed across all the simulations to follow.
We tested only null hypotheses of the form at level ν = 0.05, so that the distribution of the F -statistic does neither depend on the mean of the design vectors nor on the intercept parameter, and hence it is no loss of generality that we have omitted both. The signal β was generated such that half of the tested coefficients where equal to one and all the other coefficients were equal to zero. The signal was then scaled appropriately to produce a range of alternatives at which the power function was evaluated numerically.
Since WC have already extensively studied the effect of the dimensionality p on the power of the F -test, we here focus more on the effect of the number of tested restrictions q. For our first set of Montecarlo experiments we fixed n = 100 and p = 60 and looked at the cases q = 4 and q = 50. Figure 1 shows the simulated power of the F -test (solid lines) as a function of the scaled distance from the null hypothesis ∆ γ , where for each value of ∆ γ , 10.000 Montecarlo samples were generated. The left column shows the results for t(5)/ 5/3 distributed errors and a design that was generated from a moving average process with Γ(1, 1)−1 innovations. The right column was generated with i.i.d. standard Gaussian errors and a Gaussian moving average design. As a first observation we note that the influence of non-Gaussianity on the simulated power (solid lines) is almost negligible at the present sample size of n = 100. Now Figure 1 should be inspected from top to bottom. In the first row we clearly see the gain of power as the number of hypotheses q decreases from q = 50 to q = 4. We also compare the simulated power to the Gaussian approximation from our asymptotic result of Theorem 2.1 (dotted lines). The picture is qualitatively the same as in WC, who considered q = p − 2 and who concluded "that there is a good conformity between the empirical power and the theoretical power of the [. F -test [. . . ]" [27, p. 142]. It seems hard to evaluate the quality of approximation directly from this picture in absolute terms. Moreover, the Gaussian approximation does not seem to become much more accurate when q increases, contrary to what was suggested by Theorem 2.1. In fact, however, Theorem 2.1 says that we should look at the Gaussian approximation only locally, for values of ∆ γ that are of a smaller order than q/n. Indeed, if we look at the power function in a smaller neighborhood around the null (cf. the second row of Figure 1) we see a much better agreement between the simulated true power and the Gaussian approximation. We also see that when q/n is larger, the approximation is accurate on a larger interval around the null, as predicted by the theory. However, a global Gaussian approximation to the power of the F -test seems to be too much to ask for. Finally, the last row of Figure 1 is identical to the first row, except that we have added the theoretical approximation based on the noncentral F -distribution as in Corollary 3.2 (NCF). Compared to the Gaussian approximation, the non-central F approximation is remarkably accurate over the entire range of alternatives considered, not just locally, and for both large and small values of q.
To investigate also the quality of our approximations in small samples, we have repeated the simulations above with n = 30 and p = 20. We present only one instance of this second round of simulations in Figure 2 to discuss the main differences to the case where n = 100. We still find that the non-central F approximation is much better than the Gaussian approximation, but clearly also the quality of the former deteriorates considerably compared to the case n = 100. However, the local behavior of the power function is still picked up quite well even in the small sample scenario. Moreover, it is remarkable how similar the picture with t(5)-errors is to the picture with normal errors already for n = 30. This suggests that the dominant reason for the imperfect approximation by the non-central F -distribution is the randomness of the design rather than the non-Gaussianity of the errors (cf. Remark 3.3).
Finally, as in WC, we also investigate the distribution of the F -statistic under the null hypothesis H 0 : β 1 = · · · = β q = 0. For different choices of n, p and q, we have generated 10.000 Montecarlo samples of s −1/2 n (F n − 1) as before, but with vanishing true signal β = 0, and for t(5)/ 5/3-errors and regressors generated from Γ(1, 1)−1. To investigate also the impact of a non-symmetric error distribution we have repeated all the simulations also with the distribution of the errors and the design interchanged. 10 The plots in Figure 3 were generated by applying the R-function 'density' [25] to the samples of standardized F -statistics with default settings. As before, the dashed lines correspond to the (appropriately scaled and centered) asymptotic non-central F -approximation of Corollary 3.2 and the black dotted line is the standard normal density. The overall picture is the same as for the power function. The non-central F approximation is remarkably accurate even for moderate sample sizes like n = 50. As predicted by the theory, for large n and large q the null distribution is well approximated by the normal, whereas for small q the normal approximation fails. We also note that, again, the approximation accuracy appears to be rather insensitive to changes of the error and design distributions.

Proofs of main results
In this section we give a high-level description of the proofs of both Theorem 2.1 and Theorem 3.1. The more technical parts of the argument are collected in the appendices. The following outline section pertains to both proofs and uses only assumptions that are invoked by both theorems. Note that because of compactness and, in particular, lim sup n p n /n < 1, it is no restriction to assume that p n /n and q n /n are convergent sequences with limits ρ 1 ∈ [0, 1) and ρ 2 ∈ [0, ρ 1 ], respectively.

Outline
In addition to the general model assumptions of Section 2, the present outline section 6.1 only uses the conditions ∆ γ = o(q n /n), P(det(U U ) = 0) = 0 and √ n(σ 2 n /σ 2 n − 1) = O P (1). All of these are satisfied under the assumptions of any part of Theorem 2.1 as well as under the assumptions of Theorem 3.1, in view of Lemma C.1.
The first part of Section 6.1 closely follows the classical approach for the decomposition of the F -statistic as described, e.g., in Rao and Toutenburg [22, Chapter 3.7]. These arguments are kept to a minimum but are included nonetheless to make the notation more intelligible.
Recall the F -statistc F n as defined in (2.4). For the following preliminary consideration, we work only on the event C n = {ω : . On this event, F n is given by and thus, the first fraction in (6.1) reads Next, if q < p + 1, choose a (p + 1 − q) × p matrix R 1 , whose rows form an orthonormal basis for the orthogonal complement of the rows of R 0 . Recall that , and using block matrix inversion, we see that Similarly, we get Now, by writing W = (I n − P U1 )U 0 , on C n , we can simplify the F -statistic to read (σ 2 n /σ 2 n )F n = (ε/σ n ) P W (ε/σ n )/q + 2(ε/σ n ) W δ γ /q + δ γ W W δ γ /q. The above representation remains correct also in the case where q = p + 1 provided that the matrix U 1 is removed wherever it appears, i.e., W = U 0 in this case. The correct centering and scaling of F n is s on the event C n . Now, to get rid of the restriction to C n , define G n by 4) and note that this is well defined everywhere. It is now elementary to verify that we can study the asymptotic behavior of G n instead of s n converges weakly to some limiting distribution L, for an appropriate centering sequence η 2 n with η 2 n = o( √ n), then, on C n , and In what follows, we will establish that the first term on the right of the equal sign in (6.4), which we denote by Q n := s −1/2 n ε M n ε/σ 2 n , satisfies Q n → L, weakly, for an appropriate limit distribution L. The last summand in (6.4) can be abbreviated to s Remark 2.4). It will play the role of a non-centrality term and it will be shown to be asymptotically non-random. Note that if we can also show s . Indeed, the conditional mean of the latter expression given X is equal to zero, and its conditional variance is equal to s −1 n n∇ n /q 2 = (s Suppose, for now, that we have already established both, the weak conver- and also the fact that where b n is as in Theorem 2.1. Then, because ∆ γ = o(q/n), we have η 2 , as required for the argument in (6.5). It also follows that s , so we have asymptotic negligibility of the mixed term in (6.4) by the argument in the previous paragraph. Altogether, we arrive at which establishes the conclusion of Theorem 2.1 and Theorem 3.1, for an appropriate choice of L, provided that (6.6) and (6.7) hold. For Theorem 2.1, we will prove the weak convergence (6.6) with L = N (0, 1), under the general Assumptions (A1).(a,b,c,d) and (A2) in Section 6.2, and the convergence in (6.7) under each of the sets of assumptions of Theorem 2.1(i), 2.1(ii) and 2.1(iii), respectively (cf. Section 6.4).
For Theorem 3.1, we will prove the convergence (6.6) with L = (2q) −1/2 χ 2 q − q/2 in Section 6.3, and the convergence in (6.7) is established by the same argument as in the case of Theorem 2.1(iii) in Section 6.4, which does not require q n → ∞ nor Assumption (A2), so that it goes through also in the setting of Theorem 3.1.

Asymptotic normality of the noise term
This section establishes the weak convergence in (6.6) with L = N (0, 1). For this claim we only use the Assumptions (A1).(a,b,c,d), (A2), as well as p n /n → ρ 1 ∈ [0, 1), q n /n → ρ 2 ∈ [0, ρ 1 ] and q n → ∞. The following lemma is a variation of Theorem 2.1 in Bhansali, Giraitis and Kokoszka [5] on the asymptotic normality of quadratic forms for the case where the matrix and the enclosing vectors may exhibit a certain dependence between each other. Its proof is deferred to Appendix B.
Lemma 6.1. Let (Ω, F, P) be the common probability space on which all the random quantities below are defined. For every n ∈ N, let G n ⊆ F be a subsigma algebra, let A n = (a ij,n ) n i,j=1 be a real random symmetric n × n matrix that is G n measurable and such that A n (ω) = 0, ∀ω ∈ Ω. Let Z 1,n , . . . , Z n,n be real random variables that are conditionally independent, given G n , and such that for i ≤ n, almost surely, E[Z i,n |G n ] = 0, E[Z 2 i,n |G n ] = 1 and E[|Z i,n | 4 |G n ] < ∞. Moreover, assume that, as n → ∞, Then, for Z n = (Z 1,n , . . . , Z n,n ) , we have Remark 6.2. The proof of Lemma 6.1 essentially follows the rationale of Bhansali, Giraitis and Kokoszka [5] with the obvious modification that all the moments of Z i,n have to be replaced by conditional moments. Note that if the Z 1,n , . . . , Z n,n are the first n elements of a sequence of i.i.d. random variables and A n is non-random, as in Bhansali, Giraitis and Kokoszka [5], then the assumptions of Lemma 6.1 reduce to those imposed by Theorem 2.1(iii) in that reference, except for the additional requirement that E[Z 4 1 ] < ∞, as needed here.
By the method of Bhansali, Giraitis and Kokoszka [5] we can not get rid of this additional requirement because their truncation argument does not apply in the case of dependence between A n and Z n .
With Lemma 6.1 at hand, we can proof the asymptotic normality of at least on a set of probability one, and it remains to verify the convergence conditions of Lemma 6.1. For the first one, note that M n 2 S ≤ (1/q + 1/(n − p − 1)) 2 and hence, almost surely, under Assumption (A2). Therefore, the upper bound in the previous display converges to zero in probability. For the second condition, since the diagonal entries of a projection matrix are between 0 and 1, we see that (M 2 n ) jj ≤ 1/q 2 + 1/(n − p − 1) 2 , and thus which converges to zero in probability under Assumption (A2) and q n → ∞.
Establishing the validity of the last condition is slightly more involved. Since M n 2 F is of order 1/q, we have to show that where M n = (m ij ) n i,j=1 . By Assumption (A2), (max i e i ) 4 = O P (1). Now, take expectation and use Hölder's inequality with a, b > 1 such that 1/a + 1/b = 1 to obtain we distinguish between the cases ρ 2 > 0 and ρ 2 = 0. If ρ 2 > 0, we write the diagonal elements of M n = (P U − P U1 )/q − (I n − P U )/(n − p − 1) as and note that (P U ) jj − (p + 1)/n ∈ [−1, 1] and (P U1 ) jj − (p + 1 − q)/n ∈ [−1, 1], in order to get the bound Hence, if we can show that the (P U ) jj , for j = 1, . . . , n, and also the (P U1 ) jj , for j = 1, . . . , n, are identically distributed, then because n/q → 1/ρ 2 . Since a = (1 + κ)/κ is fixed, it then remains to show that |(P U ) 11 − (p + 1)/n| → 0 and |(P U1 ) 11 − (p + 1 − q)/n| → 0, in probability. The desired properties of the diagonal entries of P U and P U1 are now established by the following lemma, which applies under the Assumptions (A1). (a,b,d), and whose proof is deferred to Appendix B.  (1), as n → ∞, for every symmetric m n × m n matrix M n . Let R n be a nonrandom (p n + 1) × k n matrix such that rank R n = k n ≤ p n + 1 and define X n = [x 1,n , . . . , x n,n ] and W n = [ι, X n ]R n , where ι = (1, . . . , 1) ∈ R n . Furthermore, let h 1,n , . . . , h n,n denote the diagonal entries of the projection matrix P Wn . Then, the (h j,n ) n j=1 are exchangeable random variables and in probability. Altogether, we see that in the case ρ 2 > 0, (6.9) holds and the weak convergence follows, as required in (6.6).
To treat the case ρ 2 = 0, we recall from Section 6.1 that for j = 1, . . . , n, are exchangeable random variables, because x 1 , . . . , x n are i.i.d. and U U = n j=1 (1, x j ) (1, x j ) is a function in x 1 , . . . , x n that is invariant under permutations of its arguments. Therefore, By boundedness of the diagonal entries of a projection matrix, it remains to show that H n := n/q((P W ) 11 − q/n) → 0, in probability, as n → ∞. By exchangeability, and Assumption (A1 ] − q/n, which implies that for n large (such that q/n < ε), εP(|H n | > ε) ≤ εP(H n > ε)+εP(H n < −ε) ≤ E[H n 1 {Hn>ε} ]+0 ≤ q/n. This establishes the asymptotic normality required in (6.6) also in the case ρ 2 = 0. n /σ 2 n → q/2, in probability. Since we do not test the intercept parameter α, we can write R 0 = [0, T 0 ], for some q × p matrix T 0 , and If q = p, then T 0 = I p and R 1 = (1, 0, . . . , 0). With this notation, we get U = [ι, X], U 0 = U R 0 = XT 0 and U 1 = U R 1 = [ι, XT 1 ]. From this we see that the distribution of W = (I n − P U1 )U 0 = (I n − P [ι,(In−Pι)XT 1 ] )XT 0 does not depend on µ, and without loss of generality we may assume that µ = 0. Moreover, by standard properties of orthogonal projections, Now, abbreviate A = (I n − P XT 1 )XT 0 and B = P (In−P X )ι − P (In−P XT 1 )ι , and note that P A is uniformly distributed on the Grassmann manifold of n × n projection matrices of rank q, because for an n × n orthogonal matrix O we have OX Moreover, trace(B) = 0, almost surely, and thus E[ε Bε/σ 2 n |X] = 0 and by standard calculations using independence (cf. the proof of Lemma C.1) and the fact that P X = P XT = P [XT 0 ,XT 1 ] , In view of Lemma C.7 with µ = 0 and using the first two moments of the χ 2 distribution, we see that the expressions ι (I n − P XT )ι/n and ι (I n − P XT 1 )ι/n both converge to 1 − ρ 1 , in probability, and thus the whole expression on the last line of the previous display converges to zero in probability. Altogether, we see that Q n = (s n q 2 ) −1/2 ε P A ε σ 2 n − q/2 + o P (1).
Because P A is uniformly distributed on the Grassmann manifold, it can be stochastically represented as P A ∼ CC , where C is a random n × q matrix that is uniformly distributed on the Stiefel manifold of order n × q, i.e., V has orthonormal columns and its distribution is both left and right invariant under the action of the appropriate orthogonal group. Since C and ε are independent, the so called Diaconis-Freedman effect as described in Dümbgen and Conte-Zerial [9, Theorem 2.1] entails that the conditional distribution of C ε/σ n given C, converges weakly in probability to a q-dimensional standard normal distribution because ε/σ n 2 /n → 1 in probability and ε ε/(nσ 2 n ) → 0 in probability, whereε is an independent copy of ε. Weak convergence in probability implies convergence of the conditional characteristic functions of C ε/σ n given C, in probability, which, by boundedness, implies convergence of the unconditional characteristic functions. Consequently, we obtain the weak convergence This establishes (6.6) with L as claimed.

Asymptotic behavior of the non-centrality term
Finally, we have to establish the convergence in (6.7) in the three cases of Theorem 2.1(i), 2.1(ii) and 2.1(iii) as well as under the assumptions of Theorem 3.1. We begin by a representation of s −1/2 n n∇ n /q that pertains to all of these cases. In this preliminary consideration we only use Assumpitons (A1).(a) and (b) which are assumed to hold in each of the cases under investigation. Recall the conventions and definitions of Section 6.1, in particular, x i /n = X ι/n andΣ n = X X/n −μ nμ n = X (I n − P ι )X/n and partition the (p + 1) × (p + 1) orthogonal matrix T as where t 0 ∈ R q and T 0 ∈ R q×p . Since Notice that the last two cases are not mutually exclusive, but the case t 0 = 0 is a sub-case of the case q ≤ p. The representation of ∇ n in the case t 0 = 0 will come in handy. With the notation Ω = Ω 00 Ω 01 if t 0 = 0. (6.11)

The case of Theorem 2.1(ii)
For part (ii) we only need to consider the case where q ≤ p, as the case q = p + 1 has already been treated above (simply restrict to the subsequence n such that q n ≤ p n ). We establish the convergence in (6.7) for b n = n/q(s n q) −1/2 rather than b n as in the Theorem. It should be clear, however, that this is no restriction.
[Indeed, if (6.7) holds with b n = n/q(s n q) −1/2 andb n = b n (n − p − 1 + q)/n is as in the theorem, then √ n∆ γ b n − √ n∆ γbn = (b n −b n ) √ n∆ γ = ((p + 1)/n − q/n)b n √ nO( √ q/n), by the additional assumption of Theorem 2.1(ii). Since in the present case (p − q)/n → 0, the previous expression converges to zero as n → ∞.] Now, since q ≤ p, the quantity of interest reads The first term on the right-hand-side has already been studied in (6.12), and the same argument applies, except that now δ γ Ω 00 δ γ = ∆ γ , in general. But b n δ γ Ω 00 δ γ = O(1) n/qδ γ R 0 SR 0 δ γ = O(n −1/2 ) → 0 by the additional assumption of Theorem 2.1(ii). For the remaining term in (6.13), as in WC, we begin by approximating U 1 U 1 /n by Ω 11 . This can only be successful because here we are dealing with a sample covariance matrix of dimension p + 1 − q, based on n independent observations and we assume that (p + 1 − q)/n → 0. We abbreviatẽ U 0 = U 0 Ω −1/2 00 and consider the absolute difference Now, Lemma C.3(iii) with k n = p + 1 − q shows that Ũ 1Ũ1 /n − I p+1−q S → 0 in probability, since k n /n → 0, and it also establishes the boundedness in probability of Ũ 1Ũ 1 /n S Ũ 0Ũ0 /n S . The assumptions of this lemma are clearly satisfied under (A1).(a,c,d). Now the convergence in spectral norm implies the convergence of the extreme eigenvalues ofŨ 1Ũ1 /n to 1, and thus, also the extreme eigenvalues of the inverse converge to 1, which means that (Ũ 1Ũ1 /n) −1 − I p+1−q S → 0, in probability. Since b n √ nδ γ Ω 00 δ γ = O(1), by the additional assumption of Theorem 2.1(ii), the upper bound in (6.14) converges to zero in probability. Thus, we have shown that we can replace U 1 U 1 /n in (6.13) by Ω 11 , without changing the limit.
In order to show that the distribution of B n also concentrates around its mean, we make use of the Efron-Stein inequality. 11 We use the abbreviations x j ) and define the functions g : R p×n → R and g k : R p×(n−1) → R, for k = 1, . . . , n, by By the Efron-Stein inequality [18,Theorem 9], Now, for k ∈ {1, . . . , n}, g(x 1 , . . . , x n ) can be expressed as Using the fact that Q ij = Q ij = Q ji , the differences g − g k in (6.15), are equal to We need to bound the expectation of the squared expression. To this end, we calculate the expectation of L i Q ik L k L j Q jk L k for arbitrary indices i, j, k, as well as in the special case where i = k, j = k and i = j. Observe that Q ik is the inner product of Ω −1/2 11 x k ) and therefore, by Cauchy-Schwarz inequalities in both Euclidean and L p space, satisfies 4 11 ]. Moreover, parts (i) and (v) of Lemma C.3, whose assumptions are implied by the conditions (A1).(a,d,e), establish the facts E[L 4 1 ] = O(|δ γ Ω 00 δ γ | 2 ), E[L 8 1 ] = O(|δ γ Ω 00 δ γ | 4 ) and E[Q 4 11 ] = O(|p + 1 − q| 4 ), provided that at least one of the assumptions (a) or (b) of Lemma C.3(v) holds. But this follows from Lemma C.4, because if t 0 = 0, then from the representations of ∇ n and ∆ γ in (6.10) and (6.11), we see that the distribution of the quantity of interest does not depend on µ and we may restrict to µ = 0, whereas, if t 0 = 0, Lemma C.4 shows that T 1 has full rank. 12 With this, in general, we obtain If i = k, j = k and i = j, using the abbreviations v = R 0 δ γ and M = R 1 Ω −1 11 R 1 , we get the smaller bound where we have used Lemma C.3 and δ γ Ω 01 Ω −1 11 Ω 10 δ γ ≤ δ γ Ω 00 δ γ again. It is now easy to bound the expectations in (6.15). When squaring the expression in (6.16) we first note the leading factor n −3 . Next, we expand the square of the bracket term in (6.16) and take expectation. From the previous considerations we see that those summands in the resulting sum involving L k Q kk L k are of order O(q|p + 1 − q| 2 /n 2 ), and there are O(n) of them. Together with the leading factor n −3 and the summation in (6.15) we arrive at a total contribution of O(q|p+1−q| 2 /n 3 ) from all those summands involving L k Q kk L k . This expression has to be multiplied by b 2 n = O(n/q) to yield O(|p + 1 − q| 2 /n 2 ) = o(1). The remaining terms are of the form |E[L i Q ik L k L j Q jk L k ]| with i = k and j = k. Of those, there are a number of O(n) summands where i = j, but they are again of order O(q|p + 1 − q| 2 /n 2 ) and therefore, as in the case before, their total contribution to (6.15) is asymptotically negligible, even after multiplying by b 2 n . Finally, there is a number of O(n 2 ) remaining summands as above, but with i = k, j = k and i = j. Therefore, by the refined bound above, they are of order O(q/n 2 ), so that their total contribution to the variance bound in (6.15) is O(q/n 2 ). Together with the factor b 2 n we arrive at an additional term of order O(1/n) = o(1). Hence, we see that the variance of b n B n goes to zero as n → ∞ and the proof of Theorem 2.1(ii) is finished.

Acknowledgements
] for every l = 1, . . . , r. 13 (ii) If 2 ≤ p ≤ n − 2 and Z also satisfies P( Z = 0) = 0, then the random vectors x 1 , . . . , x n satisfy Assumptions (A1).(a,b). Moreover, if also Proof. We make use of the well known fact that any spherical distribution can be represented as Z = b Z , where b and Z are independent, and b is uniformly distributed on the unit m-sphere S m−1 [cf. 8].
For part (i), set = m j=1 j and let e i ∈ R m denote the i-th element of the standard basis in R m and note that Of course, the same argument can be carried through for the spherical vector 13 Due to symmetry, we always have E[Z l 1 ] = 0 = E[V l 1 ] if l is odd and the former moment exists.
V ∼ N (0, I m ), so that we have , provided that all the j are even, so that E[ m j=1 V j j ] = 0. Now choose the j to be either equal to 2 or 0, such that is any even number from 2 to 2r. Therefore, since E[Z 2 1 ] = 1 = E[V 2 1 ] and by our factorization assumption, the left-hand-side of (A.1) is equal to one, so that we have established the equality of even moments of Z and V . To see that also the even moments of Z 1 and V 1 coincide, simply choose 1 = = 2l, for some l ∈ {1, . . . , k} and j = 0, if j = 1. Note that P(B) = 1 and v is Borel measurable. Since p + 2 ≤ n, we see that A ∩ B is a subset of the event where both ι p = M p v(M p ) and 1 =x p+2 v(M p ), the probability of which is clearly bounded by But the conditional probability in the previous display is equal to zero, almost surely, because v(M p ),x p+2 / x p+2 and x p+2 are independent and x p+2 / x p+2 is uniformly distributed on the unit sphere in R p , and therefore its inner product with any fixed vector has a Lebesgue density on R provided that and one easily calculates E[(V M V ) 2 ] = (trace(M )) 2 + 2trace M 2 . Therefore, Finally, for a projection matrix P ∈ R m×m , V P V follows a χ 2 -distribution with rank P = P 2 F degrees of freedom, and thus To establish part (iii), we first verify the conditions of part (ii).  Proof. To establish part (i), we use the results of Whittle [28]. Theorem 2 in that reference shows that for a unit vector v = (v 1 , . . . , v m ) ∈ R m , for some numerical constant C > 0, and thus, Next, for a symmetric matrix M ∈ R m×m , the same theorem yields and, for a projection matrix P ∈ R m×m , where the first inequality is the reverse triangle inequality for the L 4 -norm. Now the previous chain of inequalities implies that since E[Z P Z] = trace(P ) = trace P 2 = P 2 F = rank P is integer, and where D > 0 is an appropriate constant, not depending on m. The validity of Assumption (A1).(c) follows from the arguments in Section 1.4 in Srivastava and Vershynin [23].
For part (ii), simply note that under the factorization assumption in (C1) all the moments occurring in Conditions (A1).(d,e) are identical to those calculated under independence of the components of Z. Therefore, the result follows from part (i).
Appendix B: Proofs of auxiliary results of Section 6.2 Proof of Lemma 6.1. For ease of notation we drop the subscript n that indexes the position of the matrix A n in the array, i.e., we write A = A n and denote by a ij the ij-th entry of that matrix. Similarly, we write Z = (Z 1 , . . . , Z n ) , where where we adopt the convention that empty sums are equal to zero. We show thatT n /( − − → 0, as n → ∞. The desired convergence of T * n follows from the straight forward calculation

and by assumption.
To see the weak convergence ofT n , for j = 1, . . . , n, define F n,0 = G n and F n,j = σ (G n , Z i : i ≤ j), by which we mean the smallest sigma algebra for which Z 1 , . . . , Z j are measurable and which also contains G n . Note that for each n, j ∈ N, F n,j−1 ⊆ F n,j ⊆ F, A n F = trace(A 2 ) is F n,0 measurable and V n,j is F n,j measurable. Moreover, we havē Expanding the squared sum gives and therefore, the absolute difference |T n − 1| can be bounded as To establish the convergence in (B.2), it remains to show convergence to zero in probability of the terms in absolute values on the last line of the preceding display multiplied by A −2 F . First, note that A −2 F n j=1 a 2 jj converges to zero in probability by assumption and because of E[Z 4 j |G n ] ≥ 1. Now, write T n,1 = n i<j (Z 2 i − 1)a 2 ij and T n,2 = √ 2 n j=1 j−1 i<k Z i Z k a ij a kj and observe that Therefore, if we define the triangular truncation operatorÃ of the symmetric matrix A byÃ = s>t e s a st e t , where e s ∈ R n is the s-th element of the standard basis in R n , we see that  Now, convergence to zero of A −2 F (|T n,1 | + |T n,2 |) in probability follows from the above considerations and Lemma 2.1 in Bhansali, Giraitis and Kokoszka [5], which yields the inequality where C > 0 is a global constant, not depending on n. Thus, by assumption, the bound on the far right-hand-side of the preceding display converges to zero, in probability, which establishes the convergence in (B.2). Finally, for (B.3) we abbreviate m n = max j E[Z 4 j |G n ] and use the upper bound V 2 n,j 1 |Vn,j |>δ ≤ δ −2 V 4 n,j . Now, and furthermore Together with m n ≥ 1 and our assumption, this implies that the upper bound on the second-to-last display converges to zero in probability.
Proof of Lemma 6.3. For convenience, we drop the subscript n that indexes the position in the array whenever there is no risk of confusion. Let w i = (1, x i )R denote the i-th row of the matrix W and definew i = Ω where is positive definite and U −1 is defined as in Assumption (A1).(b). This assumption also entails that W W ,W W and S 1 are invertible with probability one, where we denote the corresponding null set by N . For convenience, we redefine these quantities in an arbitrary invertible and measurable way on N . Moreover, we must also have p n + 2 ≤ n under (A1).(b).
Since h j = h j,n = w j (W W ) −1 w j on N c , permuting the h 1 , . . . , h n is equivalent to a permutation of w 1 , . . . , w n , which are i.i.d., and therefore their joint distribution is invariant under permutation. Hence, the h j are exchangeable random variables. In particular, the h j are identically distributed and therefore the fact that n j=1 h j = trace(P W ) = k n , on N c , entails that E[h 1 ] = k n /n. We also note for later use that Var It only remains to show that the variance actually converges to zero.
Finally, consider ∆ n := |h 1 −k n /n| with arbitrary t n = k n /n ∈ [0, 1]. Suppose that c := lim sup E[∆ n ] > 0. Then there exists a subsequence n , such that E[∆ n ] → c, as n → ∞. By compactness, there exists a further subsequence n , such that t n → t ∈ [0, 1], as n → ∞. But in this case, our previous arguments have shown that ∆ n → 0 in probability, which also entails that E[∆ n ] → 0, by boundedness, contradicting the assertion that E[∆ n ] → c > 0.
Remark. Lemma C.2 is an asymptotic version of the well known fact that a random variable h that satisfies h ≥ t ∈ R and E[h] = t must be equal to t, almost surely.
Proof of Lemma C.2. Fix δ > 0, choose α = α(δ) > 0 such that |Ψ(α)−t| < δ/2 and do the following standard bound, Lemma C.3. For every n ∈ N, let x 1,n , . . . , x n,n be i.i.d. random p n -vectors satisfying x i,n = µ n + Γ n z i,n as in Assumption (A1).(a) with positive semidefinite covariance matrix Σ n = Γ n Γ n . Set X n = [x 1,n , . . . , x n,n ] ,Σ n = X n (I n − P ι )X n /n and Moreover, let R n be a k n × (p n + 1) matrix such that R n R n = I kn (i.e., k n ≤ p n + 1) and set Ω n = R n S n R n .  (1), as n → ∞, for every symmetric matrix M ∈ R mn×mn , then we have (1) and (E[|z 1,n P z 1,n | 4 ]) 1/4 = O( P 2 F ), as n → ∞, for every projection matrix P in R mn , and partition R n = [t 1 , T 1 ] with t 1 ∈ R kn . If for every n ∈ N either one of (a) µ n = 0, or (b) rank T 1 = k n holds, then Proof. For ease of notation we will drop the subscript n whenever there is no risk of confusion. A simple calculation involving the elementary inequality |a + b| ≤ 2 −1 (|a| + |b| ) and the notation u n = u = (u 0 , u −1 ) , with u −1 ∈ R pn , yields where w = Γ u −1 / Γ u −1 , if Γ u −1 > 0 and w = 0, else. In the sum u (1, µ ) (1, µ )u + u −1 Σu −1 = u S n u both summands are non-negative and thus both summands are bounded by u S n u. Therefore, the upper bound in the previous display is itself bounded by a constant multiple of |u S n u| /2 . This was the claim of part (i). For part (ii), first note that because the distribution ofΣ n does not depend on µ, we may assume that µ = 0, without loss of generality. By the same argument as above but with µ = 0, u 0 = 0 and u −1 is either v n,1 or v n,2 , we see that E[|v n,s x 1 | 4 ] = O(|v n,s Σv n,s | 2 ), s = 1, 2. Now = O(v n,1 Σv n,1 v n,2 Σv n,2 ) + 2 n 2 E[|v n,1 x 1 | 2 |v n,2 x 1 | 2 ] which is of order O(v n,1 Σv n,1 v n,2 Σv n,2 ) because E[|v n,s x 1 | 2 ] = v n,s Σv n,s and For parts (iii), (iv) and (v) we make the following preliminary considerations. First, note that in all three of these statements Σ is assumed to be positive definite and thus Ω is regular. Abbreviateμ := (1, µ )R Ω −1/2 and for some orthogonal matrix A whose first column isμ/ μ if μ > 0, and A = I kn ifμ = 0. Here, quantities of dimension k n − 1 have to be removed in case k n = 1. The matrix Ω −1/2 Σ W Ω −1/2 in the previous display is positive semidefinite, which means that 0 ≤ μ ≤ 1. For later use, we partition the matrix B := Ω −1/2 A as B = [b 1 , B 1 ] where b 1 ∈ R kn and note that B satisfies and This finishes the preliminary considerations. Now, for the proof of part (iii), write the quantity of interest as For a partitioned matrix as above we have Therefore, it suffices to show that the norms of the respective blocks are O P (1), if k n = O(n), and converge to zero in probability, if k n /n → 0. We begin with the terms involvingμ in (C.5). First, in view of E[μμ ] = Σ/n + µµ , (C.2) and (C.3). Moreover, the variance satisfies To work out the combinatorics of the quadruple sum above, abbreviate if i = j and r = s. Since E[(b 1 R(1, x 1 ) ) 2 ] = b 1 RSR b 1 = b 1 Ωb 1 = 1, by definition of B, we see that the quadruple sum in the second-to-last display is of order O(n 3 ). The remaining sum in the same display is of order O(n), since E[(b 1 R(1, x 1 ) ) 4 ] = O(1), by part (i) and the assumption sup w =1 E[|w z 1 | 4 ] = O(1). Thus, we have shown that b 1 R(1,μ ) (1,μ )R b 1 − μ 2 → 0, in probability.
The first factor in the upper bound was just shown to be O P (1). For the second factor note that we see that the spectral norm in (C.5) is O P (1) if k n = O(n), and converges to zero in probability, if k n /n → 0.
For the spectral norm in (C.4), we may restrict to µ = 0. First, write R = [t 1 , T 1 ] with t 1 ∈ R kn and use (C.2) to see that For the off-diagonal block B 1ΣW b 1 , note that it has mean zero in view of (C.2). Therefore, , where e 1 , . . . , e kn−1 is the standard basis in R kn−1 . Now, Var[e j B 1ΣW b 1 ] = Var[e j B 1 T 1Σn T 1 b 1 ], and part (ii) applies with v n,1 = T 1 B 1 e j /n 1/4 and v n,2 = T 1 b 1 /n 1/4 , which satisfy v n,1 Σv n,1 = e j B 1 Σ W B 1 e j / √ n = 1/ √ n and v n,2 Σv n, Hence, the only remaining term is B 1ΣW B 1 − I kn−1 S ≤ 1 n n i=1 B 1 T 1 x i x i T 1 B 1 − I kn−1 S + B 1 T 1μμ T 1 B 1 S . For the second term in the upper bound, one easily finds its expected value to be (k n −1)/n, as in the previous paragraph. For the spectral norm of the remaining covariance term we verify the strong regularity (SR) condition of Srivastava and Vershynin [23, Theorem 1.1] for the random (k n − 1)-vectorsx i = B 1 T 1 x i = B 1 T 1 Γz i . First, note that thex i are independent and isotropic, since µ = 0 and E[x ix i ] = B 1 T 1 ΣT 1 B 1 = B 1 Σ W B 1 = I kn−1 . Fix a projection matrix P in R kn−1 and note that Γ T 1 B 1 P B 1 T 1 Γ is a projection matrix in R mn of the same rank as P . Since the z i satisfy Assumption (A1).(c) and Px 1 2 = P B 1 T 1 Γz 1 2 = z 1 Γ T 1 B 1 P B 1 T 1 Γz 1 = Γ T 1 B 1 P B 1 T 1 Γz 1 2 , we see that the (SR) condition holds forx 1 and with the same constants c, C as in (A1).(c). Therefore, Corollary 1.4 of Srivastava and Vershynin [23] shows that 1 , and converges to zero, in probability, if k n /n → 0. This finishes part (iii).
For part (v) we begin with the expectation in (C.6) with = 4, which can be written as For E[|z 1 M z 1 | 4 ], we begin with case (a) µ = 0. Then and we denote the matrix corresponding to the quadratic form in the vector (0, z 1 ) on the right-hand-side of this display by P . Clearly, P is a projection matrix which we partition as Finally, in the case (b), where rank T 1 = k n , the matrix T 1 ΣT 1 in the representation Ω = RSR = T 1 ΣT 1 + R(1, µ ) (1, µ )R , is regular and thus we can invert Ω by the Sherman-Morrison formula to get Therefore, we make use of the abbreviations P = Γ T 1 (T 1 ΣT 1 ) −1 T 1 Γ and v = Γ T 1 (T 1 ΣT 1 ) −1 R(1, µ ) to bound the fourth moment of the quadratic form Since P 8 F = (trace(P )) 4 = k 4 n , the upper bound is of order O(k 4 n ), which finishes the proof of part (v).