Testing for subsphericity when $n$ and $p$ are of different asymptotic order

We extend a classical test of subsphericity, based on the first two moments of the eigenvalues of the sample covariance matrix, to the high-dimensional regime where the signal eigenvalues of the covariance matrix diverge to infinity and either $p/n \rightarrow 0$ or $p/n \rightarrow \infty$. In the latter case we further require that the divergence of the eigenvalues is suitably fast in a specific sense. Our work can be seen to complement that of Schott (2006) who established equivalent results in the case $p/n \rightarrow \gamma \in (0, \infty)$. As our second main contribution, we use the test to derive a consistent estimator for the latent dimension of the model. Simulations and a real data example are used to demonstrate the results, providing also evidence that the test might be further extendable to a wider asymptotic regime.


Introduction
The objective of principal component analysis (PCA), and dimension reduction in general, is to extract a low-dimensional signal from noise-corrupted observed data. The most basic statistical model for the problem is as follows. Assume that S n is the sample covariance matrix of a random sample from a p-variate normal distribution whose covariance matrix has the eigenvalues λ 1 ≥ · · · ≥ λ d > σ 2 , . . . , σ 2 exhibiting "spiked" structure. The data can thus be seen to be generated by contaminating a random sample residing in a ddimensional subspace with independent normal noise having the covariance matrix σ 2 I p . This signal subspace can be straightforwardly estimated with PCA as long as one knows its dimension d which is, however, usually unknown in practice. Numerous procedures for determining the dimension have been proposed, see Jolliffe (2002) for a review and, e.g., Schott (2006); Nordhausen et al. (2016); Virta and Nordhausen (2019) for asymptotic tests and Beran and Srivastava (1985); Dray (2008); Luo and Li (2016) for bootstrap-and permutation-based techniques. Simplest of these methods is perhaps the test of sub-sphericity based on the test statistics, T n,j = m 2,p−j (S n ) m 1,p−j (S n ) 2 − 1, j = 0, . . . , p − 1, where m ,r (A) denotes the th sample moment of the last r eigenvalues of the symmetric matrix A. Under the null hypothesis H 0k : d = k that the signal dimension equals k, the limiting null distribution of T n,k is as n → ∞, see, e.g., Schott (2006). Hence, the dimension d can in practice be determined by testing the sequence of null hypotheses H 00 , H 01 , . . . and taking the estimate of d to be the smallest k for which H 0k is not rejected. By examining the power of the tests, Nordhausen et al. (2016) concluded that this procedure yields a consistent estimate of d (with a suitable choice of test levels).
The previous test assumes a fixed dimension p and, in the face of modern large and noisy data sets with great room for dimension reduction, it is desirable to extend the test to the high-dimensional regime where p = p n is a function of n and we have p n → ∞ as n → ∞. This is discussed in Section 2 where our first main contribution, extending the test based on (1) to the high-dimensional regime where either the sample size or the dimension asymptotically dominates the other, is also presented. Section 3 introduces our second main contribution, a power study of the test, using which we construct a consistent estimator for the true latent dimension. In Section 4 we demonstrate our results using simulations and a real data example and, in Section 5, we finally conclude with some discussion.

High-dimensional testing of subsphericity
The behaviour of most high-dimensional statistical procedures depends crucially on the interplay between n and p n and the most common approach in the literature is to assume that their growth rates are proportional in the sense that p n /n → γ ∈ (0, ∞) as n → ∞, see, e.g., Yao et al. (2015). The limiting ratio γ is also known as the concentration of the regime. In Schott (2006), the test of subsphericity discussed in Section 1 is extended to this asymptotic regime under the following two assumptions (note that in Assumption 2 the signal dimension d is a constant not depending on n). Assumption 1. The observations x 1 , . . . , x n are a random sample from N pn (µ n , Σ n ) for some µ n ∈ R pn and some positive-definite Σ n ∈ R pn×pn . Assumption 2. The eigenvalues of the matrix Σ n are λ n1 ≥ · · · ≥ λ nd > σ 2 = · · · = σ 2 for some σ 2 > 0. Moreover, the eigenvalues λ nk , k = 1, . . . , d, satisfy λ nk → ∞.
In fact, Schott (2006) additionally required that the quantities λ nk /tr(Σ n ) converge to positive constants summing to less than unity, but applying our Lemma 1 in the proof of their Theorem 4 reveals that this condition is unnecessary, see Appendix A for details. Hence, denoting by S n the sample covariance matrix of the observations, under Assumptions 1 and 2 and γ ∈ (0, ∞) \ {1} (see Appendix A for more details on the exclusion of the case γ = 1), Theorem 4 in Schott (2006) establishes that the test statistic, 4) where d is the signal dimension. As remarked by Schott (2006), this limiting result is consistent with its lowdimensional equivalent (1) in the sense that, as p → ∞, A crucial condition that allows the above limiting result is the divergence of the spike eigenvalues λ n1 , . . . , λ nd of the covariance matrix to infinity in Assumption 2. Indeed, usually the spikes are taken to be constant in the literature for high-dimensional PCA, see, e.g. Baik and Silverstein (2006); Johnstone and Paul (2018). However, requiring the spikes to diverge to infinity is rather natural and reflects the idea that only a few principal components are sufficient to recover a large proportion of the total variance even in high dimensions. See, for example, Yata et al. (2018), who use cross-data-matrices to detect spiked principal components with divergent variance, and the references therein.
As our first contribution, we extend the result of Schott (2006) outside of the regime p n /n → γ ∈ (0, ∞), to the extreme cases γ ∈ {0, ∞}. The latter have been less studied in the high-dimensional literature, but see, for example, Karoui (2003); Birke and Dette (2005); Yata and Aoshima (2009) ;Jung and Marron (2009), the last of which consider the extreme asymptotic scenario where the dimension diverges to infinity but the sample size remains fixed. In our treatment of the case γ = ∞, we further require the additional condition that p n /(n √ λ nd ) → 0 as n → ∞, i.e., the dimension must not diverge too fast compared to the sample size and the magnitude of the spike λ nd corresponding to the weakest signal. Assumptions of this form are rather common in highdimensional PCA when the spikes are taken to diverge, see, e.g., Shen et al. (2016) who saw n, λ nk and p n as three competing forces affecting the consistency properties of PCA, n and λ nk contributing information about the signals and p n decreasing the relative share of information in the sample by introducing more noise to the model. The condition p n /(n √ λ nd ) → 0 can thus be interpreted as requiring that even the weakest of the spike principal components has asymptotically strong enough signal to be detected.
The extension of the test to the previous regimes is given below in Theorem 1. The main line of proof is based on extending the work of Birke and Dette (2005), who considered testing of sphericity in the cases γ ∈ {0, ∞}, to testing of subsphericity. In this sense, our work is to Birke and Dette (2005) what Schott (2006) is to Ledoit and Wolf (2002), who studied tests of sphericity in the case where γ ∈ (0, ∞) and on whose work Schott (2006) based their proof.

Power analysis and dimension estimation
A natural question is whether the test of subsphericity can be used to consistently estimate the latent dimension d under the high-dimensional Gaussian model. In a low-dimensional setting, this is accomplished by chaining together tests for H 0k : d = k for different values of k in some specific order: In forward testing one sequentially tests for H 00 , H 01 , . . . and takes as the estimate of d the smallest k for which H 0k is not rejected. In backward testing, the order is H 0(p−1) , H 0(p−2) , · · · and the estimate is the largest k for which H 0(k−1) is rejected. The two strategies can also be combined into a "divide-and-conquer" approach where one starts from the middle of the search interval and subsequently halves it with each test, this process often terminating in fewer tests than the forward and backward testing. However, in the high-dimensional setting where our working assumption is that the number of latent signals is diminutive compared to the overall dimensionality (finite d vs. p n → ∞), the most economic choice is likely the forward testing. In the following we show that this strategy indeed leads, under suitable assumptions, to a consistent estimate of the dimension d in various high-dimensional regimes. Even though the equivalent of Theorem 1 for γ ∈ (0, ∞) \ {1} was established already in Schott (2006), the following results are novel also in that case. We use the notation g n,k := (n−k −1)T n,k −(p n −k), k = 0, . . . , p n − 1, for the test statistic.
Theorem 2. Under Assumptions 1 and 2, if, as n → ∞, either i) p n /n → γ ∈ [0, ∞) \ {1} and p n /λ 2 nd → 0, or, ii) p n /n → ∞, p n /(n √ λ nd ) → 0 and p n /( √ nλ nd ) → 0, then, we have, for each k = 0, . . . , d − 1 and for all M > 0, that Theorem 2 shows that the test for H 0k is consistent under the alternative hypothesis that the true dimension d > k (the power of the test in the opposite case d < k plays no role in the forward testing and, hence, is not studied here). As a straightforward corollary we then obtain the consistency of the forward testing.
Corollary 1. Under the assumptions of Theorem 2, let c n be any sequence of real numbers satisfying c n → ∞ and c n = O(n) as n → ∞. Then, Choosing a sequence c n for which the forward testing estimatord performs well in finite samples is a highly non-trivial task and, thus, we advocate using in practice the alternative estimator, where z 1−α/2 is the upper α/2 quantile of the standard normal distribution, see, e.g., Nordhausen et al. (2016) for a similar modification. The resulting procedure has asymptotically zero probability to underestimate the dimension (by Theorem 1) and carries the Type I error probability equal to α of overestimating the dimension (by Theorem 2). Finally, we still briefly discuss the assumptions of Corollary 1 which, while stricter than in Theorem 1, can nevertheless be seen to be very natural. That is, regardless of the regime, the assumptions ask that the weakest of the signals is strong enough not to be masked by the noise (similarly as in part ii) of Theorem 1). To gain a more concrete idea on the severity of the assumptions, let p n = cn α and λ nd = n β for some c = 1 and α, β > 0. Then, the feasible values of (α, β) form a polygon in R 2 that is illustrated in the range 0 < α ≤ 2 as the grey area in Figure 1. The plot reveals the intuitive fact that the effect of the dimension on the minimal feasible growth rate for the signal is the stronger the faster the dimension increases (the slope of the curve is for α > 1.5 four times higher than for α ∈ (0, 1)).

Numerical examples
We first demonstrate the result of Theorem 1 using simulated data. We consider four different settings, each of which assumes a sample of size n from N pn (0, Σ n ) where Σ n = diag(λ n1 , . . . , λ nd , 1, . . . , 1). Note that this simplified form of the normal distribution (zero location, unit noise variance and diagonal covariance) is without loss of generality as our test statistic is location, scale and rotation invariant. The settings are as follows:  Figure 1: Assuming pn = cn α and λ nd = n β for some c = 1 and α, β > 0, the grey area in the plot contains the values of (α, β) for which the assumptions of Corollary 1 hold. The points S1-S4 correspond to the four settings used in the simulation study in Section 4. and plot the obtained histogram superimposed with the density of the limiting distribution N (1, 4).
The results are shown in Figure 2 where we immediately make three observations: the convergence to the limiting distribution is (at least visually) rather fast in Settings 1-3, with the histograms exhibiting the Gaussian shape and being only slightly shifted to the left from their limiting density; Setting 1 does not appear to be significantly closer to Gaussianity than Setting 2 despite the increased amount of information in the former (in the form of more rapidly growing spike eigenvalues); in Setting 4 where the condition p n /(n √ λ nd ) → 0 required by Theorem 1 is being violated, the histogram visibly has the correct shape and scale, but clearly underestimates the location. The difference between the true mean and the mean of the replicates in Setting 4 is approximately 1.35 and some testing (not shown here) reveals that, at least with the current parameter choices, the difference seems to stay roughly constant when n is increased. Based on this, it seems possible that, even when p n /(n √ λ nd ) 0, the limiting distribution of g n,d could be made to equal N (1, 4) with a suitable additive correction term a n , which vanishes, a n → 0 as n → ∞, when the conditions of Theorem 1 are satisfied.
Next, we demonstrate how forward testing, as defined in (2), can be used to estimate the signal dimension d with a chain of hypothesis tests for the null hypotheses H 0k : d = k. That is, we sequentially test the null hypotheses H 00 , H 01 , . . . using, respectively, the test statistics g n,0 , g n,1 , . . . and take our estimate of the dimension to be the smallest k for which H 0k is not rejected. For each test, we use α = 0.05, i.e., the two-sided 95% critical regions of the limiting N (1, 4)-distribution. We consider the same four settings as in the first simulation, but include an additional, larger sample size for each. Of the four settings, only the first and the third satisfy the assumptions of Corollary 1, see Figure 1 on how the four settings are located with respect to the "feasibility region" of the assumptions.
For simplicity, we report in Table 1 the rejection rates (over 10000 replicates) of the null hypotheses corresponding to the true dimension and the neighbouring dimensions only (the columns corresponding to the true dimension are shaded grey). In Settings 1 and 3 where the assumptions of Corollary 1 are satisfied, the test achieves rather accurately the nominal level at the true dimension and shows extremely good power at the smaller dimensions, as expected. Interestingly, the same conclusions are reached also in Setting 2 where the assumptions of Corollary 1 are not satisfied, implying that the assumptions, while sufficient, are not necessary for the consistency of the forward testing estimator. Finally, as expected, the procedure reaches neither a sufficient level nor power in Setting 4 where the conditions of Theorem 1 and Corollary 1 are not satisfied.
We conclude with a brief application of the procedure to the phoneme data set in the R-package ElemStatLearn (Halvorsen, 2019). The data consists of a total of 4509 log-periodograms of length p = 256, each corresponding to a single utterance of one of several phonemes. For simplicity, we consider only the phoneme "sh" and, moreover, take only the first utterances of it by the first 64 speakers in the data set. This yields a data matrix with the dimensions n = 64 and p = 256, meaning that the experiment can be embedded, for example, to either of the regimes p n = 4n and p n = n 4/3 . To gain some idea on the possible Gaussianity of the data, we ran separate univariate Shapiro-Wilk tests for each of the p variables using the Bonferroni correction and the significance level 0.05. Based on the tests, 4 out of the 256 variables were deemed as non-normal, implying that the assumption of Gaussianity might indeed be warranted in the current context. We then applied the forward testing estimator (2) with α = 0.05 to the data and obtained the estimated = 14, implying that there is indeed great room for dimension reduction in the data set. As an alternative, "naive" approach we also considered forward testing based on a sequence of tests of the form (1) that assume p to be finite. It turned out that each of the tests was rejected (with α = 0.05), giving the maximal estimated = min{n, p} = 64. As the sample size is most likely too small for the finite-dimension asymptotics to kick in (unlike for the high-dimensional asymptotics, which are in Table 1 seen to be good approximations already for sample sizes and dimensions comparable to the current situation), we conclude that ignoring the high-dimensional nature of the data led to a gross overestimation of the latent dimension.

Discussion
In this short note, we showed that a classical test of subsphericity is valid also in the less often studied high-dimensional Gaussian regimes where the concentration γ is allowed to take the extreme values 0 and ∞, as long as the spikes themselves diverge to infinity. The case γ = ∞ further requires the condition that p n /(n √ λ nd ) → 0, limiting the growth rate of the dimension p n in terms of the signal strength λ nd . And even though, by our simulation study, it seems plausible that the test could be extended outside of this condition, several key arguments in our proof of Theorem 1 hinge on it, meaning that any extensions should use a different technique of proof.
Additionally, we derived sufficient conditions for the consistent estimation of the latent dimension d with the forward testing procedure that chains together tests for the hypotheses H 00 , H 01 , . . .. While the conditions are rather natural, again requiring that p n does not grow too fast compared to λ nd , our simulation study gives indication that there is still room for improvement.
Finally, the main limiting factor of the presented results is the assumption of Gaussianity. This requirement could possibly be weakened by showing that the so-called universality phenomenon applies to our scenario; in high-dimensional statistics, a result derived under the Gaussian assumption is said to exhibit universality if it continues to hold when the normal distribution is replaced with some other distribution that is close to it in some suitable sense, see Johnstone and Paul (2018) for a review of such results. In the current situation concerning the limiting behavior of second-order quantities, it seems reasonable to conjecture that our main results continue to hold if the normal distribution is replaced with a distribution that shares its first four moments with the normal distribution. While the actual theoretical study of this claim goes beyond the scope of the current work (our proofs rely heavily on several pre-existing results for Wishart matrices), we nevertheless did quick experiments in Settings 1-4 described in Section 4, with the normal distribution replaced by the symmetric Laplace mixture (1/2)L(−µ, b) + (1/2)L(µ, b) having the dispersion parameter b = 3/2 − 1 and the mean µ = √ 1 − 2b 2 . The resulting distribution then has identical moments with the standard normal up to the fourth one. The resulting rejection rates are shown in Table 2 and they indeed match very closely with those in Table 1, giving plausibility to the universality claim.  (2006) For convenience, this section uses the notation of Schott (2006). We first show that the final part of Condition 2 in Schott (2006), assuming that lim k→∞ λ i,k /tr(Σ k ) = ρ i ∈ (0, 1), i = 1, . . . , q, and that q i=1 ρ i ∈ (0, 1), is actually not necessary for their Theorem 4. This condition is used both in equation (22) and in the equation right after (24) to guarantee that m/λ q = O(1). This, in conjunction with the observation that tr(W 12 W 12 ) = o p (m), then gives the relation, used in bounding the moments. However, the same relation follows directly from the divergence of the spike eigenvalues λ j by first observing that, by the proof of our Lemma 1, we have tr(W 12 W 12 ) = qc + o p (1), where c ∈ (0, ∞) is the limit of p/n. Note also that, to obtain the final bound in the equation right after (24) without assuming anything about the relative growth rates of the spikes, we use the bound Σ −1 * ≤ λ −1 q λ q tr(Σ −1 * ) ≤ λ −1 q q (which is valid simply by the ordering of the spike eigenvalues). Thus, the result of Theorem 4 can be obtained without the final part of Condition 2 in Schott (2006).
Additionally, we remark that, by what appears to be an oversight, the proof of Theorem 4 in Schott (2006) does not hold as such in the case where p/n → c = 1. Namely, in equation (22) and in the equation right after (24), the upper bounds involve the term φ −1 r (S 22·1 ), which converges in probability to (1 − c 1/2 ) −2 which fails to be finite when c = 1. It seems to us that introducing some additional (non-trivial) assumptions on the spike eigenvalues could possibly recover the proof for c = 1 as, indeed, the simulations in Schott (2006) suggest that the result of Theorem 4 holds in that case also.

Appendix B. Proofs
Before the proof of Theorem 1 we establish an auxiliary lemma.
Lemma 1. Let W n ∼ W pn (I pn /n, n) be partitioned as W n = W n,11 W n,12 W n,21 W n,22 , where the block W n,11 has the size d × d and W p (Σ, ν) denotes the (p × p)dimensional Wishart distribution with the scale matrix Σ and ν degrees of freedom. Then, as n, p n → ∞, 2. if p n /n → ∞ and p n /(n √ λ n ) → 0 for some sequence λ n → ∞, we have, tr(W n,12 W n,21 ) = o p λ n .
Proof of Lemma 1. The matrix W n has the same distribution as the (biased) non-centered sample covariance matrix of a random sample z 1 , . . . , z n from the p n -variate standard normal distribution. Hence, letting Y n := W n,12 W n,21 , we have, for arbitrary j = 1, . . . , d, that The expected value of y n,jj is Whereas, its second moment is where the second-to-last equality uses Isserlis' theorem. Consequently, the variance of y n,jj is Var(y n,jj ) = 2(p n − d) Hence, the moments of t n,jj := (n/p n )y n,jj are E(t n,jj ) = 1 − d/p n = 1 + o(1) and Var(t n,jj ) = 2 1 − d p n 2 n 2 + p n − d n 2 + 1 n = o(1).
The first claim now follows and the second one is straightforwardly verified to be true in a like manner.
Proof of Theorem 1. Due to centering we may WLOG assume that µ n = 0 for all n ∈ N. Moreover, as our main claim depends on S n only through its eigenvalues, we may, again WLOG, assume that Σ n = diag(λ n1 , . . . , λ nd , σ 2 , . . . , σ 2 ). Finally, as the left-hand side of our main claim is invariant under scaling of the observations, we may WLOG assume that σ 2 = 1. Denoting n 0 := n − 1, we have that S n = Σ 1/2 n W n Σ 1/2 n where W n ∼ W pn {n −1 0 I pn , n 0 } is the sample covariance matrix of a sample of size n from the p n -variate standard normal distribution. Denote then Λ n = diag(λ n1 , . . . , λ nd ) and partition S n and W n as S n = S n,11 S n,12 S n,21 S n,22 = Λ 1/2 n W n,11 Λ 1/2 n Λ 1/2 n W n,12 W n,21 Λ 1/2 n W n,22 where the matrices S n,11 and W n,11 are of the size d × d. Then W n,22 ∼ W rn {n −1 0 I rn , n 0 }, where r n := p n −d, and the Schur complement S n,22·1 satisfies S n,22·1 := S n,22 − S n,21 S −1 n,11 S n,12 = W n,22·1 ∼ W rn {n −1 0 I rn , n 0 − d}, where the distribution of W n,22·1 follows from Theorem 3.4.6 in Mardia et al. (1995). Consequently, G n : implying that m 2,rn (S n,22·1 )/m 1,rn (S n,22·1 ) 2 = m 2,rn (G n )/m 1,rn (G n ) 2 . Hence, by Theorem 3.7 in Birke and Dette (2005), we have regardless of which of the two asymptotic regimes we are in. Note also that, as G n is of the size r n × r n , the notation m k,rn (G n ) simply refers to the kth sample moment of its eigenvalues. Consider next the regime where p n /n → 0 and assume that is easily checked to be of the order o p (1) using (B.2) and the results following from Section 2 in Birke and Dette (2005) that m 1,rn (S n,22·1 ) → p 1 and m 2,rn (S n,22·1 ) → p 1. Hence, the first claim of the theorem follows from (B.1). Similarly, in the regime that p n /n → ∞ and p n /(n √ λ nd ) → 0, assume that Then, the difference (B.3) can similarly be shown to be of the order o p (1) (proving the second claim of the theorem). Note that in this case we require a faster convergence from the first moment since, by Section 2 of Birke and Dette (2005) we have again m 1,rn (S n,22·1 ) → p 1 but the second moment behaves as m 2,rn (S n,22·1 ) − (n 0 − d)(r n + 1)/n 2 0 → p 1 Thus, we next establish (B.2) for k = 1, 2 and (B.4), starting from the former. As p n /n → 0, we may without loss of generality assume n > p n , implying that S n is almost surely positive definite. Now, we have for S n,11·2 := S n,11 − S n,12 S −1 n,22 S n,21 that, φ −1 d (S n,11·2 ) =φ 1 {(S n,11 − S n,12 S −1 n,22 S n,21 ) −1 } =φ 1 (S −1 n,11 + S −1 n,11 S n,12 S −1 n,22·1 S n,21 S −1 n,11 ) ≤φ where the second equality follows from the Woodbury matrix identity, the first inequality uses Weyl's inequality and the second inequality follows from the sub-multiplicativity of the spectral norm. Now, Assumption 2 guarantees that φ 2 1 (Λ −1/2 n ) = λ −1 nd → 0 and, since W n,11 → p I d , we further have, by the continuity of eigenvalues, that φ 1 (W −1 n,11 ) → p 1. Write then, φ 1 (W n,12 W −1 n,22·1 W n,21 ) = W n,12 W −1 n,22·1 W n,21 2 ≤ W n,12 where · 2 denotes the spectral norm. Now, since G n = {n 0 /(n 0 −d)}W n,22·1 ∼ W rn {(n 0 − d) −1 I rn , n 0 − d}, we have by the discussion after Theorem 1.1 in Rudelson and Vershynin (2009) for all t > 0. Substituting t = (1/2) √ n 0 − d − √ r n (which is positive for a large enough n), gives, Hence, where the final step follows as φ rn (G n ) − 1/4 is positive with probability approaching one. Finally, by Lemma 1, we have that tr(W n,12 W n,21 ) = O p (p n /n) = o p (1) and plugging all these in to (B.5), we obtain that 0 < φ −1 d (S n,11·2 ) ≤ o p (1) (where the first inequality holds a.s. by the positive-definiteness of the Schur complement). This, in conjunction with the fact that φ 1 (S n,22·1 ) → p 1, implied by Theorem 2 in Karoui (2003), lets us to conclude that P{φ d (S n,11·2 ) > φ 1 (S n,22·1 )} → 1 as n → ∞, and, in the sequel, we restrict our attention to this event, allowing us to apply Theorem 3 in Schott (2006), equation (17) of which yields, tr(S −1 n,11 S n,12 S −2 n,22·1 S n,21 S −1 n,11 ) 2 W n,12 W −2 n,22·1 W n,21 = o p (1/p n ) W n,12 W −2 n,22·1 W n,21 . (B.7) Let the singular value decomposition of W n,21 be W n,21 = R n D n T n . Then, where the second equality follows from Lemma 1, the second inequality from the Poincaré separation theorem and the final equality from (B.6). Plugging this in to (B.7) then establishes (B.2) for k = 1. To show the same for k = 2, we apply equation (18) from Theorem 3 in Schott (2006) to obtain 0 ≤ m 2,rn (S n,22·1 ) − m 2,rn (S n ) tr(S −1 n,11 S n,12 S −2 n,22·1 S n,21 S −1 n,11 ), where arguing as in the case k = 1 shows that the right-hand side is bounded by a o p (1/n)-quantity, concluding the proof of the case where p n /n → 0. For the second claim, we, without loss of generality, assume that p n > n, implying that the rank of S n,22·1 is almost surely n 0 − d. Denote then any of its eigendecompositions by S n,22·1 = Q n ∆ n Q n where Q n is a r n × (n 0 − d) matrix with orthonormal columns and ∆ n contains the almost surely positive n 0 − d eigenvalues. Our aim is to use Corollary 3 of Schott (2006) and, for that, we first show that P{φ d (S n,11·2 ) > φ 1 (S n,22·1 )} → 1 as n → ∞, wherẽ S n,11·2 := S n,11 − S n,12 Q n (Q n S n,22 Q n ) −1 Q n S n,21 . Now, the inverse ofS n,11·2 is S −1 n,11 + S −1 n,11 S n,12 Q n ∆ −1 n Q n S n,21 S −1 n,11 and, proceeding as in (B.5), we see that φ −1 d (S n,11·2 ) has the upper bound, where the final leading eigenvalue has, by the Poincaré separation theorem, the upper bound tr(W n,12 W n,21 )φ −1 n0−d (∆ n ). Now, as in the proof of the first claim, Rudelson and Vershynin (2009) can be used to show that φ −1 n0−d (∆ n ) = O p (n/p n ). Furthermore, Lemma 1 shows that tr(W n,12 W n,21 ) = o p ( √ λ nd ) under our assumptions, finally yielding that, This, in conjunction with the result that (p n /n)φ −1 1 (S n,22·1 ) → p 1, implied by Theorem 1 in Karoui (2003), guarantees now that showing that P{φ d (S n,11·2 ) > φ 1 (S n,22·1 )} → 1, as desired, and allowing us to restrict our attention to the corresponding set and to use Corollary 3 in Schott (2006). Its first part gives us tr(S −1 n,11 S n,12 Q n ∆ −2 n Q n S n,21 S −1 n,11 ) ≤ p 3 n n 3 r n λ nd λ nd tr(Λ −1 n ){d + o p (1)} W n,12 Q n ∆ −2 n Q n W n,21 , where λ nd tr(Λ −1 n ) ≤ d. Reasoning similarly as with the first claim of the theorem, we further have W n,12 Q n ∆ −2 n Q n W n,21 ≤ tr(W n,12 W n,21 ) R n Q n ∆ −2 where R n again contains the left singular vectors of W n,21 . Plugging the obtained upper bound to (B.8) then finally gives the first claim of (B.4) and the second claim is obtained in exactly the same manner but by using the second inequality of Corollary 3 in Schott (2006) instead of the first.
The proof for i) is now finished once we show that O p (1) + Ω ∞ (1) = Ω ∞ (1). Letting a n , b n be arbitrary sequences of random variables with the orders a n = O p (1) and b n = Ω ∞ (1), fix ε, M > 0 and take C, n 0 > 0 to be such that P(|a n | ≥ C) ≤ ε/2 for all n > n 0 . Moreover, let n 1 be such that for all n > n 1 , we have P(b n ≤ M + C) ≤ ε/2. Then, for n > max{n 0 , n 1 }, we have P(a n + b n ≤ M ) =P(a n + b n ≤ M | a n ≥ −C)P(a n ≥ −C) +P(a n + b n ≤ M | a n < −C)P(a n < −C) ≤P(−C + b n ≤ M | a n ≥ −C)P(a n ≥ −C) + ε/2 ≤P(b n ≤ M + C) + ε/2 ≤ε, proving the claim.
Moving our attention to the case γ ∈ (0, ∞) \ {1} of part i) of the theorem, exactly the same proof as was used for γ = 0 suffices also here after the modification of a single part: To see that m 2 /m 2 1 converges in probability to a constant, we use (21) from Schott (2006) in conjunction with Lemma 2.2 in Wang and Yao (2013) to obtain m 1 → p 1 and m 2 → p 1 + γ.
Finally, to obtain part ii) of the claim, we make the following modifications to the proof: To control m 2 /m 2 1 , the equation (B.4) together with the formulas in Section 2 of Birke and Dette (2005) give m 1 = 1 + o p (1) and m 2 = O p (p n /n). Hence, we get the following lower bound for g n,k /n: where the equality follows from our assumption that p n /( √ nλ nd ) → 0. The claim now follows using our earlier statement that O p (1) + Ω ∞ (1) = Ω ∞ (1).
Proof of Corollary 1. We have,