Tracy-Widom law for the extreme eigenvalues of sample correlation matrices

Let the sample correlation matrix be $W=YY^T$, where $Y=(y_{ij})_{p,n}$ with $y_{ij}=x_{ij}/\sqrt{\sum_{j=1}^nx_{ij}^2}$. We assume $\{x_{ij}: 1\leq i\leq p, 1\leq j\leq n\}$ to be a collection of independent symmetric distributed random variables with sub-exponential tails. Moreover, for any $i$, we assume $x_{ij}, 1\leq j\leq n$ to be identically distributed. We assume $0<p<n$ and $p/n\rightarrow y$ with some $y\in(0,1)$ as $p,n\rightarrow\infty$. In this paper, we provide the Tracy-Widom law ($TW_1$) for both the largest and smallest eigenvalues of $W$. If $x_{ij}$ are i.i.d. standard normal, we can derive the $TW_1$ for both the largest and smallest eigenvalues of the matrix $\mathcal{R}=RR^T$, where $R=(r_{ij})_{p,n}$ with $r_{ij}=(x_{ij}-\bar x_i)/\sqrt{\sum_{j=1}^n(x_{ij}-\bar x_i)^2}$, $\bar x_i=n^{-1}\sum_{j=1}^nx_{ij}$.


Introduction
Suppose we have a p-dimensional distribution with mean µ and covariance matrix Σ. In recent three or four decades, in many research areas, including signal processing, network security, image processing, genetics, stock marketing and other economic problems, people are interested in the case where p is quite large or proportional to the sample size. Naturally, one may ask how to test the independence among the p components of the population. From the principal component analysis point of view, the independence test statistic is usually the maximum eigenvalue of the sample covariance matrices. Under the additional normality assumption, Johnstone [12] derived the asymptotic distribution of the largest eigenvalue of the sample covariance matrices to study the test H 0 : Σ = I assuming µ = 0.
However, sample covariance matrices are not scale-invariant. So if µ = 0, Johnstone [12] proposes to perform principal component analysis (PCA) by the maximum eigenvalue of the matrix W = Y Y T , where Y = (y ij ) p,n :=    Here x i = (x i1 , · · · , x in ) T contains n observations for the i-th component of the population, i = 1, · · · , p, and || · || represents the vector norm.
Performing PCA on W amounts to PCA on the sample correlations of the original data if µ = 0. So for simplicity, we call W the sample correlation matrix in this paper. From now on, the eigenvalues of W will be denoted by 0 ≤ λ 1 ≤ λ 2 ≤ · · · ≤ λ p .
Then the empirical distribution (ESD) of W is defined by The asymptotic property of F p was studied in [11] and [2]. For the almost sure convergence of λ 1 and λ p , see [11].
In this paper, we will study the fluctuations of the extreme eigenvalues λ 1 , λ p of W for a general population, including multivariate normal one. The basic assumption on the distribution of our population throughout the paper is Condition C 1 . We assume x ij are independent symmetric distributed random variables with variance 1. And for any i, we assume x i1 , · · · , x in to be i.i.d. Furthermore, we request the distributions of the x ′ ij s have sub-exponential tails, i.e., there exist positive constants C, C ′ such that for all 1 ≤ i ≤ p, 1 ≤ j ≤ n one has P(|x ij | ≥ t C ) ≤ e −t for all t ≥ C ′ . And we also assume p/n → y as p, n = n(p) → ∞, where 0 < y < 1.
Remark 1.1. We use C, C 0 , C 1 , C 2 , C ′ , O(1) to denote some positive constants independent of p, which may differ from line to line. And we use C α to denote some positive constants depending on the parameter α. The notation || · || op , || · || F represent the operator norm and Frobenius norm of a matrix respectively. And || · || represents a Euclidean norm of a vector.
Remark 1.2. The sample correlation matrix W is invariant under the scaling on the elements x ij , so the assumption V ar(x ij ) = 1 is not necessary indeed. We specify it to be 1 here just for convenience. Owing to the exponential tails, we can always truncate the variables so that |x ij | ≤ K with some K ≥ log O(1) n.
Here T W 1 is the famous Tracy-Widom distribution of type 1, which was firstly raised by Tracy and Widom in [19] for the Gaussian orthogonal ensemble. The distribution function F 1 (t) of T W 1 admits the representation where q statisfies the Painlevé II equation Here Ai(t) is the Airy function.
The main purpose of this paper is to generalize Theorem 1.1 to the population satisfying the basic condition C 1 . Our main results are the following two theorems. Theorem 1.2. Let W be a sample correlation matrix satisfying the basic condition C 1 . We have nλ p − (p 1/2 + n 1/2 ) 2 (n 1/2 + p 1/2 )(p −1/2 + n −1/2 ) 1/3 d −→ T W 1 . and as p → ∞. Remark 1.3. For technical reasons, it is convenient to work with the continuous random variables x ij . As a result, the events such as eigenvalue collision will only occur with probability zero (see Lemma 3.5). Because none of our bounds depends on how continuous the x ij are, one can recover the discrete case from the continuous one by a standard limiting argument by using Weyl's inequality (see Lemma 2.2), especially for the Bernoulli case.
If the population is normal, then we can derive the Tracy-Widom law for both the largest and smallest eigenvalues of the matrix R = RR T , where Herex i = n −1 n j=1 x ij and x i −x i means each element x ij of x i will be subtracted byx i , i = 1, · · · , p. We denote the ordered eigenvalues of R by 0 ≤ λ 1 (R) ≤ · · · ≤ λ p (R) below. Actually R is the sample correlation matrix when the population mean is unknown.
Throughout the paper, we will use the following ad hoc definitions on the frequent events provided in [16].
Definition 1(Frequent events). [16] Let E be an event depending on n.
• E holds with high probability if P(E) ≥ 1 − O(n −c ) for some constant c > 0 (independent of n).
The main strategy is to prove a so-called "Green function comparison theorem", which was raised by Erdös, Yau and Yin in [9] for generalized Wigner matrices. We will provide a "Green function comparison theorem" to the sample correlation matrices obeying the assumption C 1 in Section 4, see Theorem 4.3. Then by the comparison theorem, we can compare the general distributed case with the Bernoulli case to get Theorem 1.2. And as an application, we can also get Theorem 1.3.
Our article is organized as follows. In Section 2, we state some basic tools, which can be also found in the series work [16], [17], [18] and [20]. And we provide some main technical lemmas and theorems in Section 3. The most important one is the so-called delocalization property of singular vectors, which will be shown as an obstacle to establish the Green function comparison theorem in the sample correlation matrices case. And in Section 4, we provide a Green function comparison theorem to prove the edge universality for sample correlation matrices satisfying the assumption C 1 . In Section 5, we state the proofs for our main results: Theorem 1.2 and Theorem 1.3.

Basic Tools
In this section, we state some basic tools from linear algebra and probability theory. Firstly, we denote the ordered singular values of Y by i . If we further denote the unit right singular vector of Y corresponding σ i by u i and the left one by v i , we have Below we shall state some tools for eigenvalues, singular values and singular vectors without proof.
Lemma 2.1. (Cauchy's interlacing law). Let 1 ≤ p ≤ n (i) If A n is an n × n Hermitian matrix, and A n−1 is an n − 1 × n − 1 minor, then (ii) If A n,p is a p × n matrix, and A n,p−1 is a p − 1 × n minor, then σ i (A n,p ) ≤ σ i (A n,p−1 ) ≤ σ i+1 (A n,p ) for all 1 ≤ i < p.
(iii) If p < n, A n,p is a p × n matrix, and A n−1,p is a p × n − 1 minor, then σ i−1 (A n,p ) ≤ σ i (A n−1,p ) ≤ σ i (A n,p ) for all 1 ≤ i ≤ p, with the understanding that σ 0 (A n,p ) = 0. (For p = n, one can consider its transpose and use (ii) instead.) The following lemma is on the components of a singular vector, which can be found in [18]. be a p × n matrix with h ∈ C p , and let u x be a right unit singular vector of A p,n with singular value σ i (A p,n ), where x ∈ C and u ∈ C n−1 . Suppose that none of the singular values of A p,n−1 is equal to σ i (A p,n ). Then where {v 1 (A p,n−1 ), · · · , v min(p.n−1) (A p,n−1 ) ∈ C p } is an orthonormal system of left singular vectors corresponding to the non-trivial singular values of A p,n−1 and v j (A p,n−1 )·h = v j (A p,n−1 ) * h with v j (A p,n−1 ) * being the complex conjugate of v j (A p,n−1 ).

Similarly, if
A p,n = A p−1,n l * for some l ∈ C n , and (v T , y) T is a left unit singular vector of A p.n with singular value σ i (A p,n ), where y ∈ C and v ∈ C p−1 , and none of the singular values of A p−1,n are equal to σ i (A p,n ), then where {u 1 (A p−1,n ), · · · , u min(p−1,n) (A p−1,n ) ∈ C n } is an orthonormal system right singular vectors corresponding to the non-trivial singular values of A p−1,n .
Further, we need a frequently used tool in the Random Matrix Theory: the Stieltjes transform of ESD F p (x), which is defined by for any z = E + iη with E ∈ R and η > 0. If we introduce the Green function G(z) = (W − z) −1 , we also have Here we denote G jk as the (j, k) entry of G(z). As is well known, the convergence of a tight probability measure sequence is equivalent to the convergence of its Stieltjes transform sequence towards the corresponding transform of the limiting measure. So corresponding to the convergence of F p (x) towards F M P,y (x), the famous Marcenko-Pastur law F M P,y (x) whose density function is given by where the square root is defined as the analytic extension of the positive square root of the positive numbers. Moreover, s(z) satisfies the equation If we denote the k-th row of Y by y T k and the remaining (p − 1) × n matrix after deleting y T k by Y (k) , one has By Schur's complement, The formula of G kk is analogous. By (2.3), we have the following lemma on the decomposition of s p (z): The last main tool we need comes from the probability theory, which is a concentration inequality for projections of random vectors. The details of the proof can also be found in [16].
Lemma 2.5. Let X = (ξ 1 , · · · , ξ n ) T ∈ C n be a random vector whose entries are independent with mean zero, variance 1, and are bounded in magnitude by K almost surely for some K, where K ≥ 10(E|ξ| 4 + 1). Let H be a subspace of dimension d and π H the orthogonal projection onto H. Then In particular, one has with overwhelming probability.

Main Technical Results
In this section, we provide our main technical results: the local MP law for sample correlation matrices, and the delocalization property for the singular vectors. Both results will be proved under much weaker assumption than C 1 . We form them into the following two theorems. Let us introduce more notation. For any interval I ⊂ R, we use N I to denote the number of the eigenvalues of W falling into I, and use |I| to denote the length of I. Theorem 3.1. (Local MP law). Assume that p/n → y with 0 < y < 1. And {x ij : 1 ≤ i ≤ p, 1 ≤ j ≤ n} is a collection of independent (but not necessary identically distributed) random variables with mean zero and variance 1. If |x ij | ≤ K almost surely for some K = o(p 1/C 0 δ 2 log −1 p) with some 0 < δ < 1/2 and some large constant C 0 for all i, j, one has with overwhelming probability that the number of eigenvalues N I for any interval I ⊂ [a/2, 2b] with |I| ≥ K 2 log 7 p Remark 3.1. The topic of the limiting spectrum distribution on short scales was firstly raised by Erdős, Schlein and Yau in [6] for Wigner matrices. Such type of results are shown to be quite necessary for the proof of the famous universality conjectures in the Random Matrix Theory, for example, see [8] and [16].
Remark 3.2. A strong type of the local MP law has been established for more general matrix models in a very recent paper of Pillai and Yin, see Theorem 1.5, [14]. In fact, from Theorem 1.5 of [14], one can get a more precise bound than that in (3.1) if we replace ρ M P,y (x) by the nonasymptotic MP law ρ W (x) defined in Section 4. Moreover, Pillai and Yin's strong local MP law also provides some crucial estimates on individual elements of the Green function G, which will be used to establish our Green function comparison theorem in Section 4.
Remark 3.3. Note that a little weaker delocalization property for the left singular vector v i can also be found in Theorem 1.2 (iv) of Pillai and Yin [14].
then S := XX T is the sample covariance matrix corresponding to W . We further denote the ordered eigenvalues of S by 0 ≤λ 1 ≤ · · · ≤λ p and introduce the matrix By Theorem 5.9 of [1], we haveλ p = b + o(1) holds with overwhelming probability. In fact, it is easy to seeλ 1 = a + o(1) holds with overwhelming probability as well by a similar discussion through moment method. Observe that W = DSD, and ||D − I|| op = o(1) holds with overwhelming probability. By Lemma 2.2, we also have holds with overwhelming probability. So below we always assume The proof of Theorem 3.1 is partially based on the lemmas of Section 2. It turns out to be quite similar to the case of sample covariance matrices and Wigner matrices, see [7], [8], [18] and [20]. However, the delocalization of the right singular vector u i of Y is an obstacle, owing to the lack of independence between the columns of Y .
For the convenience of the reader, we provide a short proof of Theorem 3.1 at first. Our main task in this section is the proof of Theorem 3.2, more precisely, the right singular vector part of the theorem.
Proof of Theorem 3.1. We provide the following crude upper bound on N I at first. Proof. Firstly we introduce the notation α , α = 1, · · · , p − 1 are also the eigenvalues of the n × n matrix W (1) , whose other eigenvalues are all zeros. We further use ν α to denote the eigenvector of W (1) corresponding to the eigenvalue λ (1) α , and introduce the quantity We can rewrite (2.7) as By Cauchy's interlacing law, we also have λ (1) α ∈ [a/2, 2b] with overwhelming probability. Then for any z = E + iη such that E ∈ [a/2, 2b], we have for any k ∈ {1, · · · , p}. Now we set I = [E − η/2, E + η/2]. Notice that there always exists some positive constant C 2 such that If we set C 3 = C 1 C 2 , it follows from (3.5) and (3.6) that The first term of (3.7) is obviously exponential small by the Hoeffding inequality. For the second term, we use Lemma 2.5. Now we specialize X in Lemma 2.5 to be x 1 and the subspace H to be the one generated by eigenvectors {ν α : λ Thus one has d = N I ≥ Cpη ≫ CK 2 log 2 n. Then by Lemma 2.5 we have Cpη with overwhelming probability. This implies that the second term of (3.7) is exponential small when C is large enough. So we conclude the proof of Lemma 3.3.
Now we proceed to prove Theorem 3.1. The basic strategy is to compare s p (z) and s(z) with small imaginary part η. In fact, we have the following proposition.
Suppose that one has the bound and ℑz ≥ η. Then for any interval Remark 3.4. Proposition 3.1 is an extension of Lemma 29 of [18] up to the edge, whose proof can be found in [20]. In fact, the proof can be taken in the same manner as that of Lemma 64 in [16] for the Wigner matrix.
So in view of Proposition 3.1, to prove Theorem 3.1, we only need to prove that the bound holds with (uniformly) overwhelming probability for all z = E + iη such that E ∈ [a/2−ǫ, 2b+ǫ] and 1/10 ≥ η ≥ K 2 log 6 n nδ 8 . To prove (3.8) we need to derive a consistent equation for s p (z), which is similar to the equation (2.6) for s(z).
Firstly by Lemma 2.4 we can rewrite s p (z) as Then the proof of (3.8) can be taken in the same manner as the counterpart of the sample covariance matrix case (see the proof of formula (4.12) of [20]). We only state the different parts below and leave the details to the reader. We remark here that we consider the domain [L 1 , [20]. However, if one goes through the proof in [20], it is not difficult to see that the proof towards any domain [L 1 , L 2 ] containing [a, b] is the same. The only minor difference between our case and the sample covariance matrix in [20] is the estimation of d k . We will only deal with d 1 in the sequel. The others are analogous. By (3.3) and (3.4), we have For the first term of (3.9) we have is the Stieltjes transform of the ESD of W (1) . Then by the Cauchy's interlacing property, we have ).
Consequently one has Now we provide the following lemma on the second term of (3.9).
uniformly in z with overwhelming probability.
Proof. We set R j = (ξ j − 1). By (3.3) and the fact that holds with overwhelming probability, we have for any By using Lemma 2.5, we have where a ∨ b = max(a, b). By inserting (3.11) and (3.13) into (3.12), we have If we choose T = log O(1) n, we always have Then the following part of the proof is the same as that in the sample covariance matrix case. One can refer to the proof of Proposition 4.6 of [20] for details.
Now we proceed to the proof of Theorem 3.1. By (3.9), (3.10) and Lemma 3.4 we can get the following equation By a standard comparison of (3.14) and (2.6) (see [20] for example), we have (3.8). Thus by Proposition 3.1 we conclude the proof of Theorem 3.1. Now we turn to the proof of Theorem 3.2. At first, we introduce the matrix We will need the following lemma on eigenvalue collision.
Lemma 3.5. If we assume the random variables x ′ ij s are continuous, we have the following events hold with probability one. i): W has simple eigenvalues, i.e. λ 1 < λ 2 < · · · < λ p . ii): W and W (p) have no eigenvalue in common. iii): W and W (n) have no eigenvalue in common.
The proof of Lemma 3.5 will be postponed to Appendix A.
Proof of Theorem 3.2. The proof for the left singular vectors is nearly the same as the sample covariance matrix case shown in [20] by using Lemma 2.3, ii) of Lemma 3.5 and Theorem 3.1. Moreover, as we have mentioned in the Remark 3.3, a slightly weaker delocalization property for the left singular vectors has been provided in [14]. So we will only present the proof for the right singular vectors below.
Below we denote the k-th column of Y by h k , and the remaining p×(n−1) matrix after deleting h k by Y (k) . Note that Y (n) is not independent of the last column h n . However, for the sample covariance matrix case, the independence between the column and the corresponding submatrix is essential for one to use the concentration results such as Lemma 2.5. To overcome the inconvenience caused by the dependence, we will use the modified matrix Y (n) defined above. Notice that the matrix Y (n) is independent of the random vector (x 1n , x 2n · · · , x pn ) T . Now we define The following lemma handles the operator norms of ∆ 1 and ∆ 2 .
with overwhelming probability.
Proof. Observe that We only discuss the second term since the first one is analogous. It is easy to see the entries of ∆ T 2 satisfy It follows that .
Thus it is easy to see with overwhelming probability. Together with the fact that || Y (n) || op ≤ C holds with overwhelming probability, we can conclude the proof of Lemma 3.6.
Now we proceed to the proof of Theorem 3.2. If we denote where x is the last component of u i . Without loss of generality, we can only prove the theorem for x. Notice that u i is the eigenvector of W = Y T Y corresponding to the eigenvalue Note that Y T (n) Y (n) share the same nonzero eigenvalues with W (n) , so by iii) of Lemma 3.5, we can always view that the matrix If x = 0 then Theorem 3.2 is evidently true. Consider x = 0 below. Together with the fact that x 2 = 1 − ||w|| 2 , we have . Now if we useλ j to denote the ordered nonzero eigenvalue of Y T (n) Y (n) andû j the corresponding unit eigenvector. And set the projection Then by the spectral decomposition one has Therefore to show |x| ≤ n −1/2 K C 0 /2 log O(1) n, we only need to prove To prove (3.20), we need to separate the issue into the bulk case and the edge case. Before that, we shall provide the following lemma which will be used in both cases.
Lemma 3.7. If we denote the unit eigenvector of W (n) corresponding toλ j bŷ v j , under the assumption of Theorem 3.2 we have for any J ⊆ {1, · · · , p} with with overwhelming probability.
We will postpone the proof of Lemma 3.7 to Appendix B. In fact, it can be viewed as a modification of Lemma 2.5. Now we decompose the proof of Theorem 3.2 into two parts: bulk case and edge case.
• Bulk case: λ i ∈ [a + ǫ, b − ǫ] for some ǫ > 0 Note that the local MP law (Theorem 3.1) can also be applied to the matrix Y T (n) Y (n) . Thus we can find a set J ⊆ {1, · · · , p} with |J| ≥ K 2 log 20 n such that λ j = λ i + O(K 2 log 20 n/n) for any j ∈ J when λ i is in the bulk region of the MP law. It follows that By the singular value decomposition, we have for any J ⊂ {1, · · · , p} such that K 2 log 20 n ≤ |J| ≤ nK −3 .
Notice that for any real number sequence {S 1 , · · · , S m } and which implies (3.20) directly. So we conclude the proof for the bulk case.
Next, we turn to the edge case.
For the edge case we also begin with the representation (3.19). By (3.18), we have Similarly to the bulk case, we only need to get (3.20). Below we also assume |x| ≥ Cn −1/2 K C 0 /2 log O(1) n to get (3.20). Similar to (3.25), by using Lemma 3.6 we have holds with overwhelming probability. Thus to provide (3.20), it suffices to show instead. By the Cauchy-Schwarz inequality, we only need to prove with overwhelming probability for some 1 ≤ T − < T + ≤ K 2 log O(1) n.
Notice that under the assumption |x| ≥ Cn −1/2 K C 0 /2 log O(1) n, by Lemma 3.6 we have Moreover, it is not difficult to see h T n h n = y + o(1) with overwhelming probability. Thus by (3.29), we have with overwhelming probability So to prove (3.32) we only need to evaluate Here sgn(λ i , I) = 1(resp. −1) when λ i is on the left (resp. right) hand side of I.
By Theorem 3.1, the interval I with |d I | < log n contains at most K 2 log O(1) n eigenvalues. So we can set T − , T + accordingly so that such intervals don't contain anyλ j if j < i − T − or j > i + T + . In the following we only consider I such that |d I | ≥ log n in the estimation of (3.34). Note that forλ j ∈ I, Using (3.30) and (3.31) again one has Here we used Lemma 3.3 in the last inequality. Now we partition the real line into intervals I of length K 2 log A n/n, and sum (3.35) over all intervals I with |d I | ≥ log n. Then

So we can evaluate
instead of (3.34). The evaluation of (3.36) is really the same as the counterpart in the sample covariance matrix case (see (4.5) in [20]) by inserting Lemma 3.7, so we omit the details here. In fact, we can finally get where p.v. means the principal value. Using the formula for the Stieltjes transform s(z), one can get from residue calculus that for λ i ∈ [a, b], Consequently by the definition of a and b, if |λ i − a| ≤ o(1), we have Then it is easy to see when 0 < y < 1, (3.32) holds with overwhelming probability for the case where |λ i − a| = o(1) or |λ i − b| = o(1). Moreover by continuity we can adjust the value of ǫ to get the conclusion for the general case . Thus we complete the proof of the delocalization for u i .

Green function comparison theorem
In this section, we provide a Green function comparison theorem for the sample correlation matrices satisfying C 1 . The proof heavily relies on the recent results of Pillai and Yin [14] on sample covariance matrices and the delocalization property for the right singular vectors proved in the last section. At first, we will borrow some results from [14] directly with only minor notation change. In fact, by Theorem 1.5 in [14], it is not difficult to see Theorem 1.2 and Theorem 1.3 of [14] also hold for sample correlation matrices under our basic condition C 1 .
To state the results in [14], we need to introduce some notation. Define the parameter ϕ := (log p) log log p , and Moreover we introduce the "nonasymptotic Marchenko-Pastur law " and the corresponding distribution function F W (x) and Stieltjes transform For ζ ≥ 0, define the set And we say that an event Ω holds with ζ-high probability if there exists a constant C > 0 such that for large enough p. Note that (4.2) implies that the event Ω holds with overwhelming probability if ζ > 0. We further denote Lemma 4.1. (Theorem 1.5, [14]) Under the condition C 1 , for any ζ > 0 there exists a constant C ζ such that the following events hold with ζ-high probability.
(i) The Stieltjes transform of the ESD of W satisfies

(ii) The individual matrix elements of the Green function satisfy
We also need the following lemma on s W (z).
. Define the matrix W w , the Green function G w (z) and the Stieltjes transform s w p (z) analogously for another random sequence {x w ij } satisfying C 1 which is independent of {x v ij }. The aim in this section is to prove the following Green function comparison theorem.
Below we only state the results and proofs for the largest eigenvalue. The smallest one is just analogous. with some constant C 1 > 0. Then there exists ǫ 0 > 0 depending only on C 1 such that for any ǫ < ǫ 0 and for any real numbers E, E 1 and E 2 satisfying for some constant C and large enough p.
Proof of Theorem 4.3. The proof is similar to that of Theorem 6.3 of [14]. Moreover, the proof of (4.6) can be taken in a same manner as that of (4.5), so we will just present the proof for (4.5) below. The basic strategy is to estimate the successive difference of matrices which differ by a row. For 1 ≤ γ ≤ p, we denote by Y γ the random matrix whose j-th row is the same as that of Y v if j ≤ γ and that of Y w otherwise; in particular Y 0 = Y v and Y p = Y w . And we set We shall compare W γ−1 with W γ by using the following lemma. For simplicity, we denote Lemma 4.4. For any sample correlation matrix W with elements satisfying the basic assumption C 1 , if |E − λ + | ≤ p −2/3+ǫ and p −2/3 ≫ η ≫ p −2/3−ǫ for some ǫ > 0, then we have where the functional A(Y (i) , m 1 , m 2 ) only depends on the distribution of Y (i) and the first two moments m 1 , m 2 of x ij .
Note that thus Lemma 4.4 implies that Then the proof of Theorem 4.3 can be completed by the telescoping argument. Therefore it suffices to prove Lemma 4.4 in the sequel. To do this, we need to provide some bounds about G (i) . We only state the result for i = 1 as the following lemma since the others are analogous. and hold with overwhelming probability.
The proof of Lemma 4.5 will be postponed to the end of this section. Now we begin to prove Lemma 4.4 assuming Lemma 4.5.
Proof of Lemma 4.4. The proof is in a similar manner to that of Lemma 6.5 in [14]. At first we rewrite (2.7) as by using the facts that W (1) G (1) (z) = I + zG (1) (z), y T 1 y 1 = 1. Moreover, by Schur's complement, we also have Inserting (4.9) and the identity Now we define the quantity B as Thus by (4.9) we have By (ii) of Lemma 4.1 and (4.4) we can get with overwhelming probability. Thus we have the expansion Now we set It follows from (4.11) and (4.12) that with overwhelming probability. Thus we have with overwhelming probability.
Similarly to the counterpart proof of Lemma 6.5 in [14], we only need to show with some functional A k only depending on the distribution of Y (1) , m 1 and m 2 .
Since the proof of (4.14) is similar to the counterpart in [14], we will only state the proof for k = 3 below. We use E 1 to denote the expectation with respect to y 1 in the sequel. By using (4.13) we obtain with overwhelming probability. If we write r 1 = ℜ(ηzs W (z)), r 2 = ℑ(ηzs W (z)), then we have Notice that if there exists a k i which appears only once in the above product, then by the assumption that x ij is symmetric, we have So we consider the case where k i appears exactly twice. Firstly, we consider where the first summation goes through the indices k 1 , k 2 , k 3 such that they are not equal to each other, and the second summation goes through the left part of the indices. Then it is not difficult to see the number of the terms in the second summation is of the order O(n 2 ). By the exponential tail assumption and the Hoeffding inequality, we can see Furthermore, since x 11 , · · · , x 1n are i.i.d., we have for k 1 , k 2 , k 3 not equal to each other (4.18) Therefore by (4.16), (4.17), (4.18) and the fact that G (1) only depends on Y (1) , we have with some functionalÃ 3 only depending on the distribution of Y (1) , m 1 and m 2 .
Here the first summation (3) in (4.19) goes through the terms such that each k i , i = 1, · · · , 6 appears exactly twice. It is easy to see that there are O(n 3 ) such terms totally. And the second summation goes through the terms such that (4) no k i appears only once and (5) at least one k i appears three times. Thus we have the total number of the terms in the second summation is of the order O(n 2 ). Then by using Lemma 4.5 and the fact we have By inserting (4.20) into (4.15), we can get (4.14) for k = 3. The cases of k = 1 and k = 2 can be proved similarly by inserting Lemma 4.5. So we conclude the proof. Now we begin to prove Lemma 4.5.
Proof of Lemma 4.5. The proof of (4.7) is the same as the counterpart in [14], (see (6.36) of [14]). So we only state the proof of (4.8) below. For the ease of the presentation, we prove (4.8) for G = (W − z) −1 := (Y T Y − z) −1 instead of G (1) . By the spectral decomposition, we have where the projection P = I − p k=1 u k u T k . Consequently, we have Note that |P ij | ≤ 1, |z| ≥ λ + /2. By the delocalization property of u k in Theorem 3.2 one has Observe that with overwhelming probability. Here we used (iii) of Lemma 4.1 in the last inequality. Consequently, we have It remains to estimate By the formula for the MP law, one has .
When E ≥ λ + , we still have (4.21). Therefore, we have with overwhelming probability. Thus we complete the proof.
Theorem 4.3 is proved.

Proofs of main theorems
In this section, we provide the proofs of Theorem 1.2 and Theorem 1.3.
Proof of Theorem 1.2. The proof of Theorem 1.2 is totally based on Theorem 1.5 of [14] and our Theorem 4.3. Let W v and W w be two independent sample correlation matrix satisfying C 1 . We claim that there is an ε > 0 and δ > 0 such that for any real number s (which may depend on p) one has for p ≥ p 0 sufficiently large, where p 0 is independent of s. The proof of (5.1) is independent of the matrix model and totally based on Theorem 1.5 of [14] and our Theorem 4.3, we refer to the proof of Theorem 1.7 of [14] for details. Now if we choose W v to be the Bernoulli case, it is not difficult to get Theorem 1.2 by combining (5.1) and Theorem 1.1.

Proof of Theorem 1.3. Set the matrix
· · · 0 · · · · · · · · · · · · · · · 1 √ n(n−1) It is easy to see A is an orthogonal matrix. Moreover, it is elementary that where z i1 , · · · , z in−1 is a sequence of i.i.d N (0, 1) variables. Further, if we denote the vector z i = (z i1 , · · · , z in−1 ) T , we also have Thus one has Here Consequently, in the Gaussian case, R is also a W -type sample correlation matrix defined in (1.1) with parameters p, n − 1. Thus by Theorem 1.2, we have as p → ∞. Replacing n − 1 by n in (5.2) and (5.3), we can complete the proof of Theorem 1.3.

Appendix A
In this appendix we prove Lemma 3.5 Proof of Lemma 3.5. At first we prove i). Note that W = DSD. For W and SD 2 share the same eigenvalues, it is equivalent to prove that the eigenvalues of SD 2 are simple. We further introduce the polynomial P 1 (X) of {x ij , 1 ≤ i ≤ p, 1 ≤ j ≤ n} as It is easy to see P 1 (X) vanishes with zero Lebesgue measure, so we can always assume P 1 (X) = 0. As a consequence, we can reduce our problem to prove the matrix Q := SD 2 P 1 (X) has no multiple eigenvalue. Now we denote the discriminant of the characteristic polynomial of Q by P Q (X). Observe that all the entries of Q are polynomials of For the set of zeros of any non null polynomial in real variables only has zero Lebesgue measure, it suffices to prove that P Q (X) is not a null polynomial. In other words, it suffices to find a family {x ij , 1 ≤ i ≤ p, 1 ≤ j ≤ n} such that P Q (X) = 0.
It is equivalent to show that W has no multiple eigenvalue for one sample of the collection {x ij , 1 ≤ i ≤ p, 1 ≤ j ≤ n} such that P 1 (X) = 0.
Now we choose the sample as Then it is not difficult to see which is a Jacobi matrix with positive subdiagonal entries. Such a Jacobi matrix has simple eigenvalues, for example, see Proposition 2.40 of [5].
Next we turn to the proof of ii). We use X (p) to denote the submatrix of X with p-th row deleted, and use D (p) to denote the p − 1 × p − 1 upper left corner of D. And we set S (p) = X (p) X (p)T , thus one has W (p) = D (p) S (p) D (p) . Similar to the proof of i), we can prove that SD 2 P 1 (X) and S (p) (D (p) ) 2 P 1 (X) have no eigenvalue in common instead. It is easy to see the resultant of the characteristic polynomials of SD 2 P 1 (X) and S (p) (D (p) ) 2 P 1 (X) is a polynomial of {x ij , 1 ≤ i ≤ p, 1 ≤ j ≤ n}. Therefore, it suffices to show the resultant is a non null polynomial. Equivalently, we shall provide a sample of {x ij , 1 ≤ i ≤ p, 1 ≤ j ≤ n} such that W and W (p) have no eigenvalue in common.
Using i) to W (p) we can denote the ordered eigenvalues of W (p) by λ (p) 1 < λ (p) 2 < · · · < λ (p) p−1 . By Cauchy's interlacing property, one has Moreover, we know that W (p) shares the same nonzero eigenvalues with W (p) . So we can provide an example such that W and W (p) have no nonzero eigenvalue in common instead. Note Taking trace on both side of (6.2), we obtain and let {x pj , 1 ≤ j ≤ n} vary. When {x pj , 1 ≤ j ≤ n} runs through the set R n , the ordered nonzero eigenvalues of W describe the set of families λ 1 , · · · , λ p of real numbers obeying (6.1) and (6.3), see the proof of Lemma 11.4 of [3] for example. Thus it is easy to find a family λ 1 , · · · , λ p such that Now we prove iii). We set X (n) to be the submatrix of X with the n-th column deleted and set . It is obvious that S (n) D 2 (n) shares the same eigenvalues with W (n) Now we introduce the polynomials To prove that W and W (n) have no eigenvalue in common, we only need to show SD 2 and S (n) D 2 (n) have no eigenvalue in common. Moreover, if P 2 (X) does not vanish, it is equivalent to prove that the matrices T := SD 2 P 2 (X) and T (n) := S (n) D 2 (n) P 2 (X) have no eigenvalue in common. Note that the event P 2 (X) = 0 has zero Lebesgue measure. What's more, it is not difficult to see the entries of T and T (n) are all polynomials of the elements of X, thus the resultant R(X) of the characteristic polynomials of T and T (n) is also a polynomial of the elements of X. Therefore, we only need to show R(X) is a non null polynomial, it suffices to give only one example of X such that W and W (n) do not have eigenvalue in common. For example, we can choose  Thus it is easy to see W (n) and W have no eigenvalue in common for det(W −I) = 0, which implies that R(X) is not a null polynomial, so we conclude the proof.

Appendix B
In this appendix, we prove Lemma 3.7. If we denotê .
Since (x 3 1n , · · · , x 3 pn ) T is also a random vector with mean zero and finite variance entries, Lemma 2.5 can be used to the first part of the right hand side of (7.3). Thus if we set the projection , it suffices to prove the following lemma instead.
We use the following concentration theorem, which is a consequence of Talagrand's inequality, (see Theorem 69 of [16]). is a convex function of the vectorh n . Note Since F (h n ) is the norm of a projection of the vector √ nĥ n , it is always 1-Lipschitz with respect to √ nĥ n . And by the Hoeffding inequality, we also have || √ nĥ ′ n − √ nĥ n || ||h ′ n −h n || ≤ 2 with overwhelming probability. So F (h n ) is a 2-Lipschitz function with overwhelming probability. Thus we can always consider F (h n ) as a 2-Lipschitz function below. By Theorem 7.2, we have P(|F (h n ) − M (F (h n ))| ≥ r) ≤ 4 exp(−r 2 /64K 2 ).
Similarly, we can define E − as the event F (h n ) ≤ √ d − 2K and use Both terms on the right hand side can be bounded by 1/5 by the same argument as above. So we conclude the proof.