Edge universality of separable covariance matrices

In this paper, we prove the edge universality of largest eigenvalues for separable covariance matrices of the form $\mathcal Q :=A^{1/2}XBX^*A^{1/2}$. Here $X=(x_{ij})$ is an $n\times N$ random matrix with $x_{ij}=N^{-1/2}q_{ij}$, where $q_{ij}$ are $i.i.d.$ random variables with zero mean and unit variance, and $A$ and $B$ are respectively $n \times n$ and $N\times N$ deterministic non-negative definite symmetric (or Hermitian) matrices. We consider the high-dimensional case, i.e. ${n}/{N}\to d \in (0, \infty)$ as $N\to \infty$. Assuming $\mathbb E q_{ij}^3=0$ and some mild conditions on $A$ and $B$, we prove that the limiting distribution of the largest eigenvalue of $\mathcal Q$ coincide with that of the corresponding Gaussian ensemble (i.e. the $\mathcal Q$ with $X$ being an $i.i.d.$ Gaussian matrix) as long as we have $\lim_{s \rightarrow \infty}s^4 \mathbb{P}(\vert q_{ij} \vert \geq s)=0$, which is a sharp moment condition for edge universality. If we take $B=I$, then $\mathcal Q$ becomes the normal sample covariance matrix and the edge universality holds true without the vanishing third moment condition. So far, this is the strongest edge universality result for sample covariance matrices with correlated data (i.e. non-diagonal $A$) and heavy tails, which improves the previous results in \cite{BPZ1,LS} (assuming high moments and diagonal $A$), \cite{Anisotropic} (assuming high moments) and \cite{DY} (assuming diagonal $A$).


Introduction
Sample covariance matrices are fundamental objects in multivariate statistics. Given a centered random vector y P R n and its i.i.d. copies y i , i " 1,¨¨¨, N , the sample covariance matrix Q :" N´1 ř i y i yi is the simplest estimator for the covariance matrix A :" Eyy˚. In fact, if the dimension n of the data is fixed, then Q converges almost surely to Σ as N Ñ 8. However, in many modern applications, high dimensional data, i.e. data with n being comparable to or even larger than N , is commonly collected in various fields, such as statistics [13,32,33,34], economics [47] and population genetics [49], to name a few. In this setting, A cannot be estimated through Q directly due to the so-called curse of dimensionality. Yet, some properties of A can be inferred from the eigenvalue statistics of Q.
In this paper, we focus on the limiting distribution of the largest eigenvalues of high-dimensional sample covariance matrices, which is of great interest to the principal component analysis. The largest eigenvalue has been widely used in hypothesis testing problems on the structure of covariance matrices, see e.g. [7,17,33,48]. Of course the list is very far from being complete, and we refer the reader to [32,51,67] for a comprehensive review. Precisely, we will consider sample covariance matrices of the form where the data matrix X " px ij q is an nˆN random matrix with i.i.d. entries such that Ex 11 " 0 and E|x 11 | 2 " N´1, and A is an nˆn deterministic non-negative definite symmetric (or Hermitian) matrix. On dimensionality, we assume that n{N Ñ d P p0, 8q as N Ñ 8. It is well-known that the empirical spectral distribution (ESD) of Q converges to the (deformed) Marchenko-Pastur (MP) law [42], whose rightmost edge λ`gives the asymptotic location of the largest eigenvalue. Moreover, it was proved in a series of papers that under an N 2{3 scaling, the distribution of the largest eigenvalue λ 1 pQq around λ`converges to the famous Tracy-Widom distribution [58,59]. This result is commonly referred to as the edge universality, in the sense that it is independent of the detailed distribution of the entries of X. The limiting distribution of λ 1 was first obtained for Q with X consisting of i.i.d. centered Gaussian entries (i.e. XX˚is a Wishart matrix) and with trivial covariance (i.e. A " I) [33]. The edge universality in the A " I case was later proved for all random matrices X whose entries satisfy a sub-exponential decay [53]. When A is a non-scalar diagonal matrix, the Tracy-Widom distribution was first proved for the case with i.i.d. Gaussian X in [17] (non-singular A case) and [46] (singular A case). Later the edge universality with general diagonal A was proved in [6,39] for X with entries having arbitrarily high moments, and in [14] for X with entries satisfying the tail condition (1.1) below. The most general case with non-diagonal A is considered in [37], where the edge universality was proved under the arbitrarily high moments assumption.
Without loss of generality, we may assume that the row indices of the data matrix correspond to the spatial locations and the column indices correspond to the observation times. Then the data model A 1{2 X corresponds to observing independent samples at N different times, and hence is incompetent to model sampling data with time correlations. In fact, the spatio-temporal sampling data is commonly collected in environmental study [29,38,41,43] and wireless communications [60]. Motivated by this fact, we shall consider a separable data model Y " A 1{2 XB 1{2 , where A and B are respectively nˆn and NˆN deterministic non-negative definite symmetric (or Hermitian) matrices. Here A and B are not necessarily diagonal, which means that the entries are correlated both in space and in time. The name "separable" is because the joint covariance of Y , viewed as an pN nq-dimensional vector, is given by a separable form A b B. In particular, if the entries of X are Gaussian, then the joint distribution of Y is N N n p0, A b Bq. Note that the separable model describes a process where the time correlation does not depend on the spatial location and the spatial correlation does not depend on time, i.e. there is no space-time interaction.
The separable covariance matrix is defined as Q :" Y Y˚" A 1{2 XBX˚A 1{2 . It has been proved to be very useful for various applications. For example, in wireless communications, it was shown in [61] that an estimate of the capacity is directly given by various informations of the largest eigenvalue. The spectral properties of separable covariance matrices have been investigated in some recent works, see e.g. [11,18,52,62,69]. However, the edge universality is much less known compared with sample covariance matrices. It is known that the edge universality generally follows from an optimal local law for the resolvent G " pQ´zq´1 near the spectral edge, where z P C`:" tz P C : Im z ą 0u with Im z " N´1 [6,14,37,39]. Consider an nˆN matrix X consisting of independent centered entries with general variance profile E|x ij | 2 " σ ij {N , then an optimal local law was prove in [1,2] for the resolvent pXX˚´zq´1 under the arbitrarily high moments assumption. Note that this gives the local law for G in the case where both A and B are diagonal. However, if A and B are not diagonal, no such local law is proved so far, let alone the edge universality.
The goal of this paper is to fill this gap. More precisely, we shall prove that for general (non-diagonal) A and B satisfying some mild assumptions, the limiting distribution of the rescaled largest eigenvalue N 2 3 pλ 1 pQq´λ`q coincides with that of the corresponding Gaussian ensemble (i.e. Q G " A 1{2 X G BpX G q˚A 1{2 with X G being an i.i.d. Gaussian matrix) as long as the following conditions hold: For a precise statement, the reader can refer to Theorem 2.7. Note that the tail condition (1.1) is slightly weaker than the finite fourth moment condition for ? N x 11 , and in fact is sharp for the edge universality of the largest eigenvalue, see Remark 2.8 below. Historically, for sample covariance matrices, it was proved in [68] that λ 1 Ñ λ`almost surely in the null case with A " I if the fourth moment exists. Later the finite fourth moment condition is proved to be also necessary for the almost sure convergence of λ 1 [3]. On the other hand, it was proved in [54] that λ 1 Ñ λ`in probability under the condition (1.1). If A is diagonal, it was proved in [14] that the condition (1.1) is actually necessary and sufficient for the edge universality of sample covariance matrices to hold.
On the other hand, the condition (1.2) is more technical and should be considered to be removed in future works. We now discuss about it briefly. The main difficulty in studying Q " A 1{2 XBX˚A 1{2 and its resolvent is due to the fact that the entries of A 1{2 XB 1{2 are not independent. We assume that A and B have eigendecompositions A " U ΣU˚and B " V r ΣV˚. Then in the special case where X " X G is i.i.d. Gaussian, it is easy to see that which is reduced to a separable covariance matrix with diagonal Σ and r Σ. This case can be handled using the current method in [14]. To extend the result in the Gaussian case to the general X case, we use a selfconsistent comparison argument developed in [37]. For this argument to work, we need to assume that the third moments of the X entries coincide with that of the Gaussian random variable, i.e. the condition (1.2). (Actually it is common that for a comparison argument to work for random matrices, some kind of four moment matching is needed; see e.g. [55,56,57].) If one of the A and B is diagonal, then a notable argument in [37,Section 8] can remove this requirement by exploring more detailed structures of the resolvents of Q. However, their argument is quite specific and cannot be adapted to the general case with both A and B being non-diagonal. Nevertheless, this is still a welcome result, which shows that for sample covariance matrices, the condition (1.2) is not necessary and the edge universality holds as long as (1.1) holds. For a more detailed explanation on why and where the condition (1.2) is needed, we refer the reader to the discussion following Theorem 3.6.
Finally, we believe that the largest eigenvalue of the Gaussian separable covariance matrix Q G should converge to the Tracy-Widom distribution. However, to the best of our knowledge, so far there is no explicit proof for this fact. We will give a proof in another paper [16].
This paper is organized as follows. In Section 2, we first define the limiting spectral distribution of the separable covariance matrix and its rightmost edge λ`, which will depend only on the empirical spectral densities (ESD) of A and B. Then we will state the main theorem-Theorem 2.7-of this paper. In Section 3, we introduce the notations and collect some tools including the anisotropic local law (Theorem 3.6), rigidity of eigenvalues (Theorem 3.8) and a comparison theorem (Theorem 3.10). In Section 4, we prove Theorem 2.7 with these tools. Then Section 5 and Section 6 are devoted to proving Theorem 3.6, and Section 7 is devoted to proving Theorem 3.8 and Theorem 3.10.
Conventions. The fundamental large parameter is N and we always assume that n is comparable to N . All quantities that are not explicitly constant may depend on N , and we usually omit N from our notations. We use C to denote a generic large positive constant, whose value may change from one line to the next. Similarly, we use ε, τ , δ and c to denote generic small positive constants. If a constant depends on a quantity a, we use Cpaq or C a to indicate this dependence. We use τ ą 0 in various assumptions to denote a small positive constant. All constants appear in the statements or proof may depend on τ ; we neither indicate nor track this dependence.
For two quantities a N and b N depending on N , the notation a N " Opb N q means that |a N | ď C|b N | for some constant C ą 0, and a N " opb N q means that |a N | ď c N |b N | for some positive sequence c N Ó 0 as N Ñ 8. We also use the notations a N À b N if a N " Opb N q, and a N " b N if a N " Opb N q and b N " Opa N q. For a matrix A, we use }A} :" }A} l 2 Ñl 2 to denote the operator norm; for a vector v " pv i q n i"1 , }v} " }v} 2 stands for the Euclidean norm, while |v| " }v} 1 stands for the l 1 -norm. In this paper, we often write an identity matrix as I or 1 without causing any confusions. If two random variables X and Y have the same distribution, we write X d " Y .
Acknowledgements. I would like to thank Marc Potters and Xiucai Ding for bringing this problem to my attention and for helpful discussions. I also want to thank my advisor Jun Yin for the guidance and valuable suggestions.

Separable covariance matrices
We consider a class of separable covariance matrices of the form Q 1 :" A 1{2 XBX˚A 1{2 , where A and B are deterministic non-negative definite symmetric (or Hermitian) matrices. Note that A and B are not necessarily diagonal. We assume that X " px ij q is an nˆN random matrix with entries x ij " N´1 {2 q ij , 1 ď i ď n, 1 ď j ď N , where q ij are i.i.d. random variables satisfying For definiteness, in this paper we focus on the real case, i.e. the random variable q 11 is real. However, we remark that our proof can be applied to the complex case after minor modifications if we assume in addition that Re q 11 and Im q 11 are independent centered random variables with variance 1{2. We will also use the NˆN matrix Q 2 :" B 1{2 X˚AXB 1{2 . We assume that the aspect ratio d N :" n{N satisfies τ ď d N ď τ´1 for some constant 0 ă τ ă 1. Without loss of generality, by switching the roles of Q 1 and Q 2 if necessary, we can assume that τ ď d N ď 1 for all N.
For simplicity of notations, we will often abbreviate d N as d in this paper. We denote the eigenvalues of Q 1 and Q 2 in descending order by λ 1 pQ 1 q ě . . . ě λ n pQ 1 q and λ 1 pQ 2 q ě . . . ě λ N pQ 2 q. Since Q 1 and Q 2 share the same nonzero eigenvalues, we will for simplicity write λ j , 1 ď j ď N^n, to denote the j-th eigenvalue of both Q 1 and Q 2 without causing any confusion. We assume that A and B have eigendecompositions We denote the empirical spectral densities (ESD) of A and B by We assume that there exists a small constant 0 ă τ ă 1 such that for all N large enough, The first condition means that the operator norms of A and B are bounded by τ´1, and the second condition means that the spectrums of A and B do not concentrate at zero. We summarize our basic assumptions here for future reference.

Resolvents and limiting law
In this paper, we will study the eigenvalue statistics of Q 1 and Q 2 through their resolvents (or Green's functions). It is equivalent to study the matrices In this paper, we shall denote the upper half complex plane and the right half real line by C`:" tz P C : Im z ą 0u, R`:" r0, 8q.
Definition 2.2 (Resolvents). For z " E`iη P C`, we define the resolvents for r Q 1,2 as G 1 pX, zq :"´r Q 1 pXq´z¯´1 , G 2 pX, zq :"´r Q 2 pXq´z¯´1 . (2.7) We denote the ESD ρ pnq of r Q 1 and its Stieltjes transform as ρ " ρ pnq :" 1 n n ÿ i"1 δ λip r Q1q , mpzq " m pnq pzq :" We also introduce the following quantities: It was shown in [52] that if d N Ñ d P p0, 8q and π pnq A , π pN q B converge to certain probability distributions, then almost surely ρ pnq converges to a deterministic distributions ρ 8 . We now describe it through the Stieltjes transform For any finite N and z P C`, we define pm pN q 1c pzq, m pN q 2c pzqq P C 2 as the unique solution to the system of self-consistent equations Then we define m c pzq " m pnq c pzq :" It is easy to verify that m pnq c pzq P C`for z P C`. Letting η Ó 0, we can obtain a probability measure ρ Letting η Ó 0, we can recover the asymptotic eigenvalue density ρ 8 with It is also easy to see that ρ 8 is the weak limit of ρ pnq c . The above definitions of m pnq c , ρ pnq c , m 8 and ρ 8 make sense due to the following theorem. Throughout the rest of this paper, we often omit the super-indices pnq and pN q from our notations. Theorem 2.3 (Existence, uniqueness, and continuous density). For any z P C`, there exists a unique solution pm 1c , m 2c q P C 2 to the systems of equations in (2.9). The function m c in (2.10) is the Stieltjes transform of a probability measure µ c supported on R`. Moreover, µ c has a continuous derivative ρ c pxq on p0, 8q, which is defined by (2.12 We now make a small detour and discuss about another very enlightening way to understand the Stieltjes transforms m 1,2c and m c . Consider the vector solution v " pv 1 ,¨¨¨, v n q to the following self-consistent vector equation [1,2]: where 1{v denotes the entrywise reciprocal, and S is an nˆN matrix with entries In fact, if one regards X 1 :" 1, n and X 2 :" 1, N as measure spaces equipped with counting measures then S defines a linear operator S : l 8 pX 2 q Ñ l 8 pX 1 q such that Now we can regard (2.13) as a self-consistent equation of the function v : C`Ñ l 8 pX 1 q. Suppose v is a solution to (2.13) with Im vpzq ą 0, then it is easy to verify that The structure of the solution v was well-studied in [1,2]. In particular, one has the following preliminary result on the existence and uniqueness of the solution.
Theorem 2.4 (Proposition 2.1 of [1]). There is a unique function v : C`Ñ l 8 pX 1 q satisfying (2.13) and Im vpzq ą 0 for all z P C`. Moreover, for each k P X 1 , there is a unique probability measure µ k on R such that v k is the Stieltjes transform of µ k , i.e. v k pzq " The measures µ k , k P X 1 , all have the same support contained in r0, p Cs, where Now we go back to study the equations in (2.9). If we define the function then m 2c pzq can be characterized as the unique solution to the equation f pz, αq " 0 of α with Im α ą 0, and m 1c pzq is defined using the first equation in (2.9). Moreover, m 1,2c pzq are the Stieltjes transforms of densities ρ 1,2c : Then we have the following result.
We shall call a k the spectral edges. In particular, we will focus on the rightmost edge λ`:" a 1 . Now we make the following assumption: there exists a constant τ ą 0 such that

Main result
The main result of this paper is the following theorem.
Theorem 2.7. Let Q 1 :" A 1{2 XBX˚A 1{2 be an nˆn separable covariance matrix with A, B and X satisfying Assumption 2.1 and (2.18). Let λ 1 be the largest eigenvalue of Q 1 . If the conditions (1.1) and (1.2) hold, then we have for all s P R, where P G denotes the law for X " px ij q with real i.i.d. Gaussian entries N 1{2 x ij " q ij satisfying (2.1). P pλ 1 pQ 1 q ě aq ą 0 for any fixed a ą λ`, and the edge universality (2.21) cannot hold.
Remark 2.9. It is clear that (2.21) gives the edge universality of the largest eigenvalues of separable covariance matrices. However, to the best of our knowledge, so far there is no explicit formula for the limiting distribution of the largest eigenvalue of Q 1 when X is Gaussian. In an ongoing work [16], we shall prove that the largest eigenvalue of Q 1 actually converges weakly to the Tracy-Widom distribution. Here we state the precise result we expect to prove in [16], which may be of interest to some readers. Recall the proof of Lemma 2.6. We define γ 0 " γ 0 pA, Bq such that where we denote I 1 pA, Bq :" ż t λ`p1`tm 2c pλ`qq 2 π A pdtq, J 1 pA, Bq :" ż x λ 2 p1`xm 1c pλ`qq 2 π B pdxq, and for k " 2, 3, Using (2.5) and (2.18), it is easy to see that γ 0 " 1.
for all s 1 , s 2 , . . . , s k P R. Let H GOE be an NˆN random matrix belonging to the Gaussian orthogonal ensemble. The joint distribution of the k largest eigenvalues of H GOE , µ GOE 1 ě . . . ě µ GOE k , can be written in terms of the Airy kernel for any fixed k [28]. In [16], we actually show that lim N Ñ8 for all s 1 , s 2 , . . . , s k P R. Hence (2.23) gives a complete description of the finite-dimensional correlation functions of the largest eigenvalues of Q 1 . Remark 2.11. A key input for the proof of (2.21) is the anisotropic local law for the resolvents in (2.7). Our basic strategy is first to prove the anisotropic local law for G 1,2 when X is Gaussian, and then to obtain the anisotropic local law for the general X case through a comparison with the Gaussian case. Without (1.2), the comparison argument cannot give the anisotropic local law up to the optimal scale. However, in the case where A or B is diagonal, the condition (1.2) is not needed for the comparison argument in [37] to work. We refer the reader to the discussion following Theorem 3.6, which explains why and where the condition (1.2) is needed. We will try to remove the assumption (1.2) completely in future works.  (3) U and V are orthogonal matrices uniformly chosen from orthogonal groups Opnq and OpN q. Then we take n " 1000 and calculate the largest eigenvalues for 20000 independently chosen matrices. The histograms are plotted in Fig. 1. In case (a), the entries ? N x ij are drawn independently from a distribution with mean zero, variance 1 and satisfying (1.1); in case (b), the entries ? N x ij are i.i.d. Gaussian with mean zero and variance 1. We translate and rescale the numerical results properly, and one can observe that they fit the type-1 Tracy-Widom distribution very well.

Statistical applications
In this subsection, we briefly discuss some applications of our result to high-dimensional statistics.
If we take B " I, then Q 1 becomes the normal sample covariance matrix and Theorem 2.7 indicates that the edge universality of the largest eigenvalue of Q 1 holds true for correlated data (i.e. non-diagonal A) with heavy tails as in (1.1). So far, this is the strongest edge universality for sample covariance matrices compared with [6,39] (assuming high moments and diagonal A), [37] (assuming high moments) and [14] (assuming diagonal A). On the other hand, the separable data model Y " A 1{2 XB 1{2 for some nontrivial B is widely used in spatio-temporal data modeling, where A is the spatial covariance matrix and B is the temporal covariance matrix. If the entries of X are symmetrically distributed and the singular values of A, B are such that (2.18) holds, then Theorem 2.7 shows that the largest eigenvalue of Q 1 satisfies the edge universality as long as (1.1) holds. We now describe some possible applications of this result.
Consider the following standard signal plus noise model in classic signal processing [35]: where Γ is an nˆk deterministic matrix, s is a k-dimensional centered signal vector, A is an nˆn deterministic positive definite matrix, and x is an n-dimensional noise vector with i.i.d. mean zero and variance one entries. Moreover, the signal vector and the noise vector are assumed to be independent. In practice, suppose we observe N such samples, where the observations at different times are correlated such that the correlations are independent of the spatial locations. Denoting the temporal covariance matrix by B, we then have the spatio-temporal data matrix Y " ΓSB 1{2`A1{2 XB 1{2 , S :" ps 1 ,¨¨¨, s N q, X :" px 1 ,¨¨¨, x N q.
A fundamental task is to detect the signals via observed samples, and the very first step is to know whether there exists any such signal, i.e., H 0 : k " 0 vs. H 1 : k ě 1. (2.25) For the above hypothesis testing problem (2.25), the largest eigenvalue of the observed samples serves as a natural choice for the tests: our result shows that, for heavy-tailed correlated data satisfying (1.1), the largest singular value of Y satisfies the Tracy-Widom distribution asymptotically under H 0 . We can also consider to test whether the space-time data follows a specific separable covariance model with spatial and time covariance matrices r A and r B. Then we can use the largest singular value of r A´1 {2 Y r B´1 {2 as a test static. Another interesting test static for this hypothesis testing problem is the eigenvector empirical spectral distribution (VESD); see [63,65,66]. The convergence of VESD for separable covariance matrices has been proved in [66] using the anisotropic local law-Theorem 3.6 in this paper (which also serves as an important tool for the proof of Theorem 2.7).
Finally, we remark that one can also perform principal component analysis for separable covariance matrices, and study the phase transition phenomena caused by a few large isolated eigenvalues of A or B as in the case of spiked covariance matrices [4,5,9,50]. We expect that our edge universality result will serve as an important input for the study of the eigenvalues and eigenvectors for the principal components (the outliers) and the bulk components (the non-outliers). For example, in [15] we studied the convergence of the outlier eigenvalues and eigenvectors, and the limiting distribution of extremal bulk eigenvalues for the spiked separable covariance model based on our main result, Theorem 2.7, and the results given in Section 3.2 below.

Notations
We will use the following notion of stochastic domination, which was first introduced in [19] and subsequently used in many works on random matrix theory, such as [8,9,10,21,22,37]. It simplifies the presentation of the results and proofs by systematizing statements of the form "ξ is bounded by ζ with high probability up to a small power of N ". ξ "´ξ pN q puq : N P N, u P U pN q¯, ζ "´ζ pN q puq : N P N, u P U pN qb e two families of nonnegative random variables, where U pN q is a possibly N -dependent parameter set. We say ξ is stochastically dominated by ζ, uniformly in u, if for any fixed (small) ε ą 0 and (large) D ą 0, for large enough N ě N 0 pε, Dq, and we shall use the notation ξ ă ζ. Throughout this paper, the stochastic domination will always be uniform in all parameters that are not explicitly fixed (such as matrix indices, and z that takes values in some compact set). Note that N 0 pε, Dq may depend on quantities that are explicitly constant, such as τ in Assumption 2.1 and (2.18). If for some complex family ξ we have |ξ| ă ζ, then we will also write ξ ă ζ or ξ " O ă pζq.
(ii) We extend the definition of O ă p¨q to matrices in the weak operator sense as follows. Let A be a family of random matrices and ζ be a family of nonnegative random variables. Then A " O ă pζq means that |xv, Awy| ă ζ}v} 2 }w} 2 uniformly in any deterministic vectors v and w. Here and throughout the following, whenever we say "uniformly in any deterministic vectors", we mean that "uniformly in any deterministic vectors belonging to certain fixed set of cardinality N Op1q ".
(iii) We say an event Ξ holds with high probability if for any constant D ą 0, PpΞq ě 1´N´D for large enough N .
(iii) Suppose that Ψpuq ě N´C is deterministic and ξpuq satisfies Eξpuq 2 ď N C for all u. Then if ξpuq ă Ψpuq uniformly in u, we have Eξpuq ă Ψpuq uniformly in u.

Definition 3.3 (Bounded support condition).
We say a random matrix X " px ij q satisfies the bounded support condition with q, if max Here q " qpN q is a deterministic parameter and usually satisfies N´1 {2 ď q ď N´φ for some (small) constant φ ą 0. Whenever (3.1) holds, we say that X has support q.
Next we introduce a convenient self-adjoint linearization trick, which has been proved to be useful in studying the local laws of random matrices of the Gram type [1,2,37,64]. We define the following pnǸ qˆpn`N q self-adjoint block matrix, which is a linear function of X: Then we define its resolvent (Green's function) as G " GpX, zq :" By Schur complement formula, we can verify that (recall (2.7)) Thus a control of G yields directly a control of the resolvents G 1,2 . For simplicity of notations, we define the index sets I 1 :" t1, ..., nu, I 2 :" tn`1, ..., n`N u, I :" Then we label the indices of the matrices according to In the rest of this paper, we will consistently use the latin letters i, j P I 1 , greek letters µ, ν P I 2 , and a, b P I. Next we introduce the spectral decomposition of G. Let tξ k u n k"1 are the left-singular vectors, and tζ k u N k"1 are the right-singular vectors. Then using (3.4), we can get that for i, j P I 1 and µ, ν P I 2 ,

Main tools
For any constants c 0 , C 0 ą 0 and ω ď 1, we define a domain of the spectral parameter z as In particular, we shall denote We define the distance to the rightmost edge as Then we have the following lemma, which summarizes some basic properties of m 1,2c and ρ 1,2c .
Lemma 3.4. Suppose the assumptions (2.2), (2.5) and (2.18) hold. Then there exists sufficiently small constant r c ą 0 such that the following estimates hold: for any z P Spr c, C 0 ,´8q.
The estimates (3.10) and (3.11) also hold for ρ c and m c .
With (2.20), we see that if κ`η ď 2c 0 for some sufficiently small constant c 0 ą 0, then Then we consider the case with E ě λ``c 0 and η ď c 1 for some constant c 1 ą 0. In fact, for η " 0 and E ą λ`, m 2c pEq is real and it is easy to verify that m 1 2c pEq ě 0 using the Stieltjes transform formula m 2c pzq :" Using (3.13) again, we can get thatˇˇˇˇd Thus if c 1 is sufficiently small, we have for E ě λ``c 0 and η ď c 1 . Finally, it remains to consider the case with η ě c 1 . In this case, we have |m 2c pzq| " Im m 2c pzq " 1 by (3.11).
In sum, we have proved the second estimate in (3.12). The first estimate can be proved in a similar way.
Definition 3.5 (Classical locations of eigenvalues). The classical location γ j of the j-th eigenvalue of Q 1 is defined as (3.14) In particular, we have γ 1 " λ`.
In the rest of this section, we present some results that will be used in the proof of Theorem 2.7. Their proofs will be given in subsequent sections. For any matrix X satisfying Assumption 2.1 and the tail condition (1.1), we can construct a matrix X s that approximates X with probability 1´op1q, and satisfies Assumption 2.1, the bounded support condition (3.1) with q ď N´φ for some small constant φ ą 0, and see Section 4 for the details. We will need the local laws (Theorem 3.6), eigenvalues rigidity (Theorem 3.8), eigenvector delocalization (Lemma 3.9), and edge universality (Theorem 3.10) for separable covariance matrices with X s . We define the deterministic limit Π of the resolvent G in (3.3) as Define the control parameters Note that by (3.11) and (3.12), we have for z P Spr c, C 0 ,´8q. Now we are ready to state the local laws for GpX, zq. For the purpose of proving Theorem 2.7, we shall relax the condition (1.2) a little bit.
Theorem 3.6 (Local laws). Suppose Assumption 2.1 and (2.18) hold. Suppose X satisfies the bounded support condition (3.1) with q ď N´φ for some constant φ ą 0. Furthermore, suppose X satisfies (3.15) anďˇE where b N is an N -dependent deterministic parameter satisfying 1 ď b N ď N 1{2 . Fix C 0 ą 1 and let c 0 ą 0 be a sufficiently small constant. Given any ε, a ą 0, we define the domain Then for any constants ε ą 0 and a ą 0, the following estimates hold.
(1) Anisotropic local law: For any z P r Spc 0 , C 0 , a, εq and deterministic unit vectors u, v P C I , |xu, GpX, zqvy´xu, Πpzqvy| ă q`Ψpzq. (2) Averaged local law: For any z P r Spc 0 , C 0 , a, εq, we have where m is defined in (2.8). Moreover, outside of the spectrum we have the following stronger estimate The above estimates are uniform in the spectral parameter z and any set of deterministic vectors of cardinality The main difficulty for the proof of Theorem 3.6 is due to the fact that the entries of A 1{2 XB 1{2 are not independent anymore. However, notice that if X " X Gauss is i.i.d. Gaussian, we have In this case, the problem is reduced to proving the local laws for separable covariance matrices with diagonal spatial and temporal covariance matrices, which can be handled using the standard resolvent methods as in e.g. [8,53]. To go from the Gaussian case to the general X case, we adopt a continuous self-consistent comparison argument developed in [37]. In order for this argument to work, we need to assume (1.2). The main reason is that we need to match the third moment of x ij with that of the Gaussian random variables in the derivation of equation (6.26) below. Under the weaker condition (3.20), we cannot prove the local laws up to the optimal scale η " N´1, but only up to the scale η " maxt qb N N , N u near the edge. However, to prove the edge universality, we only need to have a good local law up to the scale η ď N´2 {3´ε , hence b N can take values up to b N ! N 1{3 . (Actually in the proof of Theorem 2.7 in Section 4, we will take b N " N´ε for some small constant ε ą 0; see (4.4) below for the estimate on b N that is obtained from (1.2).) Finally, if A or B is diagonal, one can prove the local laws up to the optimal scale for all b N " OpN 1{2 q by using an improved comparison argument in [37].
Following the above discussions, we divide the proof of Theorem 3.6 into two steps. In Section 5, we give the proof for separable covariance matrices of the form Σ 1{2 X r ΣX˚Σ 1{2 , which implies the local laws in the Gaussian X case. In Section 6, we apply the self-consistent comparison argument in [37] to extend the result to the general X case. Compared with [37], there are two differences in our setting: (1) the support of X in Theorem 3.6 is q " OpN´φq for some constant 0 ă φ ď 1{2, while [37] only dealt with X with small support q " OpN´1 {2 q; (2) one has B " I in [37], which simplifies the proof.
The second moment of the error xu, pG´Πqvy in fact satisfies a stronger bound.
With Theorem 3.6 as a key input, we can prove a stronger estimate on mpzq that is independent of q. This averaged local law implies the rigidity of eigenvalues for Q 1 . Note that for any fixed E, Ψ 2 pE`iηq`q{pN ηq is monotonically decreasing with respect to η, hence there is a unique η 1 pEq such that Then we define η l pEq :" max Eďxďλ`η1 pxq ("l" for lower bound) for E ď λ`, and η l pEq :" η l pλ`q for E ą λ`. Note that by (3.18), we always have η l pEq " Opb N {N q.
Theorem 3.8 (Rigidity of eigenvalues). Suppose the assumptions in Theorem 3.6 hold. Fix the constants c 0 and C 0 as given in Theorem 3.6. Then for any fixed ε, a ą 0, we have uniformly in z P r Spc 0 , C 0 , a, εq. Moreover, outside of the spectrum we have the following stronger estimate uniformly in z P r Spc 0 , C 0 , a, εq X tz " E`iη : E ě λ`, N η ? κ`η ě N ε u for any fixed ε ą 0. If A or B is diagonal, then (3.26) holds for z P Spc 0 , C 0 , εq and (3.27) holds for z P Spc 0 , C 0 , εq X tz " E`iη : The bounds (3.26) and (3.27) imply that for any constant 0 ă c 1 ă c 0 , the following estimates hold.
(1) For any E ě λ`´c 1 , we have where κ E is defined in (3.9), and The anisotropic local law (3.22) implies the following delocalization properties of eigenvectors.
Finally, we have the following edge universality result for separable covariance matrices with support q ď N´φ and satisfying the condition (3.15).
Theorem 3.10. Let X p1q and X p2q be two separable covariance matrices satisfying the assumptions in Theorem 3.6. Suppose b N ď N 1{3´c for some constant c ą 0. Then there exist constants ε, δ ą 0 such that for any s P R, where P p1q and P p2q denote the laws of X p1q and X p2q , respectively. Remark 3.11. As in [20,24,40], Theorem 3.10 can be can be generalized to finite correlation functions of the k largest eigenvalues for any fixed k: The proof of (3.34) is similar to that of (3.33) except that it uses a general form of the Green function comparison theorem; see e.g. [24,Theorem 6.4]. As a corollary, we can get the stronger edge universality result (2.23).
The proofs for Lemma 3.7, Theorem 3.8 and Theorem 3.10 follow essentially the same path as discussed below. First, for random matrix r X with small suppoort q " OpN´1 {2 q, we have the averaged local laws (3.26)-(3.27) and the following anisotropic local lawˇˇx u, Gp r X, zqvy´xu, Πpzqvyˇˇă Ψpzq.
With these estimates, one can prove that Lemma 3.7, Theorem 3.8 and Theorem 3.10 hold in the small support case using the methods in e.g. [20,24,53]. Then it suffices to use a comparison argument to show that the large support case is "sufficiently close" to the small support case. In fact, given any matrix X satisfying the assumptions in Theorem 3.6, we can construct a matrix r X having the same first four moments as X but with smaller support q " OpN´1 {2 q, which is the content of the next lemma.
Lemma 3.12 (Lemma 5.1 in [40]). Suppose X satisfies the assumptions in Theorem 3.6. Then there exists another matrix r X " pr x ij q, such that r X satisfies the bounded support condition (3.1) with q " N´1 {2 , and the first four moments of the X entries and r X entries match, i.e.
It is known that the Lindeberg replacement strategy combined with the four moment matching usually implies some universality results in random matrix theory, see e.g. [55,56,57]. This is actually also true in our case. We shall extend the Green function comparison method developed in [40] (which is essentially an iterative application of the Lindeberg strategy using the four moment matching), and prove that Lemma 3.7, Theorem 3.8 and Theorem 3.10 also hold for the large support case. The proofs are given in Section 7.

Proof of of Theorem 2.7
In this section, we prove Theorem 2.7 with the results in Section 3.2. Given the matrix X satisfying Assumption 2.1 and the tail condition (1.1), we introduce a cutoff on its matrix entries at the level N´ε. For any fixed ε ą 0, define By (1.1) and integration by parts, we get that for any fixed δ ą 0 and large enough N , Let ρpdxq be the law of q 11 . Then we define independent random variables q s ij , q l ij , c ij , 1 ď i ď n and 1 ď j ď N , in the following ways.
• q s ij has law ρ s , which is defined such that for any event E. Note that if q 11 has density ρpxq, then the density for q s 11 is • q l ij has law ρ l , such that for any event E.
Let X s , X l and X c be random matrices such that X s ij " N´1 {2 q s ij , X l ij " N´1 {2 q l ij and X c ij " c ij . It is easy to check that for independent X s , X l and X c , The purpose of this decomposition (in distribution) is to write X into a well-behaved random matrix X s with bounded support q " OpN´εq plus a perturbation matrix pX l´X s qX c . Here the matrix X c gives the locations of the nonzero entries of the perturbation matrix, and its rank is at most N 5ε with high probability; see (4.6) below. The matrix X l contains the "abnormal" large entries above the cutoff, but the tail condition (1.1) guarantees that the sizes of these entries are of order op1q in probability; see (4.9). Hence the perturbation pX l´X s qX c is of low rank and has small strengths. Then as in the famous BBP transition [4], we will show that the effect of this perturbation on the largest eigenvalue is negligible.
In the proof below, one will see that (recall (2.6)) with probability 1´op1q. Thus with probability 1´op1q, we havěˇˇλ Hence the deterministic part in (4.2) is negligible under the scaling N 2{3 . By (1.1), (1.2) and integration by parts, it is easy to check that Note that this is the only place where (1.2) is used in order to get the estimate on Epq s 11 q 3 . For the reason why this estimate is needed, we refer the reader to the discussion below Theorem 3.6. Thus X 1 :" pE|q s 11 | 2 q´1 {2 X s is a matrix that satisfies the assumptions for X in Theorem 3.6 with b N " OpN ε q and q " OpN´εq. Then by Theorem 3.10, there exist constants ε 1 , δ 1 ą 0 such that for any s P R, where P s denotes the law for X s and P G denotes the law for i.i.d. Gaussian matrix. Now we write the first two terms on the right-hand side of (4.2) as We define the matrix R c :" pR ij X c ij q. It remains to show that the effect of R c on λ 1 is negligible. Note that X c ij is independent of X s ij and R ij .
We first introduce a cutoff on matrix X c as r If we regard the matrix X c as a sequence X c of nN i.i.d. Bernoulli random variables, it is easy to obtain from the large deviation formula that for sufficiently large N . Suppose the number n 0 of the nonzero elements in X c is given with n 0 ď N 5ε . Then it is easy to check that Combining the estimates (4.6) and (4.7), we get that PpA q ě 1´OpN´1`1 0ε q. On the other hand, by condition (1.1), we have for any fixed constant ω ą 0. Hence if we introduce the matrix then we have PpE " R c q " 1´op1q (4.10) by (4.8) and (4.9). Thus we only need to study the largest eigenvalue of r Q 1 pX s`E q, where max i,j |E ij | ď ω and rankpEq ď N 5ε . In fact, it suffices to prove that The estimate (4.11), combined with (4.3), (4.5) and (4.10), concludes (2.21). Now we prove (4.11). Since r X c is independent of X s , the positions of the nonzero elements of r X c are independent of X s . Without loss of generality, we assume the positions of the n 0 nonzero entries of r X c are p1, 1q, p2, 2q,¨¨¨, pn 0 , n 0 q, which correspond to the following entries of E: e 11 , e 22 ,¨¨¨, e n0n0 , n 0 ď N 5ε . (4.12) For other choices of the positions of nonzero entries, the proof is exactly the same, but we make this assumption to simplify the notations. By the definition of E, we have |e ii | ď ω, 1 ď i ď n 0 . We define the matrices and H E :" H s`P , where where P D is a 2n 0ˆ2 n 0 diagonal matrix P D " diag pe 11 , . . . , e n0n0 ,´e 11 , . . . ,´e n0n0 q , and W is an pn`N qˆ2n 0 matrix such that Without loss of generality, we assume that e ii ‰ 0, 1 ď i ď n 0 (otherwise we only need to use a matrix W with smaller rank). With the identity and Lemma 6.1 of [36], we find that if µ R σp r Q 1 pX s qq, then µ is an eigenvalue of r Q 1 pX s`γ Eq if and only if det`O˚G s pµqO`pγP D q´1˘" 0, 0 ă γ ă 1, Define R γ :" O˚G s O`pγP D q´1 for 0 ă γ ă 1, and let µ :" λ s
5 Proof of Theorem 3.6: Gaussian X As discussed below Theorem 3.6, in this section we prove Theorem 3.6 for separable covariance matrices of the form Σ 1{2 X r ΣX˚Σ 1{2 , which will imply the local laws in the Gaussian X case. Thus in this section, we use the following resolvent: with X satisfying (3.1) with q " N´1 {2 . More precisely, we will prove the following result. for any z P Spc 0 , C 0 , εq, and Both of the above estimates are uniform in the spectral parameter z and the deterministic vectors u, v.
Under a different set of assumptions, the local law as in Proposition 5.1 has been proved in [1]. However, in order to satisfy their assumptions in our setting, we need to assume that the eigenvalues of A and B are both upper and lower bounded by some constants τ ď σ i , r σ i ď τ´1, which rules out the possibility of zero or very small (that is, op1q) eigenvalues of A and B. On the other hand, our assumptions in (2.5) and (2.18) are slightly more general, and allow for a large portion of small or zero eigenvalues of A and B. For reader's convenience, we shall give the proof of Proposition 5.1 in our setting. This proof is similar to the previous proof of the local laws, such as [8,14,37,64]. Thus instead of giving all the details, we only describe briefly the proof. In particular, we shall focus on the key self-consistent equation argument, which is (almost) the only part that departs significantly from the previous proof in e.g. [8]. In the proof, we always denote the spectral parameter by z " E`iη.

Basic tools
In this subsection, we collect some basic tools that will be used. For simplicity, we denote Y :" Σ 1{2 X r Σ 1{2 .
Definition 5.2 (Minors). For any pn`N qˆpn`N q matrix A and T Ď I, we define the minor A pTq :" pA ab : a, b P IzTq as the pn`N´|T|qˆpn`N´|T|q matrix obtained by removing all rows and columns indexed by T. Note that we keep the names of indices when defining A pTq , i.e. pA pTq q ab " A ab for a, b R T. Correspondingly, we define the resolvent minor as and the partial traces For convenience, we will adopt the convention that for any minor A pT q defined as above, A pT q ab " 0 if a P T or b P T. We will abbreviate ptauq " paq, pta, buq " pabq, and ř pTq a :" ř aRT . Lemma 5.3. (Resolvent identities).
(i) For i P I 1 and µ P I 2 , we have For i P I 1 and µ P I 2 , we have (5.7) (iii) For a P I and b, c P Iztau, (5.8) (iv) All of the above identities hold for G pTq instead of G for T Ă I, and in the case where A and B are not diagonal.
Proof. All these identities can be proved using Schur's complement formula. The reader can refer to, for example, [37,Lemma 4.4].
Lemma 5.4. Fix constants c 0 , C 0 ą 0. The following estimates hold uniformly for all z P Spc 0 , C 0 , aq for any a P R: }G} ď Cη´1, }B z G} ď Cη´2. (5.9) Furthermore, we have the following identities: All of the above estimates remain true for G pTq instead of G for any T Ď I, and in the case where A and B are not diagonal.
Lemma 5.5. Fix constants c 0 , C 0 ą 0. For any T Ď I and a P R, the following bounds hold uniformly in z P Spc 0 , C 0 , aq:ˇˇm where C ą 0 is a constant depending only on τ .
Proof. For µ P I 2 , we havěˇˇm where in the first step we used (5.8), and in the second and third steps we used (5.11). Similarly, using (5.8) and (5.12) we getˇˇˇm Similarly, we can prove the same bounds for m 1 . Then (5.14) can be proved by induction on the indices in T.
The following lemma gives large deviation bounds for bounded supported random variables.
Lemma 5.6 (Lemma 3.8 of [23]). Let px i q, py j q be independent families of centered and independent random variables, and pA i q, pB ij q be families of deterministic complex numbers. Suppose the entries x i , y j have variance at most N´1 and satisfy the bounded support condition (3.1) with q ď N´ε for some constant ε ą 0. Then we have the following bound:ˇˇÿ For the proof of Proposition 5.1, it is convenient to introduce the following random control parameters.

Entrywise local law
The main goal of this subsection is to prove the following entrywise local law. The anisotropic local law (5.2) then follows from the entrywise local law combined with a polynomialization method as we will explain in next subsection.
Proposition 5.8. Suppose the assumptions in Proposition 5.1 hold. Fix C 0 ą 0 and let c 0 ą 0 be a sufficiently small constant. Then for any fixed ε ą 0, the following estimate holds uniformly for z P Spc 0 , C 0 , εq: |G ab pX, zq´Π ab pzq| ă Ψpzq. (5.19) In analogy to [23, Section 3] and [37, Section 5], we introduce the Z variables where E a r¨s :" Er¨| H paq s, i.e. it is the partial expectation over the randomness of the a-th row and column of H. By (5.5), we have and The following lemma plays a key role in the proof of local laws.
Lemma 5.9. Suppose the assumptions in Proposition 5.1 hold. Let c 0 ą 0 be a sufficiently small constant and fix C 0 , ε ą 0. Define the z-dependent event Ξpzq :" tΛpzq ď plog N q´1u. Then there exists constant C ą 0 such that the following estimates hold uniformly for all a P I and z P Spc 0 , C 0 , εq: Proof. Applying Lemma 5.6 to Z i in (5.20), we get that on Ξ, where we used (2.5), (5.11) and the fact that max a,b |G ab | " Op1q on event Ξ. Now by (5.17), (5.18) and the bound (5.14), we have that Together with the fact that q " N´1 {2 À Ψ θ by (3.19), we get (5.22) for 1pΞq|Z i |. Similarly, we can prove the same estimate for 1pΞq|Z µ |, where in the proof we need to use (5.10) and (3.19). If η ě 1, we also have max a,b |G ab | " Op1q by (5.9). Then repeating the above proof, we obtain (5.23) for 1pη ě 1q|Z a |. Similarly, using (5.6) and Lemmas 5.4-5.6, we can prove that 1pΞq p|G ij |`|G µν |q`1pη ě 1q p|G ij |`|G µν |q ă Ψ θ .
A key component of the proof for Proposition 5.8 is an analysis of the self-consistent equation. Recall the equations in (2.9) and the function f pz, αq in (2.15).
Then we prove (5.29). Using the bound 1pη ě 1q max a,b |G ab | " Op1q, we trivially have |m 1 |`|m 2 |`θ " Op1q. Thus we have 1pη ě 1qΨ θ " OpN´1 {2 q. Then (5.14) and (5.23) together give that 1pη ě 1qp|ε i |`|ε µ |q ă N´1 {2 . (5.43) First we claim that in the case η ě 1, with high probability, for some constant c ą 0. By the spectral decomposition (3.5), we have Then applying it to (5.35), G´1 µµ is of order Op1q and has imaginary part ď´η`O ă`N´1 {2˘. This implies Im G µµ Á η with high probability, which gives the second estimate of (5.44) by (2.5). Moreover, with (2.5) we also get that Imp1`σ i m 2 q Á 1 for i ď τ n. Then with (5.34) and a similar argument as above, we obtain the first estimate of (5.44). Next, we claim that in the case η ě 1, with high probability, The following lemma gives the stability of the equation f pz, αq " 0. Roughly speaking, it states that if f pz, m 2 pzqq is small and m 2 pr zq´m 2c pr zq is small for Im r z ě Im z, then m 2 pzq´m 2c pzq is small. For an arbitrary z P Spc 0 , C 0 , εq, we define the discrete set Lpzq :" tzu Y tz 1 P Spc 0 , C 0 , εq : Re z 1 " Re z, Im z 1 P rIm z, 1s X pN´1 0 Nqu.
Lemma 5.11. Let c 0 ą 0 be a sufficiently small constant and fix C 0 , ε ą 0. The self-consistent equation f pz, αq " 0 is stable on Spc 0 , C 0 , εq in the following sense. Suppose the z-dependent function δ satisfies N´2 ď δpzq ď plog N q´1 for z P Spc 0 , C 0 , εq and that δ is Lipschitz continuous with Lipschitz constant ď N 2 . Suppose moreover that for each fixed E, the function η Þ Ñ δpE`iηq is non-increasing for η ą 0. Suppose that u 2 : Spc 0 , C 0 , εq Ñ C is the Stieltjes transform of a probability measure. Let z P Spc 0 , C 0 , εq and suppose that for all z 1 P Lpzq we haveˇˇf pz 1 , u 2 qˇˇď δpz 1 q. (5.46) Then we have for some constant C ą 0 independent of z and N , where κ is defined in (3.9).
Proof. This lemma can proved with the same method as in e.g. [ which gives the diagonal estimate. These bounds can be easily generalized to the case η ě c for any fixed c ą 0. Compared with (5.19), one can see that the bounds (5.49) and (5.50) are optimal for the η ě c case. Now it remains to deal with the small η case (in particular, the local case with η ! 1). We first prove the following weak bound. To get the strong entrywise local law as in (5.19), we need stronger bounds on rZs 1 and rZs 2 in (5.31) and (5.32). They follow from the following fluctuation averaging lemma.
Proof. We suppose that the event Ξ holds. The bound (5.52) can be proved in a similar way as [8,Lemma 4.9] and [22,Theorem 4.7]. Take rZs 1 as an example. The only complication of the proof is that the coefficients σ i {p1`σ i m 2 q 2 are random and depend on i. This can be dealt with by writing, for any i P I 1 , Then we write Now the method to bound the first term in the line (5.53) is only a slight modification of the one in [8] or [22]. For the proof of an even more complicated fluctuation averaging lemma, one can also refer to [64,Lemma 4.9]. Finally, we use that Ξ holds with high probability by Lemma 5.12 to conclude the proof. Now we give the proof of Proposition 5.8.
Proof of Proposition 5.8. By Lemma 5.12, the event Ξ holds with high probability. Then by Lemma 5.12 and Lemma 5.9, we can take uniformly in z P Spc 0 , C 0 , εq, which is a better bound than the one in (5.54). Taking the RHS of (5.57) as the new Φ o , we can obtain an even better bound for Λ o . Iterating the above arguments, we get the bound θ ă pN ηq´ř

Proof of Proposition 5.1
We now can finish the proof of Proposition 5.1 using Proposition 5.8. By (5.38) and (5.58), we have Using the same method as in Lemma 5.13, we can obtain thaťˇˇˇˇ1 Together with (2.10), (3.12) and (5.58), we get that where we used (3.19) in the second step. This proves (5.3). For z P S out pc 0 , C 0 , εq :" Spc 0 , C 0 , εq X tz " E`iη : E ě λ`, N η ? κ`η ě N ε u, we have where we used (3.11) in the second step. Thus by (5.59), to prove (5.4), it suffices to show that In fact, taking Φ o " Φ " Ψ in Lemma 5.13 and then using Lemma 5.11, we get that This finishes the proof of (5.60), and hence (5.4). Finally, with (5.19), one can repeat the polynomialization method in [8,Section 5] to get the anisotropic local law (5.2). The only difference is that one need to use the first bound in (2.5).

Proof of Theorem 3.6: self-consistent comparison
In this section, we finish the proof of Theorem 3.6 for a general X satisfying (3.15), (3.20) and the bounded support condition (3.1) with q ď N´φ for some constant φ ą 0. Proposition 5.1 implies that (3.22) holds for Gaussian X Gauss as discussed below Theorem 3.6. Thus the basic idea of this section is to prove that for X satisfying the assumptions in Theorem 3.6, @ u,`GpX, zq´GpX Gauss , zq˘v D ă q`Ψpzq uniformly for deterministic unit vectors u, v P C I and z P r Spc 0 , C 0 , a, εq. For simplicity of notations, we introduce the following notion of generalized entries. For v, w P C I and a P I, we shall denote G vw :" xv, Gwy, G va :" xv, Ge a y, G aw :" xe a , Gwy, where e a is the standard unit vector along a-th axis. Given vectors x P C I1 and y P C I2 , we always identify them with their natural embeddingsˆx 0˙a ndˆ0 y˙i n C I . The exact meanings will be clear from the context. Now similar to Lemma 5.4, we can prove the following estimates for G.
Lemma 6.1. For i P I 1 and µ P I 2 , we define u i " U˚e i P C I1 and v µ " V˚e µ P C I2 , i.e. u i is the i-th row vector of U and v µ is the µ-th row vector of V . Let x P C I1 and y P C I2 . Then we have All of the above estimates remain true for G pTq instead of G for any T Ď I.
Our proof basically follows the arguments in [37,Section 7] with some modifications. Thus we will not give all the details. We first focus on proving the anisotropic local law (3.22), and the proof of (3.23)-(3.24) will be given at the end of this section. By polarization, to prove (3.22) it suffices to prove that xv, pGpX, zq´Πpzqq vy ă q`Ψpzq (6.7) uniformly in z P r Spc 0 , C 0 , a, εq and any deterministic unit vector v P C I . In fact, we can obtain the more general bound (3.22) by applying (6.7) to the vectors u`v and u`iv, respectively.
The proof consists of a bootstrap argument from larger scales to smaller scales in multiplicative increments of N´δ, where δ Pˆ0, mintε, a, φu 2C a˙. (6.8) Here ε, a ą 0 are the constants in r Spc 0 , C 0 , a, εq, φ ą 0 is a constant such that q ď N´φ, C a ą 0 is an absolute constant that will be chosen large enough in the proof. For any η ě N´1`ε, we define η l :" ηN δl for l " 0, ..., L´1, η L :" 1. (6.9) where L " Lpηq :" max l P N| ηN δpl´1q ă 1 ( . Note that L ď δ´1. By (5.9), the function z Þ Ñ Gpzq´Πpzq is Lipschitz continuous in r Spc 0 , C 0 , a, εq with Lipschitz constant bounded by N 2 . Thus to prove (6.7) for all z P r Spc 0 , C 0 , a, εq, it suffices to show that (6.7) holds for all z in some discrete but sufficiently dense subset S Ă r Spc 0 , C 0 , a, eq. We will use the following discretized domain S. Definition 6.2. Let S be an N´1 0 -net of r Spc 0 , C 0 , a, εq such that |S| ď N 20 and E`iη P S ñ E`iη l P S for l " 1, ..., Lpηq.
The bootstrapping is formulated in terms of two scale-dependent properties (A m ) and (C m ) defined on the subsets S m :" z P S | Im z ě N´δ m ( . pA m q For all z P S m , all deterministic unit vectors x P C I1 and y P C I2 , and all X satisfying the assumptions in Theorem 3.6, we have ImˆG xx pzq z˙`I m G yy pzq ă Im m 2c pzq`N Caδ pq`Ψpzqq. (6.10) pC m q For all z P S m , all deterministic unit vector v P C I , and all X satisfying the assumptions in Theorem 3.6, we have |G vv pzq´Π vv pzq| ă N Caδ pq`Ψpzqq. (6.11) It is trivial to see that pA 0 q holds by (5.9) and (3.11). Moreover, it is easy to observe the following result. Proof. By (3.11), (3.12) and the definition of Π in (3.16), it is easy to get that ImˆΠ xx pzq z˙`I m Π yy pzq À Im m 2c pzq, which finishes the proof.
The key step is the following induction result.
Lemma 6.4. For any 1 ď m ď δ´1, property pA m´1 q implies property pC m q.
Combining Lemmas 6.3 and 6.4, we conclude that (6.11) holds for all w P S. Since δ can be chosen arbitrarily small under the condition (6.8), we conclude that (6.7) holds for all w P S, and (3.22) follows for all z P r Spc 0 , C 0 , a, εq. What remains now is the proof of Lemma 6.4. Denote F v pX, zq :" |G vv pX, zq´Π vv pzq| . (6.12) By Markov's inequality, it suffices to prove the following lemma.
Lemma 6.5. Fix p P N and m ď δ´1. Suppose that the assumptions of Theorem 3.6 and property pA m´1 q hold. Then we have for all z P S m and any deterministic unit vector v.
In the rest of this section, we focus on proving Lemma 6.5. First, in order to make use of the assumption pA m´1 q, which has spectral parameters in S m´1 , to get some estimates for G with spectral parameters in S m , we shall use the following rough bounds for G xy . Lemma 6.6. For any z " E`iη P S and unit vectors x, y P C I , we have |G xy pzq´Π xy pzq| ăN 2δ where x "ˆx 1 x 2˙a nd y "ˆy 1 y 2˙f or x 1 , y 1 P C I1 and x 2 , y 2 P C I2 , and η l is defined in (6.9).
Proof. The proof is the same as the one for [37,Lemma 7.12].
Recall that for a given family of random matrices M, we use M " O ă pζq to mean |xv, Mwy| ă ζ}v} 2 }w} 2 uniformly in any deterministic vectors v and w (see Definition 3.1 (ii)). Lemma 6.7. Suppose pA m´1 q holds, then Gpzq´Πpzq " O ă pN 2δ q, (6.14) and for all z P S m and any deterministic unit vectors x P C I1 and y P C I2 .
Proof. The proof is the same as the one for [37,Lemma 7.13].
Now we are ready to perform the self-consistent comparison. We divide the proof into three subsections. In Sections 6.1-6.2, we prove Lemma 6.5 under the condition for z P Spc 0 , C 0 , εq. Then in Section 6.3, we show how to relax (6.16) to (3.20) for z P r Spc 0 , C 0 , a, εq.
Let tX θ : θ P p0, 1qu be a collection of random matrices such that the following properties hold. For any fixed θ P p0, 1q, pX 0 , X θ , X 1 q is a triple of independent I 1ˆI2 random matrices, and the matrix X θ " pX θ iµ q has law ź iPI1 ź µPI2 ρ θ iµ pdX θ iµ q. (6.17) Note that we do not require X θ1 to be independent of X θ2 for θ 1 ‰ θ 2 P p0, 1q. For λ P R, i P I 1 and µ P I 2 , we define the matrix X θ,λ piµq through´X θ,λ piµq¯j ν :" if pj, νq " pi, µq .
We shall prove Lemma 6.5 through interpolation matrices X θ between X 0 and X 1 . It holds for X 0 by Proposition 5.1. Lemma 6.9. Lemma 6.5 holds if X " X 0 .
Using (6.17) and fundamental calculus, we get the following basic interpolation formula.
provided all the expectations exist.
We shall apply Lemma 6.10 to F pXq " F p v pX, zq with F v pX, zq defined in (6.12). The main work is devoted to proving the following self-consistent estimate for the right-hand side of (6.18). Lemma 6.11. Fix p P 2N and m ď δ´1. Suppose (6.16) and pA m´1 q hold, then we have for all θ P r0, 1s, z P S m and any deterministic unit vector v.
Combining Lemmas 6.9-6.11 with a Grönwall's argument, we can conclude Lemma 6.5 and hence (6.7) by Markov's inequality. In order to prove Lemma 6.11, we compare X θ,X 0 iµ piµq and X θ,X 1 iµ piµq via a common X θ,0 piµq , i.e. we will prove that for all u P t0, 1u, θ P r0, 1s, z P S m , and any deterministic unit vector v.
Underlying the proof of (6.20) is an expansion approach which we will describe below. During the proof, we always assume that pA m´1 q holds. Also the rest of the proof is performed at a fixed z P S m . We define the IˆI matrix ∆ λ piµq as ∆ λ piµq :" λ˜0 where we recall the definitions of u i and v µ in Lemma 6.1. Then we have for any λ, λ 1 P R and K P N, The following result provides a priori bounds for the entries of G θ,λ piµq .
Lemma 6.12. Suppose that y is a random variable satisfying |y| ă q. Then Proof. The proof is the same as the one for [37, Lemma 7.14].
In the following proof, for simplicity of notations, we introduce f piµq pλq :" F p v pX θ,λ piµq q. We use f prq piµq to denote the r-th derivative of f piµq . With Lemma 6.12 and (6.22), it is easy to prove the following result. Lemma 6.13. Suppose that y is a random variable satisfying |y| ă q. Then for fixed r P N,ˇˇf prq piµq pyqˇˇă N 2δpr`pq . (6.24) By this lemma, the Taylor expansion of f piµq gives provided C a is chosen large enough in (6.8). Therefore we have for u P t0, 1u, where we used that X u iµ has vanishing first and third moments and its variance is 1{N . (Note that this is the only place where we need the condition (6.16).) By (3.15) and the bounded support condition, we havěˇE`X u iµ˘rˇă N´2q r´4 , r ě 4. (6.27) Thus to show (6.20), we only need to prove for r " 4, 5, ..., 4p`4, In order to get a self-consistent estimate in terms of the matrix X θ on the right-hand side of (6.28), we want to replace X θ,0 piµq in f piµq p0q " F p v pX θ,0 piµq q with X θ " X θ,X θ iµ piµq . Lemma 6.14. Suppose that holds for r " 4, ..., 4p`4. Then (6.28) holds for r " 4, ..., 4p`4.

Conclusion of the proof with words
What remains now is to prove (6.29). For simplicity, we abbreviate X θ " X. In order to exploit the detailed structure of the derivatives on the left-hand side of (6.29), we introduce the following algebraic objects.
Definition 6.15 (Words). Given i P I 1 and µ P I 2 , let W be the set of words of even length in two letters ti, µu. We denote the length of a word w P W by 2lpwq with lpwq P N. We use bold symbols to denote the letters of words. For instance, w " t 1 s 2 t 2 s 3¨¨¨tr s r`1 denotes a word of length 2r. Define W r :" tw P W : lpwq " ru to be the set of words of length 2r, and such that each word w P W r satisfies that t l s l`1 P tiµ, µiu for all 1 ď l ď r.
Next we assign to each letter a value r¨s through ris :" Σ 1{2 u i , rµs :" r Σ 1{2 v µ , where u i and v µ are defined in Lemma 6.1 and are regarded as summation indices. Note that it is important to distinguish the abstract letter from its value, which is a summation index. Finally, to each word w we assign a random variable A v,i,µ pwq as follows. If lpwq " 0 we define If lpwq ě 1, say w " t 1 s 2 t 2 s 3¨¨¨tr s r`1 , we define Notice the words are constructed such that, by (6.21) and (6.22) , with which we get that A v,i,µ pw t qA v,i,µ pw t`p{2 q‚.
Then to prove (6.29), it suffices to show that for 4 ď r ď 4p`4 and all words w 1 , ..., w p P W satisfying lpw 1 q`¨¨¨`lpw p q " r. To avoid the unimportant notational complications associated with the complex conjugates, we will actually prove that The proof of p6.32q is essentially the same but with slightly heavier notations. Treating empty words separately, we find it suffices to prove for 4 ď r ď 4p`4, 1 ď l ď p, and words such that lpw 0 q " 0, ř t lpw t q " r and lpw t q ě 1 for t ě 1. To estimate (6.34) we introduce the quantity R a :" |G vwa |`|G wav | (6.35) for a P I, where w i :" Σ 1{2 u i for i P I 1 and w µ :" r Σ 1{2 v µ for µ P I 2 .
Lemma 6.16. For w P W, we have the rough bound Furthermore, for lpwq ě 1 we have For lpwq " 1, we have the better bound Proof. The estimates (6.36) and (6.37) follow immediately from the rough bound (6.14) and the definition (6.31). The estimate (6.38) follows from the constraint t 1 ‰ s 2 in the definition (6.31).
By pigeonhole principle, if r ď 2l´2, then there exist at least two words w t with lpw t q " 1. Therefore by Lemma 6.16 we havěˇˇˇˇA µ˘. (6.39) Let v "ˆv 1 v 2˙f or v 1 P C I1 and v 2 P C I2 . Then using Lemma 6.1, we get where in the second step we used the two bounds in Lemma 6.7 and η " OpIm m 2c q by (3.11), and in the last step the definition of Ψ in (3.18). Using the same method we can get Plugging (6.40) and (6.41) into (6.39), we get that the left-hand side of (6.34) is bounded by where we used that l ď r and r ě 4 in the last step. If we choose C a ě 25, then by (6.8) we have N Caδ{2`12δ ! mintN φ{2 , N ε{2 u, and hence N Caδ{2`12δ pq`Ψq ! 1. Moreover, if r ě 4 and r ě 2l´1, then r ě l`2. Therefore we conclude that the left-hand side of p6.34q is bounded by Now (6.34) follows from Hölder's inequality. This concludes the proof of (6.29), and hence of (6.20), and hence of Lemma 6.4. This proves (6.7), and hence (3.22) under the condition (6.16).

Non-vanishing third moment
In this subsection, we prove Lemma 6.5 under (3.20) for z P r Spc 0 , C 0 , a, εq. Following the arguments in Section 6.1 and Section 6.2, we see that it suffices to prove the estimate (6.29) in the r " 3 case. In other words, we need to prove the following lemma.
Proof. The main new ingredient of the proof is a further iteration step at a fixed z. Suppose for some deterministic parameter Φ " Φ N . By the a priori bound (6.14), we can take Φ ď N 2δ . Assuming (6.44), we shall prove a self-improving bound of the form Once (6.45) is proved, we can use it iteratively to get an increasingly accurate bound for |G vv pX, zq´Π vv pzq|. After each step, we obtain a better bound (6.44) with Φ reduced by N´a {2 . Hence after Opa´1q many iterations we obtain (6.43). As in Section 6.2, to prove (6.45) it suffices to show which follows from the bound We now list all the three cases with l " 1, 2, 3, and discuss each case separately.
When l " 1, the single factor A v,i,µ pw 1 q is of the form Then we split it as where we abbreviate r G :" G´Π. For the second term, we have provided δ is small enough, where we used (6.40), (6.44) and the definition (3.21). The third and fourth terms of (6.48) can be dealt with in a similar way. For the first term, we consider the following two cases.
Case 1: rt 1 s " w i and rs 4 s " w µ . Then we havěˇˇÿ where in the first step we usedˇˇˇÿ 50) and in the second step we used (6.40). To get (6.50), we used the a priori bound (6.44) with Φ ď N 2δ , which gives that for any deterministic unit vectors v and w (recall Definition 3.1 (ii)), |xv, G wy| ă N 2δ .
Applying this estimate with deterministic vectors v and w :" using }w} " OpN 1{2 q. This explains (6.50). If rt 1 s " w µ and rs 4 s " v i , the proof is similar.
Case 2: If rt 1 s " rs 4 s, then at least one of the terms Π rs2srt2s and Π rs3srt3s must be of the form Π wiwµ " 0 or Π wµwi " 0, and hence we have In sum, we obtain that provided that C a ě 8. Together with (6.49), this proves (6.47) for l " 1.
When l " 2, (6.52) or an expression obtained from one of these four by exchanging w i and w µ . The first expression in (6.51) can be estimated using (6.40) and (6.44):ˇˇÿ and ff , (6.54) where in the second step we applied the same argument to ř µ G wµv Π wµwµ as the one for (6.50). Combining (3.21), (6.53) and (6.54), we get that provided that δ is small enough. The second expression in (6.51) can be estimated similarly. The first expression of (6.52) can be estimated using (3.21), (6.40) and (6.44) as for small enough δ. The second expression in (6.52) is estimated similarly. This proves (6.47) for l " 2. When l " 3, ś 3 t"1 A v,i,µ pw t q is of the form pG vwi G wµv q 3 or an expression obtained by exchanging w i and w µ in some of the three factors. We use (6.40) and Now we conclude (6.47) for l " 3 using (3.21) and N´1 {2 " Opq`Ψq.
If A or B is diagonal, then we can still prove (3.22) for all z P Spc 0 , C 0 , εq without using (6.16). This follows from an improved self-consistent comparison argument for sample covariance matrices (i.e. separable covariance matrices with B " I) in [37,Section 8]. The argument for separable covariance matrices with diagonal A or B is almost the same except for some notational differences, so we omit the details.

Weak averaged local law
In this section, we prove the weak averaged local laws in (3.23) and (3.24). The proof is similar to the one for (3.22) in previous subsections, and we only explain the differences. Note that the bootstrapping argument is not necessary, since we already have a good a priori bound by (3.22). In analogy to (6.12), we define where we used (3.17). Moreover, by Proposition 5.1, we know that (3.23) and (3.24) hold for Gaussian X (without the q 2 term). For now, we assume (6.16) and prove the following stronger estimates: |mpzq´m c pzq| ă pN ηq´1 (6.55) for z P Spc 0 , C 0 , εq, and for z P Spc 0 , C 0 , εq X tz " E`iη : E ě λ`, N η ? κ`η ě N ε u. At the end of this section, we will show how to relax (6.16) to (3.20) for z P r Spc 0 , C 0 , a, εq. Note that Ψ 2 pzq À 1 N η , and Ψ 2 pzq À 1 N pκ`ηq`1 pN ηq 2 ?
Then we prove the averaged local law (3.23) for z P r Spc 0 , C 0 , a, εq and (3.24) for z P r Spc 0 , C 0 , a, εq X tz " E`iη : E ě λ`, N η ? κ`η ě N ε u under (3.20). By (6.57), it suffices to prove for any small constant δ ą 0. Analogous to the arguments in Section 6.3, it reduces to showing that where l P t1, 2, 3u is the number of words with nonzero length. Then we can discuss these three cases using a similar argument as in Section 6.3, with the only difference being that we now can use the anisotropic local law (3.22) instead of the a priori bounds (6.23) and (6.44).
In the l " 1 case, we first consider the expression A ej ,i,µ pw 1 q " G jwi G wµwµ G wiwi G wµj . We havěˇˇˇˇÿ where we used (3.22) and (6.40). Similarly, we also havěˇˇˇˇÿ where we also used Π wµj " 0 for any µ in the second step. Then with (3.21), we can see that the LHS of (6.63) is bounded by O ă pq 2`Ψ2 q in this case. For the case A ej ,i,µ pw 1 q " G jwi G wµwµ G wiwµ G wij , we can estimate thaťˇˇˇˇÿ Thus in this case the LHS of (6.63) is also bounded by O ă pq 2`Ψ2 q. The case A ej ,i,µ pw 1 q " G jwi G wµwi G wµwµ G wij can be handled similarly. Finally in the case A ej ,i,µ pw 1 q " G jwi G wµwi G wµwi G wµj , we can estimate thaťˇˇˇˇÿ Again in this case the LHS of (6.63) is bounded by O ă pq 2`Ψ2 q. All the other expressions are obtained from these four by exchanging w i and w µ .
In the l " 2 case, t"1´1 n ř jPI1 A ej ,i,µ pw t q¯is of the form (up to some constant coefficients) or an expression obtained from one of these terms by exchanging w i and w µ . These two expressions can be written as For the second term, using (3.4), (3.5) and recalling that Y " Σ 1{2 U˚XV r Σ 1{2 , we can get thaťˇˇˇˇ1 and then use (6.40).) Thus for the first term in (6.64), we havěˇˇˇˇ1 where in the last step we used the bound in (6.65). Now using (6.65), (6.67) and (3.21), we get Finally, in the l " 3 case, t"1´1 N ř jPI1 A ej ,i,µ pw t q¯is of the form N´3pGˆ2q 3 wiwµ , or an expression obtained by exchanging w i and w µ in some of the three factors. Using (6.66) and the bound in (6.65), we can estimate that Then the LHS of (6.63) is bounded by Combining the above three cases l " 1, 2, 3, we conclude (6.62), which finishes the proof of (3.23) and (3.24).
If A or B is diagonal, then by the remark at the end of Section 6.3, the anisotropic local law (3.22) holds for all z P Spc 0 , C 0 , εq even in the case with b N " N 1{2 in (3.20). Then with (3.22) and the self-consistent comparison argument in [37, Section 9], we can prove (3.23) and (3.24) for z P Spc 0 , C 0 , εq. Again most of the arguments are the same as the ones in [37, Section 9], hence we omit the details.
7 Proof of Lemma 3.7, Theorem 3.8 and Theorem 3.10 With Lemma 3.12, given X satisfying the assumptions in Theorem 3.6, we can construct a matrix r X with support q " N´1 {2 and have the same first four moments as X. By Theorem 3.6, the averaged local laws (3.26) and (3.27) hold for Gp r X, zq. Thus it is easy to see that Theorem 3.8 is implied by the following lemma.
Lemma 7.1. Let X, r X be two matrices as in Lemma 3.12, and G " GpX, zq, r G " Gp r X, zq be the corresponding resolvents. We denote mpzq " mpX, zq and r mpzq " mp r X, zq. Fix any constant ε ą 0. For any z P r Spc 0 , C 0 , a, εq, if there exist deterministic quantities J " JpN q and K " KpN q such that r Gpzq´Π " O ă pJq, | r mpzq´m c pzq| ă K, J`K ă 1, . First, we have λ 1 ď C with high probability for some constant C ą 0 by e.g. [14,Lemma 3.12]. Now we pick ε to be a sufficiently small constant such that 0 ă ε ď c{4, and C 0 to be a sufficiently large constant such that C 0 λ`ą C. Set η " N´2 {3 and choose E " λ``κ outside of the spectrum with some κ ě N´2 {3`2ε " N ε η. Then using (3.11), (3.18) and b ď N 1{3´c , we can verify that Then using (3.27), we get that On the other hand, if there is an eigenvalue λ j satisfying |λ j´E | ď η for some 1 ď j ď n, then On the other hand, by (3.11) we have Im m c pzq " Oˆη ? κ`η˙" OˆN´ε N η˙.
In order to prove Lemma 3.7 and Lemma 7.1, we will extend the resolvent comparison method developed in [14,40]. The basic idea is still to use the Lindeberg replacement strategy for GpX, zq. On the other hand, the main difference is that the resolvent estimates are only obtained from the entrywise local law in [14,40], while in our case we need to use the more general anisotropic local law (3.22). (We will use the anisotropic local law in (7.1) when proving Lemma 7.1. However, for simplicity of presentation, we will always mention (3.22) instead. ) We remark that the following proof is similar to the one in [14,Section 6], and involves some tedious notations bookkeeping. We shall first give the proper notations and definitions that are adapt to our setting. The proof of the results is then a straightforward extension of the one in [14] by using the correct notations and applying the stronger anisotropic local law (3.22). Hence we will only state several key lemmas that are needed for the argument without presenting all the details of the proof.
Let X " px iµ q and r X " pr x iµ q be two matrices as in Lemma 3.12. Define a bijective ordering map Φ on the index set of X as Φ : tpi, µq : 1 ď i ď n, n`1 ď µ ď n`N u Ñ t1, . . . , γ max " nN u.
We denote the l-th power of A under the˚γ-product by A˚γ l , i.e. where r A k 's do not depend on pu t , v t q, 1 ď t ď p, and | r A k | ď N´| k|φ{10 . (7.24) Note that the terms A k , r A k , A k and r A k do depend on γ and we have omitted this dependence in the above expressions.
For the details of the above derivation, we refer the reader to the arguments between (6.25) and (6.31) in [14]. The above estimate still holds if we replace some of the G entries with G entries, since we only need to use the absolute bounds for the resolvent entries. Of course, using (7.28) instead of (7.27), we can obtain a similar estimatěˇˇˇˇE Proof of Lemma 3.7. The proof of this lemma is similar to the one for [14,Lemma 3.17], where the main difference lies in the estimate (7.34) below. We apply (7.31) to pG´Πq uv pG´Πq uv with p " 2 and r " 3.
Recall that r X is of bounded support q " N´1 {2 . Then by (3.22) and Lemma 3.2, we have E|p r G´Πq uv | 2 ă Ψ 2 pzq. (7.32) Moreover, by (3.19) the remainder term O ă pN´r`2q " O ă pN´1q in (7.31) is negligible. Hence it remains to handle the second term on the right-hand side of (7.31), i.e.

(7.33)
For each product in (7.33), v appears exactly twice in the indices of G. These two v's appear as G vwa G w b v in the product, where w a , w b come from some γ k and γ l via P. Let v "ˆv 1 v 2˙f or v 1 P C I1 and v 2 P C I2 .
By Lemma 6.1, after taking the averages N´2 ř γ k and N´2 ř γ l , the term G vwa G w b v contributes a factor O ă˜I m`z´1G 0 v1v1˘`I m`G 0 v2v2˘`ηˇG 0 v1v1ˇ`ηˇG 0 v2v2Ň η¸" O ăˆI m m 2c`Ψ pzq N η˙" O ă pΨ 2 pzqq, (7.34) where we used (3.22). For all the other G factors in the product, we control them by O ă p1q using (7.14).
Proof of Lemma 7.1. The proof of this lemma is similar to the one for [14,Lemma 5.2], where the main difference lies in (7.39) below. For simplicity, we shall prove that |E pmpzq´m c pzqq p | ă |E p r mpzq´m c pzqq p |``Ψ 2 pzq`J 2`K˘p . (7.35) The proof for (7.2) is exactly the same but with slightly heavier notations (in the product of p terms, half of them are normal and the other half are complex conjugates). We define a function of coefficients f pI, Jq " n´p ź δ itjt , I " pi 1 , i 2 ,¨¨¨, i p q P I p 1 , J " pj 1 , j 2 ,¨¨¨, j p q P I p 1 .
It is easy to check that pG α´Π q itjt " Epm α´m c q p , α " 0, γ max . Now to conclude (7.35), it suffices to control the second term on the RHS of (7.37). We consider the terms P γ l ,k l¨¨¨P γ1,k1 for k 1 , . . . , k l satisfying (7.30). For each product in (7.38) and any 1 ď t ď p, there are two i t 's in the indices of G. These two i t 's can only appear as (1) p r G´Πq itit in the product, or (2) r G itwa r G w b it , where w a , w b come from some γ k and γ l via P. Then after averaging over n´p ř i1,¨¨¨,ip , this term becomes either (1) r m´m c , which is bounded by K by (7.1), or (2) n´1 ř it G itwa G w b it , which is bounded as in (7. For other G entries in the product with no i t , we simply bound them by O ă p1q using (7.1). Then for any fixed γ 1 , . . . , γ l , k 1 , . . . , k l , we have proved thaťˇˇˇˇˇ1 Together with (7.37), this concludes (7.35).
Finally, we give the proof of Theorem 3.10. Its proof is similar to the one for [14,Theorem 3.16]. We only outline the proof by stating the key lemmas one can prove. For the matrix r X constructed in Lemma 3.12, it satisfies the edge universality by the following lemma. Lemma 7.6. Let X p1q and X p2q be two separable covariance matrices satisfying the assumptions in Theorem 3.6 and the bounded support condition (3.1) with q " N´1 {2 . Suppose b N ď N 1{3´c for some constant c ą 0. Then there exist constants ε, δ ą 0 such that for any s P R, we have P p1q´N 2{3 pλ 1´λ`q ď s´N´ε¯´N´δ ď P p2q´N 2{3 pλ 1´λ`q ď sď P p1q´N 2{3 pλ 1´λ`q ď s`N´ε¯`N´δ, (7.41) where P p1q and P p2q denote the laws of X p1q and X p2q , respectively.
Proof. The proof of this lemma is similar to the ones in [20,Section 6], [24,Section 6], [53,Section 4] and [37,Section 10]. The main argument involves a routine application of the Green's function comparison method (as the one in Lemma 7.8) near the edge developed in [24,Section 6] and [53,Section 4]. The proofs there can be easily adapted to our case using the anisotropic local law (Theorem 3.6), the rigidity of eigenvalues (Theorem 3.8), and the resolvent identities in Lemma 5.3 and Lemma 6.1. Now it is easy to see that Theorem 3.10 follows from the following comparison lemma. Lemma 7.7. Let X and r X be two matrices as in Lemma 3.12. Suppose b N ď N 1{3´c for some constant c ą 0. Then there exist constants ε, δ ą 0 such that, for any s P R we have P Ă X´N 2{3 pλ 1´λ`q ď s´N´ε¯´N´δ ď P X pN 2{3 pλ 1´λ`q ď sq ď P Ă X´N 2{3 pλ 1´λ`q ď s`N´ε¯`N´δ, (7.42) where P X and P Ă X are the laws for X and r X, respectively.
To prove Lemma 7.7, it suffices to prove the following Green's function comparison result. Its proof is the same as the one for [14, Lemma 5.5], so we skip the details.
Proof of Lemma 7.7. Although not explicitly stated, it was shown in [24] that if Theorem 3.8 and Lemma 7.8 hold, then the edge universality (7.42) holds. More precisely, in Section 6 of [24], the edge universality problem was reduced to proving Theorem 6.3 of [24], which corresponds to our Lemma 7.8. In order for this conversion to work, only the the averaged local law and the rigidity of eigenvalues are used, which correspond to the statements in our Theorem 3.8.