Asymptotic Freeness for Rectangular Random Matrices and Large Deviations for Sample Covariance Matrices With Sub-Gaussian Tails

We establish a large deviation principle for the empirical spectral measure of a sample covariance matrix with sub-Gaussian entries, which extends Bordenave and Caputo's result for Wigner matrices having the same type of entries [7]. To this aim, we need to establish an asymptotic freeness result for rectangular free convolution, more precisely, we give a bound in the subordination formula for information-plus-noise matrices.


Introduction
Throughout this paper, P(E) will denote the set of probability measures on a space E, M n,p (R) (resp. M n,p (C)) the set of n × p real (resp. complex) matrices, H n (C) the set of n × n Hermitian matrices, A t (resp. A * ) the transpose (resp. transconjugate) of a matrix A, and Tr(A) its trace. Besides, for a random variable X,X denotes the centred variable X − E(X). Finally, for two real numbers x, y, we denote by x ∧ y the minimum of x and y.

Large deviation results in random matrix theory
Let us first recall some basic facts in random matrix theory (RMT). A key object in RMT is the empirical spectral measure of a matrix A ∈ H n (C), namely the probability measure on R defined by where λ 1 (A), . . . , λ n (A) denote the eigenvalues of A. It is well known (cf. [19]) that if X is a Wigner matrix, i.e. X ∈ H n (C) and the families of centred independent and identically distributed (i.i.d.) random variables (X j,j ) 1≤j≤n , (X j,k ) 1≤j<k≤n are independent, and if the variance Var(X 1,2 ) = E |X 1,2 − E(X 1,2 )| 2 equals 1, then almost surely, the spectral measure µ X/ √ n converges weakly towards the semicircular distribution µ sc , i.e. for any bounded continuous f : R → R, The semicircular distribution µ sc is the probability measure on R defined by In the case of a sample covariance matrix, i.e. a matrix XX * with X ∈ M n,p (C) having centred i.i.d. entries, if Var(X 1,1 ) = 1, then almost surely, the spectral measure µ XX * /p converges weakly towards the Marcenko-Pastur distribution µ MP,c with ratio c as n, p → +∞ with n p → c ∈ (0, +∞) (cf. [15]). This probability measure on R is defined by with a c = (1 − √ c) 2 and b c = (1 + √ c) 2 . For these two models in which the empirical spectral measure converges, we can investigate the speed of convergence and more particularly large deviation principles.
We recall from [9] that a sequence of random variables (Z n ) n≥1 with values in a topological space (E, O) with σ-Borel field B satisfies the large deviation principle (LDP) with speed v and rate function I in the topology O if • I : E → [0, +∞] is a lower semi-continuous function, i.e. the level set {x ∈ E | I(x) ≤ t} is closed for every t ≥ 0, • v : N → (0, +∞) admits a limit equal to +∞, where Int(B) and Clo(B) denote resp. the interior and the closure of B.
We also recall that the rate function I is said to be good if the level set {x ∈ E | I(x) ≤ t} is compact for every t ≥ 0.
In [4], Ben Arous and Guionnet proved that if X is in the GUE, i.e. X is a Wigner matrix and X 1,1 (resp. X 1,2 ) has law N (0, 1) (resp. N 2 0, 1 2 I 2 ), then µ X/ √ n satisfies a LDP in P(R) at speed n 2 with the rate function This result was extended to LUE matrices, i.e. sample covariance matrices XX * where X has standard Gaussian entries, by Hiai and Petz (see [14]). Note that in fact, these two LDPs do not concern only Gaussian matrices but also more general unitarily invariant models. They strongly rely on the fact that for the considered models, the joint distribution of the eigenvalues has an explicit form, which is also the case in [12]. In [7], Bordenave and Caputo managed to obtain a LDP for Wigner matrices in another case, where the distribution of the X j,k 's has sub-Gaussian tails. This is remarkable because here the joint distribution of the eigenvalues is unknown. Let us recall their result. Definition 1.1. For α > 0 and a ∈ (0, +∞], we denote by S α (a) the class of complex random variables Z such that lim t→+∞ −t −α log P(|Z| ≥ t) = a (1) and such that |Z| and Z/|Z| are independent for large values of |Z|, i.e. there exist t 0 > 0 and a probability measure ϑ a on the unit circle S 1 such that for all t ≥ t 0 and all measurable sets U ⊂ S 1 , we have P(Z/|Z| ∈ U ∩ |Z| ≥ t) = ϑ a (U ) P(|Z| ≥ t) .
In particular, a real random variable Z belongs to S α (a) if it satisfies (1) and there exist t 0 > 0 and a probability measure ϑ a on {−1, 1} such that for all t ≥ t 0 and all U ⊂ {−1, 1}, we have P(|Z| ≥ t ∩ sign(Z) ∈ U ) = ϑ a (U ) P(|Z| ≥ t) .
Note that the first hypothesis implies that a random variable in S α (a) has finite moments of all orders. Theorem 1.2 (see [7,Theorem 1.1]). Let X be a Wigner matrix with X 1,2 ∈ S α (a) and X 1,1 ∈ S α (b) for some α ∈ (0, 2) and a, b ∈ (0, +∞]. Then the spectral measure µ X/ √ n satisfies the LDP with speed n 1+α/2 and good rate function where Φ : P(R) → [0, +∞] is a good rate function (see [7] for further details) and ⊞ denotes the free convolution (see Section 1.2).
Let us make a few remarks about this result. Roughly speaking, after random matrix considerations, the proof of Theorem 1.2 consists in proving a LDP for some random graphs associated to the Wigner matrix X. Therefore, the rate function Φ expresses as the supremum of functions of probability measures on graphs and it can not be computed in general. However, in some particular cases, it is possible to compute Φ(ν). For example, if ν is a symmetric distribution on R, b < ∞ and the support of where m α (ν) denotes the α-th moment of ν. Theorem 1.7 below will extend Theorem 1.2 to sample covariance matrices XX * with X 1,1 ∈ S α (a) for some α ∈ (0, 2), a ∈ (0, +∞]. Note that to simplify, we will assume that X is a real random matrix.
Let us mention here that LDPs for the top eigenvalue of Wigner matrices have also been obtained in Ben Arous and Guionnet's setting, see [1, p. 81], and for the model introduced by Bordenave and Caputo in [2].

Deformed matrix models
After understanding the behaviour of the spectral measure of Wigner matrices or sample covariance matrices, the question of deformations of these models has been investigated. Several types of deformations have been studied, the main ones being matrices of the type X+A with A ∈ H n (C) (additive deformation), Σ 1/2 XX * Σ 1/2 with Σ ∈ H n (C) definite positive (multiplicative deformation) or (X +A)(X +A) * with A ∈ M n,p (C) (information-plus-noise model).
A tool to study the spectral measure of a deformation is free probability, and more particularly free convolutions. Let us recall their definitions. Theorem 1.3 (see [18]). Let A, B be two independent n × n Hermitian random matrices such that • either A or B is unitarily invariant, i.e. for M = A or B, for any unitary U ∈ M n (C), U M U * has the same law as M , • µ A and µ B converge weakly in probability to some distributions µ 1 and µ 2 on R as n → +∞.
Then, as n → +∞, the spectral measure µ A+B converges weakly in probability to a deterministic distribution depending only on µ 1 and µ 2 . This distribution is called the free (additive) convolution of µ 1 and µ 2 , and is denoted by A similar result also exists for the singular values of the sum of two rectangular matrices and it is due to Benaych-Georges. The empirical singular value distribution of a matrix A ∈ M n,p (C) is the probability measure on R + defined by where σ 1 (A), . . . , σ n∧p (A) denote the singular values of A, i.e. the square roots of the eigenvalues of the positive matrix AA * (resp. A * A) if n ≤ p (resp. n ≥ p). Theorem 1.4 (see [5,Theorem 3.13]). Let A, B be two independent n × p random matrices such that • either A or B is bi-unitarily invariant, i.e. for M = A or B, for any unitary matrices U ∈ M n (C) and V ∈ M p (C), U M V has the same law as M , • ν A and ν B converge weakly in probability to some distributions µ 1 and µ 2 on R + as n, p → +∞ with n p → c ∈ (0, +∞).
Then, as n → +∞, the singular value distribution ν A+B converges weakly in probability to a deterministic distribution depending only on µ 1 , µ 2 and c. This distribution is called the rectangular free convolution with ratio c of µ 1 and µ 2 , and is denoted by Free convolutions can be characterized in terms of another key object in RMT, Stieltjes transform. For a probability measure µ on R, we call the Stieltjes transform of µ the function G µ : for all z ∈ C \ R. The following properties are obvious: We will use them implicitly in this paper. Note that the notion of Stieltjes transform is related to the resolvent one, since for a matrix A ∈ H n (C), we have G µ A (z) = 1 n Tr((zI n − A) −1 ). Useful properties of resolvents we will use in this paper are gathered in Appendix B.2.
Stieltjes transform allows to express subordination relations for free convolutions. To state these relations, we need some additional notations. For µ ∈ P(R), we denote by µ 2 the distribution of X 2 when X has law µ. Similarly, for µ ∈ P(R + ), we denote by √ µ the symmetrization of the distribution ν of √ X when X has law µ, i.e. the symmetric distribution on R defined by √ µ(B) = ν(B)+ν(−B) 2 for all borelians B. We have the following subordination formulas, the first is due to Biane (cf. [6]) and the second is obtained from Dozier and Silverstein's work [10] and a paper by Benaych-Georges (cf. [5]). Proposition 1.5.
• Let µ ∈ P(R) and ν = µ ⊞ µ sc . We have In Theorem 1.6 below, we are interested in the information-plus-noise model and we control the distance between the spectral measure and the corresponding rectangular free convolution, by bounding the difference between the two terms in (4) evaluated at the average Stieltjes transform. 6

Main results
Note that in the rest of the paper, we will only consider real matrices for ease but our results should generalize to complex matrices adapting the proofs. The only difficulty in the complex case is to adapt the general integration by parts formula (28) which is used several times in this paper, which would lead to heavier computations.
Let us define, for s, t > 0, the distance d s,t on P(R) by where V s,t = z ∈ C | Im z > s, Re z Im z < t .
As the distance d defined in [7], d s,t metrizes weak convergence. Let us mention that for all µ, ν ∈ P(R), we have where d KS and W 1 are respectively the Kolmogorov-Smirnov and the L 1 -Wasserstein distances on P(R). Some key inequalities for the distance between two empirical spectral measures are summarized in Appendix B.3. Our first main result is the following.
Theorem 1.6. We assume that c n = n p is bounded below and above by two constants in (0, +∞). Let c > 0. There exist s, t > 0 and a constant c s,t > 0 such that for any random matrix Y ∈ M n,p (R) with i.i.d. entries satisfying Var(Y 1,1 ) = 1 and E(Y 4 1,1 ) < +∞, for any deterministic matrix M ∈ M n,p (R), and for all n large enough, we have whereY is the matrix whose entries are given byY j,k = Y j,k − E(Y j,k ).
This result allows to understand the influence of the deformation in the information-plus-noise model. First, we can observe a decorrelation between the classical term 1 √ n and the Frobenius norm of the deformation divided by a better power of n, namely Tr(M M t ) 1/2 n . It is important for us to get this 7 precise estimate since in Section 3, we apply Theorem 1.6 to a matrix M whose Frobenius norm is not bounded but of order √ n log n. Besides, it is interesting to compare Theorem 1.6 to the Wigner case (cf. [7,Theorem 2.6]). Bordenave and Caputo investigated additive deformations and obtained that in this model, the distance between the spectral measure and the corresponding free additive convolution is bounded by 1 √ n . This bound is uniform in the deformation M and it depends on the initial matrix through its moments only. In the case of sample covariance matrices, it would have been surprising if we had obtained a better bound. Table 1 below permits to compare Bordenave and Caputo's results with ours in the Gaussian and the general cases. In addition to this, let us mention that in [8], the authors were interested in the case of Wigner matrices whose entries have a symmetric distribution satisfying a Poincaré inequality, which leads to better bounds than [7].
Theorem 1.6 above will be used in the proof of our second main result.
Theorem 1.7. Let X ∈ M n,p (R) be a random matrix such that c n = n p → c ∈ (0, +∞). We assume that Var(X 1,1 ) = 1 and that there exist α ∈ (0, 2) and a ∈ (0, +∞] such that X 1,1 ∈ S α (a). Then, the empirical spectral measure µ XX t /p satisfies the LDP with speed n 1+α/2 in P(R + ), governed by the good rate function J ′ defined by It is very similar to Bordenave and Caputo's result (see Theorem 1.2), the main difference being the explicit expression of the rate function in all cases. This is due to the fact that here, we can achieve large deviation explicitly without using a LDP on graphs.

8
The rest of the paper is organized as follows. In Section 2, we prove the bound for rectangular free convolution stated in Theorem 1.6. In Section 3, we prove the large deviation principle in Theorem 1.7. In Appendix A, we state and prove concentration results used in Sections 2 and 3. Finally, in Appendix B, we summarize miscellaneous inequalities and identities used throughout the paper.

Asymptotic freeness
This section is devoted to the proof of Theorem 1.6. This theorem is in fact a consequence of the following, as we will see in Section 2.1. (4)). We assume that c n = n p is bounded below and above by two constants in (0, +∞). Let c > 0. There exist s, t > 0 and a function f , bounded on the domain V s,t defined by (6), such that for any random matrix Y ∈ M n,p (R) with i.i.d. entries satisfying Var(Y 1,1 ) = 1 and E(Y 4 1,1 ) < +∞, for any deterministic matrix M ∈ M n,p (R), for all n large enough, and for all z ∈ V s,t , we have

Theorem 2.1 (Bound in subordination formula
The proof of Theorem 2.1 follows the same lines as Bordenave and Caputo's one for the bound in subordination formula (3) for free additive convolution (see [7,Theorem A.1]). It consists in two main steps: the Gaussian case and the general case, which we deduce from the Gaussian case by interpolation. However, in the case of sample covariance matrices, the computations are heavier and some majorizations must be finer.
Let us mention that in the Gaussian case, the bound consists only in the last terms (see Proposition 2.3).
In the proof, we define and we denote by S = (zI n − XX t ) −1 the resolvent of XX t . We consider s > 2, t > 0, and along the proof, s can increase and t can decrease. Moreover, f will denote a bounded function on V s,t , which can also change from one line to another. In particular, for all z ∈ V s,t and x < y, since we have Before starting the proofs, let us state a lemma we will use in the different steps. B C (z, δ) denotes here the ball with centre z ∈ C and radius δ > 0 for the usual distance in C.
The proof of this lemma consists in simple computations and is left to the reader. Let us mention however that it relies on the inequality and σ = s γ . We will use it again later.
Furthermore, note that choosing a larger s and a smaller t, l s,t and l ′ s,t can be as close to 0 as wanted.

Proof of Theorem 1.6
First, let us deduce Theorem 1.6 from Theorem 2.1.
Consequently, using Lemma 2.2, there exist s, t > 0 and l s,t ∈ (0, 1) such that for all z ∈ V s,t , From Theorem 2.1 in which we majorize f by a constant depending on s, t and from the definition (5) of d s,t , we finally get

The Gaussian case
In this subsection, we assume that Y 1,1 is a standard Gaussian. Moreover, we will simply denote g(z) and g(z) by g and g respectively (see Theorem 2.1 for their definitions). We will prove the following bound.
such that for any random matrix Y ∈ M n,p (R) with i.i.d. standard Gaussian entries, for any deterministic matrix M ∈ M n,p (R), for all n large enough, and for all z ∈ V s,t , we have To prove Proposition 2.3, we will follow and improve some computations by Dumont et al., see [11,Appendix II]. (11) and In this lemma, we compare g to 1 n Tr(R) because, using the notations in Lemma 2.2, we have 1 That is interesting if we have in mind our goal, which is Proposition 2.3.
Note that, as [17,Formula (122)], the proof of Lemma 2.4 mainly relies on the Gaussian integration by parts formula (27), so we do not give it here.
However, we can observe an important difference between Formula (122) in [17] and Lemma 2.4, namely the terms in ∆ ′ . In fact, the background here is not exactly the same as in [17]. Indeed, Vallet et al. consider complex Gaussian entries with independent real and imaginary parts having the same distribution in the matrix Y , whereas we consider real Gaussian entries. Consequently, some simplifications do not occur any longer and a new term appears. Behind this phenomenon is the quantity ζ = K 1,1 + 2iK 1,2 − K 2,2 , where K denotes the covariance matrix of the Gaussian vector (Re Y 1,1 , Im Y 1,1 ). This quantity is equal to 0 in [17] and to 1 here, that is why we have an additional term.
In the next lemma, we bound the different terms appearing in (9). For this, we will use the concentration bounds (68) and (70) for the terms in ∆ and standard inequalities on traces and resolvents (see Propositions B.1 and B.2) for the terms in ∆ ′ . Our computations will partially follow those in [17].
Lemma 2.5. There exist s, t > 0 and a function f , bounded on V s,t , such that for all Y , M , n, and z as in Proposition 2.3, we have This lemma shows that 1 n Tr(R) is a deterministic equivalent to the Stieltjes transform g(z) = 1 n Tr(S) as soon as Tr(M M t ) 1/2 tends to 0 as n → +∞, i.e. when the perturbation M is not too large.
We can compare this result with the bound obtained in [17,Proposition 6]. Two main differences must be highlighted. First, as we mentioned above, the model is not exactly the same. Indeed, we consider real Gaussian entries and not complex Gaussian entries with independent real and imaginary parts, which produces an additional term in ∆ ′ . However, the terms in ∆ are present in both cases, so we can compare the bounds for these terms. Here is the second difference. In [17], the authors assume that M is uniformly bounded in n and get the bound f (z) n 2 . Here, for the terms in ∆, we will get the bound Moreover, if we use the bound (69) instead of (70) in the proof, and if we observe that Tr(M M t ) 1/2 ≤ √ n M , then we get the bound f (z) n 2 (1 + M ), which is the same as in [17] when M is uniformly bounded in n. Consequently, our bound has two advantages: it is slightly better than the bound in [17] and it applies without any assumption on M .
Proof. First of all, let us remark that because on the one hand, R 1−cng is a resolvent evaluated at η = z(1 − c n g) 2 − (1 − c n )(1 − c n g) so its operator norm is less than 1 | Im η| , and on the other hand, we have the inequalities |1 − c n g| ≤ 1 + cn | Im z| and (8) (we apply the latter with σ = s cn ). By proposition B.1 (ii), it follows that Note that more precise bounds can be obtained, see [17, Appendix E].
Next, let us recall that ∆ is defined by and observe that Tr(SXM t ) = Tr(X t SM ).
The first term in (9) we bound is cn n 2 Tr(∆) Tr(E(S)R) . First, using the concentration bounds (68), (70), and the Cauchy-Schwarz inequality, we get where u(z) and v(z) are defined in Proposition A.1. Next, using the identity g = 1 n Tr(S) and (68), we have 1 n Tr where, for the last inequality, we used the definition of u(z) to get The same arguments also allow to show that Combining inequalities from (14) to (17) gives 14 Computations are similar for the term 1 n Tr(∆R), using the additional inequalities (13) and Tr(RR * ) 1/2 ≤ √ n R (see Proposition B.1 (iv)). For instance, we have which have a similar proof, we thus have We have bounded the terms in Lemma 2.4 in which ∆ appears thanks to the concentration bounds proved in Appendix A. We will now consider the terms in which ∆ ′ appears, in other words the terms not present in [17]. To this, we will only use inequalities on traces and resolvents (see Propositions B.1 and B.2). Let us recall that ∆ ′ is defined by Using inequalities (i)-(iv) in Proposition B.1 and the resolvent identity SXX t = zS − I n , we get In addition, and using (20) again, Consequently, the combination of (14), By very similar calculations, we get Finally, combining relation (9) with inequalities (18), (19), (24), and (25), we get

The general case
We now only assume that Var(Y 1,1 ) = 1 and that E(Y 4 1,1 ) < +∞. Let Y ∈ M n,p (R) be an independent random matrix such that the Y j,k 's are i.i.d. standard Gaussians, we define X = Y √ p + M and for all u ∈ [0, 1], We have the following, which will allow us to bring back the general case to the Gaussian case.
Proposition 2.6. There exist s, t > 0 and a function f , bounded on V s,t , such that for any random matrix Y ∈ M n,p (R) with i.i.d. entries satisfying Var(Y 1,1 ) = 1, E(Y 4 1,1 ) < +∞, and E(Y 1,1 ) = 0, for any deterministic matrix M ∈ M n,p (R), for all n large enough, and for all z ∈ V s,t , we have Proof. The proof consists in four main steps. After developing E G µ XX t (z)− E G µ X X t (z), we use integration by parts formulas (see Lemma 2.7). Then, we respectively focus on bounds for the main terms and the rests in these integrations by parts.
Dividing by h and taking h → 0, we get for all u ∈ [0, 1], Thus we can rewrite Denoting by where S(u) 2 j,k must be read (S(u) 2 ) j,k , we finally rewrite Second step: Integrations by parts.
Let us recall the formulas we will use below.  N (0, σ 2 ).
(ii) More generally, let p be an integer, a function F ∈ C p+1 (R, R), and a real random variable ξ. If E |ξ| p+2 < +∞ and the derivatives F ′ , . . . , F (p+1) are bounded on R, then where the κ j+1 's are the cumulants of the distribution of ξ and We will apply the Gaussian (27) or the general (28) integration by parts formula for all j, k, l in order to decompose E[(1) + (2) + (3) + (4) + (5) + (6)] as a sum of terms that we can bound.
Note a first crucial point here. As we want to apply Theorem 1.6 to the matrices Y and C in Section 3 in order to obtain (50), it will not be sufficient to use the integration by parts formula up to order 2, that is why we will be interested in terms of order 3 in this formula.
From now, D a,b denotes the derivation with respect to Y a,b .
Let u ∈ [0, 1], j, k ∈ 1, n , and l ∈ 1, p . We denote by F 1 and G 1 the functions defined by F 1 (Y j,l ) = Y k,l S(u) 2 j,k and G 1 ( Y j,l ) = Y k,l S(u) 2 j,k . We have where E j,l denotes the associated conditional expectation. Similarly, from (28), we have where E j,l denotes the expectation conditionally to the variables Taking the expectation, we thus have

21
for all j, k, l, and considering F 5 (Y j,l ) = M k,l S(u) 2 j,k , for all j, k, l.
We have thus rewritten Third step: Bounds for the main terms. Note that in order to simplify the notations, from now, we will denote S and X for S(u) and X(u).
Let us start with the term (1.2). Using (84) and (88), we have where • is the Hadamard product (see Appendix B.1) and S •2 denotes S •S. Note that it is crucial here to rewrite precisely the terms with the Hadamard product and then to bound the traces rather than bound directly the entries. Indeed, it allows us to get better powers of n in the bound, which is crucial if we have in mind the large deviations in Section 3. Using Propositions B.1, B.2, and the Cauchy-Schwarz inequality in C np and denoting by y a square root of z, we have Using also the bound (90), there exists a function f , bounded on V s,t , independent from Y , M , and n, such that for all z ∈ V s,t , we have But for a centred random variable, the third cumulant equals the third moment, so this inequality can be rewritten We adopt the same strategy for the term (1.3). We have so, using the previous bounds, and also j,l the same arguments as above lead to Besides, we have We finally have Very similar computations allow to show that If we remember that Y 1,1 and Y 1,1 have mean zero and variance 1, we have E(Tr(Y Y t )) 1/2 ≤ √ np and E(Tr( Y Y t )) 1/2 ≤ √ np by Jensen's inequality. Finally, we can write Fourth step: Bounds for the rests.
The only thing to be left is to bound the rests appeared in the integration by parts formulas. We recall that for all j, k ∈ 1, n , l ∈ 1, p , we have Using the expression of F 1 (Y j,l ), differentiation formulas (84), (88), (89), and inequalities (iv)-(vi) in Proposition B.2, there exists a function f , independent from Y, M, n, j, k, l, bounded on V s,t , such that So, using the Cauchy-Schwarz inequality in R np , we have The same bound holds for 1 √ p j,k,l E(ε 2,j,k,l ) . Similarly, we get and j,k,l Finally, combining relations from (26) to (40), we get We can now conclude the proof of the general case and obtain Theorem 2.1. In fact, in Proposition 2.6, we assumed that E(Y 1,1 ) = 0, so we only have to remove this assumption.
Proof. We recall thatX = X − E(X) by definition. We also define g(z) = E G µ XX t (z), g • (z) = E G µXX t (z), and g(z) = E G µ X X t (z). Using the notations in Lemma 2.2, we have for s large enough and t small enough by Lemma 2.2. Since the matrix X −X = E(X) has rank at most 1, using the relations (5), (7), and (92), we have Proposition 2.3 (the Gaussian case) applied to Y and Proposition 2.6 (the centred case) applied toY permit to get finally
We define ε(n) = 1 log n and we decompose the matrix X as where A, B, C, D are the matrices defined by Besides, we denote by B s,t (µ, δ) the ball with centre µ ∈ P(R) and radius δ > 0 for the distance d s,t .

Exponential equivalences
The goal of this subsection is to prove the following.
Proposition 3.1. There exist s, t > 0 such that the random distributions µ XX t /p and √ µ CC t ⊞ c √ µ MP,c 2 are d s,t -exponentially equivalent at scale n 1+α/2 as n → +∞, i.e. for all δ > 0, we have The strategy to prove Proposition 3.1 is similar to the one in [7]. First, we explain why the contributions of B and D for large deviations can be neglected (Lemmas 3.2 and 3.3) and then, we show that the measures µ (A+C)(A+C) t and √ µ CC t ⊞ c √ µ MP,c 2 are exponentially equivalent thanks to a conditioning and a coupling argument in which several tools are needed, such as the concentration property (82) and the asymptotic freeness result stated in Theorem 1.6. From now on, we consider s > 2 and t > 0. First, the contribution of D is negligible.
The proof is very similar to what is done in [7], the only difference being the use of (92) instead of (91). Therefore, it will not be repeated here.
The contribution of B is also negligible. Proof. From Lemma 3.2, the triangle inequality, Lemma 1.2.15 in [9], and the inequality d s,t ≤ W 1 ≤ W 2 , it is sufficient to prove that for all δ > 0, From (94), which is the analogue of the Hoffman-Wielandt inequality (93) for covariance matrices, it is sufficient to check that for all δ > 0, Let δ > 0. We have using the decomposition (42). Thus, On the other hand, since n p → c, the same arguments as in [7] lead to Finally, combining (43), (44), (45), and Lemma 1.2.15 in [9], we get the exponential equivalence of µ XX t /p and µ (A+C)(A+C) t .
Before proving Proposition 3.1, we need some additional properties.
(iii) We denote by P n the distribution of X 1,1 conditionally to {|X 1,1 | < (log n) 2/α }. Let Z n be a random variable with distribution P n . There exists ζ > 0 such that Furthermore, the variance of Z n , denoted by σ 2 n , tends to Var(X 1,1 ) = 1 as n → +∞ and more precisely, there exists η > 0 such that Proof. The proofs of (i) and (ii) exactly follow the proof of Lemma 2.4 in [7]. Therefore, we will only prove (iii). Let Z n be a random variable with distribution P n defined as above. We have But thanks to hypothesis (1), X 2 1,1 is integrable, so by the dominated convergence theorem, E X 2 1,1 1 |X 1,1 |<(log n) 2/α tends to E(X 2 1,1 ) as n → +∞. Besides, P(|X 1,1 | < (log n) 2/α ) tends to 1, so E(Z 2 n ) tends to E(X 2 1,1 ) as n → +∞. The same arguments show that E(Z 4 n ) tends to E(X 4 1,1 ) as n → +∞. We can deduce that there exists a real number ζ such that Moreover, we have Using similar arguments, we prove that σ 2 n tends to Var(X 1,1 ) = 1 as n →
Going back to (46), we have for n large enough Because the moments of X 1,1 are finite, we can deduce that there exists a real number η such that We can now prove Proposition 3.1.
Proof. The proof relies on a conditioning with respect to the entries of X which are not in A and on a coupling argument to remove the dependency between A and C.
We use here the same notations as [7]. We denote by F the σ-algebra F = σ X j,k 1 |X j,k |≥(log n) 2/α , P F and E F the probability and the expectation conditionally to F, and we denote by E and F the events (47) Besides, conditionally to F, √ pA is a random matrix with independent entries bounded by (log n) 2/α . From the concentration result (82) applied to Y = √ pA, M = C, κ = (log n) 2/α , from the inequality d s,t ≤ W 1 , and using that α < 2, we get for all δ > 0 and n large enough, (48) We will now use a coupling argument. We consider an independent random matrix Y whose entries are i.i.d. with distribution P n defined in Lemma 3.4, and we denote by A ′ the matrix defined by Consequently, √ pA ′ and Y have the same distribution and are independent from F. In particular, we will use later that for all bounded continuous f , From the inequalities (94) and d s,t ≤ W 2 , we have With definition (5) of d s,t and conditional Jensen's inequality for the concave function x → x 1/4 , we thus have because 1 E , 1 F , and Tr(CC t ) are F -measurable. Since the events {(j, k) ∈ I} are F-measurable and Y is independent from F , we have from Lemma 3.4 (iii), and similarly So we have ≤ 2ζc n n 3 2 n(log n) 4/α + n(log n) 2 n 1+α/2 + c n n 1+α 1/4 ≤ 2ζc n n 3 .3n(log n) 4/α n 1+α/2 1/4 = (6ζc n ) 1/4 (log n) 1/α n 1/4−α/8 for n large enough (we used here the fact that 4 α > 2). It follows that for all δ > 0, In addition, we define σ 2 n = Var(Y 1,1 ) as in Lemma 3.4 (iii). Since C is F-measurable, Y is independent from F, and 1 1,1 ) ≤ 2ζ < +∞ for n large enough, we can apply Theorem 1.6 to Y /σ n and C, conditionally to F. Therefore, for n large enough, s large enough, and t small enough, we using Jensen's inequality and the fact that for all j ∈ 1, n and k ∈ 1, p , we have |Y j,k | = |Y j,k − E F (Y j,k )| ≤ 2(log n) 2/α . Therefore, for all δ > 0, To finish, from (94), we have so, using conditional Jensen's inequality and doing as above, we get By Lemma 3.4 (iii), we deduce from it that for all δ > 0, To conclude, combining equalities from (47) to (51), Lemma 3.3, and Lemma 1.2.15 in [9], for s large enough and t small enough, we have for all δ > 0,

Large deviations for µ C ′
In the previous subsection, we proved that µ XX t /p and √ µ CC t ⊞ c √ µ MP,c 2 are exponentially equivalent. Consequently, to obtain the large deviations of µ XX t /p (Theorem 1.7), it is sufficient to study the large deviations of µ CC t and to apply the contraction principle (see [9,Theorem 4.2.1]). For this, in this subsection, we will study the large deviations of and prove the following, from which we will deduce the large deviations of µ CC t thanks to identity (54) and conclude in the next subsection.
Note that Φ ′ is a good rate function because it is well known that for all m ≥ 0 and p > 0, the set is compact for the weak topology. Moreover, the domain of Φ ′ can be explained thanks to Lemma 3.6 (i).
Lemma 3.6. Let M ∈ M n,p (R) and (ii) We have (iii) If M is diagonal, in the sense that only the entries M j,j , 1 ≤ j ≤ n ∧ p, can be non-zero, then The proof of this lemma does not present any difficulty and is left to the reader. We also need a second lemma, which consists in two estimates for the distribution of X 1,1 . These estimates come from the particular form of this distribution, see hypotheses (1) and (2). Lemma 3.7. (i) There exists a sequence (η n ) n∈N converging to 0 such that for all x ≥ ε(n), we have (ii) We denote by S a the support of the distribution ϑ a defined by (2). There exists a sequence (a n ) n∈N converging to a such that for all x ∈ R satisfying |x| ≥ ε(n) and sign(x) ∈ S a , for all γ > 0, and for all n large enough, we have The computations to get these inequalities are explained in [7, p. 26] and are left to the reader.
We will now prove Proposition 3.5. Let us mention that Schatten's inequality (95) will be crucial in the proof since it will allow to link the α-th moment of the spectral measure µ C ′ and the entries of C ′ .
Proof. Since the set of symmetric probability measures on R is closed for the weak topology, it is enough to prove the LDP on this set, see [9,Lemma 4.1.5].

39
We have obtained the upper bound of the LDP.
Exponential tightness. Let A > 0 and m = 2Ac 1+α/2 a(1+c) . We recall that the set K α,m defined by (53) is compact. Moreover, using the computations ity, and the differentiation formula (84), we get