CONCENTRATION INEQUALITIES FOR THE SPECTRAL MEASURE OF RANDOM MATRICES

We give new exponential inequalities for the spectral measure of random Wishart matrices. These results give in particular useful bounds when these matrices have the form M = Y Y T , in the case where Y is a p × n random matrix with independent enties (weaker conditions are also proposed), and p and n are large.


Introduction
Recent literature has focused much interest on Wishart random matrices, that is matrices of the form: where Y ∈ p×n is a rectangular p × n matrix with random centered entries, and both n and p ≤ n tend to infinity: typically p = p(n), and p(n)/n tends to some limit. M can be seen as the empirical covariance matrix of a random vector of dimension p sampled n times, each sample being a column of Y . It is common in applications to have a number of variables with a comparable order of magnitude with the number of observation samples; the analysis of the asymptotic spectral distribution for such matrices appears now to be an important issue, in particular for implementing significance test for eigenvalues (e.g. for principal component analysis with p variables an n observations). Our objective is to give new exponential inequalities concerning symmetric functions of the eigenvalues, i.e. the variables of the form and lead naturally to concentration bounds, i.e. bounds on P(|Z − E[Z]| ≥ x) for large values of x.
Our contribution is to improve on the existent literature on the following points: (i) the function g is a once or twice differentiable general function (i.e. not necessarily of the form (2) below), (ii) the entries of Y are possibly dependent, (iii) they are not necessarily identically distributed and (iv) the bound is instrumental when both p and n are large.
The columnise dependence is usually taken into account in essentially two different ways: either one assumes that the columns of Y are i.i.d. but each column is a dependent vector, or that M can be written, for some mixture matrix Q ∈ p×p where X ∈ p×n , p < n with independent entries; this means that the j-th observation vector Y . j has the specific form QX . j . This second way is thus more restrictive; but it is natural from a theoretical point of view since one obtains essentially the same results as in the case Q = I d (cf. [1] Chapter 8). We shall consider both cases (Theorem 3 and Theorem 2), the first one is much easier, but leads to much larger bounds when p is large.
We shall give now a brief evaluation of results in this context. From now on in this section we assume for simplicity that Q = I d. In order to be more specific, and to be able to compare with existing results, we shall stick in this introduction to the case where although any V-statistics of (λ 1 , ...λ p ) can be considered. According to distributional limit-theorems (e.g. Theorem 9.10 p.247 of [1]), the asymptotic variance of Z as n tends to infinity and p/n converges to some limit y > 0 is a rather complicated function of y; this suggests that the best one could hope in the independent case, and when p and n tend to infinity, would be something like for some function f . Guionnet and Zeitouni obtainned in [3] for the independent case (Corollary 1.
for some reasonable constant g with order of magnitude g ∼ ξ 2 xϕ ( and under the assumption of convexity of the function x → ϕ(x 2 ). A similar result is also given without this convexity assumption (Corollary 1.8 (b)), but a logarithmic Sobolev inequality is required for the densities of the X i j s.
We shall obtain here, in the independent case, the following particular consequence of Theorem 2 below (still with Q = I d): This leads to In the case where n tends to infinity and p = cn, (4) gives, when it applies, a better bound than (6) because it does not involve second derivatives of ϕ; when p n Equation (6) is always sharper. The amazing point is that this inequality is essentially a consequence of an extension by Boucheron, Lugosi and Massart of the McDiarmid inequality, Theorem 1 below. Equation (5) will be applied in a forthcoming article with Jian-Feng Yao and Jia-Qi Chen where ϕ is an estimate of the limit spectral density of E[M ]; it will be a key step to prove the validity of a cross-validation procedure.
In the case of independent columns only, recently Guntuboyina and Leeb obtainned the following concentration inequality (Theorem 1 (ii) of [4]): where V is the total variation of the function ϕ on and it is assumed that Y i j ∞ ≤ 1 for all i, j. This bound is never close to (3) but is essentially optimal in the context of independent columns if p and n have the same order of magnitude (Example 3 in [4]). We provide here an estimate which can be written as This is a simple consequence of Theorem 3 below with a 2 = pξ 2 (since the entries are independent). In the independent case, this improves on Corollary 1.8 of [3] only if p 2 is significantly smaller than n; it also improves on (7) if p n since in this case p 2 /n n. But it is hardly useful for applications where p and n have the same order of magnitude.

Results
The following Theorem is a bounded difference inequality due to Boucheron-Lugosi-Massart [2] (Theorem 2 with λ = 1, θ → 0); it looks like the McDiarmid inequality but the important difference here is that the infinity norm in (9) is outside the sum:

zero-mean sequence of independent variables with values in some measured space E. Let f be a measurable function on E n with real values. Let Y be an independent copy of Y and set
Then, with the notation x 2 We shall use actually the weaker inequality: Notations. In what follows x will stand for the Euclidean norm of the vector x and M will stand for the matrix norm of M ∈ d×d : X p is the usual L p -norm of the real random variable X . e i will stand for the i-th vector of the canonical basis, and we set And similarly, M i. is the i-th row of M , a row vector, a 1 × d matrix. If u, v ∈ d , they will be understood also as d × 1 matrix, in particular u T v = 〈u, v〉, and uv T is a d × d matrix. Let λ → g(λ) be a twice differentiable symmetric function on p and define the random variable In particular, for any x > 0 The next theorem deals with matrices X with independent columns; since QX has again independent columns, we assume here that Q = I d. where In particular, for any x > 0
We have to estimate with the notations of Theorem 1. If we set the Taylor formula implies We shall estimate these quantities with the help of Lemma 4 below, specifically Equations (24) and (25), which give bounds for the derivatives of a function of the eigenvalues; the variable t is here X i j . We have for any matrix P Equation (24) can be rewritten as where P λ k is the orthogonal projection on the eigenspace of M corresponding to λ k , and d k is the dimension of this space. It happens that we can get a good bound for a partial Euclidean norm of the gradient of f : set ( j is fixed) where A . j stands for the jth column of the matrix A. Notice that in case of eigenvalues of higher multiplicities, some ξ k s are repeated, but if ξ j = ξ k , λ j = λ k , one has also g j = g k (by symmetry of g); hence, denoting by K a set of indices such that k∈K P λ k = I d, For the estimation of the second derivatives we shall use again Lemma 4. Equation (25) can be rewritten as the ν k are all zero except for one 2n −1 Q .i 2 ≤ 2n −1 Q 2 and finally we get Hence, for any j and Equation (10) follows. Equation (14) is easily deduced from (10): let us recall that a random variable X such that for any t > 0, and some a > 0, E[e t X ] ≤ e at 2 , satisfies for any x, P(X > x) ≤ e −x 2 /4a . In particular, application of (10) with the function t g(x) instead of g(x) leads to the required bound on the Laplace transform and

Proof of theorem 3
Define the function on p×n We have to estimate with the notations of Theorem 1. Then the Taylor formula implies The matrix ∆ = X j − X vanishes except on its j-th column which is the vector δ, δ i = X j i j − X i j . We shall estimate these quantities with the help of Lemma 4 of the appendix; with M (t) = 1 n (X + t∆)(X + t∆) T . We havė where X . j is the jth column of X . We setḟ = d d t f (X + t∆). Equation (24) can be rewritten aṡ and the result follows.

A Smoothness of eigenvalues of a parameter dependent symmetric matrix
The following lemma has been a key instrument for proving Theorems 2 and 3:  (λ 1 (t), ...λ p (t)), and the following equations are satisfied for 1 ≤ k ≤ p and t ∈ I: where both sides are functions of the implicit variable t, P λ is the orthogonal projection on the eigenspace E λ of the eigenvalue λ, d k is the dimension of E λ k . Let g be a twice differentiable symmetric function defined on some cube Q = (a, b) p containing the eigenvalues of the matrices M (t), t ∈ I; then the function is twice differentiable on I and for any t ∈ İ where γ = sup Λ∈Q ∇ 2 g(Λ) (matrix norm), and ν i (t) is the i-th eigenvalue ofM (t).
Proof. The smoothness comes from the possibility of finding a twice differentiable parametrization of the eigenvalues t → (λ 1 (t), ...λ p (t)) which is simply a consequence of the theorem of [5]; let us recall the third statement of that theorem: Consider a continuous curve of polynomials If all a i are of class C 3p then there is a twice differentiable parametrization x = (x l , . . . x p ) : → of the roots of P.
It remains to prove Equations (21) to (25). We shall use below the notation λ for denoting a sum over distinct eigenvalues, hence The first term is and the second term may be computed via the Equation (29) identifying the r.h.s. of (30) as the sum of the r.h.s. of (31) and (32) with a polynomial f such that f (λ) is zero for all eigenvalue λ of M (t) and f (λ) is zero for any eigenvalue λ with the exception of a specific one λ k for which f (λ k ) = 1, we get f (M ) = P λ k and j:λ j =λ kλ and (22) is proved. Equation (23) is proved similarly by considering a polynomial f such that f (λ) is zero for all eigenvalue λ of M (t) and f (λ) is zero for any eigenvalue λ with the exception of a specific one λ k for which f (λ k ) = 1. We now prove Equation (24). For any smooth symmetric function f (x, y) of two real variables, we have, if we denote by f 1 and f 2 the partial derivatives: which implies that on any point such that x = y, one has f 1 = f 2 , and f 11 = f 22 . Hence using (21), the symmetry of g, and (33)φ It remains to prove (25). Setting g k = ∂ g ∂ λ k and g kl = For the estimation of the first term, we shall use the identityM = i ν i w i w T i where (w 1 , ...w p ) is an orthonormal basis of eigenvectors associated to (ν 1 , ...ν p ): The second term is by symmetry We need to notice the following: the relation (33) implies that the symmetric function f satisfies Hence we have g k − g j λ k − λ j ≤ 1 2 g j j + g kk − 2g jk ∞ = 1 2 sup Λ∈Q |〈e j − e k , ∇ 2 g(Λ)(e j − e k )〉| ≤ γ. Hence Tr((P λṀ P µ ) T P λṀ P µ ) =γ λ =µ Tr(Ṁ P λṀ P µ ) =γTr(Ṁ 2 ) − γ λ Tr(Ṁ P λṀ P λ ).