Sums of random Hermitian matrices and an inequality by Rudelson

We give a new, elementary proof of a key inequality used by Rudelson in the derivation of his well-known bound for random sums of rank-one operators. Our approach is based on Ahlswede and Winter's technique for proving operator Chernoff bounds. We also prove a concentration inequality for sums of random matrices of rank one with explicit constants.


Introduction
This note mainly deals with estimates for the operator norm Z n of random sums of deterministic Hermitian matrices A 1 , . . . , A n multiplied by random coefficients. Recall that a Rademacher sequence is a sequence {ǫ i } n i=1 of i.i.d. random variables with ǫ 1 uniform over {−1, +1}. A standard Gaussian sequence is a sequence i.i.d. standard Gaussian random variables. Our main goal is to prove the following result.
Theorem 1 (proven in Section 3) Given positive integers d, n ∈ N, let A 1 , . . . , A n be deterministic d × d Hermitian matrices and {ǫ i } n i=1 be either a Rademacher sequence or a standard Gaussian sequence. Define Z n as in (1). Then for all p ∈ [1, +∞), For d = 1, this result corresponds to the classical Khintchine inequalities, which give sub-Guassian bounds for the moments of n i=1 ǫ i a i (a 1 , . . . , a n ∈ R). Theorem 1 is implicit in Section 3 of Rudelson's paper [11], albeit with non-explicit constants. The main Theorem in that paper is the following inequality, which is a simple corollary of Theorem 1: if Y 1 , . . . , Y n are i.i.d. random (column) vectors in C d which are isotropic (i.e E [Y 1 Y * 1 ] = I, the d × d identity matrix), then: for some universal C > 0, whenever the RHS of the above inequality is at most 1. This important result has been applied to several different problems, such as bringing a convex body to near-isotropic position [11]; the analysis of for low-rank approximations of matrices [12,6] and graph sparsification [13]; estimating of singular values of matrices with independent rows [10]; analysing compressive sensing [3]; and related problems in Harmonic Analysis [16,15].
The key ingredient of the original proof of Theorem 1 is a non-commutative Khintchine inequality by Lust-Picard and Pisier [9]. This states that there exists a universal c > 0 such that for all Z n as in the Theorem, all p ≥ 1 and all d × d matrices where · S p denotes the p-th Schatten norm: A p S p ≡ Tr[(A * A) p/2 ]. Unfortunately, the proof of the Lust-Picard/Pisier inequality employs language and tools from non-commutative probability that are rather foreign to most potential users of (2). This note presents an elementary proof of Theorem 1 that bypasses the above inequality. Our argument is based on an improvement of the methodology created by Ahlswede and Winter [2] in order to prove their operator Chernoff bound, which also has many applications e.g. [7] (the improvement is discussed in Section 3.1). This approach only requires elementary facts from Linear Algebra and Matrix Analysis. The most complicated result that we use is the Golden-Thompspon inequality [5,14]: The elementary proof of this classical inequality is sketched in Section 5 below.
We have already noted that Rudelson's bound (2) follows simply from Theorem 1; see [11,Section 3] for detais. Here we prove a concentration lemma corresponding to that result under the stronger assumption that |Y 1 | is a.s. bounded. While similar results have appeared in other papers [10,12,16], our proof is simpler and gives explicit (albeit quite large) constants.

Lemma 1 (Proven in Section
Then: In particular, a calculation shows that: A key feature both of this Lemma is that the ambient dimension d plays no direct role in the bound. In fact, the same result holds for Y i taking values in a separable Hilbert space (as in the last section of [10]).
To conclude the introduction, we present an open problem: is it possible to improve upon Rudelson's bound under further assumptions? There is some evidence that the dependence on ln(d) in the Theorem, while necessary in general [12,Remark 3.4], can sometimes be removed. For instance, Adamczak et al. [1] have improved upon Rudelson's original application of Theorem 1 to convex bodies, obtaining exactly what one would expect in the absence of the log(2d) term. Another setting where our bound is a Θ √ ln d factor away from optimality is that of more classical random matrices (cf. the end of Section 3.1 below). It would be interesting if one could sharpen the proof of Theorem 1 in order to reobtain these results. [Related issues are raised by Vershynin [17].]

Preliminaries
We let C d×d Herm denote the set of d × d Hermitian matrices, which is a subset of the set C d×d of all d × d matrices with complex entries. The spectral theorem states that all A ∈ C d×d Herm have d real eigenvalues (possibly with repetitions) that correspond to an orthonormal set of eigenvectors. λ max (A) is the largest eigenvalue of A. The spectrum of A, denoted by spec(A), is the multiset of all eigenvalues, where each eigenvalue appears a number of times equal to its multiplicity. We let |Cv| denote the operator norm of C ∈ C d×d (|·| is the Euclidean norm). By the spectral theorem, is the sum of the eigenvalues of A.

Spectral mapping
Let f : C → C be an entire analytic function with a power-series representation f (z) ≡ n≥0 c n z n (z ∈ C). If all c n are real, the expression: corresponds to a map from C d×d Herm to itself. We will sometimes use the so-called spectral mapping property: specf (A) = f (spec(A)).
By this we mean that the eigenvalues of f (A) are the numbers f (λ) with λ ∈ spec(A). Moreover, the multiplicity of ξ ∈ specf (A) is the sum of the multiplicities of all preimages of ξ under f that lie in spec(A).

The positive-semidefinite order
We will use the notation A 0 to say that A is positive-semidefinite, i.e. A ∈ C d×d Herm and its eigenvalues are A are non-negative. This is equivalent to saying that (v, Av) ≥ 0 for all v ∈ C d , where (·, ··) is the standard Euclidean inner product.
If A, B ∈ C d×d Herm , we write A B or B A to say that A − B 0. Notice that " " is a partial order and that: Moreover, spectral mapping (4) implies that: We will also need the following simple fact.
Proposition 1 For all A, B, C ∈ C d×d Herm : Proof: To prove this, assume the LHS and observe that the RHS is equivalent to Tr(C∆) ≥ 0 where ∆ ≡ B − A. By assumption, ∆ 0, hence it has a Hermitian square root ∆ 1/2 . The cyclic property of the trace implies: Since the trace is the sum of the eigenvalues, we will be done once we show that ∆ 1/2 C∆ 1/2 0. But, since ∆ 1/2 is Hermitian and C 0, which shows that ∆ 1/2 C∆ 1/2 0, as desired. 2

Probability with matrices
Assume (Ω, F , P) is a probability space and Z : Ω → C d×d Herm is measurable with respect to F and the Borel σ-field on C d×d Herm (this is equivalent to requiring that all entries of Z be complex-valued random variables). C d×d Herm is a metrically complete vector space and one can naturally define an expected value E [Z] ∈ C d×d Herm . This turns out to be the matrix Herm whose (i, j)-entry is the expected value of the (i, j)-th entry of Z.
[Of course, E [Z] is only defined if all entries of Z are integrable, but this will always be the case in this paper.] The definition of expectations implies that traces and expectations commute: Moreover, one can check that the usual product rule is satisfied: However, Z n and −Z n have the same distribution. It follows that: The usual Bernstein trick implies that for all t ≥ 0, The function "x → e sx " is monotone non-decreasing and positive for all s ≥ 0. It follows from the spectral mapping property (4) that for all s ≥ 0, the largest eigenvalue of e sZn is e sλmax(Zn) and all eigenvalues of e sZn are non-negative. Using the equality "trace = sum of eigenvalues" implies that for all s ≥ 0, E e sλmax(Zn) = E λ max e sZn ≤ E Tr e sZn .
As a result, we have the inequality: Up to now, our proof has followed Ahlswede and Winter's argument. The next lemma, however, will require new ideas.
Lemma 2 For all s ∈ R, E Tr(e sZn ) ≤ Tr e This lemma is proven below. We will now show how it implies Rudelson's bound. Let [The second inequality follows from n i=1 A 2 i 0, which holds because of (5) and (6).] We note that: where the equality is yet another application of spectral mapping (4) and the fact that "x → e s 2 x/2 " is monotone increasing. We deduce from the Lemma and (10) that: This implies that for any p ≥ 1, Since 0 ≤ Z n ≤ 2 ln(2d)σ + ( Z n − 2 ln(2d)σ) + , this implies the L p estimate in the Theorem. The bound "C p ≤ c √ p" is standard and we omit its proof. 2 To finish, we now prove Lemma 2.
We will prove that for all 1 ≤ j ≤ n: Notice that this implies E Tr(e Dn ) ≤ E Tr(e D 0 ) , which is the precisely the Lemma. To prove (12), fix 1 ≤ j ≤ n. Notice that D j−1 is independent from sǫ j A j − s 2 A 2 j /2 since the {ǫ i } n i=1 are independent. This implies that: (use product rule, (9)) = Tr By the monotonicity of the trace (7) and the fact that exp (D j−1 ) 0 (which follows from (4)), we will be done once we show that: The key fact is that sǫ j A j and −s 2 A 2 j /2 always commute, hence the exponential of the sum is the product of the exponentials. Applying (9) and noting that e −s 2 A 2 j /2 is constant, we see that: In the Gaussian case, an explicit calculation shows that E [exp (sǫ j A j )] = e s 2 A 2 j /2 , hence (13) holds. In the Rademacher case, we have: where f (z) = cosh(sz)e −s 2 z 2 /2 . It is a classical fact that 0 ≤ cosh(x) ≤ e x 2 /2 for all x ∈ R (just compare the Taylor expansions); this implies that 0 ≤ f (λ) ≤ 1 for all eigenvalues of A j . Using spectral mapping (4), we see that: which implies that f (A j ) I. This proves (13) in this case and finishes the proof of (12) and of the Lemma. 2

Remarks on the original AW approach
A direct adaptation of the original argument of Ahlswede and Winter [2] would lead to an inequality of the form: E Tr(e sZn ) ≤ Tr E e sǫnAn E e sZ n−1 .

One sees that:
E e sǫnAn e s 2 A 2 n 2 e s 2 A 2 n 2

I.
However, only the second inequality seems to be useful, as there is no obvious relationship between Tr e s 2 A 2 n 2 E e sZ n−1 and Tr E e sǫ n−1 A n−1 E e sZ n−2 + s 2 A 2 n 2 , which is what we would need to proceed with induction. [Note that  cannot be undone and fails for three summands, [14].] The best one can do with the second inequality is: This would give a version of Theorem 1 with n i=1 A i 2 replacing n i=1 A 2 i . This modified result is always worse than the actual Theorem, and can be dramatically so. For instance, consider the case of a Wigner matrix where: with the ǫ ij i.i.d. standard Gaussian and each A ij has ones at positions (i, j) and (j, i) and zeros elsewhere (we take d = m and n = m 2 in this case). Direct calculation reveals: We note in passing that neither approach is sharp in this case, as ij ǫ ij A ij concentrates around 2 √ m [4].

Concentration for rank-one operators
In this section we prove Lemma 1.

5 Proof sketch for Golden-Thompson inequality
As promised in the Introduction, we sketch an elementary proof of inequality (3). We will need the Trotter-Lie formula, a simple consequence of the Taylor formula for e X : ∀A, B ∈ C d×d Herm , lim n→+∞ (e A/n e B/n ) n = e A+B .
The second ingredient is the inequality: This is proven in of [5] via an argument using the existence of positive-semidefinite squareroots for positive-semidefinite matrices, and the Cauchy-Schwartz inequality for the standard inner product over C d×d . Iterating (16) implies: ∀X, Y ∈ C d×d Herm : X, Y 0 ⇒ Tr((XY ) 2 k ) ≤ Tr(X 2 k Y 2 k ).
Apply this to X = e A/2 k and Y = e B/2 k with A, B ∈ C d×d Herm . Spectral mapping (4) implies X, Y 0 and we deduce: Tr((e A/2 k e B/2 k ) 2 k ) ≤ Tr(e A e B ).