Moment inequalities for matrix-valued U-statistics of order 2

We present Rosenthal-type moment inequalities for matrix-valued U-statistics of order 2. As a corollary, we obtain new matrix concentration inequalities for U-statistics. One of our main technical tools, a version of the non-commutative Khintchine inequality for the spectral norm of the Rademacher chaos, could be of independent interest.


Notation and background material.
Given A P C d 1ˆd2 , A˚P C d 2ˆd1 will denote the Hermitian adjoint of A. H d Ă C dˆd stands for the set of all self-adjoint matrices. If A " A˚, we will write λ max pAq and λ min pAq for the largest and smallest eigenvalues of A.
Everywhere below, }¨} stands for the spectral norm }A} :" a λ max pA˚Aq. If d 1 " d 2 " d, we denote by trpAq the trace of A. The Schatten p-norm of a matrix A is defined as }A} Sp " trpA˚Aq p{2˘1 {p . When p " 1, the resulting norm is called the nuclear norm and will be denoted by }¨}˚. The Schatten 2-norm is also referred to as the Frobenius norm or the Hilbert-Schmidt norm, and is denoted by }¨} F ; and the associated inner product is xA 1 , A 2 y " trpA1 A 2 q.
Given z P C d , }z} 2 " ? z˚z stands for the usual Euclidean norm of z. Let A, B P H d . We will write A ľ BporA ą Bq iff A´B is nonnegative (or positive) definite. For a, b P R, we set a _ b :" maxpa, bq and a^b :" minpa, bq. We use C to denote absolute constants that can take different values in various places.
Finally, we introduce the so-called Hermitian dilation which is a tool that often allows to reduce the problems involving general rectangular matrices to the case of Hermitian matrices.
The rest of the paper is organized as follows. Section 2.1 contains the necessary background on U-statistics. Section 3 contains our main results -bounds on the H d -valued Rademacher chaos and the moment inequalities for H d -valued U-statistics of order 2. Section 4 provides comparison of our bounds to relevant results in the literature, and discusses further improvements. Finally, Section 6 contains the technical background and the proofs of the main results.
When H i 1 ,...,im " H, we obtain the classical U-statistics. It is often easier to work with the decoupled version of U n defined as , k " 1, . . . , m are independent copies of X 1 , . . . , X n . Our ultimate goal is to obtain the moment and deviation bounds for the random variable }U n´E U n }.
Next, we recall several useful facts about U-statistics. The projection operator π m,k pk ď mq is defined as π m,k Hpx i 1 , . . . , x i k q :" pδ x i 1´P q . . . pδ x i k´P qP m´k H, where Q m H :" ż . . .
For instance, it is easy to check that π m,k H is degenerate of order k´1. If F is degenerate of order m´1, that it is called completely degenerate. From now on, we will only consider generalized U-statistics of order m " 2 with completely degenerate (that is, of order 1) kernels. The case of non-degenerate U-statistics is easily reduced to the degenerate case via the Hoeffding's decomposition; see (de la Pena and Gine, 1999, page 137) for details.

Main results.
Rosenthal-type moment inequalities for sum of independent matrices have appeared in a number of works including (Chen, Gittens and Tropp, 2012;Mackey et al., 2014;Tropp, 2015). Specifically, the following inequality follows from Theorem A.1 of (Chen, Gittens and Tropp, 2012): Lemma 3.1 (Matrix Rosenthal inequality). Suppose that q ě 1 is an integer and fix r ě q_log d. Consider a finite sequence of tY i u of independent copies of Y P H d . Then The bound above improves upon the moment inequality that follows from the matrix Bernstein's inequality (Tropp, 2015): Indeed, Lemma 6.8 implies, after some simple algebra, that for an absolute constant C ą 0 and all q ě 1. This bound is weaker than (3) as it requires almost sure boundedness of }Y i´E Y i } for all i. One the the main goal of this work is to obtain operator norm bounds similar to (3) for H d -valued U-statistics of order 2.
The moment bounds for scalar U-statistics are well-known, see for example the work by Gine, Latala and Zinn (2000) and references therein. Moment inequalities for general Banach-space valued U-statistics were obtained by Adamczak (2006). Here, we aim at improving these bounds for the special case of H d -valued U-statistics of order 2. We discuss connections and provide comparison of our results with the bounds obtained by Adamczak in Section 4.

Matrix Rademacher chaos.
The starting point of our investigation is a moment bound for the matrix Rademacher chaos of order 2. This bound generalizes the spectral norm inequality for matrix Rademacher series, see (Tropp, 2015(Tropp, , 2016aVershynin, 2010). We recall Khintchine's inequality for the matrix Rademacher series for the ease of comparison: let A 1 , . . . , A n P H d be a sequence of fixed matrices, and ε 1 , . . . , ε n -a sequence of i.i.d. Rademacher random variables. Then Furthermore, Jensen's inequality implies this bound is tight (up to a logarithmic factor). Note that the expected norm of ř ε i A i is controlled by the single "matrix variance" parameter Lemma 3.3. Let tA i 1 ,i 2 u n i 1 ,i 2 "1 P H d be a sequence of fixed matrices. Assume that 1, 2, is a sequence of i.i.d. Rademacher random variables, and define Then for any q ě 1, where r :" q _ log d, and the matrix G P R pndqˆpndq is defined as Remark 3.1 (Constants in Lemma 3.3). Matrix Rademacher chaos of order 2 has been studied previously by Rauhut (2009), Pisier (1998 and Rauhut (2012), where Schatten-p norm upper 3.3. Moment inequalities for degenerate U-statistics of order 2.
Let H i 1 ,i 2 : SˆS Þ Ñ H d , pi 1 , i 2 q P I 2 n , be a sequence of degenerate kernels, for example, H i 1 ,i 2 px 1 , x 2 q " π 2,2 p H i 1 ,i 2 px 1 , x 2 q for some non-degenerate permutation-symmetric p H i 1 ,i 2 . Everywhere below, E j r¨s, j " 1, 2, stands for the expectation with respect to only (that is, conditionally on all other random variables). The following Theorem is our most general result; it can be used as a starting point to derive more refined bounds.
-permutation-symmetric degenerate kernels. Then for all q ě 1 and r " maxpq, log dq, where r G i is the the i-th column of the matrix r G P H nd defined as Proof. See Section 6.2.3.
The following lower bound (proven in Section 6.2.4) demonstrates that all the terms in the bound of Theorem 3.1 are necessary.
Lemma 3.4. Under the assumptions of Theorem 3.1, where C ą 0 is an absolute constant.
Our next goal is to obtain more "user-friendly" versions of the upper bound, and we first focus on the term E 1 that might be difficult to deal with directly. It is easy to see that the pi, jq-th block of the matrix E 2 r G r G T iś Proof. See Section 6.2.5.
One of the key features of the bounds established above is the fact that they yield estimates for E }U n }: for example, Theorem 3.1 yields that for some absolute constant C. On the other hand, direct application of the non-commutative Khintchine inequality (4) followed by Rosenthal inequality (Lemma 6.5) only gives that and it is easy to see that the right-hand side of (10) is never worse than the bound (11). To see that it can be strictly better, consider the framework of Example 2, where it is easy to see (following the same calculations as those given in Section 6.4) that Remark 3.3 (Extensions to rectangular matrices). All results in this section can be extended to the general case of C d 1ˆd2 -valued kernels by considering the Hermitian dilation (1) of U n , namely DpU n q " and observing that }U n } " }DpU n q}; we omit the general statements and only consider the special case of vector-valued U-statistics in Section 5.

Adamczak's moment inequality for U-statistics.
Adamczak (2006) developed moment inequalities for general Banach space-valued completely degenerated U-statistics of arbitrary order. More specifically, application of Theorem 1 in (Adamczak, 2006) to our scenario B "`H d , }¨}˘and m " 2 yields the following bounds for all q ě 1 and t ě 2:`E where C is an absolute constant, and the quantities A, B, Γ, D will be specified below (see Section 6.3 for the complete statement of Adamczak's result). Notice that inequality (12) contains the "sub-Gaussian" term corresponding to ? q that did not appear in the previously established bounds.
We should mention another important distinction between (12) and the results of Theorem 3.1 and its corollaries, such as inequality (9): while (12) describes the deviations of }U n } from its expectation, (9) states that U n is close to its expectation as a random matrix ; similar connections exist between the Matrix Bernstein inequality (Tropp, 2012) and Talagrand concentration inequality (e.g., Boucheron, Lugosi and Massart, 2013). It particular, (12) can be combined with a bound (10) for E}U n } to obtain a moment inequality that is superior (in the certain range of q) to the results derived from Theorem 3.1.
Theorem 4.1. Inequalities (12) hold with the following choice of A, B, Γ and D: Proof. See Section 6.3.
It is possible to further simplify the bounds for A (via Lemma 3.5) and D to deduce that one can choose The upper bound for A can be modified even further as in (8), using the fact that

Bounds for vector-valued U-statistics.
In this section, we show that obtained results easily adapt to the case of Hilbert space-valued Ustatistics, that is, when Note that the Hermitian dilation of H i 1 ,i 2 satisfies Moreover, it is easy to see that Theorem 3.1 and Lemma 3.5 imply that for all q ě 1 and r " q _ log d, If moreover m, M are such that }H i 1 ,i 2 } 2 ď M and b E 2 }H i 1 ,i 2 } 2 2 ď m almost surely for all i 1 , i 2 , then Lemma 6.7 implies that for an absolute constant C ą 0 and all t ě 1, To provide comparison and illustrate the improvements achievable via Theorem 4.1, observe that (13) and Lemma 6.7 imply (after some simple algebra) the following bound for all t ě 1: which is better than (14) for small values of t.

Tools from probability theory and linear algebra.
This section summarizes several facts that will be used in our proofs. The first inequality is a bound connecting the norm of a matrix to the norms of its blocks.
Proof. It follows from a result by Bourin and Lee (2012) that there exist unitary operators U, V such thatˆA hence the result is a consequence of the triangle inequality.
The second result is the well-known decoupling inequality for U-statistics due to de la Pena and Montgomery-Smit (1995).
Lemma 6.2. Let tX i u n i"1 be a sequence of independent random variables with values in a measurable space pS, Sq equipped with the probability measure P , and let tX pkq i u n i"1 , k " 1, 2, . . . , m be m independent copies of this sequence. Let B be a separable Banach space and, for each pi 1 , . . . , i m q P I m n , let H i 1 ,...,im : S m Ñ B be a measurable function. Moreover, let Φ : r0, 8q Ñ r0, 8q be a convex nondecreasing function such that EΦp}H i 1 ,...,im pX i 1 , . . . , X im q}q ă 8 for all pi 1 , . . . , i m q P I m n . Then where C m :" 2 m pm m´1 q¨ppm´1q m´1´1 q¨. . .¨3. Moreover, if H i 1 ,...,im is P -canonical, then the constant C m can be taken to be m m . Finally, there exists a constant D m ą 0 such that for all t ą 0, Furthermore, if H i 1 ,...,im is permutation symmetric, then, both of the above inequalities can be reversed (with different constants C m and D m ).
The following results are the variants of the non-commutative Khintchine inequalities for the Rademacher sums and the Rademacher chaos with explicit constants, see Theorems 6.14, 6.22 in (Rauhut, 2012) and Corollary 20 in (Tropp, 2008). Lemma 6.3. Let B j P C rˆt , j " 1, . . . , n be the matrices of the same dimension, and let tε j u jPN be a sequence of i.i.d. Rademacher random variables. Then for any p ě 1, Lemma 6.4. Let tA i 1 ,i 2 u n i 1 ,i 2 "1 be a sequence of Hermitian matrices of the same dimension, and let ! ε pkq i ) n i"1 , k " 1, 2, be i.i.d. Rademacher random variables. Then for any p ě 1, where the matrixG P R pndqˆpndq is defined as The following result by Chen, Gittens and Tropp (2012) is a variant of matrix Rosenthal inequality for nonnegative definite matrices.
Lemma 6.5. Let Y 1 , . . . , Y n be a sequence of independent nonnegative definite dˆd random matrices. Then for all q ě 1 and r " maxpq, logpdqq, Next inequality, due to de la Pena and Gine (1999), allows to replace the sum of moments of nonnegative random variables with maxima.
Lemma 6.7. Let X be a random variable satisfying pE|X| p q 1{p ď a 0 p 2`a 1 p 3{2`a 2 p`a 3 ? p`a 4 for all p ě 2 and some positive real numbers a j , j " 0, . . . , 3. Then for any u ě 2, Pr´|X| ě epa 0 u 2`a 1 u 3{2`a 2 u`a 3 ? u`a 4 q¯ď exp p´uq .
Lemma 6.8. Let X be a random variable such that Pr p|X| ě a 1 u`a 2 ? uq ď e´u for all u ě 1 and some 0 ď a 1 , a 2 ă 8. Then πe 1{p6pq q 1{p a 2 ? p for all p ě 1.
Next, we will use Lemma 6.2 combined with a well-known argument to obtain the symmetrization inequality for degenerate U-statistics. Lemma 6.9. Let H i 1 ,i 2 : SˆS Þ Ñ H d be degenerate kernels, X 1 , . . . , X n -independent S-valued random variables, and assume that tX pkq i u n i"1 , k " 1, 2, are independent copies of this sequence. Moreover, let ! ε pkq i ) n i"1 , k " 1, 2, be i.i.d. Rademacher random variables. Define Then for any p ě 1,´E Proof. Note that where the inequality follows from the fact that H i 1 ,i 2 is P-canonical, hence Lemma 6.2 applies with constant equal to C 2 " 4. Next, for i " 1, 2, let E i r¨s stand for the expectation with respect to , k ‰ i). Using iterative expectations and the symmetrization inequality for the Rademacher sums twice (see Lemma 6.3 in (Ledoux and Talagrand, 1991)), we deduce that 6.2. Proofs of results in Section 3.

Proof of Lemma 3.3.
Recall that and let C p :" 2´2 ? 2 e p¯2 p . We will first establish the upper bound. Application of Lemma 6.4 (Khintchine's inequality) to the sequence of matrices tA i 1 ,i 2 u n i 1 ,i 2 "1 such that A j,j " 0 for j " 1, . . . , n yieldś where G :"¨0 Our goal is to obtain a version of inequality (16) for p " 8. To this end, we need to find an upper bound for Since G is a ndˆnd matrix, a naive upper bound is of order logpndq. We will show that it can be improved to log d. To this end, we need to distinguish between the cases when the maximum in (16) is attained by the first or second term. Define where A i 1 i 2 sits on the i 1 -th position of the above block matrix. Moreover, let Then it is easy to see that The following bound gives a key estimate.
The proof of the Lemma is given in Section 6.2.2. We will apply this fact with M j " B j , j " 1, . . . , n. Assuming that max i λ i ď 1 d ř nd j"1 λ j , it is easy to see that the second term in the maximum in (16) dominates, hence where the last equality follows from the fact that for any positive semidefinite matrix H, we have }H p } " }H} p . On the other hand, when max i λ i ą 1 d ř nd j"1 λ j , it is easy to see that for all p ě 1, which in turn implies that Moreover, Combining (19), (20), we deduce that where the second from the last equality follows again from the fact that for any positive semidefinite matrix H, we have }H p } " }H} p . Thus, combining the bound above with (16) and (18), we obtain -.
Finally, set p " maxpq, logpdqq and note that d 1{2p ď ? e, hencè This finishes the proof of upper bound. Now, we turn to the lower bound. Let E 1 r¨s stand for the expectation with respect to It is easy to check that where B i were defined in (40). Hencè Next, for any matrix A P R d 1ˆd2 , where the last equality follows since }AA T } " }A T A}. Taking A " , .
The equality of traces is obvious since Moreover, λ i ě 0, ν j ě 0 for all i, j, and max i λ i ď 1 d ř d j"1 ν j " S d by assumption. It is clear that Hence, it is enough to show that max 0ďλ i ď S d , The right hand side of the inequality (21) can be estimated via Jensen's inequality as Given positive integers K ě d and K 1 ě d 1 , we will write pK, dq ą pK 1 , d 1 q if K ě K 1 , d ě d 1 and at least one of the inequalities is strict. We will now prove by induction that for all pK, dq, K ě d and any S ą 0, max λ 1 ,...,λ K PΛpK,d,Sq The inequality is obvious for all pairs pK, dq with K " d or with d " 1. Fix pK, dq with K ą d ą 1, and assume that the claim holds for all pK 1 , d 1 q such that K 1 ě d 1 and pK, dq ą pK 1 , d 1 q.
It is easy to check that the only critical point of the Lagrangian F pλ 1 , . . . , λ K , τ q " n the relative interior of the set ΛpK, d, Sq isλ 1 " . . . "λ K " S{K where the function achieves its minimum, hence the maximum is attained on the boundary of the set ΛpK, d, Sq. There are 2 possibilities: 1. Without loss of generality, λ 1 " 0. Then the situation is reduced to pK 1 , d 1 q " pK´1, dq, whence we conclude that max λ 1 ,...,λ K´1 PΛpK´1,d,Sq 2. Without loss of generality, λ 1 " S{d. Then the situation is reduced to pK 1 , d 1 q " pK1 , d´1q for S 1 " Spd´1q{d, hence 6.2.3. Proof of Theorem 3.1.
By Lemma 6.9, we have´E where U 1 n was defined in (15). Applying Lemma 3.3 conditionally on tX pjq i u n i"1 , j " 1, 2, we get where r G was defined in (7). Let r G i be the i-th column of r G, then Let Q i P H pn`1qdˆpn`1qd be defined as Inequality (24) implies thaẗ Let E 2 r¨s stand for the expectation with respect to ). Then Minkowski inequality followed by the symmetrization inequality imply that Next, we obtain an upper bound for`E › › ř n i"1 ε i Q 2 i › › q˘1{p2qq . To this end, we apply Khintchine inequality (Lemma 6.3). Denote C r :" 2´2 ? 2 e r¯2 r , and let E ε r¨s be the expectation with respect to tε i u n i"1 only. Then for r ą q we deduce that where we used the fact that Q 4 i ĺ }Q 2 i }Q 2 i for all i, and the fact that A ĺ B implies that tr gpAq ď tr gpBq for any non-decreasing g : R Þ Ñ R. Next, we will focus on the term Putting the bounds together, we obtain that Next, observe that for r such that 2r ě q, E ε by Hölder's inequality, hencẽ Next, set r " q _ log d and apply Cauchy-Schwarz inequality to deduce that Substituting bound (28) into (26) and letting Finally, it follows from (25) that where the last equality follows from the definition of r G i . To bring the bound to its final form, we will apply Rosenthal's inequality (Lemma 6.5) to the last term in (29) to get that Moreover, Jensen's inequality implies that hence this term can be combined with one of the terms in (29).

Proof of Lemma 3.4.
Let E i r¨s stand for the expectation with respect to the variables with the upper index i only.
Since H i 1 ,i 2 p¨,¨q are permutation-symmetric, we can apply the second part of Lemma 6.2 and the desymmetrization inequality to get that for some absolute constant C 0 ą 0 .
Applying the lower bound of Lemma 3.3 conditionally on ! X p1q i ) n i"1 and ! X p2q i 6.2.5. Proof of Lemma 3.5.
Let J Ď I Ď t1, 2u. We will write i to denote the multi-index pi 1 , i 2 q P t1, . . . , nu 2 . We will also let i I be the restriction of i onto its coordinates indexed by I, and, for a fixed value of i I c , let pH i q i I be the array H i , i I P t1, . . . , nu |I| ( , where H i :" H i 1 ,i 2 pX p1q i 1 , X p2q i 2 q. Finally, we let E I stand for the expectation with respect to the variables with upper indices contained in I only. and › › pH i q i H › › H,H :" }H i }, where xA 1 , A 2 y :" trpA 1 A2q for A 1 , A 2 P H d and }¨}˚denotes the nuclear norm. Theorem 1 in (Adamczak, 2006) states that for all q ě 1, where C is an absolute constant. Obtaining upper bounds for each term in the sum above, we get that´E