A note on the Hanson-Wright inequality for random vectors with dependencies

We prove that quadratic forms in isotropic random vectors $X$ in $\mathbb{R}^n$, possessing the convex concentration property with constant $K$, satisfy the Hanson-Wright inequality with constant $CK$, where $C$ is an absolute constant, thus eliminating the logarithmic (in the dimension) factors in a recent estimate by Vu and Wang. We also show that the concentration inequality for all Lipschitz functions implies a uniform version of the Hanson-Wright inequality for suprema of quadratic forms (in the spirit of the inequalities by Borell, Arcones-Gin\'e and Ledoux-Talagrand). Previous results of this type relied on stronger isoperimetric properties of $X$ and in some cases provided an upper bound on the deviations rather than a concentration inequality. In the last part of the paper we show that the uniform version of the Hanson-Wright inequality for Gaussian vectors can be used to recover a recent concentration inequality for empirical estimators of the covariance operator of $B$-valued Gaussian variables due to Koltchinskii and Lounici.


Introduction
The Hanson-Wright inequality asserts that if X 1 , . . . , X n are independent mean zero, variance one random variables with sub-Gaussian tail decay, i.e. such that for all t > 0, P(|X i | ≥ t) ≤ 2 exp(−t 2 /K 2 ), and A = [a ij ] n i,j=1 is an n × n matrix, then the quadratic form a ij X i X j satisfies the inequality for all t > 0, where C is a universal constant. Here and in what follows A HS = ( i,j≤n a 2 ij ) 1/2 is the Hilbert-Schmidt norm of A, whereas A = sup |x|≤1 |Ax| is the operator norm of A (| · | denotes the standard Euclidean norm in R n ). Actually Hanson and Wright [12] proved a somewhat weaker inequality in which A was replaced by the operator norm of the matrix A = [|a ij |] n i,j=1 . The original argument worked also only for symmetric random variables, the general mean zero case was proved by Wright in [32]. The above version with the operator norm of A appeared in many works under different sets of assumptions. For Gaussian variables it follows from estimates for general Banach space valued polynomials by Borell [8] and Arcones-Giné [4]. Independent proofs were also provided by Ledoux-Talagrand [21] and Lata la [16,17]. It is also well known that the general case can be reduced to the Gaussian one by comparison of moments or a decoupling and contraction approach [18,5,3,25]. As observed by Lata la [16] in the Gaussian case the Hanson-Wright inequality can be reversed (up to universal constants). Lata la provided also two-sided moment and tail inequalities for higher degree homogeneous forms in Gaussian variables [17] (see also [3]).
The interest in Hanson-Wright type estimates has been recently revived in connection with non-asymptotic theory of random matrices and related statistical problems [31,24]. Since in many applications one considers quadratic forms in random vectors with dependencies among coefficients, some recent work has been devoted to proving counterparts of the Hanson-Wright inequality in a dependent setting. In particular in [14] a corresponding upper tail inequality is proved for positive definite matrices and sub-Gaussian random vectors X (we recall that a random vector X in R n is sub-Gaussian with constant K if for all u ∈ S n−1 , and all t > 0, P(| X, u | ≥ t) ≤ 2 exp(−t 2 /K 2 )). It is easy to see that in this setting one cannot hope for a lower tail estimate as a sub-Gaussian random vector can vanish with probability separated from zero. In [31], Vu and Wang consider vectors satisfying the convex concentration property (see Definition 2.2 below) and prove that if X is a random vector in R n in the isotropic position (i.e. with mean zero and covariance matrix equal to identity) which has the convex concentration property with constant K, then for all t > 0, (We remark that Vu and Wang considered complex random vectors with complex conjugatetranspose operation instead of transpose, but since we are interested here primarily in the real case, we do not state their result in this version. In fact it is not difficult to pass from the real version to the complex one). One of the objectives of this paper is to remove the dependence on dimension in the above estimate (Theorem 2.3 below) as well as to prove corresponding uniform estimates for suprema of quadratic forms under some stronger assumptions on the random vector X (Theorem 2.4). Such uniform versions (corresponding to Banach space valued quadratic forms) for Gaussian random vectors were considered e.g. by Borell [8] and Arcones-Giné [4], whereas the Rademacher case was studied by Talagrand [30] and Bousquet-Boucheron-Lugosi-Massart [9]. In Theorem 2.4 we prove that a uniform estimate is a consequence of the concentration property for Lipschitz functions.
The estimates provided by uniform Hanson-Wright inequalities are expressed in terms of expectations of suprema of certain empirical processes. Since estimating such expectations is in general difficult, direct applications of such inequalities are limited. In our last result, Theorem 4.1 presented in Section 4, we provide one example in which it is possible to effectively bound the empirical process involved in the estimate, i.e. we recover a recent concentration result for empirical approximations of the covariance operator for Banach space valued Gaussian variables, obtained first by Koltchinskii and Lounici by other methods [15].
The organization of the paper is as follows. In the next section we present our main results together with some additional discussion. Next, in Section 3 we provide proofs. Finally, in Section 4 we present the aforementioned application of uniform estimates for quadratic forms.
Acknowledgements. The author would like to thank Vladimir Koltchinskii and Karim Lounici for interesting conversations during The Seventh International Conference on High Dimensional Probability. The results of this paper grew directly out of those conversations. Separate thanks go to the organizers of the conference.

Main results
To introduce the setting for our estimates let us first recall the standard definitions of concentration properties of random vectors. Definition 2.1 (Concentration property). Let X be a random vector in R n . We will say that X has the concentration property with constant K if for every 1-Lipschitz function ϕ : R n → R, we have E|ϕ(X)| < ∞ and for every t > 0, The concentration property of random vectors has been extensively studied in the recent forty years, starting with the celebrated results by Borell [7] and Sudakov-Tsirelson [27] who established it for Gaussian measures. Many efficient techniques for proving concentration have been discovered, including e.g. isoperimetric techniques, functional inequalities, transportation of measure, semigroup tools. We refer to the monograph [20] by Ledoux for a thorough discussion of this topic.

Definition 2.2 (Convex concentration property)
. Let X be a random vector in R n . We will say that X has the convex concentration property with constant K if for every 1-Lipschitz convex function ϕ : R n → R, we have E|ϕ(X)| < ∞ and for every t > 0,

Remarks.
1. The convex concentration property has been first observed by Talagrand, who proved it for the uniform measure on the discrete cube [28] and for general product measures with bounded support [29] by means of his celebrated convex distance inequality. In the nonproduct case it has been obtained by Samson [26] for vectors satisfying some uniform mixing properties and recently by Paulin [23] under Dobrushin type criteria. From Talagrand's results it also follows that the convex concentration property is satisfied by vectors obtained via sampling without replacement [23,2]. Sub-Gaussian estimates for the upper tails of Lipschitz functions of product random vectors were also obtained by Ledoux [19] and later Adamczak in the unbounded case [1] by means of log-Sobolev inequalities. 2. Note that the convex concentration property is preserved if we replace X with UX+b,where U is a deterministic orthogonal matrix and b ∈ R n .
Our first result is the following Theorem 2.3. Let X be a mean zero random vector in R n . If X has the convex concentration property with constant K then for any n × n matrix A = [a ij ] n i,j=1 and every t > 0, for some universal constant C.

Remarks.
1. The above theorem improves the estimate (1) due to Vu-Wang by removing the dimension dependent factors (note that in the isotropic case EX T AX = trA and Cov (X) = Id = 1). 2.
The assumption that X is centered is introduced just to simplify the statement of the theorem. Note that if X has the convex concentration property with constant K, then so doesX = X − EX. Moreover, a quadratic form in X can be decomposed into a sum of a quadratic form inX and an affine function of X. Since linear functions are convex, Lipschitz, their deviations can be controlled by the convex concentration property. We leave the precise formulation of the corresponding inequality to the Reader. 3. As it will become clear from the proof, similar theorems hold if instead of sub-Gaussian concentration inequality for convex functions one assumes some other rate of decay for the tail probabilities. The whole argument remains then valid, one just has to modify accordingly the right-hand side of (2). Convex concentration property with sub-exponential tail decay was studied e.g. in [6]. 4. We remark that it is not true that if X = (X 1 , . . . , X n ) where X i are i.i.d. sub-Gussian random variables, then X has the convex concentration property with a constant independent of dimension (as noted in [1] following [13]). Therefore, Theorem 2.3 does not imply the standard Hanson-Wright inequality.
Our second result concerns a uniform version of the Hanson-Wright inequality for suprema of quadratic forms and is contained in the following Theorem 2.4. Let X be a mean zero random vector in R n . Assume that X has the concentration property with constant K. Let A be a bounded set of n × n matrices and consider the random variable Then, for every t > 0, where and C is a universal constant. Remarks.

One can easily see that if
If in addition X has the convex concentration property with constant K, then Cov X ≤ 2K 2 (see the proof of Theorem 2.3 below). Thus the conclusion of the above theorem is stronger than that of Theorem 2.3. On the other hand the assumption is also stronger. We do not know if (3) is implied just by the convex concentration property. This is the case if instead of sup A∈A X T AX one considers sup A∈A X T AY , where Y is an independent copy of X (see [1]). 2. As mentioned in the Introduction, inequalities similar to (3) have been proven by many authors under various sets of assumptions. In particular Borell [8] and Arcones-Giné [4] obtained inequalities for Banach space valued polynomials in Gaussian random variables. When specialised to quadratic forms, these inequalities give an upper bound on P(sup A∈A |X T AX| ≥ M +t), where M is a certain quantile of sup A∈A |X T AX|. The proofs are based on the Gaussian isoperimetric inequality. We do not see how to adapt their arguments to get concentration around the mean rather then deviation above a multiple of the mean. Talagrand [30] proved a concentration inequality for suprema of quadratic forms in Rademacher variables, which via the Central Limit Theorem implies the concentration inequality in the Gaussian case. The upper bound in Talagrand's inequality was later generalized to higher order forms by Boucheron,Bousquet, Lugosi and Massart [9].

Proofs of the main results
In what follows the letter C will denote an absolute constant, the value of which may change between various occurrences (even in the same line).
In the proofs we will need the following standard lemmas.
Lemma 3.1. Assume that a random variable Z satisfies for all t > 0. Consider p ∈ (0, 1) and let q p Z = inf{t ∈ R : P(Z ≤ t) ≥ p} be the smallest p-th quantile of Z. Then q p Z ≥ EZ − K log(2/p).
Proof. Assume that q p Z < EZ − K log(2/p). Then which contradicts the standard inequality P(Z ≤ q p Z) ≥ p.

Lemma 3.2. Assume that a random variable Z satisfies
t b for all t > 0, where Med Z is a median of Z. Then for some absolute constant C and all t > 0, Proof. We have Thus for t > 2 √ πa + 4b, we have On the other hand, there exists an absolute constant C, such that for t ≤ 2 which implies that (4) is trivially satisfied. This ends the proof of the lemma.
Another simple fact we will need is Lemma 3.3. Let S and Z be random variables and a, b, t > 0 be such that for all s > 0, and P(S = Z) ≤ 2 exp(−t/b). Then Assume first that t > max(3b, 2M √ log 8). We then have P(S = Z) ≤ 1/4 and so P(S ≤ Med Z) ≥ 1/4, which means that Med Z ≥ q 1/4 S, where q p S = inf{t : P(S ≤ t) ≥ p}. By Lemma 3.1, Med Z ≥ q 1/4 S ≥ ES − M √ log 8 and thus Using (5) with s = t/2 and (6), we obtain . Similarly, by replacing S, Z, with −S, −Z and using the fact that −Med Z is a median for −Z, we obtain where the last inequality follows by the definition of M and simple calculations. This ends the proof in the case t > max(3b, 2M √ log 8). Note that for t ≤ max(3b, 2M √ log 8), we have Proof of Theorem 2.3. Since X T AX = X T ( 1 2 (A+A T ))X, we can assume that A is symmetric. Thus there exists an orthogonal matrix U, such that D = U T XU is a diagonal matrix, with diagonal entries λ 1 , . . . , λ n . Let Y = UX and note that Y also has the convex concentration property with constant K. Moreover XAX T = Y T DY . Thus our goal is to prove that for t > 0, Observe that A 2 HS = i≤n λ 2 i and A = max i≤n |λ i |.
i and thus, by the triangle inequality, to demonstrate the theorem it is enough to prove that for every sequence µ 1 , . . . , µ n of nonnegative numbers, we have Note that for any unit vector u, u, X is a 1-Lipschitz convex function of X. Since we also have E u, X = 0, by the convex concentration property, we get This shows that Cov X ≤ 2K 2 . Moreover Y i = u, X , where u is the first row of U. Since u is a unit vector, we get in particular Let ϕ(y) = n i=1 µ 2 i y 2 i and note that ∇ϕ(y) = (2µ 1 y 1 , . . . , 2µ n y n ). Define By the convex concentration property of Y and the fact that the function y → n i=1 µ 2 i y 2 i is convex and (max i≤n µ i )-Lipschitz, we get Define now a new function f : R n → R with the formula .
Note that f is a convex function, moreover for y, z ∈ R n , Thus f is convex and M-Lipschitz and so for all s > 0, Moreover, by convexity of ϕ, we have f (y) ≤ ϕ(y) and thus for y ∈ B, we have f (y) = ϕ(y). Thanks to (9) and (10) we can now apply Lemma 3.
where in the two last inequalities we used (8).
Since the above inequality holds for arbitrary t > 0, Lemma 3.2 gives (7), which ends the proof.
Proof of Theorem 2.4. By the boundedness assumption on the set A and the integrability assumption on X we can assume that the set A is finite. Let thus ij ] i,j≤n . Denote also a (k) = EX T A (k) X and define the function f : R n → R with the formula Note that f is locally Lipschitz, moreover as the set of roots of a non-zero multivariate polynomial is of Lebesgue measure zero, for every x outside a set of Lebesgue measure zero, there exists unique k ≤ m, such that For k ≤ m let B k be the set of points x ∈ R n such that k is the unique maximizer in (11). Then R n \ ( k≤m B k ) has Lebesgue measure equal to zero, moreover the sets B k are open.
and consequently Let now B = {x ∈ R n : max A∈A |(A + A T )x| < X A + t max A∈A A } and note that B is an open convex set. Let λ k denote the Lebesgue measure on R k . By the Fubini theorem, the preceding discussion concerning the differentiability of f , the definition of the set B and its convexity, for λ 2n almost all pairs (x, y) ∈ B × B we have Since t → f (tx + (1 − t)y) is locally Lipschitz and thus absolutely continuous, we have for such x, y, By continuity and density arguments, the above inequality clearly extends to all x, y ∈ B, allowing us to conclude that f is M-Lipschitz on B with M = X A + t max A∈A A . Let now g : R n → R be any M-Lipschitz function, which coincides with f on B (it exists by McShane's lemma, see e.g. Lemma 7.3. in [22]). By the concentration property of X we have for all s > 0, P(|g(X) − Eg(X)| ≥ s) ≤ 2 exp(−s 2 /K 2 M 2 ) and where we used that the function x → max A∈A |(A+A T )x| has the Lipschitz constant bounded by max A∈A A + A T ≤ 2 max A∈A A . Thus, Lemma 3.3 with S = g(X), Z = f (X), Since the above inequality holds for arbitrary t > 0, we can use Lemma 3.2 to complete the proof.

Application. Concentration inequalities for the empirical covariance operator
Let us conclude with an application of Theorem 2.4 in the Gaussian setting, by providing a new proof of the concentration inequality for empirical approximations of the covariance operator of a Banach space valued random variable, proved recently in [15] by other methods. Since this part serves mostly as an illustration of applicability of Theorem 2.4, we do not present the general setting and motivation for this type of results, referring the Reader to the original paper [15].
In the formulation of the following theorem we use · to denote both a norm of a vector in a Banach space and the operator norm. Let G 1 , . . . , G n be i.i.d. copies of G and defineΣ : E * → E with the formulâ Then, for any t ≥ 1, where Proof. By the Karhunen-Loève theorem, there exists a sequence x k ∈ E, such that almost surely where g j are i.i.d. standard Gaussian variables. Let {g ij } 1≤i≤n,j∈N be an array of i.i.d. standard Gaussian variables. We can assume that Therefore, denoting by B * the unit ball of E * , we get which puts us in position to use Theorem 2.
: u, v ∈ B * } and X = (g ki ) k≤n,i≤∞ (we skip the standard details of approximation by finite dimensional vectors). Let us estimate the parameters of Theorem 2.4. Using the fact that each A ∈ A is a block matrix with blocks of the form 1 n ( x i , u ) ∞ i=1 ⊗ ( x j , v ) ∞ j=1 , one easily gets that Passing to X A , we have Now, x i , u g ki x i , u g ki x i , u g ki 2 1/2 To bound the last expectation, we can use the Gordon-Chevet inequality [10,11], which asserts that for any Banach spaces E, F and points x i ∈ E, y k ∈ F , the random operator where g i 's are i.i.d. standard Gaussian variables. Applying this inequality with Γ = k,i g ki x i ⊗y k : E * → ℓ n 2 , where y 1 , . . . , y n is the standard basis of ℓ n 2 , we get E sup u∈B * n k=1 ∞ j=1 x i , u g ki x i , u 2 1/2 √ n + 1 · E G = Σ 1/2 √ n + E G .
Going back to (13), we get E sup A∈A |A T X| ≤ Σ √ n + Σ 1/2 E G n .
By symmetry, an analogous bound holds for the other expectation on the right-hand side of (12), hence Combining this with the estimate on sup A∈A A and Theorem 2.4, we get for t ≥ 1, which ends the proof.