A note on concentration for polynomials in the Ising model

We present precise multilevel exponential concentration inequalities for polynomials in Ising models satisfying the Dobrushin condition. The estimates have the same form as two-sided tail estimates for polynomials in Gaussian variables due to Lata{\l}a. In particular, for quadratic forms we obtain a Hanson-Wright type inequality. We also prove concentration results for convex functions and estimates for nonnegative definite quadratic forms, analogous as for quadratic forms in i.i.d. Rademacher variables, for more general random vectors satisfying the approximate tensorization property for entropy.


Introduction
Since its introduction in [25], the Ising model has been a source of numerous mathematical questions. In addition to its physical importance it is appealing to mathematicians, providing an easy to formulate, yet challenging model of dependent random variables and serving as testing ground for many probabilistic ideas. Recently in the context of finite graphs, the Ising model attracted also attention of statisticians and theoretical computer scientists interested e.g, in estimating the parameters of the model, learning the underlying graph structure or testing some properties of the model in a computationally efficient way (see e.g., [15,36,37,16]). In particular in the last two decades several authors studied the Ising model from the point of view of concentration of measure phenomena, see e.g., [30,12,29,13]. While most effort has been devoted to concentration inequalities for functions satisfying appropriate Lipschitz type conditions, recently several papers appeared related to variance bounds or stronger, exponential concentration inequalities for polynomials, to mention the work by Daskalakis,Dikkala,Kamath [15,14], Gheissari, Lubetzky and Peres [19], Götze, Sambale and Sinulis [21]. Motivation for these developments ranged from statistical and algorithmic (efficient discrimination between samples drawn from an Ising model and i.i.d. samples) to purely probabilistic ones (searching for counterparts of inequalities known in the i.i.d. case).
In this note we complement the results proved in the aforementioned papers, by obtaining exponential inequalities for polynomials of the same form as in bounds for polynomials in independent Gaussian (or more generally subgaussian) random variables, which were introduced originally by Latała [27] and subsequently studied e.g., by Adamczak and Wolff [5]. Such inequalities are expressed in terms of appropriate injective tensor product norms of averaged derivatives of the polynomials in question and in the Gaussian case are known to be optimal up to constants depending only on the degree of the polynomial. Optimality is understood here in a strong sense -the inequalities can be up to constants reversed. Moreover they are known to imply other, more classical inequalities for multilinear forms, such as Bonami-Nelson inequalities [10,33]. In particular, for polynomials of degree d they provide multilevel type concentration, of the form exp(−ct 2 ) for small values of t up to exp(−c ′ t 2/d ) for larger values (as opposed to the inequalities from the aforementioned results for the Ising model, which do not yield precise multilevel concentration but rather give weaker bounds of the form exp(−c ′′ t 2/d ) for all t). As a consequence our estimates imply the previous ones and provide a more accurate description of the tail behavior.
Our approach is similar to the one by Götze, Sambale and Sinulis in that it builds on general Aida-Stroock type moment estimates those Authors obtained for the Ising model in [21], and uses them as a tool in an inductive argument. However, the details are different, while in [21] one works with moments of Euclidean norms of discrete iterated derivatives of multilinear forms, we adapt an argument from [5], linearizing the Euclidean norms with an auxiliary Gaussian sequence, which allows us to treat general functions (seen by the Fourier-Walsh theory as tetrahedral polynomials) and pass from discrete gradients to classical derivatives.
The argument we present may be seen as a method of reduction of concentration properties for polynomials from the Ising model to the i.i.d. Gaussian case. Since the random variables considered in the Ising model take only values ±1, one could expect a similar reduction to polynomials in i.i.d. Rademacher variables. We are able to obtain such estimates for positive definite quadratic forms, by passing through concentration properties for convex functions which are of independent interest.
The organization of the article is as follows. First, in Section 2 we present our main result (Theorem 2.2) and discuss its relation with known inequalities for the Ising model as well as with the estimates for the i.i.d. case. Next, in Section 3 we discuss the approximate tensorization of entropy (as studied recently by Marton [31] and Caputo, Menz, Tetali [11]) and Aida-Stroock type moment estimates obtained by Götze, Sambale and Sinulis [21]. Using these tools, in Section 4 we present the proof of the main result. The final Section 5 presents estimates for convex functions and Rademacher-type inequalities for quadratic forms.
2. Gaussian type inequality for polynomials 2.1. Basic definitions and notation. Let us begin by introducing the general form of the Ising model on a finite set. Definition 2.1 (Ising model). Let n be a positive integer and let µ be the measure on {−1, 1} n , having density with respect to the uniform distribution of the form for any σ ∈ {−1, 1} n , where J = (J ij ) i,j≤n is a symmetric matrix with vanishing diagonal, h = (h i ) i≤n ∈ R n and Z is a normalizing constant.
In physical terms the coupling matrix J corresponds to interactions between particles and the vector h describes an external field. The order of magnitude of the constants J ij reflects the temperature (the higher the temperature the smaller the coefficients, which corresponds to weaker interactions), however as our results will be expressed solely in terms of the coefficients J ij and h i , we will not incorporate the temperature into the notation.
To obtain concentration inequalities, one needs some control over the coupling constants and the external field, which will allow for sufficient proximity to the i.i.d. case. The conditions we will impose on the model are classical and in the context of concentration of measure appeared already in [31,19,21].
Main assumptions. We will assume that and the coupling constants satisfy Dobrushin's condition for some ρ > 0.
In order to formulate concentration of measure estimates for polynomials of the Ising model, which correspond to inequalities obtained by Latała for polynomials in independent Gaussian random variables, we will need to introduce a family of injective tensor product norms on d-index matrices (d-tensors).
To provide transparent notation for multi-indices we will use the following convention. For a positive integer n we will denote [n] = {1, . . . , n}. The cardinality of a set I will be denoted by |I|. For i = (i 1 , . . . , i d ) ∈ [n] d and I ⊆ [d] we write i I = (i k ) k∈I . We will also denote |i| = max j≤d i j and |i I | = max j∈I i j . We will often deal with homogeneous polynomials, defined in terms of multi-indexed matrices (tensors). We will say that a d-indexed matrix A = (a i ) i∈[n] d is symmetric if for every permutation σ of the set [d] and every i = (i 1 , . . . , i d ) ∈ [n] d , we have a i = a i σ(1) ,...,i σ(d) . When d is fixed, we will write simply A = (a i ) |i|≤n . We will say that a d-indexed matrix A = (a i ) |i|≤n has vanishing generalized diagonals if a i = 0 for all i = (i 1 , . . . , i d ) such that there exist k = l with i k = i l .
Let now P d be the set of partitions of [d] into nonempty, pairwise disjoint sets. For a partition I = {I 1 , . . . , I k } ∈ P d , and a d-indexed matrix A = (a i ) i∈[n] d , define Thus, e.g., Note that for simplicity in the notation we skip the outer brackets in the subscript and write e.g., In particular for d = 2, · {1,2} and · {1}{2} coincide with the Hilbert-Schmidt and operator norm of a matrix respectively. Note that for every d and I ∈ P d we have can be considered a counterpart of the Hilbert-Schmidt norm for higher order tensors.
We will use the standard notation X p = (E|X| p ) 1/p for a random variable X. Sometimes, when dealing with independent random variables X, Y , we will use the notation E X for the expectation with respect to the variable X (i.e., the conditional expectation given Y ).
In what follows, we will write e.g., c a , C a or c a (b) to denote constants depending only on the parameters a or a, b respectively. The values of such constants may change between occurrences.
By the Fourier-Walsh expansion (see e.g., [34]), every function f : {−1, 1} n → R can be written in a unique way as a tetrahedral polynomial, i.e., a polynomial which is affine with respect to every variable (in particular the degree of the polynomial is at most n). Therefore in what follows we will restrict our attention to this representation. In particular, when we speak about gradients ∇f or higher order derivatives ∇ k f , we always think of the usual derivatives of the polynomial function on R n given by the tetrahedral representation of f .

Main result. The main result of this section is
Theorem 2.2. Let µ be defined by (2.1) and assume that |h| ∞ ≤ α and the Dobrushin condition (2.3) holds. Let X be a random vector distributed according to µ. Then for d ≥ 1 there exist constants c d = c d (α, ρ), such that for any tetrahedral polynomial f : {−1, 1} n → R of degree d, and any t > 0, x i l for some symmetric d-indexed matrix with vanishing generalized diagonals, then for any I ∈ P k we can estimate By (the proof of) Lemma 3.1. in [19], for some new constant c d = c d (α, ρ). This inequality was proved by Götze, Sambale, Sinulis in [21] and earlier, up to some additional logarithmic in n factors in the exponent by Gheissari, Lubetzky and Peres in [19].
Remark 2.4. An inequality analogous to (2.5) for polynomials in Gaussian variables has been obtained in [5]. In this case the inequality can be reversed up to numerical constants depending only on d (in front of and inside the exponent). The proof relied on a reduction to the special case of tetrahedral multilinear forms in independent standard Gaussian variables obtained by Latała [27]. The proof of Theorem 2.2 presented below is a simple adaptation of this idea. The inequality (2.5) is also known to hold for polynomials in i.i.d. subgaussian random variables [5].
Example 2.5. In this example and the following ones we will let c denote a constant which may depend on the parameters α and ρ. Its value may change between occurrences. For f (x) = n i,j=1 a ij x i x j we obtain Note that if h = 0 then EX i = 0 (as the distribution of X is symmetric) and the right hand side simplifies to which gives a counterpart of the Hanson-Wright inequality known for quadratic forms in independent sub-Gaussian random variables [24], which turned out to be useful e.g., in random-matrix theory and statistics (see e.g., [42]). A version of this inequality for strongly mixing Ising models on a lattice was proved by Marton [30]. If one further estimates the operator norm by the Hilbert-Schmidt norm, one obtains a weaker tail bound of the form For quadratic form in independent Rademacher variables such an inequality (together with counterparts for higher order forms) were for the first time established by Bonami [10], Beckner [7] and Gross [23] in the context of hypercontractivity of semigroups. A counterpart of (2.8) for quadratic forms of the Ising model has been recently obtained in [21].
Example 2. 6. Consider now f (x) = 1≤i,j,k≤n a ijk x i x j x k , where A = (a ijk ) i,j,k≤n is symmetric with vanishing generalized diagonals. Assume also that h = 0. In this case Theorem 2.2 gives where we again used the equality EX i = 0.
One can wonder if it is possible to obtain estimates just in terms of the norm A {1,2,3} , e.g., of the form {1,2,3} ) as in the independent case or in the case of quadratic forms discussed above. Clearly, this is true if one can estimate the quantity i,j≤n (EX i X j ) 2 by a constant independent of n. However it turns out that if one assumes only the Dobrushin condition (2.3), then the coefficient i ( jk a ijk EX i X j ) 2 in general cannot be discarded. To see this consider the Ising model on the one-dimensional interval, e.g., with J i,i+1 = J i+1,i = 1/3 for i = 1, . . . , n − 1 and J ij = 0 otherwise. In this case (2.3) is clearly satisfied with ρ = 1/3 and the Hamiltonian is of the form − 1 Since under the uniform measure on the discrete cube, σ 1 and the products σ i σ i+1 , i = 1, . . . , n − 1 are independent Rademacher variables, one can see that under the measure µ given by (2.1), the products In particular E µ σ i σ i+1 = a > 0 is independent of n. Consider now a symmetric 3-indexed matrix A = (a ijk ) i,j,k≤n with vanishing generalized diagonals and coefficients a ijk defined for i < j < k with the formula One can easily see that A {1,2,3} is of order n (as A has O(n 2 ) nonzero coefficients). However, if X is distributed according to µ, then for f (X) = i,j,k≤n a ijk X i X j X k , Var f (X) = f (X) 2 is of order n 3/2 (as can be checked by using the equality EX i X j X k = 0, expanding the product Ef (X) 2 and performing some elementary combinatorics). This shows that an estimate of the form P(|f (X)| ≥ t) ≤ 2 exp(−c(t/ A {1,2,3} ) κ ) cannot hold with c, κ independent of n. Of course, by (2.6) we do have the inequality P(|f (X)| ≥ t) ≤ 2 exp(−ct 2/3 /n). However, (2.9) leads to an improvement of this inequality. As one can easily check A {1,2},{3} is of the same order as A {1,2,3} , i.e., of order n, A {1}{2}{3} is of order √ n and i ( jk a ijk EX j X k ) 2 is of order n 3 . Together with some elementary calculations, this gives with some c = c(α, ρ, d). Since Theorem 2.2 applies to general (not necessarily homogeneous) polynomials, it yields a refinement of the above inequality, of the form . Example 2.6 shows that in general one cannot eliminate passing to f A,d , i.e., the above inequality may not hold for the original polynomial f .

Approximate tensorization of entropy and moment estimates
In this section we will present basic tools (coming mostly from the work by Marton [31] and Götze, Sambale, Sinulis [21]) which we will need for the proof of Theorem 2.2 and also in Section 5 to obtain concentration for convex functions and Rademacher-type bounds for quadratic forms.
Let X = n i=1 X i , where X i are Polish spaces with their Borel σ-fields and let µ be a probability distribution on X .
For each I ⊆ [n], and x = ( Let µ I be the marginal of µ corresponding to the coordinates indexed by I and µ I (·|x I ) denote the regular conditional distribution of x I givenx I on the probability space (X , µ). Thus for any Borel set A ⊂ i∈I X i , we have Recall that for a probability measure µ and a nonnegative function f , the entropy of f relative to µ is defined as The following definition will play a crucial part in what follows.
Definition 3.1 (Approximate tensorization of entropy). We will say that µ has the approximate tensorization property with constant C (abbrev. AT (C)) if for every function f : X → [0, ∞), It is well known that product measures satisfy AT (1), see e.g., [28,Proposition 5.6]. Recently Marton [31] (see also [11,21]) proved the following sufficient condition for tensorization of entropy in discrete product spaces.
Let also A = (a ij ) i,j≤n satisfy a ii = 0 for all i and for i = j, for every x, y ∈ X which differ only at the j-th coordinate. Assume moreover that A ℓ n 2 →ℓ n 2 < 1. Then the measure µ hast the approximate tensorization property with constant C = 2 In particular in [21] the Authors verify that under our main assumptions the approximate tensorization property is satisfied by the Ising model. and any x, y ∈ X n differing only at the j-th coordinate where C α,ρ depends only on α and ρ. As a consequence the measure µ satisfies AT (C) with C depending only on ρ and α.
We will also need the definition of the discrete gradient on X , induced by the measure µ. To this end we will slightly abuse the notation and write (x i , y i ) for the sequence z such that z i = y i andz i =x i .
Following [21] let us introduce and regard this vector as an element of R n endowed with the standard Euclidean norm | · |.
Definition 3.5. We will say that µ satisfies the logarithmic Sobolev inequality with constant C (abbrev. LSI(C)) if for every f : X → R, Remark 3.6. The notion of logarithmic Sobolev inequality introduced above can be interpreted as the usual logarithmic Sobolev inequality equivalent to hypercontractivity of the related Glauber dynamics/Gibbs sampler (see e.g., [11,31,21]), however we will not use this interpretation in the sequel.
Using the approximate tensorization property together with a log-Sobolev inequality for two point distributions [17] and a Herbst-type argument of Aida and Stroock [6] (see also [9,5,2]), Götze, Sambale and Sinulis [21] proved Theorem 3.7. Let µ be a measure on {−1, 1} n , defined by (2.1), and assume that |h| ∞ ≤ α and the Dobrushin condition (2.3) is satisfied with some ρ < 1. Then there exists a constant C = C(α, ρ) such that µ satisfies the LSI(C). As a consequence, if X is a random vector distributed according to µ, then for any p ≥ 2, Combining the above result with the well known fact that the logarithmic Sobolev inequality (3.2) implies the Poincaré inequality with constant C, i.e., we immediately obtain Corollary 3.8. Under the assumptions and notation of Theorem 3.7, for every f : This inequality will be the basis of our inductive argument in the proof of Theorem 2.2.

Proof of Theorem 2.2
Proof of Theorem 2.2. The proof will be an adaptation of arguments from [5] (see also [2]) to the discrete case. To carry it out it will be convenient to introduce an inner product ·, · on the space of k-tensors with the formula Let us also recall the notation . . , k. We will first prove by induction on d that for any positive integer d, and any function f : where K is a constant depending only on α, ρ and G 1 , . . . , G d are i.i.d. standard Gaussian vectors in R n independent of X. Here by E X we denote expectation with respect to the random vector X and (as explained in Section 2) the derivatives ∇ i f denote the derivatives of the tetrahedral polynomial coming from the Fourier-Walsh expansion of f . We remark that d does not necessarily coincide with the degree of f .
To this end we will proceed by induction. Consider thus f (x) = A 0 + D k=1 A k , x ⊗k , where A 0 ∈ R and for k = 1, . . . , D, A k = (a k i ) |i|≤n are k-indexed symmetric matrices with zeros on generalized diagonals. Let X i , i = 1, . . . , n be {−1, 1}-valued random variables (possibly defined on some extension of the original probability space) such that the conditional distribution of X i given X = x equals µ i (·|x i ).
Recall that for i ≤ n, , hence by Corollary 3.8, using the notationX i = (X j ) j =i we have for any p ≥ 2, Using Jensen's inequality for the conditional expectation, we can further write Define now for k = 1, . . . , d and i = 1, . . . , n the . Using the fact that the generalized diagonals of the matrices A k vanish together with the symmetry of A k , we get Since |X i − X i | ≤ 2, by combining this equality with the previous estimate, we obtain Using the fact that if g is a standard Gaussian variable, then for p ≥ 1 we have √ pM −1 ≤ g p ≤ M √ p, where M is a universal constant, we can write the above inequality as for a standard n-dimensional Gaussian vector G, independent of X, { X i } i≤n and K = 2 √ CM . This establishes (4.1) for d = 1.
The induction step follows just by the case d = 1 and the triangle inequality in L p . Indeed, assuming that (4.1) holds for d, by the triangle inequality and linearity of expectation we get Applying now (conditionally on G 1 , . . . , G d ) (4.2) to the first term on the right hand side and using the Fubini theorem we obtain This ends the induction step and establishes (4.1).
If f is a polynomial of degree d, then ∇ d f (X) is deterministic (and thus equal to its expectation) so (4.1) can be written in a more concise way We will now use a result by Latała [27], which asserts the existence of constants C k , depending only on k, such that for any k-index matrix A, and p ≥ 2, where C depends on ρ, α, d.
By Chebyshev's inequality in L p this gives for p ≥ 0, (the additional factor e 2 on the right hand side allows to extend the estimate from p ≥ 2 to all p ≥ 0). The theorem follows now by a change of variables and adjustment of constants.

Convex concentration and improved estimates for positive definite quadratic forms
The estimates of Theorem 2.2 are of Gaussian nature, i.e., they have the same form as two-sided estimates valid for polynomials in independent Gaussian variables. Since the values of the random variables in the Ising model are ±1, it is natural to look for estimates resembling those known for polynomials in independent Rademacher variables. In this case the situation is however more complicated, as two-sided bounds are known only for polynomials of degree at most 3 (see [20,26,3]).
Below in Theorem 5.11 we present estimates similar in nature to those for Rademacher sequences for quadratic forms AX, X where A is a non-negative definite matrix and X is a random vector with bounded coefficients, satisfying the approximate tensorization property. In some situations they improve on the bounds one can get for the Ising model from Theorem 2.2, however in general they are not comparable to them, because they involve norms of the matrix A and not just its off-diagonal part (note that in the case of the Ising model, the contribution from the diagonal is deterministic). It is natural to conjecture that (similarly as for the i.i.d. case) the assumption of non-negative definiteness is an artefact of our proof and can be actually dropped, however at present we are not able to obtain such more general bounds.
5.1. Convex concentration. As a tool for proving estimates for quadratic forms we will derive concentration inequalities for convex Lipschitz functions for measures on products of compact sets, satisfying the approximate tensorization property, which are of independent interest. In particular this will allow us to obtain concentration for linear combinations with vector coefficients (see Proposition 5.5), which generalize moment estimates obtained in the Rademacher case by Dilworth and Montgomery-Smith [18] (see also [26]).
Recall that a random vector X in R n has the convex concentration property with constant K if for any L-Lipschitz convex function f : R n → R, and any t > 0, It is well known that the above property is up to constant equivalent to concentration around the mean, i.e., i.e., the inequalities (5.1) and (5.2) imply each other and the constants K and K depend only on one another. We will say that a random vector Z satisfies the dimension-free convex concentration property with constant K if for any N , the random vector X = (X 1 , . . . , X N ), where X i are i.i.d. copies of Z, satisfies (5.1).
We will now relate the approximate tensorization property of measures on [−1, 1] n to the convex concentration property, showing in particular that if the distribution of X is given by (2.1), where |h| ∞ ≤ α and J ij satisfy the Dobrushin condition (2.3), then X satisfies the dimension-free convex concentration property with a constant depending only on α and ρ (Proposition 5.4 below). Next we will prove that convex concentration property for measures on products of compact sets can be in fact improved by taking into account the uniform bounds on the components of X. Finally we will illustrate this phenomenon with applications to linear forms with vector coefficients and quadratic non-negatively definite forms (Proposition 5.5 and Theorem 5.11).
In order to pass from approximate tensorization of entropy to dimension-free convex concentration property, we will use weak transportation inequalities, introduced recently by Gozlan, Roberto, Samson and Tetali [22].
Let us denote by P 1 (R n ) the set of all probability measures on R n with finite first moment.
Definition 5.1. Let µ and ν be probability measures on R n . Assume that ν ∈ P 1 (R n ). For a convex, lower semicontinuous function θ : R n → [0, ∞], such that θ(0) = 0 define the weak transport cost between µ and ν as where the infimum is taken over all couplings π between µ and ν (i.e., measures on R n ×R n with marginals µ, ν) and for x ∈ R n , p x (·) is the conditional measure defined (µ almost surely) by π(dxdy) = p x (dy)µ(dx).
Using probabilistic notation one can write where the infimum is taken over all pairs of random vectors (X, Y ) with values in R n × R n , such that X is distributed according to µ and Y according to ν.
Recall also that if µ, ν are two probability measures then the relative entropy of ν with respect to µ is given by the formula H(ν|µ) = E ν log dν dµ if ν is absolutely continuous with respect to µ and H(ν|µ) = ∞ otherwise.
Definition 5.2. Let µ ∈ P 1 (R n ) and θ : R n → [0, ∞] be a convex lower semicontinuous function with θ(0) = 0. We will say that µ satisfies the inequality T θ if for every probability measure ν ∈ P 1 (R n ), The following theorem established in [22] describes connections between dimension-free convex concentration, weak transportation inequalities and log-Sobolev inequalities for convex and concave functions. Theorem 5.3. Let X be a random vector in R n with distribution µ. The following conditions are equivalent.
(i) There exists K such that X has the dimension-free convex concentration property with constant K.
(ii) There exists c such that µ satisfies the inequality T θ with θ(x) = c|x| 2 .
(iii) There exist D, λ > 0 such that for every convex Lipschitz function and every concave function whose Hessian is bounded from below by (−λ)Id, Moreover for any two assertions above the constants in one of them may be taken to depend only on the constants in the other one.
Using the above result we can easily obtain the following proposition, which may be useful e.g., in statistical applications, when dealing with i.i.d. samples drawn from the measure µ (see e.g., [15] for a discussion of applications related to the Ising model).
Proposition 5.4. If X is a [−1, 1] n -valued random vector with law µ, which satisfies the approximate tensorization AT (C), then X satisfies the dimension-free convex concentration inequality with constant K, depending only on C.
Proof. The celebrated convex distance inequality by Talagrand (see eg. [39,40]) asserts that any random variable with support in [−1, 1] satisfies the dimension-free convex concentration property with a universal constant. In particular by Theorem 5.3 it satisfies the log-Sobolev inequality (5.5) with some universal constants D, λ. Consider any function f : R n → R, which is either convex or concave with ∇ 2 f (x) ≥ −λId for all x. In particular in the latter case for any i ≤ n, One can thus apply the onedimensional version of (5.5) to µ i (·|x i ) and the function x i → f (x i , x i ), which together with the condition AT (C) gives (5.5) with constants CD and λ. The proof is now concluded by another application of Theorem 5.3.
It is easy to see that if a measure µ supported on [−1, 1] n satisfies T θ with θ(x) = c|x| 2 then it actually satisfies a stronger inequality T γ with γ(x) = |x| 2 if |x| ∞ < 2 and γ(x) = ∞ otherwise. Indeed for the right-hand side to be finite ν must be also supported on [−1, 1] n , in which case by (5.3) T θ (µ|ν) = T γ (µ|ν) and T θ (ν|µ) = T γ (ν|µ). In fact, weak transportation inequalities with such strengthened cost functions can hold only for compactly supported measures (see [38]). The interest in such strengthening lies in the fact that by taking into account the boundedness of random variables, it implies concentration inequalities stronger than the subgaussian bound given by (5.2) (see e.g., [4] for concentration results corresponding to various cost functions θ). As shown in the next proposition, such inequalities can be also easily inferred just at the level of convex concentration. To formulate this result let us introduce a family of norms on R n given for p > 0 by the formula It is not difficult to see that where (x ↓ i ) i≤n is the nonincreasing rearrangement of the sequence (|x i |) i≤n . In fact one has Such norms are equivalent to interpolation norms between the spaces ℓ n 2 and ℓ n 1 and in a probabilistic context appeared for the first time in the paper [32], where it is shown that if ε 1 , . . . , ε n are independent Rademacher variables, then for x ∈ R n and p ≥ 2, where C is a universal constant. The meaning of the subscript {1} will become clear when we define counterparts of this norm for matrices. To keep uniform notation, we introduce it already here.
We are now ready to state the strengthened concentration result. where M is the mean or the median of f (X) and C is a universal constant.
This improves on what can be obtained from (5.2) since as one can see from (5.7), x {1},p ≤ √ p|x|.
Remark 5.6. By standard regularization arguments (see e.g., see [35, p. 429]) one can drop the smoothness assumptions on f , by replacing sup x ∇f (x) {1},p with the Lipschitz constant of f with respect to the norm dual to · {1},p . One can also assume that f is defined on [−1, 1] n since one can extend it to R n with the formulaf (y) = sup x∈(−1,1) n (f (x) + ∇f (x), y − x ), without altering the Lipschitz constant (here ∇f (x) denotes some subgradient of f at x).
Before proving the above proposition, let us illustrate it with examples, to show how it improves on the usual subgaussian convex concentration (5.2).
In view of Remark 5.6, Proposition 5.5 yields the following corollary.
Example 5.9. Let us now provide a simple one dimensional example. For illustration purposes it will be more convenient to state it in terms of infinite sequences of random variables. Let thus X 1 , X 2 , . . . be centered random variables with values in [−1, 1], such that for all n, the vector X = (X 1 , . . . , X n ) satisfies (5.1) with K = 1 (for simplicity). Define the random variable Z = ∞ i=1 1 i X i (note that thanks to (5.2) this sequence converges in L 2 ). Then by (5.2), we get P(|Z| ≥ t) ≤ 2 exp(−ct 2 ) for some c > 0. However, it is easy to see that (1/i) ∞ i=1 {1},p ≃ log p (up to multiplicative constants) for p ≥ 2 thus by Proposition 5.5 (after adjusting the constants) we obtain This Gumbel type tail decay is clearly much faster than Gaussian. Note that thanks to the onedimensional nature this example can be in fact easily recovered directly from (5.2) by combining it with obvious pointwise bounds on the variables X i . Since (5.10) can be equivalently restated as for any bounded set T of vectors and Z = sup x∈T | i x i X i |, one can easily create more complicated examples with various types of tail decay.
Proof of Proposition 5.5. The idea of the proof goes back to Talagrand and is by now classical. The main additional observation one needs to make is that exploring the boundedness of the support may lead to improved inequalities for general convex Lipschitz functions rather than just for linear functions.
We will start by proving the inequality in question with the median. Let thus M = Med f and consider first the convex set A = {x ∈ [−1, 1] n : f (x) ≤ M }, so that P(X ∈ A) ≥ 1/2. Define g(x) = dist(x, A), then Med g(X) = 0 and by convexity of A, g is a convex function. Note that if for z ∈ [−1, 1] n , f (z) > M + 3 sup x |∇f (x)| {1},p , then by convexity for any y ∈ A, (where we used (5.8)) and so |z − y| > √ p. Taking infimum over all y ∈ A, and recalling that Med g(X) = 0, we obtain where in the last inequality we used (5.1). As for the lower tail, we can clearly assume that f is not constant. In particular sup x ∇f (x) {1},p > 0.
by similar estimates as above one obtains that for z ∈ [−1, 1] n , if dist(z, A) < √ p, then f (z) < M . Thus, Since A ⊆ B c we can also assume that B c = ∅. The function g(x) = dist(x, B) is 1-Lipschitz, concave on the complement of B and can be extended to a function g(x) := inf z∈B c (g(z) + ∇g(z), x − z ), which is 1-Lipschitz, concave on R n and non-positive on B. Moreover g = g on B c . Thus Med g(X) ≤ 0 and so As a consequence P(X ∈ A) ≤ 2 exp(−p/K 2 ), which together with (5.11) proves that To pass from the median to the mean, we notice that for t ≥ 1, x {1},tp ≤ √ t x {1},p , so applying the above estimate with t 2 p instead of p, we get In particular by Jensen's inequality and integration by parts this yields If p > K 2 , this gives (5.9) for M = Ef (X) with C = 6 + C ′ , otherwise (5.9) is trivial, as the right hand side exceeds one.

Quadratic forms.
We will now pass to quadratic forms. In order to formulate tail estimates in this case, we need to introduce two additional norms of a symmetric matrix. Following [26] we define a ij x i y j : |x|, |y| ≤ √ p, |x| ∞ , |y| ∞ ≤ 1 .
We note that by (5.7), which gives a simpler expression if one is interested in concentration up to dimension-free constants.
Remark 5.10. It is easy to see that which justifies the subscript {1, 2} used in the notation. It is also clear that In [26] Latała proved that there exists a universal constant C such that if X = (ε 1 , . . . , ε n ), where ε i 's are independent Rademacher variables, then for any symmetric matrix with vanishing diagonal and any p ≥ 2, one has similar bounds were also obtained for cubic forms in [3]. As a consequence, by Chebyshev's and Paley-Zygmund inequalities (see [26]), in this case for any p > 0, The result we obtain for quadratic forms in dependent random variables is Theorem 5.11. Let X be a centered random vector with values in [−1, 1] n , satisfying the convex concentration property with constant K and let A = (a ij ) i,j≤n be a symmetric nonnegative definite matrix. Then there exists a constant C K , depending only on K, such that for any p ≥ 0, and Remark 5.12. The mean zero assumption in the above theorem is introduced only to simplify its formulation. Clearly in the general case one can recenter the vector and handle the linear correction by Proposition 5.5.
Remark 5.13. For the Ising model the norms A {1,2},p and A {1}{2},p introduce unnecessary contribution from the diagonal of A, which does not influence the value of AX, X − E AX, X . However for random variables not supported on {−1, 1}, in general this contribution has to be taken into account. It is not difficult to see that up to constants it corresponds to the · {1},p norm of the vector consisting of diagonal elements from A, which is consistent with estimates of Proposition 5.5 as well as tail bounds for sums of independent bounded random variables.
As already mentioned at the beginning of the section, one expects that the assumption of nonnegative definiteness of the matrix A is not needed in Theorem 5.11. In [1] it is shown that the convex concentration property (5.2) implies the Hanson-Wright inequality (2.7) (with c depending on K) for arbitrary matrices by splitting the matrix into the sum of its positive and negative definite parts and treating each of them separately (using convexity). This strategy does not work here, since the · I,p norms are not invariant under conjugation and the norms of positive and negative parts can be of greater order than the corresponding norms of the original matrix. This can be seen e.g., with a matrix A = (a ij ) i,j≤n such that a 1i = a i1 = 1 for i = 1 and all the other coefficients are zero. In this case for 1 ≪ p ≪ n we get A {1,2},p + A {1}{2},p ≃ √ p √ n, whereas if A ± is the positive/negative part of A, then For the Ising model one may hope that the assumption of nonnegative definiteness of the matrix A in Theorem 5.11 could be removed by a repetition of the proof of Theorem 2.2 with auxiliary Rademacher variables, instead of Gaussian ones, i.e., by proving that for every f (seen as a tetrahedral polynomial) and p ≥ 2, (actually if one is interested only in quadratic forms, it is enough to prove it for polynomials of degree 2). We do not know if such inequality is satisfied under the assumptions of Theorem 2.2.
Example 5.14. Let us now present an example of a matrix A for which Theorem 5.11 gives a substantially better tail estimate that the one given by the Hanson-Wright inequality. One possibility is to tensorize Example 5.9, i.e., to consider the matrix A n = (a ij ) n i,j=1 given by a ij = 1 ij for large values of n. Noting that AX, X = v, X 2 for v = (1, 1/2, . . . , 1/n) one can argue that this example is still rather about linear combinations than quadratic forms. Let us therefore leave the details to the Reader and instead consider the matrix A n given by a ij = 1 (i+j) 2 . It is easy to see that for any n, A n is positive definite (e.g., by noting that for a standard exponential variable Y , and t ≥ 0 we have Ee −tY = 1 1+t and using basic properties of the Laplace transform). Now both A n HS and A n ℓ n 2 →ℓ n 2 are of order Ω(1) as n → ∞, and so, if X n is a sequence of centered random vectors in [−1, 1] n satisfying (5.2) with K independent of n, then the Hanson-Wright type inequalities (2.7) give P(| A n X n , X n − E A n X n , X n | ≥ t) ≤ 2 exp(−ct) for some dimension independent constant c. On the other hand, it is not difficult to check that we have A n {1,2},p ≤ C and A n {1}{2},p ≤ C log p for some dimension-independent constant C. Thus Theorem 5.11 gives where c ′ is another dimension independent constant and we again obtain a strengthened Gumbel type behavior. The above example is primarily an illustration of the difference between the norms √ p A HS + p A ℓ n 2 →ℓ n 2 , used in the Hanson-Wright inequality and the norms A {1,2},p + A {1}{2},p used in estimates of Rademacher type given in Theorem 5.11, but in fact one can recover (5.14) by splitting appropriately the matrix A into a sum of two matrices, applying the Hanson-Wright inequality to one of them and the trivial pointwise bound to the other one (similarly as in the one-dimensional case of Example 5.9, where one can apply the pointwise bounds together with the Khintchine inequality). This strategy is however limited, as in general there does not exist a constant C, independent of p and n such that for all n × n matrices A and p ≥ 2, To see this one can consider e.g., the matrix given in Remark 5.13 or matrices of the form A = vv T , where v has one coordinate equal to 1 and the remaining ones equal to p/n, and p → ∞ with n at an appropriate speed (we leave the details to the Reader). This shows that estimates of the form (5.12) do improve on the Hanson-Wright inequality.
Proof of Theorem 5.11. Denote f (x) = Ax, x . By our assumptions this is a convex function and therefore, similarly as in the proof of Proposition 5.5, we can write for any x, y ∈ [−1, 1] n , and thus by the convex concentration assumption applied to the function g (note that g is convex and Med g(X) = 0), we get P(f (X) − M ≥ 3 ∇f (X) {1},p ) ≤ 2e −p/K 2 (observe that here we bound f (X) − M by a random quantity).
As ∇f (X) = 2AX, we can apply Corollary 5.7 to the norm · {1},p to obtain for some universal constant C. Combining the two last inequalities we obtain for some (new) constant C, To get a bound on the upper tail (above the median) it thus suffices to estimate E AX {1},p . In [26] Latała proved that in the case when Y is a vector of independent Rademacher variables, E AY {1},p ≤ C( A {1,2},p + A {1}{2},p ).
In [3] it is mentioned (see the remark before Lemma 8.4) that this inequality can be proved by a chaining argument, relying on concentration properties of vector valued linear combinations of Rademacher variables. Since X is a centered random vector, satisfying analogous concentration properties as Y , one could follow this approach to recover the above inequality for X. However the formulations and proofs in [26,3] are given for general independent variables with log-concave tails and translating the arguments of [3], even if straightforward, is quite tedious. Therefore, instead we will use a recent deep result of Bednorz and Latała [8] concerning suprema of Rademacher processes, which will allow to directly reduce estimates for X to the case of random signs. Their Theorem 1.1 (in a finite dimensional formulation suitable for our purposes) asserts that if T ⊆ R n then there exists a decomposition T = T 1 + T 2 such that where ε i , g i are sequences of i.i.d. resp. Rademacher and Gaussian variables, and C is a universal constant.
Since our X satisfies convex concentration property, it is in particular subgaussian with constant K and by another deep result, Talagrand's Majorizing Measure Theorem (see [41]), we have for any set T 2 . Therefore, expressing AX {1},p as a supremum of linear combinations of X i 's and using the above estimates together with the inequality |X i | ≤ 1, we see that for some (new) universal constant C. Using the fact that A {1,2},tp ≤ C ′′ √ t A {1,2},p and A {1}{2},tp ≤ t A {1}{2},p for t ≥ 1 and some universal constant C ′′ , we can easily replace the right hand side by 4e −p at the cost of changing C(1 + K) to some constant C K (which can be clearly expressed explicitly in terms of C and K).
The inequality (5.13) with the mean replaced by the median follows now by (5.16) and from the observation that where the last inequality follows easily by (5.2) and integration by parts. Now, (5.12) and (5.13) with the median instead of the mean yield which by another integration by parts gives | AX, X − M ≤ C ′ K A {1,2} . Since for p ≥ 2, A {1,2} ≤ C A {1,2},p , this easily allows to pass from concentration around median to concentration around mean (at the cost of increasing the values of the constant C K ).