Concentration inequalities for polynomials of contracting Ising models

We study the concentration of a degree-$d$ polynomial of the $N$ spins of a general Ising model, in the regime where single-site Glauber dynamics is contracting. For $d=1$, Gaussian concentration was shown by Marton (1996) and Samson (2000) as a special case of concentration for convex Lipschitz functions, and extended to a variety of related settings by e.g., Chazottes et al. (2007) and Kontorovich and Ramanan (2008). For $d=2$, exponential concentration was shown by Marton (2003) on lattices. We treat a general fixed degree $d$ with $O(1)$ coefficients, and show that the polynomial has variance $O(N^d)$ and, after rescaling it by $N^{-d/2}$, its tail probabilities decay as $\exp(- c\, r^{2/d})$ for deviations of $r \geq C \log N$.


Introduction
Concentration of measure for functions of random fields has been extensively studied (see, e.g., [8]). A prototypical example for a system where the underlying variables are weakly dependent is the high-temperature Ising model. The model, in its most general form without an external magnetic field, is a probability measure over configurations σ ∈ Ω N := {±1} N (assigning spins to the sites {1, . . . , N }), defined as follows: for a set of coupling interactions {J ij } 1≤i,j≤N , the corresponding Ising distribution π is given by in which Z (the partition function) is a normalizer. For general {J ij } this includes ferromagnetic/anti-ferromagnetic models, and spin-glass systems on arbitrary graphs.
The Gaussian concentration of functions f : Ω N → R in the high temperature regime has been studied both using analytical methods, adapting tools from the analysis of product spaces to the setting of weakly dependent random variables (see, e.g., [7,12]), and using probabilistic tools such as coupling (cf. [1]). In the presence of arbitrary couplings {J ij }, our hypothesis for capturing the high-temperature behavior of the model will be be based on contraction, as in the related works on concentration inequalities in [1,10,11,13], and closely related to the Dobrushin uniqueness condition in [7].
In this case, for linear functions f (σ) = i a i σ i , it is known, as a special case of results of Marton [11] regarding Gaussian concentration for Lipschitz functions (see also [13] as well as [1,6,7,10]) that there exists c = c(a 1 , . . . , a N , α) > 0 such that, For bilinear forms, where f (σ) = ij a ij σ i σ j , Marton [12] showed that on lattices whereas Daskalakis et al. [3] showed that, for a general Ising model, in a subset of this regime (contraction as above with α > 3 4 vs. any α > 0), Var π (f ) = O(N 2 log 3 N ). Our main result recovers the correct variance and, up to a polynomial pre-factor, the tail probabilities for a polynomial of any fixed degree d (for matching lower bounds, one can take, for instance, the d-th power of the magnetization f (σ) = i σ i ).
Theorem 1. For every α, d > 0 there exists C(α, d) > 0 so that the following holds. Let π be the distribution of the Ising model on N spins with couplings {J ij } satisfying j:j∼i ,

2)
and for every r > 0,  [3], the authors used their variance bounds for bilinear forms of Ising models to study statistical independence testing for Ising models. Namely, they gave bounds (in terms of N and ε) on the number of samples that are required to distinguish, with high probability, between a product measure and an Ising model whose (symmetrized Kullback-Leibler) distance to any product measure is at least ε. In Section 4, Theorems 4.1-4.2, we present a short application of Theorem 1 to improve the upper bounds of [3] by considering fourth-order statistics of the Ising model. Remark 1.2. In this paper, we always consider polynomials of Ising models with no external field. As the following example shows, in the presence of an external field, such polynomials can be anti-concentrated. Let µ i = E[σ i ] for all i and expand, The first term on the right-hand side should have O(N ) fluctuations while the second and third terms i ( j a ij µ j )σ i can have order N 3/2 fluctuations (e.g., if (µ j a ij ) j all have the same sign), implying (1.2)-(1.3) cannot hold in general under external field.

Concentration for quadratic functions
In this section, we prove the special and more straightforward case of concentration for quadratic functions of the Ising model. The proof of Theorem 1 in §3 requires some additional ingredients but is motivated by the proof of the following.
Theorem 2.1. For every α > 0 there exists C(α) > 0 so that the following holds. Let π be the distribution of the Ising model on N spins with interaction couplings
Consider a linear function of the form g = Returning to the function f , assume w.l.o.g. that a ii = 0 for all i (as σ 2 i = 1) and let which, again applying (2.3), yields We now proceed to proving the exponential tail bounds on f . Throughout the paper, we say a function f is b-Lipschitz on a set S if for every σ, σ ′ ∈ S, it is so on its whole domain, in our case Ω N . For subsets of a graph, e.g., {±1} N , endowed with the graph distance, by the triangle inequality, it suffices to consider only σ, σ ′ that are neighbors.
Proof of (2.2). We begin by bounding the Lipschitz constant of 1 in light of which, if we define In order to upper bound P π (S c b ), we will use the following version of concentration inequalities for Lipschitz functions of contracting Markov chains [10]: Proposition 2.2 ([10, Corollary 4.4, Eq. (4.13)], cf. [11,13]). Let π be the stationary distribution of a θ-contracting Markov chain with state space Ω, and suppose g : Ω → R is b-Lipschitz. Then for all r > 0, To see this, note that for every i and every σ, σ ′ ∈ Ω N , By a union bound and Proposition 2.2 with θ = 1 − α/N , there exists κ(α) > 0 such that (2.8) Next, consider the McShane-Whitney extension of N −1/2 f from S b , given by by definition, N −1/2f is 2b-Lipschitz on all of Ω N . As a result, by Proposition 2.2, In order to move to the desired quantity, we need to control the difference between the means of f,f using the fact thatf (σ) = f (σ) for all σ ∈ S b : where in the last line we used (2.8) to bound P π (S c b ), as well as that max By (2.10), and the choice of b, the first term above has Replacing the requirement of b > 2 κ A 2 ∞ log( A ∞ N ) with a prefactor of N 2 , and combining the above two estimates, we see that holds for every r > 0.

Concentration for general polynomials
In order to prove Theorem 1, we will need the following intermediate lemma used to control the mean of the gradient of f . b i 1 ,...,ip σ i 1 · · · σ ip is a degree-p polynomial in (σ 1 , ..., σ N ) for a degree-p tensor B, then Proof. Begin by considering ferromagnetic models with non-negative couplings, {J ij }. It is well-known that in the E π [σ i 1 · · · σ ip ] ≥ 0 in the ferromagnetic Ising model with no external field (e.g., by viewing its FK representation that enjoys monotonicity). Thus, and taking M p = ( B ∞ ) 1/p , we see that However, i M p σ i is clearly an M p -Lipschitz function, and by spin-flip symmetry of the Ising system, has mean 0, so by Proposition 2.2, there exists κ(α) > 0 such that and therefore, by integrating, are not all non-negative; using the FK representation of Ising spin systems with general couplings (not necessarily ferromagnetic)-see, e.g., [5, §11.5], and in particular Proposition 259 and Eq. (11.44)-for every i 1 , ..., i p , (3.1) Then, proceeding as before, we see that Sinceπ is contracting, we can apply Proposition 2.2 as before to obtain for the same constant, C(p, α) > 0 that Proof of (1.2). Fix d and recall the variational formula for the spectral gap, (2.3). Following (2.5), we see that for γ defined in (2.4) with A ∞ ≤ K, and w.l.o.g. (since σ 2 i = 1, every polynomial can be rewritten as a sum of monomials) assume that a i 1 ,...,i d = 0 if i k = i j for some j = k. Then we see that for every ℓ and every σ, so that g ℓ (σ) := (∇ ℓ f ) 2 (σ) is a 2(d−1)-degree polynomial in σ with coefficients bounded above by 4 2(d−1) (d−1) K 2 . By Lemma 3.1, there exists C(α, d) > 0 such that for every ℓ, so that using (2.3), (2.5), and the fact that gap ≥ α/N , for some new C(α, d) > 0, Proof of (1.3). Observe that since we are on the hypercube Ω N , σ k i = σ k mod 2 i , so that every polynomial function f of degree d can be rewritten as a sum of monomials of degree at most d. The concentration of the lower-degree monomials can be absorbed into a constant multiple in the prefactor in (1.3) of Theorem 1. Moreover, it suffices by rescaling to prove the theorem for the case K = 1. Hence, we proceed to prove the following concentration inequality for monomials: consider a (1 − α N )-contracting Ising model π; for every d, if f is a monomial of degree d, i.e., f (σ) = i 1 ,...,i d a i 1 ,...,i d σ i 1 · · · σ i d for a d-tensor A with A ∞ ≤ 1 and a i 1 ...i d = 0 if i j = i k for some j = k, there exists C(α, d) > 0 such that for every r > 0, and every N , Since we are considering d fixed, throughout this section, will be with respect to constants that may depend on d. We prove (3.2) inductively over d ≥ 2. The base case d = 1 is given by Proposition 2.2. Now assume that for every p ≤ d − 1, Eq. (3.2) holds and show it holds for d. Fix 1 ≤ ℓ ≤ N and let σ ℓ be the configuration that differs with σ only in coordinate ℓ. For every σ, we can compute the gradient N −d/2 (∇ ℓ f )(σ) as Define the following set of configurations: Now let (X t ) be the single spin-flip Markov chain which we assumed to be (1 − α N )contracting with stationary distribution π, and, for each η, bound In order to bound Φ 1 we will need the following result of Luczak [10]: Proposition 3.2 ([10, Eq. (4.14)]). Suppose (Y t ) is a θ-contracting Markov chain on Ω with stationary distribution π; suppose further that g : Ω → R is a b-Lipschitz function. Then for every Y 0 ∈ Ω, By Proposition 3.2 with the choice of θ = 1 − α N , there exists κ(α) > 0 such that for every η ∈ S b and every t, Second, the fact that f andf η identify on S η,b implies that where the last equality crucially used that (X t ) is a single-site dynamics (whence starting from η, exiting S η,b and exiting S b are equivalent). By the definition off η , we have that f η ∞ ≤ f ∞ + N Lip(f ↾ S η,b ), implying that Finally, if we take so that for all such t, for every η ∈ Ω N , we have Ψ 2 = 0. Because (e.g., [9], a Markov chain that is θ-contracting with θ = 1 − α N has t mix N log N ) by sub-multiplicativity of total variation distance to stationarity, this holds for t 0 ≍ N log 2 (N ).
Combining (3.5)-(3.8), we see that for all η ∈ S b and t ≥ t 0 , If we now average both sides over η ∼ π and set t = t 0 , we obtain where we used using stationarity of the Markov chain and a union bound over all times up to t 0 , and Markov's inequality with E π [P η (τ S c b ≤ t)] = P π (τ S c b ≤ t). It remains to bound the probability P π (S c b ). Let, for every 1 ≤ ℓ ≤ N , 1 ≤ j ≤ d, by the inductive hypothesis there exists C ′ (α, d) > 0 such that uniformly over ℓ, j, To upper bound P π (S c b ), by (3.4) it suffices to show that |E π [g ℓ,j ]| is at most bN (d−1)/2 /2 and then union bound over ℓ, j. Since for each ℓ, j, the function g ℓ,j is a d − 1 degree polynomial of the form of h(σ) in Lemma 3.1 there exists C(α, d) > 0 such that Therefore, for all b ≥ 2C, by a union bound over 1 ≤ ℓ ≤ N and 1 ≤ j ≤ d, Plugging (3.10) into (3.9), by stationarity of π and t 0 ≍ dN log 2 (N ), we obtain at which point, the choice of b given by implies the desired (3.2) for some different C(α, d) > 0 for all r > 0.

An application to testing Ising models
In [3], independence testing of Ising models was extensively studied. Namely, suppose one is given k samples of N bits, either from a product measure I or from an Ising measure ν satisfying (1.1) whose Kullback-Leibler distance to I is at least ε. The goal is to decide with high probability, using a minimum number of samples, which distribution the samples came from. Our variance bound in Theorem 1 allows us to use a fourth-order statistic to improve on the results of [3] in the high-temperature regime of (1.1), including obtaining the sharp result in the case of ferromagnetic Ising models.
Consider an Ising model with couplings J ij and for every i ∼ j, denote by , which in the absence of external field equals E π [σ x σ y ]. We will be concerned with Ising models satisfying (1.1) and therefore in their high-temperature Dobrushin regime.
The Ising model has the special property that for two Ising models π and ν on N vertices, with couplings {J π ij } and {J ν ij } and edge-magnetizations λ π ij and λ ν ij , the symmetrized Kullback-Leibler divergence d SKL (π, ν) is given by Let I be the product measure on N independent, symmetric ±1 random variables. That is to say that J I ij = λ I ij = 0 for all i, j and d SKL (π, I) = i,j J π ij λ π ij . Finally, for an Ising model π, let m denote the number of edges, i.e., the number of non-zero J π ij .
Theorem 4.1. There exists a polynomial time algorithm that uses O(N/ε) samples from a ferromagnetic Ising model π on N vertices satisfying (1.1), and distinguishes with probability better than 3 4 , whether π = I or d SKL (π, I) ≥ ε. In the specific case where the edge set {(ij) : J π ij = 0} is known, this is improved to O( √ m/ε) samples. The algorithms we use take k i.i.d. samples (σ i ) i≤N from π and compute the test statistic, where in the case where we do know the edge set of the underlying graph a priori, we sum only over i ∼ j. Let P be the measure given by k i=1 π. Observe first that At the same time, For every fixed k, we can view (σ (ℓ) i ) 1≤i≤N,1≤ℓ≤k as an Ising model on kN vertices, that satisfies (1.1) since it corresponds to k independent copies of an Ising model each satisfying (1.1). Therefore, by Theorem 1, specifically (1.2), we have Var(Z k ) ≤ CN 2 /k 2 .
In the specific case where the underlying graph of the Ising model is known a priori, we have the following. Proof. Again view (σ (ℓ) i ) i,ℓ as an Ising model on kN vertices with measure π k = k i=1 π. Recall that since {J π ij } satisfy (1.1) for α > 0, the Ising model is 1 − α/N contracting. Since the spectral gap tensorizes, and π is 1 − α/N contracting, π k also has inverse spectral gap satisfying gap −1 ≥ α/N . Using the variational form of the spectral gap as before, we have by (2.4)-(2.5), Now we compute (∇ i,ℓ Z k ) 2 (σ) for fixed (i, ℓ) = (i ⋆ , ℓ ⋆ ) and every σ. Expanding out, model, λ π ij ≥ tanh(J π ij ) ≥ J π ij /2. As a result, Applying Chebyshev's inequality to P(Z k ≤ ε/4), we see that the desired number of samples we require of k is sufficient to identify in this case that d SKL (π, I) ≥ ε with probability at least 9 10 . A union bound over the two cases π = I and π such that d SKL (π, I) ≥ ε concludes the proof.
Proof of Theorem 4.2. The algorithm again computes the test statistic, Z k defined in (4.1), and now outputs that π = I if Z k ≤ ε 2 /2N and outputs d SKL (π, I) ≥ ε otherwise.
First, consider the situation π = I; by similar reasoning to the proof of Theorem 4.1, after k ≥ CN 2 /ε 2 , (when we know the underlying graph, k ≥ C ′ N √ m/ε, with probability at least 9 10 , the algorithm outputs that π = I. Now suppose that π is such that d SKL (π, I) ≥ ε; we wish to lower bound E[Z k ]. By Cauchy-Schwarz inequality, When (1.1) holds, we know that for every i and some α > 0, we have j:j∼i |J π ij | ≤ 1−α. Therefore, We can then use Chebyshev's inequality to bound P(Z k ≤ ε 2 /(2N )) ≤ P(|Z k − E[Z k ]| ≥ ε 2 /(2N )) ≤ 4N 2 Var(Z k ) ε 4 via the aforementioned bounds on Var(Z k ). Plugging in those bounds implies that the number of samples k we require is sufficient to identify that in this case d SKL (π, I) ≥ ε with probability at least 9 10 , at which point a union bound concludes the proof.