Decomposition of mean-field Gibbs distributions into product measures

We show that under a low complexity condition on the gradient of a Hamiltonian, Gibbs distributions on the Boolean hypercube are approximate mixtures of product measures whose probability vectors are critical points of an associated mean-field functional. This extends a previous work by the first author. As an application, we demonstrate how this framework helps characterize both Ising models satisfying a mean-field condition and the conditional distributions which arise in the emerging theory of nonlinear large deviations, both in the dense case and in the polynomially-sparse case.

• As an example of this composition, we demonstrate in Theorem 16 that the conditional distribution Pr [Y = y | f (Y ) ≥ tn] arising in large deviation theory can be approximated by a smoothed-cutoff distribution that can be decomposed into product measures, each satisfying an equation which arises from the Lagrange multiplier problem associated with the rate function.
In the sequel work [5], we apply Theorem 9 to exponential random graphs, improving a previously known characterization.

Background and notation
We denote the Boolean hypercube by C n = {−1, 1} n and the continuous hypercube by C n = [0, 1] n . The uniform measure on C n is denoted by µ. The space of all product measures on C n is denoted PM n . For a vector x ∈ R n , we denote its one-norm by

The Ising model
An Ising model on n sites can be described as follows: Let x ∈ C n represent n interacting sites that can be in one of two states. Let A ∈ R n×n be a real symmetric matrix with 0 on the diagonal representing the intensity of interaction between the sites, so that the interaction between site i and site j is A ij . Let µ ∈ R n be a vector representing magnetic field strengths, so that site i feels a magnetic field µ i . The Hamiltonian for the system is then defined as f (x) = x, Ax + µ, x .
If TrA 2 = o (n), we say that the model satisfies the mean-field assumption [1]. We also assume that both µ max and max i∈[n] j∈[n] |A ij | are O (1), which amounts to the force acting on a single site being bounded.

Nonlinear large deviations
Let f : C n → R be a Hamiltonian. For 0 ≤ p ≤ 1, define µ p to be the measure on C n where every entry is an i.i.d Bernoulli random variable with success probability p. Let t ∈ R be a real number. The two central questions in the field of large deviation theory are: One line of approach to answering these questions is to approximate Pr [f (Y ) ≥ tn] and Pr [Y = y | f (Y ) ≥ tn] by using Gibbs distributions. For example, observe that the conditional distribution Pr [Y = y | f (y) ≥ tn] may be obtained from a Gibbs distribution with a "cutoff Hamiltonian"f , defined by All y with f (y) ≥ tn are thus weighted according to µ p , and all y with f (y) < tn have probability 0. Unfortunately,f is not smooth enough in order to be applicable for the existing large deviation frameworks. However, it is possible to get approximations of Xf n by using Hamiltonians which approximatef . Such a "smooth-cutoff" Hamiltonian should give a large mass to "good" vectors y such that f (y) ≥ tn and a small mass to "bad" vectors y such that f (y) < tn. Both [4] and [2] follow this approach in order to tackle item (1).
With this we define both the the discrete gradient: and the Lipschitz constant of f : Every Boolean function f : C n → R has a unique Fourier decomposition into monomials [7]: This defines an extension of f from the discrete hypercube C n into the continuous hypercube C n = [−1, 1] n by computing the value of the polynomial S⊆[n]f (S) i∈S x i for x ∈ C n . It can be shown that this is the same extension as the harmonic extension defined in [4, Section 3.1.1]. By Fact 14 in [4], the extension of ∂ i f agrees with the i-th partial derivative (in the real-differentiable sense) of the extension of f . Throughout this text, we will always assume that f , and therefore ∇f as well, are extended to C n . Definition 2 (Gaussian width, gradient complexity). The Gaussian-width of a set K ⊆ R n is defined as For a function f : C n → R, the gradient complexity of f is defined as For a measure ν on C n , by slight abuse of notation, we define its complexity as

Mixture models
Definition 3 (ρ-mixtures). For z ∈ [−1, 1] n , denote by X (z) the unique random vector in C n whose coordinates are independent and whose expectation is EX (z) = z. Let ρ be a measure on [−1, 1] n . We define the random vector X (ρ) by Definition 4 (Approximate mixture decomposition). Let δ > 0 and let ρ be a measure on [−1, 1] n . A random variable X is called a (ρ, δ)-mixture if there exists a coupling between X (ρ) and X such that A result of [4] roughly states that low complexity Gibbs distributions are (ρ, δ)-mixtures for δ = o (1) and where ρ is such that most of the entropy comes from the individual X (z) rather than from the mixture.
Definition 5 (Wasserstein distance). For two distributions ν 1 and ν 2 , the Wasserstein masstransportation distance, denoted W 1 , is defined as where the infimum is taken over all joint distributions whose marginals have the laws ν 1 and ν 2 respectively.
Definition 6 (Tilt of a distribution). For a vector θ ∈ R n , the tilt τ θ ν of the distribution ν is a distribution defined by Cn e θ,z dν .
With the notion of ρ-mixture and tilt, we define what it means for a random variable to break up into small tilts: Definition 7 (Tilt decomposition). Let δ, ε > 0 and let ρ be a measure on [−1, 1] n . A random variable X with distribution ν is called a (ρ, δ, ε)-tilt-mixture if there exists a probability measure m on R n supported on B (0, ε √ n) ∩ − 1 4 , 1 4 n such that: 1. For every ϕ : C n → R, 2. For all but a δ-portion of the measure m, the tilt τ θ ν is δn-close to a product measure in Wasserstein distance: 3. The measure ρ is the push-forward of the measure m under the map θ → E X∼τ θ ν [X].
Proof. Define Θ = {θ ∈ R n : ∃ξ ∈ PM n s.t W 1 (τ θ ν, ξ) ≤ δn}, and denote the distribution of X and of X (ρ) by ν and σ respectively. Using item 1 in the definition of a tilt-mixture, we have By item 2 in the definition of a tilt-mixture, there exists a coupling between X and X (ρ) such that each term on the right hand side is bounded by δn. This gives a 4δ bound on the expectation A tilt-mixture decomposition provides more information than generalρ-mixtures: It tells us something about the structure of the elements of the mixture, with the parameter ε in Definition 7 bounding the support of the tilts to a ball of radius ε √ n. Some of our results will rely on the existence of tilt decompositions with small ε.

Results
Our main technical contribution is a characterization of the measure ρ described above: With high probability with respect to ρ, the vector z in equation (2) is nearly a critical point of a certain functional associated with f .
Theorem 9 (Main Structural Theorem). Let n > 0, let f : C n → R be a function and denote Denote by X f the set where ∇f (X) is calculated by harmonically extending ∇f to C n , and with the tanh applied entrywise to the entries of ∇f (X). Then X f n is a ρ, 3D 1/4 In particular, if D = o (n), then X f n is a (ρ, o (1))-mixture with ρ (X f ) = 1 − o (1). In other words, almost all the mass of the mixture resides on random vectors X which almost satisfy the fixed point equation Remark 10. One can check that the solutions of the fixed point equation are exactly the critical points of the functional f (X) is the entropy of X. This is a variant of the functional that arises in the variational problem in [3].
Remark 11. The following is an example application of Theorem 9 to Ising models, to be compared with the main result of [1].
Corollary 12 (Ising models). Let f be an Ising model Hamiltonian as described in Section 2.1.1, with interaction matrix A ∈ R n×n and a magnetic moment vector µ ∈ R n . Denote In particular, if L 1 = O (1) and Tr A 2 = o (n) (the "mean-field assumption"), then X f is (ρ, o (1))mixture with ρ (X f ) = 1 − o (1).
The simplest example of an Ising model is the Curie-Weiss ferromagnet, for which we can use our framework as a toy example and rederive well-known properties about its distribution.
Corollary 13. Let β > 0 and let f : For a more detailed application of Theorem 9 for the case of exponential random graphs, see [5].
The following theorem finds sufficient conditions under which composing f with a real-valued function produces a Hamiltonian with aρ-mixture approximation: Let f : C n → R be a function with parameters D, L 1 , and L 2 as described in Theorem 9. Denote bỹ D,L 1 ,L 2 andL 3 the real numbersD where ∇f (X) is calculated by harmonically extending ∇f to C n , and with the tanh applied entrywise to the entries of ∇f (X).
Remark 15. Theorem 14 bounds the norm X − tanh (h ′ (f (X)) ∇f (X)) 1 rather than X − tanh (∇ (h • f ) (X)) 1 (which is the analogue of the quantity arising in the main Theorem 9). This is a matter of practicality: For many known Hamiltonians f it is easy to compute ∇f and its extension to C n , but it is not straightforward to compute ∇ (h • f ) (X) and its extension to C n for arbitrary h. In these cases, calculating h ′ (f (X)) ∇f (X) is a much simpler task. Further, as will be shown in Lemma 25, the two quantities h ′ (f (X)) ∇f (X) and ∇ (h • f ) (X) are close to each other.
As an example application of Theorem 14, we show that the conditional distribution Pr [Y = y | f (y) ≥ tn] described in item (2) in Section 2.1.2 can be approximated by a "smoothedout" distribution, which gives equal mass to vectors y satisfying f (y) ≥ nt and no mass to vectors y satisfying f (y) < (t − δ) n. This "smoothed-out" distribution is obtained from a "smoothed-cutoff" approximation to thef described in Section 2.1.2. Our framework can be applied to this "smoothedcutoff" function, yielding an equation corresponding to the Lagrange multiplier problem associated with the rate function.
Theorem 16 (Large deviations). Let t > 0. Let f : C n → R be a Hamiltonian with parameters D, L 1 and L 2 as described in Theorem 9, and assume that there exists z ∈ C n such that f (z) ≥ tn. Let δ > 0. There exists a monotone function h : ≥ tn and such that the following holds. Denote by σ the measure defined by dσ = ϕdµ Cn ϕdµ , and let X ϕ be a random variable whose law is σ. Denote Then X ϕ is a ρ, 80D Note that the expression X − tanh (λ∇f (X)) in the definition of the set X g is closely related to the rate function: Consider the variational problem where Y is a random vector in C n whose entries are independent. By monotonicity, the minimum is attained on the boundary of the constraint. Denoting EY = y and using the method of Lagrange multipliers, we obtain the equations Applying the fact that ∇ y H (Y ) = tanh −1 (y) on equation (11) gives exactly the equation X − tanh (λ∇f (X)) = 0.
Example 17 (Large deviations for triangle counts). Let N > 0 be an integer representing the number of vertices of a graph, and let n = N 2 be the number of possible edges in the graph. We treat each vector v ∈ C n as a simple graph, with v e = 1 if and only if the edge e appears in the graph. This in turns gives an adjacency matrix (x ij ) N i,j=1 with x ij = 1 if and only if v {ij} = 1. In this setting, let f be a triangle-counting function, for some real β. It is shown in [4] that D (f ) is O n 3/4 and in [5] that L 1 and L 2 are bounded by 200 |β|. Thus we can apply Theorem 16 to f , concluding that for a fixed t > 0 there exists some δ = o (1) and a smoothed cutoff function h with h (x) = 1 for x > tn and h (x) = 0 for x < (t − δ)n and such that the random graph G whose density is proportional to h Here X ∈ C n is treated as an n × n symmetric matrix with zeros on the diagonal, and we understand the expression X 2 as the usual matrix multiplication, with zeros on the diagonal as well. We conjecture that all of the points of the set X g are close to the solutions obtained by Lubetzky and Zhao in [6].
Our results extend to triangle counts on sparse graphs as well. In this case, expected value of f is of order np 3 , which decays to 0 as p → 0. We should therefore take both t to be proportional to p 3 and δ to be o p 3 . Since the bound on the vectors in X g in Theorem 16 is polynomial in δ, we can consider large deviations for graphs whose edge probabilities are proportional to p ∼ n −c for some constant c (for example, if we wish ε to be of order p, we can take p ∼ n −1/160 ).
The rest of this paper is organized as follows. Theorem 9 is proved in Section 4, while Theorem 14 is proved in Section 5. Corollaries 12 and 13 are proved in Section 6.1 and 16 is proved in Section 6.2.

Notation and review
We use the notation from [4], and rely on the proofs therein. Here is a brief review of the required terms and bounds.
For a probability measure ν on C n , we define f ν = log (dν/dµ), so that the Gibbs distribution with Hamiltonian f ν is exactly ν. For every distribution ν on the hypercube (exponential or otherwise), we define which should be thought of as the covariance matrix of the random variable ∇f ν (X) with X ∼ ν. We will use the following three results from [4].
Proposition 18 (Proposition 17 in [4]). Letν be a probability distribution on C n . Then there exists a product measure ξ = ξ (ν) such that Moreover, one may take ξ to be the unique product measure whose center of mass lies at Cn tanh (∇fν (y)) dν (y) where the tanh is applied entrywise.
Proposition 19 (Proposition 18 together with Lemma 16 in [4]). Define D = D (f ν ). Let ε ∈ 0, 1/4 log (4n/D) . Let ν be a probability measure on C n and define f = log dν dµ . Then there exists a measure m on B (0, ε √ n) ∩ [−1/4, 1/4] n , such that ν admits the decomposition for every test function ϕ : C n → R, and which satisfies m θ : Tr (H (τ θ ν)) ≤ 256 Lemma 20 (Lemma 24 in [4]). Let θ ∈ R n and let ν,ν be probability measures on C n . Define . Then We can now describe the general plan of our proof. Fix ε > 0, and let m be the measure obtained from Proposition 19. Denote by Θ the set Θ = θ ∈ R n : Tr (H (τ θ ν)) ≤ 256 For every θ ∈ R n , denote by ξ θ the unique product measure with the same marginals as τ θ ν, and by A (θ) the vector Denote by ρ the push-forward of the measure m under the map θ → A (θ) and define In order to prove Theorem 9, all we have to do is that show that for each θ ∈ Θ, the corresponding A (θ) is close in the one-norm to tanh (∇f (A (θ))); this will show equation (7). In other words, we need the following proposition: Proposition 21. Let θ ∈ Θ and let Y ∼ ξ θ . Then for every ε > 0, Relying on the above, we can prove of Theorem 9.
Proof of Theorem 9. Define the measure ρ and the set X as above. Set ε = This implies that X ⊆ X f , and together with Proposition 19 and by choice of ε, this shows that n 1/4 , satisfying equation (7). The rest of this section is devoted to proving Proposition 21.

Thus, since by equations
Combining equations (18), (16) and (19) together with the triangle inequality finally gives Lemma 23. Let Z be an almost-surely bounded random variable, |Z| ≤ L with L ≥ 1. Then The proof is postponed to the appendix.
Claim 24. Let ξ be a product measure on C n , let Y ∼ ξ, and let f : C n → R be a function on the hypercube. Then and Proof. The extension of f to C n is defined by the Fourier decomposition Thus, since ξ is a product measure, Equation 21 is then obtained by applying equation 20 to every component of ∇f .
Proof Proposition 21. By the triangle inequality, Proposition 22 gives a bound on the second term in the right hand side. For the first term, note that by equation (4), for every index j ∈ [n], ∇f (Y ) j ≤ L 1 .
We can therefore invoke Lemma 23 on every index, giving that For this last term, we again use the triangle inequality and equation (16), giving

Proof of composition theorem
We will use two lemmas concerning the relation between f and h • f . The first is a discrete chain rule which will be central to our calculations: Lemma 25 (Chain rule for discrete gradient). Let f : C n → R with Lip (f ) = L and let h : R → R with |h ′′ (x)| < B. Then and The second lemma concerns the parameters of the function h • f : Lemma 26 (Composition parameters). Let h : R → R be a twice differentiable function satisfying for all x ∈ R. Let f : C n → R be a function with parameters D, L 1 , L 2 as described in Theorem 9. Then The proofs of both lemmas are postponed to the appendix.
Proof of Theorem 14. Denote by X h•f the set Note that for every X ∈ X h•f , (by equation (26) Remark 27. The bound for compositions h • f with domain C n , given in (26), is worse by a factor of √ n than that of compositions with domain C n , given in (24). This disparity is in fact tight. For example, consider the function The function h has a bounded second derivative and satisfies h ′ (0) = 0. For x = 0, we have f (x) = 0 and so h ′ (f (x)) ∇f (x) = 0 as well. However, a calculation shows that 6 Example applications

The Ising model
Proof of Corollary 12. A short calculation shows that ∇f (x) = Ax + µ. The corollary will follow immediately from Theorem 9 once we have obtained the parameters D, L 1 and L 2 for f . The calculations for D (f ) and Lip(f ) are also found in [4, Section 1.3] but we repeat them here for completeness. Denote µ max = max i∈[n] |µ i |. We then have 1. The Gaussian-width is bounded by:

The Lipschitz constant is bounded by
|A ij | .
3. Regarding the Lipschitz constant of the gradient, note that ∇f (x) − ∇f (y) 1 = A (x − y) 1 . Suppose that x and y differ only in the i-th coordinate. Then A (|x − y|) is just 2 times the i-th column of A. By the triangle inequality, we then have |A ij | .
Proof of Corollary 13. The interactions described in Corollary 13 can be represented by an interaction matrix A = β1 n , where 1 is the n × n matrix whose off-diagonal entries are 1 and whose diagonal is 0, and β is interpreted as the inverse temperature. Note that for every x, y ∈ C n , so that L 2 ≤ 1 + β. A simple calculation also shows that D ≤ β √ n and L 1 ≤ 1 + β. Denoting X = X ∈ C n : X − tanh β1 n X 1 ≤ 5000 (1 + β) 2 n 7/8 , by Corollary 12 we have that X f n is a ρ, 3n −1/8 , 3n −1/8 -tilt-mixture with ρ (X ) ≥ 1 − 3n −1/8 . Denote by J = 1 + Id the n × n matrix whose every entry is 1. Then every X ∈ X also satisfies Thus X ⊆ X f and the first part of Corollary 13 is proved. The fixed point equation X = tanh βJ n X is easier to work with, since all of its exact solutions are constant: Indeed, every entry X i of a solution satisfies X i = tanh n j=1 β n X j ; every solution X is then of the form X = (x, x, . . . , x), and the exact fixed point vector equation reduces to the scalar equation The value x 0 = 0 is always a solution, corresponding to the case where the typical configuration is completely disordered.
For β ≤ 1, this is also the only solution. In this case, for every X ∈ X f , Rearranging, we get that every X ∈ X f is close to 0: This represents the fact that for high temperatures, the system is always disordered. For β > 1, there are two other solutions, x 1 = −x 2 . These satisfy |x 1 | , |x 2 | → 1 as β → ∞, and correspond to the symmetry-broken phase where all spins tend to point in the same direction. Showing that every X ∈ X f is close to either (x 1 , x 1 , . . . , x 1 ) or (x 2 , x 2 , . . . , x 2 ) can then be done by a standard counting argument, which we choose to omit.

Large deviations
In order to prove Theorem 16, we follow the approach mentioned in Section 2.1.2, and try to approximate functionf in equation (1) by a well-behaved Hamiltonian g.
Let t ∈ R and δ > 0. Let h : R → R and ψ : R → R be defined as Denote by ν the measure defined by X g n . The function g is an approximation forf , in the sense that almost of all the mass of ν is supported on vectors on which f attains a large value.
If there exists a z ∈ C n such that f (z) ≥ tn, then Proof. Let y ∈ B. By definition of g, Let z ∈ C n be such that f (z) ≥ tn. Then under ν the probability for obtaining z is proportional to e g(z) = e 0 = 1. On the other hand, for every y ∈ B, the probability for obtaining y is proportional to a value smaller than e − log 4·n = 4 −n = 2 −2n . Since there are no more than 2 n possible vectors in C n , we thus obtain Proposition 28 allows us to approximate ν with a distribution that does not give any mass at all to vectors y ∈ C n with f (y) < (t − δ ′ ) n. Define the function ϕ : C n → R by and observe that ϕ (y) agrees with e g(y) for all y such that f (y) ≥ (t − δ ′ ) n. Denote by σ the measure defined by dσ = ϕdµ Cn ϕdµ and by X ϕ a random variable whose law is σ. Proposition 29. Assume that there exists a z ∈ C n such that f (z) ≥ tn. Then there exists a coupling between X g n and X ϕ such that E X g n − X ϕ 1 ≤ 2n · 2 −n . We postpone the proof to the appendix.
Proof of Theorem 16. Applying Theorem 14 to g, there exists a ρ-mixture and a coupling between X (ρ) and X g n such that and E X (ρ) − X g n 1 ≤ 80n 3/4D1/4 . Therefore by Proposition 29 there exists a coupling between X (ρ) and X ϕ such that This shows that X ϕ is a ρ, 80D 1/4 n 1/4 + 8 · 2 −n -mixture. To obtain equation (10), denote Y g = {X ∈ X g : f (X) < (t − 3δ ′ ) n}, and let X ∈ Y g . Denote by ξ X the product measure on C n such that if Y X ∼ ξ X then EY X = X. We then have Denote by A X the event

Equation (32) and Proposition 28 together imply that
Pr Under A X we have that Since E X (ρ) − X g n 1 is small, this inequality sets a constraint on the measure of Y g . Letting Z be a random variables with law ρ, coupled with X (ρ) so that X (ρ) | Z ∼ Y Z , one has We thus obtain Together with equation (31), this gives Remark 30. A particular type of Hamiltonian that has been of considerable interest in the field of large deviations that of subgraph-counting functions. It was recently shown in [5] that for these types of Hamiltonians, ∇f (X) is close to a stochastic block matrix. Since h ′ f (X) n − t /δ is a scalar, this implies that every X ∈ X g is also close to a stochastic block matrix.
Analogues of Propositions 28 and 29 can then be proved following the same line.

Acknowledgments
The first author is grateful to Sourav Chatterjee for inspiring him to work on this topic and for an enlightening discussion. We thank Amir Dembo and Sumit Mukherjee for insightful discussions, and Yufei Zhao for his motivating comments on sparse bounds. Finally we thank the anonymous referee for comments improving the presentation of this work.
Suppose that E |Y − a| is fixed. Then the left hand side of (35) is maximized by the Y that gives tanh E tanh −1 Y an extremal value, conditioned on b := E |Y − a| being constant. Since tanh is monotone, this is equivalent to finding the extremal value of the integral The constraint (37) is of the form f (x) dµ = b, where f (x) = |x − a|. By Theorems 2.1 and 3.2 and Proposition 3.1 in [8], the extremal distributions which solve a system of n constraints of the form f i (x) dµ = c i are linear combinations of no more than n+1 singletons, i.e delta distributions. We can therefore write the extremal µ as for some two real numbers −α ≤ x, y ≤ α and p ∈ [0, 1]. Now, using the triangle inequality, we have that so it is in fact enough to show that and since E |Y − EY | ≤ 2E |Y − a| for every a, it actually suffices to show that Plugging the decomposition (38) into (39), we need to prove that for every such x and y, Assume without loss of generality that x > 0 and x > |y|. We will now show that inequality is correct for 0 < p ≤ 1 2 . We omit the similar proof for 1 2 ≤ p < 1. For these values of p, it suffices to show that For every fixed value of y, we treat the expression on the left hand side as a function of p for p ∈ (0, 1). This expression may attain its supremum either at p → 0 + , p = 1 2 , or at values of p such that the derivative of the left hand side with respect to p is 0. We'll now consider each of these three cases.

Taking the derivative
Comparing the derivative to 0, one obtains the relation then 2 (a−b) tanh −1 (α)(1−e −(a−b) ) ≤ 2 (a−b) 1 2 a( 1 2 ) ≤ 8. Otherwise, if e −(a−b) ≥ 1 2 , then a − b < 3 4 . By Taylor's theorem, the 1 − e −(a−b) in the denominator can be bounded from below by 1 2 (a − b), bounding the expression by 8 tanh −1 (α) ≤ 8. Now suppose that 1−y 1−x < 2. Since log z ≤ z − 1 for all z, we may then write the left hand side of (43) as The case p = 0 Using L'HÃŽpital's rule, the value of the left hand side of (40) attained as p → 0 + is For y ≥ 0, this is the same expression obtained by setting p = 0 in (41). The case y < 0 is handled similarly as above.
The case p = 1/2 In this case we must show that tanh 1 2 tanh −1 (x) + 1 2 tanh −1 (y) − 1 2 x + 1 2 y tanh −1 (α) (x − y) ≤ 9 2 · This bound can be shown by differentiating with respect to y to the find the maximum of the left hand side.
Proposition 32. Let f : C n → R, let ξ be a product measure over C n , and let Y ∼ ξ. Then Then the variance of f can be bounded by Var [Y i ] ≤ nLip 2 (f ) .
By Jensen's inequality, Proof of the chain rule Lemma 25. For y ∈ C n in the discrete hypercube, denote by S i (y) the vector which is equal to y everywhere, except for the i-th entry, so that (S i (y)) j = y j i = j −y j i = j.
Putting this into equation (45), we get Equations (24) and (25) then follow immediately. For equation (26), let x ∈ C n and let ξ be the product measure on C n such that for Y ∼ ξ, EY = x. Applying equation (21) on ∇f and ∇ (h • f ), we have By equation (24), the second term on the right hand side is bounded by BL 2 n. AS for the first term, (by Proposition 32) ≤ BL 2 n 3/2 .