HWI inequalities in discrete spaces via couplings

HWI inequalities are interpolation inequalities relating entropy, Fisher information and optimal transport distances. We adapt an argument of Y. Wu for proving the Gaussian HWI inequality via a coupling argument to the discrete setting, establishing new interpolation inequalities for the discrete hypercube and the discrete torus. In particular, we obtain an improvement of the modified logarithmic Sobolev inequality for the discrete hypercube of Bobkov and Tetali.


Introduction
The HWI inequality, originally proved by Otto and Villani [24], is an interpolation inequality relating entropy, Fisher information and L 2 transport distances.It plays a role in the synthetic theory of Ricci curvature bounds on metric spaces, and has found applications to concentration of measure, statistical physics and geometry.Alternative proofs have been given in [31,17,20], and an improved dimensional version was derived in [9].The most common point of view on this inequality is to view it as a consequence of the convexity of the entropy functional along certain families of interpolations, which can be interpreted as geodesics for a formal Riemanian structure on the space of probability measures.The main goal of this work is to explain how the proof of [31] (see also [7,27]), which does not rely on this convex viewpoint, can be adapted to the discrete setting.This leads to new discrete interpolation inequalities, different from those obtained by adapting the ideas of [24] (as was done in [13]).
In the Euclidean setting, the HWI inequality takes the following form: given a reference probability measure dµ = e −V dx on R d such that Hess V K Id for some K ∈ R, we have for all other probability measures ν on R d where W 2 stands for the L 2 Wasserstein (or Monge-Kantorovitch) distance, H for the relative entropy functional and I for the relative Fisher information; formal definitions of each will be given later.When K > 0, the HWI inequality (1) implies a logarithmic Sobolev inequality, one of the main functional inequalities used for establishing concentration of measure estimates, as well as bounds on the trend to equilibrium for stochastic dynamics.Beyond their relationship with other functional inequalities, HWI inequalities have found some direct applications in statistical physics [14,19].They hold in a more general setting of weighted manifold satisfying a Ricci curvature bound.A dimensional reinforcement of the Gaussian HWI inequality for even, strongly log-concave arguments was derived in [4,Theorem 5.4] In the discrete setting, several families of HWI inequalities have been proposed [13,18,21], as consequences of various proposed definitions of Ricci curvature bounds adapted to discrete spaces.For each of them, the approach consists in defining a family of interpolating curves in the space of probability measures, and proving that the entropy is uniformly (semi)-convex along these curves.These curves are interpreted as geodesic curves, and the distance used in place of the Wasserstein distance is the associated geodesic distance, while the Fisher information is the squared norm of the gradient of the entropy with respect to that metric structure.Such convexity properties are usually known as Ricci curvature bounds, in analogy with the Lott-Sturm-Villani synthetic notion of Ricci curvature bounds for Riemannian manifolds.All of these approaches have been shown to work for the simple graphs we shall consider here, and some of them have been shown to work for more sophisticated examples [15,12,21].
Our starting point for this work is an alternative proof of the HWI inequality for the Gaussian space, due to Yihong Wu [31], which we shall describe in some detail in Section 2. What is interesting is that, while it uses some ingredients related to curvature bounds in the continuous setting (namely, a decay rate for the Fisher information along a stochastic dynamic), strictly speaking it does not require such a bound, and instead also relies on some very explicit coupling arguments to bound the entropy along the dynamic by the Wasserstein distance between the initial data and the equilibrium measure.In particular, there is no need to introduce some family of geodesic interpolations.We shall mimic this proof in the discrete setting for several examples, and obtain new HWI-type inequalities, that are different from those obtained using discrete curvature arguments.The distances involved will be simple variations on the L 1 and L 2 Wasserstein distances, rather than the more sophisticated variational distances appearing for example in [13,21].In the case of the hypercube, the inequality we obtain improves on the modified logarithmic Sobolev inequality of Bobkov and Tetali [8,26].
Remark 1.1.While this work was being finalized, Altschuler and Chewi published a preprint [1] which also leverages couplings and convexity of entropy to prove reverse transport inequalities.Their viewpoint offers some additional flexibility by iterating short-time estimates, combined with regularity assumptions on the transition rates.Unlike the discrete settings emphasized here, their work focuses on diffusion processes in the continuous setting, leveraging tools from stochastic calculus (e.g., Girsanov's theorem).

Yihong Wu's proof of the Gaussian HWI inequality
The goal of this section is to present Y. Wu's proof of the HWI inequality for the Gaussian measure, and extract the main arguments we will need to replicate in the discrete setting.This proof was included in [7] and [27].We also note that a related approach was used in [25], and extended in [28] to establish an HWI inequality for the Wiener measure, using an infinite-dimensional Harnack inequality instead of a coupling argument.

The Gaussian HWI inequality
Let γ denote the standard Gaussian measure on R d , and let ν be another probability measure with density dν = ρdγ.The relative entropy, Fisher information, and W 2 distance are defined respectively by where the infimum in the definition of W 2 is over all couplings of X, Z with laws X ∼ ν and Z ∼ γ.If ν ≪ γ we set H(µ|γ) = +∞, and we similarly set I(µ|γ) = +∞ if ν ≪ γ or if ρ is not weakly differentiable.It will be convenient to adopt the usual abuse of notation and write H(X|Z) in place of H(ν|γ) when X ∼ ν and Z ∼ γ, and similarly write I(X|Z) ≡ I(ν|γ) and W 2 (X, Z) ≡ W 2 (ν, γ).In this notation, the Gaussian HWI inequality is To start the proof, we introduce the Ornstein-Uhlenbeck dynamic for a standard Brownian motion (B t ) t 0 , which is a reversible diffusion process whose invariant distribution is the standard Gaussian measure.For initial data X 0 law = X ∼ ν, we also have the identity in law, or Mehler formula, where Z ∼ γ is independent of X.A direct computation shows that if X and Y are two Gaussians, with respective mean x and y and same variance σ 2 , then Relative entropy is jointly convex in its arguments, so for Z, Z ′ i.i.d.∼ γ, the above implies for every coupling of X and Z ′ .Taking the infimum over all such couplings, we get the following reverse transport-entropy inequality along the Ornstein-Uhlenbeck flow: This estimate also holds for non-Gaussian reference measures with curvature bounded from below [6], with a different proof, but this is not our goal here.The classical entropy production formula (i.e., the de Bruijn identity) states and since the Fisher information decays exponentially fast along the Ornstein-Uhlenbeck flow [5,6], that is I(X t |Z) e −2t I(X|Z), we conclude Optimizing over t 0 produces the Gaussian HWI inequality (2).

Roadmap for a scheme to prove HWI inequalities
The above argument relies on the fact that our reference measure γ in the HWI inequality is the invariant measure of a certain Markov process, and that the Fisher information arrises as the derivative of the entropy along the process.Adopting this viewpoint, we can extract the two main ingredients of the proof we just described: 1.A decay rate, or at least a bound, for the Fisher information along the flow generated by the Markov chain.
2. A coupling of two trajectories of the Markov chain with different starting points, and such that the relative entropy of one marginal of the joint distribution at time t with respect to the other can be estimated.
As we will illustrate in subsequent sections, when these two ingredients are available for some Markov process, they can be combined to obtain a HWI inequality for the invariant measure.
The first element is one of the typical outcomes of a bound on entropic Ricci curvature, and so is available for any Markov chain that satisfies such a bound.But Fisher information decay is a strictly weaker property, and is sometimes known in cases where no lower bound on the Ricci curvature is known, or known but with worse constants.At a practical level, checking it requires checking convexity of the entropy along the dynamic flow, while Ricci curvature bounds require checking convexity along a much larger class of curves.See for example [11] and [15] to see how the two notions differ on concrete examples.This part of the argument only involves Fisher information and entropy, it has no effect on what distance arises in the final inequality.In the examples below, we always use exponential decay rates for Fisher information, but in principle other rates could be used.
The second element is at the core of the reverse transport-entropy inequality (5), and is established here using the Mehler formula (3).Using couplings is a now-standard method for studying long-time behavior of Markov processes, see for example [10].It is in this step that the Wasserstein distance appears.
In the discrete situations we shall consider next, we demonstrate how these two elements can be used to mimic the proof of the Gaussian HWI inequality.This approach leads to different HWI inequalities than those obtained when only using curvature arguments.The downside of the method presented here is that an exact representation of transition probabilities (e.g., as given by the Mehler formula), is often unavailable, and thus the argument seems less widely applicable than curvature arguments.It may be that Harnack-type inequalities could be used to cover examples where we do not have explicit couplings, yet still lead to different HWI inequalities than those obtained via curvature arguments.

The hypercube
Consider the simple random walk (X t ) t 0 on the hypercube {0, 1} N , where with rate N we choose a coordinate uniformly at random and flip it with probability 1/2.The invariant measure is the product Bernoulli measure with parameter 1/2, which we shall denote by µ.Given initial data X 0 ≡ X ∼ ν, we shall denote by ν t the distribution after running this dynamic for some time t (i.e., X t ∼ ν t ).
If we differentiate the entropy along this dynamic, we have where ρ t is the density of ν t with respect to µ, and x i is the configuration of {0, 1} N obtained from x by flipping the i-th coordinate.The right-hand side is the discrete (modified) Fisher information, also known as entropy production.Note that under this scaling, the Fisher information I is additive on the dimension.
The main result of this section is the following HWI inequality for the discrete hypercube.To state it, we define the W 1 distance where d is the Hamming (graph) distance on the hypercube, and the infimum is over all couplings of X ∼ ν and Y ∼ µ.Theorem 3.1.Let µ be the uniform probability measure on {0, 1} N .For any other probability measure ν on the same space satisfying I(ν|µ) 4W 1 (ν, µ), we have Note that in the admissible regime I 4W 1 , this improves on the usual modified log-Sobolev inequality H 1 2 I of [8], via Young's inequality.The transport-information inequality for the discrete hypercube states that W 2 1 N I (see for example [16]), which is not enough to ensure I 4W 1 is always true, but does imply it when I/N is not too large.Remark 3.1.In this discrete setting, there are several notions of discrete Fisher information, and the associated functional inequalities are not equivalent in general, see for example [8].
Proof.It is known that I is exponentially decaying: I(ν t |µ) e −2t I(ν|µ), which gives the first of the two ingredients discussed in Section 2.2.This can be derived for example as a consequence of the Ricci curvature bounds for the hypercube obtained in [13] (but can also be computed directly).Hence The usual proof of the modified logarithmic Sobolev inequality by the Bakry-Emery method consists at this point of letting t go to infinity, to obtain H 1 2 I. Our HWI inequality shall be obtained by taking a better choice for t, after bounding the entropy H(ν t |µ).
We now bound H(ν t |µ) using a coupling argument, giving the second ingredient in the discussion of Section 2.2.To do this, we couple two trajectories of the random walk starting from different deterministic initial conditions X 0 = x and Y 0 = y by having their same coordinates change at the same times, with same outcome.If we then look at the random variables X t and X t obtained by running this coupling, their coordinates that matched initially still match at later times, while coordinates that did not match at the beginning are the same with probability 1 − e −2t .Moreover, their laws are products on the coordinates.Hence H(X t |X t ) d(x, y)ϕ(t), where ϕ(t) is the entropy of a Bernoulli random variable with parameter p = (1 − e −2t )/2 with respect to another one with parameter q = (1 + e −2t )/2.This quantity can be computed, and is Now, for any coupling of X = X 0 ∼ ν and Y = Y 0 ∼ µ, we have and hence, minimizing over couplings, Combining the above estimates, we get We can then use the bound ϕ(t) 2/(1 − e −2t ) − 2 and take t such that (1 − e −2t ) = 2 √ W 1 /I which is possible when I 4W 1 , to complete the proof.
Remark 3.2.Note the difference with the Gaussian case, where it is a squared distance that plays a role.The appearance of W 1 makes sense, though, since it is W 1 , and not W 2 1 , that is additive on product measures.

The discrete torus
In this section, we prove a new HWI inequality for the discrete torus.We show that it implies the (known) HWI inequality on the circle as a limiting case.

HWI inequality on discrete torus
We now consider the situation where the reference measure µ is the uniform measure on the discrete hypercube Z/(N Z), viewed as the invariant measure of the simple random walk.The relative entropy of another probability measure ν is then and the Fisher information is which is indeed the dissipation of entropy along the flow of the simple random walk.
The HWI inequality for the uniform measure on the discrete torus of length N will involve the following transport cost where once again the infimum is taken over all possible couplings of X ∼ ν and Y ∼ µ, and d denotes the graph distance.Note that We shall obtain the following HWI inequality: Remark 4.1.The HWI inequality for the continuous torus obtained when adopting the viewpoint of [24] is H W 2 √ I.One can recover this inequality from the discrete one above by rescaling.It is also possible to prove the continuous HWI inequality directly with the method we use here by coupling two Brownian motions on the torus.
Proof.Let X t and Y t be two simple random walks on the discrete torus starting from positions at distance d.We realize them starting from simple random walks Xt and Ỹt on the integers also starting at distance d, and setting X t = Xt mod N (resp.Y t = Ỹt mod N ).Starting with the data processing inequality for relative entropy, we have where I n (t) is the so-called modified Bessel function of the first kind [3], which is related to the transition probabilities of the simple random walks via P(X t = n|X 0 = 0) = e −t I n (t).From Lemma 4.2 below, we obtain For random walks with general initial data X 0 = X and Y 0 = Y , the same coupling argument as before yields and hence, since the Fisher information is non-increasing along the flow [13], Optimizing in t gives the result.
We now prove the technical estimate on transition probabilities of the simple random walk we used above: Lemma 4.2.Let M be a symmetric, unimodal integer-valued random variable.Then For a symmetric probability measure P, unimodality means that m −→ P(m) is non-increasing in |m|.
Proof.This lemma is a consequence of the following estimate on modified Bessel functions: for integers n, d satisfying n d/2 0, we have for all t > 0 log I n (t) Let us take this bound as given for now, and show how it implies Lemma 4.2.
For simplicity, we assume that d is even, so that d/2 is an integer.Let h(n) := − log In(t) I n−d (t) .Since it is the difference of two functions that are odd about the point d/2, it also is.Moreover, as a consequence of (6), h(n) 0 for n d/2, and by antisymmetry h(n) 0 for n d/2.
We then have The case where d is odd follows the same chain of arguments.
x) 0; ∂ t f (t, 0) 0 and therefore, for any t 0 and x 0 Now, if d/2 n < d, then n d − n > 0 and