Some universal estimates for reversible Markov chains

We obtain universal estimates on the convergence to equilibrium and the times of coupling for continuous time irreducible reversible finite-state Markov chains, both in the total variation and in the L^2 norms. The estimates in total variation norm are obtained using a novel identity relating the convergence to equilibrium of a reversible Markov chain to the increase in the entropy of its one-dimensional distributions. In addition, we propose a universal way of defining the ultrametric partition structure on the state space of such Markov chains. Finally, for chains reversible with respect to the uniform measure, we show how the global convergence to equilibrium can be controlled using the entropy accumulated by the chain.


Introduction
Recently, the convergence to equilibrium of slowly mixing Markov chains appearing in statistical physics has attracted much attention. In this framework continuous time irreducible reversible Markov chains are defined by choosing the transition rates from a state (usually, a spin configuration) a to a state (spin configuration) b to be proportional to e −β(E(b)−E(a)) + , where E is an energy functional and β stands for the inverse temperature, which in this context is chosen to be large: β 1. In this low temperature regime, a recurring feature is that the energy landscape given by E divides the state space into sets of metastable (or, stable) states, which are separated by potential wells. The convergence to equilibrium of the corresponding Markov chain, started in a metastable state, is then governed by the time it takes to overcome the respective potential wells in order to reach the part of the state space with the lowest energy.
The potential theoretic approach to metastability developed in the articles [7], [8] and [9] (see also the excellent summaries [5] and [6]) has been used to obtain precise information on metastable transitions for reversible Markov chains associated with several models of statistical physics. These include certain disordered mean field models (see [8]) and, more specifically, the Curie-Weiss model with a random field taking finitely many values (see [8], [3] and [4]). Other examples of slowly mixing reversible Markov chains, in which the metastable behavior has been analyzed in detail, include the Glauber dynamics for the two-dimensional Ising model on a torus and its generalizations (see Date: December 21, 2013. This research was partially supported by NSF grant DMS-08-06211. [18] and [19]), the three-dimensional Ising model on a torus (see [2]) and the classical Curie-Weiss model (see [13] and [17]). Moreover, the first exit problem from a domain for reversible chains with exponentially small transition probabilities was studied in the article [20]. In a different but related line of research, initiated by the article [16], the metastable transitions are studied for diffusions with a small diffusion parameter, which are confined in a potential having several local minima (see [10] and [11] for a recent account on this problem).
Here, we take a different viewpoint. Instead of analyzing a specific Markov chain in detail, we try to understand some universal aspects of the ultrametric structure, that is, the presence of multiple time scales in a general irreducible reversible finite-state Markov chain. We obtain universal estimates on the convergence to equilibrium and the times of coupling in this abstract framework. We prove such results both in the total variation norm and in the L 2 norm. In the case of the total variation norm, we utilize a novel entropy identity relating the convergence to equilibrium of the chain to the increase of the entropy of its one-dimensional distributions.
In addition, we propose a universal way of defining the ultrametric partition structure, that is, a sequence of partitions of the state space corresponding to the different time scales on which convergence to equilibrium occurs. Finally, in the case that the chain is reversible with respect to the uniform measure, we show how the entropy of the one-dimensional distributions can be utilized to control the global convergence to equilibrium of the chain.
To give examples of the type of results we obtain, we introduce a set of notations. Let X be a continuous time irreducible reversible Markov chain on a set I = {1, 2, . . . , n} of n elements. Moreover, for an a ∈ I and a t ≥ 0 let P a t be the one-dimensional distribution of the chain at time t, when started in a. Finally, write . T V for the total variation norm, ν for the invariant distribution of X and let H(ν) be the entropy − a∈I ν(a) log ν(a) of the invariant distribution ν.
Before stating our first result rigorously, we would like to provide the reader with some intuition by giving an example. Fix natural numbers M, N ≥ 3 and consider the graph given by an arrangement of M N -cycles in a cycle of size M . Now, let X be the continuous time Markov chain on this graph, which has a transition rate ρ 1 > 0 for neighboring vertices belonging to the same N -cycle and a transition rate ρ 2 > 0 for neighboring vertices belonging to different Ncycles. Since the generating matrix of this Markov chain is symmetric, it is reversible with respect to the uniform distribution on the set of vertices of the graph. Next, suppose that ρ 2 is much smaller than ρ 1 . Then, it is intuitively clear that, for every vertex a of the described graph, the quantity P a 3t − P a t T V can be only large on the two disjoint time intervals, during which the Markov chain mixes on the N -cycle containing a and on the M -cycle comprised by the M N -cycles, respectively. Under the scale-invariant measure, which has the density 1 t on the time axis [0, ∞), the union of these two time intervals has a total measure of order log N + log M = log(M N ). Thus, it is logarithmic in the size of the state space of X. The purpose of Theorems 1 and 3 below is to show that the latter property is universal for continuous time irreducible reversible Markov chains, in the sense that the order of magnitude in this example is an upper bound on the size of the corresponding quantity for a general reversible Markov chain. Theorem 1. Let X be a continuous time irreducible reversible Markov chain on the set I = {1, 2, . . . , n} and let ν be its invariant distribution. Then, the following is true. (a) For every δ > 0, there is a constant C(δ) > 0 depending only on δ (and not on n or the particular Markov chain) such that In particular, for every δ, > 0, there exists a constant C (δ) > 0 depending only on δ and (and not on n or the particular Markov chain) such that (b) For every δ > 0, there is a constantC(δ) > 0 depending only on δ (and not on n or the particular Markov chain) such that In particular, for every δ, > 0, there exists a constantC (δ) > 0 depending only on δ and (and not on n or the particular Markov chain) such that We remark at this point that universal estimates as in Theorem 1 can only be obtained under the scale-invariant measure 1 t dt on the time axis [0, ∞), which has the property (1.5) for all η > 0 and 0 < t 1 < t 2 < ∞. This can be easily seen by slowing down or speeding up the chain by a constant factor.
To give an example of a result in the framework of L 2 convergence, a set of auxiliary notations is needed. For simplicity, we assume for the moment that X is irreducible and reversible with respect to the uniform distribution on I. In this case, writing L for the generating matrix of X, we can conclude that the matrix −L is symmetric and admits an orthonormal basis of eigenvectors v 1 , v 2 , . . . , v n corresponding to eigenvalues 0 = λ 1 < λ 2 ≤ λ 3 ≤ . . . ≤ λ n . Fixing a pair of initial states (a, b) and letting e a (resp. e b ) be the vector, whose only non-zero component is the a-th one (resp. the b-th one) and equals to 1, we have the decomposition Finally, we define the set and a family of its neighborhoods 1 2 , and write . 2 for the L 2 norm with respect to the counting measure on I.
Theorem 2. In the setting just described the following is true. For every δ ∈ 0, 1 2 , there is a constant K(δ) > 0 such that for all pairs of initial states a, b. The constant K(δ) depends only on δ, but not on a, b, n or the particular Markov chain X.
The rest of the paper is organized as follows. In section 2.1, we prove a stronger version of Theorem 1 in the case that the invariant distribution ν is the uniform distribution on I. In order to do this, we show a novel entropy identity (see Lemma 4) allowing us to relate the increase in the entropy of the one-dimensional distributions of the Markov chain to the convergence of the chain to its equilibrium. In section 2.2, we prove Theorem 1 by suitably adapting the entropy identity of Lemma 4 to the general setting. In section 3.1, we give a global (or, averaged) version of Theorem 2 and present the proof of Theorem 2. Subsequently, we explain in section 3.2, how Theorem 2 extends to general continuous time irreducible reversible finite-state Markov chains. Then, in section 4, we present a universal way of defining the ultrametric partition structure on the state space of a continuous time irreducible reversible finitestate Markov chain. Finally, in section 5, we show in the case that the chain is reversible with respect to the uniform distribution, how the entropy of the one-dimensional distributions of the chain can be used to obtain a control on the global convergence of the chain to its equilibrium.

Estimates in total variation norm
In this section we give a control on the convergence to equilibrium and the times of coupling with respect to the total variation norm by analyzing the change in the entropy of the Markov chain over time.
2.1. Markov chains reversible with respect to the uniform distribution. The following theorem is a stronger version of Theorem 1 for the special case of Markov chains, which are reversible with respect to the uniform distribution.
Theorem 3. Consider the setting of Theorem 1 and assume, in addition, that the Markov chain X is reversible with respect to the uniform distribution on I = {1, 2, . . . , n}. Then: (a) There is a constant C(δ) > 0 depending only on δ (and not on n or the particular Markov chain) such that for all initial states a of the Markov chain: There is a constantC(δ) > 0 depending only on δ (and not on n or the particular Markov chain) such that for all pairs (a, b) of initial states of the Markov chain: The proof relies on the following entropy identity.
Lemma 4. Let X(t), t ≥ 0 be a Markov chain as in Theorem 3 started in an initial state a ∈ I. Then, for all t ≥ 0: 2t |P a 3t,2t ). Hereby, H(.|.) stands for the relative entropy and P a u,s stands for the law of the random vector (X(u), X(s)). In particular, the inequality ) holds for all t ≥ 0 and all initial states a ∈ I.
Proof of Lemma 4. We start the proof with the following elementary computation, which only relies on the Markov property of X: .
We now exploit the symmetry of the transition matrices of the Markov chain X (which is due to the reversibility of the uniform distribution and the detailed balance condition) to deduce In addition, the Markov propery of X yields for all (i, j) ∈ I 2 . Putting the latter three observations together, we end up with the lemma.
In the proof of Theorem 3 we will need the following simple calculus lemma.
for some non-negative real constants p ≤ q. Then, for every r > 0 and > 0, one has the inequality Proof of Lemma 5. It suffices to observe the elementary inequality and, hence, yields the lemma.
We are now ready for the proof of Theorem 3.
Proof of Theorem 3. First, we note that part (b) of the theorem is a consequence of part (a) due to the inequalities . We turn now to the proof of part (a). Due to the inequality (2.4), it suffices to prove that for every δ > 0 there is a constant C(δ) > 0 depending only on δ (and not on a, n or the Markov chain X) such that Introducing the function g : R → [0, ∞), g(u) = H(P a e u ), we can rewrite the latter inequality as Noting that lim u→∞ g(u) = lim t→∞ H(P a t ) = log n (since the uniform distribution is the unique stationary distribution of X), we see that the desired inequality holds with C(δ) = log 2 δ as a consequence of Lemma 5.

2.2.
General reversible Markov chains. In this subsection we consider a general continuous time irreducible reversible Markov chain X on I and will prove Theorem 1. To start with, we recall the detailed balance condition: which holds for all times t ≥ 0 and all pairs of states (a, b) ∈ I 2 . We now give the proof of Theorem 1.
Proof of Theorem 1. The first assertion in part (b) of the theorem is a direct consequence of the inequality (2.12) (which clearly remains true in the more general setting of the present theorem) and the first assertion in part (a) of the theorem. Moreover, the second assertions in both parts of the theorem follow from the first assertions in the corresponding parts of the theorem and Markov's inequality. For these reasons, we only need to prove the first assertion in part (a) of the theorem.
To this end, we fix an initial state a ∈ I and note that the same computation as in the proof of Lemma 4 above yields: .
As before, we have by the Markov property (2.16) P a t (j) P j t (i) = P a (X(t) = j, X(2t) = i). Moreover, the detailed balance condition (2.15) gives Plugging this in, we get (2.18) where P a t,2t and P a 3t,2t denote the laws of the random vectors (X(t), X(2t)) and (X(3t), X(2t)), conditioned on X(0) = a. In addition, writing log ν(i) ν(j) = log ν(i) − log ν(j) and summing, we obtain Finally, integrating both sides of the latter equation with respect to ν and using the fact that ν is the invariant distribution of the Markov chain X, we end up with the averaged entropy identity In particular, this implies the inequality On the other hand, the first inequality in part (a) of the theorem is equivalent to This in turn would follow from However, due to the estimate (2.20), the left-hand side in the latter inequality is bounded above by This finishes the proof.

Estimates in L 2 norm
Throughout the first subsection of this section, we assume for the simplicity of notation that the continuous time Markov chain X is irreducible and reversible with respect to the uniform distribution on the set I = {1, 2, . . . , n}. We first give a global version of Theorem 2 in Theorem 6 and then prove Theorem 2 at the end of the first subsection. Subsequently, in the second subsection, we give the analogues of these results for a general continuous time irreducible reversible Markov chain on I.

3.1.
Markov chains reversible with respect to the uniform distribution. In the following theorem we show that, for most of the time on the scale-invariant clock, the square of the L 2 distance between the one-dimensional distributions of the Markov chain started in a and the one-dimensional distributions of the Markov chain started in b, averaged over all pairs (a, b) ∈ I 2 , stays close to the lattice This statement can be viewed as a global (or, averaged) version of Theorem 2. To make this statement precise, we write A δ L for the 2δ n -neighborhood of A L in 0, 2(n−1) n , where δ is a number in 0, 1 2 , and can formulate the following result.
Theorem 6. In the setting of Theorem 2, for all 0 < δ < 1 2 , there exists a constant K(δ) > 0 such that the estimate holds. Hereby, the constant K(δ) depends only on δ, and not on n or the particular Markov chain X.
Proof. To start with, we recall the notation L for the generating matrix of the Markov chain X, so that, in particular, the transition matrix P t corresponding to a time t ≥ 0 is given by e tL . Since X is irreducible and reversible with respect to the uniform distribution, the matrix −L is symmetric and non-negatively definite and has the eigenvalues 0 = λ 1 < λ 2 ≤ λ 3 ≤ . . . ≤ λ n . In particular, each of the matrices P t , t ≥ 0 is symmetric, positively definite and has the eigenvalues Writing . 2 for the L 2 norm with respect to the counting measure on I and ., . 2 for the corresponding scalar product, we can make the following computation: which is valid for all t ≥ 0.
Next, we set The continuity of the function f implies f (t k ) = k − δ. In particular, it follows that (3.6) f (2t k ) ≤ max Moreover, since the maximum of the convex function (x 1 , . . . , x n−1 ) → (x 2 1 + x 2 2 + . . . + x 2 n−1 ) is taken over a convex set, it must be attained at a boundary point of that set. In other words, at the optimizing point it must hold x l ∈ {0, 1} for at least one 1 ≤ l ≤ n − 1. Eliminating the corresponding variable, we obtain a maximization problem of the same type and conclude that at least one another coordinate x l has to belong to the set {0, 1}. Proceeding with the same argument, we conclude that for each optimizing point (x 1 , x 2 , . . . , x n−1 ), there must be (k − 1) coordinates, which are equal to 1, (n − k − 1) coordinates, which are equal to 0, and one coordinate, which is equal to 1 − δ. Thus, we have: Now, either f (2t k ) ≤ (k − 1) + δ, or we can proceed with the same argument to conclude Proceeding further with the same argument, we end up with where . denotes the closest integer from above.
This together with the proof of Theorem 6 shows that the order n of the upper bound in Theorem 6 is optimal. As will become clear from the proofs below, the same is true for the upper bound of Theorem 2, and the counterparts of these results for general continuous time irreducible reversible Markov chains treated in section 3.2.
We proceed with the proof of Theorem 2.
Proof of Theorem 2. To start with, we introduce the following notations: .
From now on, we fix a k = 2, 3, . . . , n and will show that for a suitable constant K(δ) > 0, which depends only on δ (but not on a, b, k or n). To this end, we note that the identity and the inequality 0 < λ 2 ≤ λ 3 ≤ . . . ≤ λ n imply the estimate (3.18) Moreover, if we have µ 2 l > 0 for all l = 2, 3, . . . , n, then the function (x 2 , x 3 , . . . , x n ) → n l=2 µ 2 l x 2 l is stricly convex and must attain its maximum at a vertex point of the convex polyhedron (x 2 , x 3 , . . . , x n ) : If we have µ 2 l = 0 for some l ∈ {2, 3, . . . , n}, then we can elimininate the corresponding coordinate in the maximization problem and make the same conclusion for the reduced maximization problem. For this reason, we may assume without loss of generality that µ 2 l > 0 for all l = 2, 3, . . . , n. Moreover, since the hyperplane n l=2 µ 2 2 is (n − 2)-dimensional, the vertices of the polyhedron above are given by points 1 ≥ x 2 ≥ x 3 ≥ . . . ≥ x n ≥ 0, for which (n − 2) of the inequalities Thus, each optimizing point of the maximization problem above can be described as follows: There is a partition of {2, 3, . . . , n} into three sets I 1 , I 2 , I 3 of the form {2, 3, . . . , l 1 }, {l 1 + 1, l 1 + 2, . . . , l 2 }, {l 2 + 1, l 2 + 2, . . . , n}, respectively, such that, for all l ∈ I 1 it holds x l = 1, for all l ∈ I 2 we have x l = ζ for a suitable ζ ∈ [0, 1], and for all l ∈ I 3 it holds x l = 0. Moreover, the identity shows that the value of ζ is given by and that I 1 ⊂ {2, 3, . . . , k − 1}. To proceed, we introduce the set This allows us to make the following computation: Next, we note that the latter fraction is of the form A·B A+B = 1 , whereby: Proceeding with the same argument, we conclude that for all natural numbers R ≥ 1−2δ In particular, we conclude that where . denotes the closest integer from above. This shows the claim (3.16) with and finishes the proof.

3.2.
General reversible Markov chains. We proceed with the analogues of Theorems 2 and 6 for a general continuous time irreducible reversible Markov chain X. To state the results, we introduce the following set of notations. We write ν for the invariant measure of X as before, and let D be the diagonal matrix, whose diagonal entries are given by ν(i), i ∈ I. Then, by the detailed balance condition (2.15), the matrix D 1/2 P t D −1/2 is symmetric for all t ≥ 0. Moreover, since the matrices D 1/2 P t D −1/2 , t ≥ 0 commute, they have a joint orthonormal basis of eigenvectors v 1 , v 2 , . . . , v n corresponding to sets eigenvalues respectively (see chapter 3 of the book [1] for more details). In addition, for any fixed pair (a, b) of initial states, we let With these notations, the analogues of Theorems 2 and 6 read as follows.
Proof. In order to prove (3.28), we use the fact that ν is the invariant distribution of the Markov chain X to deduce the identities which hold for all t ≥ 0. Moreover, the latter sum is given by the sum of squares of the entries of the matrix D 1/2 P t D −1/2 and is, hence, equal to 1 + n l=2 e −2λ l t . Thus, From this point on, one can proceed as in the proof of Theorem 6 to show (3.28). Now, we turn to the proof of (3.29). To this end, we note that the detailed balance condition (2.15) implies DP t = P T t D, t ≥ 0, where the superscript T stands for the transpose of a matrix. This allows us to make the computation for all t ≥ 0. Next, we observe that the vectors D 1/2 v 1 , D 1/2 v 2 , . . . , D 1/2 v n form an orthonormal basis with respect to the scalar product ., . L 2 (ν −1 ) , since the vectors v 1 , v 2 , . . . , v n form an orthonormal basis with respect to the standard Euclidean scalar product. Hence, From this point on, one only needs to follow the arguments in the proof of Theorem 2 to end up with (3.29).
Remark 2. It is worth noting that the estimates of Theorems 2, 6 and 7 hold for and P a t − P b t 2 L 2 (ν −1 ) , respectively. The same proofs apply, with the only difference being that one needs to expand the vectors (e a − ν) and D −1/2 (e a − ν) in terms of an orthonormal basis of eigenvectors of the matrices P t , t ≥ 0 and D 1/2 P t D −1/2 , t ≥ 0, respectively.

A universal approach to the ultrametric structure
In this section we provide a univeral way of defining the ultrametric partition structure on the state space I = {1, 2, . . . , n} of a continuous time irreducible Markov chain X, which is reversible with respect to its invariant distibution ν. Typical examples of such chains are encountered in statistical physics, where often the transition rate for a pair (a, b) of neighboring states is proportional to e −β(E(b)−E(a)) + with E being an energy functional (see the references given in the introduction, as well as the references therein). For large values of β, the energy landscape naturally provides a partition of the state space into states of different types, which are separated by potential wells (see Figure 1 for a schematic diagram).
Here, we will give a universal way of defining the partition structure without making use of the explicit knowledge of the transition rates. Thereby, each of the partitions will correspond to a time scale on which convergence to equilibrium occurs for the Markov chain in consideration. For this purpose, we let . be any norm on the space of finite measures on the set I, which is normalized in such a way that π 1 − π 2 ≤ 1 for any two probability measures π 1 , π 2 on I. Moreover, we assume that the function t → π 1 P t − π 2 P t is strictly decreasing on [0, ∞) and tends to zero in the limit t → ∞ for all probability measures π 1 = π 2 on I (hereby, the products π 1 P t , π 2 P t should be understood in the sense of multiplication of a probability measure by a stochastic kernel). Examples of such norms are the appropriately normalized total variation and L 2 norms discussed above. Now, we fix an 0 < < 1 and will recursively define equivalence relations ∼ 1 , ∼ 2 , . . . on I, which will induce the desired sequence of nested partitions. To define ∼ 1 , we set Then, we let a ∼ 1 b iff either a = b, or   with respect to the uniform distribution. To this end, for each 0 < κ < 1 and t ≥ 0, we introduce the set where, with a slight abuse of notation, we wrote P t for the law of the random variable X(t). For each t ≥ 0, the set EQ(t) ⊂ I should be viewed as the part of the state space on which the probability measure P t is close to the equilibrium distribution of the Markov chain X. We are interested in lower bounds on the size |EQ(t)| of such sets.
Proof. We fix numbers κ and α as in the statement of the theorem and suppose that the inequality (5.3) does not hold. We will show that this implies that the entropy bound (5.2) cannot hold. To start with, we introduce the notation p a := P t (a), a ∈ I, and make the decomposition For a given value of ρ := a∈EQ(t) p a ∈ [0, 1], the maximum of the function − b / ∈EQ(t) p b log p b is attained on the interior boundary of the set Indeed, this is a consequence of the fact that the function is concave and attains its maximum over the convex set { b / ∈EQ(t) p b = 1 − ρ} at the point ( 1−ρ n−|EQ(t)| , 1−ρ n−|EQ(t)| , . . . , 1−ρ n−|EQ(t)| ), which is not an element of the set in (5.5). The latter statement follows from the inequalities < 1 + κ n with the respective second inequalities in the latter two displays being consequences of |EQ(t)| < n 2 .
We also observe that the function F must be non-positive throughout [0,α], since the entropy of a probability measure on a set of n elements cannot exceed the value log n. Moreover, since the functionF is concave, the function F is also concave. Furthermore, a straightforward computation shows that, depending on the values of κ and α, either the derivative of the function F has no zeros on the interval [0,α], in which case F attains its maximum at one of the boundary points, or the only zero of the derivative of F on the interval [0,α] is given by α * 1 (defined in the statement of the theorem), in which case F attains its maximum at α * 1 . This finishes the proof.

Acknowledgement
The author would like to thank David J. Aldous for his comments throughout the preparation of this work. He is also grateful to Anton Bovier for his remarks on an early version of this manuscript.