Mixing and hitting times for finite Markov chains

Let 0<\alpha<1/2. We show that the mixing time of a continuous-time reversible Markov chain on a finite state space is about as large as the largest expected hitting time of a subset of stationary measure at least \alpha of the state space. Suitably modified results hold in discrete time and/or without the reversibility assumption. The key technical tool is a construction of a random set A such that the hitting time of A is both light-tailed and a stationary time for the chain. We note that essentially the same results were obtained independently by Peres and Sousi [arXiv:1108.0133].


Introduction
The present paper is a contribution to the general quantitative theory of finite-state Markov chains that was started in [2] and further developed in [4]. The gist of those papers is that the so-called mixing time of a Markov chain is fundamentally related, in a precise quantitative sense, to hitting times and other quantities of interest. Our main achievement is to add a new equivalent quantity to this list by showing that mixing times nearly coincide with maximum hitting times of large sets in the state space.
Remark 1 (Important remark) The results in this paper were proven (but not made public) around May 2010. In July 2011 we learned that extremely similar results for discrete-time chains have been proven independently by Peres and Sousi [9]. We then decided to submit our results, in the hope that our ideas might also be found useful and interesting. We will discuss their results at several points in our paper.
Here we just mention that the main difference between the papers is the construction of the stopping time in Lemma 1 (see Section 1.1).
We need to introduce some notions before we clarify what we mean; [3] and [5] are our main references for the involved concepts. In this paper E will always denote the finite state space of a continuous-time Markov chain with generator Q, with transition rates q(x, y) (x, y ∈ E, x = y). Most of the time Q and E will be implicit in our notation. The trajectories of the chain are denoted by {X t } t≥0 , and the law of {X t } t≥0 started from x ∈ E or from a probability distribution µ over E are denoted by P x or P µ (respectively) . For t ≥ 0, we write: for the transition probability from x to y at time t. In what follows we will always assume that Q is irreducible, which implies that it has a unique stationary distribution π and: ∀(x, y) ∈ E 2 : lim t→+∞ p t (x, y) = π(y).
We can measure the rate of this convergence after we introduce a metric over probability distributions. We choose the total variation metric: and define the mixing time of Q as: T Q mix (δ) = inf{t ≥ 0 : ∀x ∈ E, d TV (p t (x, ·), π(·)) ≤ δ}.
Finally, given ∅ = A ⊂ E, we may define the hitting time of A as: Results for reversible chains. Recall that Q is reversible if π(x)q(x, y) = π(y)q(y, x) for all distinct x, y ∈ E. In this setting, Aldous proved: Theorem 1 (Aldous, [2]) There exist universal (ie. chain independent) constants c − , c + > 0 such that for any irreducible, reversible, finite-state-space Markov chain in continuous time with generator Q: Notice that T Q hit = 1 if Q consists of iid jumps at rate 1 between states in E, so T Q hit can be viewed as a measure of how "non-iid" the chain is. Informally, the mixing time is another measure of "non-iid-ness", and the Theorem shows that these two measures are quantitatively related in a very strong sense. We emphasize that Theorem 1 is part of a much larger family of universal inequalities for reversible Markov chains; see [2] for details. In this paper we prove a stronger form of Theorem 1. Given α > 0, let: Unlike T Q hit , only "large enough" sets are considered in this definition. We prove in Section 4 that: Theorem 2 For any 0 < α < 1/2 there exist constants C + (α), C − (α) > 0 depending only on α such that, for any irreducible continuous-time Markov chain as above: Although similar to Theorem 1, the intuitive content of Theorem 2 seems different: instead of measures of non-iid-ness, we have a statement that says that mixing times are about as large as the expected time necessary to hit any large set, which is quite reasonable. Theorem 2 should also be easier to use in applications. The condition α < 1/2 is discussed in Section 1.1.
Remark 2 Theorem 2 also holds in discrete time if p 1 (x, x) ≥ 1/2 for all x ∈ E (use [5,Theorem 20.3]). Peres and Sousi [9] have shown that p 1 (x, x) ≥ β > 0 for any fixed β > 0 also suffices. Some lower bound on p 1 (x, x) is necessary; otherwise there are counterexamples such as large complete bipartite graphs with an edge added to one of the parts.
Results for non-reversible chains. Theorem 2 and the main results of [2] only apply to reversible chains; counterexamples can be found in that paper. Aldous, Lóvasz and Winkler [4] developed a quantitative theory in the general case using a different notion of mixing time. Let M 1 ([0, t]) be the set of all probability measures over [0, t] and define: In discrete time, one replaces M 1 ([0, t]) with the set M 1 ({0, . . . , t}) of all probability measures over {0, . . . , t}. Aldous, Lóvasz and Winkler [4] proved an analogue to Theorem 1 for arbitrary Markov chains in discrete time, where T rmix replaces T mix (their method can also be applied in continuous time). We prove an analogue of Theorem 2 in this setting: Theorem 3 For any α ∈ (0, 1/2) there exist C ′ − (α) > 0, C ′ + (α) such that for any irreducible finite-state Markov chain Q in continuous time: Remark 3 Our proof can be easily adapted to discrete time. Peres and Sousi [9] have proved a variant of Theorem 3 where T Q rmix (1/4) is replaced by another notion of time-averaged mixing, with µ a geometric distribution with success probability 1/t.

Discussion of the results
Outside of potential applications to bounding mixing, Theorems 2 and 3 seem conceptually interesting. They show that mixing times are natural in that they are strongly related to hitting times, a quantity of intrinsic interest. For instance, we have the following immediate corollary of Theorem 3.
Corollary 1 There exists some universal C > 0 such that for any irreducible Markov chain in discrete or continuous time, We omit the proof, which follows from T Q hit ≤ c T rmix (1/4) ≤ c ′ T Q hit (1/3) (with c, c ′ > 0 universal). This result says that one may control the hitting times of small sets via those of large sets.sOther applications of (slight variants of) our theorems are considered in [9].
The limitation α < 1/2 is not clearly necessary for the Theorems to hold. However, Peres [8] noted that one cannot allow α > 1/2. In that case one may contradict the two theorems by connecting two complete graphs K n by a single edge. In this case T hit (α) = O (n) whenever α > 1/2, since any set A with π(A) > 0 occupies a cosntant proportion of the mass of each clique. However, mixing requires crossing the connecting edge, so T mix (1/4) = Ω (T rmix (1/4)) = Ω n 2 . The intersting question is then: In 2009 Peres conjectured that T Q hit (1/2) is also "equivalent up to universal constant factors" to T Q mix (1/4) (for lazy and reversible Q) and T Q rmix (1/4) (in general) [1]. We prove this result in an upcoming paper with Griffiths, Kang and Patel.

Steps of the proof
The main step in the proof is Lemma 1, proven in Section 2. We construct there a randomized stopping time T , which depends on the initial distribution, such that X T has the stationary distribution. This stopping rule is the hitting time of a randomly chosen subset A ⊂ E, where the possible values of A form a chain A 1 ⊃ A 2 ⊃ · · · ⊃ A n . We will see that this property property implies that we can control the tail of H A via T Q hit (α). We note that this stopping time was outlined in [7, Theorem5.4] and [6, Theorem 4.9], but it is not explicit anywhere. Moreover, results in [7] imply that T is minimal in some sense (cf. Remark 5). Peres and Sousi [9] prove similar results via another minimal stopping rule, the so-called filling rule that was also employed in [2,4]). We believe that our construction provides an interesting alternative point of view.
Ater the construction of T , our paper continues with the proofs of Theorem 3, proven in Section 3. The elegant argument we use argument employs Lemma 1 together with a simple coupling devised in the survey [6]. The proof of Theorem 2 in Section 4 follows a convoluted computation in [2], which we reproduce in order to get the sharp form we need. An Appendix presents a simple lower bound of T Q rmix (α/2) in terms of T Q hit (α).

Acknowledgements
We thank Yuval Peres for the counterexample in Section 1.1 [8] and both him and Perla Sousi for presenting [9] to us.

A special stationary stopping time
We use the notation in Section 1. Recall that a randomized stopping time for this chain is a [0, +∞)-valued random variable T such that for all t ≥ 0 the event {T ≤ t} is measurable relative to the σ-field generated by {X s } s≤t and an independent random variable U .
Lemma 1 Suppose µ 0 is a probability measure over E. Then there exists a randomized stopping time T with Remark 4 The same result works (with a slightly different proof ) if π is replaced by another target distribution µ 1 over E and π substitutes µ 1 in the definition of T Q hit (ǫ).
Remark 5 Although we do not use this, one can show that E µ 0 [T ] is minimal among all randomized stopping times with P µ (X T = ·) = π(·). This is because our T has a halting state [6,Theorem 4.5].

Remark 6
We note from the definitions that T Q hit (ǫ) ≤ T Q hit /ǫ. We may plug this into Lemma 1 and optimize over ǫ to deduce: Aldous [2] proves a similar bound for a different stopping time, which he uses to prove Theorem 1. The same proof would go through with our own T . Another proof of Theorem 1 is presented in [9] Proof: Let n ≡ |E| denote the cardinality of E. The idea in the proof is to find a chain of subsets E = A 1 ⊃ A 2 ⊃ · · · ⊃ A n = ∅ and numbers p 1 , . . . , p n ≥ 0 with i p i = 1. We then define a random A that equals A i with probability p i and define T = H A . We will then show that if {X t } t≥0 is a realization P µ 0 that is independent from A, then Law(X T ) = π. The tail behavior of T = H A will follow automatically from the construction.
Notation. For any set ∅ = S ⊂ E, let ρ S (·) = P µ 0 (X H S = ·) denote the harmonic measure on S for the chain started from µ 0 . The irreducibility of the chain implies that H S < +∞ P µ 0 -a.s. and therefore ρ S is a probability measure over E with support in S.
Inductive construction of (A i , p i ): Set A 1 = E and choose a 1 ∈ A 1 so that ρ A 1 (a 1 )/π(a 1 ) is the maximum of ρ A 1 (a)/π(a) over all a ∈ A 1 . Since the π-weighted average of such ratios satisfies: the maximal value must satisfy ρ A 1 (a 1 )/π(a 1 ) ≥ 1. We then choose p 1 = π(a 1 )/ρ A 1 (a 1 ) and note that p 1 ∈ [0, 1], p 1 ρ A 1 (a 1 ) = π(a 1 ) and p 1 ρ A 1 (a)/π(a) ≤ 1 for all other a ∈ E\{a 1 }. Assume inductively that we have chosen distinct elements a 1 , . . . , a k ∈ E and numbers 0 ≤ p 1 , . . . , p k ≤ 1 such that if A i = E\{a j : 1 ≤ j < i} (1 ≤ i ≤ k), we have the following properties: Assume also that k < n, so that A k+1 = E\{a 1 , . . . , a k } is non-empty. We will prove that one may choose (p k+1 , a k+1 ) so as to preserve these properties for one further step. The following claim is the key: Given the claim, we choose a pair (p k+1 , a k+1 ) ∈ P k+1 with minimum value of the first coordinate. Let us show that condition 2. above remains valid for a ∈ E\{a 1 , . . . , a k+1 }. Any a violating 2 would have to satisfy: and this would imply that there is some 0 ≤ p < p k+1 with: which would contradict the minimality of p k+1 .
To prove that condition 1. also remains valid, we simply observe that it certainly holds for a k+1 and that it also holds for a i , i < k + 1, because a i ∈ A k+1 and therefore ρ A k+1 (a i ) = 0 . Hence such a choice of p k+1 , a k+1 preserves the induction hypothesis for one more step.
We now prove the Claim. Notice that: Since the first term in the LHS is an average, there must exist some a ∈ A k+1 with ρ A k+1 (a) ≥ π(a), whence:

Moreover, the inductive assumption 2. implies that
which proves the claim.
Analysis of the construction. Carrying the induction to its end at k = n implies that there exist p 1 , . . . , p n ∈ [0, 1] and an ordering a 1 , . . . , a n of the elements of E such that, if A i ≡ E\{a j : 1 ≤ j < i}, then: (the last identity in the RHS follows from a i ∈ A j for j > i).
These are the only facts about the construction we will use in the remainder of the analysis. We now prove some consequences of these facts. First notice that: π(a i ) = 1, which implies that the p i form a probability distribution over {1, . . . , n}. Moreover, the same line of reasoning implies that for all k ∈ {1, . . . , n}: (2) where A n+1 = ∅ by definition.
We now define our randomized stopping time as T = H A , where the choice of A is independent of the realization of the chain and P (A = A i ) = p i , 1 ≤ i ≤ n. Notice that A = ∅, hence T < +∞ almost surely. Moreover, it is easy to check that P µ 0 (X T = ·) = π(·), as desired.

Mixing of non-reversible chains
In this section we prove Theorem 3.
Proof: [of Theorem 3] The lower bound on T Q rmix (α) follows easily from the ideas in [4]. We give a proof in the Appendix for completeness. For the upper bound, we proceed as follows. Define: Proof: [of the Claim] A standard compactness argument shows that there exists a measure µ which achieves the infimum in the definition of d r (t). Let M be the discrete time Markov chain whose transition probabilities are given by: Define: where m t is the transition probability for t steps of m. Notice that d M (1) = d r (t) by the choice of µ. Moreover, d r (kt) ≤ d M (k) because k steps of M correspond to replacing µ in (3) by its k-fold convolution with itself µ * t . Lemma 4.12 in [5] implies that We will spend most of the rest of the proof proving that for all irreducible Markov chains Q, where c(α), δ(α) > 0 depend only on α ∈ (0, 1/2). Applying the Claim with t = c(α) T Q hit (α) and k = k(α) such that (1 − δ(α)) k ≤ 1/4 we may then deduce that which is the desired result.
Given x, z ∈ E, we let {X t } t≥0 and {Z t } t≥0 denote trajectories of Q started from x and z (respectively). Let T x , T z be obtained from Lemma 1 for µ 0 = δ x and δ z (resp.). Clearly, Law(X Tx ) = Law(Z Tz ) = π.
Sample U uniformly from [0, t] and independently from the two chains. The Markov property and the stationarity of π imply: Law(X Tx+U ) = Law(Z Tz +U ) = π.
Now fix some t ≥ 0 and define Notice that U x is uniform over [0, t], independently from {X t } t≥0 , and similarly for U z . Hence: where µ is uniform over [0, t]. Therefore, +d TV (Law(Z Uz ), Law(Z Tz +U )) by the triangle inequality and the previous remarks. We now show that: This is of course trivial if t < T Q hit (α), so we assume the opposite is true. The coupling characterization of total variation distance implies that for any λ ∈ (0, 1): Choosing λ = T Q hit (α)/t gives (6). We plug this and the corresponding statement for Z Tx+U into (5) to deduce: Now recall that α < 1/2 and take For this value of t, we have: Since x, z are arbitrary, we deduce (4) with c(α) = 64/(1−2α) 2 and δ(α) = (1−2α)/2. ✷

Mixing of reversible chains
We now prove Theorem 2.
Basic definitions for the proof. Let U > L > 0 (we will choose their values later). Fix a pair x, z ∈ E and let {X t } t≥0 and {Z t } t≥0 denote trajectories of Q started from x and z (respectively). Also let T x , T z be the randomized stopping times given by Lemma 1 for the X and Z processes, and define η x , η z to be the probability distributions of (X Tx , T x ) and (Z Tz , T z ) over E×[0, +∞). Finally, we let f x (a) ≡ P x (X Tx = a, T x ≤ L) and f z (a) = P z (Z Tz = a, T z ≤ L) (a ∈ E).
Estimating total variation distance. Recall: Notice that: and similarly for p t (z, a). Therefore, where the last line uses the Cauchy Schwartz inequality. We may further bound: and plugging this into (8) gives the inequality: Averaging. Our next step is to average the LHS and RHS of (9) over t ∈ [L, U ]. Since d TV (p t (x, ·), p t (z, ·)) is decreasing in t [5], the distance at time t = U is at most this average. We use concavity to move the averaging inside the square root and deduce: The term inside the square root. Define E L ≡ E × [0, L]. By the strong Markov property: By reversibility, we may rewrite the integrand in the RHS as which implies that: Integrating over t (with the change of variables t ′ = 2t − s − s ′ ), we find that: where the last inequality follows from the fact that which holds for all s, s ′ in the range considered. With this the bracketed term becomes independent of s, which may be integrated out. Since: we obtain: as well as a similar bound for z. On the other hand, starting from the formula: averaging over t ∈ [L, U ] and using [2L − s − s ′ , 2L + 2U − s − s ′ ] ⊃ [2L, 2U ], we may obtain: Combining these bounds we obtain To bound the sum in the RHS, we notice again that f x (·), f z (·) ≤ π(·), and also that for all u ∈ E, u ′ p w (u, u ′ ) = 1. Hence and similarly for z, so that We deduce that the term inside the square root in (10) is bounded by: Wrapping up. We now plug this previous inequality into (10) to deduce: If the quantity inside the square root is < 1, we get another upper bound: Now by Lemma 1 we obtain: Thus the condition for (13) is satisfied, and we have the bound: d TV (p U (x, ·), p U (z, ·)) ≤ 1 + 2α 2 .
Since x, z ∈ E are arbitrary, we deduce: which has the form requested in (7). ✷

Appendix: the lower bound
In this section we prove the lower bound part of the main theorems. As above, Q is a irreducible continuous-time Markov chain with state space E and stationary distribution π. The trajectories of the chain are denoted by {X t } t≥0 Proposition 1 For any α ∈ (0, 1), T Q hit (α) ≤ c(α)T Q rmix where c(α) > 0 depends only on α.
Fix A ⊂ V with measure π(A) ≥ α and x ∈ V . By the definition of T Q rmix (α/2) and a simple compactness argument, there exists a distribution supported on [0, T Q rmix (α/2)] such that if U has this distribution and is independent from {X t } t , d TV (Law(X U ), π) ≤ 1 − α/2.

As a result,
Since U is supported in [0, T Q rmix (α/2)], and we deduce: Let us use this to show that E x [H A ] ≤ (2/α) T Q rmix (α/2) for all x and A as above. Let k ∈ N\{0} and denote by Λ k the law of X (k−1)T Q rmix (α/2) conditioned on {H A ≥ (k − 1)T Q rmix (α/2)}. By (14), whereas by the Markov property, We deduce: Since x ∈ V and A ⊂ V with π(A) ≥ α were arbitrary, this finishes the proof. ✷