QUANTITATIVE CONVERGENCE RATES OF MARKOV CHAINS: A SIMPLE ACCOUNT

We state and prove a simple quantitative bound on the total variation distance after k iter-ations between two Markov chains with di(cid:11)erent initial distributions but identical transition probabilities. The result is a simpli(cid:12)ed and improved version of the result in Rosenthal (1995), which also takes into account the (cid:15) -improvement of Roberts and Tweedie (1999), and which follows as a special case of the more complicated time-inhomogeneous results of Douc et al. (2002). However, the proof we present is very short and simple; and we feel that it is worthwhile to boil the proof down to its essence. This paper is purely expository; no new results are presented.


Introduction
Let P be the transition kernel for a Markov chain defined on a state space X . Suppose we run two different copies of the chain, {X n } and {X n }, started (independently or otherwise) from two different initial distributions L(X 0 ) and L(X 0 ). We are interested in quantitative upper-bounds on the total variation distance between the two chains after k steps of the chain, which is defined by Such quantitative bounds on convergence rates of Markov chains have been studied in various forms by Meyn and Tweedie (1994), Rosenthal (1995), Roberts and Tweedie (1999), Jones and Hobert (2001), Douc et al. (2002), and others. These investigations have been motivated largely by interest in Markov chain Monte Carlo (MCMC) algorithms including the Gibbs sampler and the Metropolis-Hastings algorithm (see e.g. Gilks et al., 1996), where convergence bounds provide useful information about how long the algorithms must be run to achieve a prescribed level of accuracy.
In this paper, we present one such quantitative bound result. This result is a simplified and improved version of the result in Rosenthal (1995), which also takes into account theimprovement (i.e., replacing αB 0 by B in the conclusion) of Roberts and Tweedie (1999). This result follows directly as a special case of the more complicated time-inhomogeneous results of Douc et al. (2002). However, the proof we present is very short and simple; and we feel that it is worthwhile to boil the proof down to its essence. This paper is purely expository; no new results are presented.

Assumptions and Statement of Result
Our result requires a minorisation condition of the form for all x ∈ C and all measurable A ⊆ X ), for some probability measure ν(·) on X , some subset C ⊆ X , and some > 0. It also requires a drift condition of the form for some function h : X × X → [1, ∞) and some α > 1, where Finally, we let where for (x, y) ∈ C × C,

It is easily seen that
In terms of these assumptions, we state our result as follows.
Theorem 1. Consider a Markov chain on a state space X , having transition kernel P .
Suppose there is C ⊆ X , h : X × X → [1, ∞), a probability distribution ν(·) on X , α > 1, and > 0, such that (1) and (2) hold. Define B by (3). Then for any joint initial distribution L(X 0 , X 0 ), and any integers 1 ≤ j ≤ k, if {X n } and {X n } are two copies of the Markov chain started in the joint initial distribution L(X 0 , X 0 ), then

Proof of Result
The proof uses a coupling approach. We begin by constructing {X n } and {X n } simultaneously using a "splitting technique" (Athreya and Ney, 1978;Nummelin, 1984;Meyn and Tweedie, 1993) as follows. Let X 0 and X 0 be drawn jointly from their given initial distribution. We shall let d n be the "bell variable" indicating whether or not the chains have coupled by time n. Begin with d n = 0. For n = 0, 1, 2, . . ., proceed as follows. If d n = 1, then choose X n+1 ∼ P (X n , ·), and set X n+1 = X n+1 and d n+1 = 1. If d n = 0 and (X n , X n ) ∈ C × C, then flip (independently) a coin with probability of heads . If the coin comes up heads, then choose a point x ∈ X from the distribution ν(·), and set X n+1 = X n+1 = x, and set d n+1 = 1. If the coin comes up tails, then choose X n+1 and X n+1 independently according to the residual kernels (1 − ) −1 (P (X n , ·) − ν(·)) and (1 − ) −1 (P (X n , ·) − ν(·)), respectively, and set d n+1 = 0. Finally, if d n = 0 and (X n , X n ) ∈ C × C, then draw X n+1 ∼ P (X n , ·) and X n+1 ∼ P (X n , ·), independently, and set d n+1 = 0. It is then easily checked that X n and X n are each marginally updated according to the transition kernel P . Also, X n = X n whenever d n = 1. Hence, by the coupling inequality (e.g. Pitman, 1976;Lindvall, 1992), we have Now, let and let τ 1 , τ 2 , . . . be the times of the successive visits of {(X n , X n )} to C × C. Then for any integer j with 1 ≤ j ≤ k, Now, the event {d k = 0, N k−1 ≥ j} is contained in the event that the first j coin flips all came up tails. Hence, P [d k = 0, N k−1 ≥ j] ≤ (1 − ) j . which bounds the first term in (5).
To bound the second term in (5), let i.e. that {M k } is a supermartingale. Indeed, from the Markov property, by (2). Similarly, if (X k , X k ) ∈ C × C, then N k = N k−1 + 1, so assuming d k = 0 (since if d k = 1 then d k+1 = 1 so the result is trivial), we have Theorem 1 now follows from combining these two bounds with (5) and (4).

Extensions and Applications
If P has a stationary distribution π(·), then in Theorem 1 we can choose L(X 0 ) = π(·), so that L(X k ) = π(·) for all k. Theorem 1 then implies that where the expectation is now taken with respect to X 0 ∼ π(·). Furthermore, we can allow j to grow with k, for example by setting j = rk where 0 < r < 1, to make (1 − ) j → 0 as k → ∞. The minorisation condition (1) can be relaxed to a pseudo-minorisation condition, where the measure ν = ν x,x may depend upon the pair (x, x ) ∈ C × C (Roberts and Rosenthal, 2000). More generally, the set C × C can be replaced by a non-rectangular -coupling set C ⊆ X × X (Bickel and Ritov, 2002;Douc et al., 2002). Also, P and R need not update the two components independently as they do above; it is required only that they have the correct marginal distributions (Douc et al., 2002). The joint drift condition (2) can be derived from univariate drift conditions of the form P V ≤ λV + b or P V ≤ λV + b 1 C in various ways (see e.g. Rosenthal, 2001, Proposition 9); such univariate drift conditions may be easier to identify in specific examples. Extensions of Theorem 1 have been developed for stochastically monotone chains (Lund et al., 1996;Roberts and Tweedie, 2000), for time-inhomogeneous chains (Douc et al., 2002;Bickel and Ritov, 2002), for nearly-periodic chains (Rosenthal, 2001), and in the context of shiftcoupling (Aldous and Thorisson, 1993;Roberts and Rosenthal, 1997;Roberts and Tweedie, 1999). Meyn and Tweedie (1994), Rosenthal (1995), and Roberts and Tweedie (1999). They have also been applied to more substantial examples of the Gibbs sampler, including a hierarchical Poisson model (Rosenthal, 1995), a version of the variance components model (Rosenthal, 1996), and some other MCMC examples (Jones and Hobert, 2001). Furthermore, with the aid of auxiliary simulation to only approximately verify (1) and (2), approximate versions of Theorem 1 have been applied successfully to more complicated Gibbs sampler examples (Cowles and Rosenthal, 1998;Cowles, 2001).

Versions of Theorem 1 have been applied to a number of simple Markov chain examples in
In spite of these successes in particular applications, it remains true that verifying (1) and (2) for complicated Markov chains is usually a difficult task. Nevertheless, it is of clear theoretical, and sometimes practical, importance to be able to identify convergence bounds solely in terms of drift and minorisation conditions, as in Theorem 1.