General state space Markov chains and MCMC algorithms

This paper surveys various results about Markov chains on general (non-countable) state spaces. It begins with an introduction to Markov chain Monte Carlo (MCMC) algorithms, which provide the motivation and context for the theory which follows. Then, sufficient conditions for geometric and uniform ergodicity are presented, along with quantitative bounds on the rate of convergence to stationarity. Many of these results are proved using direct coupling constructions based on minorisation and drift conditions. Necessary and sufficient conditions for Central Limit Theorems (CLTs) are also presented, in some cases proved via the Poisson Equation or direct regeneration constructions. Finally, optimal scaling and weak convergence results for Metropolis-Hastings algorithms are discussed. None of the results presented is new, though many of the proofs are. We also describe some Open Problems.


Introduction
Markov chain Monte Carlo (MCMC) algorithms -such as the Metropolis-Hastings algorithm ( [53], [37]) and the Gibbs sampler (e.g. Geman and Geman [32]; Gelfand and Smith [30]) -have become extremely popular in statistics, as a way of approximately sampling from complicated probability distributions in high dimensions (see for example the reviews [93], [89], [33], [71]). Most dramatically, the existence of MCMC algorithms has transformed Bayesian inference, by allowing practitioners to sample from posterior distributions of complicated statistical models.
In addition to their importance to applications in statistics and other subjects, these algorithms also raise numerous questions related to probability theory and the mathematics of Markov chains. In particular, MCMC algorithms involve Markov chains {X n } having a (complicated) stationary distribution π(·), for which it is important to understand as precisely as possible the nature and speed of the convergence of the law of X n to π(·) as n increases.
This paper attempts to explain and summarise MCMC algorithms and the probability theory questions that they generate. After introducing the algorithms (Section 2), we discuss various important theoretical questions related to them. In Section 3 we present various convergence rate results commonly used in MCMC. Most of these are proved in Section 4, using direct coupling arguments and thereby avoiding many of the analytic technicalities of previous proofs. We consider MCMC central limit theorems in Section 5, and optimal scaling and weak convergence results in Section 6. Numerous references to the MCMC literature are given throughout. We also describe some Open Problems.

The problem
The problem addressed by MCMC algorithms is the following. We're given a density function π u , on some state space X , which is possibly unnormalised but at least satisfies 0 < X π u < ∞. (Typically X is an open subset of R d , and the densities are taken with respect to Lebesgue measure, though other settingsincluding discrete state spaces -are also possible.) This density gives rise to a probability measure π(·) on X , by We want to (say) estimate expectations of functions f : X → R with respect to π(·), i.e. we want to estimate If X is high-dimensional, and π u is a complicated function, then direct integration (either analytic or numerical) of the integrals in (2) is infeasible. The classical Monte Carlo solution to this problem is to simulate i.i.d. random variables Z 1 , Z 2 , . . . , Z N ∼ π(·), and then estimate π(f ) bŷ This gives an unbiased estimate, having standard deviation of order O(1/ √ N ). Furthermore, if π(f 2 ) < ∞, then by the classical Central Limit Theorem, the errorπ(f ) − π(f ) will have a limiting normal distribution, which is also useful. The problem, however, is that if π u is complicated, then it is very difficult to directly simulate i.i.d. random variables from π(·).
The Markov chain Monte Carlo (MCMC) solution is to instead construct a Markov chain on X which is easily run on a computer, and which has π(·) as a stationary distribution. That is, we want to define easily-simulated Markov chain transition probabilities P (x, dy) for x, y ∈ X , such that x∈X π(dx) P (x, dy) = π(dy).
Then hopefully (see Subsection 3.2), if we run the Markov chain for a long time (started from anywhere), then for large n the distribution of X n will be approximately stationary: L(X n ) ≈ π(·). We can then (say) set Z 1 = X n , and then restart and rerun the Markov chain to obtain Z 2 , Z 3 , etc., and then do estimates as in (3). It may seem at first to be even more difficult to find such a Markov chain, then to estimate π(f ) directly. However, we shall see in the next section that constructing (and running) such Markov chains is often surprisingly straightforward.
Remark. In the practical use of MCMC, rather than start a fresh Markov chain for each new sample, often an entire tail of the Markov chain run {X n } is used to create an estimate such as (N − B) −1 N i=B+1 f (X i ), where the burn-in value B is hopefully chosen large enough that L(X B ) ≈ π(·). In that case the different f (X i ) are not independent, but the estimate can be computed more efficiently. Since many of the mathematical issues which arise are similar in either implementation, we largely ignore this modification herein.
Remark. MCMC is, of course, not the only way to sample or estimate from complicated probability distributions. Other possible sampling algorithms include "rejection sampling" and "importance sampling", not reviewed here; but these alternative algorithms only work well in certain particular cases and are not as widely applicable as MCMC algorithms.

Motivation: Bayesian Statistics Computations
While MCMC algorithms are used in many fields (statistical physics, computer science), their most widespread application is in Bayesian statistical inference.
Let L(y|θ) be the likelihood function (i.e., density of data y given unknown parameters θ) of a statistical model, for θ ∈ X . (Usually X ⊆ R d .) Let the "prior" density of θ be p(θ). Then the "posterior" distribution of θ given y is the density which is proportional to π u (θ) ≡ L(y | θ) p(θ).
(Of course, the normalisation constant is simply the density for the data y, though that constant may be impossible to compute.) The "posterior mean" of any functional f is then given by: For this reason, Bayesians are anxious (even desperate!) to estimate such π(f ). Good estimates allow Bayesian inference can be used to estimate a wide variety of parameters, probabilities, means, etc. MCMC has proven to be extremely helpful for such Bayesian estimates, and MCMC is now extremely widely used in the Bayesian statistical community.
A very important property of reversibility is the following. Proposition 1. If Markov chain is reversible with respect to π(·), then π(·) is stationary for the chain.
We see from this lemma that, when constructing an MCMC algorithm, it suffices to create a Markov chain which is easily run, and which is reversible with respect to π(·). The simplest way to do so is to use the Metropolis-Hastings algorithm, as we now discuss.
The Metropolis-Hastings algorithm proceeds as follows. First choose some X 0 . Then, given X n , generate a proposal Y n+1 from Q(X n , ·). Also flip an independent coin, whose probability of heads equals α(X n , Y n+1 ), where α(x, y) = min 1, π u (y) q(y, x) π u (x) q(x, y) .
(To avoid ambiguity, we set α(x, y) = 1 whenever π(x) q(x, y) = 0.) Then, if the coin is heads, "accept" the proposal by setting X n+1 = Y n+1 ; if the coin is tails then "reject" the proposal by setting X n+1 = X n . Replace n by n + 1 and repeat.
The reason for the unusual formula for α(x, y) is the following: Proposition 2. The Metropolis-Hastings algorithm (as described above) produces a Markov chain {X n } which is reversible with respect to π(·).
To run the Metropolis-Hastings algorithm on a computer, we just need to be able to run the proposal chain Q(x, ·) (which is easy, for appropriate choices of Q), and then do the accept/reject step (which is easy, provided we can easily compute the densities at individual points). Thus, running the algorithm is quite feasible. Furthermore we need to compute only ratios of densities [e.g. π u (y) / π u (x)], so we don't require the normalising constants c = X π u (x)dx.
However, this algorithm in turn suggests further questions. Most obviously, how should we choose the proposal distributions Q(x, ·)? In addition, once Q(x, ·) is chosen, then will we really have L(X n ) ≈ π(·) for large enough n? How large is large enough? We will return to these questions below.
•Langevin algorithm. Here the proposal is generated by for some (small) δ > 0. (This is motivated by a discrete approximation to a Langevin diffusion processes.) More about optimal choices of proposal distributions will be discussed in a later section, as will the second question about time to stationarity (i.e. how large does n need to be).

Combining Chains
If P 1 and P 2 are two different chains, each having stationary distribution π(·), then the new chain P 1 P 2 also has stationary distribution π(·).
Thus, it is perfectly acceptable, and quite common (see e.g. Tierney [93] and [69]), to make new MCMC algorithms out of old ones, by specifying that the new algorithm applies first the chain P 1 , then the chain P 2 , then the chain P 1 again, etc. (And, more generally, it is possible to combine many different chains in this manner.) Note that, even if each of P 1 and P 2 are reversible, the combined chain P 1 P 2 will in general not be reversible. It is for this reason that it is important, when studying MCMC, to allow for non-reversible chains as well.

The Gibbs Sampler
The Gibbs sampler is also known as the "heat bath" algorithm, or as "Glauber dynamics". Suppose again that π u (·) is d-dimensional density, with X an open subset of R d , and write x = (x 1 , . . . , x d ).
The i th component Gibbs sampler is defined such that P i leaves all components besides i unchanged, and replaces the i th component by a draw from the full conditional distribution of π(·) conditional on all the other components.
More formally, let Then It follows immediately (from direct computation, or from the definition of conditional density), that P i , is reversible with respect to π(·). (In fact, P i may be regarded as a special case of a Metropolis-Hastings algorithm, with α(x, y) ≡ 1.) Hence, P i has π(·) as a stationary distribution.
We then construct the full Gibbs sampler out of the various P i , by combining them (as in the previous subsection) in one of two ways: •The deterministic-scan Gibbs sampler is That is, it performs the d different Gibbs sampler components, in sequential order.
•The random-scan Gibbs sampler is That is, it does one of the d different Gibbs sampler components, chosen uniformly at random.
Either version produces an MCMC algorithm having π(·) as its stationary distribution. The output of a Gibbs sampler is thus a "zig-zag pattern", where the components get updated one at a time. (Also, the random-scan Gibbs sampler is reversible, while the deterministic-scan Gibbs sampler usually is not.)

Detailed Bayesian Example: Variance Components Model
We close this section by presenting a typical example of a target density π u that arises in Bayesian statistics, in an effort to illustrate the problems and issues which arise.
The model involves fixed constant µ 0 and positive constants a 1 , b 1 , a 2 , b 2 , and σ 2 0 . It involves three hyperparameters, σ 2 θ , σ 2 e , and µ, each having priors based upon these constants as follows: σ 2 θ ∼ IG(a 1 , b 1 ); σ 2 e ∼ IG(a 2 , b 2 ); and µ ∼ N (µ 0 , σ 2 0 ). It involves K further parameters θ 1 , θ 2 , . . . , θ K , conditionally independent given the above hyperparameters, with θ i ∼ N (µ, σ 2 θ ). In terms of these parameters, the data , conditionally independently given the parameters. A graphical representation of the model is as follows: The Bayesian paradigm then involves conditioning on the values of the data {Y ij }, and considering the joint distribution of all K + 3 parameters given this data. That is, we are interested in the distribution defined on the state space X = (0, ∞) 2 × R K+1 . We would like to sample from this distribution π(·). We compute that this distribution's unnormalised density is given by This is a very typical target density for MCMC in statistics, in that it is highdimensional (K + 3), its formula is messy and irregular, it is positive throughout X , and it is larger in "center" of X and smaller in "tails" of X . We now consider constructing MCMC algorithms to sample from the target density π u . We begin with the Gibbs sampler. To run a Gibbs sampler, we require the full conditionals distributions, computed (without difficulty since they are all one-dimensional) to be as follows:

Bounds on Markov Chain Convergence Times
Once we know how to construct (and run) lots of different MCMC algorithms, other questions arise. Most obviously, do they converge to the distribution π(·)? And, how quickly does this convergence take place?
To proceed, write P n (x, A) for the n-step transition law of the Markov chain: The main MCMC convergence questions are, is P n (x, A) "close" to π(A) for large enough n? And, how large is large enough?

Total Variation Distance
We shall measure the distance to stationary in terms of total variation distance, defined as follows: Definition. The total variation distance between two probability measures ν 1 (·) and ν 2 (·) is: We can then ask, is lim n→∞ P n (x, ·) − π(·) = 0? And, given ǫ > 0, how large must n be so that P n (x, ·)−π(·) < ǫ? We consider such questions herein.
We first pause to note some simple properties of total variation distance.

Asymptotic Convergence
Even if a Markov chain has stationary distribution π(·), it may still fail to converge to stationarity: Then π(·) is stationary. However, if X 0 = 1, then X n ∈ {1, 2} for all n, so P (X n = 3) = 0 for all n, so P (X n = 3) → π{3}, and the distribution of X n does not converge to π(·). (In fact, here the stationary distribution is not unique, and the distribution of X n converges to a different stationary distribution defined by π{1} = π{2} = 1/2.) The above example is "reducible", in that the chain can never get from state 1 to state 3, in any number of steps. Now, the classical notion of "irreducibility" is that the chain has positive probability of eventually reaching any state from any other state, but if X is uncountable then that condition is impossible. Instead, we demand the weaker condition of φ-irreducibility: Definition. A chain is φ-irreducible if there exists a non-zero σ-finite measure φ on X such that for all A ⊆ X with φ(A) > 0, and for all x ∈ X , there exists a positive integer n = n(x, A) such that P n (x, A) > 0.
For example, if φ(A) = δ x * (A), then this requires that x * has positive probability of eventually being reached from any state x. Thus, if a chain has any one state which is reachable from anywhere (which on a finite state space is equivalent to being indecomposible), then it is φ-irreducible. However, if X is uncountable then often P (x, {y}) = 0 for all x and y. In that case, φ(·) might instead be e.g. Lebesgue measure on R d , so that φ({x}) = 0 for all singleton sets, but such that all subsets A of positive Lebesgue measure are eventually reachable with positive probability from any x ∈ X .
Running Example. Here we introduce a running example, to which we shall return several times. Suppose that π(·) is a probability measure having unnormalised density function π u with respect to d-dimensional Lebesgue measure. Consider the Metropolis-Hastings algorithm for π u with proposal density q(x, ·) with respect to d-dimensional Lebesgue measure. Then if q(·, ·) is positive and continuous on R d × R d , and π u is finite everywhere, then the algorithm is πirreducible. Indeed, let π(A) > 0. Then there exists R > 0 such that π(A R ) > 0, where A R = A ∩ B R (0), and B R (0) represents the ball of radius R centred at 0. Then by continuity, for any x ∈ R d , inf y∈AR min{q(x, y), q(y, x)} ≥ ǫ for some ǫ > 0, and thus we have (assuming π u (x) > 0, otherwise P (x, A) > 0 follows immediately) that where K = X π u (x) dx > 0. Since π(·) is absolutely continuous with respect to Lebesgue measure, and since Leb(A R ) > 0, it follows that the terms in this final sum cannot both be 0, so that we must have P (x, A) > 0. Hence, the chain is π-irreducible.
Even φ-irreducible chains might not converge in distribution, due to periodicity problems, as in the following simple example.
To avoid this problem, we require aperiodicity, and we adopt the following definition (which suffices for the φ-irreducible chains with stationary distributions that we shall study; for more general relationships see e.g. Meyn and Tweedie [54], Theorem 5.4.4): Definition. A Markov chain with stationary distribution π(·) is aperiodic if there do not exist d ≥ 2 and disjoint subsets X 1 , X 2 , . . . , X d ⊆ X with P (x, X i+1 ) = 1 for all x ∈ X i (1 ≤ i ≤ d − 1), and P (x, X 1 ) = 1 for all x ∈ X d , such that π(X 1 ) > 0 (and hence π(X i ) > 0 for all i). (Otherwise, the chain is periodic, with period d, and periodic decomposition X 1 , . . . , X d .) Running Example, Continued. Here we return to the Running Example introduced above, and demonstrate that no additional assumptions are necessary to ensure aperiodicity. To see this, suppose that X 1 and X 2 are disjoint subsets of X both of positive π measure, with P (x, X 2 ) = 1 for all x ∈ X 1 . But just take any x ∈ X 1 , then since X 1 must have positive Lebesgue measure, for a contradiction. Therefore aperiodicity must hold. (It is possible to demonstrate similar results for other MCMC algorithms, such as the Gibbs sampler, see e.g. Tierney [93]. Indeed, it is rather rare for MCMC algorithms to be periodic.) Now we can state the main asymptotic convergence theorem, whose proof is described in Section 4. (This theorem assumes that the state space's σ-algebra is countably generated, but this is a very weak assumption which is true for e.g. any countable state space, or any subset of R d with the usual Borel σ-algebra, since that σ-algebra is generated by the balls with rational centers and rational radii.) Theorem 4. If a Markov chain on a state space with countably generated σalgebra is φ-irreducible and aperiodic, and has a stationary distribution π(·), then for π-a.e. x ∈ X , lim n→∞ P n (x, ·) − π(·) = 0.
In particular, lim n→∞ P n (x, A) = π(A) for all measurable A ⊆ X .
Fact 5. In fact, under the conditions of Theorem 4, if h : X → R with π(|h|) < ∞, then a "strong law of large numbers" also holds (see e.g. Meyn and Tweedie [54], Theorem 17.0.1), as follows: Theorem 4 requires that the chain be φ-irreducible and aperiodic, and have stationary distribution π(·). Now, MCMC algorithms are created precisely so that π(·) is stationary, so this requirement is not a problem. Furthermore, it is usually straightforward to verify that chain is φ-irreducible, where e.g. φ is Lebesgue measure on an appropriate region. Also, aperiodicity almost always holds, e.g. for virtually any Metropolis algorithm or Gibbs sampler. Hence, Theorem 4 is widely applicable to MCMC algorithms.
It is worth asking why the convergence in Theorem 4 is just from π-a.e. x ∈ X . The problem is that the chain may have unpredictable behaviour on a "null set" of π-measure 0, and fail to converge there. Here is a simple example due to C. Geyer (personal communication): . Then chain has stationary distribution π(·) = δ 1 (·), and it is π-irreducible and aperiodic. On the other hand, if Here Theorem 4 holds for x = 1 which is indeed π-a.e. x ∈ X , but it does not hold for x ≥ 2.
Remark. The transient behaviour of the chain on the null set in Example 3 is not accidental. If instead the chain converged on the null set to some other stationary distribution, but still had positive probability of escaping the null set (as it must to be φ-irreducible), then with probability 1 the chain would eventually exit the null set, and would thus converge to π(·) from the null set after all.
It is reasonable to ask under what circumstances the conclusions of Theorem 4 will hold for all x ∈ X , not just π-a.e. Obviously, this will hold if the transition kernels P (x, ·) are all absolutely continuous with respect to π(·) (i.e., P (x, dy) = p(x, y) π(dy) for some function p : X × X → [0, ∞)), or for any Metropolis algorithm whose proposal distributions Q(x, ·) are absolutely continuous with respect to π(·). It is also easy to see that this will hold for our Running Example described above. More generally, it suffices that the chain be Harris recurrent, meaning that for all B ⊆ X with π(B) > 0, and all x ∈ X , the chain will eventually reach B from x with probability 1, i.e. P[∃ n : X n ∈ B | X 0 = x] = 1. This condition is stronger than π-irreducibility (as evidenced by Example 3); for further discussions of this see e.g. Orey [61], Tierney [93], Chan and Geyer [15], and [75].
Finally, we note that periodic chains occasionally arise in MCMC (see e.g. Neal [58]), and much of the theory can be applied to this case. For example, we have the following. Corollary 6. If a Markov chain is φ-irreducible, with period d ≥ 2, and has a stationary distribution π(·), then for π-a.e. x ∈ X , and also the strong law of large numbers (6) continues to hold without change.
Proof. Let the chain have periodic decomposition X 1 , . . . , X d ⊆ X , and let P ′ be the d-step chain P d restricted to the state space X 1 . Then P ′ is φirreducible and aperiodic on X 1 , with stationary distribution π ′ (·) which satisfies that π(·) = (1/d) d−1 j=0 (π ′ P j )(·). Now, from Proposition 3(c), it suffices to prove the Corollary when n = md with m → ∞, and for simplicity we assume without loss of generality that x ∈ X 1 . From Proposition 3(d), we have Then, by the triangle inequality, But applying Theorem 4 to P ′ , we obtain that lim m→∞ P md (x, ·) − π ′ (·) = 0 for π ′ -a.e. x ∈ X 1 , thus giving the first result.
To establish (6), let P be the transition kernel for the Markov chain on ). Then just like P ′ , we see that P is φ-irreducible and aperiodic, with stationary distribution given by Applying Fact 5 to P and h establishes that (6) holds without change.
Remark. By similar methods, it follows that (5) also remains true in the periodic case, i.e. that whenever h : X → R with π(|h|) < ∞, provided the Markov chain is φirreducible and countably generated, without any assumption of aperiodicity. In particular, both (7) and (5) hold (without further assumptions re periodicity) for any irreducible (or indecomposible) Markov chain on a finite state space.
A related question for periodic chains, not considered here, is to consider quantitative bounds on the difference of average distributions, through the use of shift-coupling; see Aldous and Thorisson [3], and [68].

Uniform Ergodicity
Theorem 4 implies asymptotic convergence to stationarity, but does not say anything about the rate of this convergence. One "qualitative" convergence rate property is uniform ergodicity: for some ρ < 1 and M < ∞.
One equivalence of uniform ergodicity is: Proof. If the chain is uniformly ergodic, then so sup x∈X P n (x, ·) − π(·) < 1/2 for all sufficiently large n. Conversely, if sup x∈X P n (x, ·) − π(·) < 1/2 for some n ∈ N, then in the notation of Propo- so the chain is uniformly ergodic with M = β −1 and ρ = β 1/n .
To develop further conditions which ensure uniform ergodicity, we require a definition.
Definition. A subset C ⊆ X is small (or, (n 0 , ǫ, ν)-small) if there exists a positive integer n 0 , ǫ > 0, and a probability measure ν(·) on X such that the following minorisation condition holds: i.e. P n0 (x, A) ≥ ǫ ν(A) for all x ∈ C and all measurable A ⊆ X .
Remark. Some authors (e.g. Meyn and Tweedie [54]) also require that C have positive stationary measure, but for simplicity we don't explicitly require that here. In any case, π(C) > 0 follows under the additional assumption of the drift condition (10) considered in the next section.
Remark. As observed in [72], small-set conditions of the form P (x, ·) ≥ ǫ ν(·) for all x ∈ C, can be replaced by pseudo-small conditions of the form P (x, ·) ≥ ǫ ν xy (·) and P (y, ·) ≥ ǫ ν xy (·) for all x, y ∈ C, without affecting any bounds which use pairwise coupling (which includes all of the bounds considered here before Section 5. Thus, all of the results stated in this section remain true without change if "small set" is replaced by "pseudo-small set" in the hypotheses. For ease of exposition, we do not emphasise this point herein.
The main result guaranteeing uniform ergodicity, which goes back to Doeblin [22] and Doob [23] and in some sense even to Markov [50], is the following.
Theorem 8. Consider a Markov chain with invariant probability distribution π(·). Suppose the minorisation condition (8) is satisfied for some n 0 ∈ N and ǫ > 0 and probability measure ν(·), in the special case C = X (i.e., the entire state space is small). Then the chain is uniformly ergodic, and in fact P n (x, ·) − π(·) ≤ (1 − ǫ) ⌊n/n0⌋ for all x ∈ X , where ⌊r⌋ is the greatest integer not exceeding r.
Running Example, Continued. Recall our Running Example, introduced above. Since we have imposed strong continuity conditions on q, it is natural to conjecture that compact sets are small. However this is not true without extra regularity conditions. For instance, consider dimension d = 1, and suppose that π u (x) = 1 0<|x|<1 |x| −1/2 , and let q(x, y) ∝ exp{−(x − y) 2 /2}, then it is easy to check that any neighbourhood of 0 is not small. However in the general setup of our Running Example, all compact sets on which π u is bounded are small. To see this, suppose C is a compact set on which π u is bounded by k < ∞. Let x ∈ C, and let D be any compact set of positive Lebesgue and π measure, such that inf x∈C,y∈D q(x, y) = ǫ > 0 for all y ∈ D. We then have, which is a positive measure independent of x. Hence, C is small. (This example also shows that if π u is continuous, the state space X is compact, and q is continuous and positive, then X is small, and so the chain must be uniformly ergodic.) If a Markov chain is not uniformly ergodic (as few MCMC algorithms on unbounded state spaces are), then Theorem 8 cannot be applied. However, it is still of great importance, given a Markov chain kernel P and an initial state x, to be able to find n * so that, say, P n * (x, ·) − π(·) ≤ 0.01. This issue is discussed further below.

Geometric ergodicity
A weaker condition than uniform ergodicity is geometric ergodicity, as follows (for background and history, see e.g. Nummelin [60], and Meyn and Tweedie [54]): The difference between geometric ergodicity and uniform ergodicity is that now the constant M may depend on the initial state x.
Of course, if the state space X is finite, then all irreducible and aperiodic Markov chains are geometrically (in fact, uniformly) ergodic. However, for infinite X this is not the case. For example, it is shown by Mengersen and Tweedie [52] (see also [76]) that a symmetric random-walk Metropolis algorithm is geometrically ergodic essentially if and only if π(·) has finite exponential moments. (For chains which are not geometrically ergodic, it is possible also to study polynomial ergodicity, not considered here; see Fort and Moulines [29], and Jarner and Roberts [42].) Hence, we now discuss conditions which ensure geometric ergodicity.
Definition. Given Markov chain transition probabilities P on a state space X , and a measurable function f : X → R, define the function P f : Definition. A Markov chain satisfies a drift condition (or, univariate geometric drift condition) if there are constants 0 < λ < 1 and b < ∞, and a function V : i.e. such that The main result guaranteeing geometric ergodicity is the following.
Theorem 9 is usually proved by complicated analytic arguments (see e.g. [60], [54], [7]). In Section 4, we describe a proof of Theorem 9 which uses direct coupling constructions instead. Note also that Theorem 9 provides no quantitative bounds on M (x) or ρ, though this is remedied in Theorem 12 below.  [54], and Proposition 1 of [69], that the minorisation condition (8) and drift condition (10) of Theorem 9 are equivalent (assuming φ-irreducibility and aperiodicity) to the apparently stronger property of "V -uniform ergodicity", i.e. that there is C < ∞ and ρ < 1 such that where π(f ) = x∈X f (x) π(dx). That is, we can take sup |f |≤V instead of just sup 0<f <1 (compare Proposition 3 parts (a) and (b)), and we can let M (x) = C V (x) in the geometric ergodicity bound. Furthermore, we always have π(V ) < ∞. (The term "V -uniform ergodicity", as used in [54], perhaps also implies that V (x) < ∞ for all x ∈ X , rather than just for π-a.e. x ∈ X , though we do not consider that distinction further here.) Open Problem # 1. Can direct coupling methods, similar to those used below to prove Theorem 9, also be used to provide an alternative proof of Fact 10?
Example 4. Here we consider a simple example of geometric ergodicity of Metropolis algorithms on R (see Mengersen and Tweedie [52], and [76]). Suppose that X = R + and π u (x) = e −x . We will use a symmetric (about In this simple situation, a natural drift function to take is V (x) = e cx for some c > 0. For x ≥ a, we compute: By the symmetry of q, this can be written as and where u = y − x. For c < 1, this is equal to 2(1 − ǫ)V (x) for some positive constant ǫ. Thus in this case we have shown that for all x > a Furthermore, it is easy to show that P V (x) is bounded on [0, a] and that [0, a] is in fact a small set. Thus, we have demonstrated that the drift condition (10) holds. Hence, the algorithm is geometrically ergodic by Theorem 9. (It turns out that for such Metropolis algorithms, a certain condition, which essentially requires an exponential bound on the tail probabilities of π(·), is in fact necessary for geometric ergodicity; see [76].) Implications of geometric ergodicity for central limit theorems are discussed in Section 5. In general, it believed by practitioners of MCMC that geometric ergodicity is a useful property. But does geometric ergodicity really matter? Consider the following examples.
Example 5. ( [71]) Consider an independence sampler, with π(·) an Exponential(1) distribution, and Q(x, ·) an Exponential(λ) distribution. Then if 0 < λ ≤ 1, the sampler is geometrically ergodic, has central limit theorems (see Section 5), and generally behaves fairly well even for very small λ. On the other hand, for λ > 1 the sampler fails to be geometrically ergodic, and indeed for λ ≥ 2 it fails to have central limit theorems, and generally behaves quite poorly. For example, the simulations in [71] indicate that with λ = 5, when started in stationarity and averaged over the first million iterations, the sampler will usually return an average value of about 0.8 instead of 1, and then occasionally return a very large value instead, leading to very unstable behaviour. Thus, this is an example where the property of geometric ergodicity does indeed correspond to stable, useful convergence behaviour.
However, geometric ergodicity does not always guarantee a useful Markov chain algorithm, as the following two examples show. Example 6. ("Witch's Hat", e.g. Matthews [51]) Let X = [0, 1], let δ = 10 −100 (say), let 0 < a < 1 − δ, and let π u (x) = δ + 1 [a,a+δ] (x). Then π([a, a + δ]) ≈ 1/2. Now, consider running a typical Metropolis algorithm on π u . Unless X 0 ∈ [a, a + δ], or the sampler gets "lucky" and achieves X n ∈ [a, a + δ] for some moderate n, then the algorithm will likely miss the tiny interval [a, a + δ] entirely, over any feasible time period. The algorithm will thus "appear" (to the naked eye or to any statistical test) to converge to the Uniform(X ) distribution, even though Uniform(X ) is very different from π(·). Nevertheless, this algorithm is still geometrically ergodic (in fact uniformly ergodic). So in this example, geometric ergodicity does not guarantee a well-behaved sampler.
Example 7. Let X = R, and let π u (x) = 1/(1 + x 2 ) be the (unnormalised) density of the Cauchy distribution. Then a random-walk Metropolis algorithm for π u (with, say, X 0 = 0 and Q(x, ·) = Uniform[x−1, x+1]) is ergodic but is not geometrically ergodic. And, indeed, this sampler has very slow, poor convergence properties. On the other hand, let π ′ u (x) = π u (x) 1 |x|≤10 100 , i.e. π ′ u corresponds to π u truncated at ± one googol. Then the same random-walk Metropolis algorithm for π ′ u is geometrically ergodic, in fact uniformly ergodic. However, the two algorithms are indistinguishable when run for any remotely feasible number of iterations. Thus, this is an example where geometric ergodicity does not in any way indicate improved performance of the algorithm.
In addition to the above two examples, there are also numerous examples of important Markov chains on finite state spaces (such as the single-site Gibbs sampler for the Ising model at low temperature on a large but finite grid) which are irreducible and aperiodic, and hence uniformly (and thus also geometrically) ergodic, but which converge to stationarity extremely slowly.
The above examples illustrate a limitation of qualitative convergence properties such as geometric ergodicity. It is thus desirable where possible to instead obtain quantitative bounds on Markov chain convergence. We consider this issue next.
We here present a result from [85], which follows as a special case of [24]; it is based on the approach of [80] while also taking into account a small improvement from [77].
Our result requires a bivariate drift condition of the form for some function h : X × X → [1, ∞) and some α > 1, where (Thus, P represents running two independent copies of the chain.) Of course, (11) is closely related to (10); for example we have the following (see also [80], ∈ C × C, then either x / ∈ C or y / ∈ C (or both), so h(x, y) ≥ (1 + d)/2, and P V (x) + P V (y) ≤ λV (x) + λV (y) + b. Then Finally, we let where for (x, y) ∈ C × C, In terms of these assumptions, we state our result as follows.
Theorem 12. Consider a Markov chain on a state space X , having transition kernel P . Suppose there is C ⊆ X , h : X × X → [1, ∞), a probability distribution ν(·) on X , α > 1, n 0 ∈ N, and ǫ > 0, such that (8) and (11) hold. Define B n0 by (12). Then for any joint initial distribution L(X 0 , X ′ 0 ), and any integers 1 ≤ j ≤ k, if {X n } and {X ′ n } are two copies of the Markov chain started in the joint initial distribution L(X 0 , X ′ 0 ), then In particular, by choosing j = ⌊rk⌋ for sufficiently small r > 0, we obtain an explicit, quantitative convergence bound which goes to 0 exponentially quickly as k → ∞.
Theorem 12 is proved in Section 4. Versions of this theorem have been applied to various realistic MCMC algorithms, including for versions of the variance components model described earlier, resulting in bounds like P n (x, ·) − π(·) < 0.01 for n = 140 or n = 3415; see e.g. [82], and Jones and Hobert [45]. Thus, while it is admittedly hard work to apply Theorem 12 to realistic MCMC algorithms, it is indeed possible and often can establish rigorously that perfectly feasible numbers of iterations are sufficient to ensure convergence.
Remark. For complicated Markov chains, it might be difficult to apply Theorem 12 successfully. In such cases, MCMC practitioners instead use "convergence diagnostics", i.e. do statistical analysis of the realised output X 1 , X 2 , . . ., to see if the distributions of X n appear to be "stable" for large enough n. Many such diagnostics involve running the Markov chain repeatedly from different initial states, and checking if the chains all converge to approximately the same distribution (see e.g. Gelman and Rubin [31], and Cowles and Carlin [18]). This technique often works well in practice. However, it provides no rigorous guarantees and can sometimes be fooled into prematurely claiming convergence (see e.g. [51]), as is likely to happen for the examples at the end of Section 3. Furthermore, convergence diagnostics can also introduce bias into the resulting estimates (see [19]). Overall, despite the extensive theory surveyed herein, the "convergence time problem" remains largely unresolved for practical application of MCMC. (This is also the motivation for "perfect MCMC" algorithms, originally developed by Propp and Wilson [63] and not discussed here; for further discussion see e.g. Kendall and Møller [46], Thönnes [92], and Fill et al. [27].)

Convergence Proofs using Coupling Constructions
In this section, we prove some of the theorems stated earlier. There are of course many methods available for bounding convergence of Markov chains, appropriate to various settings (see e.g. [1], [21], [88], [2], [90], and Subsection 5.4 herein), including the setting of large but finite state spaces that often arises in computer science (see e.g. Sinclair [88] and Randall [64]) but is not our emphasis here. In this section, we focus on the method of coupling, which seems particularly wellsuited to analysing MCMC algorithms on general (uncountable) state spaces. It is also particularly well-suited to incorporating small sets (though small sets can also be combined with regeneration theory, see e.g. [8], [4], [57], [38]). Some of the proofs below are new, and avoid many of the long analytic arguments of some previous proofs (e.g. Nummelin [60], and Meyn and Tweedie [54]).

The Coupling Inequality
The basic idea of coupling is the following. Suppose we have two random variables X and Y , defined jointly on some space X . If we write L(X) and L(Y ) for their respective probability distributions, then we can write That is, the variation distance between the laws of two random variables is bounded by the probability that they are unequal. For background, see e.g. Pitman [62], Lindvall [48], and Thorisson [91].

Small Sets and Coupling
Suppose now that C is a small set. We shall use the following coupling construction, which is essentially the "splitting technique" of Nummelin [59] and Athreya and Ney [8]; see also Nummelin [60], and Meyn and Tweedie [54]. The idea is to run two copies {X n } and {X ′ n } of the Markov chain, each of which marginally follows the updating rules P (x, ·), but whose joint construction (using C) gives them as high a probability as possible of becoming equal to each other.

THE COUPLING CONSTRUCTION:
Start with X 0 = x and X ′ 0 ∼ π(·), and n = 0, and repeat the following loop forever.
Beginning of Loop. Given X n and X ′ n : 1. If X n = X ′ n , choose X n+1 = X ′ n+1 ∼ P (X n , ·), and replace n by n + 1.

Then return to Beginning of Loop.
Under this construction, it is easily checked that X n and X ′ n are each marginally updated according to the correct transition kernel P . It follows that P[X n ∈ A] = P n (x, ·) and P[X ′ n ∈ A] = π(A) for all n. Moreover the two chains are run independently until they both enter C at which time the minorisation splitting construction (step 2) is utilised. Without such a construction, on uncountable state spaces, we would not be able to ensure successful coupling of the two processes.
The coupling inequality then says that P n (x, ·) − π(·) ≤ P[X n = X ′ n ]. The question is, can we use this to obtain useful bounds on P n (x, ·) − π(·) ? In fact, we shall now provide proofs (nearly self-contained) of all of the theorems stated earlier, in terms of this coupling construction. This allows for intuitive understanding of the theorems, while also avoiding various analytic technicalities of the previous proofs of some of these theorems.

Proof of Theorem 12
We follow the general outline of [85]. We again begin by assuming that n 0 = 1 in the minorisation condition for the small set C (and thus write B n0 as B), and indicate at the end what changes are required if n 0 > 1.
Let N k = #{m : 0 ≤ m ≤ k, (X m , X ′ m ) ∈ C × C}, and let τ 1 , τ 2 , . . . be the times of the successive visits of {(X n , X ′ n )} to C × C. Then for any integer j with 1 ≤ j ≤ k, Now, the event {X k = X ′ k , N k−1 ≥ j} is contained in the event that the first j coin flips all came up tails. Hence, P[X k = X ′ k , N k−1 ≥ j] ≤ (1 − ǫ) j . which bounds the first term in (14).
To bound the second term in (14), let

Lemma 13.
We have by (9). Similarly, if (X k , X ′ k ) ∈ C × C, then N k = N k−1 + 1, so assuming X k = X ′ k (since if X k = X ′ k , then the result is trivial), we have by (10). Hence, {M k } is a supermartingale.
To proceed, we note that since B ≥ 1, Theorem 12 now follows (in the case n 0 = 1), by combining these two bounds with (14) and (13).
Finally, we consider the changes required if n 0 > 1. In this case, the main change is that we do not wish to count visits to C × C during which the joint chain could not try to couple, i.e. visits which correspond to the "filling in" times for going back and constructing X n+1 , . . . , X n+n0 [and similarly for X ′ ] in step 2 of the coupling construction. Thus, we instead let N k count the number of visits to C × C, and {τ i } the actual visit times, avoiding all such "filling in" times. Also, we replace N k−1 by N k−n0 in (14) and in the definition of M k . Finally, what is a supermartingale is not {M k } but rather {M t(k) }, where t(k) is the latest time ≤ k which does not correspond to a "filling in" time. (Thus, t(k) will take the value k, unless the joint chain visited C × C at some time between k − n 0 and k − 1.) With these changes, the proof goes through just as before.

Proof of Theorem 9
Here we give a direct coupling proof of Theorem 9, thereby somewhat avoiding the technicalities of e.g. Meyn and Tweedie [54] (though admittedly with a slightly weaker conclusion; see Fact 10). Our approach shall be to make use of Theorem 12. To begin, set h(x, y) = 1 2 [V (x) + V (y)]. Our proof will use the following technical result.

Lemma 14.
We may assume without loss of generality that Specifically, given a small set C and drift function V satisfying (8) and (10), we can find a small set C 0 ⊆ C such that (8) and (10) still hold (with the same n 0 and ǫ and b, but with λ replaced by some λ 0 < 1), and such that (15) also holds.
Proof. Let λ and b be as in (10). Choose δ with 0 < δ < 1 − λ, let λ 0 = 1 − δ, , and set Then clearly (8) continues to hold on C 0 , since C 0 ⊆ C. It remains to verify that (10) holds with C replaced by C 0 , and λ replaced by λ 0 . Now, (10) clearly holds for x ∈ C 0 and x / ∈ C, by inspection. Finally, for x ∈ C \ C 0 , we have V (x) ≥ K, and so using the original drift condition (10), we have showing that (10) still holds, with C replaced by C 0 and λ replaced by λ 0 .
As an aside, we note that in Lemma 14, it may not be possible to satisfy (15) by instead modifying V and leaving C unchanged: Proposition 15. There exists a geometrically ergodic Markov chain, with small set C and drift function V satisfying (8) and (10), such that there does not exist a drift function V 0 : X → [0, ∞] with the property that upon replacing V by V 0 , (8) and (10) continue to hold, and (15) also holds.
On the other hand, suppose we had a some drift function V 0 satisfying (10), is bounded for all 0 < x ≤ 1, which would in turn imply that V 0 were bounded everywhere on X . But then Fact 10 would imply that the chain is uniformly ergodic, which it clearly is not. This gives a contradiction.
Thus, for the remainder of this proof, we can (and do) assume that (15) holds. This, together with (10), implies that sup (x,y)∈C×C Rh(x, y) < ∞, (16) which in turn ensures that the quantity B n0 of (12) is finite. To continue, let d = inf C c V . Then we see from Proposition 11 that the bivariate drift condition (11) will hold, provided that d > b/(1 − λ) − 1. In that case, Theorem 9 follows immediately (in fact, in a quantitative version) by combining Proposition 11 with Theorem 12.
However, if d ≤ b/(1−λ)−1, then this argument does not go through. This is not merely a technicality; the condition d > b/(1 − λ) − 1 ensures that the chain is aperiodic, and without this condition we must somehow use the assumption aperiodicity more directly in the proof.
Our plan shall be to enlarge C so that the new value of d satisfies d > b/(1 − λ) − 1, and to use aperiodicity to show that C remains a small set (i.e., that (8) still holds though perhaps for uncontrollably larger n 0 and smaller ǫ > 0). Theorem 9 will then follow from Proposition 11 and Theorem 12 as above. (Note that we will have no direct control over the new values of n 0 and C, which is why this approach does not provide a quantitative convergence rate bound.) To proceed, choose any Furthermore, since V is bounded on S by construction, we see that (15) will still hold with C replaced by C ′ . It then follows from (16) and (10) that we will still have B n0 < ∞ even upon replacing C by C ′ . Thus, Theorem 9 will follow from Proposition 11 and Theorem 12 if we can prove: To prove Lemma 16, we use the notion of "petite set", following [54].

Definition.
A subset C ⊆ X is petite (or, (n 0 , ǫ, ν)-petite), relative to a small set C, if there exists a positive integer n 0 , ǫ > 0, and a probability measure ν(·) on X such that Intuitively, the definition of petite set is like that of small set, except that it allows the different states in C to cover the minorisation measure ǫ ν(·) at different times i. Obviously, any small set is petite. The converse is false in general, as the petite set condition does not itself rule out periodic behaviour of the chain (for example, perhaps some of the states x ∈ C cover ǫ ν(·) only at odd times, and others only at even times). However, for an aperiodic, φ-irreducible Markov chain, we have the following result, whose proof is presented in the Appendix. To make use of Lemma 17, we use the following.
Proof. To begin, choose N large enough that r ≡ 1 − λ N d > 0. Let τ C = inf{n ≥ 1; X n ∈ C} be the first return time to C. Let Z n = λ −n V (X n ), and let W n = Z min(n,τC) . Then the drift condition (10) implies that W n is a supermartingale. Indeed, if τ C ≤ n, then while if τ C > n, then X n / ∈ C, so using (10), Hence, for x ∈ S, using Markov's inequality and the fact that V ≥ 1, On the other hand, recall that C is (n 0 , ǫ, ν(·))-small, so that P n0 (x, ·) ≥ ǫ ν(·) for x ∈ C. It follows that for x ∈ S, N +n0 i=1+n0 P i (x, ·) ≥ r ǫ ν(·). Hence, for x ∈ S ∪ C, N +n0 i=n0 P i (x, ·) ≥ r ǫ ν(·). This shows that S ∪ C is petite. e Combining Lemmas 18 and 17, we see that C ′ must be small, proving Lemma 16, and hence proving Theorem 9.

Proof of Theorem 4
Theorem 4 does not assume the existence of any small set C, so it is not clear how to make use of our coupling construction in this case. However, help is at hand in the form of a remarkable result about the existence of small sets, due to Jain and Jameson [41] (see also Orey [61]). We shall not prove it here; for modern proofs see e.g. [60], p. 16, or [54], Theorem 5.2.2. The key idea (see e.g. Meyn and Tweedie [54], Theorem 5.2.1) is to extract the part of P n0 (x, ·) which is absolutely continuous with respect to the measure φ, and then to find a C with φ(C) > 0 such that this density part is at least δ > 0 throughout C. Theorem 19. (Jain and Jameson [41]) Every φ-irreducible Markov chain, on a state space with countably generated σ-algebra, contains a small set C ⊆ X with φ(C) > 0. (In fact, each B ⊆ X with φ(B) > 0 in turn contains a small set C ⊆ B with φ(C) > 0.) Furthermore, the minorisation measure ν(·) may be taken to satisfy ν(C) > 0.
In terms of our coupling construction, if we can show that the pair (X n , X ′ n ) will hit C × C infinitely often, then they will have infinitely many opportunities to couple, with probability ≥ ǫ > 0 of coupling each time. Hence, they will eventually couple with probability 1, thus proving Theorem 4.
We prove this following the outline of [84]. We begin with a lemma about return probabilities: Lemma 20. Consider a Markov chain on a state space X , having stationary distribution π(·). Suppose that for some A ⊆ X , we have P x (τ A < ∞) > 0 for all x ∈ X . Then for π-almost-every x ∈ X , P x (τ A < ∞) = 1.
Proof. Suppose to the contrary that the conclusion does not hold, i.e. that π x ∈ X : Then we make the following claims (proved below): Claim 1. Condition (18) implies that there are constants ℓ, ℓ 0 ∈ N, δ > 0, and B ⊆ X with π(B) > 0, such that Claim 2. Let B, ℓ, ℓ 0 , and δ be as in Claim 1. Let L = ℓℓ 0 , and let S = sup{k ≥ 1; X kL ∈ B}, using the convention that S = −∞ if the set {k ≥ 1; X kL ∈ B} is empty. Then for all integers 1 ≤ r ≤ j, Assuming the claims, we complete the proof as follows. We have by stationarity that for any j ∈ N, π(B) δ = j π(B) δ.
For j > 1/π(B) δ, this gives π(A C ) > 1, which is impossible. This gives a contradiction, and hence completes the proof of Lemma 20, subject to the proofs of Claims 1 and 2 below.
Proof of Claim 2. We compute using stationarity and then Claim 1 that To proceed, we let C be a small set as in Theorem 19. Consider again the coupling construction {(X n , Y n )}. Let G ⊆ X × X be the set of (x, y) for which P (x,y) ∃ n ≥ 1; X n = Y n = 1. From the coupling construction, we see that if (X 0 , X ′ 0 ) ≡ (x, X ′ 0 ) ∈ G, then lim n→∞ P[X n = X ′ n ] = 1, so that lim n→∞ P n (x, ·) − π(·) = 0, proving Theorem 4. Hence, it suffices to show that for π-a.e. x ∈ X , we have P[(x, X ′ 0 ) ∈ G] = 1. Let G be as above, let G x = {y ∈ X ; (x, y) ∈ G} for x ∈ X , and let G = {x ∈ X ; π(G x ) = 1}. Then Theorem 4 follows from: Lemma 21. π(G) = 1.
Proof. We first prove that (π × π)(G) = 1. Indeed, since ν(C) > 0 by Theorem 19, it follows from Lemma 35 that, from any (x, y) ∈ X × X , the joint chain has positive probability of eventually hitting C × C. It then follows by applying Lemma 20 to the joint chain, that the joint chain will return to C × C with probability 1 from (π × π)-a.e. (x, y) / ∈ C × C. Once the joint chain reaches C × C, then conditional on not coupling, the joint chain will update from R which must be absolutely continuous with respect to π × π, and hence (again by Lemma 20) will return again to C × C with probability 1. Hence, the joint chain will repeatedly return to C × C with probability 1, until such time as X n = X ′ n . And by the coupling construction, each time the joint chain is in C × C, it has probability ≥ ǫ of then forcing X n = X ′ n . Hence, eventually we will have X n = X ′ n , thus proving that (π × π)(G) = 1. Now, if we had π(G) < 1, then we would have contradicting the fact that (π × π)(G) = 1.

A Negative Result
One might expect that CLTs always hold when π(h 2 ) is finite, but this is false. For example, it is shown in [66] that Metropolis-Hastings algorithms whose acceptance probabilities are too low may get so "stuck" that τ = ∞ and they will not have a √ n-CLT. More specifically, the following is proved: Consider a reversible Markov chain, beginning in its stationary distribution π(·), and let r( then a √ n-CLT does not hold for h. Proof. We compute directly from (19) that by (20). Hence, a √ n-CLT cannot exist.
In particular, Theorem 22 is used in [66] to prove that for the independence sampler with target Exp(1) and i.i.d. proposals Exp(λ), the identity function has no √ n-CLT for any λ ≥ 2. The question then arises of what conditions on the Markov chain transitions, and on the functional h, guarantee a √ n-CLT for h.

Conditions Guaranteeing CLTs
Here we present various positive results about the existence of CLTs. Some, though not all, of these results are then proved in the following two sections. For i.i.d. samples, classical theory guarantees a CLT provided the second moments are finite (e.g. [13], Theorem 27.1; [83], p. 110). For uniformly ergodic chains, an identical result exists; it is shown in Corollary 4.2(ii) of Cogburn [17] (cf. Theorem 5 of Tierney [93]) that: Theorem 23. If a Markov chain with stationary distribution π(·) is uniformly ergodic, then a √ n-CLT holds for h whenever π(h 2 ) < ∞.
If a chain is just geometrically ergodic but not uniformly ergodic, then a similar result holds under the slightly stronger assumption of a finite 2 + δ moments. That is, it is shown in Theorem 18.5.3 of Ibragimov and Linnik [40] (see also Theorem 2 of Chan and Geyer [15], and Theorem 2 of Hobert et al. [38]) that: Theorem 24. If a Markov chain with stationary distribution π(·) is geometrically ergodic, then a √ n-CLT holds for h whenever π(|h| 2+δ ) < ∞ for some δ > 0. It follows, for example, that the independence sampler example mentioned above (which fails to have a √ n-CLT, but which has finite moments of all orders) is not geometrically ergodic.
It is shown in Corollary 3 of [69] that Theorem 24 can be strengthened if the chain is reversible: Theorem 25. If the Markov chain is geometrically ergodic and reversible, then a √ n-CLT holds for h whenever π(h 2 ) < ∞.
Comparing Theorems 25 and 24 leads to the following yes-or-no question (see [6]): if a Markov chain is geometrically ergodic, but not necessarily reversible, and π(h 2 ) < ∞, then does a √ n-CLT necessarily exist for h? In the first draft of this paper, we posed that question as an Open Problem. However, it was recently solved by Häggström [36], who produced a counter-example to prove the following: There exists a (non-reversible) geometrically ergodic Markov chain, on a (countable) state space X , and a function h : X → R, such that π(h 2 ) < ∞, but such that h does not satisfy a √ n-CLT (nor a CLT with any other scaling).
If P is reversible, then it was proved by Kipnis and Varadhan [47] that finiteness of σ 2 is all that is required: Theorem 27. For a φ-irreducible and aperiodic Markov chain which is reversible, a √ n-CLT holds for h whenever σ 2 < ∞, where σ 2 is given by (19).
In a different direction, we have the following: Theorem 28. Suppose a Markov chain is geometrically ergodic, satisfying (10) for some V : X → [1, ∞] which is finite π-a.e. Let h : X → R with h 2 ≤ K V for some K < ∞. Then a √ n-CLT holds for h.
Before proving some of these results, we consider two extensions which are straightforward mathematically, but which may be of practical importance.
Proof. The hypotheses of the various CLT results all imply that the chain is φ-irreducible and aperiodic, with stationary distribution π(·). Hence, by Theorem 4, there is convergence to π(·) from π-a.e. x ∈ X . For such x, let ǫ > 0, and find m ∈ N such that P m (x, ·) − π(·) ≤ ǫ. It then follows from Proposition 3(g) that we can jointly construct copies {X n } and {X ′ n } of the Markov chain, with X 0 = x and X ′ 0 ∼ π(·), such that But this means that for any A ⊆ X , lim sup Since ǫ > 0 is arbitrary, and since we know that Proposition 30. The CLT Theorems 23 and 24 remain true if the chain is periodic of period d ≥ 2, provided that the d-step chain P ′ = P d X1 (as in the proof of Corollary 6) has all the other properties required of P in the original result (i.e. φ-irreducibility, and uniform or geometric ergodicity), and that the function h still satisfies the same moment condition.
Proof. As in the proof of Corollary 6, let P be the d-step chain defined on . Then P inherits the irreducibility and ergodicity properties of P ′ (formally, since P ′ is de-initialising for P ; see [73]). Then, Theorem 23 or 24 establishes a CLT for P and h. However, this is easily seen to be equivalent to the corresponding CLT for the original P and h, thus giving the result.
Remark. In particular, combining Theorem 23 with Proposition 30, we see that a √ n-CLT holds for any function h for any irreducible (or indecomposible) Markov chain on a finite state space, without any assumption of aperiodicity.
(See also the Remark following Corollary 6 above.) Remark. We note that for periodic chains as in Proposition 30, the formula (19) for the asymptotic variance σ 2 continues to hold without change. The relation σ 2 = τ Var π (h) also continues to hold, except that now the formula for the integrated autocorrelation time τ requires that the sum taken over ranges whose lengths are multiples of d, i.e. the flexibly-ordered infinite sum τ = k∈Z Corr(X 0 , X k ) must be replaced by the more precisely limited sum τ = lim m,ℓ→∞ md k=−ℓd Corr(X 0 , X k ) (otherwise the sum will not converge, since now the individual terms do not go to 0).

CLT Proofs using the Poisson Equation
Here we provide proofs of some of the results stated in the previous subsection.
We begin by stating a version of the martingale central limit theorem, which was proved independently by Billingsley [12] and Ibragimov [39]; see e.g. p. 375 of Durrett [25].
Proof of Theorem 28. By Fact 10, there is C < ∞ and ρ < 1 with |P n f (x) − π(f )| ≤ CV (x)ρ n for x ∈ X and f ≤ V , and furthermore π(V ) < ∞. Let g k = P k [h − π(h)] as in the proof of Corollary 33. Then by the Cauchy-Schwartz inequality, ( On the other hand, since Hence, the result again follows from Corollary 33.

Proof of Theorem 24 using Regenerations
Here we use regeneration theory to give a reasonably direct proof of Theorem 24, following the outline of Hobert et al. [38], thereby avoiding the technicalities of the original proof of Ibragimov and Linnik [40]. We begin by noting from Fact 10 that since the chain is geometrically ergodic, there is a small set C and a drift function V satisfying (8) and (10).
In terms of this, we consider a regeneration construction for the chain (cf. [8], [4], [57], [38]). This is very similar to the coupling construction presented in Section 4, except now just for a single chain {X n }. Thus, in the coupling construction we omit option 1, and merely update the single chain. More formally, given X n , we proceed as follows. If X n / ∈ C, then we simply choose X n+1 ∼ P (X n , ·). Otherwise, if X n ∈ C, then with probability ǫ we choose X n+n0 ∼ ν(·), while with probability 1 − ǫ we choose X n+n0 ∼ R(X n , ·). [If n 0 > 1, we then fill in the missing values X n+1 , . . . , X n+n0−1 as usual.] We let T 1 , T 2 , . . . be the regeneration times, i.e. the times such that X Ti ∼ ν(·) as above. Thus, the regeneration times occur with probability ǫ precisely n 0 iterations after each time the chain enters C (not counting those entries of C which are within n 0 of a previous regeneration attempt).
The benefit of regeneration times is that they break up sums like n i=0 [h(X i )− π(h)] into sums over tours, each of the form Furthermore, since each subsequent tour begins from the same fixed distribution ν(·), we see that the different tours, after the first one, are independent and identically distributed (i.i.d.).
To continue, we note that geometric ergodicity implies (as in the proof of Lemma 18) exponential tails on the return times to C. It then follows (cf. Theorem 2.5 of [94]) that there is β > 1 with (This also follows from Theorem 15.0.1 of [54], together with a simple argument using probability generating functions.) Now, it seems intuitively clear that E(n) is O p (1) as n → ∞, so when multiplied by n −1/2 , it will not contribute to the limit. Formally, this follows from (22), which implies by standard renewal theory that E(n) has a limiting distribution as n → ∞, which in turn implies that E(n) is O p (1) as n → ∞. Thus, the term E(n) can be neglected without affecting the result.
Also, using (22), Markov's inequality then gives that ]. Hence, combining (23) and (24), we obtain that It appears at first glance that Theorem 23 could be proved by similar regeneration arguments. However, we have been unable to do so.
Open Problem # 2. Can Theorem 23 be proved by direct regeneration arguments, similar to the above proof of Theorem 24?

Optimal Scaling and Weak Convergence
Finally, we briefly discuss another application of probability theory to MCMC, namely the optimal scaling problem. Our presentation here is quite brief; for further details see the review article [74].
Let π u : R d → [0, ∞) be a continuous d-dimensional density (d large). Consider running a Metropolis-Hastings algorithm for π u . The optimal scaling problem concerns the question of how we should choose the proposal distribution for this algorithm.
For concreteness, consider either the random-walk Metropolis (RWM) algorithm with proposal distribution given by Q(x, ·) = N (x, σ 2 I d ), or the Langevin algorithm with proposal distribution given by Q(x, ·) = N (x+ σ 2 2 ∇ log π u (x), σ 2 I d ). In either case, the question becomes, how should we choose σ 2 ?
If σ 2 is chosen to be too small, then by continuity the resulting Markov chain will nearly always accept its proposed value. However, the proposed value will usually be extremely close to the chain's previous state, so that the chain will move extremely slowly, leading to a very high acceptance rate, but very poor performance. On the other hand, if σ 2 is chosen to be too large, then the proposed values will usually be very far from the current state. Unless the chain gets very "lucky", then those proposed values will usually be rejected, so that the chain will tend to get "stuck" at the same state for large periods of time. This will lead to a very low acceptance rate, and again a very poorly performing algorithm. We conclude that proposal scalings satisfy a Goldilocks Principle: The choice of the proposal scaling σ 2 should be "just right", neither too small nor too large.
To prove theorems about this, assume for now that i.e. that the density π u factors into i.i.d. components, each with (smooth) density f . (This assumption is obviously very restrictive, and is uninteresting in practice since then each coordinate can be simulated separately. However, it does allow us to develop some interesting theory, which may approximately apply in other cases as well.) Also, assume that chain begins in stationarity, i.e. that X 0 ∼ π(·).

The Random Walk Metropolis (RWM) Case
where Z ∼ f (z) dz. Then it turns out, essentially, that under the assumption (25), as d → ∞ it is optimal to choose σ 2 . = (2.38) 2 /Id, leading to an asymptotic acceptance rate . = 0.234.
More precisely, set the proposal variance to be σ 2 d = ℓ 2 /d, where ℓ > 0 is to be chosen later. Let {X n } be the Random Walk Metropolis algorithm for π(·) on R d with proposal variance σ 2 d . Also, let {N (t)} t≥0 be a Poisson process with rate d which is independent of {X n }. Finally, let Thus, {Z d t } t≥0 follows the first component of {X n }, with time speeded up by a factor of d.
Then it is proved in [67] (see also [74]), using the theory from Ethier and Kurtz [26], that as d → ∞, the process {Z d t } t≥0 converges weakly to a diffusion process {Z t } t≥0 which satisfies the following stochastic differential equation: Here h(ℓ) = 2 ℓ 2 Φ − √ Iℓ 2 corresponds to the speed of the limiting diffusion, where Φ(x) = 1 √ 2π x −∞ e −s 2 /2 ds is the cdf of a standard normal distribution.
We then compute numerically that the choice ℓ =l . = 2.38/ √ I maximises the above speed function h(ℓ), and thus must be the choice leading to optimally fast mixing (at least, as d → ∞). Furthermore, it is also proved in [67] that the asymptotic (i.e., expected value with respect to the stationary distribution) acceptance rate of the algorithm is given by the formula A(ℓ) = 2 Φ − √ Iℓ 2 , and we compute that A(l) . = 0.234, thus giving the optimal asymptotic acceptance rate.
= (0.825) 2 /J 1/2 d 1/3 , leading to an asymptotic acceptance rate . = 0.574. More precisely, set σ 2 d = ℓ 2 /d 1/3 , let {X n } be the Langevin Algorithm for π(·) on R d with proposal variance σ 2 d , let {N (t)} t≥0 be a Poisson process with rate d 1/3 which is independent of {X n }, and let so that {Z d t } t≥0 follows the first component of {X n }, with time speeded up by a factor of d 1/3 . Then it is proved in [70] (see also [74]) that as d → ∞, the process {Z d t } t≥0 converges weakly to a diffusion process {Z t } t≥0 which satisfies the following stochastic differential equation: dZ t = g(ℓ) 1/2 dB t + 1 2 g(ℓ) ∇ log π u (Z t ) dt.
Here g(ℓ) = 2 ℓ 2 Φ −Jℓ 3 represents the speed of the limiting diffusion. We then compute numerically that the choice ℓ =l = 0.825/ √ J maximises g(ℓ), and thus must be the choice leading to optimally fast mixing (at least, as d → ∞). Furthermore, it is proved in [70] that the asymptotic acceptance rate satisfies A(l) = 2 Φ(−Jl 3 ) . = 0.574, thus giving the optimal asymptotic acceptance rate for the Langevin case.

Discussion of Optimal Scaling
The above results show that for either the RWM or the Langevin algorithm, under the assumption (25), we can determine the optimal proposal scaling just in terms of universally optimal asymptotic acceptance rates (0.234 for RWM, 0.574 for Langevin). Such results are straightforward to apply in practice, since it is trivial for a computer to monitor the acceptance rate of the algorithm, and the user can modify σ 2 appropriately to achieve appropriate acceptance rates. Thus, these optimal scaling rates are often used in applied contexts (see e.g. Møller et al. [56]). (It may even be possible for the computer to adaptively modify σ 2 to achieve the appropriate acceptance rates; see [5] and references therein. However it is important to recognise that adaptive strategies can violate the stationarity of π so they have to be carefully implemented; see for example [35].) The above results also describe the computational complexity of these algorithms. Specifically, they say that as d → ∞, the efficiency of RWM algorithms scales like d −1 , so its computational complexity is O(d). Similarly, the efficiency of Langevin algorithms scales like d −1/3 , so its computational complexity is O(d 1/3 ) which is much lower order (i.e. better).
We note that for reasonable efficiency, we do not need the acceptance rate to be exactly 0.234 (or 0.574), just fairly close. Also, the dimension doesn't have to be too large before asymptotics approximately kick in; often 0.234 is approximately optimal in dimensions as low as 5 or 10. For further discussion of these issues, see the review article [74]. Now, the above results are only proved under the strong assumption (25). It is natural to ask what happens if this assumption is not satisfied. In that case, there are various extensions of the optimal-scaling results to cases of inhomogeneously-scaled components of the form π u (x) = d i=1 C i f (C i x i ) (see [74]), to the discrete hypercube [65], and to finite-range homogeneous Markov random fields [14]; in particular, the optimal acceptance rate remains 0.234 (under appropriate assumptions) in all of these cases. On the other hand, surprising behaviour can result if we do not start in stationarity, i.e. if the assumption X 0 ∼ π(·) is violated and the chain instead begins way out in the tails of π(·); see [16]. The true level of generality of these optimal scaling results is currently unknown, though investigations are ongoing [10]. In general this is an open problem: Open Problem # 3. Determine the extent to which the above optimal scaling results continue to apply, even when assumption (25) is violated.

APPENDIX: Proof of Lemma 17
Lemma 17 above states (Meyn and Tweedie [54], Theorem 5.5.7) that for an aperiodic, φ-irreducible Markov chain, all petite sets are small sets.
We now proceed to prove that gcd(T ) = 1. Indeed, suppose to the contrary that gcd(T ) = d > 1. We will derive a contradiction.
Then d i=1 X i = X by assumption. Now, let S = i =j (X i ∩ X j ), let S = S ∪ {x ∈ X ; ∃m ∈ N s.t. P m (x, S) > 0}, and let X ′ i = X i \ S. Then X 1 , X 2 , . . . , X d are disjoint by construction (since we have removed S).
Also if x ∈ X ′ i , then P (x, S) = 0, so that P (x, d j=1 X ′ j ) = 1 by construction. In fact we must have P (x, X ′ i+1 ) = 1 in the case i < d (with P (x, X ′ 1 ) = 1 for i = d), for if not then x would be in two different X ′ j at once, contradicting their disjointedness.
It then follows (by sub-additivity of measures) that ν(S) = 0. Therefore, We conclude from all of this that X ′ 1 , . . . , X ′ d are subsets of positive πmeasure, with respect to which the Markov chain is periodic (of period d), contradicting the assumption of aperiodicity.