Fast Mixing of Metropolis-Hastings with Unimodal Targets

A well-known folklore result in the MCMC community is that the Metropolis-Hastings algorithm mixes quickly for any unimodal target, as long as the tails are not too heavy. Although we've heard this fact stated many times in conversation, we are not aware of any quantitative statement of this result in the literature, and we are not aware of any quick derivation from well-known results. The present paper patches this small gap in the literature, providing a generic bound based on the popular"drift-and-minorization"framework of Rosenthal (1995). Our main contribution is to study two sublevel sets of the Lyapunov function and use path arguments in order to obtain a sharper general bound than what can typically be obtained from multistep minorization arguments.


Introduction
The Metropolis algorithm [16] and its generalization, the Metropolis-Hastings algorithm [6], have been exceptionally successful in the numerical approximation of analytically intractable integrals.Because these algorithms are both important and difficult to analyze, there is an enormous literature on the properties of Metropolis-Hastings chains in the statistics, computer science, mathematics and physics communities (see e.g. the popular textbooks [17,14]).Despite the size of this literature, obtaining reasonable quantitative bounds on the convergence rates of specific Markov chains used in statistics can be quite difficult, even when there are good heuristic reasons that convergence should be quick [11,1].Recently, the authors needed to use an "obvious" folklore result that does not seem to be in the literature: reasonable Metropolis-Hastings chains targetting unimodal distributions will mix quickly.The main purpose of the paper is to provide a general and quantitatively useful version of this folklore result (see Theorem 3.1).
We were originally motivated by the need to prove a sharp lower bound on the spectral gap of a Metropolis-Hastings algorithm for logistic regression in a "raresuccess" asymptotic regime (see Johndrow et al. [10]).When the standard deviation of the proposal kernel was similar to the standard deviation of the target distribution, it was straightforward to obtain a quantitatively strong version of the "minorization" condition required by the "drift-and-minorization" approach of Rosenthal [19].However, this argument becomes much more delicate when the proposal variance does not closely match the target variance.Although we were motivated by a specific problem, similar problems appear more generally when the proposal kernel of an MCMC algorithm is not perfectly tuned to the target.This sort of (initial) bad tuning can be difficult to avoid in contexts such as Johndrow et al. [10] where the posterior distribution is very far from Gaussian.
To address this technical problem, we combine pathwise arguments (as studied in [21]) with coupling arguments to obtain reasonable estimates of mixing times inside of compact sublevel sets of the Lyapunov function.We then apply the "drift and minorization" approach of [19] to obtain mixing bounds on the full state space.This argument is presented here for generic random-walk type Metropolis-Hastings, and thus should be broadly useful.

Related Work
Popular approaches for establishing bounds on convergence rates for Markov chains include the Lyapunov-small set techniques of [12,17,19], and geometric inequalities such as Poincaré, Cheeger, and log-Sobolev inequalities [2,3,4,13,20,21].Under suitable conditions on the tails of the target and the proposal kernel, drift and minorization arguments show that the Metropolis-Hastings algorithm will converge to the target at an exponential rate [15,7,8].The paper [9] studies essentially the same question addressed in the present paper using Cheeger inequalities, but restricts their attention only to log-concave target distributions.

Notation and Standing Assumptions
Consider a Markov kernel P with a unique invariant measure µ : µP = µ.The spectrum of P is the set S S(P) = {λ ∈ C \ {0} : (λI − P) −1 is not a bounded linear operator on L 2 (µ)} and the spectral gap when the eigenvalue 1 has multiplicity 1, and λ * (P) = 0 else.Define the relaxation time τ rel ≡ α −1 , and the mixing time τ of P on a set Θ, τ ≡ min{t : sup x∈Θ δ x P t − µ TV < 1/4}, which need not be finite.
The following is a strong notion of unimodality on a set: Throughout the remainder of the paper, we fix scale ǫ > 0 for typical step sizes of P, in a way that is made concrete in the context of the following assumptions on P: Assumption 2.3.The Markov kernel of interest P is a Metropolis-Hastings kernel with target distribution µ and proposal kernel Q that satisfies 1. Q(x, •) has density q(x, •) and µ has density p(•) with respect to Lebesgue measure.2. q is isotropic, i.e. it is of the form q(x, y) = q( x − y ) for some density q that is unimodal.
4. There exist constants γ ∈ (0, 1) and 0 ≤ K < ∞ and a Lyapunov function The first three assumptions hold for most Metropolis-Hastings proposal kernels used in practice, and we expect them to be easy to verify.The last condition is stronger, and it can be difficult to verify that the condition holds with reasonably small constants γ, K.However, Lyapunov functions do exist under fairly mild conditions that have been well-studied (see e.g.[15], [7]).
For chains of this form, define the Metropolis-Hastings acceptance probability by α(x, y) ≡ 1 ∧ p(y)q(y, x) p(x)q(x, y) .

Main Result
For any Θ ⊂ R d with µ(Θ) > 0, denote by µ Θ and p Θ the usual restrictions of µ, p to Θ: Similarly, denote by P Θ the usual restriction of P to Θ.
i.e. it is "almost constant" on a ball of radius 2ǫ around the mode.
Then there exists a constant If the set Θ is "small but not too small", i.e.
we also have Although our final result is restricted to R, several of the Lemmas used in proving the result hold with almost no changes in R d and could be used to prove similar results for higher-dimensional target distributions.Thus, we prove most of the results in R d and specialize to the case of R for the final Lemma.

Proofs
We break the proof up into three lemmas, each of which might be individually useful for proving similar results.The first two lemmas are proved on R d ; the final lemma is proved only on R.

Mixing: From Very Small Sets to Small Sets
The first Lemma shows that a Lyapunov condition combined with a bound on the mixing time τ of P Θ for a "small" sublevel set Θ of V allows us to bound the spectral gap by the inverse of the mixing time.The key idea is that a Markov chain started from a point x inside of a "very small" sublevel set is unlikely to escape from the slightly larger "small" set within its first τ steps.This allows us to minorize P τ by µ Θ inside of the "very small" set {x : V (x) < 4K(1 − γ) −1 } and then apply the usual Harris theorem to obtain a bound on the spectral gap.In essence we use the Lyapunov condition and the mixing time bound to obtain a minorization condition on the time scale of the mixing time -that is, for P τ -rather than directly showing a multistep minorization condition.
imsart-generic ver.2014/10/16 file: unimodal_mixing_arXiv.texdate: June 20, 2018 Lemma 4.1.Suppose that V is a Lyapunov function of P satisfying (2), Θ ⊂ R d satisfies (5), and the mixing time of P Θ is τ < ∞.Then there exists C = C(γ, K) independent of τ and d so that the relaxation time of P is at most . Let {X t } t≥0 be a Markov chain with transition kernel P and initial state X 0 = x.Denote by κ = inf{t : ∈ Θ}.By Inequality (2) and Markov's inequality, Applying this bound, the maximal coupling inequality, and the triangle inequality, we obtain the minorization bound: So then P τ satisfies inf x∈S(R1) imsart-generic ver.2014/10/16 file: unimodal_mixing_arXiv.texdate: June 20, 2018 Applying this minorization bound and the Lyapunov bound (2) along with Theorem 1 of Hairer and Mattingly [5] to the Markov operator P τ implies that we can bound the geometric convergence rate of convergence ᾱ of P τ via ᾱ = inf α0∈(0,5/8) where the last line follows because the second term in the maximum is decreasing in τ .This implies that the geometric convergence rate of P is at most ᾱ1/τ < 1, and so the L 2 (µ) spectral gap of P is at least by an application of Theorem 2 of Roberts and Rosenthal [18].Inspection of inequality (7) completes the proof.

Mixing for Unimodal Distributions on Compact Sets
We now show the first of two Lemmas necessary to prove the mixing time bound inside of Θ.This lemma shows that when started from an initial condition very close to the mode, P Θ will mix rapidly.Our approach is to compare P Θ to a chain with transition kernel PΘ (x, •) = µ Θ (•) for all x ∈ Θ -this chain simply takes iid samples from its stationary measure.We can write the transition densities of P and P as: where α Θ (x, y) = p Θ (y)q(y, x) p Θ (x)q(x, y) .
Then Γ = {γ x,y : (x, y) ∈ Θ, α Θ (u, v)q(u, v) > 0} is the collection of all such paths of finite length.We say that a pair (u, v) ∈ Θ × Θ is an ith edge of the path γ x,y iff u = γ x,y and v = γ x,y .Let E i be the collection of the ith edges of all paths γ ∈ Γ, and put E = i∈N E i .As shown in Section 2 of [21], the set of paths Γ satisfies the regularity conditions of Theorem 3.2 of that paper and, for any (u, v) ∈ E, the associated Jacobian satisfies J x,y (u, v) = b d x,y (see Yuen [21, page 5] for details).Define ξ(u, v) = α Θ (u, v)q(u, v)p Θ (u) and for any γ x,y ∈ Γ put γ x,y 0 = (u,v)∈γx,y ξ(u, v) 0 = b x,y .Notice we can also view the comparison kernel PΘ as a Metropolis-Hastings kernel with acceptance probability α(x, y) = 1 and proposal q(x, y) = p Θ (y).To use Theorem 3.2 of [21], we must bound the geometric constant Bounding below by the uniform proposal on B ǫ (x) we have using (1) and the volume of a unit ball in which is everywhere positive for any (u, v) ∈ γ x,y by the definition of Γ. Define b = max x,y b x,y ≤ 2Lǫ −1 + 1, we have where in the second line we used the fact that there are at most ℓ starting points for paths of length ℓ that contain the edge (u, v) and for any points u, v on a linear path from x to y, and therefore Thus, combining with (8), we obtain with b = 2ǫ −1 L + 1 the maximum length of a path consisting of steps of size ǫ connecting two points inside a ball of radius where we used that ǫ ≤ L so that 2ǫ −1 L + 1 < 3ǫ −1 L. It follows that the spectral gap By Inequality (3), observe that we can write for some "remainder" measure r x .Applying Proposition 1.1 of [21] and the bound (9) on the spectral gap of P Θ , this implies there exists some absolute constant 0 < C < ∞ such that We prove the last lemma: Lemma 4.3.Suppose P Θ is a Metropolis-Hastings kernel on Θ ⊂ R satisfying Assumption 2.3.Then there exists some constant imsart-generic ver.2014/10/16 file: unimodal_mixing_arXiv.texdate: June 20, 2018 Proof.Define the function F by ∼ Uniform(0, 1).We fix x ∈ (m + ǫ, m + L] ∩ Θ and consider a Markov chain (X t , Y t , Z t ) on X × X × X, with initial state (x, x, x) and dynamics defined jointly by We focus initially on the properties of (X t , Y t ).Define and where the second bound comes from Inequality (1).Combining these two bounds, we have shown Now we just need a bound on τ y hit (x), which we obtain by comparing Y t and Z t .By the Berry-Esseen theorem and the sub-exponential tail bound in Inequality (1), there exists a Using again the fact that the tails of Q are sub-exponential (in the sense of Inequality (1)), along with the fact that Y t ≤ Z t for all t < min{s : Z s < −L}, there exists  This completes the proof of the lemma.