A large deviation principle for the empirical measures of Metropolis-Hastings chains

To sample from a given target distribution, Markov chain Monte Carlo (MCMC) sampling relies on constructing an ergodic Markov chain with the target distribution as its invariant measure. For any MCMC method, an important question is how to evaluate its efficiency. One approach is to consider the associated empirical measure and how fast it converges to the stationary distribution of the underlying Markov process. Recently, this question has been considered from the perspective of large deviation theory, for different types of MCMC methods, including, e.g., non-reversible Metropolis-Hastings on a finite state space, non-reversible Langevin samplers, the zig-zag sampler, and parallell tempering. This approach, based on large deviations, has proven successful in analysing existing methods and designing new, efficient ones. However, for the Metropolis-Hastings algorithm on more general state spaces, the workhorse of MCMC sampling, the same techniques have not been available for analysing performance, as the underlying Markov chain dynamics violate the conditions used to prove existing large deviation results for empirical measures of a Markov chain. This also extends to methods built on the same idea as Metropolis-Hastings, such as the Metropolis-Adjusted Langevin Method or ABC-MCMC. In this paper, we take the first steps towards such a large-deviations based analysis of Metropolis-Hastings-like methods, by proving a large deviation principle for the the empirical measures of Metropolis-Hastings chains. In addition, we characterize the rate function and its properties in terms of the acceptance- and rejection-part of the Metropolis-Hastings dynamics.


Introduction
Sampling from a given probability distribution is an essential problem in a range of areas, for example biology, physics, epidemiology and ecology, and statistics.The most common approach is Markov chain Monte Carlo (MCMC), which allows the user to sample from a target probability distribution π, by generating an ergodic Markov chain {X i } i≥0 with π as stationary distribution.These sampling techniques are particularly helpful when it is not possible to use methods that simulate directly from π, for example for computing posterior distributions in a Bayesian setting, or more generally when π is only known up to a normalizing constant.Because of this, MCMC methods are now widely used across scientific disciplines, and are integral tools in areas such as computational chemistry and physics, statistics and machine learning [RC04,AG07,AdFDJ03].
Because of their prevalence in a range of fields, the performance of MCMC algorithms has become an important topic within applied probability and computational statistics.In principle, even the standard Metropolis-Hastings algorithm [MRR + 53, Has70] can be used to sample from essentially any target distribution π.However, when the underlying problem, and thus the distribution π, becomes more and more complex, convergence speed or the cost per iteration becomes an issue.Analysing and improving the convergence speed of a given class of algorithms, as well as comparing the performance of different types of algorithms, is therefore not only interesting from a theoretical perspective, it is also of central importance for applications, where fast and accurate methods are needed for increasingly complex problems.
When analysing performance of MCMC methods, the rate of convergence of time averages is a central quantity for comparing different metods, and for choosing hyperparameters.The fundamental idea underlying MCMC is that for an observable f ∈ L 1 (π), for an ergodic Markov chain {X i } i∈N with invariant distribution π, the n-step average 1 n n−1 i=0 f (X i ) can be used to approximate the expectation E π [f (X)].This average can be viewed as the integral of f with respect to the empirical measure of the Markov process.The rate of convergence of the empirical measure is therefore directly linked to the performance of a given MCMC method.
Because of the role the empirical measure plays in MCMC, and for Monte Carlo methods in general, in the past decade there has been an increasing interest in using the theory of large deviations for empirical measures to study the performance of MCMC methods [DLPD12, PDD + 11, RBS15a, RBS15b, RBS16, DDN18, BNS21, Bie16].However, surprisingly, existing large deviation results do not cover the empirical measure arising from the Metropolis-Hastings algorithm [MRR + 53, Has70] on a general state space.Thus, in order to use a large deviation approach to analyse this foundational algorithm, or more advanced MCMC methods built on the same ideas as Metropolis-Hastings-such as the Metropolis-Adjusted Langevin Method (MALA) [Bes94,RT96a,RR98] and methods based on Approximate Bayesian Computation (ABC) (see [MMPT03,Bea19] for an overview and further references)-the relevant large deviation results must first be established.This is the main contribution of this paper: we prove the large deviation principle for the empirical measures associated with Markov chains arising from the Metropolis-Hastings algorithm.This sets the stage for future work proving similar results for Markov chains with dynamics that resemble those of Metropolis-Hastings, and for analysing the corresponding MCMC methods.
The theory of large deviations has become a cornerstone in modern probability theory, with a wide range of applications.In the context of Monte Carlo methods, it has been known for a long time that for rare-event simulation, sample-path large deviations results are integral to analysing and designing efficient algorithms; see [Buc04,AG07,BD19] and the references therein.In the MCMC setting, the theory remains much less explored for analysing performance and designing new, efficient methods.Instead, standard tools for convergence analysis of sampling methods based on ergodic Markov processes include: the spectral gap of the associated dynamics, mixing times of the process, asymptotic variance and functional inequalities (Poincaré, log-Sobolev) [BR08, Ros03, DHN00, FHPS10, FdSHS93, HHMS05].However, these tools mainly provide information about convergence of the associated n-step transition operator or the law of the process, neither of which are directly linked to the convergence of the empirical measure.Empirical measure large deviations are instead concerned precisely with the convergence of the empirical measure.This is in turn linked to the transient behaviour of the underlying Markov chain, which is of central importance for the performance of MCMC methods.
To the best of our knowledge, the first works on using large deviation theory to study the convergence of the empirical measures arising from MCMC sampling are [DLPD12, PDD + 11].Therein, the authors analyse the performance of parallel tempering, one of the most frequently applied MCMC methods in computational chemistry and physics, from the perspective of large deviations, leading to the construction of a new type of method known as infinite swapping.In the subsequent work [DDN18], empirical measure large deviations and associated stochastic control problems are used to analyse the convergence properties of parallel tempering and infinite swapping.In [DW22] the authors study methods like parallel tempering and infinite swapping in the low-temperature regime, and use empirical measure large deviations to solve the long-standing open problem of optimal temperature selection.Similarly, in [Bie16, RBS15a, RBS15b, RBS16] a large deviation approach is used to analyse certain irreversible samplers.In [BNS21], large deviations for the empirical measures of certain piecewise deterministic Markov processes, including the zig-zag sampler, are obtained, and the associated rate function is used to address a key question concerning the optimal choice of the so-called switching rate of the zig-zag process.The results therein also highlight the differences in considering convergence of empirical averages, and in studying the convergence to equilibrium with, e.g., the spectral gap; see also [Ros03,VM20].
In this paper we focus on the Metropolis-Hastings algorithm [MRR + 53] (described in Section 2.3), the most classical MCMC method and the main building block for many more advanced methods [RC04,AG07,AdFDJ03,Tie98]. Because of its importance in the area of Monte Carlo sampling, the method is wellstudied and classical results on convergence properties and performance include [MT96, RT96b, GGR97, RR97, RR01, CRR05]; see also [MT12,DMPS18] and the references therein for the general theory of Markov chains.However, despite significant efforts over long time, there are still gaps in our understanding of the theoretical properties of this fundamental class of algorithms.As an example, in a recent tour de force [ALPW22a,ALPW22b] the authors develop a functional analytical framework, aimed at analysing Markov chains arising in sampling algorithms, and obtain the first explicit convergence bounds for the Metropolis algorithm.In [Bie16] a non-reversible version of Metropolis-Hastings is introduced and studied.One of the methods used for analysing performance is large deviations for the associated empirical measure.Because the setting is a finite state space S, the classical results [DV75,DV75b,DV76], due to Donsker and Varadhan, give the large deviation principle.To the best of our knowledge, this is the only work that studies large deviations for Markov chains arising from algorithms of Metropolis-Hastings-type.
In [Bie16] the focus is on the effects of non-reversibility, and there is thus no attempt of extending the large deviation results to the setting where the state space S is instead a (uncountable) subset of R d .This is the setting typically encountered in applications.
The pioneering work by Donsker and Varadhan [DV75, DV75b, DV76] is often the starting point for empirical measure large deviations for Markov processes, and their results have been extended in numerous directions; see [DZ94,FK06,BD19] and the references therein.However, it is pointed out in [DL15] (see also Section 2.2) that even for fairly simple continuous-time pure-jump processes, the results by Donsker and Varadhan, or more general versions of them such as in, e.g., [KM05], do not hold.This is because all such large deviation results rely on the transition probability function of the Markov process to have a density with respect to some reference measure.In [DL15] the authors show how this condition can be replaced by a more general transitivity condition (Condition 2.1) to ensure that a large class of processes are covered.However, for the Metropolis-Hastings chains, neither of these conditions hold due to the rejection part of the dynamics.The purpose of this paper is to show that, despite this violation of the standard transitivity conditions, the empirical measures of the Metropolis-Hastings chain do satisfy a large deviation principle.The proof is based on the weak convergence approach [DE97, BD19], which is described in some more detail in Sections 2.2 and 4. With the large deviation results established, our future work is aimed at (i) analysing the performance and comparing various Metropolis-Hastings algorithms using the rate function, and comparing the conclusion to, e.g., the recent results [ALPW22a]; (ii) investigate whether optimal scaling results, similar to the celebrated results in [GGR97,RR01], can be obtained from a large deviation perspective; (iii) extend the results to cover more advanced MCMC algorithms, such as MALA and ABC-MCMC.These topics are all significant undertakings in their own right and we leave them to be investigated separately in future work.
The remainder of the paper is organized as follows.In Section 2 we provide the preliminaries needed for the paper: notation and definitions, a brief overview of large deviations for empirical measures, and a description of the Metropolis-Hastings algorithm.Next, in Section 3 we present the assumptions used for the Metropolis-Hastings chain.The main result is stated in Theorem 4.1 in Section 4. In this section we also show some properties of the associated rate function.The proof of Theorem 4.1 is divided into two parts, in Sections 5 and 6 we prove the Laplace upper and lower bound, respectively, which combined prove Theorem 4.1.

Preliminaries
2.1.Notation and definitions.Throughout the paper we work with some probability space (Ω, F , P).We use a.s. and w.p. 1 as shorthand for almost sure, or almost surely, and with probability 1, respectively.
For a Polish space S, with a translation invariant metric d S , B(S) is the Borel σ−algebra on S, and C(S) and C b (S) denote the spaces of functions f : S → R that are continuous, and bounded and continuous, respectively.For any r ∈ R + and x ∈ S, B r (x) is the open ball of radius r with center in x: B r (x) = {y ∈ S : d S (x, y) < r}.
When S ⊆ R d , for some d ≥ 1, we take λ to denote Lebesgue measure on R d .We abuse notation a bit in that λ is generically taken to represent Lebesgue measure, regardless of the underlying dimension d.For integration with respect to λ we use the standard notation dx for λ(dx).
For a measure η on S, and measurable function f on S, we denote the integral of f with respect to η by η(f ) = S f (x)η(dx).When f is the indicator of a set A, we write η(A) = A η(dx).
The space of probability measures on S is denoted by P(S).Given γ ∈ P(S 2 ), denote by [γ] 1 and [γ] 2 the first and second marginals of γ, respectively.For µ ∈ P(S), define (2.1) We consider the topology of weak convergence on P(S): We use ν n ⇒ ν as shorthand notation for {ν n } ⊂ P(S) converging weakly to ν ∈ P(S).Unless otherwise stated, we equip P(S) with the Lévy-Prohorov metric, denoted d LP : for ν, µ ∈ P(S), where A ǫ = {x ∈ S : d S (x, A) < ǫ}.This metric is compatible with the topology of weak convergence (see, e.g., [BD19], Theorem A.1), and turns P(S) into a Polish space.For any signed measure η on S, the total variation norm of η, η T V , is defined as where the supremum is taken over all measurable functions bounded by 1.For ν, µ ∈ P(S), the total variation norm provides an upper bound on d LP : For a measurable space (Y, A), let q(y, dx) be a collection of probability measures on S parameterized by y ∈ Y .Then q is called a stochastic kernel on S given Y if, for every A ∈ B(S), the map y → q(y, A) ∈ [0, 1] is measurable.
For a Markov chain {X i } i∈N taking values in S, for a given x 0 ∈ S, we denote by P x0 the distribution of {X i } i∈N starting at x 0 .The associated expectation operator is denoted by E x0 .The transition probability function, or transition kernel, of a Markov chain is a stochastic kernel q, such that the distribution of X i given X i−1 is given by q(X i−1 , •).We say that a transition probability function q(x, dy) on S × P(A) satisfies the Feller property if, for any sequence {x n } n∈N such that Given a measure µ ∈ P(S) and a transition kernel q(x, dy), we say that µ is invariant for q, or for the corresponding Markov chain, if for all A ∈ B(S), For ν ∈ P(S), R(• ν) : P(S) → [0, ∞] is the relative entropy (with respect to ν), defined by R(µ ν) = S log dµ dν dµ, µ ≪ ν, +∞, otherwise.
We recall the following properties of relative entropy (see Lemmas 1.4.1 and 1.4.3 in [DE97]): R(• •) is jointly convex and jointly lower semi-continuous with respect to the weak topology on P(S) 2 , and R(µ ν) = 0 if and only if µ = ν.Another useful property follows from the chain rule for relative entropy (see Theorem 2.6 and Corollary 2.7 in [BD19]): given two transition kernels p, q, for any µ ∈ P(S), Lastly, for a set A, A • and Ā denote the interior and closure of the set, respectively, and x → I{x ∈ A} is the indicator function of the set A. When the set is a singleton, A = {y}, we write I{x = y}.We also use δ y to denote this case.
2.2.Large deviations for empirical measures of a Markov chain.Consider a Markov chain X = {X i } i≥0 with state space S and transition probability function p.The empirical measure, L n , associated with the chain X is defined as For each n, this is a random element of P(S).We can also view {L n } n≥0 as a stochastic process in P(S).
In the context of MCMC methods, empirical measures are essential objects as they are used for forming approximations for any observable: for a given observable f ∈ C b (S), we have If the Markov chain X has an invariant distribution π ∈ P(S) and is ergodic, we have L n (f ) → π(f ), a.s. as n → ∞.Thus, there is a direct link between the convergence properties of the empirical measure L n and the performance of Monte Carlo methods based on time averages for approximating observables.Classical methods for studying performance of MCMC methods are often mixing properties or asymptotic variance, which are not directly linked to the empirical measure L n of the underlying Markov chain.The theory of large deviations on the other hand, is concerned precisely with deviations of L n from π as the number of steps n grows.It therefore serves as a useful complement to the more traditional methods for analysing performance of a given MCMC method, as well as for designing new algorithms.
At the heart of the theory of large deviations is the large deviation principle (LDP): the sequence {L n } is said to satisfy an LDP with speed n and rate function I : S → [0, ∞], if I is lower semi-continuous, has compact sub-level sets and for any measurable A ⊂ P(S), . The gist of these inequalities is that, if {L n } satisfies an LDP with speed n and rate function I, then for any ν ∈ P(S) and n large, The definition of an LDP makes this statement rigorous in the limit n → ∞.
For any metric space, an equivalent formulation of the LDP is the Laplace principle (see e.g., Theorems 1.5 and 1.8 in [BD19]).In the setting of the empirical measures {L n }, we have that this sequence satisfies a Laplace principle, with speed n and rate function I (same as in the LDP), if for any The starting point for large deviations of empirical measures of Markov processes is the pioneering work of Donsker and Varadhan [DV75,DV76].A central assumption in those works is that the transition probability function p has a density with respect to some reference measure.This is a reasonable transitivity assumption for processes that involve something that, in some sense, resembles a diffusive term.However, in [DL15] the authors show that it is a rather restrictive condition and as an example construct a simple continuous-time pure-jump process for which it does not hold.The following alternative condition on p was used in [DL15] to establish an LDP for the empirical measures of a Markov process.
Condition 2.1 (Condition 6.3 in [BD19]).The transition kernel p of the Markov chain X is such that there exist positive integers l 0 and n 0 , such that for all x and ζ in S, where p (k) denotes the k−step transition probability.This condition is general enough to cover a large class of Markov processes, both in discrete and continuous time; see e.g., [BD19,DL15] and the references therein.However, it does not cover the case when X comes from a Metropolis-Hastings scheme, as we show with a simple counterexample in Section 4. Condition 2.1, or variations of it, is a key ingredient in existing work on large deviations for Markov chains.Because it is not satisfied for Metropolis-Hastings, in order to use large deviations to analyse the performance of such algorithms, and with an outlook towards more advanced MCMC methods that build on the Metropolis-Hastings algorithm-e.g., MALA and ABC-MCMC-we must first establish the relevant LDP.This is the main contribution of this paper.

Metropolis-Hastings algorithm.
We now give a brief description of the Metropolis-Hastings (MH) algorithm for constructing a Markov chain X = {X i } i≥0 with the target measure π as invariant distribution.For simplicity we restrict ourselves to the setting where S ⊆ R d and π is equivalent to Lebesgue measure.
More abstract settings are possible as well, see for example [Tie98].However this would require different assumptions and modifications of the proof of the large deviation principle in Section 2.2.
The main ingredient of the MH algorithm is the proposal distribution J(•|x) ∈ P(S), defined for all x ∈ S. If the chain after n steps is in some state X n = x n , a proposal Y n+1 for the next state X n+1 is generated from J(•|x n ).This is followed by an acceptance-rejection step, which is defined in terms of the Hastings ratio, where if π(x)J(y|x) = 0, we set ̟(x, y) = 1.The proposed move from ), and rejected with probability 1 − ̟(x n , Y n+1 ).In the latter case, we set X n+1 = x n .The pseudocode for the update step in the MH algorithm is presented in Algorithm 2.1.

and
(2.7) The kernel a corresponds to the acceptance-part of the MH algorithm, i.e., it corresponds to transitions to proposed states that are accepted in the MH algorithm.Similarly, r corresponds to the rejection part: it represents the probability of rejecting a proposed state, and thus remaining at the current state of the chain.With these definitions, the dynamics of the MH algorithm corresponds to generating a Markov chain {X i } i≥0 , the MH chain, with transition kernel For a more in-depth look at the MH algorithm and its various properties, see for example [RC04] and the references therein.A key observation is that due to the form of the Hastings ratio, and the corresponding kernel K, under reasonable assumptions on the proposal distribution J, the MH chain {X i } i≥0 generated according to the above has π as its unique invariant measure.

Assumptions
In this section, we state the assumptions we make on the MH chain defined in Section 2.3.Rather than aiming to make them as general as possible, we have aimed for assumptions, primarily on the proposal distribution J, that are tangible from the perspective of MCMC methods.One alternative, commonly used when studying this type of Markov chain, is to assume the existence of some Lyapunov function [MT96,RT96a,KM03,KM05].Although this ensures the convergence of the empirical measures, however for the large deviation results additional assumptions are still needed; see e.g., the Donsker-Varadhan-like assumption on the transition kernel in [KM05].
As mentioned in Section 2.3, we make the assumption that S ⊆ R d , for some d ≥ 1.We make a slight abuse of notation, in that we let π(•), J(•|x), and a(x, •) denote both the corresponding measures and probability density functions.In order to establish the LDP, we make the following additional assumptions. ( Because π and λ are equivalent measures, the support of π is all of S.However, it is not necessarily the case that π(x) > 0 for all x ∈ S, as there may exist a (nonempty) set E ⊂ S, such that λ(E) = 0 and π(x) = 0 for x ∈ E. Therefore, define the set S + as (3.2) S + = {y ∈ S : π(y) > 0}.
Observe that S + is an open subset of S, being the density function π(x) continuous.Assumptions (A.1)-Assumption (A.2) are used to show that the MH transition kernel K, and thus the MH chain {X i } i≥0 , has certain properties.Assumption (A.3) replaces a compactness-assumption on S for proving the LDP.In the case of a compact state space S, this assumption is not needed.
Remark 3.1.We start by showing that the combination of (A.1) and (A.2) ensure continuity and boundedness of the components a (acceptance part) and r (rejection part) of the MH transition kernel K. To see this, note first that it is sufficient to define K only for the states x ∈ S + .If x ∈ S + , the quantity is well defined, and therefore so is the MH transition kernel K(x, •).Moreover, if the initial point X 0 of the chain belongs to S + , the MH algorithm only allows moves to states y that preserve the property π(y) > 0.
Since π(x) is continuous for all x ∈ S and J(x|y) is continuous and bounded for all (x, y) ∈ S 2 , we have a(x, y) ∈ C b (S + × S).
From the continuity of a(x, y) on S + × S, we obtain that r(x) = 1 − a(x, S) is also continuous for all x ∈ S + .This continuity extends to all of S. First, if x / ∈ S + , so that π(x) = 0, then since π(y) > 0 for λ-almost all y ∈ S. Take x / ∈ S + and a sequence {x n } ⊂ S that converges to x.From the continuity of the target density function π, π(x n ) → π(x) = 0.Moreover, for a fixed y such that π(y) > 0, we have a(x n , y) → J(y|x) as n → ∞.To see this, note that since π and λ are equivalent, π(y) > 0 for λ−almost all y ∈ S. It follows that lim n→∞ a(x n , y) = J(y|x) for λ−almost all y ∈ S. Recalling that J is bounded, by dominated convergence we have Since r(x) = 0, this shows that r is continuous on S.
Remark 3.2.Next, we show that (A.1)-(A.2) ensure that K has the target measure π as its unique invariant distribution, and the MH chain {X i } i≥0 is ergodic.
Let x ∈ S + as defined in (3.2).Since λ ≪ π by (A.1), π(y) > 0 for λ−almost every y ∈ S.Moreover, by Assumption (A.2), J(x|y) > 0 for all (x, y) ∈ S 2 .It follows that a(x, y) > 0 for λ−a.e.y ∈ S.This in turn implies that λ ≪ a(x, •), and λ and a(x, •) are equivalent measures for all x ∈ S + .By transitivity, a(x, •) and a(y, •) are equivalent for all x, y, ∈ S + .We now show that from this it follows that the MH transition kernel K is indecomposable, i.e. there are no disjoint Borel sets A 1 , A 2 ∈ B(S) such that We argue by contradiction.Assume that two such sets exist.Then, Since λ ≪ a(x, •), we have a(x, S) > 0, and thus r(x) = 1 − a(x, S) < 1 for all x ∈ s.
Combined with (3.5), this shows a(x, A 1 ) > 0. It follows from a(x, •) and a(y, •) being equivalent measures that a(y, A 1 ) > 0, which contradicts the assumption.Hence, K(x, dy) is indecomposable.By Theorem 7.16 in [Bre92], π is the unique invariant distribution for the MH transition kernel K(x, dy) and the Markov chain associated with π and K(x, dy) is ergodic.
and the corresponding H b is e α,y a(x, dy) + r(x)e α,x .
Note that if the space S is compact, then Assumption (A.3) is automatically satisfied (for example, take U (x) ≡ 0).

Large deviations for empirical measures of Metropolis-Hastings chains
We are now ready to state our main result, an LDP for the sequence {L n } of empirical measures of the MH chain {X i } i≥0 with invariant distribution π (see Section 2.3 for the definition).
Theorem 4.1.Let {X i } i≥0 be the Metropolis-Hastings chain from Section 2.3 and K(x, dy) the associated transition kernel.Let {L n } n≥0 ⊂ P(S) be the corresponding sequence of empirical measures, defined in (2.3).Under Assumptions (A.1)-(A.3),with A(µ) as in (2.1), {L n } n≥0 satisfies an LDP with speed n and rate function As mentioned in Section 2.1, we consider P(S) as a metric space (equipped with, e.g., the Lévy-Prohorov metric).Therefore, the LDP is equivalent to the Laplace principle, and we will use the latter to prove Theorem 4.1.More specifically, the proof is split up into proving the Laplace principle upper bound, {F (µ) + I(µ)} , (4.2) and the Laplace principle lower bound, {F (µ) + I(µ)} , (4.3) for every F ∈ C b (P(S)), every sequence {x n } ⊂ S and x ∈ S. The respective proofs are given in Sections 5 and 6.The starting point for both bounds is the following representation formula (Proposition 6.1 in [BD19]): for every bounded, measurable where Ln is the controlled empirical measure, Ln = For the upper bound, under Assumptions (A.1)-(A.3),the proof of Proposition 6.13 in [BD19], with the additional arguments in Section 6.10 therein to account for a non-compact state space, can be applied in our setting as well.The only thing that needs to be verified is the Feller property of the MH transition kernel K (see Lemma 5.2).
The work for proving Theorem 4.1 lies in proving the lower bound (4.3).Existing results rely on some variation of Condition 2.1.However, such a condition is not applicable in our setting, as the following simple example shows: Take an x ∈ S such that r(x) > 0 (i.e., when in x, there is a positive probability of rejecting a proposed move and stay in x) and consider the Borel set A = {x} ∈ B(S).If x = ζ, then p (j) (ζ, A) = p (j) (ζ, x) = 0, ∀j ≥ 0. However, p (i) (x, x) > 0, ∀i ≥ 0, since r(x) > 0 .This shows that (2.5) does not hold for all x ∈ S, and Condition 2.1 does not hold for the MH kernel K, nor for kernels of similar type, such as those arising in ABC-MCMC or MALA.
In Section 6 we show how the Laplace principle lower bound can be shown for the MH chain without relying on a transitivity assumption like Condition 2.1.The main point is that due to the specific structure of the MH kernel, under Assumptions (A.1)-(A.2) the chain retains the properties that are important for proving the LDP (and typically guaranteed by something like Condition 2.1 combined with other assumptions).
The main difficult in the proof arises from the fact that, contrary to the setting in [BD19], for ν ∈ P(S), I(ν) < ∞ does not imply that ν ≪ π.In [BD19] this implication is used in defining near-optimal controls in the representation (4.4), which in turn can be used to prove the lower bound.
Before proceeding with the proofs of the upper and lower bounds, in the following section we give some different characterizations and properties of the rate function I in (4.1).

4.1.
Characterization and properties of the rate function.We first express the rate function (4.1) in a more convenient form.By Lemma 6.8(a) in [BD19], the probability measures in the set A(ν) are of the form γ(dx × dy) = ν(dx) q(x, dy), for a transition kernel q(x, dy) such that ν is invariant for q.Therefore, using (2.2), the chain rule for relative entropy, we can rewrite (4.1) as (4.5) where Q denotes the set of all the transition kernels q(x, dy) on S such that ν is an invariant distribution for q.Lemma 6.8(b) in [BD19] guarantees the existence of a minimizing q in the definition of I(ν), under the assumption I(ν) < ∞.That is, there exists a transition kernel q with stationary distribution ν such that (4.6) The representation (4.6) of the rate function allows us to characterize the minimizers q, based on the form of the MH transition kernel K (2.8), as the following result shows.Lemma 4.2.If I(ν) < ∞, then the transition kernel q(x, dy) in (4.6) is q(x, •) ≪ K(x, •) ν-a.s.In particular, it is of the form By the definition of relative entropy, this means that q(x, •) ≪ K(x, •) ν−a.s.Recall that K(x, dy) = a(x, y)dy + r(x)δ x (dy), i.e.K(x, •) is a mixture of a transition kernel a(x, •) ≪ λ, and a point mass in x.Therefore, for the transition kernel q(x, •) to be q(x, •) ≪ K(x, •) ν−a.s., it must be of the form q(x, y) = α(x, y)dy + ρ(x)δ x (dy), where α(x, •) ≪ a(x, •) ≪ λ, and ρ(x) = 0 if r(x) = 0.In particular, ρ(x) must be a measurable function in order to make x → q(x, A) a measurable function for every A ∈ B(S), and therefore q a stochastic kernel.
With the characterization of q from Lemma 4.2, we can write the rate function (4.6) in a more explicit way.for ν-almost all x ∈ S.This proves that f x (y) in (4.10) is the Radon-Nikodym derivative of q(x, •) with respect to K(x, •) for ν-almost all x in S.
We end this section with an alternative characterization of the rate function, that highlights the fact that measures ν ∈ P(S) for which I(ν) < ∞ need not be absolutely continuous with respect to π.
The following Lemma shows that I(ν) is split into two parts, one corresponding to ν λ and one corresponding to ν s .
Lemma 4.4.Let ν ∈ P(S) with I(ν) < ∞ and consider its decomposition as in (4.11).Let q(x, dy) be a transition kernel on S with invariant distribution ν, that satisfies Define Q λ and Q s as the set of transitions kernels that ν λ and ν s are invariant for, respectively.The following holds: (a) q ∈ Q λ ∩ Q s , i.e. both ν λ and ν s are invariant for q, (b) the rate function satisfies (4.12) Proof.(a) By Lemma 4.2, we can write where α(x, •) ≪ λ.By invariance of ν for q, for all A ∈ B(S), If we consider A = S s , for which λ(S s ) = 0, then α(x, S s ) = 0 for ν-almost all x ∈ S (because of α(x, •) ≪ λ), and thus ν(S s ) = Ss ρ(x)ν(dx).On the other hand, ν(S s ) = Ss ν(dx).This implies that for all x ∈ S s ν-a.s., we have that ρ(x) = 1 a.s., and therefore q(x, dy) = δ x (dy).
With the form of q on S s established, for A ∈ B(S), we have where the last equality is due to q(x, •) = δ x (•) a.s.being the only ν s -invariant transition kernel (Lemma 6.2).This proves that ν s is invariant for q, which means that q ∈ Q s .We now show that ν λ is also invariant for q.The decomposition (4.11) combined with the invariance of ν for q, and given that q(x, •) = δ x (dy), ν s -a.s., gives, for A ∈ B(S), It follows that ν λ (A) = q(x, A)ν λ (dx).Since A ∈ B(S) was chosen arbitrarily, ν λ is invariant for q, i.e., q ∈ Q λ .
To prove (b), by convexity of I (see Lemma 6.10(a) in [BD19]), (4.13) On the other hand, by the decomposition (4.11), From part (a), q is an element of both Q λ and Q s .Therefore, The right-hand side of the previous display is precisely and the right-hand side of this inequality is now I(ν s ).The two inequalities together with (4.14) imply Combined with the opposite inequality (4.13), this proves the desired equality (4.12).

Laplace principle upper bound
In this section we prove the Laplace principle upper bound (4.2).
Proposition 5.1.Let {L n } n≥0 be the empirical measures defined in (2.3) and {x n } n≥0 any sequence in S. Take F ∈ C b (P(S)) and define I : P(S) → [0, ∞] as in (4.1).Assume (A.1), (A.2) and (A.3).Then, As mentioned in Section 4, under (A.1)-(A.3),the arguments from [BD19] can be used.We include the main steps here for self-containment and convenience of the reader; we emphasise that once the Feller property of K(x, dy) has been established, this part of the proof goes precisely as in [BD19].

Lemma 5.2. Under Assumptions (A.1)-(A.2), the Metropolis-Hastings transition kernel K(x, •) satisfies the Feller property.
Proof.Recall the form (2.8) for K, with a(x, y) in (3.4) corresponding to the probability density of the acceptance part and r corresponding to the rejection part.The assumptions ensure that both a and r are continuous (see Remark 3.1) and bounded as functions of x.Consider now a function f ∈ C b (S), and a sequence {x n } n∈N ⊂ S such that x n → x ∈ S. By dominated convergence, we have An application of the Portmanteau theorem then completes the proof.
Proof of Proposition 5.1.In (4.4), take a control sequence {μ n i } such that where Ln is the controlled empirical measure associated with {μ n i }.Let By Assumption (A.3), {(L n , λ n )} is tight; see Section 10 in [BD19].Thus, there is a subsequence, also denoted by n, such that {(L n , λ n )} converge along that subsequence, to some limit ( L, λ), and it is enough to prove the upper bound (4.2) for this subsequence.In fact, taking n → ∞, we have [F (ν) + I(ν)] .

Laplace principle lower bound
We now proceed to prove the Laplace principle lower bound (4.3).
Proposition 6.1.Let {L n } n≥0 be the empirical measures defined in (2.3) and define I : P(S) → [0, ∞] as in (4.1).Assume (A.1)-(A.2).Then, for x ∈ S, As described in Section 4, in proving Theorem 4.1, the lower bound is where the lack of Condition 2.1 for the MH kernel K plays a role.To see why the lack of this transitivity condition becomes an issue, one of the consequences of the condition is that if ν ∈ P(S) is such that I(ν) < ∞, then ν ≪ π.This property plays an important role in the proof of the LDP for empirical measures of a Markov chain in [BD19]-it is implicitly used to define a sequence of near-optimal controls in the representation (4.4).Here, because of the rejection part of the MH kernel, which is the reason Condition 2.1 does not hold, the implication is not true in general.As a counterexample, consider an x 0 ∈ S such that r(x 0 ) > 0 and take ν = δ x0 ∈ P(S).Then ν is not absolutely continuous with respect to λ, and thus not with respect to π.Consider the transition kernel q(x, •) = δ x .Then ν is invariant for q and from (4.5), From (4.10), the Radon-Nikodym derivative of δ x0 (•) with respect to K(x, •), for x = x 0 , is given by f x0 (y) = 1 r(x0) I{y = x 0 }.It follows that the rate function is finite, since We circumvent the problem of not having Condition 2.1 by showing that if ν ∈ P(S) is such that I(ν) < ∞, then there exists another probability measure ν * ∈ P(S) that is arbitrarily close to ν, and satisfies ν ⋆ ≪ π and I(ν * ) ≤ I(ν) + ε.
To prove the existence of such a measure, recall that the decomposition (4.11) allows us to separate ν into two parts: one part, ν λ , with a density with respect to λ (and thus with respect to π) and one, ν s , that is singular with respect to λ.The idea is to approximate the latter with measures that are absolutely continuous with respect to λ.This allows us to construct near-optimal controls in the representation formula, which in turn are used to prove Proposition 6.1.
The following is a brief outline of the argument.In Lemma 6.2, we characterize the transition kernels q that achieve the infimum in (4.5) for ν s ∈ P(S) such that ν s ⊥ λ and I(ν s ) < ∞.Next, in Lemma 6.3, we Lemma 6.3.Take ν s ∈ P(S) such that ν s ⊥ λ and I(ν s ) < ∞.Let {Y i } ∞ i=1 be independent and identically distributed according to ν s .For n ∈ N, define (6.3) Let V n = λ(B ̺ n (0)), the (Lebesgue) volume of the balls of radius ̺ n , and define the sequence of random measures {ν n s } n∈N ⊂ P(S) by This sequence satisfies the following properties: (a) ν n s ≪ λ for all n ∈ N, (b) ν n s ⇒ ν s a.s., (c) There is an n 0 ∈ N such that, for all n > n 0 , I(ν n s ) < ∞ a.s., (d) lim n→∞ I(ν n s ) ≤ I(ν s ) a.s.Before we embark on the proof, some comments on the construction.First, because we consider ν s such that I(ν s ) < ∞, ν s can only put mass on points in S + : if ν s (x) > 0 for some x such that π(x) = 0, then r(x) = 0 (see Remark 3.1).By Lemma 4.2, the corresponding transition kernel is of the form q(x, •) = α(x, •), where α(x, •) ≪ a(x, •).This is not compatible with ν s being singular with respect to λ; see also Lemma 6.2.Thus, the Y i s used in the construction are in S + ν s -a.s.
Next, we verify that for any fixed n ∈ N, the radius ̺ n of the B ̺ n (Y i )-balls is well-defined, i.e., ̺ n > 0 ν s -a.s.Note that if ν s = δ x for some x ∈ S + , then ̺ n becomes Because I(ν s ) < ∞, we have for Since the support of ν s is in S + (an open subset of S; see Assumption (A.1)), and a(Y i , Y i ) and r(Y i ) are both strictly positive ν s -a.s., the continuity of r(•) and a(•, •) on S and S + × S, respectively, ensures that Proof of Lemma 6.3.Part (a) follows directly from the definition (6.4) of ν n s .In particular, since λ and π are equivalent measures (Assumption (A.1)), then ν n s ≪ π.To prove (b), that the sequence {ν n s } converges weakly to ν s a.s., we show that for any bounded and Lipschitz continuous function f it holds that S f dν n s → S f dν s a.s.An application of the Portmanteau theorem then gives the claim.
To this end, let f ∈ C b (S) be Lipschitz continuous and denote its Lipschitz constant by L f < ∞.For n ∈ N, we have (6.5) The Lipschitz continuity of f implies that for all x ∈ B ̺ n (Y i ) and for all i ∈ {1, . . ., n}, f By integrating over B ̺ n (Y i ) and dividing by V n , it follows that This implies the following bounds on the integral (6.5): By the strong law of large numbers, s., and it follows that The squeeze theorem now yields the desired result.We now move to part (c).To show that I(ν n s ) is finite for large enough n ∈ N, we first note that by construction, V n → 0 as n → ∞.Therefore, there is an n 0 ∈ N such that V n < 1 for all n > n 0 .Henceforth, we only consider such n.
Recall the characterization (4.5) of the rate function, where the infimum is taken over all the transition kernels q(x, dy) that are ν n s −irreducible.We will now construct such a transition kernel q n (x, dy), for which it also holds that S R(q n (x, This in turn implies that I(ν n s ) < ∞.The collection of transition kernels {q n } will also be used to show part (d).
We begin by defining N n (x) as the number of B ̺ n (Y i ), i = 1, . . ., n, that x ∈ S belongs to, Next, we define q n by for x such that N n (x) ≥ 1, and otherwise q n (x, dy) = δ x (dy).Then, for all x ∈ S, and, for N n (x) = 0, it holds immediately that q n (x, S) = 1.Moreover, due to the choice of n > n 0 q n (x, A) ∈ [0, 1], for every A ∈ B(S).
To show that q n (x, •) is also invariant for ν n s , consider a set A ∈ B(S).We have From this it follows that where in the second equality we use that, due to the definition of ̺ n , there are no overlaps between the B ̺ n (Y i )-balls.If instead N n (x) = 0, then q n (x, A) = δ x (A), and we have Combined with the computation for N n (x) ≥ 1, this proves the invariance.From (4.5), I(ν n s ) is defined in terms of the infimum over the set of ν n s -invariant kernels (4.5).Therefore, For the first integral in the last display, since ν n s has no mass on {x ∈ S : N n (x) = 0}, this integral is zero.For the second integral, we have Recalling that we only consider n > n 0 , so that V n < 1, we obtain the upper bound From the definition of ̺ n (see (6.3)), it holds that ̺ n ≤ a(Y i , Y i ) and ̺ n ≤ ∆ 1 n (Y i ) for all i = 1, . . ., n.Moreover, the definition of ∆ 1 n implies that, for a fixed i = 1, . . ., n and (x, y) for some constant C d that depends on the dimension d of the space S ⊆ R d .Similarly, for a fixed i = 1, . . ., n and Using the inequalities (6.7) and (6.8) in (6.6) gives the upper bound whenever n > n 0 .Since V n → 0 by construction, we conclude that where the second-to-last equality follows from the strong law of large numbers, and the last equality is motivated by Lemma 6.2.
Proof.First, if ν ≪ λ there is nothing to prove.Therefore, suppose this does not hold and the decomposition (4.11) is non-trivial.
Sample {Y i } ∞ i=1 i.i.d.ν s and define the sequence of random probability measures {ν n s } n∈N as in the construction in Lemma 6.3.Motivated by the decomposition (4.11) for ν, we define a new sequence of random probability measures {ν n } n∈N by s , where p ∈ [0, 1] is the same as in (4.11), again suppressing in the notation that p depends on ν.By part (a) of Lemma 6.3, ν n s ≪ λ for all n.It follows that ν n ≪ λ.Moreover, from part (b) of the same Lemma, ν n converges weakly to ν ν s -a.s.Therefore, for any ω ∈ Ω outside of a ν s -null set, there is an Consider now I(ν n ).By convexity, , for which the right-hand-side is finite w.p. 1 whenever n ≥ n 0 .Combined with part (d) of Lemma 6.3, this yields that, ν s -a.s., Similar to before, this implies that for any ω ∈ Ω outside of a ν s -null set, there is a As a consequence, for any ω ∈ Ω outside of a null set, we can define Then, for n ≥ N (ω), ν n (ω) ≪ λ, d LP (ν n (ω), ν) < δ/2, and I(ν n (ω)) < I(ν) + ε.Since this is outside a ν s -null set, it has positive probability also under ν.This proves the existence of a measure ν † with the claimed properties.
emphasise that the randomness of the sequence {ν n } is entirely due to the sequence of random variables {Y i } ∞ i=1 .Thus, the set of outcomes of {Y i } that lead to a measure ν n with the desired properties is a set with strictly positive probability.This guarantees the existence of a measure ν † with the claimed properties.The following result is a version of Lemma 6.17 in [BD19].Lemma 6.5.Let ν ∈ P(S) satisfy I(ν) < ∞.Under (A.1)-(A.2),for given ε > 0 and δ > 0, there exists ν * ∈ P(S) with the following properties: (a) d LP (ν * , ν) < δ; (b) ν * ≪ π and π ≪ ν * ; (c) there exists a transition probability function q * (x, dy) on S such that ν * is an invariant measure of q * (x, dy), the associated Markov chain is ergodic, and (6.9) I(ν * ) ≤ I(ν) + ε.
Proof.To prove (a), by Lemma 6.4, there exists a measure ν † that satisfies (6.10) We observe that for any A, B ∈ B(S), and, from the definition of γ * , It follows that for all B ∈ B(S).Thus, π-.a.s. for x ∈ S, K(x, •) ≪ q * (x, •).To show absolute continuity in the reverse direction, note that from •), ν * -a.s.As ν * and π are equivalent measures, we obtain that q * (x, •) and K(x, •) are equivalent π-a.s.This means that there exists a Borel set C ∈ B(S) such that π(C) = 0 = ν * (C), and q * (x, •) and K(x, •) are equivalent for all x in the complement of C. If we redefine q * (x, •) = K(x, •) for all x ∈ C, we obtain the equivalence between q * (x, dy) and K(x, •) for all x ∈ S. Besides, being ν * (C) = 0, the newly defined q * (x, •) still has ν * as invariant measure.To show that q * (x, •) is ergodic, recall that in Remark 3.2 we proved that there are no disjoint Borel sets A 1 , A 2 ∈ B(S) such that K(x, A 1 ) = 1 ∀x ∈ A 1 and K(y, A 2 ) = 1 ∀y ∈ A 2 .
Because q * (x, •) and K(x, •) are equivalent for all x ∈ S, it follows that also q * (x, •) satisfies the property that there do not exist disjoint A 1 , A 2 ∈ B(S) for which q * (x, A 1 ) = 1 ∀x ∈ A 1 and q * (y, A 2 ) = 1 ∀y ∈ A 2 , meaning that q * (x, •) is indecomposable.Therefore, by Theorem 7.16 in [Bre92], ν * is the unique invariant distribution for q * (x, dy) and the Markov chain associated with ν * and q * (x, dy) is ergodic.
We are ready to prove Proposition 6.1, the Laplace principle lower bound.The following proof is mostly based on the proof of Proposition 6.15 in [BD19], with minor changes due to the lack of Condition 2.1.The main work has been done in Lemmas 6.2-6.5, and most of the proof from [BD19] now goes through, with some minor modifications to rely on those results rather than Condition 2.1.We include the full argument for self-containment and convenience for the reader.
Proof of Proposition 6.1.To prove the Laplace lower bound (6.1), it is sufficient to consider only bounded Lipschitz continuous functions F (see Corollary 1.10 in [BD19]).Since we have endowed P(S) with the Lévy-Prohorov metric d LP , a function Recall that X = {X i } i≥0 denotes the Metropolis-Hastings chain, as described in Section 2.3, and {L n } n the associated sequence of empirical measures.We now construct a nearly optimal sequence of controls in variational representation (4.4), Let ε > 0 be given and choose ν ∈ P(S) such that (6.13) In Lemma 6.5 it is shown that, for any such pair δ, ε, there exists a probability measure ν * ∈ P(S) and a transition probability q * (x, dy) such that ν * is an invariant measure for q * (x, dy), the Markov chain with initial distribution ν * and transition probability q * (x, dy) is ergodic, besides (6.14) Moreover, Part (a) of the Lemma ensures d LP (ν * , ν) < δ, which then implies (6.15) Thus, ν * is such that The transition probability function q * associated with ν * is now used to define the controls, (6.16) μn i (dy) = q * ( Xn i−1 , dy), i = 1, . . ., n.With the inequalities (6.14)-(6.15)established, and the choice (6.16) for the controls, we can proceed with the same arguments as in the proof of Proposition 6.15 in [BD19].
With the choice (6.16), the running costs for the controlled chain Xn are The μn i s only give the conditional distributions for Xn i for i = 1, . . ., n.For the distribution of the initial point Xn 0 , consider two choices: δ x and ν * .Let P x and P * denote the corresponding probability measures and let E x and E * be the associated expectation, respectively.Define D n and D n x as the expected difference between the empirical average of the relative entropy between q * and K, and its mean, under P * and P x , respectively, From the definition of the controls (6.16), and since ν * is an invariant measure of q * (x, dy), all terms of the controlled process { Xn i } n i=0 are distributed according to ν * .By the non-negativity of the relative entropy and R(• •) and (6.14), we obtain The From this convergence in probability, for every subsequence of {n} there is a further subsequence, which we also denote by {n}, such that the convergence is w.p. 1.That is, there is a Borel set Φ 1 with ν * (Φ 1 ) = 1, such that along such (sub)subsequences and for all x ∈ Φ 1 , (6.17By the pointwise ergodic theorem [Bre92, Sect.6.5], P * {A(g)} = 1.
To establish the convergence of Ln , we define Φ 2 = ∩ g∈Ξ Φ 2 (g).Since Ξ is countable, Φ 2 satisfies ν * (Φ 2 ) = 1.Then, for all initial points Xn for all x ∈ Φ 2 .We now combine the arguments for the running costs and the controlled empirical measures to show the Laplace principle lower bound on a set of ν * -measure 1. Define the set Φ = Φ 1 ∩ Φ 2 ⊂ S. Since ν * (Φ) = ν * (Φ 2 ) = 1, we have ν * (Φ) = 1.For all x ∈ Φ, both (6.17 where the inequality on the third line comes from (6.14) and (6.15), while the inequality on the last line follows from (6.13).By taking the limit ε → 0 we obtain the upper bound (6.1) for all x ∈ Φ.We conclude the proof by extending this result from Φ to the whole space S. Whereas [BD19] relies on the transitivity condition (2.5) for this extension, we instead rely on the properties of the MH kernel; this requires only minor changes in the argument.
Define Ln as the empirical measure of X 1 , . . ., X n , Since L n and Ln only differ in the first and last summands, Let L F < ∞ denote the Lipschitz constant of F with respect to the Lévy-Prohorov metric.For all ω ∈ Ω, Take any x ∈ S and n ∈ N. Since the X n i s evolve according to K, using the previous inequality we have, where the equality on the third line is due to the Markov property.With this lower bound established, from here we can again follow the proof of Proposition 6.15 in [BD19].Let ε > 0 be fixed.We have that (6.1) holds for all y ∈ Φ, why for each such y there exists an N (y, ε) ∈ N such that (6.21) [F (µ) + I(µ)] + ε for all n ≥ N (y, ε).Without loss of generality, take N (y, ε) as the smallest integer with this property.Then, the function S → N that maps y into N (y, ε) is measurable, the sets Φ (i) = {y ∈ Φ : N (y, ε) = i} ⊂ S are disjoint Borel sets, and Φ = ∪ ∞ i=1 Φ (i) .Because K(x, Φ) > 0 for all x ∈ S (see (6.19)), we have that for all x ∈ S there exists an i 0 ∈ N such that K(x, Φ (i0) ) > 0. Combined with the bounds in (6.20), and (6.21), this implies that for all n ≥ i 0 , This concludes the proof of the Laplace principle lower bound.

Remark 3. 3 .
For the case S = R d , Section 8.2 in[DE97] describes a class of models for which a Lyapunov function U that satisfies (A.3) exists.Here we present their example adapted to the MH kernel K.For specific choices of J and/or π, this assumption can be made more explicit (or verified).Let b : R d → R d be measurable.Denote by •, • the scalar product in R d and for α ∈ R d defineH b (x, α) = logR d e α,y−x−b(x) a(x, dy) + r(x)e − α,b(x) .Consider the following assumptions.(a) b is bounded on all compact sets in R d (b) there exists r > 0 such that sup x∈R d H b (x, α) < ∞, for all α ∈ R d that satisfy α ≤ r (c) there exists a Lipschitz continuous function U : R d → [0, ∞) for which lim x →∞ [U (x + b(x)) − U (x)] = −∞.If (a), (b) and (c) hold, then U is a Lyapunov function as required by Assumption(A.3).A natural choice for b is b
A.1) The target probability measure π is equivalent to λ on S (i.e., π ≪ λ and λ ≪ π).The probability density π(x) is a continuous function.(A.2) The proposal distribution J(•|x) is absolutely continuous with respect to the target measure π (i.e., J(•|x) ≪ π), for all x ∈ S. The probability density J(y|x) is a continuous and bounded function of x and y, and it satisfies (3.1) J(y|x) > 0, ∀(x, y) ∈ S 2 .(A.3)There exists a Lyapunov function U : S → [0, ∞] such that the following properties hold: