Perturbation Bounds for Monte Carlo within Metropolis via Restricted Approximations

The Monte Carlo within Metropolis (MCwM) algorithm, interpreted as a perturbed Metropolis-Hastings (MH) algorithm, provides a simple approach for approximate sampling when the target distribution is doubly-intractable or contains latent variables. We assume that the associated unperturbed Markov chain is geometrically ergodic and show explicit estimates of the difference between the n-th step distributions of the perturbed MCwM and the unperturbed MH chains. These bounds are based on novel perturbation results for Markov chains which are of interest beyond the MCwM setting. To apply the bounds, we need to control the difference between the transition probabilities of the two Markov chains, at least in the center of the state-space. Moreover, we need to verify stability of the perturbed chain, either through a Lyapunov condition, or by restricting it to the center of the state space.


Introduction
The Metropolis-Hastings (MH) algorithm is a classical method for sampling approximately from a distribution of interest relying only on point-wise evaluations of an unnormalized density. However, when even this unnormalized density depends on unknown integrals and cannot easily be evaluated, then this approach is not feasible. A possible solution is to replace the required density evaluations in the MH acceptance ratio with suitable approximations. This idea is implemented in Monte Carlo within Metropolis (MCwM) algorithms which substitute the unnormalized density evaluations by Monte Carlo estimates for the intractable integrals.
Yet in general, replacing the exact MH acceptance ratio by an approximation leads to inexact algorithms in the sense that the resulting Markov chain does not have the distribution of interest as stationary one and might not converge to a stationary distribution at all. Nonetheless, these approximate or noisy methods, see e.g. [AFEB16,JM17b,JMMD15], have recently gained increased attention due to their applicability in certain intractable sampling problems. In this work we attempt to answer the following questions about the MCwM algorithm: • Can one quantify the quality of MCwM algorithms?
• When might the MCwM algorithm fail and what can one do in such situations?
Regarding the first question, by using bounds on the difference of the nth step distributions of MH and MCwM algorithm we give a positive answer which improves upon earlier results in the literature, see below. For the second question, we suggest a modification for stabilizing the MCwM approach by restricting the Markov chain to a suitably chosen set that contains the "essential part", or the "center", of the state space. We provide examples where this restricted version of MCwM converges towards the distribution of interest while the unrestricted version does not. Note also that in practical implementations of Markov chain Monte Carlo on a computer, simulated chains are effectively restricted to compact state spaces due to memory limitations. Our results on restricted approximations can also be read in this spirit.
Our overall approach is based on perturbation theory for Markov chains. Let (X n ) n∈N 0 be a Markov chain with transition kernel P and ( X n ) n∈N 0 be a Markov chain with transition kernel P on a common Polish space (G, B(G)). We think of P and P as "close" to each other in a suitable sense and consider P as a perturbation of P . In order to quantify the difference of the distributions of X n and X n , denoted by p n and p n respectively, we work with where · tv denotes the total variation distance. The Markov chain (X n ) n∈N 0 can be interpreted as unavailable, unperturbed, ideal, while ( X n ) n∈N 0 is a perturbation that is available for simulation. We focus on the case where the ideal Markov chain is geometrically ergodic, more precisely V -uniformly ergodic, and satisfies a Lyapunov condition of the form for some function V : G → [1, ∞) and numbers δ ∈ [0, 1), L ∈ [1, ∞).
To obtain estimates of (1) we need two assumptions which can be informally explained as follows: 1. Closeness of P and P : The difference of P and P is measured by controlling either a uniformly weighted total variation distance or a uniformly weighted V -norm of P (x, ·) − P (x, ·), see ε tv,W , ε V,W in Lemma 4 and Theorem 8. In Theorem 10, uniformity refers at least to the essential region of the state space.

Stability of P :
A stability condition on P is satisfied either in the form of a Lyapunov condition as in Theorem 8 or by restriction to the center of the state space determined by V as in Theorem 10.
Explicit bounds on (1) are provided in Section 3 in Proposition 7, Theorem 8 and Theorem 10. More precisely, in Theorem 7 a Lyapunov condition on P is imposed whereas in Theorem 8 a restricted approximation P is considered. We now briefly comment on related literature to such perturbation estimates for V -uniformly ergodic Markov chains.
In contrast to the V -uniform ergodicity assumption we impose on the ideal Markov chain, the results in [AFEB16,JMMD15,Mit05] only cover perturbations of uniformly ergodic Markov chains. Nonetheless, perturbation theoretical questions for geometrically ergodic Markov chains have been studied before, see e.g. [BRR01, FHL13, MLR16, NR17, RRS98, RS18, SS00] and the references therein. A crucial aspect where those papers differ from each other is how one measures the closeness of the transitions of the unperturbed and perturbed Markov chains to have applicable estimates, see the discussion about that in [SS00,FHL13,RS18]. Our Theorem 7 and Theorem 8 refine and extend the results of [RS18,Theorem 3.2]. In particular, in Theorem 8 we take a restriction to the center of the state space into account. Let us also mention here that [PS14,RS18] contain related results under Wasserstein ergodicity assumptions. More recently, [JM17a] studies approximate chains using notions of maximal couplings, [NR17] extends the uniformly ergodic setting from [JMMD15] to using L 2 norms instead of total variation, and [JM17b] explores bounds on the approximation error of time averages.
In Section 4 we apply our perturbation bounds in the context of approximate sampling via MCwM. The goal is to (approximately) sample from a target distribution π on G, which is determined by an unnormalized density function π u : G → [0, ∞) w.r.t a reference measure µ, that is, Classically the method of choice is to construct a Markov chain (X n ) n∈N 0 based on the MH algorithm for approximate sampling of π. This algorithm crucially relies on knowing (at least) the ratio π u (y)/π u (x) for arbitrary (x, y) ∈ G 2 , e.g., because π u (x) and π u (y) can readily be computed. However, in some scenarios, only approximations of π u (x) and π u (y) are available. Replacing the true unnormalized density π u in the MH algorithm by an approximation yields a perturbed, "inexact" Markov chain ( X n ) n∈N 0 . If the approximation is based on a Monte Carlo method, the perturbed chain is called MCwM chain. Two particular settings where approximations of π u may rely on Monte Carlo estimates are doubly-intractable distributions and latent variables. Examples of the former occur in Markov or Gibbs random fields, where the function values of the unnormalized density π u (x) itself are only known up to a factor Z(x). This means that where only values of ρ(x) can easily be computed while the computational problem lies in evaluating where Y denotes an auxiliary variable space, ρ : G × Y → [0, ∞) and r x is a σ-finite measure. We study the MCwM algorithm, which in every transition uses an iid sequence of random variables (Y i ) (and Z(y) by Z N (y), respectively).
Let us emphasize that the doubly-intractable case can also be approached algorithmically from various other perspectives. For instance, instead of estimating the normalizing constant Z(x) in (2), one could estimate unbiasedly (Z(x)) −1 whenever exact simulation from the Markov or Gibbs random field is possible. In this case, π u (x) turns into a Monte Carlo estimate which can formally be analyzed with exactly the same techniques as the latent variable scenario described below. Yet another algorithmic possibility is explored in the noisy exchange algorithm of [AFEB16], where ratios of the form Z(x)/Z(y) are approximated by a single Monte Carlo estimate. Their algorithm is motivated by the exchange algorithm [MGM06] which, perhaps surprisingly, can avoid the need for evaluating the ratio Z(x)/Z(y) and targets the distribution π exactly, see [EJREH17] for a brief review of these and related methods. However, in some cases the exchange algorithm performs poorly, see [AFEB16]. Then approximate sampling methods for distributions of the form (2) might prove useful as long as the introduced bias is not too large. As a final remark in this direction, the recent work [ADYC18] considers a correction of the noisy exchange algorithm which produces a Markov chain with stationary distribution π.
The second setting we study arises from latent variables. Here, π u (x) cannot be evaluated since it takes the form where r x is a distribution on a measurable space Y of latent variables y, and ρ : G × Y → [0, ∞) is a non-negative density function. In general, no explicit computable expression of the above integral is at hand and the MCwM idea is to substitute π u (x) in the MH algorithm by a Monte Carlo estimate based on iid sequences of random variables (Y ∼ r y . The resulting MCwM algorithm has been studied before in [AR09,MLR16]. This MCwM approach should not be confused with the pseudomarginal method, see [AR09]. The pseudo-marginal method constructs a Markov chain on the extended space G × Y that targets a distribution with π as its marginal on G. In both intractability settings, the corresponding MCwM Markov chains depend on the parameter N ∈ N which denotes the number of samples used within the Monte Carlo estimates. As a consequence, any bound on (1) is N -dependent, which allows us to control the dissimilarity to the ideal MH based Markov chain. In Corollary 16 and the application of Corollary 17 to the examples considered in Section 4 we provide informative rates of convergence as N → ∞. Note that with those estimates we relax the requirement of uniform bounds on the approximation error introduced by the estimator for π u , which is essentially imposed in [MLR16,AFEB16]. In contrast to this requirement, we use (if available) the Lyapunov function as a counterweight for a second as well as inverse second moment and can therefore handle situations where uniform bounds on the approximation error are not available. If we do not have access to a Lyapunov function for the MCwM transition kernel we suggest to restrict it to a subset of the state space, i.e., use restricted approximations. This subset is determined by V and usually corresponds to a ball with some radius R that increases as the approximation quality improves, that is, R(N ) → ∞ as N → ∞.
Our analysis of the MCwM algorithm is guided by some stylized facts we observe in simple illustrations, in particular, we consider a log-normal example discussed in Section 4.1. In this example, we encounter a situation where the mean squared error of the Monte Carlo approximation grows exponentially in the tail of the target distribution. We observe empirically that (unrestricted) MCwM works well whenever the growth behavior is dominated by the decay of the (Gaussian) target density in the tail. The application of Corollary 17 to the log-normal example shows that the restricted approximation converges towards the true target density in the number of samples N at least like (log N ) −1 independent of any growth of the error. However, the convergence is better, at least like log N N , if the growth is dominated by the decay of the target density.
Note that our perturbation bounds of Corollary 16 and Corollary 17 hold generally in the doubly-intractable as well as in the latent variable case. Corollary 17 handles the situation of not uniformly bounded approximation error and Corollary 16 the situation of uniformly bounded error. In Section 4.2 we consider the latent variable case and provide an illustrative normalnormal example where the restricted approximation proves to be useful.

Preliminaries
Let G be a Polish space, where B(G) denotes its Borel σ-algebra. Assume that P is a transition kernel with stationary distribution π on G. For a signed measure q on G and a measurable function f : G → R we define For a distribution µ on G we use the notation µ(f ) := G f (x) dµ(x). For a measurable function V : G → [1, ∞) and two probability measures µ, ν on G define For the constant function V = 1 this is the total variation distance, i.e., Theorem 1. For a φ-irreducible and aperiodic transition kernel P with stationary distribution π defined on G the following statements are equivalent: • The transition kernel P is geometrically ergodic, that is, there exists a number α ∈ [0, 1) and a measurable function C : G → [1, ∞) such that for π-a.e. x ∈ G we have P n (x, ·) − π tv ≤ C(x)α n , n ∈ N. (4) • There is a π-a.e. finite measurable function V : G → [1, ∞] with finite moments with respect to π and there are constants α ∈ [0, 1) and C ∈ [1, ∞) such that In particular, the function V can be chosen such that a Lyapunov condition of the form for some δ ∈ [0, 1) and L ∈ (0, ∞), is satisfied.
Remark 2. We call a transition kernel V -uniformly ergodic if it satisfies (5) and note that this condition can be be rewritten as 3 Quantitative perturbation bounds Assume that (X n ) n∈N 0 is a Markov chain with transition kernel P and initial distribution p 0 on G. We define p n := p 0 P n , i.e., p n is the distribution of X n . The distribution p n is approximated by using another Markov chain ( X n ) n∈N 0 with transition kernel P and initial distribution p 0 . We define p n := p 0 P n , i.e., p n is the distribution of X n . The idea throughout the paper is to interpret (X n ) n∈N 0 as some ideal, unperturbed chain and ( X n ) n∈N 0 as an approximating, perturbed Markov chain.
In the spirit of the doubly-intractable distribution and latent variable case considered in Section 4 we think of the unperturbed Markov chain as "nice", where convergence properties are readily available. Unfortunately since we cannot simulate the "nice" chain we try to approximate it with a perturbed Markov chain, which is, because of the perturbation, difficult to analyze directly. With this in mind, we make the following standing assumption on the unperturbed Markov chain.
We start with an auxiliary estimate of p n − p n tv which is interesting on its own and proved in Appendix A.1.

Lemma 4. Let Assumption 3 be satisfied and for a measurable function
Then, for any r ∈ (0, 1], Remark 5. If the initial distribution of the perturbed and unperturbed Markov chain differ, then, this appears also in the estimate. Yet, for n sufficiently large the influence of the difference of the initial distributions vanishes exponentially quickly.
Remark 6. The quantities ε tv,W and ε V,W measure the difference between P and P . Note that we can interpret them as operator norms It is also easily seen that ε tv,W ≤ max{2, ε V,W } which implies that ε V,W is a stronger criterion for measuring the perturbation. In (8) an additional parameter r appears which can be used to tune the estimate. Namely, if one is not able to bound ε V,W sufficiently well but has a good estimate of ε tv,W one can optimize over r. On the other hand, if there is a satisfying estimate of ε V,W one can just set r = 1.
In the previous lemma we proved an upper bound of p n − p n tv which still contains an unknown quantity given by which measures, in a sense, stability of the perturbed chain through a weighted sum of expectations of the Lyapunov function W under p i . To control this term, we impose additional assumptions on the perturbed chain. In the following, we consider two assumptions of this type, a Lyapunov condition and a bounded support assumption.

Lyapunov condition
We start with a simple version of our main estimate which illustrates already some key aspects of the approach via the Lyapunov condition. Here the intuition is as follows: By Theorem 1 we know that the function V of Assumption 3 can be chosen such that a Lyapunov condition for P is satisfied.
Since we think of P as being close to P , it might be possible to show also a Lyapunov condition with V of P . If this is the case, the following proposition is applicable.
Now we state a more general theorem. In particular, in this estimate the dependence on the initial distribution can be weakened. In the perturbation bound of the previous estimate, the initial distribution is only forgotten if p 0 (V ) < L/(1 − δ). Yet, intuitively, for long-term stability results p 0 (V ) should not matter at all. This intuition is confirmed by the theorem.
Remark 9. We consider an illustrating example where Theorem 8 leads to a considerably sharper bound than Proposition 7. This improvement is due to the combination of two novel properties of the bound of Theorem 8: 1. In the Lyapunov condition (13) the function W can be chosen differently from V .
2. The fact that β n,r (δ, α) is bounded from above by max{δ, α r } n−1 . Thus β n,r (δ, α) converges almost exponentially fast to zero in n. This implies that for n sufficiently large the dependence of p 0 (W ) vanishes. Nevertheless, the leading factor n can capture situations in which the perturbation error is increasing in n for small n.
Illustrating example. Let G = {0, 1} and assume p 0 = p 0 = (0, 1). Here state "1" can be interpreted as the "tail" while state "0" is the "center". Define P = 1 0 1 0 and P = 1 0 Thus, the unperturbed Markov chain (X n ) n∈N 0 moves from "1" to "0" right away, while the perturbed one ( X n ) n∈N 0 takes longer. Both transition matrices have the same stationary distribution π = (1, 0). Obviously, p 0 − p 0 tv = 0 and for n ∈ N holds The unperturbed Markov chain is uniformly ergodic, such that we can choose V = 1 and (5) is satisfied with C = 1 and α = 0. In particular, in this setting ε tv and ε V from Proposition 7 coincide, we have ε tv = 1. Thus, the estimate of Proposition 7 gives p n − p n tv ≤ ε tv = 1.
This bound is optimal in the sense that it is best possible for n = 1. But for increasing n it is getting worse. Notice also that a different choice of V cannot really remedy this situation: The chains differ most strongly at n = 1 and the bound of Proposition 7 is constant over time. Now choose the function W (x) = 1 + v · 1 {x=1} for some v ≥ 0. The transition matrix P satisfies the Lyapunov condition i.e., δ = L = 1 2 . Moreover, we have p 0 (W ) = 1 + v and ε V,W = ε tv,W = 1/(1 + v). Thus, in the bound from Theorem 8 we can set r = 1 and γ = 1 such that Since v can be chosen arbitrarily large, it follows that p n − p n tv ≤ 1 2 n−1 , which is best possible for all n ∈ N.
The previous example can be seen as a toy model of a situation where the transition probabilities of a perturbed and unperturbed Markov chain are very similar in the "center" of the state space but differ considerably in the "tail". When the chains start both in the same point in the "tail", considerable differences between distributions can build up along the initial transient and then vanish again. Earlier perturbation bounds as for example in [Mit05,PS14,RS18] take only an initial error and a remaining error into account. Thus, those are worse for situations where this transient error captured by β n,r dominates. A very similar term also appears in the very recent error bounds due to [JM17b]. Here we also see that a function W different from V is advantageous.

Restricted approximation
In the previous section, we have seen that a Lyapunov condition of the perturbation helps to control the long-term stability of approximating a V -uniformly ergodic Markov chain. In this section we assume that the perturbed chain is restricted to a "large" subset of the state space. In this setting a sufficiently good approximation of the unperturbed Markov chain on this subset leads to a perturbation estimate.
For the unperturbed Markov chain we assume that transition kernel P is V -uniformly ergodic. Then, for R ≥ 1 define the "large subset" of the state space as If V is chosen as a monotonic transformation of a norm on G, B R is simply a ball around 0. The restriction of P to the set B R , given as P R , is defined as In other words, whenever P would make a transition from x ∈ B R to G \ B R , P R remains in x. Otherwise, P R is the same as P . We obtain the following perturbation bound for approximations whose stability is guaranteed through a restriction to the set B R .
Theorem 10. Under the V -uniform ergodicity of Assumption 3 let δ ∈ [0, 1) and L ∈ [1, ∞) be chosen in such a way that For the perturbed transition kernel P assume that it is restricted to B R , i.e., Then, with p 0 = p 0 and The proof of the result is stated in Appendix A.1. Notice that while the perturbed chain is restricted to the set B R , we do not place a similar restriction on the unperturbed chain. The estimate (15) compares the restricted, perturbed chain to the unrestricted, unperturbed one.
Remark 11. In the special case where P (x, , with x 0 ∈ B R satisfies this condition. The resulting perturbed Markov chain is simply a restriction of the unperturbed Markov chain to B R and Theorem 10 provides a quantitative bound on the difference of the distributions.

Monte Carlo within Metropolis
In Bayesian statistics it is of interest to sample with respect to a distribution π on (G, B(G)). We assume that π admits a possibly unnormalized density π u : G → [0, ∞) with respect to a reference measure µ, for example the counting, Lebesgue or some Gaussian measure. The Metropolis-Hastings (MH) algorithm is often the method of choice to draw approximate samples according to π: Algorithm 1. For a proposal transition kernel Q a transition from x to y of the MH algorithm works as follows.
The transition kernel of the MH algorithm with proposal Q, stationary distribution π and acceptance probability a(x, z) is given by For the MH algorithm in the computation of r(x, z) one uses π u (z)/π u (x), which might be known from having access to function evaluations of the unnormalized density π u . However, when it is expensive or even impossible to compute function values of π u , then it may not be feasible to sample from π using the MH algorithm. Here are two typical examples of such scenarios: • Doubly-intractable distribution: For models such as Markov or Gibbs random fields, the unnormalized density π u (x) itself is typically only known up to a factor Z(x), that is, where functions values of ρ can be computed, but function values of Z cannot. For instance, Z might be given in the form where Y denotes an auxiliary variable space, ρ : G × Y → [0, ∞) and r x is a σ-finite measure.
• Latent variables: Here π u (x) cannot be evaluated, since it takes the form with a distribution r x on a measurable space Y of latent variables y and a non-negative function ρ : In the next sections, we study in both of these settings the perturbation error of an approximating MH algorithm. A fair assumption in both scenarios is that the infeasible, unperturbed MH algorithm is V -uniformly ergodic: Then we have and, for any β ∈ (0, 1), The proposition provides a tool for controlling the distance between the transition kernels of two MH type algorithms with identical proposal and different acceptance probabilities. The specific functional form for the dependence of the upper bound in (20) on x and y is motivated by the applications below. The set B indicates the "essential" part of G where the difference of the acceptance probabilities matter. The parameter β is used to shift weight between the two components ξ and η of the approximation error. For the proof of the proposition, we refer to Appendix A.2.

Doubly-intractable distributions
In the case where π u takes the form (18), we can approximate Z(x) by a Monte Carlo estimate under the assumption that we have access to an iid sequence of random variables (Y is distributed according to r x . Then, the idea is to substitute the unknown quantity Z(x) by the approximation Z N (x) within the acceptance ratio. Define a function W N : Z(x) , then the acceptance probability given W N (x), W N (z) modifies to where the random variables W N (x), W N (z) are assumed to be independent from each other. This leads to a Monte Carlo within Metropolis (MCwM) algorithm: Algorithm 2. For a given proposal transition kernel Q, a transition from x to y of the MCwM algorithm works as follows.
2. Based on independent samples compute W N (x), W N (z) and then calculate a(x, z, W N ).
3. If u < a(x, z, W N ), then accept the proposal, and return y := z, otherwise reject the proposal and return y := x.
Given the current state x ∈ G and a proposed state z ∈ G the overall acceptance probability is which leads to the corresponding transition kernel of the form M a N , see (17).
The quality of the MCwM algorithm depends on the error of the approximation of Z(x). The root mean squared error of this approximation can be quantified by the use of W N , that is, where s(x) := (E |W 1 (x) − 1| 2 ) 1/2 is determined by the second moment of W 1 (x). In addition, due to the appearance of the estimator W N (z) in the denominator of a, we need some control of its distribution near zero. To this end, we define, for z ∈ G and p > 0, the inverse moment function i p,N (z) := EW N (z) −p 1 p .
With this notation we obtain the following estimate, which is proved in Appendix A.2.
Lemma 14. Assume that there exists k ∈ N such that i 2,k (x) and s(x) are finite for all x ∈ G. Then, for all x, z ∈ G and N ≥ k we have Remark 15. One can replace the boundedness of the second inverse moment i 2,k (x) for any x ∈ G by boundedness of a lower moment i p,m (x) for p ∈ (0, 2) with suitably adjusted m ∈ N, see Lemma 23 in the appendix.

Inheritance of the Lyapunov condition
If the second and inverse second moment are uniformly bounded, s ∞ < ∞ as well as i 2,N ∞ < ∞, one can show that the Lyapunov condition of the MH transition kernel is inherited by the the MCwM algorithm. In the following corollary, we prove this inheritance and state the resulting error bound for MCwM.
Corollary 16. For a distribution m 0 on G let m n := m 0 M n a and m n,N := m 0 M n a N be the respective distributions of the MH and MCwM algorithms after n steps. Let Assumption 12 be satisfied and for some k ∈ N let Further, define δ N := δ + D/ √ N and β n := n max{δ N , α} n−1 . Then, for any
Observe that the estimate is bounded in n ∈ N so that the difference of the distributions converges uniformly in n to zero for N → ∞. The constant δ N decreases for increasing N , so that larger values of N improve the bound.
Log-normal example I. Let G = R and the target measure π be the standard normal distribution. We choose a Gaussian proposal kernel Q(x, ·) = N (x, γ 2 ) for some γ 2 > 0, where N (µ, σ 2 ) denotes the normal distribution with mean µ and variance σ 2 . It is well known, see [JH00, Theorem 4.1, Theorem 4.3 and Theorem 4.6], that the MH transition kernel satisfies Assumption 12 for some numbers α, C, δ and L with V (x) = exp(x 2 /4).
Let g(y; µ, σ 2 ) be the density of the log-normal distribution with parameters µ and σ, i.e., g is the density of exp(µ + σS) for a random variable S ∼ N (0, 1). Then, by the fact that ∞ 0 y g(y; −σ(x) 2 /2, σ(x) 2 )dy = 1 for all functions σ : G → (0, ∞), we can write the (unnormalized) standard normal density as Hence π u takes the form (18) with Y = [0, ∞), ρ(x) = exp(−x 2 /2), ρ(x, y) = y and r x being a log-normal distribution with parameters −σ(x) 2 /2 and σ(x) 2 . Independent draws from this log-normal distribution are used in the MCwM algorithm to approximate the integral. We have E[W 1 (x) p ] = exp(p(p − 1)σ(x) 2 /2) for all x, p ∈ R and, accordingly, By Lemma 23 we conclude that Hence, s ∞ as well as i 2,k ∞ are bounded if for some constant c > 0 we have σ(x) 2 ≤ c for all x ∈ G. In that case Corollary 16 is applicable and provides estimates for the difference between the distributions of the MH and MCwM algorithms after n-steps. However, one might ask what happens if the function σ(x) 2 is not uniformly bounded, taking, for example, the form σ(x) 2 = |x| q for some q > 0. In Figure 1 we illustrate the difference of the distribution of the target measure to a kernel density estimator based on a MCwM algorithm sample for σ(x) 2 = |x| 1.8 . Even though s(x) and i p,1 (x) grows super-exponentially in |x|, the MCwM still works reasonably well in this case. However, in Figure 2 we consider the case where σ(x) 2 = |x| 2.2 and the behavior changes dramatically. Here the MCwM algorithm does not seem to work at all. This motivates a modification of the MCwM algorithm in terms of restricting the state space to the "essential part" determined by the Lyapunov condition.

Restricted MCwM approximation
With the notation and definition from the previous section we consider the case where the functions i 2,k (x) and s(x) are not uniformly bounded. Under Assumption 12 there are two simultaneously used tools which help to control the difference of a transition of MH and MCwM: 1. The Lyapunov condition leads to a weight function and eventually to a weighted norm, see Proposition 13.
2. By restricting the MCwM to the "essential part" of the state space we prevent that the approximating Markov chain deteriorates. Namely, for some R ≥ 1 we restrict the MCwM to B R , see Section 3.2.
For x, z ∈ G the acceptance probability given W N (x), W N (z) is now modified from a(x, z, W N ) to 1 B R (z) · a(x, z, W N ) which leads to the restricted MCwM algorithm: Algorithm 3. For given R ≥ 1 and a proposal transition kernel Q a transition from x to y of the restricted MCwM algorithm works as follows.
2. Based on independent samples compute W N (x), W N (z) and then calculate a(x, z, W N ).
3. If u < 1 B R (z)· a(x, z, W N ), then accept the proposal, and return y := z, otherwise reject the proposal and return y := x.
Given the current state x ∈ G and a proposed state z ∈ G the overall acceptance probability is which leads to the corresponding transition kernel of the form M a (R) N , see (17). By using Theorem 10 and Proposition 13 we obtain the following estimate.
Corollary 17. Let Assumption 12 be satisfied, i.e., M a is V -uniformly ergodic and the function V as well as the constants α, C, δ and L are determined. For β ∈ (0, 1) and R ≥ 1 let Let m 0 be a distribution on B R and κ := max{m 0 (V ), L/(1 − δ)}. Then, for and R ≥ exp(1) we have where m Proof. We apply Theorem 10 with P (x, ·) = M a (x, ·) and for some x 0 ∈ B R . Note that P (x, B R ) = 1 for any x ∈ G. Further P and M a (R) N coincide on B R , thus we also have P n = M n a (R) N on B R for n ∈ N.
Observe also that the restriction of P to B R , denoted by P R , satisfies Moreover, we have by Lemma 14 that With Proposition 13 and Then, by N ≥ 4(RD R /(1 − δ)) 2 we obtain R · ∆(R) ≤ 1 − δ 2 such that all conditions of Theorem 10 are verified and the stated estimate follows.
Remark 18. The estimate depends crucially on the sample size N as well as on the parameter R. If the influence of R in D R is explicitly known, then one can choose R depending on N in such away that the conditions of the corollary are satisfied and one eventually obtains an upper bound on the total variation distance of the difference between the distributions depending only on N and not on R anymore.
Log-normal example II. We continue with the log-normal example. In this setting we have Figure 1 and Figure 2 let us consider the cases σ(x) 2 = |x| 1.8 and σ(x) 2 = |x| 2.2 . In Figure 3 we compare the normal target density with a kernel density estimator based on the restricted MCwM on B R = [−10, 10] and observe essentially the same reasonable behavior as in Figure 1. In Figure 4 we consider the same scenario and observe that the restriction indeed stabilizes. In contrast to Figure 2, convergence to the true target distribution is visible but, in line with the theory, slower than for σ(x) 2 = |x| 1.8 .  Now we apply Corollary 17 in both cases: 1. Case σ(x) 2 = |x| 1.8 . For k = 100 and β = 1/2 one can easily see that i 2,100 · 1 B R ∞,V 1/2 and s · 1 B R ∞,V 1/2 is bounded by 6000, independent of R. Hence there is a constant D ≥ 1 so that D R ≤ D. With this knowledge we choose R = (1−δ) √ 2D √ N such that for N ≥ max 100, 2 exp(2)D 2
for any initial distribution m 0 on B R and all q ∈ N. Here the last inequality follows by the fact that exp(x) ≥ x q+1 (q+1)! for any x ≥ 0 and q ∈ N. To summarize, by suitably choosing N and R (possibly depending on N ) sufficiently large the difference between the distributions of the restricted MCwM and the MH algorithms after n-steps can be made arbitrarily small.

Latent variables
In this section we consider π u of the form (19). Here, as for doubly intractable distributions, the idea is to substitute π u (x) in the acceptance probability of the MH algorithm by a Monte Carlo estimate where we assume that we have access to an iid sequence of random vari- has distribution r x . Define a function W N : G → R by W N (x) := ρ N (x)/π u (x) and note that E[W N (x)] = 1. Then, the acceptance probability given W N (x), W N (z) modifies to where W N (x), W N (z) are assumed to be independent random variables. Note that all the objects which depend on a, such as a N , M a N , a , are in this section defined just as in Section 4.1. The only difference is that the ratio within the minimum in a is reversed compared to (21). Thus, this leads to a MCwM algorithm as stated in Algorithm 2, where the transition kernel is given by M a N .
Also as in Section 4.1 we define s(x) := E |W 1 (x) − 1| 2 1/2 and i p,N (x) := (EW N (x) −p ) 1/p for all x ∈ G and p > 0. With those quantities we obtain the following estimate of the difference of the acceptance probabilities of M a and M a N proved in Appendix A.2.
Lemma 19. Assume that there exists k ∈ N such that i 2,k (x) and s(x) are finite for all x ∈ G. Then, for all x, z ∈ G and N ≥ k we have If s ∞ and i 2,k ∞ are finite for some k ∈ N, then the same statement as formulated in Corollary 16 holds. The proof works exactly as stated there. Examples which satisfy this condition are for instance presented in [MLR18]. However, there are cases where the functions s and i 2,k are unbounded. In this setting, as in Section 4.1.2, we consider the restricted MCwM algorithm with transition kernel M a (R) N . Here again the same statement and proof as formulated in Corollary 17 hold. We next provide an application of this corollary in the latent variable setting.
Normal-normal model. Let G = R and the function ϕ µ,σ 2 be the density of N (µ, σ 2 ). For some z ∈ R and (precision) parameters γ Z , γ Y > 0 define . By the convolution of two normals the target distribution π satisfies Note that, for real-valued random variables Y, Z the probability measure π is the posterior distribution given an observation Z = z within the model with the improper Lebesgue prior imposed on x.
Pretending that we do not know π u (x) we compute By using a random variable ξ ∼ N (0, 1) we have for Here ∝ means equal up to a constant independent of x. As a consequence, s ∞ = ∞ and therefore Corollary 16 (which is also true in the latent variable setting) cannot be applied. Nevertheless, we can obtain bounds for the restricted MCwM in this example using the statement of Corollary 17 by controlling s and i 2,k using a Lyapunov function V . The following result, proved in Appendix A.2, verifies the necessary moment conditions under some additional restrictions on the model parameters.
Proposition 20. Assume that γ Y > √ 2γ Z , the unnormalized density π u is given as in (25) and let the proposal transition kernel Q be a Gaussian random walk, that is, Q(x, ·) = N (x, σ 2 ) for some σ > 0. Then, there is a Lyapunov function V : G → [1, ∞) for M a , such that M a is V -uniformly ergodic, i.e., Assumption 12 is satisfied, and there are β ∈ (0, 1) as well as k ∈ N such that The previous proposition implies that there is a constant D < ∞, such that D R from Corollary 17 is bounded by D independent of R. Hence there are numbers C 1 , C 2 > 0 such that with R = C 1 √ N and for N sufficiently large we have a measurable, π-a.e. finite function, then, define the ergodicity coefficient τ V (P ) as The next lemma provides a relation between the ergodicity coefficient and V -uniform ergodicity.
A proof of this fact is implicitly contained in [MZZ13] and can also be found in [RS18, Lemma 3.2]. Both references crucially use an observation of Hairer and Mattingly [HM11].
To summarize, if the transition kernel P is geometrically ergodic, then, by Theorem 1 there exist a function V : G → [1, ∞), α ∈ [0, 1) and C ∈ (0, ∞) such that, by Lemma 21, τ V (P n ) ≤ Cα n . The next proposition states two further useful properties (submultiplicativity and contractivity) of the ergodicity coefficient. For a proof of the corresponding inequalities see for example [MZZ13, Proposition 2.1].
Proposition 22. Assume P, Q are transition kernels and µ, ν are probability measures on G. Then Now we prove Lemma 4.
Proof of Lemma 4. As in the proof of [Mit05, Theorem 3.1] we use which can be shown by induction over n ∈ N 0 . Then With Proposition 22 and Lemma 21 we estimate the first term of the previous inequality by For the terms which appear in the sum of (27) we can use two types of estimates. Note that τ 1 (P ) ≤ 1 (here the subscript indicates that V = 1) which leads by Proposition 22 to On the other hand Thus, for any r ∈ (0, 1] we obtain which gives by (27) the final estimate.
Next we prove Theorem 10.
Proof Theorem 10. Locally for x ∈ B R we have P R V (x) ≤ P V (x) ≤ δV (x)+ L, and, eventually, We write B c R for G \ B R and obtain for x ∈ B c R that Denote δ := δ + R · ∆(R) ≤ 1/2 + δ/2 < 1. For i ≥ 2 we obtain by (28), (29) and (1 − δ i ) ≤ 2(1 − δ i−1 ) that Furthermore, p 0 (V ) ≤ κ and p 1 (V ) ≤ 2κ. Now it is easily seen that The second term in the maximum is bounded by 2/R. For x ∈ B R we have so that the first term in the maximum satisfies .
We obtain ε V,V ≤ 2(L + 1) by the use of the fact that sup x∈G max δ + L, 1 ≤ L + 1.
Then, by Lemma 4 for r ∈ (0, 1], . By minimizing over r we obtain for R ≥ exp(1) that where we used that sup x∈G ≤ T β by Jensen's inequality. Moreover, for any x ∈ B we obtain which implies the assertion in that case. In the case where z = y, we have similarly for any x ∈ B that and M b (x, ·) − M c (x, ·) V ≤ B V (y)b(x, y)ξ(y)(η(x) + η(y))Q(x, dy) b(x, y)ξ(y)(η(x) + η(y))Q(x, dy) which finishes the proof.
Before we come to further proofs of Section 4 we provide some properties of inverse moments of averages of non-negative real-valued iid random variables (S i ) i∈N . In this setting, the pth inverse moment, for p > 0, is defined by Lemma 23. Assume that j p,r < ∞ for some r ∈ N and p > 0. Then i) j p,s ≤ j p,r for s ∈ N with s ≥ r; ii) j q,r ≤ j p,r for 0 < q < p; iii) j k·p,k·r ≤ j p,r for any k ∈ N.
Proof. Properties i) and ii) follow as in [MLR16, Lemma 3.5]. For proving iii) we have to show that To this end, observe first that we can write where the "batch-means" V 1 , . . . , V k are non-negative, real-valued iid random variables which have the same distribution as 1 which is a moment of the harmonic mean of Z 1 , . . . , Z k . Using the inequality between geometric and harmonic means as well as the independence we find that The previous lemma shows that when inverse moments of some positive order are finite, then so are inverse moments of all higher and lower orders if the sample size is adjusted accordingly.
Proof of Lemma 14. It is easily seen that a(x, z)E min 1, for any x, z ∈ G. By virtue of Jensen's inequality and E[W N (z)] = 1 we have E[W N (z) −1 ] ≥ 1 as well as where we also used the independence of W N (x) and W N (z) in the last inequality. (The previous arguments are similar to those in [MLR16, Lemma 3.3 and the proof of Lemma 3.2].) Note that i 2,N (x) ≤ i 2,k (x) for N ≥ k by Lemma 23. Hence, one can conclude that a(x, z) ≥ a N (x, z) Proof of Lemma 19. As in the previous proof or from [MLR16,Lemma 3.3 and the proof of Lemma 3.2] an immediate consequence is a(x, z)E min 1, Note that i 2,N ≤ i 2,k for N ≥ k, see Lemma 23. The rest of the lemma follows as in the previous proof, only the ratio W N (x)/W N (z) is reversed.
Proof of Proposition 20. For random-walk-based Metropolis chains (in particular for Q as assumed in the statement) by [JH00, Theorem 4.1 and the first sentence after the proof of the theorem, as well as, Theorem 4.3, Theorem 4.6] we have that M a is V t -uniformly ergodic with for any t ∈ (0, 1). Hence, Assumption 12 is satisfied and we need to find t ∈ (0, 1) as well as β ∈ (0, 1) such that i 2,k ∞,V 1−β t < ∞ and s ∞,V β t < ∞ for some k ∈ N. For showing s ∞,V β t < ∞ we use (26) to see that for some C < ∞. Hence and choosing β ∈ (0, 1) such that leads to s ∞,V β t < ∞. In order to show i 2,k ∞,V 1−β t < ∞, we first use Lemma 23 iii) and obtain for any x ∈ G and any k ∈ N Then, for k > 2γ Z /γ Y by (26) we have Therefore, there is a constant C < ∞ such that We have i 2,k ∞, . The latter condition holds whenever provided that t(1 − β) > γ Z /γ Y . This implies, by (30), that t should be chosen such that Choosing t such that it satisfies (31) is feasible whenever the right-hand side of (31) is smaller than 1. This is the case if γ Y > √ 2γ Z .