1 Introduction

The Metropolis-Hastings (MH) algorithm is widely used to approximately compute expectations with respect to complicated high-dimensional posterior distributions (e.g. Gilks et al. (1996); Geyer (2011)). The algorithm requires that it be possible to evaluate point-wise the density of the distribution of interest (throughout this article, all densities are with respect to Lebesgue measure) up to an arbitrary constant of proportionality.

In many problems the target posterior density is computationally expensive to evaluate. When a computationally-cheap approximation, or surrogate, is available, the delayed-acceptance Metropolis-Hastings (DAMH) algorithm also known as the two-stage algorithm, and a special case of the surrogate-transition method Liu (2001); Christen and Fox (2005); Higdon et al. (2011) leverages the surrogate to produce a new Markov chain that still targets the original distribution of interest. A first ‘screening’ stage substitutes the surrogate density for the true density in the standard formula for the MH acceptance probability; proposals which fail at this stage are discarded. Only proposals that pass the first stage are considered in the second ‘correction’ stage, where it is necessary to evaluate the true posterior density at the proposed value.

Delayed acceptance (DA) algorithms have been applied in a variety of settings with the approximate density obtained in a variety of different ways, for example: a coarsening of a numerical grid in Bayesian inverse problems Christen and Fox (2005); Moulton et al. (2008); Cui et al. (2011), subsampling from big-data Payne and Mallick (2014); Banterle et al. (2019); Quiroz et al. (2018), a tractable approximation to a stochastic process Smith (2011); Golightly et al. (2015), or a direct, nearest-neighbour approximation to the truth using previous values Sherlock et al. (2017).

For a Markov kernel, P, with a stationary distribution of \(\pi\), and an associated chain \(\{X_t\}_{t=1}^\infty\), the asymptotic variance of any functional, h, is defined to be

$$\begin{aligned} {\textsf {var}}(h,P):=\lim _{n\rightarrow \infty } n\text {Var}\left[ \frac{1}{n}\sum _{i=1}^n h(X_i)\right] , \end{aligned}$$

where \(X_1\sim \pi\). A lower asymptotic variance is thus associated, in practice, with a greater accuracy in estimating \(\mathbb {E}_{{\pi }}\left[ {h(X)}\right]\) using a realisation of length \(n>>1\) from the distribution of the chain. In terms of the asymptotic variance of any functional of the chain, the DAMH kernel cannot be more efficient than the parent MH kernel; however the computational cost per iteration is, typically, reduced considerably. The almost-negligible computational cost of the screening stage also, typically, facilitates proposals that have a larger chance of being rejected than the MH proposal, but where the pay-off on acceptance is so much larger that the expected overall movement per unit of time increases. When efficiency is measured in terms of effective samples per second, gains of over an order of magnitude have been reported (e.g. Golightly et al. (2015)).

A Markov kernel P with a stationary distribution of \(\pi\) is termed variance bounding if \({\textsf {var}}(h,P)<\infty\) for all \(h\in L^2(\pi )\), the Hilbert space of functions that are square-integrable with respect to \(\pi\). Equivalently there exists \(K<\infty\) such that \({\textsf {var}}(h,P)\le K \text {Var}_{\pi }[h]\) for all such h. This property was named and studied in Roberts and Rosenthal (2008), where it was shown to be equivalent to the existence of a ‘usual’ central limit theorem (CLT); that is, a CLT where the limiting variance is the asymptotic variance.

Intuitively, the variance-bounding property embodies desirable behaviour for a chain started at equilibrium. In practice, the chain is not started at equilibrium, but asymptotically the bias that results from this is negligible compared with the variance. An alternative natural requirement is that the chain converge to equilibrium geometrically quickly (rather than, say, polynomially quickly). A Markov chain kernel, P, with stationary distribution \(\pi\) is geometrically ergodic (e.g. Roberts and Rosenthal (1997, 2004); Meyn and Tweedie (1993) Chapter 15) if there exist \(\rho >0\) and \(M:\mathcal {X}\rightarrow [0,\infty )\) that is finite \(\pi\)-almost everywhere, such that

$$\begin{aligned} |P^n(x,\mathcal {A})-\pi (\mathcal {A})|\le M(x)\rho ^n \end{aligned}$$

for all \(\mathcal {A}\in \mathcal {F}, ~x\in \mathcal {X}\) and \(n\in \mathbb {N}\), where \(P^n\) denotes the n-step transition kernel.

Although the motivations behind the definitions of variance bounding and geometric ergodicity, mixing at equilibrium and convergence to equilibrium, are quite different, for a large class of algorithms, including those studied in this article, these two properties are very closely linked as we will describe in Sect. 2.2. Indeed, for delayed-acceptance algorithms, under weak conditions the two properties are equivalent (see Proposition 1).

Theoretical properties of the efficiency of delayed-acceptance algorithms have been studied in Banterle et al. (2019), Sherlock et al. (2021) and Franks and Vihola (2020). The first contribution from Banterle et al. (2019) is an example delayed-acceptance algorithm which fails to inherit geometric ergodicity from its parent Metropolis-Hastings algorithm (see Example 1 in Sect. 2.3 of this article); a simple sufficient condition for inheritance of geometric ergodicity, uniformly good behaviour of the ratios \(\mathsf {r}_1\) and \(\mathsf {r}_2\) that we define in Eq. 3, is also supplied. Finally, an idealised setting where the cheap approximation is perfectly accurate is explored to obtain tuning guidelines for \(\lambda\) in the delayed-acceptance random walk Metropolis algorithm. Sherlock et al. (2021) examines this tuning issue further, proving a limiting diffusion for the first component of the delayed-acceptance Markov chain, and providing robust tuning guidelines that account for the error in the cheap approximation; the article then extends these guidelines to the pseudo-marginal version of the algorithm. Finally, Franks and Vihola (2020) compares the asymptotic variance of a general pseudo-marginal delayed-acceptance algorithm with the variance of an algorithm that applies importance-sampling to the output of an MCMC algorithm targeting the cheap approximation directly.

Using our Proposition 1, the lack of inheritance of geometric ergodicity in the example in Banterle et al. (2019) is equivalent to a lack of inheritance of the variance bounding property: even though the asymptotic variance using the parent MH kernel is finite for all \(h \in L^2(\pi )\), there exist \(h\in L^2(\pi )\) for which the asymptotic variance using the DA kernel is infinite. For such h, estimated quantities such as effective sample size (e.g. Hoff (2009) are invalid, and consequent, standard CLT-based intuitions about the sizes of typical errors in estimates of \(\mathbb {E}_{\pi }[h]\) from the chain do not hold.

We investigate the conditions under which a DAMH kernel inherits variance bounding from its MH parent and, as a by product, discover conditions under which two different proposals produce MH kernels that are equivalent in terms of whether or not they are variance bounding. Section 2 provides the background and two motivating examples, while Sect. 3 provides some key definitions, a general inheritance result applicable to all propose-accept-reject kernels, and sufficient conditions for variance-bounding equivalence between two Metropolis-Hastings proposals. Section 4 contains our results for standard DA algorithms with further illustrative examples, and includes parent MH algorithms where the proposal depends upon the form of the density, so that the proposal for a computationally cheap DA kernel would naturally depend on the surrogate. Numerical experiments are performed in Sect. 5 and the article concludes with a discussion. All proofs are deferred to Appendix 1.

2 Background, Notation and Motivation

Throughout this article all Markov chains are assumed to be on a statespace \((\mathcal {X},\mathcal {F})\), with \(\mathcal {X}\subseteq \mathbb {R}^d\) Lebesgue measurable, and \(\mathcal {F}\) the \(\sigma\)-algebra of all Lebesgue-measurable sets in \(\mathcal {X}\). The target and surrogate distributions are denoted by \(\pi\) and \(\hat{\pi }\), respectively, and they are assumed to have densities of \(\pi (x)\) and \(\hat{\pi }(x)\) with respect to Lebesgue measure.

2.1 Metropolis-Hastings and Delayed-Acceptance Kernels

The Metropolis-Hastings kernel has a proposal density q(xy) and an acceptance probability \(\alpha (x,y)=1\wedge \mathsf {r}(x,y)\) where

$$\begin{aligned} \mathsf {r}(x,y) := \frac{\pi (y)q(y,x)}{\pi (x)q(x,y)}. \end{aligned}$$
(1)

With \(\overline{\alpha }(x):=\int \alpha (x,y) q(x,y) \text {d}y\), the Metropolis-Hastings (MH) kernel is then

$$\begin{aligned} \mathsf {P}(x,\text {d}y):=q(x,y)\text {d}y~ \alpha (x,y)+[1-\overline{\alpha }(x)]\delta _{x}(\text {d}y). \end{aligned}$$
(2)

An iteration of the corresponding MH algorithm proceeds from a current value, x, to the next value, y, as follows. A value \(x'\) is sampled from the distribution with a density of \(q(x,x')\). With a probability of \(\alpha (x,x')\), \(y\leftarrow x'\), else \(y\leftarrow x\).

Now, suppose that we have an approximation, \(\hat{\pi }(x)\), to \(\pi (x)\). The standard delayed-acceptance kernel uses the same proposal, q(xy), but has an acceptance probability of \(\tilde{\alpha }(x,y)=[1\wedge \mathsf {r}_1(x,y)][1\wedge \mathsf {r}_2(x,y)]\), where

$$\begin{aligned} \mathsf {r}_1(x,y):=\frac{\hat{\pi }(y)q(y,x)}{\hat{\pi }(x)q(x,y)} ~~~\text {and}~~~ \mathsf {r}_2(x,y):=\frac{\pi (y)/\hat{\pi }(y)}{\pi (x)/\hat{\pi }(x)}. \end{aligned}$$
(3)

With \(\overline{\widetilde{\alpha }}(x):=\int \tilde{\alpha }(x,y) q(x,y) \text {d}y\), the delayed-acceptance (DA) kernel is

$$\begin{aligned} \tilde{\mathsf {P}}(x,\text {d}y):=q(x,y)\text {d}y ~\tilde{\alpha }(x,y)+[1-\overline{\widetilde{\alpha }}(x)]\delta _{x}(\text {d}y). \end{aligned}$$
(4)

An iteration of the corresponding DA algorithm proceeds from a current value, x, to a next value, y, as follows.

Stage One: A value \(x'\) is sampled from the distribution with a density of \(q(x,x')\). With a probability of \(1\wedge \mathsf {r}_1(x,x')\) the algorithm proceeds to Stage Two, else \(y\leftarrow x\).

Stage Two: With a probability of \(1\wedge \mathsf {r}_2(x,x')\), \(y\leftarrow x'\), else \(y\leftarrow x\).

Now, \(\tilde{\alpha }(x,y)\le \alpha (x,y)\), and so \({\textsf {var}}(h,\tilde{\mathsf {P}})\ge {\textsf {var}}(h,\mathsf {P})\) for each \(h\in L^2(\pi )\) Peskun (1973); Tierney (1998). At first glance this might suggest that the DA algorithm is never worthwhile; however for any proposal that is rejected at Stage One there is no need to complete the expensive calculation of \(\pi (x')\) that is required at every iteration of the MH algorithm and in Stage Two of the DA algorithm. As mentioned in the Introduction, for a fixed computational time, the decreased average computational cost per iteration, and alterations of any tuning parameters to take advantage of this, can lead to a DA algorithm where the variance of an estimator can be over an order of magnitude smaller than that of the MH algorithm.

Since \({\textsf {var}}(h,\tilde{\mathsf {P}})\ge {\textsf {var}}(h,\mathsf {P})\), if \(\tilde{\mathsf {P}}\) is variance bounding then so is \(\mathsf {P}\); however it is feasible that \(\mathsf {P}\) may be variance bounding while \(\tilde{\mathsf {P}}\) is not.

2.2 Key Terminology, Equivalences and Implications

The MH and DA kernels are both reversible with respect to the target. A kernel P is reversible with respect to a distribution \(\pi\) iff for all \(\mathcal {A}\in \mathcal {F}\) and \(\mathcal {B}\in \mathcal {F}\), \(\int _{\mathcal {A}}\pi (\text {d}x){P}(x,\mathcal {B}) = \int _{\mathcal {B}}\pi (\text {d}x) {P}(x,\mathcal {A}).\) This article utilises a number of existing results for reversible Markov chains on the relationship between variance bounding, conductance, spectral gaps and geometric ergodicity. Here we define conductance and spectral gaps and summarise the relationships between the four properties.

Define the Hilbert space \(L_0^2(\pi )=\{f:\mathcal {X}\rightarrow \mathbb {R}; \mathbb {E}_{{\pi }}\left[ {f(X)}\right] =0,~\mathbb {E}_{{\pi }}\left[ {f^2(X)}\right] <\infty \}\) with the inner product \(\langle f, g \rangle =\int _{\mathcal {X}}f(x)g(x)\pi (\text {d}x)\), and consider P as an operator acting on \(L_0^2(\pi )\) according to \((Pf)(x)=\int _{\mathcal {X}} P(x,\text {d}y) f(y)\). If P is reversible then it is a self adjoint operator on \(L_0^2(\pi )\) and by the spectral theorem for bounded self-adjoint operators, for each \(f\in L_0^2(\pi )\), \(\langle f,P^nf\rangle =\int _{-1}^1 \lambda ^nH_{f}(\text {d}\lambda )\) for some positive measure \(H_f\) on \([-1,1]\). Let

$$\begin{aligned} r_1:=\inf _{f\in L_0^2(\pi )}\frac{\langle f,Pf\rangle }{\langle f,f\rangle }\ge -1 ~~~\text {and}~~~ r_2:=\sup _{f\in L_0^2(\pi )}\frac{\langle f,Pf\rangle }{\langle f,f\rangle }\le 1, \end{aligned}$$
(5)

or, equivalently (e.g. Yosida (1980), p320, Theorem 2), that the smallest closed interval containing the support of \(H_f\) for all \(f\in L_0^2(\pi )\) is \([r_1,r_2]\). The spectral gap of P is \(1-\max (|r_1|,|r_2|)\) (e.g. Geyer (1992); Roberts and Rosenthal (1997)), the right spectral gap is \(1-r_2\) and the left spectral gap is \(1+r_1\). P is said to have a spectral gap (or a left or right spectral gap) if its spectral gap (or left or right spectral gap) is non-zero.

For any set \(\mathcal {A}\in \mathcal {F}\) with \(\pi (\mathcal {A})>0\) consider the probability of leaving \(\mathcal {A}\) at the next iteration given that the stationary chain is currently in \(\mathcal {A}\):

$$\begin{aligned} \kappa (\mathcal {A}): = \frac{1}{\pi (\mathcal {A})} \int _{\mathcal {A}}P(x,\mathcal {A}^c)\pi (\text {d}x). \end{aligned}$$

The conductance, \(\kappa\) for a Markov kernel P with invariant measure \(\pi\) is then (e.g. Lawler and Sokal (1988)) (see also Jerrum and Sinclair (1988))

$$\begin{aligned} \kappa :=\inf _{\mathcal {A}:0<\pi (\mathcal {A})\le 1/2} \kappa (\mathcal {A}). \end{aligned}$$

For any reversible Markov chain we have the following relationships:

P is variance bounding

\(\Leftrightarrow\) P has a right spectral gap

(Thrm 14 of Roberts and Rosentdal (2008))

 

\(\Leftrightarrow\) P has a positive conductance

(Thrm 2.1 of Lawler and Sokal (1988))

 

\(\Leftarrow\) P has a spectral gap

 
 

\(\Leftrightarrow\) P is geometrically ergodic

(Thrm 2.1 of Roberts and Rosenthal (1997)).

These relationships will be used repeatedly in the sequel without further reference.

2.3 Example Algorithms

To exemplify our theoretical results we will consider four specific, frequently-used MH algorithms.

  1. 1.

    The Metropolis-Hastings independence sampler (MHIS): \(q(x,y)=q(y)\).

  2. 2.

    The random walk Metropolis (RWM): \(q(x,y)=q(x-y)=q(y-x)\); e.g. \(q(x,y)=\mathsf {N}(y;x,\lambda ^2 I)\).

  3. 3.

    The Metropolis-adjusted Langevin algorithm (MALA): \(q(x,y)=\mathsf {N}(y;x+\frac{1}{2}\lambda ^2\nabla \log \pi ,\lambda ^2 I).\)

  4. 4.

    The truncated MALA:

    $$\begin{aligned} q(x,y)=\mathsf {N}\left( y;x+\frac{1}{2}\lambda ^2R(x),\lambda ^2I\right) ,~\text {where}~R(x)=\frac{D\nabla \log \pi }{D\vee \left| \left| {\nabla \log \pi }\right| \right| }, \end{aligned}$$
    (6)

    for some \(D>0\).

In Proposals of type 2, 3 and 4, \(\lambda\) is often referred to as the scale parameter of the proposal. The MHIS and RWM have been used since the early days of MCMC (e.g. Tierney (1994)); conditions under which they are geometrically ergodic (and, hence, variance bounding) have been well studied; see, for example, Liu (1996) and Mengersen and Tweedie (1996) for the MHIS and Mengersen and Tweedie (1996), Roberts and Tweedie (1996b) and Jarner and Hansen (2000) for the RWM. Essentially, for the MHIS the proposal, q, must not have lighter tails than the target, and for the RWM the target must have suffiently smooth and exponentially decreasing tails. The MALA was introduced in Besag (1994) and was analysed in Roberts and Tweedie (1996a), in which the truncated MALA was also introduced. The MALA can be much more efficient than the RWM in moderate to high dimensions. As with the RWM, for geometric ergodicity the MALA requires exponentially decreasing tails, but if the tails decrease too quickly, \(\left| \left| {\nabla \log \pi }\right| \right|\) grows too quickly and the MALA can fail to be geometrically ergodic. The truncated MALA circumvents this problem.

In Banterle et al. (2019) it is shown that the geometric ergodicity of an RWM algorithm need not be inherited by the resulting DA algorithm.

Example 1

Banterle et al. (2019) Let \(\mathcal {X}=\mathbb {R}\) with \(\pi (x)\propto e^{-x^2/2}\) and \(q(x,y)\propto e^{-(y-x)^2/(2\lambda ^2)}\). If \(\hat{\pi }(x)\propto e^{-x^2/(2\sigma ^2)}\), with \(\sigma ^2<1\) then \(\tilde{\mathsf {P}}\) is not geometrically ergodic.

The following conditional equivalence (proved in Sect. A.2) is used throughout the sequel. If the parent kernel is geometrically ergodic then the DA kernel must have a left spectral gap, and with this constraint geometric ergodicity and variance bounding are equivalent.

Proposition 1

Let \(\mathsf {P}\) be a MH kernel targeting \(\pi\) as specified in Eq. 2. Let \(\tilde{\mathsf {P}}\) be the DA kernel derived from this through the approximation \(\hat{\pi }\) as in Eq. 4. If \(\mathsf {P}\) is geometrically ergodic then

$$\begin{aligned} \tilde{\mathsf {P}}\text { is geometrically ergodic }\iff ~\tilde{\mathsf {P}}\text { is variance bounding.} \end{aligned}$$

The original random walk Metropolis algorithm on \(\pi (x)\) is geometrically ergodic Mengersen and Tweedie (1996), and hence variance bounding, so the DA kernel in Example 1 has not inherited its parent’s desirable properties. As a direct corollary of our Theorem 3 (see Sect. 4.2) we find that \(\sigma ^2\ge 1\) is exactly the right condition in this case:

Example 2

Let \(\mathcal {X}=\mathbb {R}\) with \(\pi (x)\propto e^{-x^2/2}\) and \(q(x,y)\propto e^{-(y-x)^2/(2\lambda ^2)}\). If \(\hat{\pi }(x)\propto e^{-x^2/(2\sigma ^2)}\), with \(\sigma ^2\ge 1\) then \(\tilde{\mathsf {P}}\) is variance bounding and geometrically ergodic.

Examples 1 and 2 suggest an intuition that problems may arise when \(\hat{\pi }(x)\) has lighter tails than \(\pi (x)\). As we shall see, this is a part of the story; however, in general, heavier tails are not sufficient to guarantee inheritance of the variance bounding property, and for a class of algorithms where heavy tails are sufficient, lighter tails can also be sufficient provided they are not too much lighter, in a sense we make precise.

3 Variance Bounding: Inheritance and Equivalence

Throughout this section we use the following generic formulation for two Markov kernels.

Definition 1

Let \(P_A(x,\text {d}y)\) and \(P_B(x,\text {d}y)\) be propose-accept-reject Markov kernels both targeting a distribution \(\pi\), and using, respectively, proposal densities of \(q_A(x,y)\) and \(q_B(x,y)\) and acceptance probabilities of \(\alpha _A(x,y)\) and \(\alpha _B(x,y)\).

Theorem 1, below, follows from Lemma 1, which is proved in Sect. A.1.1. It generalises Corollary 12 of Roberts and Rosenthal (2008) to allow for different acceptance probabilities and, more importantly, removes the need for a fixed, uniform minorisation condition. The minorisation needs only hold in a region \(\mathcal {D}(x)\) such that under \(P_A\) there is “unlikely” to be an accepted proposal in \(\mathcal {D}(x)^{\complement }\).

Lemma 1

Let \(P_A\), \(P_B\), \(q_A\), \(q_B\), \(\alpha _A\) and \(\alpha _B\) be as in Definition 1, and let the conductances of \(P_A\) and \(P_B\) be \(\kappa _A\) and \(\kappa _B\) respectively. If \(\kappa _A>0\) and there is an \(\epsilon <\kappa _A\) and a \(\delta >0\) such that for \(\pi\)-almost all \(x\in \mathcal {X}\), there is a region \(\mathcal {D}(x)\in \mathcal {F}\) such that

$$\begin{aligned} \int _{\mathcal {D}(x)^{\complement }}q_A(x,y)\alpha _A(x,y)\text {d}y\le \epsilon , \end{aligned}$$
(7)

and

$$\begin{aligned} y\in \mathcal {D}(x)\Rightarrow q_B(x,y)\alpha _B(x,y)~\ge ~ \delta ~ q_A(x,y)\alpha _A(x,y), \end{aligned}$$
(8)

then \(\kappa _B \ge (1-\epsilon /\kappa _A) \delta \kappa _A\).

If \(P_A\) is variance bounding, \(\kappa _A>0\); choose an \(\epsilon \in (0,\kappa _A)\) and for each x a corresponding \(\mathcal {D}(x)\) so as to satisfy Eqs. 7 and 8 to obtain:

Theorem 1

Let \(P_A\), \(P_B\), \(q_A\), \(q_B\), \(\alpha _A\) and \(\alpha _B\) be as in Definition 1. If \(P_A\) is variance bounding and for any \(\epsilon >0\) there is a \(\delta >0\) such that for \(\pi\)-almost all \(x\in \mathcal {X}\) there is a region \(\mathcal {D}(x)\in \mathcal {F}\) such that Eqs. 7 and 8 hold, then \(P_B\) is also variance bounding.

The relationship between conductance and right spectral gap has recently Lee and Latuszyński (2014); Rudolph and Sprungk (2016) been used in other contexts to bound the behaviour of one Markov kernel in terms of that of another. Lemma 1 itself shows that condition Eq. 7 need only hold for a single \(\epsilon <\kappa _{A}\); however, since in practice \(\kappa _{A}\) is unlikely to be known, the conditions of Theorem 1 are more practically useful.

From Sect. 4 we apply Theorem 1 to provide sufficient conditions for a delayed-acceptance kernel to inherit variance bounding from its Metropolis-Hastings parent. However, if a DA kernel is variance bounding then so is its parent MH kernel. Thus, the sufficient conditions in Sect. 4 imply an equivalence between the two kernels with respect to the variance bounding property. In this section, after two key definitions, we return, briefly, to this equivalence with regard to the variance bounding property and provide sufficient conditions for equivalence (over potential targets) between Metropolis-Hastings kernels arising from two different proposal densities.

The most natural special case of Eq. 7 in practice is where the kernel is uniformly local, which we define as follows:

Definition 2

(Uniformly Local) A proposal is uniformly local if, given any \(\epsilon >0\),

$$\begin{aligned} \exists ~r<\infty ~ \text {such that}~\text {for all}~x\in \mathcal {X},~~~\int _{B(x,r)^c}q(x,y)\text {d}y < \epsilon . \end{aligned}$$
(9)

A propose-accept-reject kernel is defined to be uniformly local when its proposal is uniformly local.

Here and throughout this article, \(B(x,r):=\{y\in \mathcal {X}:\left| \left| {y-x}\right| \right| <r\}\) is the open ball of radius r centred on x. In our examples, \(\left| \left| {x}\right| \right|\) indicates the Euclidean norm, although the results are equally valid for other norms such as the Mahalanobis norm.

Control of the ratio q(yx)/q(xy) will also be important and so we define the following.

Definition 3

For any proposal density q(xy),

$$\begin{aligned} \Delta (x,y;q):=\log q(y,x) - \log q(x,y). \end{aligned}$$

Clearly, the RWM is a uniformly local kernel; moreover \(\Delta (x,y;q_\mathrm{RWM})=0\). In contrast, on any target with unbounded support, the MHIS cannot be uniformly local; as we shall see, the behaviour of \(\Delta\) is then irrelevant. For the MALA and the truncated MALA we have:

Proposition 2

  1. (A)

    Let q(xy) be the proposal for the truncated MALA in Eq. 6 or for the MALA on a target where \(ess \; sup_x \left| \left| {\nabla \log \pi (x)}\right| \right| = D< \infty\). Then

    1. (i)

      For all x, \(\mathbb {P}_{{q}}\left( {\left| \left| {Y-x}\right| \right| >r}\right) \rightarrow 0\) uniformly in x as \(r\rightarrow \infty\), so q is uniformly local, as defined in Eq. 9.

    2. (ii)

      \(|\Delta (x,y;q)|\le h(\left| \left| {y-x}\right| \right| ):=D\left| \left| {y-x}\right| \right| +\lambda ^2D^2/8\).

  2. (B)

    The proposal, q(xy), for the MALA on a target where \(\text {ess sup} \left| \left| {\nabla \log \pi (x)}\right| \right| = \infty\) is not uniformly local.

The applicability of Lemma 1 and Theorem 1 ranges beyond delayed-acceptance kernels. Here we supply sufficient conditions for an equivalence between Metropolis–Hastings proposals.

Theorem 2

Let \(P_A\), \(P_B\), \(q_A\), \(q_B\), \(\alpha _A\) and \(\alpha _B\) be as in Definition 1 except that \(q_A(x,y)\) and \(q_B(x,y)\) are uniformly local proposal kernels, with \(\log q_A(x,y) - \log q_B(x,y)\) a continuous function from \(\mathbb {R}^{2d}\) to \(\mathbb {R}\). If, for \(\pi\)-almost all x and for some function \(h:[0,\infty )\rightarrow [0,\infty )\) with \(h(r)<\infty\) for all \(r<\infty\),

$$\begin{aligned} |\Delta (x,y;q_B)-\Delta (x,y;q_A)|\le h(\left| \left| {y-x}\right| \right| ), \end{aligned}$$
(10)

then \(P_A\) is variance bounding if and only if \(P_B\) is variance bounding.

Thus, for example, any two random-walk Metropolis algorithms with Gaussian jumps are equivalent, in that if, on a particular target, one is variance bounding then so is the other. When restricted to targets with a continuous gradient this equivalence extends to truncated MALA algorithms. The continuity requirement on \(\log q_A-\log q_B\) rules out, for example, an equivalence between a Gaussian random walk and a random walk where the proposal has bounded support; indeed, the latter may not even be ergodic if the target has gaps in its support.

4 Application to Delayed-Acceptance Kernels

4.1 Key Definitions and Properties

For uniformly local kernels we will describe two general sets of sufficient conditions for Eq. 8 to hold. The first is based upon the fact that the acceptance probability for \(\tilde{\mathsf {P}}\) can be written as

$$\begin{aligned} \tilde{\alpha }(x,y)&= [1\wedge \mathsf {r}_1(x,y)]~\left[ 1\wedge \frac{\mathsf {r}(x,y)}{\mathsf {r}_1(x,y)}\right] \ge [1\wedge \mathsf {r}_1(x,y)]~\left[ 1\wedge \frac{1}{\mathsf {r}_1(x,y)}\right] ~\left[ 1\wedge \mathsf {r}(x,y)\right] ,\\ \text {or}~~~\tilde{\alpha }(x,y)&= \left[ 1\wedge \frac{\mathsf {r}(x,y)}{\mathsf {r}_2(x,y)}\right] ~[1\wedge \mathsf {r}_2(x,y)] \ge \left[ 1\wedge {\mathsf {r}(x,y)}\right] ~\left[ 1\wedge \frac{1}{\mathsf {r}_2(x,y)}\right] ~[1\wedge \mathsf {r}_2(x,y)], \end{aligned}$$

where \(\mathsf {r}_1\) and \(\mathsf {r}_2\) are as defined in Eq. 3. So, if \(|\log \mathsf {r}_1(x,y)|\le m\) or \(|\log \mathsf {r}_2(x,y)|\le m\) then \(\tilde{\alpha }(x,y)\ge e^{-m}\alpha (x,y)\). The quantity \(|\log \mathsf {r}_2(x,y)|=|[\log \hat{\pi }(y)-\log \pi (y)]-[\log \hat{\pi }(x)-\log \pi (x)]|\) measures the discrepancy between the error in the approximation at the proposed value and the error in the approximation at the current value. We name this intuitive quantity, the log-error discrepancy. The quantity \(\log \mathsf {r}_1\) is less natural since it relates \(\hat{\pi }(x), \hat{\pi }(y)\) and q(xy).

The second set of conditions is based upon the fact that if either \(\mathsf {r}_1(x,y)\le 1\) and \(\mathsf {r}_2(x,y)\le 1\) or if \(\mathsf {r}_1(x,y)\ge 1\) and \(\mathsf {r}_2(x,y)\ge 1\) then \(\tilde{\alpha }(x,y)=\alpha (x,y)\), whatever the log-error discrepancy.

These considerations lead to the natural definitions of a ‘potential problem’ set, \(\mathcal {M}_m(x)\), and a ‘no problem’ set \(\mathcal {C}(x)\), as follows:

$$\begin{aligned} \mathcal {M}_m(x):= & {} \{y\in \mathcal {X}: \min |\log \mathsf {r}_1(x,y))|,|\log \mathsf {r}_2(x,y)|>m\},\end{aligned}$$
(11)
$$\begin{aligned} \mathcal {C}(x):= & {} \{y\in \mathcal {X}:\text {sign}[\log \mathsf {r}_1(x,y)]~\text {sign}[\log \mathsf {r}_2(x,y)]\ge 0\}. \end{aligned}$$
(12)

Theorem 1 then leads directly to the following.

Corollary 1

Let \(\mathsf {P}\) be the Metropolis-Hastings kernel given in Eq. 2 and let \(\tilde{\mathsf {P}}\) be the corresponding delayed-acceptance kernel given in Eq. 4. Suppose that for all \(\epsilon >0\) there is an \(m<\infty\) such that for \(\pi\)-almost all x there exists a set \(\mathcal {D}(x)\subseteq \mathcal {X}\) such that

$$\begin{aligned} \mathcal {M}_m(x)\cap \mathcal {D}(x)\subseteq \mathcal {C}(x), \end{aligned}$$
(13)

and

$$\begin{aligned} \int _{\mathcal {D}(x)^c}q(x,y)\text {d}y \le \epsilon . \end{aligned}$$
(14)

Subject to these conditions, if \(\mathsf {P}\) is variance bounding then so is \(\tilde{\mathsf {P}}\).

When \(\hat{\pi }\) has heavier tails than \(\pi\) then for large x, the set \(\mathcal {C}(x)\) can play an important role in the inheritance of the variance bounding property. In a dimension \(d>1\), there are numerous possible definitions of ‘heavier tails’. The following is precisely that required for our purposes:

Definition 4

(heavy tails) An approximate density \(\hat{\pi }\) is said to have heavy tails with respect to a density \(\pi\) if

$$\begin{aligned} \exists ~ r_*>0~\text {such that if}~\left| \left| {x}\right| \right|>r_*~\text {and}~\left| \left| {y}\right| \right| >r_*~\text {then}~\hat{\pi }(x)\le \hat{\pi }(y) \Rightarrow \frac{\hat{\pi }(x)}{\pi (x)}\ge \frac{\hat{\pi }(y)}{\pi (y)}. \end{aligned}$$
(15)

Intuitively, the left hand side is true when x is ‘further from the centre’ (according to \(\hat{\pi }\)) than y, and the implication is that the further out a point, the larger \(\hat{\pi }\) is compared with \(\pi\).

For uniformly local kernels we show (Corollary 3) that it is sufficient that either the log error discrepancy should satisfy a growth condition that is uniform in \(||y-x||\), or (Theorem 3) that the tails of the approximation should be heavier than those of the target and that \(|\Delta (x,y;q)|\) should satisfy a growth condition that is uniform in \(\left| \left| {y-x}\right| \right|\).

For all kernels, boundedness of the error \(\hat{\pi }(x)/\pi (x)\) away from 0 and \(\infty\) will ensure the required inheritance (Corollary 2). This is a very strong condition, but we exhibit MHIS and MALA algorithms where the weaker conditions, that are sufficient for a uniformly local kernel, are satisfied, but the DA kernel is not variance bounding even though the MH kernel is.

4.2 DA Kernels with the Same Proposal Distribution as the Parent

Suppose that for all \(x\in \mathcal {X}\), \(\gamma _{lo}\le \hat{\pi }(x)/\pi (x)\le \gamma _{hi}\), then \(|\log \mathsf {r}_2(x,y)|\le \log (\gamma _{hi}/\gamma _{lo})\), so applying Corollary 1 with \(\mathcal {D}(x)=\mathcal {X}\) and \(m=\log \gamma _{hi}-\log \gamma _{lo}\) leads to:

Corollary 2

Let \(\mathsf {P}\) and \(\tilde{\mathsf {P}}\) be as described in Corollary 1. If there exist \(\gamma _{lo}>0\) and \(\gamma _{hi}<\infty\) such that \(\gamma _{lo}\le \hat{\pi }(x)/\pi (x)\le \gamma _{hi}\), and if \(\mathsf {P}\) is variance bounding then so is \(\tilde{\mathsf {P}}\).

A more direct proof of Corollary 2 is possible using Dirichlet forms. However, Corollary 1 comes into its own when the error discrepancy is unbounded.

We first provide a cautionary example which shows that once the errors are unbounded the delayed-acceptance kernel need not inherit the variance bounding property from the Metropolis-Hastings kernel even if the growth of the log error discrepancy is uniformly bounded or if \(\hat{\pi }\) has heavier tails than \(\pi\).

Example 3

Let \(\mathcal {X}=\mathbb {R}\), let \(\mathsf {P}\) be an MHIS with \(q(x,y)=q(y)=\pi (y)=e^{-y}\mathbbm {1}(y>0)\), and let \(\tilde{\mathsf {P}}\) be the corresponding delayed-acceptance kernel Eq. 4, with \(\hat{\pi }(y)=ke^{-ky}\mathbbm {1}(y>0)\) with \(k>0\) and \(k \ne 1\). \(\mathsf {P}\) is geometrically ergodic, but \(\tilde{\mathsf {P}}\) is neither geometrically ergodic nor variance bounding.

The problem with the algorithm in Example 3 is that for some x values the proposal, y, is very likely to be a long way from x and yet \(y\notin \mathcal {C}(x)\). Our definition of a uniformly local proposal, Eq. 9, provides uniform control on the probability that \(\left| \left| {y-x}\right| \right|\) is large. Since this is only strictly necessary for \(y\notin \mathcal {C}(x)\), Eq. 9 is stronger than necessary, but it is much easier to check.

Our first sufficient condition for uniformly local kernels insists on uniformly bounded growth in the log-error discrepancy except when \(\tilde{\alpha }(x,y)=\alpha (x,y)\). For \(\pi\)-almost all x and for some function \(h:[0,\infty )\rightarrow [0,\infty )\) with \(h(r)<\infty\) for all \(r<\infty\),

$$\begin{aligned} \{y\in \mathcal {X}:|\log \mathsf {r}_2(x,y)|>h(\left| \left| {y-x}\right| \right| )\}\subseteq \mathcal {C}(x). \end{aligned}$$
(16)

If a proposal is uniformly local, given \(\epsilon >0\) find \(r(\epsilon )\) according to Eq. 9. Then Eq. 16 implies that for \(y\in B(x,r)\), \(\mathcal {M}_{h(r)}\subseteq \mathcal {C}(x)\). Applying Corollary 1 with \(\mathcal {D}(x)=B(x,r)\) leads to the following.

Corollary 3

Let \(\mathsf {P}\) and \(\tilde{\mathsf {P}}\) be as described in Corollary 1. In addition let q(xy) be a uniformly local proposal as in Eq. 9, and let the error discrepancy satisfy Eq. 16. If \(\mathsf {P}\) is variance bounding then so is \(\tilde{\mathsf {P}}\).

Because most of the mass from the proposal, y, is not too far away from the current value, x, the discrepancy between the error at x and the error at y remains manageable provided the discrepancy grows in a manner that is controlled uniformly across the statespace. Since the random walk Metropolis on an exponential target density is geometrically ergodic Mengersen and Tweedie (1996) we may apply Corollary 3 with \(h(r)=|k-1|r\), and then Proposition 1, to obtain the following contrast to Example 3, and showing that the variance bounding property can be inherited even when the approximation has lighter tails than the target.

Example 4

Let \(\mathcal {X}=\mathbb {R}\) and let \(\mathsf {P}\) be a RWM algorithm on \(\pi (x)=e^{-x}\mathbbm {1}(x>0)\) using \(q(x,y)\propto e^{-(y-x)^2/(2\lambda ^2)}\). For any \(k>0\), let \(\tilde{\mathsf {P}}\) be the corresponding delayed-acceptance RWM algorithm using a surrogate of \(\hat{\pi }(x)=ke^{-kx}\mathbbm {1}(x>0)\). \(\tilde{\mathsf {P}}\) is variance bounding and geometrically ergodic.

As yet, the set \(\mathcal {C}(x)\) has not played a part in any of our examples. It is precisely this set that allows a delayed-acceptance random walk Metropolis kernel to inherit the variance bounding property from its parent even when the error discrepancy is not controlled uniformly, provided \(\hat{\pi }\) has tails that are heavier than those of \(\pi\). For general MH algorithms an additional control on the behaviour of q is enough to guarantee inheritance of the variance bounding property.

Theorem 3

Let \(\mathsf {P}\) be the Metropolis-Hastings kernel given in Eq. 2 and let \(\tilde{\mathsf {P}}\) be the corresponding delayed-acceptance kernel given in Eq. 4. Further, let q(xy) be a uniformly local proposal in the sense of Eq. 9, let \(\pi\) and \(\hat{\pi }\) be continuous, and let \(\hat{\pi }\) have heavier tails than \(\pi\) in the sense of Eq. 15. Suppose that, in addition, for any \(\mathcal {D}(x)\) required by Eqs. 13 and 14 there exists a function \(h:[0,\infty )\rightarrow [0,\infty )\) with \(h(r)<\infty\) for all \(r<\infty\), such that for \(\pi\)-almost all x

$$\begin{aligned} \{y\in \mathcal {D}(x):~|\Delta (x,y;q)|> h(\left| \left| {y-x}\right| \right| )\}\subseteq \mathcal {C}(x). \end{aligned}$$
(17)

Subject to these conditions, if \(\mathsf {P}\) is variance bounding then so is \(\tilde{\mathsf {P}}\).

We now consider the delayed-acceptance versions of the random walk Metropolis, the truncated MALA, and the MALA. Before doing this we provide the details of a property that was anticipated in Roberts and Tweedie (1996a).

Proposition 3

Let \(\mathsf {P}_\mathrm{RWM}\) be a random walk Metropolis kernel using \(q(x,y)\propto e^{-\frac{1}{2\lambda ^2}\left| \left| {y-x}\right| \right| ^2}\) and targeting a density \(\pi (x)\). Let \(\mathsf {P}\) be a Metropolis-Hastings kernel on \(\pi\) of the form \(q(x,y)\propto e^{-\frac{1}{2}\lambda ^2\left| \left| {y-x-v(x)}\right| \right| ^2}\), where \(\pi -ess\ sup_x \left| \left| {v(x)}\right| \right| < \infty\). \(\mathsf {P}_\mathrm{RWM}\) is variance bounding if and only if \(\mathsf {P}\) is variance bounding.

Proposition 3 clearly applies to a truncated MALA kernel on \(\pi (x)\) using q as in Eq. 6. It, together with each of our subsequent results for the truncated MALA, also applies to a MALA kernel on a target where \(\pi -ess\ sup_x \left| \left| {\nabla \log \pi (x)}\right| \right| = D<\infty\); in practice, however, the useful set of such kernels is limited to targets with exponentially decaying tails, since MALA is not geometrically ergodic on targets with heavier tails Roberts and Tweedie (1996a).

Given Proposition 2 and its prelude, a direct application of Theorem 3 then leads to the following.

Example 5

Let \(\mathsf {P}_\mathrm{RWM}\) and \(\mathsf {P}_\mathrm{TMALA}\) be, respectively, a random walk Metropolis kernel and a truncated MALA kernel on the differentiable density, \(\pi (x)\). Let \(\tilde{\mathsf {P}}_\mathrm{RWM}\) and \(\tilde{\mathsf {P}}_\mathrm{TMALA}\) be the corresponding delayed-acceptance kernels, created as in Eq. 4 through the continuous density, \(\hat{\pi }(x)\). Suppose also that \(\hat{\pi }\) has heavier tails than \(\pi\) in the sense of Eq. 15. Subject to these conditions, if \(\mathsf {P}_\mathrm{RWM}\) is variance bounding then so is \(\tilde{\mathsf {P}}_\mathrm{RWM}\), and if \(\mathsf {P}_\mathrm{TMALA}\) is variance bounding then so is \(\tilde{\mathsf {P}}_\mathrm{TMALA}\).

The MALA is geometrically ergodic when applied to one-dimensional targets of the form \(\pi (x)\propto e^{-|{x}|^\beta }\) for \(\beta \in [1,2)\) Roberts and Tweedie (1996a); when \(\beta =2\) geometric ergodicity occurs provided \(\lambda\) is sufficiently small, and for \(\beta >2\) the MALA is not geometrically ergodic. Even when \(\beta >1\), however, Theorem 3 does not apply because the proposal is not uniformly local.

Example 6

Let \(\mathcal {X}=\mathbb {R}\) and let \(\mathsf {P}\) be a MALA algorithm on \(\pi (x)\propto e^{-x^\beta }1(x>0)\) with \(1\le \beta <2\). Let \(\hat{\pi }(x)\propto e^{-x^\gamma }1(x>0)\) and let \(\tilde{\mathsf {P}}\) be the corresponding delayed-acceptance MALA kernel Eq. 4 (i.e. using a proposal of \(Y=x+\frac{1}{2}\lambda ^2\nabla \log \pi (x)+\lambda Z\), where \(Z\sim N(0,1)\)). \(\tilde{\mathsf {P}}\) is neither geometrically ergodic nor variance bounding, except when \(\gamma =\beta\).

The contrast between the truncated MALA and the MALA in Examples 5 and 6 highlights the importance of a uniformly local proposal. In practice, however, if \(\pi (x)\) is computationally expensive to evaluate then, typically, \(\nabla \log \pi (x)\) will also be expensive to evaluate and it might seem more reasonable to base the proposal for delayed-acceptance MALA and delayed-acceptance truncated MALA on \(\nabla \log \hat{\pi }(x)\).

4.3 Kernels Where the Proposal is Based Upon \(\hat{\pi }\)

On some occasions, the proposal q(xy) is a function of the posterior, \(\pi (x)\), and on such occasions it may be expedient for the delayed-acceptance algorithm to use a proposal \(\hat{q}(x,y)\), which is based upon \(\hat{\pi }(x)\). The acceptance rate is \(\tilde{\alpha }_b(x,y)=[1\wedge \mathsf {r}_{1b}(x,y)][1\wedge \mathsf {r}_2(x,y)]\), where

$$\begin{aligned} \mathsf {r}_{1b}(x,y):=\frac{\hat{\pi }(y)\hat{q}(y,x)}{\hat{\pi }(x)\hat{q}(x,y)}. \end{aligned}$$

With \(\overline{\widetilde{\alpha }}_{b}(x):=\mathbb {E}_{{q}}\left[ {\tilde{\alpha }_{b}(x,Y)}\right]\), the corresponding delayed acceptance kernel is

$$\begin{aligned} \tilde{\mathsf {P}}_{b}(x,dy):=\hat{q}(x,y)\text {d}y ~\tilde{\alpha }_{b}(x,y)+[1-\overline{\widetilde{\alpha }}_{b}(x)]\delta _{x}(dy). \end{aligned}$$
(18)

Let \(\mathsf {r}_\mathrm{hyp}(x,y):=\pi (y)\hat{q}(y,x)/[\pi (x)\hat{q}(x,y)]\), \(\alpha _\mathrm{hyp}(x,y)=1\wedge \mathsf {r}_\mathrm{hyp}(x,y)\), and, with \(\overline{\alpha }_\mathrm{hyp}(x)=\mathbb {E}_{{\hat{q}}}\left[ {\alpha _\mathrm{hyp}(x,Y)}\right]\), consider the hypothetical Metropolis-Hastings kernel:

$$\begin{aligned} \mathsf {P}_\mathrm{hyp}(x,\text {d}y):=\hat{q}(x,y)\text {d}y~ \alpha _\mathrm{hyp}(x,y)+[1-\overline{\alpha }_\mathrm{hyp}(x))]\delta _{x}(dy). \end{aligned}$$
(19)

Now, \(\tilde{\alpha }_b(x,y)\le \alpha _\mathrm{hyp}(x,y)\), so if \(\mathsf {P}_\mathrm{hyp}\) is not variance bounding then \(\tilde{\mathsf {P}}_b\) is not variance bounding either. There is an exact correspondence between \(\mathsf {P}\) from the previous section, and \(\mathsf {P}_\mathrm{hyp}\), and it is natural to consider inheritance of geometric ergodicity from \(\mathsf {P}_\mathrm{hyp}\) exactly as in the prevous section we considered inheritance from \(\mathsf {P}\). The theoretical results are analogous and will not be restated; moreover, the theoretical properties of kernels of the form \(\mathsf {P}_\mathrm{hyp}\) are less well investigated. Instead we illustrate inheritance of variance bounding (or its lack) through two examples.

Example 7

Let \(\mathsf {P}_\mathrm{TMALA}\) be, a truncated MALA kernel on the differentiable density, \(\pi (x)\). Let \(\tilde{\mathsf {P}}_\mathrm{TMALAb}\) be the corresponding delayed-acceptance kernel, created as in Eq. 18 through the differentiable density \(\hat{\pi }(x)\). \(\tilde{\mathsf {P}}_\mathrm{TMALAb}\) inherits the variance bounding property from \(\mathsf {P}_\mathrm{TMALA}\) if either of the following conditions holds. (i) There is uniformly bounded growth in the log error discrepancy, in the sense of Eq. 16, or (ii) \(\hat{\pi }\) has heavier tails than \(\pi\) in the sense of Eq. 15.

Our penultimate example suggests that a delayed-acceptance MALA based upon an approximation that has heavier (though not too much heavier) tails is a reasonable choice.

Example 8

Let \(\mathcal {X}=\mathbb {R}\) and let \(\mathsf {P}\) be a MALA algorithm on \(\pi (x)\propto e^{-x^\beta }\mathbbm {1}(x>0)\) with \(1\le \beta <2\). Let \(\hat{\pi }(x)\propto e^{-x^\gamma }\mathbbm {1}(x>0)\) and let \(\tilde{\mathsf {P}}\) be the corresponding delayed-acceptance MALA kernel created as in Eq. 18 through the differentiable density \(\hat{\pi }(x)\). \(\tilde{\mathsf {P}}\) is variance bounding \(\iff\) \(\tilde{\mathsf {P}}\) is geometrically ergodic \(\iff 1\le \gamma \le \beta\).

We summarise the consequences of Examples 5 to 8 for \(\pi (x)\propto e^{-x^\beta }\mathbbm {1}(x>0)\) and \(\hat{\pi }(x)\propto e^{-x^\gamma }\mathbbm {1}(x>0)\) in Table 1, filling in the two blanks with Example 9 below. The table displays the results in terms of variance bounding, which is equivalent to geometric ergodicity in all these cases by Proposition 1.

Example 9

Let \(\mathcal {X}=\mathbb {R}\) and let \(\mathsf {P}\) be a RWM or truncated MALA algorithm on \(\pi (x)\propto e^{-x^\beta }\mathbbm {1}(x>0)\) with \(1\le \beta <2\). Let \(\hat{\pi }(x)\propto e^{-x^\gamma }\mathbbm {1}(x>0)\) and let \(\tilde{\mathsf {P}}\) be the corresponding delayed-acceptance RWM or truncated MALA kernel created either from Eq. 4 or Eq. 18 through the differentiable density \(\hat{\pi }(x)\). If \(1<\beta <\gamma\), \(\tilde{\mathsf {P}}\) is neither geometrically ergodic nor variance bounding.

Table 1 Whether or not the DA algorithm for \(\pi (x)\propto e^{-x^\beta }\mathbbm {1}(x>0)\) using \(\hat{\pi }(x)\propto e^{-x^\gamma }\mathbbm {1}(x>0)\) is variance bounding as a function of \(\gamma\) and \(\beta\) and the specific DA algorithm. The final two columns indicate that \(\hat{\pi }\) rather than \(\pi\) is used to create the proposal

5 Numerical Demonstrations

The theoretical results from Sect. 4 were made more concrete through Examples 1 to 9. In this section we investigate the numerical performance of delayed acceptance algorithms in examples similar to those used in earlier sections. The specific targets in the earlier Examples were chosen to demonstrate particular points as simply as possible; here we deliberately investigate a broader class of targets, the exponential family class (e.g. Roberts and Tweedie (1996a); Livingstone et al. (2019)):

$$\begin{aligned} \pi (x)\propto \exp \left( -||x||^\beta \right) ~~~\text {and}~~~ \hat{\pi }(x)\propto \exp \left( -||x||^\gamma /\kappa ^\gamma \right) . \end{aligned}$$
(20)

The parameters \(\beta\) and \(\gamma\) in (19) govern the lightness of the tails in the target and the approximation to it respectively, and allow us to vary these separately.

A lack of variance bounding can be seen in terms of the chain struggling to leave a certain region, which typically has a low probability under \(\pi\). In practice, this lack of variance bounding (or a lack of geometric ergodicity) can manifest in two ways.

  1. 1.

    When a sensible starting value is not known, a starting value with poor properties may be chosen unwittingly and the algorithm may struggle to move from this initial point or region of the space.

  2. 2.

    Even when started from a reasonable value, over the course of a sufficiently long run the algorithm will visit this “danger region” and then struggle to leave.

For the target Eq. 20, the “danger region” corresponds to the tails of \(\pi\).

Our experiments deliberately start the algorithm in the tails of \(\pi\) and then measure the number of iterations to reach the centre of the distribution. To make “reaching the centre” concrete, we find the number of iterations until ||x|| is less than its median value under \(\pi\). To decide where in the tails we start, we set ||x|| to its \(1-p_0\) quantile under \(\pi\), for \(p_0\in \{10^{-1},10^{-2},\dots ,10^{-6}\}\) in Scenarios (i) and (ii), and \(p_0\in \{10^{-4},10^{-8},\dots ,10^{-24}\}\) in Scenarios (iii) and (iv); we start the algorithm from a uniformly random point on the surface of that hypersphere. In practical MCMC, many runs are of \(\mathcal {O}(10^6)\) iterations, so it is not unreasonable that issues which are detected for \(p_0\ge 10^{-8}\) might occur in practice even when the algorithm is started from a sensible value. We work in dimension \(d=5\) and repeat each experiment 20 times, except for scenario (iii) where we repeat 10 times to avoid excessive clutter.

We consider four specific scenarios, and so as to bound the amount of computing time, in each scenario we set a maximum number of iterations for which the algorithm should be run. In all scenarios the time until convergence increases with the starting quantile, whether or not the algorithm is variance bounding, for the most part simply because the algorithm is starting further from the main mass of the target. However, when the disparity between algorithms grows towards an order of magnitude, this suggests danger.

For the DARWM and DAMALA, the scaling parameter, \(\lambda\), was chosen so that for the RWM or MALA itself, the acceptance rate was a little larger than the theoretical optimum values of approximately \(23\%\) and \(57\%\) respectively. DATMALA used the same scaling as DAMALA and a truncation value such that when TMALA explored the true posterior, fewer than \(4\%\) of the gradients were truncated.

Scenario i (\(\beta =\gamma =2\), \(\kappa =1/2\) and \(\kappa =2\)). The results appear in the top-left of Figure 1 and demonstrate the undesirable behaviour when the target and the approximation are both Gaussian but the approximation has lighter tails than the target (see Example 1), and the reasonable behaviour when the approximation’s tails are less tight than the target’s (Example 2).

Scenario ii (\(\beta =\gamma =1\), \(\kappa =1/2\) and \(\kappa =2\)). The results, in the top-right of Figure 1, demonstrate that, in alignment with Example 3, the worst behaviour by some margin is exhibited by the only non-variance bounding algorithm: the independence sampler where \(\hat{\pi }\) uses a smaller scaling than \(\pi\) has. In particular, aligning with Example 4, the DARWM that uses the same \(\hat{\pi }\) as the poor independence sampler performs only marginally worse than the DARWM which uses the notionally ‘safer‘ \(\hat{\pi }\).

Scenario iii (\(\beta =1.5\), \(\kappa =1\), \(\gamma =1.2\) and \(\gamma =1.8\)). This corresponds to Examples 5, 6 and 9 and is consistent with the DARWM and DATMALA, but not DAMALA, being variance bounding when \(1<\gamma <\beta\), and none being variance bounding when \(\gamma >\beta\).

Scenario iv (\(\beta =1.5\), \(\kappa =1\), \(\gamma =1.2\) and \(\gamma =1.8\), proposal uses \(\nabla \log \hat{\pi }\)) and suggests that as with Examples 7 and 8 , DAMALA and DATMALA are both variance bounding when \(\gamma <\beta\), and following Example 9, neither is variance bounding when \(\gamma >\beta\).

In scenarios (i), (iii) and (iv) the target itself has lighter-than exponential tails, so even though the x-axis is linear in \(\log p\) it is sublinear in the magnitude of the initial value, \(||x_0||\). Hence, issues with the algorithms might be expected to appear more slowly as \(-\log p_0\) increases than they do with scenario (ii). Whilst exceptionally poor behaviour is unlikely to be seen, therefore, during a typical run that has been started from the main posterior mass, it could easily occur as a result of a poor starting value.

Fig. 1
figure 1

Plots of \(\log _{10}\) convergence time against (jittered) \(-\log _{10} p_0\) for scenarios (i) top left, (ii) top right, (iii) bottom left and (iv) bottom right. In plots (i)-(iii), a dark blue \(+\) corresponds to DARWM where the target has a ‘good’ parameter value (scaling in (i) and (ii), and power in (iii)) and a red \(\times\) to DARWM with a ‘poor’ parameter value. In plot (ii) a light blue \(\circ\) corresponds to DAIS (DA independence sampler) with \(\kappa =2\) and a magenta \(\triangle\) to DAIS with \(\kappa =1/2\). In plots (iii) and (iv), the light blue \(\circ\) corresponds to truncated DATMALA with \(\gamma <\beta\), and magenta \(\triangle\) to DATMALA with \(\gamma >\beta\), whilst a green \(\bullet\) corresponds to DAMALA with \(\gamma <\beta\), and black \(\blacktriangle\) to DAMALA with \(\gamma >\beta\)

6 Discussion

Delayed acceptance Metropolis-Hastings algorithms are popular when the posterior is computationally intensive to evaluate yet a cheap approximation is available. Approximations can arise through many mechanisms, including the coarsening of a numerical-integration grid, subsampling from big data, Gaussian process approximation and nearest neighbour averaging. To date, with the exception of Franks and Vihola (2020) and a note in Banterle et al. (2019), little consideration has been given to the properties of the resulting algorithm and, in particular as to whether the delayed-acceptance algorithm might inherit good properties, such as variance bounding, from its parent Metropolis-Hastings algorithm. From the MCMC output, one might reasonably hope to be able to estimate any quantity with a finite variance under \(\pi\) and be confident that the Monte Carlo error would reduce in inverse proportion to the square-root of the run length; however, if the algorithm is not variance bounding then this may not be the case.

We have investigated the inheritance of the variance bounding property and provided sufficient conditions for it to occur. A general rule of thumb for algorithms with uniformly local (see Definition 2) proposals, such as the random walk Metropolis and the truncated MALA, is that the approximation should have heavier tails (see Definition 4) than the target; however, this is not always necessary (see Example 4). The MALA algorithm does not enjoy the same good properties as the truncated MALA and, in particular, does not necessarily inherit variance bounding even when the approximation does have heavier tails than the target (see Example 6).

A note of caution is also in order: variance bounding (and/or geometric ergodicity) are helpful properties as, in particular, they guarantee the existence of a usual central limit theorem for ergodic averages. However, whilst non-zero, the conductance of a kernel could be exceedingly small (or the geometric rate of convergence execptionally close to one) so that the algorithm might not be useful in practice. Thus, whilst we recommend following the advice in this article when choosing the approximation so as to reduce the chance of false confidence in the resulting Monte Carlo estimates, one should also continue to check other diagnostics, such as trace plots, and to vary any tuning parameters to optimise performance.