Unbiased Markov chain Monte Carlo for intractable target distributions

Performing numerical integration when the integrand itself cannot be evaluated point-wise is a challenging task that arises in statistical analysis, notably in Bayesian inference for models with intractable likelihood functions. Markov chain Monte Carlo (MCMC) algorithms have been proposed for this setting, such as the pseudo-marginal method for latent variable models and the exchange algorithm for a class of undirected graphical models. As with any MCMC algorithm, the resulting estimators are justified asymptotically in the limit of the number of iterations, but exhibit a bias for any fixed number of iterations due to the Markov chains starting outside of stationarity. This"burn-in"bias is known to complicate the use of parallel processors for MCMC computations. We show how to use coupling techniques to generate unbiased estimators in finite time, building on recent advances for generic MCMC algorithms. We establish the theoretical validity of some of these procedures by extending existing results to cover the case of polynomially ergodic Markov chains. The efficiency of the proposed estimators is compared with that of standard MCMC estimators, with theoretical arguments and numerical experiments including state space models and Ising models.


Context
For various statistical models the likelihood function cannot be computed point-wise, which prevents the use of standard Markov chain Monte Carlo (MCMC) algorithms such as Metropolis-Hastings (MH) for Bayesian inference.
For example, the likelihood of latent variable models typically involves an intractable integral over the latent space.
Classically, one can address this problem by designing MCMC algorithms on the joint space of parameters and latent variables. However, these samplers can mix poorly when latent variables and parameters are strongly correlated under the joint posterior distribution. Furthermore these schemes cannot be implemented if we can only simulate the latent variables and not evaluate their probability density function [Andrieu et al., 2010, Section 2.3]. Similarly, in the context of undirected graphical models, the likelihood function might involve an intractable integral over the observation space; see Møller et al. [2006] with examples from spatial statistics.
Pseudo-marginal methods have been proposed for these situations [Lin et al., 2000, Beaumont, 2003, Andrieu and Roberts, 2009, whereby unbiased Monte Carlo estimators of the likelihood are used within an MH acceptance mechanism, while still producing chains that are ergodic with respect to the exact posterior distribution of interest, denoted by π. Pseudo-marginal algorithms and their extensions , Tran et al., 2016 are particularly adapted to latent variable models, such as random effects models and state space models, where the likelihood can be estimated without bias using importance sampling or particle filters [Beaumont, 2003, Andrieu and Roberts, 2009, Andrieu et al., 2010. Related schemes include the exchange algorithm [Murray et al., 2006, Andrieu et al., 2018, which applies to scenarios where the likelihood involves an intractable, parameter-dependent normalizing constant. Exchange algorithms rely on simulation of synthetic observations to cancel out intractable terms in the MH acceptance ratio. As with any MCMC algorithm, the computation of each iteration requires the completion of the previous ones, which hinders the potential for parallel computation. Running independent chains in parallel is always possible, and averaging over independent chains leads to a linear decrease of the resulting variance. However, the inherent bias that comes from starting the chains outside of stationarity, also called the "burn-in bias", remains [Rosenthal, 2000].
This burn-in bias has motivated various methodological developments in the MCMC literature; among these, some rely on coupling techniques, such as the circularly-coupled Markov chains of Neal [2017], regeneration techniques described in Mykland et al. [1995], and "coupling from the past" as proposed in Propp and Wilson [1996].
Coupling methods have also been proposed for diagnosing convergence in Johnson [1996Johnson [ , 1998. Recently, a method has been proposed to completely remove the bias of Markov chain ergodic averages [Glynn and Rhee, 2014]. An extension of this approach using coupling ideas was proposed by Jacob et al. [2017b] and applied to a variety of MCMC algorithms. This methodology involves the construction of a pair of Markov chains, which are simulated until an event occurs. At this point, a certain function of the chains is returned, with the guarantee that its expectation is exactly the integral of interest. The output is thus an unbiased estimator of that integral. Averaging over i.i.d. copies of such estimators we obtain consistent estimators in the limit of the number of copies, which can be generated independently in parallel. Relevant limit theorems have been established in Glynn and Heidelberger [1990], Glynn and Whitt [1992], enabling the construction of valid confidence intervals. The methodology has already been demonstrated for various MCMC algorithms [Jacob et al., 2017b, Heng and Jacob, 2017, Jacob et al., 2017a, which were instances of geometrically ergodic Markov chain samplers under typical conditions. However, in the case of intractable likelihoods and pseudo-marginal samplers, the associated Markov chains are typically sub-geometrically ergodic .
We show here that unbiased estimators of π(h), with finite variance and finite computational cost, can also be derived from polynomially ergodic Markov chains such as those generated by pseudo-marginal methods. We provide results on the associated efficiency in comparison with standard MCMC estimators. We apply the methodology to particle MCMC algorithms for inference in generic state space models, with an application to a time series of neuron activation counts. We also consider a variant of the pseudo-marginal approach known as the block pseudo-marginal approach [Tran et al., 2016] as well as the exchange algorithm [Murray et al., 2006]. Accompanying code used for simulations and to generate the figures are provided at https://github.com/ lolmid/unbiased_intractable_targets.

Unbiased estimators from coupled Markov chains
Let π be a probability measure on a topological space Z equipped with the Borel σ-algebra B(Z). In this section we recall how two coupled chains that are marginally converging to π can be used to produce unbiased estimators of expectations π(h) :=´h (z) π(dz) for any π-integrable test function h : Z → R. Following Glynn and Rhee [2014], Jacob et al. [2017b], we consider the following coupling of two Markov chains (Z n ) n≥0 and (Z n ) n≥0 . First, Z 0 ,Z 0 are drawn independently from an initial distribution π 0 . Then, Z 1 is drawn from a Markov kernel P given Z 0 , which is denoted Z 1 |Z 0 ∼ P (Z 0 , ·). Subsequently, at step n ≥ 1, a pair (Z n+1 ,Z n ) is drawn from a Markov kernelP given (Z n ,Z n−1 ), which is denoted (Z n+1 ,Z n )|(Z n ,Z n−1 ∼P ((Z n ,Z n−1 ), ·). The kernelP is such that, marginally, Z n+1 |Z n ∼ P (Z n , ·) andZ n |Z n−1 ∼ P (Z n−1 , ·). This implies that, marginally for all n ≥ 0, Z n and Z n have the same distribution. Furthermore, the kernelP is constructed so that there exists a random variable τ termed the meeting time, such that for all n ≥ τ , Z n =Z n−1 almost surely (a.s.). Then, for any integer k, the following informal telescoping sum argument informally suggests an unbiased estimator of π(h). We start from π(h) = lim n→∞ E[h(Z n )] and write expanding the limit as a telescoping sum, h(Z n ) − h(Z n−1 )] swapping the expectations and limit, since the terms corresponding to n ≥ τ are null.
The sum τ −1 n=k+1 is treated as zero if τ − 1 < k + 1. The suggested estimator is thus defined as with Z andZ denoting the chains (Z n ) n≥0 and (Z n ) n≥0 respectively. As in Jacob et al. [2017b], we average H l (Z,Z) over a range of values of l, l ∈ {k, k + 1, ..., m} for an integer m ≥ k, resulting in the estimator Intuitively, H k:m can be understood as a standard Markov chain average after m steps using a burn-in period of k − 1 steps (which would be in general biased for π(h)), plus a second term that can be shown to remove the burn-in bias. That "bias correction" term is a weighted sum of differences of the chains between step k and the meeting time τ = inf{n ≥ 1 : Z n =Z n−1 }. In the following, we will write H k:m := H k:m (Z,Z) for brevity. The construction of H k:m is summarized in Algorithm 1, where the initial distribution of the chains is denoted by π 0 , and the Markov kernels by P andP as above. Standard MCMC estimators require the specification of π 0 and P , while the proposed method requires the additional specification of the coupled kernelP . We will propose coupled kernels for the setting of intractable likelihoods, and study the estimator H k:m under conditions which cover pseudo-marginal methods.
Algorithm 1 Unbiased MCMC estimator H k:m for any choice of k and m with 0 ≤ k ≤ m.

3.
Return H k:m as described in Equation (2).
To see how coupled kernels can be constructed, we first recall the construction for simple MH kernels. As-sume that π (dz) = π (dz) dz where dz is the Lebesgue measure. The standard MH algorithm relies on a proposal distribution q(dz |z) = q(z |z)dz , for instance chosen as a Gaussian distribution centered at z. At iteration n − 1, a proposal Z ∼ q(·|Z n−1 ) is accepted as the new state Z n with probability α MH (Z n−1 , Z ) := min (1, π(Z )q(Z n−1 |Z )/π(Z n−1 )q(Z |Z n−1 )), termed the MH acceptance probability. If Z is rejected, then Z n is assigned the value of Z n−1 . This defines the kernel P . To constructP , following Jacob et al. [2017b] we can consider a maximal coupling of the proposal distributions. This is described in Algorithm 2 for completeness; see also Johnson [1998]. Here U [a, b] refers to the uniform distribution on the interval [a, b]. The algorithm relies on draws from a maximal coupling (or γ-coupling) of the two proposal distributions q (·|Z n ) and q(·|Z n−1 ) at step n ≥ 1. Draws (Z ,Z ) from maximal couplings are such that the probability of the event {Z =Z } is maximal over all couplings of Z ∼ q(·|Z n ) andZ ∼ q(·|Z n−1 ). Sampling from maximal couplings can be done with rejection sampling techniques as described in Jacob et al. [2017b], in Section 4.5 of Chapter 1 of Thorisson [2000] and in Johnson [1998]. On the event {Z =Z }, the two chains are given identical proposals, which are then accepted or not based on α MH (Z n , Z ) and α MH (Z n−1 ,Z ) using a common uniform random number. In the event that both proposals are identical and accepted, then the chains meet: Z n+1 =Z n . One can then check that the chains remain identical from that iteration onwards.
Algorithm 2 Sampling from the coupled MH kernel given (Z n ,Z n−1 ).
1. Sample Z andZ from a maximal coupling of q (·|Z n ) and q(·|Z n−1 ). k:m . ThenH R k:m is a consistent estimator of π(h) as R → ∞, for any fixed (k, m), and a central limit theorem holds provided that V[H k:m ] < ∞; sufficient conditions are given in Section 1.3. Since τ is a random variable, the cost of generating H k:m is random.
Neglecting the cost of drawing from π 0 , the cost amounts to that of one draw from the kernel P , τ draws from the kernelP , and then (m−τ ) draws from P if τ < m. Overall that leads to a cost of T m := 2τ +max(1, m−τ +1) units, where each unit is the cost of drawing from P , and assuming that one sample fromP costs two units. Theoretical considerations on variance and cost will be useful to guide the choice of the parameters k and m as discussed in Section 1.5.

Theoretical validity under polynomial tails
We provide here sufficient conditions under which the estimator H k:m is unbiased, has finite expected cost and finite variance. Below, Assumptions 1 and 3 are identical to Assumptions 2.1 and 2.3 in Jacob et al. [2017b] whereas Assumption 2 is a polynomial tail assumption on the meeting time weaker than the geometric tail assumption, namely, P(τ > n) ≤ Kρ n for all n ≥ 1, for some constants K < ∞ and ρ ∈ (0, 1), used in Jacob et al. [2017b].
Relaxing this assumption is useful in our context as the pseudo-marginal algorithm is polynomially ergodic under realistic assumptions  and, as demonstrated in Section 1.4, this allows us to satisfy the polynomial tail assumption. Assumption 1. Each of the two chains marginally starts from a distribution π 0 , evolves according to a transition kernel P , and is such that E[h(Z n )] → π(h) as n → ∞ for a real-valued function h. Furthermore, there exist constants η > 0 and D < ∞ such that E[|h(Z n )| 2+η ] < D for all n ≥ 0.
Assumption 3. The chains stay together after meeting, i.e. Z n =Z n−1 for all n ≥ τ .
As it is assumed that κ > 2 2η −1 + 1 , this implies that E[τ p ] < ∞ for p < 2(2η −1 + 1). In particular, one has E[τ ] < ∞ and thus the computational cost associated with H k:m has a finite expectation. It also implies that τ has a finite second moment.
The following result states that H k:m has not only a finite expected cost but also has a finite variance and that its expectation is indeed π(h) under the above assumptions. The proof is provided in Appendix A.1.

Conditions for polynomial tails
We now proceed to establishing conditions that imply Assumption 2. To state the main result, we put assumptions on the probability of meeting at each iteration. We write D for the diagonal of the joint space Z × Z, that is D := {(z,z) ∈ Z × Z : z =z} and introduce the measure π D (dz, dz) := π(dz)δ z (dz). The meeting time τ is the hitting time of the diagonal, τ D := inf n ≥ 0 : Z n ,Z n−1 ∈ D . The first assumption is on the ability of the pair of chains to hit the diagonal when it enters a certain subset of Z × Z.
Assumption 4. The kernelP is π D -irreducible: for any set A ⊂ D such that π D (A) > 0 and all (z,z) ∈ Z × Z there exists some n ≥ 0 such thatP n ((z,z), A) > 0. The kernelP is also aperiodic. Finally, there exist ∈ (0, 1) and a set C ⊂ Z such that inf (z,z)∈C×CP ((z,z) , D) ≥ . (3) Next we will assume that the marginal kernel P admits a polynomial drift condition and a small set C; we will later consider that small set to be the same set C as in Assumption 4. Intuitively, the polynomial drift condition on C will ensure regular entries of the pair of chains in the set C × C, from which the diagonal can be hit in one step under Assumption 4.
Assumption 5. There exist 0 > 0, a probability measure ν on Z and a set C ⊂ Z such that In addition, there exist a measurable function V : Z → [1, ∞), constants b V , c V > 0, b ∈ (0, 1), and a value α ∈ (0, 1), such that, defining φ(x) := dx α for a constant d > 0 and all x ∈ [1, ∞), then for any z ∈ Z, The following result states that Assumptions 4 and 5 guarantee that the tail probabilities of the meeting time are polynomially bounded. The proof is provided in Appendix A.2.
Theorem 2. Suppose that Assumptions 4 and 5 hold for the same set C ⊂ Z, and that π 0 admits a density with respect to π and is supported on a compact set. Then we have that for all n ≥ 1 and some constant K > 0, where κ = 1/(1 − α), with α defined as in Assumption 5.
We note the direct relation between the exponent α in the polynomial drift condition and the exponent κ in the bound on the survival probabilities P(τ ≥ n). In turn this relates to the existence of finite moments for τ , as discussed after Assumption 2. In particular, if we can take large values of η in Assumption 1, then we require in Assumption 2 that κ is just above 2, which is implied by α > 1/2 according to Theorem 2. However, if we consider η = 1 in Assumption 1, for instance, then we require in Assumption 2 that κ is just above 6, which is implied by α > 5/6 according to Theorem 2. The condition α > 5/6 will appear again in the next section.

Efficiency under polynomial tails
In removing the bias from MCMC estimators, we expect that H k:m will have an increased variance compared to an MCMC estimator with equivalent cost. In this section we study the overall efficiency of H k:m in comparison to standard MCMC estimators. This mirrors Proposition 3.3 in Jacob et al. [2017b] in the case of geometrically ergodic chains.
We can define the inefficiency of the estimator H k:m as the product of its variance and of its expected com- in the study of estimators with random computing costs, since seminal works such as Glynn and Heidelberger [1990] and Glynn and Whitt [1992]. The inefficiency can be understood as the asymptotic variance of the proposed estimator as the computing budget goes to infinity. The following provides a precise comparison between this inefficiency and the inefficiency of the standard "serial" algorithm. Since the cost T m is measured in units equal to the cost of sampling from P , the cost of obtaining a serial MCMC estimator based on m iterations is equal to m such units. The mean squared error associated with an MCMC estimator based on (Z n ) n≥0 is denoted by We first express the estimator H k:m , for m ≥ k ≥ 0 as MCMC k:m + BC k:m , where the bias correction term is Then Cauchy-Schwarz provides a relationship between the variance of H k:m , the MCMC mean squared error, and the second moment of the bias-correction term: This relationship motivates the study of the second moment of BC k:m . The following result shows that if the Markov chains are mixing well enough, in the sense of the exponent α in the polynomial drift condition of Assumption 5 being close enough to one, then we can obtain a bound on E BC 2 k:m which is explicit in k and m. The proof can be found in Appendix A.3. Proposition 1. Suppose that the marginal chain evolving according to P is ψ-irreducible and that the assumptions of Theorem 2 hold for 5/6 < α ≤ 1 and some measurable function V : Z → [1, ∞), such that S V := {z : V (z) < ∞} = ∅. In addition assume that there exists a γ ∈ (1 − α, 1) such that π(V 4γ ) < ∞. Then for any measurable function h : Z → R such that sup z∈Z V (z) −α−γ+1 |h(z)| < ∞, and any integers m ≥ k ≥ 0 we have that, for κ := 1/(1 − α), and a constant B < ∞, The fact that a restriction on the exponent α has to be specified to control the second moment of BC k:m is to be expected: we have already seen in the previous section that such a restriction is also necessary to apply Theorem 2 to verify Assumption 2 with an adequate exponent κ, which, in turn, leads to a finite variance for H k:m through Theorem 1. The specific condition 5/6 < α ≤ 1 could perhaps be relaxed with a more refined technical analysis, thus we interpret the condition qualitatively: the chains are allowed to satisfy only a polynomial drift condition but it needs to be "close" enough to a geometric drift condition.
It follows from (9) and (10) that under the assumptions of Proposition 1, we have The variance of H k:m is thus bounded by the mean squared error of an MCMC estimator, and additive terms that vanish polynomially when k, m − k and m increase. To compare the efficiency of H k:m to that of MCMC estimators, we add simplifying assumptions as in Jacob et al. [2017b]. As k increases and for m ≥ k, we expect We will make the simplifying assumption that MSE k:m ≈ V k,m /(m − k + 1) for k large enough. As the condition 5/6 < α is equivalent to κ > 6, E BC 2 k:m will be negligible compared to the two other terms appearing on the right hand side of (11), so we obtain the approximate inequality where the cost of H k:m is approximated by the cost of m calls to P . For the left-hand side to be comparable to V k,m , we can select m as a large multiple of k such that m/(m − k + 1) is close to one. The second term on the right-hand side is then negligible as k increases.

Pseudo-marginal Metropolis-Hastings
The pseudo-marginal approach [Lin et al., 2000, Beaumont, 2003, Andrieu and Roberts, 2009 generates Markov chains that target a distribution of interest, while using only non-negative unbiased estimators of target density evaluations. For concreteness we focus on target distributions that are posterior distributions in a standard Bayesian framework. The likelihood function associated to data y ∈ Y is denoted by θ → p(y|θ), and a prior density θ → p (θ) w.r.t. the Lebesgue measure is assigned to an unknown parameter θ ∈ Θ ⊆ R D . We assume that we can compute a non-negative unbiased estimator of p(y|θ), for all θ, denoted byp(y|θ, U ) where U ∈ U ⊂ R M are random variables such that U ∼ m θ (du), where for any θ ∈ Θ, m θ denotes a Borel probability measure on U. We assume that m θ (du) admits a density with respect to the Lebesgue measure denoted by u → m θ (u). The random variables U represent variables required in the construction of the unbiased estimator of p(y|θ). The pseudo-marginal algorithm targets a distribution with density The generated Markov chain Sampling from π(dθ, du) is achieved with a Metropolis-Hastings scheme, with proposal q (dθ |θ) m θ (du ). This results in an MH acceptance probability that simplifies to which does not involve any evaluation of u → m θ (u). Thus the algorithm proceeds exactly as a standard MH algorithm with proposal density q(θ |θ), with the difference that likelihood evaluations p(y|θ) are replaced by The performance of the pseudo-marginal algorithm depends on the likelihood estimator: lower variance estimators typically yield ergodic averages with lower asymptotic variance, but the cost of producing lower variance estimators tends to be higher, which leads to a trade-off analyzed in detail in .
In the following we will generically denote by g θ the distribution of p(y | θ, U ) when U ∼ m θ (·), and for notational simplicity, we might write p(y | θ) instead of p(y | θ, U ). The above description defines a Markov kernel P and we next proceed to defining a coupled kernelP , to be used for unbiased estimation as in Algorithm 1.

Coupled pseudo-marginal Metropolis-Hastings
To define a kernelP that is marginally identical to P but jointly allows the chains to meet, we proceed as follows, mimicking the coupled MH kernel in Algorithm 2. First, the proposed parameters are sampled from a maximal coupling of the two proposal distributions. If the two proposed parameters θ andθ are identical, we sample a unique likelihood estimator p(y | θ ) ∼ g θ (·) and we use it in the acceptance step of both chains. Otherwise, we sample two estimators, p(y | θ ) ∼ g θ (·) and p(y |θ ) ∼ gθ (·). Denoting the two states of the chains at step n ≥ 1 by (θ n , p(y | θ n )) and (θ n−1 , p(y |θ n−1 )), Algorithm 3 describes how to obtain (θ n+1 , p(y | θ n+1 )) and (θ n , p(y |θ n )); thereby describing a kernelP .
In step 2. of Algorithm 3 the two likelihood estimators p(y | θ ) and p(y |θ ) can be generated independently, as we will do below for simplicity. They can also be sampled together in a way that induces positive correlations, for instance using common random numbers and other methods described in , Jacob et al.
[2017a]. We leave the exploration of possible gains in correlating likelihood estimators in that step as a future avenue of research. An appealing aspect of Algorithm 3, particularly when using independent estimators in step 2., is that existing implementation of likelihood estimators can be readily used. In Section 3.2 we will exploit this by demonstrating the use of controlled sequential Monte Carlo  in the proposed framework.
Likewise, one could explore the use of other advanced particle filters such as sequential quasi Monte Carlo [Gerber and Chopin, 2015]. To summarize, given an existing implementation of a pseudo-marginal kernel, Algorithm 3 involves only small modifications, and the extra implementation of a maximal coupling which itself is relatively simple following e.g. Jacob et al. [2017b].

Theoretical guarantees
We provide sufficient conditions to ensure that the coupled pseudo-marginal algorithm returns unbiased estimators of finite variance with finite expected computation time. By introducing the parameterization w =p(y|θ, u)/p(y|θ) and using the notation w ∼ḡ θ (·) when u ∼ m θ (·), we can rewrite the pseudo-marginal kernel where, in this parameterization, we write and PM (θ, w) is the corresponding rejection probability.
Assumption 6. The target posterior density θ → π (θ) is strictly positive everywhere and continuously differentiable. Its tails are super-exponentially decaying and have regular contours, that is, where |θ| denotes the Euclidean norm of θ. Moreover, the proposal distribution satisfies q (θ, A) =´A q (θ − x) dθ with a bounded, symmetric density q that is bounded away from zero on all compact sets.
Assumption 7. There exist constants a > 0 and b > 1 such that where the essential supremum is taken with respect to the Lebesgue measure. Additionally the family of distributions defined by the densitiesḡ θ is continuous with respect to θ in the topology of weak convergence.

Proposition 2. Under Assumptions 6 and 7 then Equations
Additionally the minorization conditions (3) and (4)  If the assumptions of Proposition 2 are satisfied for a , b such that such that b − min (1, a ) > 6 then, by application of Theorem 2, the coupling times exhibit the required tail bounds of Assumption 2 with α > 5/6 if π 0 admits a density with respect to π and is supported on a compact set.

Tails of meeting times in a toy experiment
We provide numerical experiments on the tails of the meeting time τ in a toy example, to illustrate the transition from geometric to polynomial tails. The target π is a bivariate Normal distribution N (µ, I) with µ = (1, 2) ∈ R 2 , and the initial distribution π 0 is Normal N (0, I). To emulate the pseudo-marginal setting, we pretend that we cannot evaluate θ → π(θ); instead we have access toπ(θ, U ) for all θ, whereπ(θ, U ) = π(θ) × U and U is a log-Normal variable: U = exp(Ũ ) withŨ ∼ N (−σ 2 /2, σ 2 ), and σ calibrates the amount of noise of the unbiased estimatorπ(θ, U ) of π(θ). In the case σ = 0, U = 1 almost surely and we recover the standard MCMC setting.
We consider a pseudo-marginal Metropolis-Hastings algorithm with proposal distribution q(dθ |θ), and a coupled version following Algorithm 3.
We draw R = 10, 000 independent realizations of the meeting time for σ in {0, 0.5, 1, 1.5, 2}. We then approximate survival probabilities P(τ > n) by empirical counterparts, for n between 1 and the 99% quantile of the meeting times for each σ. The resulting estimates of P(τ > n) are plotted against n in Figure 1a, where the y-axis is in log-scale.
First note that in the case σ = 0, log P(τ > n) seems to be linearly decreasing in n, which would correspond to P(τ > n) ≈ Kρ n for some constants K < ∞ and ρ ∈ (0, 1). This is indeed the expected behavior in the case of geometrically ergodic Markov chains.
As σ increases, P(τ > n) decreases less rapidly as a function of n. To verify whether P(τ > n) might be bounded by Kn −κ (as our theoretical considerations suggest), we plot P(τ > n) against n with both axes in log-scale in  Figure 1: Survival probabilities of the meeting time P(τ > n) along n, approximated with 10, 000 copies of the meeting times in the pseudo-marginal toy example of Section 2.4. Left: y-axis in log-scale and x-axis in natural scale. Right: log-scale for both axes, and restriction to n > 30, in order to focus on the tails. Each line corresponds to a different value of σ, which calibrates the amount of noise in the estimators of target density evaluations.

Experiments in state space models
State space models are a popular class of time series models. These latent variable models are defined by an unobserved Markov process (X t ) t≥0 and an observation process (Y t ) t≥1 where the observations are conditionally independent given (X t ) t≥0 with where θ parameterizes the distributions µ θ , f θ and g θ (termed the 'initial', 'transition' and 'observation' distribution respectively). Given a realization of the observations Y 1:T = y 1:T , we are interested in performing Bayesian inference on the parameter θ to which we assign a prior density p(θ). The posterior density of interest is thus π (θ) ∝ p(θ)p(y 1:T |θ) where the likelihood p(y 1: is usually intractable. It is possible to obtain a non-negative unbiased estimator p(y | θ, u) of p(y|θ) using particle filtering where here u represents all the random variables simulated during the run of a particle filter. The resulting pseudo-marginal algorithm is known as the particle marginal MH algorithm (PMMH) [Andrieu et al., 2010]. This algorithm can also be easily modified to perform unbiased smoothing for state inference and is an alternative to existing methods in [Jacob et al., 2017a]. Guidelines on the selection of the number of particle in this context are provided in [Middleton et al., 2018].

Linear Gaussian state space model
The following experiments explore the proposed unbiased estimators in a linear Gaussian state space model where the likelihood can be evaluated exactly. This allows a comparison between the pseudo-marginal kernels, that use bootstrap particle filters [Gordon et al., 1993] with N particles to estimate the likelihood, and the ideal kernels that use exact likelihood evaluations obtained with Kalman filters. We assume X 0 ∼ N (0, 1), X t |{X t−1 = x} ∼ N (ax, σ 2 X ) and Y t |{X t = x } ∼ N (x , 1) where a and σ X are assigned prior distributions, a ∼ U[0, 1] and σ X ∼ Γ(2, 2).

Effect of the number of particles
A dataset of T = 100 observations was generated from the model with parameters a = 0.5 and σ X = 1. We study how the meeting times and the efficiency vary as a function of N , the number of particles. We set the initial distribution to U[0, 1] over a and U[0, 5] over σ X , and the proposal covariance of the normal random walk proposals to 0.2 2 I. In the following we consider a grid of values for the number of particles: N ∈ {50, 75, . . . , 225}.
We estimate large quantiles of the distribution of meeting time over 20,000 repetitions of coupled PMMH, with the results shown in Figure Figure 2b where we see both an improvement in inefficiency with N and that this is lower bounded by the inefficiency of the estimators obtained using coupled Metropolis-Hastings.
We also examine the inefficiency weighted by the cost of obtaining each estimator, i.e. N IF[H k:m ], and compare this to the inefficiency of the serial algorithm using N V as , with the notation of Section 1. Here, V as was estimated using the spectrum0.ar function in R's CODA package [Plummer et al., 2006], averaging over 10 runs of 500,000 iterations and discarding the first 10% as burn-in. Figure 3 shows the results of this procedure, showing ±2 sample standard errors for the inefficiency estimates in both cases. Figure 3a demonstrates that despite the lower cost of obtaining unbiased estimators for lower values of N , the initial decline in inefficiency is still significant. In Figure 3b we show the same results though with a focus on the optimum inefficiency in both cases. Here, we see that the optimum is attained at N = 100 with value N V as = 632 for the serial algorithm and at N = 150 with N IF[H k:m ] = 769 for coupled PMMH. Therefore, after optimising the inefficiency in both cases we see that despite the embarassingly parallel nature of computing the unbiased estimators using coupled PMMH, the increase in inefficiency is estimated to be only 22% relative to an optimised serial algorithm.

Effect of the time horizon
We investigate the distribution of meeting times as a function of T , with N scaling linearly with T . Such a scaling is motivated through the guarantee that the variance of the log-likelihood estimates obtained at each iteration are asymptotically constant [Bérard et al., 2014. For the model as before, we consider a grid of T ∈ {100, ..., 1, 000}, using a single realisation of the data. Throughout the following, we fix the proposal covariance to be 2 2 T I, coinciding with the proposal covariance in 3.1.1 for T = 100. We consider two cases. Firstly, we examine how the distribution of meeting time changes for a fixed initial distribution (the distribution used previously of U[0, 1] over a and U[0, 5] over σ X ); we refer to this as Scaling 1. Secondly, for Scaling 2, we examine how the distribution of meeting times changes if we also scale the initial distribution by setting π 0 = N (µ * , 50 T I), truncated to ensure it is dominated by the prior and where µ * denotes the true parameter values.
In both cases we compare the distribution of meeting times for N = T with the distribution of meeting times for the exact algorithm (i.e.P as in Algorithm 2) with likelihood evaluations performed using the Kalman filter. Figure 4a and 4b show estimates of the 80 th and 99 th percentile over 1,000 repetitions for Scaling 1 and Scaling 2 respectively. Firstly, it can be seen that in all cases the meeting times for coupled PMMH is higher than the meeting times for coupled MH. Indeed, while the 80 th percentiles are comparable between the two the 99 th percentile is larger for coupled PMMH, reflecting a heavier tail. Finally, it can be seen that out of the two scalings Scaling 2 appears to stabilise for larger values of T whereas Scaling 1 exhibits an increase with T .

Neuroscience experiment
We apply the proposed methodology to a neuroscience experiment described in Temereanca et al. [2008]. The same data and model were used to illustrate the benefits of controlled Sequential Monte Carlo (cSMC) in Heng et al. [2017].

Model, data and target distribution
The  Zhang et al. [2018] for an alternative analysis that avoids aggregating over experiments. Letting Bin(·; n, p) denote the binomial distribution for n trials with success probability p, the model for neuron activation is given by X 0 ∼ N (0, 1) and, for t ≥ 1, where s(x) := (1 + exp(−s)) −1 . We focus on the task of estimating (a, σ 2 X ) from the data using the proposed method. Following  we specify a uniform prior on [0, 1] for a and an inverse-Gamma prior on σ 2 X with parameters (1, 0.1), where the probability density function of an inverse-Gamma with parameters (a, b) is x → Γ(a) −1 b a x −a−1 exp(−b/x). The PMMH kernels employed below use a Gaussian random walk proposal.
The likelihood is estimated with cSMC with N = 128 particles and 3 iterations, where the exact specification is taken from the appendix of . Such cSMC runs take approximately one second, on a 2015 desktop computer and a simple R implementation. Figure 5 presents the time series of observations (5a) and the estimated log-posterior density (5b), obtained on a 500 × 500 grid of parameter values, and one cSMC likelihood estimate per parameter value. In Figure 5b, the upper right corner presents small black circles, generated by the contour plot function, which indicate high variance in the likelihood estimators for these parameters. Thus we expect PMMH chains to struggle in that part of the space. On the other hand, the maximum likelihood estimate (MLE) is indicated by a black dot on the bottom right corner. The variance of the log-likelihood estimators is of the order of 0.2 around the MLE, so that PMMH chains are expected to perform well there, as was observed in  where the chains were initialized close to the MLE.

Standard deviation of the proposal
Here, we initialize the chains from a uniform distribution on [0, 1] 2 , and we investigate two choices of standard deviation for the random walk proposals: the one used in , that is 0.002 for a and 0.01 for σ 2 X , and another choice equal to 0.01 for a and 0.05 for σ 2 X , i.e. five times larger. For each choice, we can run pairs of chains until they meet and record the meeting time; we can do so on P processors in parallel (e.g. hundreds), and for a certain duration (e.g. a few hours). Thus the number of meeting times produced by each processor is a random variable. Following Glynn and Heidelberger [1990], if no meeting time was produced by a processor within the time budget, the computation continues until one meeting time is produced, otherwise on-going calculations are interrupted when the budget is reached. This allows unbiased estimation of functions of the meeting time on each processor via Corollary 7 of Glynn and Heidelberger [1990], and then we can average across processors. In particular we use this strategy to produce all histograms in the present section, as in Figure 6.
We observe that the meeting times are significatively larger when using the smaller standard deviation (6a) is of 6% for that chain, compared to 39% for the other chain shown in 7a Therefore the use of a larger proposal standard deviation seems to have a very noticeable effect here on the ability of the Markov chain to escape a region of high variance of the likelihood estimator.

Comparison with PMMH using bootstrap particle filters
We use the larger choice of standard deviation (0.01 on a and 0.05 on σ 2 X ) hereafter, and compare meeting times obtained with cSMC with those obtained with bootstrap particle filters, with N = 4, 096 particles. This number is chosen so that the compute times are comparable. Over 23 hours of compute time, the number of meeting times obtained per processor varied between 4 and 35, and a total of 7, 776 meeting times were obtained from 400 processors. The meeting times are plotted against the duration it took to produce them in Figure 8a. The compute time associated with meeting times is not only proportional to the meeting times themselves, but also varies across processors. This is partly due to hardware heterogeneity across processors, and to concurrent tasks being executed on the cluster during our experiments. The histogram in Figure 8b shows that meeting times are larger, and heavier tailed, than when using cSMC (see Figure 6b). The maximum observed value is 9, 371. From these plots, we see that to produce unbiased estimators H k:m using BPF with a similar variance as when using cSMC, we would have to choose much larger values of k and m, and thus the cost per estimator would likely be much higher.

Efficiency compared to the serial algorithm
Using cSMC and the larger choice of standard deviation for the proposal, we produce unbiased estimators H k:m with k = 1, 000 and m = 10, 000. We run 100 processors for a time budget of 23 hours, and each processor produced between 2 and 9 estimators, for a total of 578 estimators. The generation of samples for each processor is represented chronologically in Figure 9a. The variation among durations is due to the randomness of meeting times and also to external factors such as concurrent tasks being executed on the cluster. We produce histograms of the posterior marginals in Figure 9b, with the result from a long run of PMMH with cSMC (250, 000 iterations) overlaid in red lines.
We compute the loss of efficiency incurred by debiasing the PMMH chain with the proposed estimators. We consider the test function h : x → x 1 + x 2 + x 2 1 + x 2 2 . Along the PMMH chain of length n mcmc = 250, 000, after a burn-in of n burnin = 10, 000 steps, and using the spectrum0 function of the CODA package, we find the asymptotic variance associated with h to be V as = 7.53 · 10 −3 . If we measure computing cost in terms of MCMC iterations, we obtain an inefficiency of n mcmc × V as /(n mcmc − n burnin ) ≈ 7.84 · 10 −3 . With the unbiased estimators H k:m , if the cost of an estimator is 2τ + max(1, m + 1 − τ ), then the average cost per processor is 59, 860. The empirical variance of the unbiased estimators obtained per processor is equal to 1.4 · 10 −7 , thus we obtain an inefficiency of (a) (b) Figure 9: Left: start and end time for the calculation of unbiased estimators on 100 parallel processors, for a budget of 23 hours (dashed line), with cSMC, N = 128 particles, I = 3 iterations and k = 1, 000, m = 10, 000. Right: estimated histograms of both parameters a and σ 2 X ; the red lines correspond to estimates obtained with 250, 000 iterations of PMMH and discarding the first 10, 000 as burn-in; this is for the neuroscience experiment of Section 3.2. 59860 × 1.4 · 10 −7 ≈ 8.4 · 10 −3 . This inefficiency is slightly above 7.84 · 10 −3 but the difference is relatively small.
Next, we parameterize cost in terms of time (in seconds) instead of number of MCMC steps. This accounts for the fact that running jobs on a cluster involve heterogeneous hardware and concurrent tasks. The serial PMMH algorithm was run on a desktop computer for 169, 952 seconds and thus the inefficiency might be measured as 169, 952 × V as /(n mcmc − n burnin ) ≈ 5.3 · 10 −3 . Note that each iteration took less than a second on average, because parameter values proposed outside of the support of the prior were rejected before running a particle filter; on the other hand the cost of a cSMC run is above one second on average. For the proposed estimators, the budget was set to 23 hours and we obtained a variance across processors of 1.4 · 10 −7 ; thus we can compute the inefficiency as 1.16 · 10 −2 , which is approximately twice the inefficiency of the serial algorithm: it is a significant loss of efficiency, but one that can be justified in a parallel computing environment.

Methodological extensions
The following provides two further examples of coupled MCMC algorithms to perform inference when the likelihood function is intractable. The associated estimators are not covered by our theoretical results.

Block pseudo-marginal method
Block pseudo-marginal methods have demonstrated significant computational savings for Bayesian inference for random effects models over standard pseudo-marginal methods [Tran et al., 2016]. Such methods proceed through introducing strong positive correlation between the current likelihood estimatep(y|θ) and the likelihood estimate of the proposed parameterp(y|θ ) through only modifying a subset of the auxiliary variables used to obtain the likelihood estimate at each iteration. We here demonstrate the computational benefits of such a scheme in obtaining unbiased estimators of posterior expectations.
We focus here on random effects models, i.e. we have for t = 1, ..., T In this context, the likelihood of data y = (y 1 , ..., y T ) is of the form p(y|θ) = T t=1 p(y t |θ) where p(y t |θ) = f θ (dx)g θ (y t |x) and the likelihood estimate is given byp(y|θ, U ) = T t=1p (y t |θ, U t ), where {p(y t |θ, U t )} t=1,...,T are T independent non-negative unbiased likelihood estimates of {p(y t |θ)} t=1,...,T when U t ∼ m t (·). In the following, we provide a minor modification of the blocking strategy proposed in Tran et al. [2016], where instead of jointly proposing a new parameter and a single block of auxiliary random variables, a parameter update is performed, followed by sequentially iterating through the auxiliary random variables used to construct the likelihood estimate of observation t. For each data t, new values are proposed according to U t ∼ m t (·) and accepted with probability As remarked in Tran et al. [2016], such blocking strategies are generally not applicable to particle filter inference in state space models, whereby likelihood estimates for observation t typically depend on all auxiliary random variables generated up to and including t. We provide pseudo-code for the proposed blocking strategy in Algorithm 4. We denote by U t,n the set of auxiliary variables U t at iteration n, Algorithm 4 Sampling from the block pseudo-marginal kernel given θ n−1 , (U t,n−1 ) t≥1 1. Sample θ ∼ q (·|θ n−1 ) and compute p(y t | θ , U t,n−1 ) for t = 1, ..., T .

Bayesian multivariate probit regression
The following demonstrates the proposed algorithm for a latent variable model applied to real data and explores the possible gains when compared to the unbiased estimators obtained using the coupled pseudo-marginal algorithm.
The data consists of polling data collected between February 2014 and June 2017 as part of the British Election Study [Fieldhouse et al., 2018]. We use a multivariate probit model, which for i ∈ {1, .., T } and j ∈ {1, 2, 3} can be expressed as X ij = β ζ ij + ij and Y ij = 1[X ij > 0] for observed binary response Y ij , latent state X ij and where i indexes the i th participant, j indexes the j th wave of questions, β is a vector of regression coefficients (including an intercept) and ζ ij is a vector of independent variables.
We use a random sample of T = 2, 000 participants over three waves (one a year) in the run up to the United Kingdom's European Union membership referendum on 23 rd June 2016, regressing the binary outcome asking participants how they would vote in an EU referendum against how they perceive the general economic situation in the UK has changed over the previous 12 months (graded 1-5, with 1='Got a lot worse', 5='Got a lot better'). Sampling from the coupled block pseudo-marginal kernel given (θ n , (U t,n ) t≥1 ,θ n−1 , ( U t,n−1 ) t≥1 ) 1. Sample (θ ,θ ) from the maximal coupling of q (·|θ n ) and q(·|θ n−1 ).
A detailed description of the data is provided in the appendix A.5.
We allow for correlations between waves through modelling the perturbations ( i1 , i2 , i3 ) ∼ N (0, Σ ρ ) with a generic correlation matrix Σ ρ . In total, we have five unknown parameters θ = (β 1 , β 2 , ρ 2,1 , ρ 3,1 , ρ 3,2 ), with β 1 denoting a regressor coefficient, β 2 a constant offset and ρ s,t element (s, t) of Σ ρ . We place independent priors on each parameter with β 1 , β 2 i.i.d. Inference For each observation y i := (y i1 , y i2 , y i3 ), we obtain unbiased estimates of the likelihood of θ using the sequential importance sampling algorithm of Geweke, Hajivassiliou and Keane; see, e.g., Train [2009, 5.6.3] and references therein. We set the initial distribution π 0 = N (μ, 0.01 2 I), supported only on areas of positive mass under the prior (we employ a simple rejection sampling algorithm to sample Σ ρ initially) and use a normal random walk proposal with covariance set to 2.38 2 5Σ , whereμ andΣ are an empirical estimate of the posterior mean and covariance on a preliminary run of 10,000 iterations of block pseudo-marginal with N = 40 discarding the first 10% as burn-in.
We compare coupled block pseudo-marginal with coupled pseudo-marginal. We examine values of N for the latter that are close to the optimal value of N for the serial algorithm, estimated through ensuring the variance of the log-likelihood estimates is between 1 and 2, as per the guidance in . In this case we consider N ∈ {600, 700, 800}, providing a corresponding variance of the log-likelihood estimates given by {1.67, 1.40, 1.25} (estimated using 10,000 likelihood estimates atμ). For the block pseudo-marginal algorithm we consider N ∈ {5, 10, 20}.
Meeting times Both algorithms were run continuously until coupling for half an hour each on a 48 CPU Intel Xeon 2.4Ghz E5-4657L server, with the number of estimators produced for block pseudo-marginal varying between 2,000 and 4,000 for the values of N considered and between 160 and 180 for the pseudo-marginal. The meeting times are plotted in Figure 10a, where it can be seen that despite the lower cost of the block pseudo-marginal (a) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Accordingly, we also plot the distribution of meeting times accounting for the cost of running each algorithm, i.e. N τ for the pseudo-marginal algorithm and 2N τ for the block pseudo-marginal algorithm. The additional factor of 2 for the latter can be seen as an upper bound on the additional computational cost of the block pseudomarginal algorithm, assuming twice the density evaluations per complete iteration and less than twice the number of pseudo-random numbers generated. Figure 10b shows the results of this additional cost-weighting where it can be seen that meeting times are between 1 and 2 orders of magnitude larger for the pseudo-marginal over the block pseudo-marginal algorithm.

Variance of estimators
We estimate the increase in inefficiency of the coupled over the serial algorithm for N = 10, k = 500 and m = 5, 000; the choice of k is guided by the meeting times in Figure 10a. Running coupled block pseudo-marginal 200 times, we estimate the variance using the test function h : x → i (x i + x 2 i ) to be 1.05 · 10 −5 . Estimating the cost of 2τ + max(1, m + 1 − τ ) to be 5121, implies an inefficiency of 5.36 · 10 −2 . In comparison, we estimate the inefficiency of the serial algorithm using spectrum0.ar as before on runs of length 125,000 (discarding 10% as burn-in and averaging over 20 estimators) to be n mcmc × V as /(n mcmc − n burnin ) = 4.82 · 10 −2 suggesting an increase in inefficiency of only 11% for the unbiased estimators.
Finally, we compare the inefficiency of unbiased estimators generated with coupled block pseudo-marginal kernels with those produced using standard coupled pseudo-marginal kernels with N PM = 700 particles. For coupled pseudomarginal, the variance of the unbiased estimator was estimated to be 1.53·10 −5 and the expected cost was estimated to be 5147, implying an inefficiency of 7.86 · 10 −2 . As a result we estimate the improvement of inefficiency for the coupled block pseudo-marginal by N PM 2N × 7.86·10 −2 5.36·10 −2 to be approximately 51 times. Estimation of the asymptotic variance of the serial pseudo-marginal algorithm was computationally infeasible for this many particles, with a single iteration taking on average six seconds on the aforementioned server, hence the choice of N motivated by the guidance in  instead.

Exchange algorithm
Problems where the likelihood function is only known only up to a constant of proportionality occur frequently across Bayesian statistics; see, e.g., Park and Haran [2018] for a recent account of current methodology and applications.
In this case, posterior distributions π(θ) ∝ p(y|θ)p (θ) are given by where f (y|θ) can be evaluated pointwise but its parameter-dependent normalizing constant Z(θ) is intractable. This is a scenario common for undirected graphical models and spatial point processes [Møller et al., 2006, Murray et al., 2006. The exchange method detailed in Algorithm 6 is an MCMC scheme proposed by Murray et al. [2006] to sample such distributions under the assumption that, although Z(θ) cannot be evaluated, it is possible to simulate exactly artificial observations from p(y|θ). This is indeed possible for a large class of spatial point processes as well as the Ising and Potts models using perfect simulation procedures.

Coupled exchange algorithm
An algorithm to couple two block pseudo-marginal algorithms to construct unbiased estimators H k:m is provided in Algorithm 7. Denoting the two states of the chains at step n ≥ 1 by θ n andθ n−1 , Algorithm 3 describes how to obtain θ n+1 andθ n ; thus it describes a kernelP .

High temperature Ising model
We examine the proposed algorithm for inference in a planar lattice Ising model without an external field. The model comprises observations y i ∈ {−1, +1} on a L × L square lattice such that p(y|θ) ∝ exp β i∼j y i y j where i ∼ j denotes the neighbours j of node i and θ = β denotes the inverse temperature. We restrict interest to high temperature models specifying a prior distribution β ∼ U[0, β c ], with β c = 1 2 log(1 + √ 2) denoting the critical temperature of the Ising model on the infinite lattice [Ullrich, 2013, Onsager, 1944. Here, perfect simulation can be performed using coupling from the past techniques with simple heat bath dynamics developed by Propp and Wilson [1996]. We generate observations for L = 80 and set the proposal covariance to 10 −4 I, initialising the chains from the prior. We obtain estimates of the distribution of meeting times using 1,000 repetitions of coupled exchange, with the results shown in Figure 11a. Based on this, we obtain unbiased estimates of the expectation of β under the posterior distribution using k = 100 and m = 10k over 1,000 repetitions. It is noted that the clock time to obtain a single estimator (on the same machine) varies significantly due to the variable computational cost of performing coupling from the past, depending on θ. We plot a histogram of the clock times to obtain each H k:m in Figure 11b.
Based on the heterogeneity of times to produce a single unbiased estimator, we compare the serial inefficiency with the inefficiency of coupled exchange based on the clock time to obtain a certain variance, with the test function h : x → x. We estimate the asymptotic variance with n mcmc = 200, 000 iterations of the original algorithm (discarding the first 10% as burn-in, and using spectrum0.ar as before) to be V as ≈ 4.22 · 10 −4 , and the algorithm taking in total 41, 095 seconds. As a result, we estimate the serial inefficiency in terms of clock-time to be 41, 095 × V as /(n mcmc − n burnin ) ≈ 9.6 · 10 −5 . Comparatively, the mean time to return a single estimator H k:m was estimated to be 546 seconds, with the variance of a single H k:m estimated to be 4.78 · 10 −7 providing an estimated inefficiency of 2.6 · 10 −4 , implying an increase in inefficiency of approximately 3.0.

Conclusion
Markov chain Monte Carlo algorithms designed for scenarios where the target density function is intractable can be coupled and utilized in the framework of Glynn and Rhee [2014], Jacob et al. [2017b]. The validity of the resulting unbiased estimators can be related to polynomial drift conditions on the underlying Markov kernels.
These estimators open new ways of using parallel computing hardware to perform numerical integration in such scenarios.
In the context of state space models, the present contribution can be combined with the unbiased estimators obtained with couplings of conditional particle filters in Jacob et al. [2017a] (see Lee et al. [2018] for methodological and theoretical developments). This would enable unbiased smoothing under parameter uncertainty, instead of fixing the parameters as in Jacob et al. [2017a].
of Science, Research Computing Group at Harvard University. Pierre E. Jacob acknowledges support from the National Science Foundation through grant DMS-1712872.

A.2 Proof of Theorem 2
We have the following bivariate drift inequality. A similar statement is provided in Andrieu et al. [2015, Lemma 11]. Lemma 1. LetP be a coupling of the Markov kernel P with itself, and V be as in Assumption 5. Then the function for all (z,z) ∈ Z × Z, whereb : Proof. For (z,z) / ∈C we havē Since (z,z) / ∈C then at least one of z,z is not in C, and φ • V ≥ 0, so where we used (5) in Assumption 5. By two applications of the mean value theorem, we have that for any t ≥ s ≥ 1 there exist r ∈ [t, t + s − 1] and r * ∈ [1, s] such that By concavity, since t ≥ s implies that r ≥ r * , it follows that φ (r) ≤ φ (r * ) and thus or equivalently Therefore, with t = max{V (z), V (z)} and s = min{V (z), V (z)} we get For (z,z) ∈C we get by Assumption 5, Combining (20) and (21), (19) and the fact that φ ≥ 0, we have for any (z,z) Proposition 3. Suppose that Assumptions 4 and 5 hold with the same set C. Assume that the chains are initialized from some measure π 0 such that π 0 (V ) < ∞. Then there exists a positive constant K < ∞ such that To prove Proposition 3 we will make use of Douc et al. [2004, Proposition 2.1], which we provide below for the reader's convenience, noting that the exact statement is taken from Andrieu et al. [2015, Proposition 4]. We borrow the following definitions from . For any non-decreasing concave function ψ : [1, ∞) → (0, ∞), Proposition 4. (Proposition 2.1 from Douc et al. [2004]). Assume that P is a Markov kernel such that for some function V ≥ 1 we have where ψ : [1, ∞) → (0, ∞) is a nondecreasing concave function. Let r ψ and H ψ be defined as in (23). Then we have Equipped with the above results we proceed to the proof of Proposition 3.
From Assumption 5 it follows thatC is a small set, hence petite, and thus the assumptions of Douc et al. [2004, Proposition 2.5] are satisfied. Following the proof of Douc et al. [2004, Proposition 2.5], specifically the steps leading up to Douc et al. [2004, Equation (2.6)], for any accessible set B, that is such that π D (B) > 0, we have E z,z τ B −1 k=0 r(k) ≤Ṽ 0 (z,z) + c 1 c 2 .
In the above, E z,z denotes expectation with respect to the probability measure under which the joint chain Z n ,Z n−1 is initialized at (z, z ) and evolves according to the transition kernelP , τ B := inf {n ≥ 1 : Z n ∈ B}, c 1 , c 2 are positive constants depending on the set B and the various constants in the drift condition, but not on (z,z).
Since we have assumed that the chain is π D -irreducible, the diagonal D is an accessible set, and thus for B = D, τ D = τ is the meeting time. From the definition of φ(y) we have that where recall that c denotes a generic constant whose value may change from line to line. Thus for any N From the definition ofṼ 0 and the assumption π 0 (V ) < ∞ it follows that E π0⊗π0 τ 1/(1−α) D < ∞. An application of Markov's inequality completes the proof

A.3 Proof of Proposition 1
To fix notation, we have that for any measurable functions W : Z → [1, ∞), g : Z → R, and a finite signed measure µ on X , we write Our starting point is Assumption 5 which we restate here for some function V : Z → [1, ∞), some α ∈ (0, 1), constants b V , d > 0 and a small set C. As before we assume that (Z n ,Z n−1 ) evolves according toP , and that marginally the components Z n andZ n evolve according to P . Notice that we write E for the measure with the chains started from π 0 and E π for the measure with the chains initialized at π.
Since by assumption |h| V α+γ−1 < ∞, we have for π-almost all z. Therefore the function is well-defined and satisfies |g| V γ < ∞, π(g 2 ) < ∞, where the second property follows from π(V 4γ ) < ∞. In particular it follows that g − P g = h − π(h), and therefore g is the solution to the Poisson equation with respect to P and h. We continue with the calculation in the proof of Jacob et al. [2017b, Proposition 3.3]. Let where (b t ) t≥0 is an arbitrary bounded sequence. Writing Z t := (Z t ,Z t−1 ),ḡ(x, y) = g(x) − g(y) andP for the transition kernel of Z t we then have where we used the fact that, by construction ofP , we havePḡ(z,z) = P g(z) − P g(z).