Mixing of Metropolis-Adjusted Markov Chains via Couplings: The High Acceptance Regime

We present a coupling framework to upper bound the total variation mixing time of various Metropolis-adjusted, gradient-based Markov kernels in the `high acceptance regime'. The approach uses a localization argument to boost local mixing of the underlying unadjusted kernel to mixing of the adjusted kernel when the acceptance rate is suitably high. As an application, mixing time guarantees are developed for a non-reversible, adjusted Markov chain based on the kinetic Langevin diffusion, where little is currently understood.


Introduction
A nearly universal ingredient to gradient-based Markov chain Monte Carlo (MCMC) kernels are time discretizations of measure-preserving SDEs or PDMPs such as the kinetic Langevin diffusion and Andersen dynamics [72,28,60,37,8,42,26,52,6].These kernels are gradient-based in the sense that they incorporate and rely on evaluation of the gradient of the log-density of the target distribution.In practice, the asymptotic bias due to time discretization is either incurred (leading to unadjusted kernels) or eliminated by a Metropolis-Hastings filter (leading to adjusted kernels).In either case, a question that is both fundamental mathematically and crucial to applications is [29,63,27,57,82,68,30]: Starting from a distribution ν, how many steps n ∈ N are sufficient for the n-step distribution of the Markov chain to be an ε-accurate approximation of the stationary distribution in total variation?The smallest such number of steps is the so-called ε-mixing time of the Markov chain from the initial distribution ν.
Recently, there has been significant progress in quantifying the mixing time of unadjusted, gradient-based kernels including the unadjusted Langevin algorithm [35,23,34,43], unadjusted HMC [13,10,67], and various unadjusted chains based on the kinetic Langevin diffusion [20,24,66,67]; see [31] for a unified and comprehensive treatment of unadjusted MCMC methods.These works give explicit upper bounds on the mixing time and complexity, which reveal that the time step size required to adequately resolve the asymptotic bias depends substantially on the accuracy ε.This potentially costly dependence motivates Metropolis adjustment, which eliminates the asymptotic bias by employing a Metropolis-Hastings filter.Intuitively speaking, it ensures the proportion of steps the adjusted chain spends in a given region equals the measure of that region with respect to the stationary distribution [62,49,29,81,4,27,2].As a consequence, though, the adjusted chain involves a complex interplay between the transition step of the unadjusted kernel and the stationary distribution; to quote Bilera & Diaconis [2001], "for many people ... the Metropolis-Hastings algorithm seems like a magic trick.It is hard to see where it comes from or why it works."Needless to say, the mixing time analysis of adjusted kernels is mathematically more delicate than of unadjusted kernels.
Intrinsically capturing the interplay described above, the notion of conductance has played a significant role in quantifying the mixing time of adjusted kernels.Classical conductance arguments are commonly used to identify bottlenecks, which yield mixing time lower bounds [54,53,22].For adjusted kernels that in addition are reversible, conductance arguments can be adapted to obtain mixing time upper bounds; see, e.g., for MALA and HMC [22,84,19].While these works make mild assumptions on the stationary distribution (e.g.isoperimetric inequalities) and often yield sharp mixing time upper bounds, a warm start assumption is inevitable.In particular, these mixing time upper bounds typically depend logarithmically on the L ∞ -norm of the relative density of the initial to the stationary distribution; see [18,53] for progress towards doublelogarithmic dependence and [1] for sampling from a warm start.Apart from conductance arguments, current mathematical tools are limited in their ability to obtain quantitative mixing time upper bounds for non-reversible adjusted kernels, even from warm starting distributions.
As a step towards filling the gap in capability outlined above, in this work we introduce a new coupling framework to obtain mixing time guarantees for Metropolis-adjusted, gradient-based Markov chains.Let ε > 0 be the desired total variation (TV) accuracy.The underlying idea is to fix an epoch E > 0 of steps such that two copies of the unadjusted chain given by the kernel π u starting from different initial conditions x and x meet with probability at least 1 − (3e) −1 after E steps, i.e., where we used the coupling characterization of the TV distance ∥ ⋅ ∥ TV .A standard way to ensure (1) is to use a contractive coupling for E − 1 steps, followed by a one-shot coupling [71,59,42,66,10].The time step size is then tuned such that the probability of a rejection occurring in this epoch is at most 2(3e) −1 , and crucially, this tuning is at most logarithmic in 1/ε.Hence, after one epoch, the adjusted kernel π satisfies Therefore, after ⌈log(1/ε)⌉ epochs, it immediately follows that there exists a coupling of the adjusted kernel which meets with probability at least 1 − ε.
Iterating the epochs is an important step in this new coupling approach, and without this iteration, as Monmarché noted in [66], the aforementioned proof fails to capture a logarithmic scaling of the mixing time with respect to 1/ε.Stated precisely in Theorem 1, the main result of this paper provides a broadly applicable coupling framework to obtain mixing time upper bounds for Metropolis-adjusted, gradient-based Markov chains without imposing restrictive assumptions on either the stationary or the starting distribution.In essence, the theorem uses a localization argument to boost local mixing of the unadjusted kernel to mixing of the adjusted kernel when the Metropolis filter intervenes over each epoch with sufficiently low probability, i.e., in the high acceptance regime: a notion that is made precise in §2.1.The low acceptance regime, allowing for more frequent rejection, falls beyond the scope of this work.As a nontrivial application of Theorem 1, in §3 we develop mixing time upper bounds for a non-reversible, adjusted Markov chain based on the kinetic Langevin diffusion.

Complimentary Literature
Here we briefly highlight some complimentary literature on related but different probabilistic techniques for mixing time analysis.In recent years, there has been progress in developing couplings for a variety of Metropolis-adjusted, gradientbased chains whose stationary distributions display high-dimensionality and/or non-logconcavity.In particular, dimension-free upper bounds in Wasserstein distance have been developed for a variant of MALA suitable for perturbations of Gaussian measures in high dimensions [38].Moreover, a coupling of adjusted HMC that is contractive in non-logconcave settings was introduced in [12]; this coupling offers flexibility for extensions/applications [50,11,13].For MALA and related Markov chains, coupling-based contractivity results are also available in distances that interpolate between L 1 -Wasserstein and TV [42].Moreover, a variety of couplings tailored to Metropolis-Hastings kernels, including maximal couplings, have recently been proposed for MCMC convergence analysis in high dimensions [50,51,83,70].In addition, there is a considerable and growing body of work devoted to Harris Ergodic Theorem, which is a very powerful tool for verifying geometric ergodicity of Markov chains [63,64,75,74,46,36]; for a simple and elegant proof see [47].Over the years there have been many successful applications of this tool including [61,73,60,80,7,45,15,32,58,36], just to cite a few.There have also been significant advances in refining Harris Ergodic Theorem to obtain more explicit quantitative bounds under more easily verifiable conditions [48,41,25,33,85].

Main Result
Let (Ω, A, P) be a probability space and let S be a Polish state space with metric d and Borel σ-algebra B. Denote by P(S) the set of probability distributions on (S, B).Let µ ∈ P(S).A standard way to construct a gradient-based, ergodic Markov chain with stationary distribution µ is to first construct a µpreserving, ergodic Markov chain with transition kernel π exact from the exact flow of a µ-preserving SDE or PDMP.Both for theoretical purposes and for implementability in applications, it can be desirable to replace the exact flow in π exact by an approximate flow based on time-discretization, which yields an unadjusted Markov transition kernel π u .However, this unadjusted kernel has the significant drawback that µπ u ≠ µ.Resolving the resulting asymptotic bias in applications can be infeasible.Metropolis-adjustment provides a tool for correcting the stationary distribution and produces an adjusted transition kernel π satisfying µπ = µ.More precisely, we consider transition steps X ∼ π(x, ⋅) that for ω ∈ Ω are of the general form where Φ, Ψ ∶ Ω × S → S are product measurable and such that Φ(⋅, x) ∼ π u (x, ⋅) and Ψ(⋅, x) ∼ π r (x, ⋅) for all x ∈ S, where π r , like π u , is a probability kernel on (S, B).Hereafter, we omit ω from the notation, writing Φ(⋅, x) = Φ(x) and Ψ(⋅, x) = Ψ(x).The indicator function of the event A(x) = {U ≤ α(x, Φ(x))} ⊆ Ω with an independent U ∼ Unif(0, 1) indicates that the proposal Φ(x) is accepted.Otherwise the proposal is rejected, in which case the chain is allowed to move according to Ψ.

The High Acceptance Regime
We now introduce the high acceptance regime, in which acceptance occurs sufficiently often such that the adjusted kernel inherits mixing properties of the exact kernel via the unadjusted kernel.The TV-mixing time of π exact started in the distribution η ∈ P(S) to a specified accuracy δ > 0 is defined by The high acceptance regime is characterized by the acceptance rate being suitably controlled over a time scale set by the mixing time of the exact kernel which, in turn, will yield a mixing time upper bound for the adjusted kernel by comparison.
Definition 1.On a collection C ⊆ P(S) such that {ηπ ∶ η ∈ C} ⊆ C, π is in the high acceptance regime, if A key feature of Definition 1 is that the restrictiveness of the condition (5) strongly depends on the choice of C: the larger the collection C, the more restrictive (5) becomes.In one extreme C = {µ}, the adjusted kernel is always in the high acceptance regime since the left hand side of (5) trivially vanishes.This work is concerned with the other extreme: cold start distributions corresponding to C including distributions which may not even be absolutely continuous with respect to µ.This feature of the definition is what motivates formulating the high acceptance regime in terms of π exact .

Mixing in the High Acceptance Regime
Assumption A.S stated below is geared towards the high acceptance regime defined in Definition 1 with C including cold start distributions.Under Assumption A.S, Theorem 1 gives mixing time upper bounds for the adjusted kernel.To better understand Assumption A.S, a brief description is provided.
The possibility of cold start distributions motivates using pointwise acceptance probability bounds for the adjusted chain.However, since such bounds often degenerate at infinity, A.S (iv) is introduced to localize the adjusted chain to a bounded domain D ⊆ S with sufficiently high probability.By association, the underlying unadjusted chain is similarly localized to D.
In this domain, and intuitively speaking, A.S (i) and (ii) require that the underlying unadjusted kernel admits a locally successful coupling.More precisely, A.S (i) assumes there exists a coupling for π u that is locally contractive in D; and A.S (ii) assumes there exists a local one-shot coupling for π u in D.
Although stated in a slightly different way, the main idea underlying A.S (iii) is (5).Indeed, the epoch E of transition steps appearing in (iii) is defined in such a way that by (i) and (ii), there exists a coupling of two copies of the unadjusted chain starting at two different initial conditions within D that induces meeting with probability at least 1 − (3e) −1 ; therefore, this epoch E is analogous to sup η∈C τ exact mix ((3e) −1 , η) in (5).Denote by ∆ the diagonal in the product space S×S.The couplings appearing in Assumption A.S are all assumed to be faithful.Recall that a coupling Π is faithful if Π((x, x), ∆) = 1 for all x ∈ S. Couplings of the adjusted kernel inherit this property from couplings of the unadjusted kernel if a synchronous coupling of the underlying uniform random variables in the Metropolis filter is used.
Similarly to (4), define the TV-mixing time of the adjusted kernel with initial distribution η ∈ P(S) and accuracy δ > 0 to be We are now prepared to state Assumption A.S and then immediately afterwards the main result of the paper, followed by its proof.
Assumption A.S. Let ε > 0 be the accuracy, ν ∈ P(S) be the initial distribution, and D ⊆ S be a domain such that diam d (D) ≤ R for some R > 0.
Regarding the unadjusted transition kernel, we require: (i) There exists ρ > 0 and for all x, x ∈ D a coupling Π u Contr ((x, x), ⋅) of π u (x, ⋅) and π u (x, ⋅) such that the contractivity holds for (X u , Xu ) ∼ Π u Contr ((x, x), ⋅).(ii) There exists C Reg > 0 and for all x, x ∈ D a coupling Π u Reg ((x, x), ⋅) of π u (x, ⋅) and π u (x, ⋅) satisfying the regularization Regarding the adjusted transition kernel, we require: (iii) Set the length of an epoch of transition steps at (iv) To reduce to the local properties fixed hitherto, we require control of the exit probability from D over the total number of transition steps consisting of sufficiently many epochs to conclude mixing to ε accuracy.Therefore let T = inf{k ≥ 0 ∶ X k ∉ D} and presume both for X 0 ∼ ν and X 0 ∼ µ.
Theorem 1. Suppose Assumption A.S holds for ε > 0 and ν ∈ P(S).Then Proof.On the same probability space, consider two copies of the adjusted chain X n ∼ νπ n and Xn ∼ µπ n = µ, one of which in stationarity.Denote by T , T the first exit times from D of X n and Xn respectively, and let T = min(T, T ).It is notationally convenient to introduce the epoch m + 1 = E of transition steps and the total number of epochs k = ⌈log(2/ε)⌉ that will be needed to attain ε accuracy.Thus, the total number of transitions to reach the desired accuracy will be k(m + 1) = H.
To see that k(m + 1) transition steps of the adjusted chain do indeed suffice, below we will use Assumption A.S (i) -(iii) to prove that over each epoch the following bound holds for all x, x ∈ D: where P (x,x) is the distribution conditioned on X 0 = x and X0 = x.Iterating (7) k times will then yield the desired TV-convergence to ε-accuracy.Indeed, by the coupling characterization of the TV-distance, note that the TV-distance to stationarity after k(m + 1) transition steps satisfies The second term in (8) describes the probability that at least one copy exits D within k(m + 1) transition steps, and by Assumption A.S (iv) satisfies On the other hand, in the first term in (8) neither chain exits D. Denote by F n the σ-algebra generated by both copies up to transition step n.Now, by (7) and the Markov property, it holds that where we used {X k(m+1) ≠ Xk(m+1) } ⊆ {X (k−1)(m+1) ≠ X(k−1)(m+1) } in the first equation, which holds by faithfulness, and the choice of k in the last.Since the TV-distance to stationarity ∥νπ k(m+1) − µ∥ T V is non-increasing, this shows that k(m + 1) transition steps of the adjusted chain suffice for ε accuracy.
We are left to show (7) by using Assumption A.S (i) -(iii).Let x, x ∈ D. Denote the accept events in the (n+1)-th transition, i.e. from X n to X n+1 and Xn to Xn+1 , by A n+1 and Ãn+1 respectively.Let X u n and Xu n be the corresponding copies of the underlying unadjusted chain and note that X n = X u n on ⋂ n−1 l=0 A l+1 .Considering just one epoch consisting of m + 1 transition steps, (iii) allows to restrict to the case that the Metropolis filter does not intervene over the epoch so that the probability that there exists a coupling of the adjusted chains which induces meeting is determined by the corresponding probability for the underlying unadjusted chains.More precisely, with the second term bounded by 2(m + 1) For the first term, we employ m steps of the contractive coupling in (i) which brings the two copies of the unadjusted chain sufficiently close together for one step of the regularizing coupling in (ii) to induce exact meeting.This yields where in the last two steps diam(D) ≤ R and the definition of m were used respectively.
Remark 1 (Scope of Coupling Framework).A remarkable feature of the coupling framework presented in this section is that it uses localization to boost local mixing of the unadjusted kernel to mixing of the adjusted kernel.This feature is enabled by Assumption A.S (iv) which localizes the entire coupling argument to the domain D. In particular, the assumption that the unadjusted kernel admits a locally contractive coupling and a local one-shot coupling (i.e., Assumption A.S (i) and (ii)) does not impose global restrictions, such as regularity or convexity, on the stationary distribution.Therefore, this new coupling framework is broadly applicable including, i.p., to stationary distributions whose log-density is non-globally gradient or Hessian Lipschitz or non-globally concave.

Application to a non-reversible, adjusted Markov chain
Although there are numerous non-asymptotic convergence results for kinetic Langevin diffusions [20,21,24,39,17] and their unadjusted discretizations [20,21,24,79,66], quantitative mixing time guarantees for adjusted discretizations are comparatively scarce.In view of this underdevelopment, and as an application of Theorem 1, mixing time guarantees for a non-reversible, adjusted Markov chain based on a discretization of the kinetic Langevin diffusion are given in Theorem 2 of this section.

Metropolis-adjusted Kinetic Langevin Algorithm (MAKLA)
Consider an absolutely continuous probability distribution on R d of the form Here we analyze the mixing of an MCMC method aimed at µ target based on the kinetic Langevin diffusion where B t is a standard d-dimensional Brownian motion and γ > 0 is the friction.Let I d be the d × d identity matrix.A key property of ( 9) is that it leaves invariant the probability measure on phase space z = (x, v) ∈ R 2d with energy A variety of discretizations of ( 9) can be Metropolis-adjusted [78,56,9, 5] and fit the framework (3).Here we focus on a symmetric Strang splitting [14,5,3], where the splitting components are given by 1. the Ornstein-Uhlenbeck (OU) flow 2. the purely potential flow dX t = 0 , dV t = −∇U (X t ) dt , and 3. the purely kinetic flow The corresponding discretized flows are for 1. the OU-substep 2. the B-substep for the kick due to the potential part 3. the A-substep for the drift due to the kinetic part Combining these flow maps in the following palindromic fashion yields the unadjusted kinetic Langevin algorithm (UKLA) with transition step given by where ξ 1 , ξ 2 are i.i.d.random variables with distribution N (0, I d ).This discretization is commonly referred to as "OABAO" where each letter refers to either (11), (12) or (13).For the sequel, it is convenient to introduce By construction, θ h is both volume-preserving and reversible [44,8].The transition kernel of UKLA is given by π u = ΞΘΞ, where Due to asymptotic bias, UKLA does not leave µ invariant, i.e., µπ u ≠ µ.This failure is not surprising, since although the OU steps leave µ invariant and θ h is volume-preserving, the time discretization induces an energy error under θ h , i.e., (H ○ θ h − H) / ≡ 0, which is the root cause of the asymptotic bias.
The OABAO scheme can be readily Metropolis-adjusted by simply adjusting θ h , which is possible since θ h is both volume-preserving and reversible [8, Prop.5.1]; see also [81,Theorem 2].The resulting algorithm is called the Metropolis-adjusted kinetic Langevin algorithm (MAKLA) with transition step where U ∼ Unif(0, 1) is independent of the other random variables and the state of the chain, and the Metropolis-adjusted integrator is defined through the mapping where α((x, v), is the velocity flip involution, and It is easily verified that π leaves µ invariant, i.e., µπ = µ, and therefore, the xmarginal of the corresponding Markov chain can be used to sample from µ target .

Assumptions & Additional Notation
For simplicity, we focus on strongly log-concave target distributions with gradient Lipschitz log-densities having bounded third derivatives.More precisely, we fix the following assumptions: Assumption A.1.Suppose U is K-strongly convex, i.e., there exists K > 0 such that Assumption A.2. Suppose U has a global minimum at 0, U (0) = 0, and U is L-gradient Lipschitz continuous, i.e., there exists L > 0 such that Below it is sometimes convenient to write the results and conditions in terms of the condition number of the target distribution defined in the usual way by κ = L/K.Define the third derivative via the trilinear product , there exists L H ≥ 0 such that for all x, y ∈ R d , it holds Define the sets of model parameters and user-specified hyperparameters to be M = {d, K, L, L H } and H = {ε, ν, γ, h}, respectively.Since we mainly care about the non-logarithmic dependencies of the mixing time on the underlying model parameters, and for the sake of legibility of expressions, we often suppress logarithmic dependencies on parameters in M by using the notation: for two quantities x, y ∈ R, we write x = Õ(y) if there exists C > 0 depending at most logarithmically on any parameter in M such that x ≤ Cy.The symbol O is defined similarly except that it expresses all logarithmic dependencies.Assumption A.4. Regarding the user-tuned hyperparameters, let 0 < ε ≤ 1/2 and suppose ν ∈ P(R d ) such that log ν(e H/8 ) depends at most polynomially on the model parameters, i.e., there exist constants n 1 , n 2 , n 3 , n 4 ∈ Z such that Further, let γ, h > 0 satisfy as well as log(1/h) = Õ(1).
Note that (18) and the last part of A.4 pose no relevant restriction because exponential dependencies on model parameters of the quantities of interest are unrealistic.
Remark 2 (Possibilities to Relax the Assumptions).There are several possibilities the assumptions made above can be relaxed while sustaining the mixing guarantees of Theorem 2. First, the global strong convexity assumption in A.1 can be relaxed to asymptotic strong convexity by employing a more sophisticated coupling in A.S (i) as developed in [12,21,13].However, the resulting contraction rates will depend on underlying parameters in a more intricate way.Second, as emphasized in Remark 1, both the global gradient and Hessian Lipschitz continuity in A.2 and A.3 as well as the global convexity in A.1 can be replaced with local versions.In particular, convexity in a suitable shell suffices.

Mixing Guarantees for MAKLA
We are now in position to state upper bounds on the mixing time of MAKLA as defined in (6) with µ given by (10).
Theorem 2. Suppose Assumptions A.1-A.4 hold.Then there exists h > 0 with such that for all h ≤ h, it holds that For a fixed step size h ≤ h, Theorem 2 guarantees that starting in ν, τ mix (ε, ν) transition steps of MAKLA suffice to ensure ε-accuracy in TV.The assumptions on the initial distribution are minimal.In particular, cold start distributions are covered, i.e., ν = δ z for some z ∈ R 2d .
Remark 3 (Mixing Guarantee).Note that if and log ν(e H/8 ) = Õ(κd) , Theorem 2 asserts that for h = h, since in this case This choice of γ minimizes the mixing time upper bound while still satisfying A.4.Moreover, the assumption on ν is mild; e.g., it is satisfied by all cold starts in z ∈ R 2d such that H(z)/8 = log δ z (e H/8 ) = O(κd).To put this in perspective, note that the Gaussian measure ν = N (0, A −1 ) ⊗ N (0, I d ) with energy H(z) = 1 2 |v| 2 + 1 2 |A 1/2 x| 2 amounts to log ν(e H/8 ) = d log(8/7).Remark 4 (Dimension Dependence).Remarkably, the dimension scaling obtained in (19) is optimal in the high acceptance regime, cf.Definition 1, from a cold start distribution as illustrated by Denote by e 1 the unit vector in the first component and consider the collection C = {δ (0,d 1/2 e1) π n ∶ n ≥ 0} corresponding to a cold start in (0, d 1/2 e 1 ) ∈ R 2d .According to (5), π being in the high acceptance regime on C requires where we used that the mixing time τ exact mix , cf. (4), of the transition kernel of the kinetic Langevin diffusion over time h is of order h −1 .Expanding (57) shows that the energy error to leading order is For U as in (21), it hence holds that Since the reject probability from cold start in (0, d 1/2 e 1 ) is given by and the OU step to leading order in h equals the identity, ( 22) implies Remark 5 (Condition Number Dependence).In Lemma 3, UKLA is shown to converge to its stationary distribution with rate ρ ∝ Kγ −1 h, which under A.4 is at best κ −1 for h −1 of order L 1/2 .This rate differs from the optimal rate obtained for the kinetic Langevin diffusion under warm start [17].Passing to MAKLA via Theorem 1 further increases scaling in condition number.In (19), the additional κ 1/2 in front of the maximum is expected for A.S (iii) to hold with ρ and the energy error bounds of Lemma 8.However, due to the linear appearance of γhH in (23) (cf.Remark 6), A.S (iii) requires the extra K −3/4 and K −1/2 inside the maximum.At present, the optimal condition number dependence for either UKLA or MAKLA from a cold start distribution is not known.
Proof of Theorem 2. To invoke Theorem 1, it suffices to verify Assumption A.S for MAKLA, which as described below, relies on ingredients developed in §4.
Let S = R 2d with metric induced by the twisted norm ∥⋅∥ tw defined in (24).Define the domain D = {E(z) ≤ R U } for some R U ≥ 2 to be determined momentarily and where E is the energy-like function defined in (56) where in the last step we used A.4 to factor out Lγ 2 /K 2 ≥ 36κ 2 ≥ 1.Thus, Regarding the unadjusted transition kernel,

This completes the verification of Assumption A.S (i) and (ii).
It remains to verify Assumption A.S (iii) and (iv), which concern the adjusted transition kernel.To this end, the epoch of transition steps E and the total number of transition steps H play a pivotal role.Assumption A.4 implies log(γC Reg ) = Õ(1).Thus, Since (iii) depends on R U , which needs to be chosen sufficiently large for the exit probability bound in (iv) to hold, we first verify (iv).
• To verify Assumption A.S (iv), we invoke Lemma 9 as follows.By Lemma 8, C ∆H = 4L and k = 2 in Lemma 9.Moreover, (64) holds due to A.4. Define h 1 > 0 to saturate the bound 400LHh 2 ≤ 1 and let h ≤ h 1 .We now select R U to counter-saturate the bound where we inserted µ(e H/8 ) ≤ (2κ) d/2 by A.1 and A.2.By Lemma 9, this choice of R U ensures Assumption A.S (iv) to hold starting from both ν and µ.Since the right hand side of ( 23) depends logarithmically on R U , note that • Finally, we verify Assumption A.S (iii).Leveraging: (a) the higher order bound of Lemma 8; (b) the bounds that each hold for all z ∈ R 2d ; and (c) the definition of D yields Inserting E and R U , and using that γhd = O(K −1 γ 2 d) which holds by A.4 and K ≤ L, shows that there exists h 2 > 0 such that the last display is bounded by (3e) −1 for all h ≤ h 2 with )) This completes the verification of Assumption A.S (iii) and (iv).To finish, set h = min(h 1 , h 2 ) which satisfies, by A.4, )) Since Assumption A.S holds, Theorem 2 now follows by invoking Theorem 1.

Key Ingredients of the Proof of Theorem 2
The ingredients needed in the proof of Theorem 2 are developed in this section.
By using the elementary inequality Hence, the twisted norm compares to the untwisted norm Lemma 3. Suppose that Assumptions A.1, A.2, and A.4 hold.Then, for all z = (x, v), z = (x, ṽ) ∈ R 2d and a 1 , a 2 ∈ R d , it holds that Under similar assumptions, variants of Lemma 3 have beeen proven elsewhere in the literature; see, e.g., [66,77,55].For the convenience of the reader, however, a complete proof is given below.As emphasized in previous works, a key ingredient in the proof of Lemma 3 is the co-coercivity property of ∇U , which in terms of the potential force F can be written as This holds if U is continuously differentiable, convex, and L-gradient Lipschitz [69, Theorem 2.1.10].Additionally, the following elementary inequality is used Proof.It is notationally convenient to define By definition of the OABAO scheme in (14), Inserting this difference into ∆ yields Note that the cross-terms involving ⟨ζ, ω⟩ vanish by definition of α in (24).Applying Young's inequality with parameters δ 1 , δ 2 , δ 3 > 0 yields By the co-coercivity property in (29) evaluated at Inserting this bound into (32) yields Below in ( 34)-( 36), we show that II-IV are non-positive, and hence, ∆ ≤ I.In particular, choosing δ 1 = h yields Choosing δ 2 = 2h, δ 3 = 2γ −1 , and by definition of α and β in (24), as well as Combining the above, and by definition of β in (24), ≤ − Thus, ∥Z − Z∥ tw ≤ (1 − c h) ∥z − z∥ tw with c = (1/34)e −1/2 min (Kγ −1 , γ) -as required.

Verifying A.S (ii):
One-shot Coupling for UKLA By using a one-shot coupling Π u Reg , cf. [71,59,42,66,10], the next lemma proves that the transition kernel of UKLA has a regularizing effect.Throughout this section, for any z Lemma 4. Suppose Assumptions A.2, A.3, and A.4 hold.Let ξ 1 , ξ 2 ∼ N (0, I d ) be independent.Then, for all z, z ∈ R 2d , it holds that A closely related one-shot coupling result has recently been developed for an "OBABO" discretization of kinetic Langevin dynamics; see Proposition 3 and Proposition 22 of [66], and for extensions see [65,16].Although the upper bound in (40) degenerates as h ↘ 0, this degeneration manifests only logarithmically in the mixing time results for MAKLA; see Assumption A.S (iii).
As already indicated, the following lemmas are used in the proof of Lemma 4.
Lemma 5. Let ξ ∼ N (0, Σ) where Σ is an n × n matrix and suppose that Proof of Lemma 5.The proof of this result is an extension of the proof of Lemma 15 in [10] to the case where the covariance matrix of the reference Gaussian measure is Σ.Since this is a small modification, a proof is omitted.
Lemma 6. Suppose Assumptions A.2, A.3, and A.4 hold.Then, for any z = (x, v), z = (x, ṽ) ∈ R 2d and a 1 , Lemma 7. Suppose Assumptions A.2, A.3, and A.4 hold.Then, for any z For the proofs of Lemmas 6 and 7, it is notationally convenient to define This elementary inequality is used in the proofs: by A.4 and (30), Since By inserting (47) from the calculation below into (45), and using the elementary inequality 1 − e −x ≤ x valid for x > −1, we obtain Proof of Lemma 6.By definition of the one-shot map in (38), Similarly, from (38), Combining ( 47) and ( 48) yields, as required.
Proof of Lemma 7. Since, by definition of the one-shot map in (38) tr(DΦ(a 1 , a 2 ) − I 2d ) = tr( Combining ( 49) and ( 50) yields This observation motivates estimating From (38), note that On the one hand, by A.2, On the other hand, by A.3, Combining ( 53) and ( 54) yields , the spectral radius of ∂ã1 ∂a1 − I d does not exceed 1/2.Therefore, we can invoke Theorem 1.1 of [76], to obtain tr( where in the second to last step we used Inserting (55) into (51) gives the required result.

Verifying A.S (iii): Energy Error Estimates
The following Lemma provides upper bounds for the energy error in terms of the energy-like function E ∶ R 2d → R defined by As the isotropic Gaussian case suggests, where U (x) = (L/2)|x| 2 , the scaling in ( 56) is natural, since in that case: if (X, V ) ∼ µ then L −1/2 ∇U (X) = L 1/2 X and V are both standard normally distributed.
Lemma 8. Suppose that Assumption A.2 holds and let Lh 2 ≤ 1.Then, the energy error ∆H = H ○ θ h − H with θ h as in (15) satisfies If additionally Assumption A.3 holds, then Proof.For t ∈ [0, h], introduce the linear interpolation Therefore, the energy error can be written as Expanding the integrand using Let t ≥ 0 and a ∈ R d .To further simplify the last display, we expand In particular, ∇U (x * ) = ∇U (x) + I * 1 (59) with with with Using the higher and lower order expansions will give the higher and lower order bound, respectively, due to more or less cancellation.For the lower order bound, inserting ( 59) and ( 61) into (58) yields Inserting this expression and the bounds on I * 1 and I 1 (t) back into (57) shows where, besides Lh 2 ≤ 1, we used that due to (59) ≤ and that a similar bound holds for |v − h 2 ∇U (x)|.Repeating the calculation with the higher order expansions ( 60) and (62) gives as required.

Verifying A.S (iv): Exit Probability Estimates for MAKLA
We now turn to the exit probability bound required in Assumption A.S (iv).
For some suitably large R U and E as in (56), we show that the exit probability from is small over the total number of steps H required to attain the desired TV convergence.More precisely, let (Z k ) k≥0 be a copy of MAKLA started in an initial distribution ν and define the first exit time of the chain from D to be The following lemma is general in the sense that it only assumes an energy error bound satisfied by, amongst other discretizations, θ h as in (15).
Lemma 9. Suppose Assumptions A.1 and A.2 hold and that the energy error for some C ∆H > 0, k ≥ 2 and all z ∈ R 2d .Let h, γ > 0 be such that Then, for H, R U > 0, it holds that Although the energy error bound ( 63) is assumed to hold globally for simplicity, it can be relaxed to hold in a neighborhood of D; cf.Remark 1. Remark 6 (Effect of Velocity Flip).The MAKLA transition step involves a velocity flip involution in the event of a rejection; cf. ( 16).This makes it tricky to construct a Foster-Lyapunov function exploiting the contractivity of the unadjusted kernel by incorporating a cross-term x ⋅ v.The function used below does not involve such a cross-term, and as a consequence, the time horizon hH enters linearly into (66).In contrast, similar bounds for the MALA transition step only require the radius to depend logarithmically on the time horizon [38, §6].However, due to the wide availability of energy error bounds such as (63), the Foster-Lyapunov function presented here is a robust alternative.

4. 1 . 2 tw
Verifying A.S (i): Contractive Coupling for UKLA Here we use a coupling Π u Contr of two copies of UKLA starting from different initial distributions to demonstrate L 1 -Wasserstein contractivity with respect to a twisted metric induced by the twisted norm on R 2d ∥(x, v)∥ ∶= α |x| 2 + β⟨x, v⟩ + |v| 2 where α =