Optimal friction matrix for underdamped Langevin sampling

A systematic procedure for optimising the friction coefficient in underdamped Langevin dynamics as a sampling tool is given by taking the gradient of the associated asymptotic variance with respect to friction. We give an expression for this gradient in terms of the solution to an appropriate Poisson equation and show that it can be approximated by short simulations of the associated first variation/tangent process under concavity assumptions on the log density. Our algorithm is applied to the estimation of posterior means in Bayesian inference problems and reduced variance is demonstrated when compared to the original underdamped and overdamped Langevin dynamics in both full and stochastic gradient cases.


Introduction
Let π be a probability measure on R n with smooth positive bounded density, also denoted π, with respect to the Lebesgue measure on R n and let f ∈ L 2 (π) be an observable.In a range of applications including molecular dynamics [12,52,54] and machine learning [60,82,83], a quantity of interest is the expectation of f with respect to π, π(f ) := f dπ, which is analytically intractable and is numerically approximated most commonly by Markov Chain Monte Carlo (MCMC) methods, whereby π is sampled by simulating an ergodic Markov chain (X k ) 1≤k≤N with π as its unique invariant measure and π(f ) is approximated by the empirical average1 N N k=1 f (X k ).MCMC methods enjoy central limit theorems for many Markov chains employed, the most well-known (class) of such methods being the Metropolis-Hastings algorithm [41,56].Recent efforts have been to develop MCMC methods suited to settings where n 1 and where point evaluations of π or its gradients are computationally expensive; these methods include slice sampling [25,61], Hamiltonian Monte Carlo [8,24,62], piecewise-deterministic Markov processes [10,13,81] and those based on discretisations of continuous-time stochastic dynamics [30,54,55] together with divide-and-conquer and subsampling approaches [4].In this paper we consider the underdamped Langevin dynamics.Denoting S n ++ as the set of real symmetric n × n positive definite matrices, the underdamped Langevin dynamics 1 with mass M ∈ S n ++ and friction matrix Γ ∈ S n ++ is given by the R 2n -valued solution (q t , p t ) to where √ Γ ∈ R n×n is any matrix satisfying √ Γ √ Γ = Γ, U : R n → R is the associated smooth potential or negative log density such that π ∝ e −U and W t denotes a standard Wiener process on R n .The probability distribution from underdamped Langevin dynamics converges under general assumptions to the invariant probability measure given by π(dq, dp) = Z −1 e −U (q)− p M −1 p 2 dqdp (1.2) for a normalising constant Z and there have been numerous recent works [19,20,28,32,42,50,58,73] on its discretisations in terms of the quality of convergence to π over time measured by (e.g.) Wasserstein distance; in this paper, the goal is to optimise Γ ∈ S n ++ directly with respect to the asymptotic variance in the convergence of π T (f ) := 1 T T 0 f (q t )dt to π(f ) for any particular f (or a finite set of observables) as T → ∞.
We mention that parameter tuning in MCMC methods is a widely considered topic [2,84] (and references within).Specifically for underdamped Langevin dynamics, tuning the momentum part of π with respect to reducing metastability or computational effort was considered in [70,78,80].The choice of friction (as a scalar) has been a subject of consideration as early as in [44], then in [1,14,47,76] within the context of molecular dynamics and also in [20,27].Most of these works make use of different measures for efficiency.The present work constitutes the first systematic gradient procedure for choosing the friction matrix in an optimal manner, with respect to a appropriate cost criterion.

Outline of approach
We proceed with a formal description of our approach, precise statements can be found in the main Theorems 3.2 and 3.5.It is known using results from [72] and [9] that, under suitable assumptions on U and f , a central limit theorem holds and that σ 2 , the asymptotic variance, has the form where φ is a solution to the Poisson equation and L denotes the infinitesimal generator associated to (1.1).Two key observations are then made.Firstly, for any direction δΓ ∈ R n×n in the friction matrix, the derivative of σ 2 with respect to the entries of Γ in the direction δΓ, denoted dσ 2 .δΓ, is given by the formula dσ 2 .δΓ= −2 (∇ p φ) δΓ∇ p φdπ, (1.6) where φ is given by φ(q, p) = φ(q, −p). (1.7) A direction δΓ that guarantees a decrease in σ 2 is then ∆Γ := ∇ p φ ⊗ ∇ p φdπ (1.8) where ⊗ denotes the outer product.Similarly, taking δΓ to be the diagonal elements of (1.8) or δΓ = I n ∇ p φ•∇ p φdπ give in both cases a negative change in asymptotic variance respectively for diagonal Γ and Γ of the form cI n .The second observation is that since the solution φ to the Poisson equation (1.5) is known to be given by where (q t , p t ) solves (1.1) with initial condition (q 0 , p 0 ) = (q, p), given convexity conditions on the potential U and under suitable assumptions, we have where D p q t denotes the R n×n -matrix made of partial derivatives of q t with respect to the initial condition p in momentum.Not only does D p q t satisfy the dynamics that result from taking partial derivatives in (1.1), which are susceptible to algorithmic simulation, but the process also decays to zero exponentially quickly, so that the infinite time integral (1.10) can be accurately approximated with a truncation using short simulations of D p q t for adaptive estimations of the direction (1.8) in Γ.This leads to an adaptive algorithm involving the selection of Γ in an appropriate constrained set, of which we illustrate the performance with numerical examples.
Examples where improved Γ can be found analytically are presented in Section 4. Numerical illustrations making use of (1.6) and (1.10) are presented in Sections 5.2.In particular, the algorithm is applied on the problem of finding the posterior mean in a Bayesian logistic regression inference problem for two datasets with hundreds of dimensions, where the best friction matrices found in both cases are close to zero (for example Γ = 0.1I n performs well compared to Γ = I n , demonstrating reduced variance of almost an order of magnitude in Tables 5.2 and 5.3).
To use the asymptotic variance for a particular observable (or a set of them) and to use measures for the quality of convergence to π or to minimise an autocorrelation time as considered in [1,14,20,44,47,76] can be conflicting goals.To elaborate, in [44], the autocorrelation time was used as the point of comparison in the Gaussian target measure case for the optimal friction.For n = 1, ω, γ ∈ R, U (q) = 1 2 ω 2 q 2 , M = 1, Γ = γ > 0, the autocorrelation time for (1.1) satisfies ∂ t E(q t q 0 ) E(p t q 0 ) = 0 1 −ω 2 −γ E(q t q 0 ) E(p t q 0 ) . (1.11) By considering the eigenvalues, the conclusion in [45] is that the optimal γ for minimising the magnitude of E(q t q 0 ) is given by the critical damping γ = 2ω.A similar conclusion can be made when considering the spectral gap [66].
On the other hand, if f (q) = q in our setting, formally, the quantity ∞ 0 E(q t q 0 )dtdπ(q 0 ) is the asymptotic variance due to (1.4) and (1.9).Despite the similarity, Corollary 4.8 asserts that γ = 0 is optimal.A more detailed discussion about Corollary 4.8 is given in Section 4.2.This difference emphasizes that, at the cost of generic convergence to π, the tuning of Γ here is directed at variance reduction for a particular observable, in this case f (q) = q.However, multiple asymptotic variances can be used for the objective function to minimise, so that Γ can be optimised with respect to several observables of interest simultaneously.Remark 5.1 describes the implementation for a linear combination of asymptotic variances at no extra cost in terms of evaluations of π or its gradients.The rest of the paper is organised as follows.In Section 2, we provide a mathematical setting in which the underdamped Langevin dynamics with a friction matrix and in particular (1.1) has a well-defined solution and satisfies the central limit theorem for suitable observables, together with notations used throughout the paper.In Section 3, prerequisite results and the main formulae observed.In Section 6, deferred proofs are given.In Section 7, we conclude and discuss about future work.

Setting
Let (Ω, F, P) be a complete probability space, (F t ) t∈R be a normal (satisfying the usual conditions) filtration with (W t ) t≥0 a standard Wiener process on R n with respect to (F t ) t∈R , π be a probability measure given by (1.2).
Assumption 1. a U ∈ C ∞ (R n ) satisfies U ≥ 0 and its second derivatives satisfy for some K U > 0. The existence and uniqueness of a strong solution to (1.1) is established in Theorem A.1.Due to the smoothness of U and Γ, the coefficients in (1.1) are locally Lipschitz and well-posedness of equation (1.1) is given by [68], to which we also refer to for the sense of solution.In addition, we make certain to satisfy the joint measurability assumption in [9] of (A.2).

Preliminaries and notation
The set of smooth compactly supported functions is denoted C ∞ c .The infinitesimal generator L (defined in (A.6)) associated to (1.1) is given formally by its differential operator form, denoted L, when acting on the subset so that π (see (1.2)) is an invariant probability measure for (1.1) for a normalisation constant Z.Let and similar for π.The notation D 2 U will be used for the Hessian matrix of U .As in the introduction, I n ∈ R n×n denotes the identity matrix.For matrices A, |A| denotes the operator norm associated with the Euclidean norm on R n .e i is used to denote the i th Euclidean basis vector.For A, B ∈ R n×n , A : B := i,j A ij B ij and A S = 1 2 (A+A ).•, • π denotes the inner product in L 2 (π) and similar for π.

Semigroup bound, Poisson equation and central limit theorem
In this section, a central limit theorem for the solution to (1.1) is established, where the resulting asymptotic variance will be used as a cost function to optimise Γ with respect to.Specifically, it will be shown that under some weighted L ∞ bound on the observable f ∈ L 2 (π), the estimator π T for the unique solution (q t , p t ) to (1.1) converges to π(f ) as T → ∞ such that (1.3) holds with (1.4).
It is well known that the asymptotic variance can be expressed in terms of the solution to the Poisson equation (1.5) using the Kipnis-Varadhan framework, see for example Chapter 2 in [48], Section 3.1.3in [54], [15] and references therein.In order to show that the expression (1.9) is indeed a solution to the Poisson equation (1.5), exponential decay of the semigroup (A.5) is used.In Theorem 2.1 below, we establish convergence in law to the invariant measure for the Langevin dynamics (1.1).For this, let the Lyapunov function K l : R 2n → R for all l ∈ N be given by for constants a, b, c > 0.
Assumption 2. There exist constants β 1 , β 2 > 0 and α ∈ R such that for all q ∈ R n and some generic constant C > 0.
Theorem 2.1.Under Assumptions 1 and 2, π is the unique invariant probability measure for the SDE (1.1) and for all l ∈ N, there exist constants κ l , C l > 0 depending on l and constants a, b, c > 0 independent of l such that the solution z z t = (q t , p t ) to (1.1) with initial condition z satisfies for Lebesgue almost all initial z ∈ R 2n , K l ≥ 1 given by (2.5) and all Lebesgue measurable ϕ satisfying Moreover for any l ∈ N, K l satisfies for some constants a l , b l > 0.
The proof is from [72], in which the setting is more general than (1.1) in that the friction matrix is dependent on q and the drift is not necessarily conservative, i.e. the forcing term is not the gradient of a scalar function and the fluctuation-dissipation theorem (see equation (6.2) in [66]) does not hold, but of course, it applies in particular to our setting.
Remark 2.1.Inequality (2.8) holds for all initial z ∈ R 2n , as opposed to almost all z, given any bounded measurable ϕ.This is a consequence of combining (2.8) together with the strong Feller property given by Theorem 4.2 in [23].
Proof.The measure π is invariant due to (2.4).For the rest of the statements, see Theorem 3 in [72].
The following corollary holds by taking ϕ as indicator functions and Remark 2.1.
Corollary 2.2.Under Assumptions 1 and 2, for all initial z ∈ R 2n , the transition probability p z t of (1.1), given by where • TV denotes the total variation norm.
The solution to the Poisson equation is given next following the direction of [15].
for some l ∈ N, then there exists a unique solution φ ∈ L 2 0 (π) to the Poisson equation (1.5).Moreover, the solution is given by (2.12) Proof.For T > 0, let Note that g T ∈ L 2 (π) for T ∈ R + ∪ {∞} and by Theorem 2.1 in L 2 (π) as T → ∞, specifically (2.8) with ϕ = f and using (2.10) for 2l in place of l.Applying L, it holds that where the exchange in the order of integration is justified by Fubini, (2.8) and the last equality follows by the strong continuity of (P t ) t≥0 given by Proposition A.2 in Section 6. Inequalities (2.8) and (2.10) (with 2l in place of l) also give as T → ∞, so that since L is a closed operator, equations (1.5) and (2.12) hold.In addition, φdπ = 0 follows from the invariance of π, Theorem 2.1 and Fubini's theorem.
We proceed to state the central limit theorem for the solution to (1.1).

Directional derivative of σ 2
In this section, we give a number of natural preliminary results that pave the path for the main result in Theorem 3.4, in which a formula for the derivative (1.6) of σ 2 with respect to Γ is provided.The proofs of Proposition 3.1, Lemma 3.2 and Theorem 3.4 are deferred to Section 6.

Preliminary results and the main formula
In order to establish the formula for the directional derivative, heavy use of the differential operator form (2.3) for the generator is made.Proposition 3.1 establishes that φ solves the Poisson equation also as a partial differential equation, which makes use of the Feynman-Kac representation formula for the solution to the Kolmogorov (backward) equation.Proposition 3.1.Under Assumptions 1 and 2, if f ∈ L 2 0 (π) satisfies f K l ∈ L ∞ for some l ∈ N, the solution φ given by (2.12) solves −Lφ = f in the distributional sense for L given by (2.3), hence classically if in addition f ∈ C ∞ .
In order for the integral in a formula like (1.6) to be finite, control on the derivatives in p is required.This will also be used in the proof of Theorem 3.4 and it is given by the following lemma.Lemma 3.2.Under Assumptions 1 and 2, if f ∈ L 2 0 (π) satisfies f K l ∈ L ∞ for some l ∈ N, the weak derivative in p of the solution φ to −Lφ = f satisfies The following preliminary result is about the solution φ under a momentum reversal.It turns out that this is the solution to the Poisson equation associated to the formal L 2 (π)-adjoint L * of L which appears in the proof of Theorem 3.4.
Lemma 3.3.Let Assumptions 1 and 2 hold and let f ∈ C ∞ and φ ∈ L 2 0 (π) ∩ C ∞ be given by (1.7), where φ is a classical solution to −Lφ = f .Then φ is a classical solution to the equation where L * is the formal L 2 (π)-adjoint of L given by Proof.The equation L * φ = Lφ follows by a straightforward calculation.
If f is not smooth, the equation (3.1) still holds in the distributional sense, since for g ∈ C ∞ c and keeping the notation (1.7) for the momentum reversal on arbitrary functions, The main formula of this section for the directional derivative of the asymptotic variance is given next.The directional derivative of E : 1 (E(Γ + δΓ) − E(Γ)) whenever the limit exists.
Theorem 3.4.Under Assumptions 1 and 2, if for some l ∈ N and there exists > 0 such that Γ, Γ + δΓ ∈ S n ++ for 0 < ≤ , then the directional derivative of the asymptotic variance σ 2 at Γ in the direction δΓ has the form where φ is the solution (2.12) to the Poisson equation for the dynamics (1.1) at Γ and φ is given by (1.7).
As mentioned in the introduction, from (3.2), the direction (1.8) guarantees a decrease in asymptotic variance; similarly the scalar change in Γ given by (1.8) where the outer product is replaced by a dot product guarantees a decrease in σ 2 .
3.2 A formula using a tangent process 2) has a more useful form.The first variation process of (1.1) is used here to calculate (1.10); this will be the main methodology used in the numerical sections.This alternative formula given in Theorem 3.5 provides a way to avoid using a finite difference Monte Carlo estimate of the derivative of an expectation.For simplicity, we set M = I n here.The first variation process associated to (1.1), denoted by (D p q t , D p p t ) ∈ R n×2n for t ≥ 0, is defined as the matrix-valued solution to with the initial condition D p q 0 = 0, D p p 0 = I n .By Theorem 39 of Chapter V in [69], the partial derivatives of (q t , p t ) with respect to the initial values in p indeed satisfy (3.3) and (D p q t , D p p t ) is continuous with respect to those initial values.Note that there exists a unique solution to (3.3) by Theorem 38 in the same chapter of [69].
We omit the notational dependence of (q t , p t ) on its initial condition (q 0 , p 0 ) = (q, p) = z whenever convenient in the following.
Theorem 3.5.Let Assumptions 1 and 2 hold.If in addition, • there exist where D : R n → R n×n is small enough everywhere, in particular, where λ m , λ M > 0 are respectively the smallest and largest eigenvalue of Γ and σ min (Q) denotes the smallest eigenvalue of Q, is assumed; then the weak derivative ∇ p φ has the form where q t solves (1.1) with initial condition (q 0 , p 0 ) = (q, p) and D p q t solves (3.3), the latter satisfying for some constants C, C > 0 independent of (q 0 , p 0 ) and ω ∈ Ω.
The assumptions on U are made in order to ensure that the process (D p q t , D p p t ) converges to zero exponentially quickly in order for the integral in (3.5) to be finite.Specifically, U is assumed to be close to some particular quadratic function q Qq, cf.[11].
Remark 3.1.Exponential decay of the first variation process is not required, only some uniform (in initial (q 0 , p 0 )) integrability in time of D p q t together with a boundedness assumption on ∇f .On the other hand, Proposition 1 in [20] and Proposition 4 in [57] explores more detailed conditions under which contractivity holds and does not hold.
Proof.Let b > 0 be the constant and we have the following bound for some generic constant C > 0 independent of the initial values (q 0 , p 0 ) and ω ∈ Ω.Consequently, using the (weighted) boundedness assumption on |∇f | and for each index i, for a generic C > 0 independent of (q 0 , p 0 ) and ω ∈ Ω. Due to (3.8) together with Fubini's theorem, it holds for T > 0 and a test function Using Theorem 2.1, (3.8) again and dominated convergence to take T → ∞ on both sides concludes the proof.
At any t, equation (3.5) can be used in a practical way in order to estimate the gradient direction (1.8).Specifically, the following estimator can be used.Given (q, p) ∈ R 2n , which we think of as solutions (q t , p t ) to (1.1) hence approximately distributed according to π, let for any large enough T > 0, where ) solve (1.1) with independent realisations of W t and D p has been used to denote the derivative with respect to the initial p as above.The next result is that this estimator has finite variance.Note that for (q, p) distributed away from stationarity, the estimator cannot be expected to be an unbiased estimator of (1.8).
Theorem 3.6.Let the assumptions of Theorem 3.5 hold.For Lebesgue almost-all (q, p) ∈ R 2n , each entry of δΓ defined in (3.9) has finite variance.
Proof.It suffices to show that (3.9) has finite second moment, for which it suffices to show that each element in the vector of time integrals T 0 ∇f (q (q,p) t ) D p q (q,p) t ds has finite second moments by independence.For each index i, using (3.7), so that using the (weighted) boundedness assumption on |∇f | together with Theorem 2.1 and Fubini's theorem, the proof concludes.

Quadratic cases
Throughout this section, the target measure π is assumed to be Gaussian, when mean zero this is given by π ∝ exp(− 1 2 q Σ −1 q) for Σ ∈ S n ++ , in other words, the potential is quadratic, U (q) = 1 2 q Σ −1 q.For polynomial observables, we look for solutions to the Poisson equation by using a polynomial ansatz and comparing coefficients in order to obtain an explicit expression for the asymptotic variance.The results provide benchmarks to test the performance of the algorithms that arise from using the gradient in Theorem 3.4 as well as intuition for how Γ can be improved in concrete cases.We will consider the following cases.
1. Quadratic f = 1 2 q U 0 q, given commutativity between U 0 and Σ (Proposition 4.5), also f = 1 2 U 0 q 2 + lq in one dimension (Proposition 4.6); 2. Odd polynomial f , where the asymptotic variance will be shown to decrease to zero as Γ → 0 (Proposition 4.7, Corollary 4.8 and Proposition 4.9); 3. Quartic f in one dimension, in which the situation is similar to quadratic f (Proposition 4.10); We proceed with stating in more detail the general situation of this section.Let Σ ∈ S n ++ , U 0 ∈ S n ++ and l ∈ R n .The Gaussian invariant measure π and the observable f : R 2n → R are given by and the value π(f ) becomes The infinitesimal generator L becomes in this case Consider the natural candidate solution φ to the Poisson equation (1.5) given by for some constant matrices G, E, H ∈ R n×n and vectors g, h ∈ R n . ) for some antisymmetric A 1 ∈ R n×n .

Quadratic observable
Similar calculations in this situation have appeared previously in Proposition 1 in [26], where explicit expressions analogous to G, E, H and for σ 2 are given.For our purposes of finding an optimal Γ, instead of taking these explicit expressions, we keep unknown antisymmetric matrices (such as A 1 ) as they appear and eventually use commutativity between Σ and U 0 to show that the antisymmetric matrices are zero.We continue from (4.5), (4.6) and (4.7) with finding explicit expressions for the coefficients G, E, H of φ.  for some antisymmetric matrices A 1 , A 2 .
The asymptotic variance from Theorem 2.4 can be given by a formula in terms of Σ, U 0 and the coefficients of φ.
Before substituting the expressions from Lemma 4.2 into the formula, we give the formula itself, which is adapted from the proof of Proposition 1 in [26].
Lemma 4.3.If the solution φ to the Poisson equation (1.5) for f given by (4.1), π(f ) given by (4.2) and L given by (4.3) is of the form (4.4), the asymptotic variance σ 2 given by (2.15) has the expression Each of φ and f − π(f ) are given by where As a result, From the expressions (4.8) and (4.10) for G S and H S respectively, it is not straightforward to check that there exist antisymmetric A 1 and A 2 such that the right hand sides are indeed symmetric at this point, which is necessary for the ansatz (4.4) for φ to be a valid solution.On the other hand, if Σ, U 0 , Γ, M all commute, then the right hand sides of (4.8) and (4.10) are symmetric for A 1 = A 2 = 0 and the coefficients G and H become explicit, which allows taking derivatives of σ 2 with respect to the entries of Γ.Moreover, the explicit coefficients allow optimisation of M , which is given by the following proposition.
where the limit on the left hand side is in the sense of so that by Lemma 4.2, φ given by (4.4) is the solution to the Poisson equation (1.5) and inserting G, g into (4.17)gives The result follows since A : B > 0 for A, B ∈ S n ++ .
Proposition 4.4 solves the optimisation problem in M in the stated setting, but highlights the corresponding discrete time problem since one cannot take M −1 → ∞ in practice.We focus on the optimisation of Γ and fix M = I n in the following.
Proposition 4.5.Let Σ, U 0 , l, M be such that f be as in (4.1), π(f ) be as in (4.2), L be of the form (4.3) and φ be the solution to the Poisson equation (1.5).
The following holds.
, where S Σ is the set of symmetric positive definite matrices commuting with Σ and the minimum is attained by Proof.Let Σ = P Σ d P be the eigendecomposition of Σ for orthogonal P .Since all symmetric matrices in the set commuting with Σ share eigenvectors with Σ, it suffices to find a unique extremal point of the asymptotic variance with respect to the eigenvalues of Γ, call them (λ i ) 1≤i≤n , λ i ≥ 0. Setting again (4.18), φ given by (4.4) is the solution to the Poisson equation (1.5) and the asymptotic variance σ 2 given by (2.15) becomes which reduces to a sum of functions of the form a i λ −1 i + b i λ i , a i , b i > 0 after diagonalising with P and the result follows.
In the scalar case, we can remove the restriction on l.Proposition 4.6.If n = 1, U 0 = 0, l = 0, f : R → R is given by (4.1), π(f ) is given by (4.2), L is of the form (4.3) and φ is the solution to the Poisson equation (1.5), then and the minimum is attained by Proof.By Lemma 4.2, the solution (4.4) to the Poisson equation (1.5) is By Lemma 4.3, the asymptotic variance is given by which attains the stated minimum at (4.22) as claimed.

Odd polynomial observable
Another special case within (4.1) where the solution φ can be readily identified is when that is, for linear observables.More generally, (almost) zero variance can be attained in the following special case.
Proposition 4.7.Under Assumption 1 and 2, for a general target measure π ∝ e −U on R n , if the observable f is of the form f (q) = α • ∇U, The asymptotic variance is equal to where dq −j denotes dq 1 . . .dq j−1 dq j+1 . . .dq n .
Corollary 4.8.Given a Gaussian target measure with density π ∝ e −U on R n , observable f : R n → R as in (4.1) where l ∈ R n , π(f ) = 0, L of the form (2.3) and φ the solution to the Poisson equation (1.5), equation (4.23) holds.
There is some intuition in the situation in Corollary 4.8.First note that the Langevin diffusion with Γ = 0 reduces to deterministic Hamiltonian dynamics and that it is the limit case for the Γ attaining arbitrarily small asymptotic variance in the proof of Proposition 4.7.The result indicates that this is optimal in the linear observable, Gaussian measure case (i.e.(4.1), U 0 = 0) and this aligns with the fact that the value (4.2) to be approximated is exactly the value at the q = p = 0, so that Hamiltonian dynamics starting at q = 0, staying there for all time, approximates the integral (4.2) with perfect accuracy.A similar idea holds for when the starting point is not q = p = 0, where (4.2) is approximated exactly after any integer number of orbits in (q, p) space.Continuing on this idea, it seems reasonable that the same statement holds more generally for any odd observable.At least, the following holds in one dimension.
Proposition 4.9.If n = 1, k ∈ N and f : R → R is an odd finite order polynomial observable given by π(f ) = 0, L is of the form (4.3) and φ is the solution to the Poisson equation (1.5), then the asymptotic variance satisfies (4.23).
The proof of Proposition 4.9 is deferred to Section 6.

Quartic observable
The situation in the quartic observable case, at least in one dimension, is similar to quadratic observable case.
Proposition 4.10.If n = 1 and f : R → R is a quartic observable given by f (q) = q 4 , (4.25) 3), M = 1 and φ is the solution to the Poisson equation (1.5), then there exists σ quar > 0 such that The proof of Proposition 4.10 can be found in Section 6.

Computation of the change in Γ
Throughout this section, the M = I n case is considered.As mentioned, the formula (3.2) gives a natural gradient descent direction (1.8) to take Γ in order to optimise σ 2 from Theorem 2.4.In Theorem 3.4 and in the form (1.8), the expression for the gradient is already susceptible to a Green-Kubo approach in the sense that the form (2.12) for φ can be substituted in to obtain a trajectory based formula, where finite difference is used to approximate ∇ p and independent realisations of (q t , p t ) is used for the expectations.However, this is too inaccurate in the implementation to be useful.The more directly calculable form as stated in the introduction in (1.10) is used involving the derivative of (q t , p t ) with respect to the initial condition in Section 3.2.We focus the discussion on a Monte Carlo method to approximate ∇φ and gradient directions in Γ (e.g.

Methodology
Here we describe an on-the-fly procedure to repeatedly calculate the change (1.8) in Γ by simulating the first variation process parallel to underdamped Langevin processes.The discretisation schemes used to simulate (1.1) and (3.3) are given in Section 5.1.1.Two gradient procedures, namely gradient descent and the Heavy ball method, for evolving Γ given a gradient are detailed in Section 5.1.2.Then iterates from Section 5.1.1 are used to approximate each change in Γ in Section 5.1.3(see also Appendix C. The key idea linking the above is that if equation (3.5) holds, then where (q t , p t ) and (q t , pt ) denote the solutions to (1.1) with initial values (q, p), (q, −p) respectively, (D p q t , D p p t ) and (D p qt , D p pt ) denote the solutions to (5.3) with qt replacing q t for the latter and the integral in (5.1) is with respect to (q, p).

Splitting
A BAOAB splitting scheme [50,51] will be used to integrate the Langevin dynamics (1.1), given explicitly by for i ∈ N, ∆t > 0, where ξ i are independent n-dimensional standard normal random variables and Γ i ∈ S n ++ are a sequence of friction matrices to be updated throughout the duration of the algorithm, but we mention again recent developments, e.g.[19,20,32,58,73,75], on discretisations of the underdamped Langevin dynamics; the majority of the numerical error involved in updating Γ is expected to come from the small number of particles in approximating the integrals in the expression (1.8) for ∆Γ, so that no further deliberation is made about the choice of discretisation for the purposes here.The first variation process (5.3) together with its initial condition is (D 2 U (q s )D p q s + ΓD p p s )ds. (5.3b) In order to simulate (5.3), an analogous splitting scheme is used: (5.4) The k th column of the first term including the Hessian of U (and similarly for the last) can be approximated by where (Dq i ) k denotes the k th column of Dq i , so that (5.3) can still be approximated in the absence of Hessian evaluations.The approximation (5.5) will be used only when explicitly stated in the sequel.

Gradient procedure in Γ
Suppose we have available a series of proposal updates (b 0 , . . ., b L−1 ) ∈ R n×n×L for Γ.Given stepsizes α i = α ∈ R and an annealing factor r ∈ R, the following constrained stochastic gradient descent (for i where proposal updates are produced) can be considered, where L ∈ N and Π µ pd is the projection to a positive definite matrix, for some minimum value µ > 0, given by for symmetric M ∈ R n×n and its the eigenvalue decomposition Alternatively, a Heavy-ball method [67,35] (with projection) can be used.The method is considered in the stochastic gradient context in [18], given here as (5.8) The heavy-ball method offers a smoother trajectory of Γ over the course of the algorithm.Under appropriate assumptions on b j , in particular if for some gradient ∇σ 2 (Γ i k ) and variance σ 2 b > 0, then the system (5.8) has the interpretation of an Euler discretisation of a constrained Langevin dynamics, in which case r is the inverse temperature.By increasing r, the analogous invariant distribution 'sharpens' around the maximum in its density and in this way reduces the effect of noise at equilibrium; on the other hand, decreasing r reduces the decay in the momentum.

A thinning approach for ∆Γ
The most straightforward way of approximating the integral in (5.1) is to use independent realisations of (5.2), as described in Appendix C, but we draw alternatively a thinned sample [64] from a single trajectory here in order to run only a single parallel set of realisations of (5.2) and (5.4) at a time.More specifically, we consider a single realisation of (5.2) and regularly-spaced points from its trajectory (possibly after a burn-in) as sample points from π.
Starting at each of these sample points and ending at each subsequent one, the process is replicated albeit starting with a momentum reversal and simulated in parallel.In addition, for each of the two processes, a corresponding first variation process (5.4) is calculated in parallel.A precise description follows.
Let K = 1 for simplicity.The Γ direction (5.1) is approximated by where L ∈ N, ((q i (k) , p i (k) )) i∈N , ((q i (k) , pi (k) )) i∈N denote solutions to (5.2) • for all i if k = 1 with initial condition (0, 0), noise otherwise as i and k vary, along with corresponding (Dq i (k) , Dp i (k) ), (D qi (k) , D pi (k) ) satisfying (5.4) for i = B + T l − 1, l ∈ N (regardless of k), and where the k = 1 processes are 'reset' at i = B + T l corresponding to the values of the k = 1 chain if the first variation processes have converged to zero, that is, , Dq T l+B , pT l+B , D qT l+B and L * ∈ N is such that the number of elements in {l ∈ N : 1 ≤ l ≤ L + L * } satisfying (5.11) is L. The approach is summarised in Algorithm 2. Of course, the above for generic K ∈ N constitutes improving approximations to ∆Γ.Note that as Γ changes through the prescribed procedure, the asymptotic variance associated to the given observable f is expected to improve, but on the contrary, the estimator (5.9) for the continuous-time expression (5.1) may well worsen, since the integrand (of the outermost integral) in (5.1) is not f .Increasing L is expected to solve any resulting issues; extremely small L have been successful in the experiments here.
Remark 5.1.If it is of interest to approximate expectations of P ∈ N observables with respect to π, the quantity P i σ 2 i for example can be used as an objective function, where σ 2 i is the asymptotic variance from the i th observable.In the implementation in Algorithm 2, instead of only the vectors ζ, ζ, this amounts to calculating at each iteration the vectors ζ (i) , ζ(i) corresponding to the i th observable and taking the sum of the resulting update matrices in Γ to update Γ.This calls for no extra evaluations of ∇U over the single observable case.
Remark 5.2.(Tangent processes along random directions) We mention the situation where simulating the full first variation processes (D p q t , D p p t ) in R n×2n is prohibitively expensive.A directional tangent process can be used instead of (D p q t , D p p t ).Consider for a unit vector v ∈ R n , that is |v| = 1, randomly chosen at the beginning of each estimation of ∆Γ, the pair of vectors (D p q t v, D p p t v) ∈ R n×2 .Multiplying on the right of ((5.3) and) (5.4) by v, one obtains where the first term involving the Hessian of U in (5.12) can be approximated by and similarly for the last such term.In continuous time, the resulting direction in Γ is ∇φ v∇ φ vdπv ⊗ v and from (3.2) the rate of change in asymptotic variance in this direction is −2( ∇φ v∇ φ vdπ) 2 .However, the resulting gradient procedure in Γ turns out to be painstakingly slow to converge in high dimensions in comparison to simulating a full first variation process; as illustration, one can think of the situation where the randomly chosen vector v is taken from the restricted set of standard Euclidean basis vectors, where only one diagonal value in Γ is changed at a time.For a directional derivative, we also mention [79,36].

Concrete examples
In Sections 5.2.1, 5.2.2 and 5.2.3, the Monte Carlo approach is applied on concrete problems.Section 5.2.1 contains the simplest one-dimensional Gaussian case where the optimal Γ is known and it is shown that the algorithm approximates it quickly.A different Gaussian problem extracted from a diffusion bridge context is explored in Section 5.2.2, where the algorithm is shown to approximate a Γ matrix that exhibits an even better empirical asymptotic variance than the one given by Proposition 4.5.Finally, the algorithm is applied to finding the optimal Γ in estimating the posterior mean in a Bayesian inference problem in Section 5.2.3,where the situation is shown to be similar to Proposition 4.8, in the sense that the optimal Γ is close to 0; after and separately from such a finding, the empircal asymptotic variance for a small Γ is compared that for Γ = I n , with dramatic improvement in both the full gradient and minibatch gradient cases.

One dimensional quadratic case
Here the algorithm given in Section 5.1.3is used in the simplest one dimensional V 0 > 0, case to find the optimal constant friction.Since commutativity issues disappear in the one-dimensional case, the optimal constant friction is known analytically and is given by Proposition 4.5 to be Γ = √ V 0 , with the asymptotic variance V .Moreover, the relationship between the asymptotic variance and Γ is explicitly given by equations (4.8) and (4.17), which reduces in this case to The case V 0 = 5 is illustrated in gives that the 'optimal' (but unreachable in the algorithm due to the constraints) friction is 0 by Corollary 4.8.The right plot in Figure 5.3 shows that the procedure arrives at a similar conclusion in the sense that the Γ hits and stays at µ = 0.2.

Diffusion bridge sampling
The algorithm in Section 5.1.3is applied in the context of diffusion bridge sampling [38,40] (see also for example [7,21,39]), where the SDE for a suitable V : R d → R, β > 0 and W t standard Wiener process on R d , is conditioned on the events for some fixed T > 0, x 0 , x + ∈ R d and the problem setting is to sample from the path space of solutions to (5.15) conditioned on (5.16).For the derivation of the following formulations, we refer to Section 5 in [38] and Section 6.1 in [6]; here we extract a simplified potential U to apply our algorithm on after a brief description.Let Using the measure given by Brownian motion conditioned on (5.16) as the reference measure µ 0 on the path space of continuous functions C([0, 1], R), the measure µ associated to (5.15) conditioned on (5.16) satisfies where the left hand side denotes the Radon-Nikodym derivative, so that discretising µ on a grid in [0, 1] with grid-size δ > 0 gives the approximating measure π(q 1 , . . ., q n ) ∝ e −U (q1,...,qn) where U is given by From here the Langevin system (1.1) can be used to sample from π and the algorithm given in Section 5.1.3is applied.For this purpose, the observable  This Γ is fixed and used for a standard sampling procedure for the same potential and observable.The asymptotic variance is approximated by grouping the epochs after B = 100 burn-in iterations into N B = 999 blocks of T = 300 epochs, specifically, and this is compared to the estimate from the same procedure using different values of fixed Γ in Table 5.1.Note that Γ = Σ − 12 is the optimal Γ in the restricted class of matrices commuting with Σ given by Proposition 4.5, where the asymptotic variance is known to be Tr(Σ σ approx Γ = I n 6.9834 Γ = Σ − 1 2 6.5096 Γ = Γ final 6.1667 Table 5.1: Empirical asymptotic variances with N B = 999, T = 300, B = 100, N = 299700.

Bayesian inference
We adopt the binary regression problem as in [29] on a dataset 2 with datapoints encoding information about images on a webpage and each labelled with 'ad' or 'non-ad'.The labels {Y i } 1≤i≤p , taking values in {0, 1}, of the p = 2359 datapoints (counting only those without missing values) given in the dataset are modelled as conditionally independent Bernoulli random variables with probability {ρ(β X i )} 1≤i≤p , where ρ is the logistic function given by ρ(z) = e cz /(1 + e cz ) for all z ∈ R, c ∈ R is given by (5.18), {X i } 1≤i≤p , β, both taking values in R n , are respectively vectors of known features from each datapoint and regression parameters to be determined.The parameters β are given the prior distribution N (0, Σ), where and the density of the posterior distribution of β is given up to proportionality by so that the log-density gradient, in our notation −∇U , is given by The observable vector f i (q) = q i , 1 ≤ i ≤ n, corresponding to the posterior mean is used.The coordinate transform β = Σ − 1 2 β is made before applying the symmetric preconditioner Σ 1 2 on the Hamiltonian part of the dynamics so that the dynamics simulated are as in (1.1) with M = I n and (5.17) We use the observable vector and the sum of their corresponding asymptotic variances as the value to optimise with respect to Γ, but show in Figures 5.3 and 5.4 the estimated asymptotic variances for both sets f i ( β), f i (β) of observables, where the estimation is calculated using the vector on the left of the outer product in (5.9) in accordance with 2 ∇φ Γ∇φdπ which follows from the formula (2.15) after integrating by parts with truncation.The approximation (5.5) for the term(s) including the Hessian in (5.4) has been used to test the method despite the explicit availability of the Hessian.
During the execution of Algorithm 2, the constant c has been set to   as expected for a linear observable and potential close to a quadratic (see Proposition 4.9).We note that in the gradient descent procedure for Γ, using the minibatch gradient does not change the behaviour shown in Figures 5.3 and 5.4.In addition, although the trajectory of Γ seems to go directly to zero, we expect the optimal Γ to be close but away from zero since the potential is close but not exactly quadratic.Next, the value for Γ is fixed at various values and used for hyperparameter training on the same problem for the first dataset, using both the full gradient (5.17) and a minibatch 3 version where the sum in (5.17) is replaced by p 10 times a sum over a subset S of {1, . . ., p} with 10 elements randomly drawn without replacement such that S changes once for each i in (5.2).In the minibatch gradient case, c is set to a fraction of (5.18), specifically c( p 10 ) −1 .In Tables 5.2 and 5.3, variances for the posterior mean estimates are shown (similar variance reduction results persist when using the probability of success for features taken from a single datapoint in the dataset).In detail, for each row of Tables 5.2 and 5.3, N = 29700 epochs of (5.2) are simulated with the same parameters as above.The asymptotic variance for each observable entry is approximated using block averaging (Section 2.3.1.3 in [53]) by grouping the epochs after B = 100 burn-in iterations into N B = 99 blocks of T = 300 epochs, that is, and N B = 3 blocks of T = 9900 epochs (respectively for each column of Tables 5.2 and 5.3); the values 0.8667 and 0.1571 approach and correspond to values in the middle plot of Figure 5.3 after multiplying by n = 642.The variances are compared to those using a gradient oracle: unadjusted (overdamped) Langevin dynamics [29] and with an irreversible perturbation [26], where the antisymmetric matrix J is given by for 1 ≤ i, j ≤ n and the stepsizes are the same as for underdamped implementations.In addition, the Euclidean distance from intermediate estimates of the posterior mean to a total, combined estimate is shown for each method.Specifically, 5.5, where π(f ) is the mean (over the methods listed in Tables 5.2 and 5.3) of the final posterior mean estimates.A weighted mean with unit weights except one half on the Γ = 0.2I n and Γ = 0.1I n methods also gave similar results, though this is not shown explicitly.
block-size T = 300 block-size T = 9900 These figures demonstrate improvement of an order of magnitude in observed variances for Γ close to that resulting from the gradient procedure over Γ = I n .The improvement is also seen when compared to overdamped Langevin dynamics with and without irrreversible perturbation.

Proofs
Proof.(of Proposition 3.1) Take an approximating sequence ) and for the differential operator the expression For any > 0, T can be chosen so that the first two terms on the right are each bounded by 4 due to (2.13), the strongly continuous property of P t and (2.14); subsequently k can be chosen so that the third and fourth terms are each bounded by 4 .For the remaining term, using Assumption 1 for Theorem 5.13 in the first chapter of [49] and Hörmander's theorem [43], holds classically on (0, T ) × R 2n and By Fubini and equation (6.2), so that since (6.3) holds and therefore P t (f k ) is bounded on ( 1 T , T ) × B R for any k ∈ N, T, R > 0, B R denoting the Euclidean ball in R 2n of radius R, Fubini's theorem can be applied again to obtain which is that the last term in (6.1) is equal to zero.
Before giving the proof for the main formula of the directional derivative of the asymptotic variance, a truncation function is introduced and the membership of ∇ p φ in L 2 (π) is shown.The truncation function is constructed to satisfy a property (6.5) related to the generator (2.3); it will be used to robustly integrate by parts when establishing both ∇ p φ ∈ L 2 (π) and the main formula.
Firstly, let ϕ : R → R, ϕ k : R → R be the standard mollifiers together with ν k : R → R be given by Lemma 6.1.Under Assumption 1 and for k > 0, let η k : R 2n → R be the smooth functions given by for all ζ ∈ R 2n , then the following properties hold: 1. η k is compactly supported; 2. η k converges to 1 pointwise as k → ∞; 3. for some constant C > 0 independent of k, Proof.The first two properties easily hold by definition of η k .For the third property, denoting It can be seen that ν k and ν k are estimated by terms at most of order k −1 ; to see this, for all x ∈ R, Therefore there exists a constant C > 0 such that Moreover, the expression is bounded from above and below by (2.2) and also the second term in (6.6) is bounded above by a direct calculation.
Proof.(of Lemma 3.2) Consider the functions (f k,R ) k,R∈N given by for ϕ 1 k given in (6.4), B R the radius R ball in R 2n centered at 0 and . By Theorem 2.3 and Proposition 3.1 together with Hörmander's theorem, there exists for each k, R ∈ N.For r ∈ N, η r from Lemma 6.1 and the smallest eigenvalue λ m of Γ, The first term on the right can be written as for some generic constant C > 0, where the last line follows by (6.5).The second term on the right of (6.9) is where the last term integrates by parts to obtain so that plugging back into (6.11) and using again (6.5), Plugging into (6.9),together with (6.10), . Since L and its parts are linear, the same arguments as above give for a generic constant C > 0 independent of k, R and since ϕ 1 k * K l → K l uniformly on compact subsets, for any fixed R ∈ N, there exists Choosing now the sequences R i = i, k i = i + max j≤i K j for all i ∈ N. By Theorems 2.1 and 2.3, i, j ∈ N, where C > 0 is again independent of k, R. Using the definition (6.7), the terms are bounded uniformly in k and R, which, together with (6.14), implies φ ki,Ri − φ kj ,Rj L 2 (π) is bounded uniformly in i, j, so that inserting into (6.12)gives Together with (6.8), ∇ p φ ki,Ri is a Cauchy sequence, with limit denoted as g ∈ L 2 (π), so that for any h ∈ C ∞ c , Some additional preliminaries are presented here for the proof of Theorem 3.4.For small ∈ R and some direction δΓ ∈ R n×n in the space of smooth friction matrices such that Γ + δΓ ∈ S n ++ , let L be the infinitesimal generator of (1.1) with the perturbed friction matrix Γ + δΓ in place of Γ, given formally by the differential operator where the notation − S will be used for the perturbation on L, that is, The formal L 2 (π)-adjoint of L is denoted Proof.(of Theorem 3.4) For ≤ , by Theorem 2.3 there exists a solution φ + δφ ∈ L 2 0 (π) to the Poisson equation with the perturbed generator By Theorem 2.4, the directional derivative of σ 2 (Γ) in the direction δΓ : Let (f k,R ) k,R∈N be given by (6.7).Since inequality (6.13) holds by definition and ϕ 1 k * g → g uniformly on compact subsets for any continuous g, there exists for each R ∈ N a constant KR ∈ N such that k ≥ KR implies (6.14) for C independent of k, R and also The sequences R i = i, k i = i + max j≤i Kj for i ∈ N then give the sequence (f i ) i∈N , by (2.10) or by definition.Moreover, π(f i ) → 0 by (6.14) and dominated convergence.Therefore the solutions φ i , φ i, ∈ L 2 0 (π) to the Poisson equations given by Theorem 2.3 satisfy by (2.12) and Theorem 2.1.Since f i ∈ C ∞ , Hörmander's theorem together with Proposition 3.1 say that φ i , φ i, ∈ C ∞ and so φ i , φ i, solves −Lφ i = −L φ i, = f i − π(f i ) classically.Furthermore, in the same way as in the proof of Lemma 3.2 to obtain (6.15), it holds that where ∇ p φ i , ∇ p φ i, ∈ L 2 (π) by Lemma 3.2 itself.The term under the limit in (6.16) is now approximated with a term involving f i and the truncation functions η k from Lemma 6.1.Working now with the approximating integral and using Lemma 3.3 together with the obvious extension on the notation from (1.7), From here, for any > 0, the unwanted term under the limit can be controlled by approximating again with a truncation and f i , i ∈ N, where and λ m is the smallest eigenvalue of Γ + δΓ.The first term on the right hand side is negligible as k → ∞ because of Lemmata 6.1 and 3.2.The remaining term is where the last term, after integrating by parts, gives which is negligible as k → ∞ again due to Lemma 6.1 and (2.2).On the other hand, the first term on the right hand side of (6.23) is where again the term involving ∇ p η k is negligible as k → ∞, so that putting together (6.22), (6.23), (6.24) and (6.25), then taking k → ∞ and i → ∞ with (6.19) gives where λ M = inf 0< ≤ λ M , λ M is the largest eigenvalue of Γ + δΓ and δΓ = Γ+ δΓ−Γ has been used.Therefore holds for small enough and putting into (6.21)concludes the proof.
For the proof of Proposition 4.9, some notation is introduced.For k ∈ N, let the tridiagonal matrix M k ∈ R k+1× k+1 be given by its elements , has an order γ determinant as γ → 0 if m is odd and a determinant that is bounded away from zero as γ → 0 if m is even.
Lemma 6.2 is straightforwardly proved by repeatedly taking Laplace expansions.An explicit proof is not given here.
Proof.(of Proposition 4.9) Only a standard Gaussian and M = 1 is considered, the arguments for the general centered Gaussian case are the same.First consider the observable f (q) = q k (6.27) for some odd k ∈ N. Take the polynomial ansatz φ(q, p) = k i,j=0 a i,j q i p j (6.28) for a i,j ∈ R and Γ = γ > 0. It will be shown that arbitrarily small asymptotic variance is achieved in the γ → 0 limit.Note that only pairs (i, j) with odd i and even j make nonzero contributions to the asymptotic variance.
where a i,j = 0 ∀i, j < 0 and ∀i, j > k. (6.29) Comparing coefficients in (1.5), for all (i, j) = (k, 0).It holds by strong induction (in j ) that a i +j ,k+1−j = 0 ∀i , j ≥ 0 (6.31) because of the following.The base case j = 0 follows by (6.29), the induction step follows by taking (i, j) = (i + j − 1, k + 2 − j ) for i ≥ 0 in (6.30) and again using (6.29)where necessary.Comparing coefficients in the Poisson equation (1.5) for (i, j) = (k, 0) and using (6.29), (6.31) yields4 a k−1,1 = 1.(6.32) Combining (6.32) with setting (i, j) = (j − 1, k + 1 − j ) for j = 1, . . ., k in (6.30), the entries a j ,k−j satisfy the linear system M k (a k,0 , a k−1,1 , . . ., a 0,k ) = (1, 0, . . ., 0) , where M k ∈ R k+1×k+1 is the tridiagonal matrix given in (6.26).In order to find the order in γ as γ → 0 of the elements of (a k,0 , . . ., a 0,k ) appearing in (6.33), it suffices to find the order of the entries in the leftmost column of M −1 k .For this, let C i ∈ R be the i th minor appearing in the top row of the cofactor matrix of M k .On the corresponding submatrix, repeatedly taking the Laplace expansion on the leftmost column until only the determinant of a (k + 1 − i)-by-(k + 1 − i) square matrix from the bottom right corner of M k remains to be calculated, then using Lemma 6.2 for this (k + 1 − i)-by-(k + 1 − i) matrix gives that C i is of order γ as γ → 0 for odd i.Furthermore, the determinant of M k is bounded away from zero as γ → 0 by Lemma 6.2.Therefore the elements of (a k,0 , . . ., a 0,k ) in the left hand side of (6.33) with an odd index, that is a k−j,j for even j, have order γ and at most order 1 otherwise as γ → 0. These elements with odd indices are exactly those from the vector (a k,0 , . . ., a 0,k ) that make a contribution to the asymptotic variance.The 'next' set of contributions come from the vector (a k−2,0 , a k−3,1 . . ., a 0,k−2 ).Using again (6.29) and (6.30), the vector satisfies for some vector v k−2 (from the last term on the left hand side of (6.30)) of order γ as γ → 0 and since the determinant of M k−2 is of order 1 (by Lemma 6.2), the contributions here to the asymptotic variance are again of order γ.Continuing for (a k−2j,0 , a k−2j−1,1 . . ., a 0,k−2j ) , j ∈ N, it follows that all contributions are of order γ as γ → 0. The resulting coefficients indeed make up a solution φ to the Poisson equation because the matrices M k are invertible and because the coefficients a i,j for even i + j are equal to zero from repeating the above procedure for the coefficients associated to M k−1 , M k−3 and so on.For the general case of (4.24), since L is a linear differential operator and the contributions to the value of φ(f − π(f ))dπ come from exactly the same (odd i, even j) a i,j coefficients from the corresponding solution φ to each summand in (4.24), the proof concludes.
Proof.(of Proposition 4.10) Take the polynomial ansatz φ(q, p) = 4 i,j=0 a i,j q i p j (6.34) for a i,j ∈ R, where a i,j not appearing in the sum are taken to be zero in the following.Again, only the standard Gaussian is considered, it turns out the arguments follow similarly otherwise.Comparing coefficients in (1.5) and using the same strong induction argument as in the proof of Proposition 4.9 leads to (6.30) for all (i, j) = (4, 0), (0, 0) and equation (6.31).Taking (i, j) = (j − 1, 5 − j ) for 1 ≤ j ≤ 4 in (6.30) and comparing the q 4 coefficients in the Poisson equation, it holds that M 4 (a 4,0 , a 3,1 , a 2,2 , a 1,3 , a 0,4 ) = (1, 0, . . ., 0) (6.35) and taking (i, j) = (j − 1, 3 − j ) for j ≥ 1 in (6.30) yields Equations (6.35), (6.36) can be solved explicitly and the asymptotic variance is a weighted sum of the resulting coefficients.Those in (6.34) that make contributions are a 4,0 , a 2,2 , a 2,0 , which gives the asymptotic variance that goes to infinity as γ → 0 or γ → ∞.Comparing constant terms in the Poisson equation yields which turns out to be satisfied by the solution for a 0,2 , so that (6.34) is indeed a solution; note that the coefficients associated to M 3 and M 1 are zero by a similar procedure as above.

Relation to previous methodologies
The infinite time integral (1.9) has been used for the calculation of transport coefficients in molecular dynamics [52,65] and the derivative of the expectation appearing in (1.9) with respect to initial conditions is a problem considered when calculating the 'greeks' in mathematical finance [33].On the topic of the latter and in contrast to [33], there is previous work dealing with cases of degenerate noise in the system, but the formulae derived were done so under different motivations and do not seem to improve upon (1.10) in our situation; some of these references are given in Remark 5.2.Taking Γ → ∞ together with a time rescaling, the dynamics (1.1) become the overdamped Langevin equation [66].An analogous result holds [46] when Γ = Γ(q) is position dependent, where a preconditioner for the corresponding overdamped dynamics appears in terms of Γ −1 ; see Section 7.3 for a consideration of our method in the position dependent friction case.On the other hand, the Hessian of U makes a good preconditioner in the overdamped dynamics because of the Brascamp-Lieb inequality, see Remark 1 in [1].
On the application of underdamped Langevin dynamics with (variance reduced) stochastic gradients alongside the related Hamiltonian Monte Carlo method, [85] presents a comparison with convergence rates for the latter.In [17], convergence guarantees are provided for variance reduced gradients in the overdamped case and the control variate stochastic gradients in the underdamped case, along with numerical comparisons in low dimensional, tall dataset regimes.Furthermore, the underdamped dynamics with single, randomly selected component gradient update in place of the full gradient is considered in [22].Variance reduction by modifying the observable instead of changing the dynamics has been considered for example in [3,5,77].The methods there are incompatible with the framework in the present work due to the improved observable being unknown before the simulation of the Markov chain.Although useful, their applicability are limited in large n cases due to storage requirements [3], not to mention either escalating computational cost for improvements in the observable or requirement of a priori knowledge [77].

The nonconvex case
In the case where U is nonconvex, the Monte Carlo procedure in Section 5.1.3may continue to be used as presented, however the first variation process could easily stray from the case of exponential decay as in Theorem 3.5.Transitions from one metastable state to another cause the tangent process to increase in magnitude.In a one dimension double well potential U (q) = q 4 4 − q 2 + q 2 , linear observable f (q) = q case, these transitions occur frequently enough during the gradient procedure in Γ that Dq blows up in simulation.Even in cases for which the metastabilities are strong, so that transitions occur less frequently, simulations show that Γ dives to zero in periods where no transitions are occuring (as if the case of Corollary 4.8), but increase dramatically in value once a transition does occur, causing the trajectory in Γ to decay over time but occasionally jumping in value, so that there is no convergence for Γ.On the other hand, the Galerkin method presented in Appendix B tends to give good convergence for Γ in such cases.

Position-dependent friction
It is possible to adapt the formula (3.2) to the case of position-dependent gradient direction in Γ given a Feynman-Kac representation formula and the corresponding existence result, which will be the aim of future work.The gradient direction is the same as (1.8) with the change that the integral is replaced by the corresponding marginal integral in p. Ideas using such a formula need to take into account that the first variation process retains a non-vanishing stochastic integral with respect to Brownian motion, so that the truncation in calculating the corresponding infinite time integral in Section 5.1.3is not as well justified, or rather, does not happen in the execution of Algorithm 2 due to (5.11) not being satisfied.

Metropolisation
Throughout Section 5, the implementation has not involved accept-reject steps.Metropolisation of discretisations of the underdamped Langevin dynamics was given in [45], see also Section 2.2.3.2 in [53] and [58,74].The systematic discretisation error is removed with the inclusion of this step but the momentum is reversed upon rejection (to avoid high rejection rates [74]), which raises the question of whether friction matrices arising from Algorithm 1 improve the Metropolised situation where dynamics no longer imitate those in the continuous-time.For example the intuition in the Gaussian target measure, linear observable case discussed in Section 4.2 no longer applies.

Conclusion
We have presented the central limit theorem for the underdamped Langevin dynamics and provided a formula for the directional derivative of the corresponding asymptotic variance with respect to a friction matrix Γ.A number of methods for approximating the gradient direction in Γ have been discussed together with numerical results giving improved observed variances.Some cases where an improved friction matrix can be explicitly found have been given to guide the expectation of an optimal Γ.In particular, in cases where the observable is linear and the potential is close to quadratic, which is the case when finding the posterior mean in Bayesian inference with Gaussian priors, the optimal friction is expected to be close to zero (due to Corollary 4.8).This is consistent with the numerical conclusion from the proposed Algorithm 2.Moreover, it is shown that the improvement in variance is retained when using minibatch stochastic gradients in a case of Bayesian inference.We mention that the gradient procedure using (1.6) and (1.10) can be used to guide Γ in arbitrarily high dimension by extrapolation; that is, given a high dimensional problem of interest, the gradient procedure can be used on similar, intermediate dimensional problems in order to obtain a friction matrix that can be extrapolated to the original problem.In particular, for the Bayesian inference problem as formulated in Section 5.2.3, the algorithm recommends the choice of a small friction scalar, which can be expected to apply for datasets in an arbitrary number of dimensions.Future directions not mentioned above includes well-posedness of the optimisation in Γ, extension to higher-order Langevin samplers methods as in [16,59] and gradient formulae in the discrete time case analogous to Theorem 3. (2π) − n 2 .A property of the Hermite polynomials that is repeatedly used here is that For the application of Hermite polynomials in solving the Poisson equation associated to Langevin dynamics (in the case of scalar friction), we refer to [71].See also Chapter 5 in [37] for Hermite polynomials in the multidimensional setting.In the case of a non-quadratic potential U , the same polynomials are used here after a Gram-Schmidt procedure in L 2 (π), which are denoted ( Ĥl ) l∈N n , so that where |k| ∞ = max(k 1 , . . ., k n ), K ∈ N, for some constants α l k ∈ R calculated numerically.Their products with H l are considered on L 2 (π).Similarly, Fourier approximations can be used in the case of an n-torus (in q).The observable f ∈ L 2 0 (π) is approximated by the projection defined by Since the generator has the form L = ∇ * p • ∇ q − ∇ * q • ∇ p − (∇ * p ) Γ∇ p , where ∇ * q = −∇ q + ∇U, ∇ * p = −∇ p + p are the respective formal L 2 (π)-adjoints of ∇ q and ∇ p , the negative of the generator in the Poisson equation applied on functions of the form (B.1) is the (K + 1) 2n -by-(K + 1) 2n matrix given by L k,l, k, l = Ĥk H l , −L( Ĥk H l) π = − Ĥk ∇ p H l , ∇ q Ĥk H l π + ∇ q Ĥk H l , Ĥk ∇ p H l π + Ĥk ∇ p H l , Γ Ĥk ∇ p H l π where δ denotes the Kronecker delta here, the dependences of Ĥk , Ĥk and H l , H l on q and p respectively have been suppressed, v, w denotes i v i , w i for v = (v 1 , . . ., v n ), w = (w 1 , . . ., w n ) and •, • denotes the inner product on L 2 (π).Note further that l i H l−ei , H l π α l k, so that since α l k are derived from the inner products in L 2 (π) between the original Hermite polynomials (H l ) l , these inner products are the only values to be computed numerically other than those for the projection Π q K f of

Figure 1 . 1 :
Figure 1.1:The values min i (|Re(λ i )|), where λ i are the eigenvalues of the matrix appearing in (1.11), also the spectral gap.Critical values of γ are given by 2ω.
(1.6) and (1.10) are precisely stated.Exact results concerning improvements in Γ including the quadratic U , quadratic f and linear f cases are given in Section 4. Numerical methods in approximating (1.8) together with an algorithm resulting from (1.6) and (1.10) is outlined and detailed in Algorithm 1 and 2 respectively in Section 5, alongside examples of U and f where improvements in variance are

Lemma 4 . 1 .
Given f in (4.1), π(f ) in (4.2) and L of the form (4.3), φ given by (4.4) is a solution to the Poisson equation (1.5) if and only if

Algorithm 1 :
(1.8)) based on Theorem 3.4, but a spectral method to solve (1.5) and compute the change in Γ is given in Appendix B, which is computationally feasible in low dimensions.Algorithm 1 summarises the resulting continuous-time procedure, where all expectations within (1.8) are approximated by single realisations; further justifications, alternative methods, refinements and a concrete implementation (Algorithm 2) along with examples follow.Continuous-time outline of Γ update using (1.6) and(1.10)Result: Γ ∈ S n

Figure 5 . 3 :
Figure 5.3: Left: Diagonal values of Γ over iterations of (5.8) with α i = 0.1, G = 1, r = 1 and µ = 0.2.Note that the mean of the absolute values of all entries of Γ at the end of the iterations is 0.0039.Middle: Sum over i of estimated asymptotic variances for f i ( β); right: for f i (β).

Figure 5 . 4 :
Figure 5.4: The same as in the caption of Figure 5.4, except r = 0.5 and a different dataset (https://archive.ics.uci.edu/ml/datasets/Musk+(Version+1)) is used where n = 167 and p = 476.The mean of the absolute values of all entries of Γ at the end of the iterations is 0.0210.

Table 5 .
3:The same as in Table5.2,except for minibatch gradients .20) By Lemma 6.1 and 3.2, the terms involving gradients of η k converge to zero as k → ∞, so that taking k → ∞, then i → ∞,