Variance reduction for irreversible Langevin samplers and diffusion on graphs

In recent papers it has been demonstrated that sampling a Gibbs distribution from an appropriate time-irreversible Langevin process is, from several points of view, advantageous when compared to sampling from a time-reversible one. Adding an appropriate irreversible drift to the overdamped Langevin equation results in a larger large deviations rate function for the empirical measure of the process, a smaller variance for the long time average of observables of the process, as well as a larger spectral gap. In this work, we concentrate on irreversible Langevin samplers with a drift of increasing intensity. The asymptotic variance is monotonically decreasing with respect to the growth of the drift and we characterize its limiting behavior. For a Gibbs measure whose potential has one or more critical points, adding a large irreversible drift results in a decomposition of the process in a slow and fast component with fast motion along the level sets of the potential and slow motion in the orthogonal direction. This result helps understanding the variance reduction, which can be explained at the process level by the induced fast motion of the process along the level sets of the potential. The limit of the asymptotic variance as the magnitude of the irreversible perturbation grows is the asymptotic variance associated to the limiting slow motion. The latter is a diffusion process on a graph.


Introduction
It is often the case that one is given a high dimensional distribution π(dx) which is known only up to normalizing constants, a state space E and an observable f and the goal is to compute an integral of the formf = E f (x)π(dx). Typically, such integrals cannot be computed in closed forms, so one has to resort to approximations. What is typically done is to construct a Markov process X t , which has π as its invariant distribution. Then under the assumption of positive recurrence, the ergodic theorem guarantees that for any f ∈ L 1 (π), lim t→∞ 1 t t 0 f (X s )ds = E f (x)π(dx), a.s.
Then, the estimator 1 t t 0 f (X s )ds is used to approximate the integral of interest. However, the degree of accuracy of such an approximation depends on both the choice of the Markov process X t and on the criterion used for comparison.
Let us assume that π(dx) is of Gibbs type on the state space E and in particular that A Markov process that has π as its invariant distribution is the time-reversible Langevin diffusion However, as it has been argued in the literature [11,13], it is advantageous to use appropriate irreversible diffusions of the form dX t = [−∇U (X t ) + C(X t )] dt + 2βdW t and if the vector fields C satisfy div(Ce −U/β ) = 0 or equivalently divC = β −1 C∇U , then the measure π is still invariant.
The main result of [11] is that under certain conditions the absolute value of the second largest eigenvalue of the Markov semigroup in L 2 (π) decreases when C = 0, which naturally implies faster convergence to equilibrium. In [5] the Donsker-Varadhan large deviations rate function [4] has been proposed as a natural tool to compare the convergence to equilibrium for ergodic averages and it was used to analyze parallel tempering algorithms. In [13], this criterion is used as a guide to design and analyze non-reversible Markov processes and compare them with reversible ones and we prove that the large deviations rate function monotonically increases under the addition of an irreversible drift. Moreover upon connecting the large deviations rate function with the asymptotic variance of the estimator, it is proven in [13], that adding an drift C also decreases the asymptotic variance of the estimator, in the sense that is smaller that the asymptotic variance as C = 0.
The goal of the current work is to study the situation when the drift has the form 1 C(x) and to consider what happens when → 0. A similar question but from a different perspective has been studied in [3,6]. There, the authors studied the behavior of the spectral gap for diffusions on compact manifolds with U = 0 and a one-parameter families of perturbations 1 C for some divergence free vector field C. In those papers the behavior of the spectral gap is related to the ergodic properties of the flow generated by C (for example if the flow is weak-mixing then the second largest eigenvalue tends to 0 as → 0). In the present paper we want to understand the effect that increasing 1/ has on the asymptotic variance and on the paths of X t .
We find that the asymptotic variance of the estimator is monotonically decreasing in δ = 1/ . Using the averaging principle, we characterize the limiting asymptotic variance as ↓ 0. Focusing on the case where the potential U (x) has one or more critical points, the irreversible perturbations with a small induce a fast motion on the constant potential surface and slow motion in the orthogonal direction. Using the theory of diffusions on graphs and the related averaging principle as developed in a series of works [2,8,7,9], we identify the limiting motion of the slow component. The fast motion on constant potential surfaces decreases the variance as the phase space is explored faster and the limit of the asymptotic variance as → 0 is the asymptotic variance of a one-dimensional estimation problem on a graph, which is where, in the limit, the slow component of the process lives. Upon completion of this work, we became aware of the recent paper [12] where an alternative expression of the limiting asymptotic variance is provided. The methods of [12] are analytical and the characterization of the limiting variance is expressed as the projection to a kernel of a certain operator. Our approach provides complementary information and is dynamical since we are using an averaging principle which allows us to make direct connections with the limiting behavior of the underlying process itself, relating the limiting variance with an estimation problem on an one-dimensional graph.
The rest of the paper is organized as follows. In Section 2 we formulate the problem precisely and present our main results. The averaging problem treating the limit of the slow component of the process is discussed in Section 3. Due to the special structure of the model, one can perform explicit computations and thus derive precise results. The formula for the limit of the asymptotic variance as the perturbation grows is in Section 4. Numerical simulations illustrating the theoretical findings are in Section 5.

Statement of the problem and main results
The papers [3,6,11,13] motivate to look at the case of a one parameter family of irreversible drifts 1 C(x). For this purpose, we consider the model where we consider a one-parameter family of vector fields 1 C, where ∈ R and the vector field C satisfies div(Ce −U/β ) = 0. As mentioned in the introduction, the vector fields C must satisfy div(Ce −U/β ) = 0 or equivalently divC = β −1 C∇U .
A convenient choice, which we assume henceforth, which is not the most general but can be made independently of β, is to pick C such that divC = 0 , and C∇U = 0 .
A standard choice of C(x) is C(x) = S∇U (x), where S is any antisymmetric matrix. A more elaborate discussion on other possible choices of C(x) can be found in [13]. The meaning of these conditions is straightforward: the flow generated by C must preserve Lebesgue measure since it it is divergence-fee but since U is a constant of the motion, the micro-canonical measure on the surfaces {U = z} are preserved too. Although for convenience, from now on we will set β = 1/2. We will be assuming here that the diffusion process X t is on a d-dimensional compact smooth manifold without boundary and that U and C are sufficiently smooth. The ergodicity of X t implies that the empirical measure converges to π almost surely as t → ∞. Under our assumptions we have a large deviation principle (uniformly in the initial condition) for the family of measures π t , which we write, symbolically as where denotes logarithmic equivalence. (Since C is fixed we have suppressed the dependence of I on C.) The rate function I 1/ (µ) quantifies the exponential rate at which the (random) measure π t converges to π as t → ∞. It is proven in [13] (see Theorem 2.1 below) that the rate function I 1/ (µ) is quadratic in .
The information in I 1/ (µ) can be used to study the rate of convergence of observables. If f ∈ C(E; R) then we have the large deviation principle where, by the contraction principle, An alternative formula forĨ f,1/ ( ) is in terms of the Legendre transform of the maximal eigen- For f ∈ L 2 (π) withf = f dπ the asymptotic variance is given by, see e.g. Proposition I.V.1.3 in [1], We have the following theorem from [13].
(i) Let µ(dx) = p(x)dx be a measure with with positive density p ∈ C (2+α) (E). Then we have , where the functional K(µ) is positive and strictly positive if and only if div (p(x)C(x)) = 0. It takes the explicit form where ξ is the unique solution (up to a constant) of the equation div [p (C + ∇ξ)] = 0. (ii) We haveĨ f, 1 C ( ) ≥Ĩ f,0 ( ) and generically the inequality is strict: for f ∈ C (α) (E) we havẽ For the asymptotic variance we have excludingf . More generally the map | | → σ 2 f, 1 is a monotone increasing function and thus its limit as → 0 exists.
For notational convenience, we shall write from now on and, without loss of generality, we may and will assume thatf = 0. We want to characterize the behavior of the asymptotic variance as ↓ 0, namely to find lim ↓0 σ 2 f ( ), and to connect it with the limiting behavior of the trajectories X t as ↓ 0.
Notice that we can write where Φ (x) is the unique solution (up to constants) of the Poisson equation It is easy to see that as → 0, the solution to (2.3) becomes constant on the stream lines of C(x). In particular, if we multiply (2.3) by and then take → 0, we formally obtain that C(x) · ∇Φ(x) = 0. Making this rigorous is the purpose of Sections 3 and 4. Note that this heuristic argument makes it immediately clear that for . Hence, we next investigate what happens in the non-trivial case, i.e., when Ker(C · ∇) = {0}. Our goal is to relate the behavior of the variance as → 0 with that of the process X t . It turns out that as → 0, the behavior of the process X t can be decomposed into fast motion on the constant potential surface and slow motion in the orthogonal direction. Using averaging principle, see for example [2,8,7,9], we can identify the limiting behavior of the slow component and then compute To be more precise, the slow component of the motion can be characterized in the limit as → 0 as an one-dimensional Markov process on a graph and σ 2 f (0) turns out to be associated with the asymptotic variance of an estimation problem on the graph itself.
The precise behavior of the process X t as → 0 has been investigated in [7,9]. We will state the precise assumptions and results in Section 3. However, let us briefly review the result here. Following, [9], we consider a finite graph Γ, which represents the structure of the level sets of the potential function U on E. To construct the graph we identify the points that belong to the connected components of each of the level sets of U . We assume that U has finitely many nondegenerate critical points and that each connected level set component of U contains at most one critical point. In that way, each of the domains that is bounded by the separatrices gets mapped into an edge of the graph. At the same time the separatrices gets mapped to the vertexes of Γ. In particular exterior vertexes correspond to minima of U , whereas interior vertexes correspond to saddle points of U . Edges of Γ are indexed by I 1 , · · · , I m . Each point on Γ is indexed by a pair y = (z, i) where z is the value of U on the level set corresponding to y and i is the edge number containing y. Clearly the pair y = (z, i) forms a global coordinate on Γ. For any given point ) be the corresponding projection on the graph.
Consider the process X t on E to be the solution to (3.3). As we describe in Section 3, the additional noise that appears in (3.3) is added for regularization reasons and the limit does not depend on it. Moreover, the additional regularizing noise is not needed if C(x) generates an ergodic dynamical system with a unique invariant measure within each connected component of the level sets of U . It is proven in Theorem 2.2 of [9] and in [7] that the process Q(X t ) converges to a certain Markov process on Γ with continuous trajectories, which is exponentially mixing. The precise result will be stated in Section 3, but roughly speaking it goes as follows.
We remark here that the classical Freidlin-Wentzell theory, see [9], assumes that lim |x|→∞ U (x) = ∞. However, the results are clearly true in the compact case as well.
Based on Theorem 2.2, we can then establish the limiting behavior of the asymptotic variance as ↓ 0 and then make connections to estimation problems on the graph. In particular we have the following result which is discussed and proven in Section 4.
It is straightforward to see that this is the asymptotic variance of an ergodic average on the graph. In particular, we have In Section 4 we prove formula (2.4).

The averaging problem
In this section we discuss the behavior of the process X t as ↓ 0. Such problems have been studied in [2,8,7,9] and we recall here the results which are relevant to us. It turns out that as → 0, the behavior of the process X t can be decomposed into fast motion on the constant potential surface and slow motion in the orthogonal direction.
Let us consider the level set corresponding to U = z and set d i (z) to be the connected components of d(z), i.e., Then, we let Γ be the graph which is homeomorphic to the set of connected components d i (z) of the level sets d(z). Exterior vertexes correspond to minima of U , whereas interior vertexes correspond to saddle points of U . Edges of Γ are indexed by I 1 , · · · , I m . Each point on Γ is indexed by a pair y = (z, i) where z is the value of U on the level set corresponding to y and i is the edge number containing y. Clearly the pair y = (z, i) forms a global coordinate on Γ. For any given point x ∈ R d , let Q : R d → Γ by Q(x) = (U (x), i(x)) be the corresponding projection on the tree. For an edge I k and a vertex O j we write I k ∼ O j if O j lies at the boundary of the edge I k . We endow the tree Γ with the natural topology. Then it is known that Γ forms a graph with interior vertexes of order two or three, see Section 8 of [7].
Let us next consider X t with an additional artificial noise component in the fast dynamics, i.e., where W and W o are independent standard Wiener processes, and we have defined Clearly, if κ = 0 then we get the process X t that we have been considering till now. We make the following assumption Condition 3.1.
(i) In dimension d = 2, we take κ ≥ 0. In dimension d > 2, we either assume that the dynamical systemẋ t = C(x t ) has a unique invariant measure on each connected component d i (z), in which case κ ≥ 0, or otherwise we assume that κ > 0.
(ii) For any x ∈ E such that ξ · ∇U (x) = 0 we have that (iii) U has a finite number of critical points x 1 , · · · , x m and at these points the Hessian matrix is non-degenerate. (iv) There is at most one critical point for each connected level set component of U .
(v) If x k is a critical point of U , then there exists d > 0 such that C(x) ≤ d|x − x k | (vi) Let λ i,k be the eigenvalues of the Hessian of U (x) at the critical points x k where k = 1, · · · , m and i = 1, · · · , d. Then we assume that κ < (K max i,k λ i,k ) −1 . (vii) The matrix σ(x)σ T (x) is non-negative definite, symmetric with smooth entries.
Obviously Conditions 3.1(ii),(vi)-(viii) are relevant in the case d > 2 and if C(x) does not generate an ergodic dynamical system (since otherwise we can just take κ = 0). In the case where κ > 0, this procedure of incorporating an appropriate artificial noise in the system, allows to single out the correct averaging that should be done in the system, yielding the expected result. We remark here that the end result does not depend on the additional regularizing noise, since σ(x) does not appear in the limiting dynamics.
It is clear that the motion can be decomposed in a fast motion and a slow motion. The fast motion corresponds to the infinitesimal generator Let us writeX t for the diffusion process that has infinitesimal generatorL. Condition 3.1 guarantees that with probability one, if the initial point ofX is in a connected component d i (z), thenX t ∈ d i (z) for all t ≥ 0. Indeed, by Itô formula we have Since C(x)∇U (x) = 0 and σ(x)σ T (x)∇U (x) = 0 we obtain with probability one t 0L (X s )ds = 0. The quadratic variation of the stochastic integral is also zero, due to σ(x)σ T (x)∇U (x) = 0, which implies that with probability one t 0 ∇U (X s )σ(X s )dW s = 0. Thus, we indeed get that for all t ≥ 0 X t ∈ d i (z) given that the initial point belongs to the particular connected component d i (z).
Let us turn now our attention to the issue of invariant measures. It is relatively easy to see that independently of the form of the matrix σ(x)σ T (x), the fact that div(C) = 0 implies that the Lebesgue measure is invariant for the diffusion process corresponding to the operatorL in R d . Then, the proof of Lemma 2.3 of [7] and the fact that t ≥ 0X t ∈ d i (z) ifX 0 ∈ d i (z) imply that if (z, i) ∈ Γ is not a vertex, there exists a unique invariant measure µ z,i concentrated on the connected component d i (z) of d(z) which takes the form is the period of the trajectory concentrated on d i (z). Notice that the invariant density on d i (z) is For any sufficiently smooth function f (x), define its average over the related connected component the level set of U (x) as Let us consider the process Q(X t ) = (U (X t ), i(X t )) and consider its limiting behavior. Let us write L 0 for the infinitesimal generator of the process X t given by (1.2). Let us set and then consider the one-dimensional process Y t which within the branch I i is governed by the infinitesimal generator Within each edge of Γ, Q(X t ) converges as ↓ 0 to a process with infinitesimal generator L Y i g(z) In order to uniquely define the limiting process, we need to specify the behavior at the vertexes of the tree, which amounts to imposing restrictions on the domain of definition of the generator, say L Y , of the Markov process. where, if γ jk is the separatrices curves that meet at O j , we have set ∆U (x)dx.
Here one chooses + or − depending on whether the value of U increases or decreases respectively along the edge I k as we approach O j . Moreover D k represents the derivative in the direction of the edge I k . Moreover, within each edge I i the process Y t is a diffusion process which has infinitesimal generator L Y i . Consider now the process Y t that has the aforementioned L Y as its infinitesimal generator with domain of definition D, as defined in Definition 3.2. Such a process is a continuous strong Markov process, e.g., Chapter 8 of [9].
Then, for any T > 0, Q(X t ) converges weakly in C([0, T ]; Γ) to the process Y t as ↓ 0. In particular, we have the following theorem.
It is important to note that the limiting process Y t does not depend on σ(x). Next, we show that in our case of interest and for every i = 1, · · · , m, the operator L Y i which governs the motion of the limiting process within the I i branch of the graph takes a more explicit form. Let G i (z) = int(d i (z)). Notice that by Gauss theorem we have and similarly Next, we notice that Hence, we can write Thus, the infinitesimal generator L Y i , can be written in a more compact form as Thus, if we define M i (z) = G i (z) ∆U (x)dx > 0, we can rewrite L Y i in the following explicit form Hence, within each edge of the graph the effective process is a one-dimensional process with effec- and effective diffusion coefficient

Limiting behavior of the asymptotic variance
In this section we identify σ 2 f (0) = lim ↓0 σ 2 f ( ) using the averaging results of Section 3. In particular, we prove the representation of Theorem 2.3.
Recall that without loss of generality we assume thatf = 0 and that we have the formula where the process X t is the unique strong solution of (3.3). Standard PDE arguments show that for any point (z, i) ∈ I i that is not a vertex, the PDE has a unique solution (up to constants) C 2+α solution with α ∈ (0, α). The solution can be written as Let us consider β > 0 small and for an edge I i of the graph set I β i = {(z, i) ∈ I i : dist((z, i), ∂I i ) > β} and define τ i = min{t > 0 : Q(X t ) / ∈ I β i )} Then, by applying Itô formula to the solution of (4.1) with stochastic process X t one immediately gets that for any T < ∞, for any initial point x that does not belong to any of the separatrices of U and for every I i and recalling that u(x) is solution to (4.1) we get Taking expected value the right hand side of this inequality goes to zero as ↓ 0, since continuity of the integrands implies that Riemann integrals are bounded. Hence, the averaging result (4.2) follows immediately.
At the same time, by the results of [7,9] the limiting process Y t spends time of Lebesgue measure zero at the interior and exterior vertexes. For ζ > 0 and for a vertex of the graph O j , let us define If O j is an exterior vertex of Γ, then for every η > 0, there exists ζ > 0 such that for sufficiently small and for all x ∈ D j (±ζ) we have that (Lemma 3.6 in [7]) E x τ j (±ζ) < η where τ j (±ζ) is the first exit time of X t from D j (±ζ). The behavior for an interior vertex is similar. Lemma 3.7 of [7] implies that if O j is an interior vertex, then for every η > 0, and for all sufficiently small ζ > 0 E x τ j (±ζ) < ηζ for sufficiently small and for all x ∈ D j (±ζ). Recall now that Q(x) = (U (x), i(x)). Then, (4.2) and the fact that the process Y t spends time of Lebesgue measure zero at all vertexes, imply that for any t < ∞ Since, the invariant measure of X t is the Gibbs measure π, we have that the invariant measure of Y t = Q(X t ) and of Y t on the graph is the projection of the Gibbs measure π on Γ. Denoting this invariant measure by µ, we have that for any Borel set γ ⊂ Γ, µ(γ) = π(Γ −1 (γ)). Thus, by the weak convergence of Theorem 3.3 we have that for any t < ∞ The strong Markov processes X t and Y t , on E and Γ respectively, are uniform mixing. This implies that, by selecting t > 0 to be large enough, we can make the integrals arbitrarily small. Therefore, we indeed obtain that concluding the proof of Theorem 2.3.

Numerical Simulations
In this section, we explore numerically the behavior of the process under growing perturbations of the drift. In the examples below we fix β = 0.1.
The first example that we study is a simple 2-dimensional example where the potential U has a single critical point. In particular, we define U (x, y) = 1 2 x 2 + 1 2 y 2 . Let C(x, y) = J∇U (x, y), where J is the standard 2 × 2 antisymmetric matrix, i.e., J 12 = 1 and J 21 = −1. In Figure 1, we see that the more irreversibility one adds (in the sense of increasing the δ = 1/ parameter in the perturbation δC(x, y)), the faster the process explores the phase space. Since, it is perhaps convenient to think in terms of 1/ and not , we set δ = 1/ . Furthermore, notice that what the theory predicts is also shown in the numerical simulations. Namely, we observe fast motion long the level sets of the potential and slow motion in the orthogonal direction.
Notice that the potential function U (x, y) has two local minima in (−1, 0) and (1, 0) and a local maximum at (0, 0). In Figures 2 and 3, we plot the x−component of the (x, y) trajectory versus time. We see that the more irreversibility one adds (in the sense of increasing the δ parameter in the perturbation δC(x, y)), the faster the process moves along the level sets of the potential. As Figures 2 and 3 show, in the present case, this means faster switches between the two metastable states.
Lastly, to get a sense of the magnitude of the variance reduction, we take the observable f (x, y) = x 2 + y 2 (still U (x, y) = 1 4 (x 2 − 1) 2 + 1 2 y 2 ) and we use the standard batch means method, see [1], to estimate the asymptotic variance of t −1 t 0 f (X s ) ds, see Figure 4. We present in Table 1, variance estimates for different values of δ = 1/ and time horizon t. These numerical values are part of the values that were used to plot Figure 4. It is clear that the variance reduction for this particular example is at the order of at least 2 magnitudes. Moreover, by Theorem 2.3, it is clear that for small > 0, the variance estimate can be also considered as an approximation to the asymptotic variance of the corresponding estimation problem on the graph, see (2.5