Normalizing constants of log-concave densities

We derive explicit bounds for the computation of normalizing constants $Z$ for log-concave densities $\pi = \exp(-U)/Z$ with respect to the Lebesgue measure on $\mathbb{R}^d$. Our approach relies on a Gaussian annealing combined with recent and precise bounds on the Unadjusted Langevin Algorithm (High-dimensional Bayesian inference via the Unadjusted Langevin Algorithm, A. Durmus and E. Moulines). Polynomial bounds in the dimension $d$ are obtained with an exponent that depends on the assumptions made on $U$. The algorithm also provides a theoretically grounded choice of the annealing sequence of variances. A numerical experiment supports our findings. Results of independent interest on the mean squared error of the empirical average of locally Lipschitz functions are established.


Introduction
Let U : R d → R be a continuously differentiable convex function such that Z = R d e −U(x) dx < +∞. Z is the normalizing constant of the probability density π associated with the potential U, defined for x ∈ R d by π(x) = Z −1 e −U (x) . We discuss in this paper a method to estimate Z with polynomial complexity in the dimension d.
Computing the normalizing constant is a challenge which has applications in Bayesian inference and statistical physics in particular. In statistical physics, Z is better known under the name of partition function or free energy [3], [30]. Free energy differences allow to quantify the relative likelihood of different states (microscopic configurations) and are linked to thermodynamic work and heat exchanges. In Bayesian inference, the models can be compared by the computation of the Bayes factor which is the ratio of two normalizing constants (see e.g. [43, chapter 7]). This problem has consequently attracted a wealth of contribution; see for example [9, chapter 5], [31], [20], [2], [16], [29], [49] and, for a more specific molecular simulations flavor, [30]. Compared to the large number of proposed methods to estimate Z, few theoretical guarantees have been obtained on the ouput of these algorithms; see below for further references and comments. Our algorithm relies on a sequence of Gaussian densities with increasing variances, combined with the precise bounds of [15].
The paper is organized as follows. The outline of the algorithm is first described, followed by the assumptions made on U. Our main results are then stated and compared to previous works on the subject. The theoretical analysis of the algorithm is detailed in Section 2. In Section 3, a numerical experiment is provided to support our theoretical claims. Finally, the proofs are gathered in Section 5. In Section 4, a result of independent interest on the mean squared error of the empirical average of locally Lipschitz functions is established.

Notations and conventions
Denote by B(R d ) the Borel σ-field of R d . For µ a probability measure on (R d , B(R d )) and f a µ-integrable function, denote by µ(f ) the integral of f w.r.t. µ. We say that ζ is a transference plan of µ and ν if it is a probability measure on (R d × R d , B(R d × R d )) such that for all measurable sets A of R d , ζ(A × R d ) = µ(A) and ζ(R d × A) = ν(A). We denote by Π(µ, ν) the set of transference plans of µ and ν. Furthermore, we say that a couple of R d -random variables (X, Y ) is a coupling of µ and ν if there exists ζ ∈ Π(µ, ν) such that (X, Y ) are distributed according to ζ. For two probability measures µ and ν, we define the Wasserstein distance of order p ≥ 1 as By [46,Theorem 4.1], for all µ, ν probability measure on R d , there exists a transference plan ζ ∈ Π(µ, ν) such that the infimum in (1) is reached in ζ . ζ is called an optimal transference plan associated with W p . f : R d → R is a Lipschitz function if there exists C ≥ 0 such that for all x, y ∈ R d , |f (x) − f (y)| ≤ C x − y . Then we denote For k ∈ N, C k (R d ) denotes the set of k-continuously differentiable functions R d → R, with the convention that C 0 (R d ) is the set of continuous functions. Let N * = N \ {0}, n, m ∈ N * and F : R n → R m be a twice continuously differentiable function. Denote by ∇F and ∇ 2 F the Jacobian and the Hessian of F respectively. For m = 1, the Laplacian is defined by ∆F = Tr ∇ 2 F where Tr is the trace operator. In the sequel, we take the convention that for n, p ∈ N, n < p then n p = 0 and n p = 1. By convention, inf {∅} = +∞, sup {∅} = −∞ and for j > i in Z, {j, . . . , i} = ∅. For a finite set E, |E| denotes the cardinality of E. For a, b ∈ R, a ∧ b = min(a, b) and a ∨ b = max(a, b). Let ψ, φ : R + → R + . We write ψ =Õ(φ) if there exists t 0 > 0, C, c > 0 such that ψ(t) ≤ Cφ(t) |log t| c for all t ∈ (0, t 0 ]. Denote by B(x, r) = y ∈ R d : y − x ≤ r .

Presentation of the algorithm
Since Z < +∞ and U is convex, by [7, Lemma 2.2.1], there exist constants ρ 1 > 0 and ρ 2 ∈ R such that U (x) ≥ ρ 1 x − ρ 2 . Therefore, by continuity, U has a minimum x . Without loss of generality, it is assumed in the sequel that with the convention 1/∞ = 0. We define a sequence of probability densities {π i } M i=0 for i ∈ {0, . . . , M } and x ∈ R d by The dependence of Z i in σ 2 i is implicit. By definition, note that U M = U, Z M = Z and π M = π. As in the multistage sampling method [22,Section 3.3], we use the following decomposition Z 0 is estimated by choosing σ 2 0 small enough so that π 0 is sufficiently close to a Gaussian distribution of mean 0 and covariance σ 2 0 Id. For i ∈ {0, . . . , M − 1}, the ratio Z i+1 /Z i may be expressed as where g i : R d → R + is defined for all x ∈ R d by The quantity π i (g i ) is estimated by the Unadjusted Langevin Algorithm (ULA) targeting π i . Introduced in [18] and [38] (see also [44]), the ULA algorithm can be described as follows. For i ∈ {0, . . . , M − 1}, the (overdamped) Langevin stochastic differential equation (SDE) is given by where The sampling method is based on the Euler discretization of the Langevin diffusion, which defines a discrete-time Markov chain, for i ∈ {0, . . . , M − 1} and k ∈ N where sequences of standard Gaussian random variables and γ i > 0 is the stepsize. For i ∈ {0, . . . , M − 1}, consider the following estimator of Z i+1 /Z i , where n i ≥ 1 is the sample size and N i ≥ 0 the burn-in period. We introduce the following assumptions on U.
H2 (m). U : R d → R is continuously differentiable and satisfies for all x, y ∈ R d , H 3. The function U is three times continuously differentiable and there exists L ≥ 0 such that for all x, y ∈ R d The strongly convex case (H2(m) with m > 0) is considered in Section 2.1 and the convex case (H2(m) with m = 0) is dealt with in Section 2.2. Assuming H1 and H2(m) for m ≥ 0, for i ∈ {0, . . . , M }, U i defined in (2) is L i -gradient Lipschitz and m i -strongly convex if m i > 0 (and convex if m i = 0) where Define also the following useful quantities, H3 enables to have tighter bounds on the mean squared error ofπ i (g i ) defined in (9 Denote by S the set of simulation parameters, and byẐ the following estimator of Z, whereπ i (g i ) is defined in (9). The dependence ofẐ in S is implicit. Note thatẐ is a biased estimator of Z because Z 0 is approximated by (2πσ 2 0 ) d/2 (1 + σ 2 0 m) −d/2 . We define the cost of the algorithm by the total number of iterations performed by the M Markov chains ( Observe that each step of the Markov chain takes time linear in d. We state below a simplified version of our results; explicit bounds are given in Theorems 5, 6, 12 and 13. Theorem 1. Assume H1, H2(m) for m ≥ 0. Let µ, ∈ (0, 1). There exists an explicit choice of the simulation parameters S such that the estimatorẐ defined in (17) satisfies Moreover, the cost of the algorithm (18) is upper-bounded by, H1,H2(m) for m > 0,H3 By the median trick (see e.g. [27,Lemma 6.1] or [36]), the dependence in µ of the cost can be reduced to a logarithmic factor, see Corollaries 7 and 14.
It is interesting to compare these complexity bounds with previously reported results. In [33] and [5] (see also [12]), the authors propose to use sequential Monte Carlo (SMC) samplers to estimate the normalizing constant Z of a probability distribution π. In [5], π is supported on a compact set K included in R d and satisfies for x = (x 1 , . . . , . [5,Theorem 3.2] states that there exists an estimatorẐ of Z such that lim d→+∞ E[|Ẑ/Z − 1| 2 ] = C/N where N is the number of particles and C depends on g and on the parameters of the SMC (choice of the Markov kernel and of the annealing schedule). With our definition (18), the computational cost of the SMC algorithm is O(N d) (there are d phases and N particles for each phase). To obtain an estimatorẐ satisfying (19) implies a cost of dµ −1 O( −2 ). However, the product form of the density π is restrictive, the result is only asymptotic in d and the state space is assumed to be compact. [13] combines SMC with a multilevel approach and [26] establishes results on a multilevel particle filter.
Our complexity results can also be related to the computation of the volume of a convex body K (compact convex set with non-empty interior) on R d . This problem has attracted a lot of attention in the field of computer science, starting with the breakthrough of [17] until the most recent results of [10]. Define for x ∈ R d , π(x) = 1 K (x)/ Vol(K). Under the assumptions B(0, 1) ⊂ K and Nonequilibrium methods have been recently developed and studied in order to compute free energy differences or Bayes factors, see [24] and [30,Chapter 4]. They are based on an inhomogeneous diffusion evolving (for example) from t = 0 to t = 1 such that π 0 and π 1 are the stationary distributions respectively for t = 0 and t = 1. Recently, [1] provided an asymptotic and non-asymptotic analysis of the bias and variance for estimators associated with this methodology. The main aim of this paper is to obtain polynomial complexity and inspection of their results suggests a cost of order d 15 at most to compute an estimatorẐ satisfying (19). However, this cost may be due to the strategy of proofs.
Multistage sampling type algorithms are widely used and known under different names: multistage sampling [45], (extended) bridge sampling [22], annealed importance sampling (AIS) [34], thermodynamic integration [37], power posterior [4]. For the stability and accuracy of the method, the choice of the parameters (in our case {σ 2 i } M −1 i=0 ) is crucial and is known to be difficult. Indeed, the issue has been pointed out in several articles under the names of tuning tempered transitions [4], temperature placement [21], annealing sequence [

Theoretical analysis of the algorithm
In this Section, we analyse the algorithm outlined in Section 1. The strongly convex and convex cases are considered in Sections 2.1 and 2.2, respectively. The choice of the simulation parameters S explicitly depends on the (strong) convexity of U. Throughout this Section, we assume that L > m; note that if L = m, π is a Gaussian density and Z is known. For M ∈ N and i ∈ {0, . . . , M − 1}, we first provide an upper bound on the mean squared error MSE i ofπ i (g i ) defined by where π i (g i ) andπ i (g i ) are given by (5) and (9) respectively. The MSE i can be decomposed as a sum of the squared bias and variance, Propositions 2 and 3 give upper bounds on the squared bias and Proposition 4 on the variance. The results are based on the non-asymptotic bounds of the Wasserstein distance for a strongly convex potential obtained in [15] (see also [11], [14]). We introduce the following conditions on the stepsize γ i used in the Euler discretization and the variance σ 2 where by convention 1/0 = +∞. Note that the condition on σ 2 i+1 is equivalent to a i ∈ 0, m i /{4(d + 4)} ∧ (2σ 2 i ) −1 where a i is defined in (6) and m i in (13). Assuming that γ i and σ 2 i+1 satisfy (22), we define the positive quantities where m i , L i and κ i are defined in (13) and (14), respectively. Denote by, Proposition 2. Assume H1 and H2(m) for some m ≥ 0. For N i ∈ N, n i ∈ N * and γ i , σ 2 i+1 satisfying (22), we have Proof. The proof is postponed to Section 5.1.
The squared bias can thus be controlled by adjusting the parameters γ i , n i and N i . If U satisfies H 3, the bound on the squared bias can be improved. Define, Proposition 3. Assume H1, H2(m) for some m ≥ 0, and H3. For N i ∈ N, n i ∈ N * and γ i , σ 2 i+1 satisfying (22), we have Proof. The proof is postponed to Section 5.1.
Note that the leading term is of order γ 2 i instead of γ i . We consider now the variance term in (21).
Proof. The proof is postponed to Section 5.1.
Proof. The proof is postponed to Section 5.3.3.
Theorem 6. Assume H1, H2(m) for m > 0, H3 and let µ, ∈ (0, 1). There exists an explicit choice of the simulation parameters S (16) such that the esti-matorẐ defined in (17) satisfies with probability at least 1 − µ and the cost (18) of the algorithm is upper-bounded by with C defined in (29).
Proof. The proof is postponed to Section 5.3.3.
The proof of Theorems 5 and 6 and corollary 7 relies on several lemmas which are stated below. These lemmas explain how the simulation parameters S must be chosen. The details of the proofs are gathered in Section 5.3. Set This choice of σ 2 0 is justified by the following result, Lemma 8. Under H1 and H2(m) for m ≥ 0, we have Proof. The proof is postponed to Section 5.3.1.
Given a choice of S, define the event On A S, , using Lemma 8, (4) and (17), we have: It remains to choose S to minimize approximately the cost defined in (18) under the constraint P(A S, ) ≥ 1 − µ. We define the positive increasing sequence where ς s : and ς s (t) = +∞ otherwise. The subscript s in ς s stresses that this choice is valid for the strongly convex case and will be different for the convex case. With this choice of (σ 2 i ) i≥0 , the number of phases M is defined by and The Proof. The proof is postponed to Section 5.2 To show the existence of γ i , n i , N i satisfying the conditions of Lemma 9 , we apply Propositions 2 to 4 for each i ∈ {0, . . . , M − 1}. We then have the following lemmas, ii) Then, the conditions i)-ii) of Lemma 9 are satisfied.
Proof. The proof is postponed to Section 5.3.2.
We have a similar result under the additional assumption H3.

Convex potential U
We now consider the convex case. The annealing process on the variances is different from the strongly convex case and is defined in (54). In particular, the stopping criteria for the annealing process is distinct from the case where U is strongly convex and relies on a truncation argument. More precisely, a concentration theorem for log-concave functions [39, Theorem 3.1] states that for α ∈ (0, 1), Given a choice of M and and J by, and by (48), On the event A S, defined in (33) with g M −1 replaced byḡ M −1 and by (32) (with m = 0), (50), we get whereẐ is defined in (17) where ρ 1 , ρ 2 are defined in (15), Theorem 13. Assume H 1, H 2(m) for m ≥ 0 and H 3. Let , µ ∈ (0, 1).

The results of Section 4 give an upper bound on MSE
where and ς c (t) = +∞ otherwise. Define M in this Section by, By (55), for t ∈ σ 2 0 , D 2 , ς c (t) ≥ {(4d + 16)/(4d + 15)} t, which implies M < +∞. The following lemmas are the counterparts of Lemmas 10 and 11. They specify the choice of to satisfy the conditions of Lemma 9. ii) Then, the conditions i)-ii) of Lemma 9 are satisfied, with g M −1 replaced bȳ We have a similar result under the additional assumption H3.

Numerical experiments
For the following numerical experiments, the code and data are available at https://github.com/nbrosse/normalizingconstant. We first experiment our algorithm to compute the logarithm of the normalizing constant of a multivariate Gaussian distribution in dimension d ∈ {10, 25, 50}, of mean 0 and inverse covariance matrix diag(2, 1 ⊗(d−1) ). We set = µ = 0.1. The number of phases M of the algorithm and the variances σ 2 i M −1 i=0 are chosen accordingly to the formulas (34) and (36). For each phase of the algorithm, the step size γ i is set equal to 10 −2 (m i + L i ) −1 , the burn-in period N i to 10 4 and the number of samples n i to 10 5 where m i , L i are defined in (13). We carry out 10 independent runs of the algorithm and compute the boxplots in Figure 1. The true values of the logarithm of the normalizing constants are known and displayed by the red points in Figure 1.
We illustrate then our methodology to compute Bayesian model evidence; see [20] The posterior distribution conditional on model M i can also be considered For i ∈ {0, . . . , l}, the evidence p(y|M i ) of the model M i is defined by the normalizing constant for the posterior distribution (65) . The Bayes factor BF 12 between two models M i and M j is then defined by the ratio of evidences [43, Section 7.2.2], BF ij = p(y|M i )/p(y|M j ). In the following experiments, we estimate the log evidence log(p(y|M i )). For ease of notation, the dependence on the model M of the parameters θ and the dimension d of the state space is implicit in the sequel.
In the examples we consider, (M) satisfies H1, H2, H3 and has a unique minimum for θ ∈ R d . The algorithm described in Section 2 can be applied to U (M) . For each example, two different models will be considered and U (M) will be written as U (k) for k = 1, 2.
The numerical experiments are carried out on a Gaussian linear and logistic regression following the experimental setup of [20,Section 4], which is now considered as a classical benchmark. The linear regression is conducted on p = 42 specimens of radiata pine [47]. The response variable y ∈ R p is the maximum compression strength parallel to the grain. The explanatory variables are x ∈ R p the density and z ∈ R p the density adjusted for resin content. x and z are centered. The covariates of the first model M 1 , X (1) ∈ R p×2 , are composed of an intercept and x, while the covariates of the second model M 2 , X (2) ∈ R p×2 , are composed of an intercept and z. For k = 1, 2, the likelihood is defined by, where λ = 10 −5 . For the two models, the parameter θ follows the same Gaussian prior of mean (3000, 185) and inverse covariance matrix where m i , L i , κ i are defined in (13) and (14). The experiments are repeated 10 times and the boxplots for each model M are plotted in Figure 2. Note that for this Gaussian model, the log evidence is known and displayed by the red points in Figure 2.
With the same parameters for the algorithm, we run 10 independent runs at each phase to measure the variability of each estimatorπ i (g i ) defined in (9). The result is plotted in Figure 3 for the model M 1 . The last estimator π M −1 (g M −1 ) is much higher, which underlines the specificity of the last phase in the algorithm.
The logistic regression is performed on the Pima Indians dataset 1 . In this case, y ∈ {0, 1} p is a vector of diabetes indicators for p = 532 Pima Indian women and the potential predictors for diabetes are: number of pregnancies NP ∈ R p , plasma glucose concentration PGC ∈ R p , diastolic blood pressure BP ∈ R p , triceps skin fold thickness TST ∈ R p , body mass index BMI ∈ R p , diabetes pedigree function DP ∈ R p and age AGE ∈ R p . These variates are centered and standardized. The covariates of the first model M 1 are X (1) = (intercept, NP, PGC, BMI, DP) ∈ R p×5 and the covariates of the second model M 2 are X (2) = (intercept, NP, PGC, BMI, DP, AGE) ∈ R p×6 , where intercept is the intercept of the regressions. The likelihood is defined for k = 1, 2 by, denotes the i th row of X (k) . For the two models, the prior on θ is Gaussian, of mean 0 and inverse covariance matrix τ Id where τ = 0.01. For k = 1, 2, U (k) is τ -strongly convex and L (k) -gradient Lipschitz, where L (k) = λ max ([X (k) ] T X (k) )/4 + τ and λ max ([X (k) ] T X (k) ) is the maximal eigenvalue of [X (k) ] T X (k) . We set = µ = 0.1. The algorithm to estimate log(p(y|M)) described in Section 2.1 is applied with the following modifications. The number of phases is decreased and the recurrence for the variances {σ 2 i } M −1 i=0 is thus redefined by σ 2 i+1 = ς 5 s (σ 2 i ) as long as the stopping condition (36) is not fulfilled. For i ∈ {1, . . . , 30}, the burn-in period N i is set equal to 10 4 , the number of samples n i to 10 6 and the step size γ i to 10 −2 (m i + L i ) −1 where m i , L i are defined in (13); for i > 30, the number of samples n i is set equal to 10 5 and the step size γ i to 10 −1 (m i + L i ) −1 . We compare our results with different methods reviewed in [20] and implemented in [48]. These are the Laplace method (L), Laplace at the Maximum a Posteriori (L-MAP), Chib's method (C) Annealed Importance Sampling (AIS) and Power Posterior (PP). The experiments are repeated 10 times and the boxplots for each model M and each method are plotted in Figure 4.
With the same parameters for the algorithm, we run 10 independent runs at each phase to measure the variability of each estimatorπ i (g i ) defined in (9) and display the result in Figure 5 for the model M 1 .
The final example we address is a Bayesian analysis of a finite mixture of Gaussian distributions, see [33,Section 4.2] and we aim at estimating the log evidence of the posterior distribution. Note that this model does not fit into our assumptions because the potential U is not continuously differentiable on R d and neither convex. Nevertheless, we experiment heuristically our algorithm on a close model given by its likelihood for y = (y 1 , . . . , y p ) ∈ R p a vector of observations. The prior distributions are set following the recommendations of [33, Section 4.2.1] and [42]. For j ∈ {1, . . . , 4}, θ j is drawn from a Gaussian distribution of mean ξ = 1.35 and inverse variance ς = 7.6×10 −3 . λ is set equal to 0.03. The observations y ∈ R 100 are 100 simulated data points from an equally weighted mixture of four Gaussian densities with means (−3, 0, 3, 6) and standard deviations 0.55, taken from [25]. Define for θ = (θ 1 , . . . , θ 4 ) ∈ R 4 , : R 4 → R by (θ) = − log(p(y|θ)p(θ)). The optim function of R [41] gives a local minimum at θ * ≈ (1.76562 ⊗4 ). Define then the potential U : R 4 → R for θ ∈ R 4 by U (θ) = (θ + θ * ) − (θ * ). Set = µ = 0.1, m = ς and L = 1. Similarly to the logistic regression, to decrease the running time of the algorithm, the recurrence for the variances {σ 2 i } M −1 i=0 is defined by σ 2 i+1 = ς 5 s (σ 2 i ) as long as the stopping condition (36) is not fulfilled. For each phase, the step size γ i is set equal to 10 −1 (κ i σ 2 i m i )/(dL 2 i ), the burn-in period N i to 10 4 and the number of samples n i to 10 5 where m i , L i , κ i are defined in (13) and (14). For comparison purposes, we run the same algorithm using the Metropolis Adjusted Langevin Algorithm (MALA) instead of ULA to estimateπ i (g i ) at each phase. The step size γ i is set equal to (κ i σ 2 i m i )/(dL 2 i ) and the number of samples n i to   Figure 6 and the red point indicates the mean of our algorithm using MALA.

Mean squared error for locally Lipschitz functions
In this Section, we extend the results of [15, Section 3] to locally Lipschitz functions. This Section is of independent interest and only Propositions 17 and 20 are used in Section 5. Let U : R d → R be a continuously differentiable function. Consider the target distribution π with density x → e −U (x) / R d e −U (y) dy w.r.t. the Lebesgue measure. We deal with the problem of estimating R d f (x)dπ(x) for locally Lipschitz f : R d → R by the ULA algorithm defined for k ∈ N by, where (Z k ) k≥1 is an i.i.d. sequence of d-dimensional Gaussian vectors with zero mean, identity covariance and (γ k ) k≥1 is a sequence of positive step sizes, which can either be held constant or be chosen to decrease to 0. For n, p ∈ N, denote by and consider the Markov kernel R γ given for all A ∈ B(R d ) and x ∈ R d by with the convention that for n, p ≥ 0, n < p, Q p,n γ and Q 0,0 γ are the identity operator.
For all initial distribution µ 0 on (R d , B(R d )), P µ0 and E µ0 denote the probability and the expectation respectively associated with the sequence of Markov kernels (68) and the initial distribution µ 0 on the canonical space ((R d ) N , B(R d ) ⊗N ) and (X k ) k∈N denotes the canonical process. Let f : R d → R and consider the following assumption,

L1.
1. There exists L f : Under L 1, we study the approximation of R d f (y)π(dy) by the weighted average estimator where N ≥ 0 is the length of the burn-in period and n ≥ 1 is the number of samples. The Mean Squared Error (MSE) ofπ N n (f ) is defined by: and can be decomposed as, The analysis of MSE f (x, N, n) is similar to [15,Section 3]. First, the squared bias in (73) is bounded. Denote by, where κ is given by (14). Define then for n ∈ N , Proposition 17. Assume H1 and H2(m) for m > 0. Let f : R d → R satisfying L1. Let (γ k ) k≥1 be a nonincreasing sequence with γ 1 ≤ 1/(m + L). Let x be the unique minimizer of U . Let (X n ) n≥0 be given by (66) and started at x ∈ R d . Then for all N ≥ 0, n ≥ 1: where u (1) n (γ) is given in (78) and w n (γ) is equal to u (2) n (γ) defined by (79) and to u Proof. For all k ∈ {N + 1, . . . , N + n}, let ξ k be the optimal transference plan between δ x Q k γ and π for W 2 . By the Jensen and the Cauchy Schwarz inequalities, and L1, we have: The proof follows from [15,Theorems 5 and 8].
To deal with the variance term in (73) 2 ]. Noticing that for all x ∈ R d , R γ (x, ·) defined in (68) is a Gaussian distribution with mean x − γ∇U(x) and covariance matrix 2γ I d , the Gaussian Poincaré inequality implies: First consider the following decomposition ofπ N n (f ) − E x [π N n (f )] as the sum of martingale increments, where (G n ) n≥0 is the natural filtration associated with the Markov chain (X n ) n≥0 . This implies that the variance may be decomposed as the following sum with the convention Φ N n,N +n+1 = 0. Denote finally Note that for k ∈ {N, . . . , N + n − 1}, by the Markov property, and Ψ N n (X N ) = E G N x π N n (f ) . With these notations, (83) may be equivalently expressed as Now for k = N + n, . . . , N + 1, we will use the Gaussian Poincaré inequality (82) to the sequence of function Φ N n,k . It is required to prove that Φ N n,k is locally Lipschitz (see Lemma 18). For the variance of Ψ N n (X N ), similar arguments apply using Lemma 19.
Lemma 18. Assume H1, H2(m) for m > 0 and let f : R d → R satisfying L 1. Let (γ k ) k≥1 be a nonincreasing sequence with γ 1 ≤ 2/(m + L). Then for all ≥ n ≥ 0, Q n, γ f is locally Lipschitz and differentiable for almost all x ∈ R d . Its gradient is bounded by, Proof. Let ξ x,y be the optimal transference plan between δ x Q n, γ and δ y Q n, γ for W 2 . By Rademacher's Theorem [19,Theorem 3.2], ∇Q n, γ f (x) exists for almost all x ∈ R d . For such x, using Cauchy-Schwarz's inequality and [15,Theorem 4], we have It is then sufficient to prove that, Let ε, η, R > 0 and y ∈ R d . Since a ∨ b − a = (b − a) + , we have where, Hölder's inequality gives for p, q > 1, 1/p + 1/q = 1, Under L1-2, the first term on the right hand side is dominated by a constant for q small enough, and the second term tends to 0 for R large enough, uniformly for y in a compact neighborhood of x by [15,Theorem 3] and We can then choose R such that E 1 (y) ≤ ε/3. We consider now E 2 (y). L 2 f is a continuous function, uniformly continuous on a compact set and we can then choose η such that E 2 (y) ≤ ε/3. We finally consider E 3 (y). By Markov's inequality and lim y→x W 2 2 (δ x Q n, γ , δ y Q n, γ ) = 0, there exists a compact neighborhood V(x) of x such that y ∈ V(x) implies E 3 (y) ≤ ε/3. H1 and H2(m) for m > 0. Let (γ k ) k≥1 be a nonincreasing sequence with γ 1 ≤ 2/(m + L) and

Proofs of propositions 2, 3, 4
We assume in this Section that H 1 and H 2(m) for some m ≥ 0 hold. The proofs rely on the results given in Section 4, Propositions 17 and 20 which establish bounds on the mean squared error for locally Lipschitz functions. For i ∈ {0, . . . , M − 1}, σ 2 i > 0 and γ i > 0, consider the Markov chain (X i,n ) n≥0 (8) and its associated Markov kernel R i defined for all A ∈ B(R d ) and x ∈ R d by where L i , m i are defined in (13) and κ i in (14). We then check L1 for g i , where g i : R d → R is defined in (6). Note that g i is continuously differentiable and for We have for all x, y ∈ R d : which implies that L1-1 holds for g i . The following Lemmas 21 and 22 enable to check L1-2 for g i .
Proof. In the proof, the subscript i is not specified for ease of notations. Recall that the generator of the Langevin diffusion (7) associated to U is defined for In particular, for f (x) = x 2 e 2a x 2 and x ∈ R d , we have Using (97) and ∇U(0) = 0, we get Using that a ∈ [0, m/(4(d + 4))], we have 2a(4a − m) ≤ −(8/5)am. Then an elementary study of t → e 2at 2d + 4a (4a − m) t 2 on R + shows that: Therefore we get using 2(2a Proof of Proposition 4. The proof follows from Lemma 21 and proposition 20.

Proof of Lemma 8
Because U satisfies H1, H2(m) for m ≥ 0 and U(0) = 0, ∇U(0) = 0, we have: which implies by integration that, where Z 0 = R d e −U0 and U 0 is defined in (2). The proof follows from the expression of σ 2 0 and the bound,