Convergence of Simulated Annealing Using Kinetic Langevin Dynamics

We study the simulated annealing algorithm based on the kinetic Langevin dynamics, in order to find the global minimum of a non-convex potential function. For both the continuous time formulation and a discrete time analogue, we obtain the convergence rate results under technical conditions on the potential function, together with an appropriate choice of the cooling schedule and the time discretization parameters.


Introduction
Simulated annealing has always been an important method to find the global minimum of a given function U : R d −→ R, especially when U is non-convex. Classical studies on the simulated annealing have been mainly focused on an algorithm based on the overdamped Langevin dynamic: where (B t ) t≥0 is a standard d-dimensional Brownian motion and (ε t ) t≥0 is a temperature parameter that turns to 0 as t −→ ∞. Notice that, with constant temperature ε t ≡ ε and under mild conditions on U , the process X in (1.1) is the standard overdamped Langevin dynamic and has the invariant measure µ * o,ε (dx) ∝ exp (−U (x)/ε) dx. With small ε > 0, samples from µ * o,ε would approximately concentrate around the global minimum of function U , which is the intuition of the simulated annealing algorithm.
Since the introduction of the simulated annealing algorithm by Kirkpatrick, Gellatt and Vecchi [16], many works have been devoted to the convergence analysis of (1.1); see e.g., Geman and Hwang [11], Chiang, Hwang and Sheu [5], Royer [28], Holley, Kusuoka and Stroock [14], Miclo [21], Zitt [32], Fournier and Tardif [9], Tang and Zhou [30], etc. It has been shown that, the cooling schedule ε t should be at least of the order E log t as t −→ ∞ for some constant E > 0, in order to ensure the convergence of X t to the global minimum of U as t −→ ∞. Intuitively, this cooling schedule allows the diffusion process to have enough time to escape from the local minima and at the same time to explore the whole space in order to find the global minimum of U ; finally, the annealing process will "freeze" at the global minimum of U as ε t −→ 0. We would like to mention in particular the recent paper by Tang and Zhou [30], where the authors derived a convergence rate result of (1.1), where a fine estimation of the log-Sobolev inequality in Menz and Schlichting [20] for invariant measure µ * ε,o with low-temperature (small ε > 0) has been crucially used. Moreover, they have also analyzed a corresponding discrete time scheme of (1.1) and obtained a convergence rate result. Notice that in practice it is the discrete time scheme which is implemented to find the optimizer of U .
Motivated by the above works, we will study in this paper the simulated annealing based on the kinetic (underdamped) Langevin dynamic, that is, the process (X, Y ) = (X t , Y t ) t≥0 defined by where θ > 0 is a fixed constant and (ε t ) t≥0 is a cooling schedule satisfying ε t −→ 0 as t −→ ∞. Moreover, we will also study a discrete time scheme of (1.2). More precisely, consider a sequence (∆t k ) k≥0 of time steps and define the discrete time grid 0 = T 0 < T 1 < · · · by The discrete time scheme will be defined on the grid (T k ) k≥0 . For ease of presentation and the convergence analysis later, we will write this scheme as a continuous time process (X, Y ) = (X t , Y t ) t≥0 by using the time freezing function η(t) := ∞ k=0 T k 1 {t∈[T k ,T k+1 )} . Then, the discrete time scheme process (X, Y ) is defined by Notice that the above scheme is the second-order scheme, rather than the Euler scheme of (1.2). This scheme can be explicitly re-written on the time grid (T k ) k≥0 and hence is implementable (see (2.3) and (2.4) for details). This second-order scheme has also been introduced and studied for standard underdamped Langevin dynamic, i.e., for (1.2) with constant temperature ε t ≡ ε 0 ; see, e.g., Cheng, Chatterji, Bartlett and Jordan [4], Zou, Xu and Gu [33], Gao, Gürbüzbalaban and Zhu [10] and Ma, Chatterji, Cheng, Flammarion, Bartlett and Jordan [18].
For the kinetic simulated annealing process (X, Y ) in (1.2), a convergence result without convergence rate has already been established by Journel and Monmarché [15]. In the present paper, we aim at obtaining some convergence rate results for both the simulated annealing process (X, Y ) in (1.2) and the discrete time scheme (X, Y ) in (1.3). To the best of our knowledge, we are the first to study the convergence of the kinetic simulated annealing algorithm in the discrete time framework. Let us also mention that, by cooling the parameter θ instead of ε in (1.2), Monmarché [22] studied an alternative kinetic simulated annealing process and derived a convergence rate for it (see Remark 2.10 in the following for a detailed comparison of his work and ours).
The remainder of the paper is organized as follows. We first introduce some notations. In Section 2, we state the assumptions and our main results, and provide the main idea of the proofs, together with some discussions on the related literature. The proof of the convergence rate of (1.2) is given in Section 3, and the convergence rate of (1.3) is given in Section 4.
Notations. (i) Denote by C ∞ (R d ), or simply C ∞ , the collection of all smooth (i.e., infinitely differentiable) functions f : R d −→ R. For f ∈ C ∞ , let ∇f, ∇ 2 x f, and ∆f denote the gradient, Hessian, and Laplacian of f , respectively. For a smooth vector field v : R d −→ R d , ∇ · v denotes the divergence of v. For vectors a, b ∈ R d , a, b is their inner product and |a| = a, a is the Euclidean norm of a. For two matrices M, N ∈ M d×d (R) their Frobenius inner product is defined as M, . For functions f and g defined on R + , the symbol f = O(g) means that f /g is bounded when some problem parameter tends to 0 or ∞.

Main Results and Literature
We will state our main convergence rate results and then discuss the main idea of proof as well as some related literature.

Main Results
We first provide some conditions on the (potential) function U : R d −→ R. Without loss of generality, we assume that min x∈R d U (x) = 0 throughout the paper.
and all its derivatives have at most polynomial growth. The gradient ∇U is L-Lipschitz for constant L > 0. Moreover, U is (r, m)-dissipative in the sense that for some positive constants r > 0 and m > 0, (ii) The function U has a finite number of local minimizers and ∇ 2 U is non-degenerate at the local minimizers. Moreover, U has at least one non-global minimizer.
Remark 2.2. (i) In the literature of the standard kinetic (underdamped) Langevin dynamic (i.e., (1.2) with constant temperature ε t ≡ ε 0 ), the dissipative condition on U is a standard Lyapunov condition to ensure the ergodicity of the process; see e.g., Eberle, Guillin and Zimmer [8] and Mattingly, Stuart and Higham [19]. The Lipschitz condition on ∇U is also usually imposed to obtain quantitative exponential convergence rate of the law of standard kinetic Langevin dynamic to its invariant measure; see, e.g., [8,10,18]. In particular, this condition ensures that the process (X, Y ) in (1.2) is well defined.
(ii) The Lipschitz condition on ∇U , together with the dissipative condition, implies that there exists a positive constant K such that For a proof, see, e.g., Raginsky, Rakhlin and Telgarsky [27,Lemma 2].
Let m U and M U denote respectively the set of local minima and the set of global minima of U . We then define a constant E * > 0, the so-called critical depth of U , by The initial distribution of (X 0 , Y 0 ), denoted as µ 0 = L (X 0 , Y 0 ), has a C ∞ density function p 0 . Moreover, the initial Fisher information (ii) The function t −→ ε t is positive, non-increasing and differentiable. Moreover, for some time t 0 and a constant E > E * , one has ε t = E log t for all t > t 0 .
Remark 2.4. To obtain the convergence of the simulated annealing using overdamped Langevin dynamic in (1.1), it is standard to consider the cooling schedule ε t = O 1 log t as t −→ ∞; see, e.g., [11,5,28,14,21,30]. For the kinetic simulated annealing process (1.2), this cooling schedule is also assumed in Journel and Monmarché [15] to obtain a convergence result without convergence rate. Moreover, it is proved in [15] that the convergence may fail for a faster cooling schedule. In Monmarché [22], for an alternative kinetic simulated annealing process with cooling schedule on parameter θ, a similar cooling schedule is also assumed. Moreover, our technical conditions on U in Assumption 2.1 are also generally motivated by those in [22].
Let us now provide a first convergence rate result on (X, Y ) defined by (1.2). By a time change argument, we can easily reduce (1.2) to the case with θ = 1. For this reason, we will always assume θ = 1 in the rest of paper. Also, recall that the condition min x∈R d U (x) = 0 is assumed throughout the paper.
Theorem 2.5. Let Assumptions 2.1 and 2.3 hold true. Then, for any constants δ > 0 and α > 0, there exists some constant C > 0 such that Remark 2.6. (i) In [15], the authors used localization arguments to obtain a convergence result for (1.2) with conditions on the potential function U weaker than ours. They, however, did not derive the convergence rate.
(ii) The convergence rate in Theorem 2.5 is the same to that in Miclo [21] and Tang and Zhou [30] for the simulated annealing using overdamped Langevin dynamic, and also to that in Chak, Kantas and Pavliotis [3] for the simulated annealing using generalized Langevin process. While higher order Langevin dynamics are often used in MCMC as accelerated versions compare to the overdamped Langevin dynamic (see for instance [18,10,23]), this is not the case in the simulated annealing problem. Indeed, it has been observed that the convergence behavior of the annealed process will be mainly determined by the potential function U , but not the process used; see for instance the discussion in [22, Remarks after Theorem 1].
We now present the convergence rate result for the discrete time scheme (X, Y ) as defined in (1.3).
for all x, y ∈ R d , for some constant L > 0. And that the time step size parameters (∆t k ) k≥0 satisfies lim k→∞ T k = ∞ and lim sup k→∞ ∆t k T k < ∞. (

2.2)
Then for all constants δ > 0 and α > 0, there exists a constant C > 0, such that Remark 2.8. (i) The additional Lipschitz condition on ∇ 2 x U will be essentially used to control the discretization error in the discrete time scheme (1.3).
(ii) We need that ∆t k −→ 0 as T k −→ ∞ to control the (cumulative) discretization error in the scheme (1.3). At the same time, ∆t k should not decrease too fast, so that T k −→ ∞ as k −→ ∞ and thus (X T k ) k≥0 can reach the global minima of U . This explains the condition (2.2).
(iv) In Tang and Zhou [30], for the discrete simulated annealing based on the overdamped Langevin dynamic (1.1), the authors required the step size ∆t k satisfying which is a little stronger than our condition (2.2). Of course, (2.2) is only a sufficient condition to ensure the convergence rate result.
Notice that (1.3) is a linear SDE on each time interval [T k , T k+1 ], so we can solve it explicitly. Namely, given the value where B t is the Brownian motion in (1.3). Therefore, we can implement (X, Y ) on the discrete time grid (T k ) k≥0 in an exact way. More precisely, by abbreviating ( with mean zero and covariance matrix

Main Idea of Proofs and Related Literature
Recall that for two probability measures µ, ν ∈ P(R d ) (or in P(R d × R d )) such that µ ν, the relative entropy KL(µ|ν) and the Fisher information I(µ|ν) are defined by For the simulated annealing process (1.1) using overdamped Langevin dynamic, let us denote µ o,t := L (X t ) and µ * o,ε ∝ exp(−U (x)/ε)dx the invariant measure of standard overdamped Langevin dynamic with constant temperature ε. To deduce their convergence rate result, Tang and Zhou [30] analyze the evolution of KL(µ o,t |µ * εt,o ). A key step consists in obtaining for a good constant ρ(ε t ) (depending on ε t ). The equality in (2.6) follows the so-called de Bruijn's identity, which is the entropy dissipation equation for the standard overdamped Langevin dynamic; see for instance Chapter 5.2. in the monograph Bakry, Ledoux and Gentil [2] or Otto and Villani [24] for a more general setting. The inequality in (2.6) follows from the log-Sobolev inequality (LSI). This step is also the classical way to deduce the exponential ergodicity of the standard overdamped Langevin dynamic; see for instance Arnold, Markowich, Toscani and Unterreiter [1] and Pavliotis [25].
For the kinetic simulated annealing process (1.2) (with θ = 1), let us denote µ t := L (X t , Y t ) and by the invariant measure of standard kinetic Langevin dynamic with constant temperature ε (i.e. (1.2) with ε t ≡ ε). It is, however, no longer helpful to use the relative entropy KL(µ t |µ * εt ) for the convergence analysis. Indeed, because the Brownian motion only appears in the y-direction in (1.2), the time derivative of the relative entropy only gives a part of the Fisher information: Thus, we can not use the LSI to proceed as in [30] for the overdamped simulated annealing.
This is also the main reason why we cannot use the relative entropy to deduce the exponential ergodicity of the standard kinetic Langevin dynamic, for which a successful alternative method is the so-called hypocoercivity method, initiated in Desvillettes and Villani [6], Hérau [12,13] (see also Villani [31] for a detailed presentation). As in Monmarché [22], we will adopt the hypocoercivity method to study our kinetic simulated annealing processes in (1.2) and (1.3). More precisely, for two probability measures µ, ν ∈ P(R d × R d ), we consider the distorted relative entropy H γ (µ|ν) defined by where γ > 0 is a constant. To deduce the convergence rate result of (1.2), we will choose γ to be a function of ε t , and then analyze the evolution of H γ(εt) (µ t |µ * εt ): and (2.10) The term (2.9) is from the (instantaneous) evolution of H γ(εt) (µ t |µ * εt ) along the kinetic diffusion (1.2) for fixed temperature ε t and the term (2.10) arises from the influence of (instantaneous) invariant measure µ * εt and γ(ε t ) on H γ(εt) (µ t |µ * εt ). For the term (2.9), with a carefully chosen ε t −→ γ(ε t ) and by adapting a computation strategy in Ma, Chatterji, Cheng, Flammarion, Bartlett and Jordan [18] for standard kinetic Langevin dynamic, we obtain where c(ε t ) = γ −1 (ε t ) and ρ εt > 0 is the constant in the log-Sobolev inequality (LSI) satisfied by µ * εt . Remark 2.9. As in [22] and [30], it is crucial to use a fine estimation on the constant ρ εt in the LSI for µ * εt (under Assumption 2.1), in particular for the case with small ε t . We refer to Menz and Schlichting [20], especially, Section 2.3.3 therein, for this issue.
The treatment of the term (2.10) is classical (see e.g. Holley, Kusuoka and Stroock [14], Miclo [21], Tang and Zhou [30] in the case of the overdamped simulated annealing, or Monmarché [22] in the case of an alternative kinetic simulated annealing). First, we will apply the Lyapunov function technique as in Talay [29] to obtain a uniform bound on the moment of (X t , Y t ) in (1.2) for all t ≥ 0. Using this uniform moment bound, together with the explicit expression of µ * εt for all t ≥ 0, we can directly compute that, for some sub-exponential ω(ε), where ε t is the derivative of t −→ ε t . The term ω(ε t )|ε t | at the right-hand-side(r.h.s.) of (2.12) is positive. In order to make H γ(εt) (µ t |µ * εt ) decrease along the time and to get an explicit decay rate, we need to make sure that the term c(ε t )ρ εt in (2.11) is greater than the term ω(ε t )|ε t | in (2.12). This requires a careful choice of ε −→ γ(ε) as well as some conditions on the cooling schedule t −→ ε t as in Assumption 2.3.
Remark 2.10. (i) The above sketch of proof, as well as the use of the hypocoercivity method with the distorted entropy (2.7), is similar to that in Monmarché [22] for a different kinetic annealing process. More precisely, [22] studies the process with ε s −→ 0 as s −→ ∞. As our cooling schedule is on the diffusion parameter, we need to design another function ε t −→ γ(ε t ) to compute the distorted entropy H γ(εt) (µ t |µ * εt ). More importantly, [22] reformulated the problem into a Bakry-Emery framework to establish a contraction inequality similar to (2.11), while we apply a different computation strategy adapted from [18]. In particular, as shown in [18], this computation strategy can be adapted to the discrete time setting for convergence analysis of numerical schemes, a subject that is not addressed in [22].
(ii) From a numerical point of view, our kinetic annealing process (1.2) seems to be more convenient for discrete time simulation than that in (2.13). Indeed, as ε s −→ 0, the Lipschitz constant of the coefficient functions in SDE (2.13) will explode, so the corresponding time discretization error in the numerical analysis can be much harder to control.
For the convergence analysis of the discrete time process (X, Y ) in (1.3), the main idea will be quite similar. Denote µ t := L (X t , Y t ). For ease of presentation, let us . Then, with the same distorted entropy function defined by (2.8), we need to compute the difference, for each k ≥ 0, For the term (2.14), we can adapt the computation in [18] for the standard kinetic Langevin process. Concretely, we can interpolate ] satisfies a Fokker-Planck equation with a (partially) frozen coefficient function. We can then apply the same computation strategy as in the continuous time setting to obtain a (asymptotic) contraction estimation.
Similarly, the term (2.15) can be handled by interpolating together with a uniform moment estimation on (X, Y ). Finally, with a good choice of the time step size parameters (∆t k ) k≥0 , we can combine the estimations of (2.14) and (2.15) to obtain a convergence rate result for the process (X, Y ) on the discrete time grid (T k ) k≥0 .
In particular, the above analysis on H γ(ε k ) (µ k |µ * k ) extends the hypocoercivity method for discrete time Langevin dynamic in [18] to the setting with time-dependent coefficient. We also notice that most existing papers on the numerical aspects of kinetic type equations by hypocoercivity method mainly focused on the discretization of corresponding Fokker-Planck equations, as in Dujardin, Hérau and Lafitte [7] and Porretta and Zuazua [26].

Convergence of the continuous-time simulated annealing
In this section, we will present the convergence analysis of continuous-time kinetic annealing process (1.2) in order to prove Theorem 2.5. As discussed in Section 2.2, our main strategy consists in studying the evolution of where L > 0 is the Lipschtiz constant of ∇U in Assumption 2.1. For this purpose, we will study separately the two terms at the r.h.s. of (2.8).
For simplicity of presentation, we will assume that θ = 1. Denote z = (x, y) as a single variable and Z = (X, Y ) as the process satisfying (1.2). For each t ≥ 0, denote by µ t = L (Z t ) the marginal distribution of Z. Similar to [22,Proposition 4], under the smoothness and growth conditions of U in Assumption 2.1, µ t has a strictly positive smooth density function on R 2d , denoted by p t . Recall also that, for each ε > 0, the invariant probability measure of (1.2) with constant temperature ε t ≡ ε is denoted by µ * ε , i.e., with some renormalized constant C ε > 0,

Preliminary analysis of continuous kinetic annealing process
Let us first recall a fine result on the log-Sobolev inequality from [ Next, we give a moment estimation on the solution (X t , Y t ) t≥0 to (1.2).
Proof. Inspired by [22,Lemma 7], we consider the Lyapunov function with the constant r > 0 as given in Assumption 2.1.

For any smooth function
, which corresponds to the generator of the diffusion process (1.2) at temperature ε > 0. Then Further, by Assumption 2.1, for any constant c 1 > 0. Choosing c 1 = r 2 , and by (2.1) as well as the fact that β < 1 2 , it follows that for some positive constants c 2 , c 3 independent of ε. Using again β < 1 2 and that U is of quadratic growth (see (2.1)), we obtain that for some constant c 4 > 0. Therefore, for some positive constants c 5 and c 6 independent of ε > 0, we have Notice that the Lyapunov function R(x, y) does not depend on ε > 0, so We can then apply the Grönwall Lemma to conclude that (E[R(X t , Y t )]) t≥0 is uniformly bounded.
To conclude, we notice that (see also [22,Lemma 7]) there is some constant C > 0 such that Therefore,

Evolution of distorted entropy with fixed temperature
We now compute ∂ µ,t H γ(εt) (µ t |µ * εt ) in (2.9). Although the computation is mainly adapted from Ma, Chatterji, Cheng, Flammarion, Bartlett and Jordan [18], we nevertheless provide most of details for completeness. For simplicity, for a function f : Recalling that p t (resp. p * εt ) represents the density function of µ t (resp. µ * εt ), we also write Notice that the marginal distribution p t satisfies the kinetic Fokker-Planck equation , we can rewrite (3.1) as the following equivalent form: To simplify the computation afterwards, we follow the idea in [18] to further add a divergence-free term ε t ∇ y log p t −ε t ∇ x log p t into the equation. By direct computation, it follows that where v t is the vector flow that transports µ t towards µ * εt : Let h t := pt p * ε t and S := Then the distorted relative entropy H γ(εt) (µ t |µ * εt ) can be written as is the first order variational derivative of H γ(εt) (µ t |µ * εt ) at µ t . By Proposition 2.22 in [17], we have Then using (3.3) and (3.4), a simple calculation leads to For (3.5), it equals to (3.8) In addition, (3.6) can be simplified as For (3.7), we write it as the sum of the following three terms: Further, by [18, Lemma 9], we have Combining the above with (3.8) and (3.9), we derive , we obtain can get rid of the term involving with the second-order derivatives and obtain that Note that the r.h.s. of the above inequality resembles the Fisher information . In order to use the log-Sobolev inequality satisfied by µ * εt (Proposition 3.1) and connect it to the distorted relative entropy (3.4), we will prove in Lemma 3.3 that, for sufficiently large t ≥ 0, where c(ε t ) := 1 γ(εt) = εt 4(1+L 2 ) and ρ εt is given in Proposition 3.1. It follows that Proof. The proof is very similar to that of [22,Lemma 8], so we only present the main idea here.
By the choice of γ(ε t ) and c(ε t ), we have that (ε t , c(ε t ), ρ εt , γ(ε t )) −→ (0, 0, 0, ∞) as t −→ ∞. Thus, it is sufficient to prove that for sufficiently large t ≥ 0, We first compute the characteristic equation of M t − 1 2 I 2d : By diagonalizing ∇ 2 x U , with (∇ 2 x U ) i denoting the i-th eigenvalue of ∇ 2 x U , we obtain d quadratic equations on λ in the form  2ρε t I 2d for sufficiently large t. However, this is sufficient to obtain the convergence of our kinetic annealing process (1.2), as long as we can ensure the distorted relative entropy H γ(εt) (µ t |µ * εt ) is finite in any finite time interval (see Lemma 3.8 below). We also want to point out that the choice of γ(ε t ) and c(ε t ) is not unique to ensure that M t satisfies (3.10).
To conclude this subsection, we summarize the above computation in the following proposition: and ρ εt is given in Proposition 3.1.
Proof. We follow the proof of [22,Lemma 15]. In the following, the constant C > 0 may change from line to line.

Proof of Theorem 2.5
For any t ≥ 0, letZ t = (X t ,Ỹ t ) be a random variable (on the same initial probability space) following distribution µ * εt . Then, for any δ > 0, where we have used the Csiszár-Kullback-Pinsker inequality, the definition of the distorted relative entropy (2.7), and the fact that γ(ε t ) = 4 εt (1 + L 2 ) ≥ 1 for sufficiently large t ≥ 0. We then conclude the proof by Lemma 3.7 and Proposition 3.9.
For the estimation of P(U (X t ) > δ), we have the following classical result (see e.g.
In particular, let ε t satisfy Assumption 2.3. Then for sufficiently large t > 0, we have To provide an estimation on H γ(εt) (µ t |µ * εt ), we first prove that it is uniformly bounded on any finite time horizon [0, t].
, and notice that Then, by the log-Sobolev inequality, it suffices to prove that I(µ t |µ * εt ) is uniformly bounded on any finite horizon.
Proposition 3.9. Let Assumptions 2.1 and 2.3 hold and choose γ(ε t ) = 4 εt 1 + L 2 . Then, for any α > 0, there exists C > 0, such that for sufficiently large t, Proof. First, by Lemma 3.8, H γ(εt) (µ t |µ * εt ) is finite for any t ≥ 0. Next, by (2.8), Proposition 3.5 and Lemma 3.6, for sufficiently large t ≥ 0, we have Further, by Proposition 3.1, we have ρ ε = χ(ε) exp − E * ε with some sub-exponential function χ(ε). Then, by Assumption 2.3 and Proposition 3.5, for sufficiently large t ≥ 0, we have that for all α > 0, According to Proposition 3.2, E U (x) + |Y t | 2 is uniformly bounded, so we obtain Thus, there exists some positive constantsc 1 andc 2 such that for sufficiently large t, Because E > E * , for the terms of H γ(εt) (µ t |µ * εt ) in the r.h.s. of the above inequality, the first term dominates the second term when α > 0 is small. As a result, for some positive constants c 1 and c 2 , we have Set α < 1 2 1 − E * E , then for sufficiently large t 0 and for all t ≥ t 0 , a Grönwall type argument yields that The proof is then completed.

Convergence of the discrete-time simulated annealing
Following the sketch of proof in Section 2.2, we provide in this section the convergence analysis of the discrete time kinetic simulated annealing process (X, Y ) in (1.3) and prove Theorem 2.7.
Without loss of generality, we assume that the Lipschitz constant of ∇U (x) satisfies L ≥ 1. We also stay in the context with θ = 1 for the proof. Denote z = (x, y) and Z := (X, Y ). By the explicit solution of (X t , Y t ) in (2.3), we can see that the marginal distribution µ t := L (Z t ) is denoted by p t has a strictly positive and smooth density function p t (z). Also recall that µ * ε (dz) = p * ε (z)dz with p * ε (x, y) ∝ exp − 1 ε (U (x) + |y| 2 2 ) dz. For the ease of notation, we abbreviate

Preliminary analysis of discrete kinetic annealing process
We first provide a uniform boundedness result on the moment of (X, Y ). To this end, we introduce two constants where r > 0 is the constant given in Assumption 2.1. One can check thatβ ≤ 1 2 ,β < r andβ 2 r/3 < 1.
Proof. Similar to the proof of Proposition 3.2, we introduce the Lyapunov functioñ R(x, y) = U (x) + |y| 2 2 +βx · y for all z = (x, y) ∈ R d × R d . Then, to prove (4.1), it is equivalent to prove that E R (X k , Y k ) k≥0 is uniformly bounded.
Denote θ k := 1 − e −∆t k ≤ ∆t k , so that the discrete time scheme (2.4) can be rewritten as where (D x (k), D y (k)) is a Gaussian vector in R d × R d with mean zero and covariance matrix Σ k given by (2.5). Then, by the Lipschitz property of ∇ x U , we derive Further, we compute directly that Therefore, by the definition ofR, it follows that Using the fact thatβ ≤ 1 2 , θ k ≤ ∆t k ≤ 1 L ≤ 1, and the expressions of Σ ij (k) in (2.5), we have Thus, by (r, m)-dissipativity of U and the L-Lipschitz of ∇ x U , there exists some constant b such that for arbitrary c. Now choose c = r 2 . Notice thatβ < r by its definition, then Using (2.1), we derive We thus obtain Rearrange the above, we derive, for some constant C > 0, that By Proposition 4.1, we can easily obtain that E |Y t | 2 t≥0 is also uniformly bounded. Moreover, we have the following fine estimation on the increment of X: Proof. For t ∈ [T k , T k+1 ], notice that X t = X k + t T k Y s ds. It then follows by Cauchy-Schwarz inequality that for some constant C independent of k ≥ 0.

One-step entropy evolution with fixed temperature
In this subsection, we aim at estimating the difference term (2.14), which is the one-step evolution of H γ(ε) (µ t |µ * ε ) with fixed ε. We first provide the Fokker-Planck equation satisfied by µ t on each interval [T k , T k+1 ]. Recall that p t denotes the density function of µ t . For t ∈ [T k , T k+1 ] and z k ∈ R d × R d , denote by Proof. Conditioned on the σ-field F T k := σ(Z t : 0 ≤ t ≤ T k ), Z follows a linear SDE with deterministic parameters on [T k , T k+1 ] (see (1.3)). Thus, the conditional density function z −→ p(z | z k ) satisfies the Fokker-Planck equation Notice that v t,k is independent of z k . We can then complete the proof by integrating both sides of the above equality with respect to z k under the measure µ k (z k ).
For t ∈ [T k , T k+1 ], we write the distorted relative entropy Then, for t ∈ [T k , T k+1 ], by Lemma 4.4, we have In the above, the first order variational derivative of H γ(ε k ) (µ t |µ * k ) is given by where we use (∇ z ) * = (∇ * x ,∇ * y ) := −∇ x · + 1 ε k ∇ x U · , −∇ y · + y ε k · to denote the adjoint operator of ∇ z with respect to µ * k , in the sense that for any f, g ∈ L 2 (µ * k ), We then follow the same computation steps as in (3.5), (3.6) and (3.7) to obtain that the term (4.2) is equal to , term (4.3) can be directly computed as Consequently, (4.3) is equal to Using with (4.4) and Lemma 4.5, we derive that Lemma 4.6. It holds that Proof. Using a · b ≤ ε k 4 |a| 2 + 1 ε k |b| 2 and p t|T k (z|z k )p k (z k ) = p t (z)p T k |t (z k |z) where the backward conditional density function p T k |t (z k |z) := p(Z k = z k |Z t = z), and by Jensen's inequality, we can show that The ∇ y logh t term can be treated similarly. The proof then completes.
Lemma 4.7. Assume that ∆t k ≤ 1/L for every k and ∇ 2 x U is L -Lipschitz. Then Proof. Without loss of generality, we assume L ≥ 1. Using the same proof as the one of [18,Lemma 10], we can show that where (t) = e t−T k − 1 − (t − T k ). As a result, On the one hand, where the last inequality is the case due to the L -Lipschitz of ∇ 2 x U . On the other hand, As a result, we derive, Comparing the above inequality with (4.8), to complete the proof of the lemma, we only need to prove E µ k J t (X k ), SJ t (X k ) F ≤ 144d 3/2 L 2 ∆t 2 k . (4.10) By Cauchy-Schwarz inequality and matrix norm inequality, we have It is straightforward to see that Because ∇ 2 x U 2 ≤ L, which is due to the L-Lipschitz of ∇ x U , and because ∆t k ≤ 1/L, we have (t) ≤ ∆t 2 Consequently, As a result, J t (X k ) 2 ≤ √ 2 max 2L 2 ∆t 2 k , 6L∆t k = 6 √ 2L∆t k . This, together with (4.11), immediately leads to (4.10).
Recall that ρ k is the constant in the LSI satisfied by µ * k in Proposition 3.1, and γ(ε k ) = 4 ε k 1 + L 2 . Denote c(ε k ) := 1 γ(ε k ) = ε k 4(1+L 2 ) and As the discrete time analogue to Lemma 3.3, we have the following inequality forM k . The proof is omitted because it is similar to that of Lemma 3.3.
We finally present the main result in this subsection.
Proposition 4.9. Let Assumptions 2.1, 2.3 hold. Suppose in addition that ∇ 2 x U is L -Lipschitz and that ∆t k ≤ min η * 2 , 1 L for sufficiently large k ≥ 0. Then, there exist some constants C 1 > 0 and C 2 > 0 such that, for sufficiently large k ≥ 0 and for all α > 0, By computing d dt e C t T − ( E * E +α ) k H γ(ε k ) (µ t |µ * k ) and then integrating it from T k to T k+1 , we obtain where C 1 > 0 and C 2 > 0 are some constants independent of k. The proof then completes.
The term P U (X k ) > δ can be handled by Lemma 3.7 to obtain the desired convergence rate. For the convergence of 2H γ(ε k ) (µ k |µ * k ), we can apply Proposition 4.9 and Lemma 4.11 to show that there are some positive constants c 1 , c 1 , c 2 , and k 0 ≥ 0 such that Finally, by Lemma 4.12 which will be presented and shown momentarily, we know that H γ(ε k 0 ) (µ k 0 |µ * k 0 ) is finite. Then, there exists a positive constant C > 0 such that This completes the proof.
We finally provide a discrete time analogue of the result in Lemma 3.8, which is needed in the proof of Theorem 2.7. Lemma 4.12. For every k ≥ 0, the distorted entropy H γ(ε k ) (µ k |µ * k ) is finite.
Proof. As in the proof of Lemma 3.8, it suffices to prove that the Fisher information I(µ k |µ * k ) is finite for every finite k ≥ 0. Again, we use C to denote a generic positive constant whose value may change from line to line.
For t ∈ [T k , T k+1 ], withh t = p t p * k , recall that Using a similar computation to the one in the calculation of H γ(ε k ) (µ t |µ * k ) in Section 4.2, we can obtain that d dt I(µ t |µ * k ) = ∇ z δI(µ t |µ * k ) δ µ t , v t,k p t (z) dz Following the same steps as in (3.5), (3.6) and (3.7), we further obtain that ∇ z δI(µ t |µ * k ) δ µ t , v t,k p t (z) dz (4.16) Further, by similar arguments to the ones in Lemma 4.5, and with A t (z) as given by (4.6), we have Following the same arguments as in Lemma 4.6 and 4.7, we can further obtain the inequalities which implies that I(µ k+1 |µ * k ) ≤ e (1+L)∆t k I(µ k |µ * k ) + C∆t 3 k . (4.18) Finally, following similar computation steps to the ones in Lemma 4.11, we obtain that, for all ε > 0, The above, together with (4.18), implies that where the r.h.s. of the last inequality in the above is clearly finite for every k ≥ 0.