From large deviations to Wasserstein gradient flows in multiple dimensions

We study the large deviation rate functional for the empirical distribution of independent Brownian particles with drift. In one dimension, it has been shown by Adams, Dirr, Peletier and Zimmer that this functional is asymptotically equivalent (in the sense of $\Gamma$-convergence) to the Jordan--Kinderlehrer--Otto functional arising in the Wasserstein gradient flow structure of the Fokker--Planck equation. In higher dimensions, part of this statement (the lower bound) has been recently proved by Duong, Laschos and Renger, but the upper bound remained open, since the proof in \cite{DLR2013} relies on regularity properties of optimal transport maps that are restricted to one dimension. In this note we present a new proof of the upper bound, thereby generalising the result of \cite{ADPZ2011} to arbitrary dimensions.


Introduction
In the recent paper [ADPZ11], Adams, Dirr, Peletier and Zimmer unveiled a fundamental connection between two seemingly unrelated aspects of diffusion equations. They connected the large deviation rate functional for the empirical measure of a system of independently diffusing particles to the entropy gradient flow structure of diffusion equations in the Wasserstein space of probability measures. Let us informally describe these two concepts and their connection here, before giving rigorous statements in Section 2.
Large deviations for independently diffusing particles. We consider n indistinguishable particles evolving according to the stochastic differential equations where (W 1 (t), . . . , W n (t)) t≥0 is a collection of independent standard R d -valued Brownian motions. We assume that Ψ : R d → R is twice continuously differentiable and that its Hessian is uniformly bounded from below. Let ρ (n) t := n −1 n i=1 δ X i (t) denote the empirical measure of (X i (t)) n i=1 . If the initial values X i (0) are chosen deterministically such that ρ (n) 0 converges weakly to some fixed measure ρ 0 ∈ P(R d ), then, for each t ≥ 0, it is a classical result that the empirical measure ρ (n) t converges almost surely to the unique solution of the Fokker-Planck equation ∂ t ρ t = ∆ρ t + div(ρ t ∇Ψ) with initial condition ρ 0 , see, e.g., [DG87,FK06] for much stronger results. Under suitable growth conditions on Ψ, a Sanov-type theorem implies that the random measures (ρ Here, ρ 0,t ∈ P(R d × R d ) denotes the joint law of a solution (X 0 , X t ) to (1) with random initial condition X 0 ∼ ρ 0 (independent of the Brownian motion), H(·|ρ 0,t ) denotes the relative entropy with respect to ρ 0,t , and Γ(ρ 0 ,ρ) is the set of probability measures γ ∈ P(R d × R d ) with marginals ρ 0 and ρ. For background on large deviation theory we refer the reader to [DZ98,FK06].
In this paper we are interested in the short-time behaviour of the rate functional I t (·|ρ 0 ) and its relation to the Wasserstein gradient structure of the Fokker-Planck equation.
The Wasserstein gradient structure of the Fokker-Planck equation. A seminal result by Jordan-Kinderlehrer-Otto [JKO98] asserts that the Fokker-Planck equation (2) can be regarded as the gradient flow equation of the relative entropy in the Wasserstein space of probability measures (P 2 (R d ), W 2 ). This result can be rigorously interpreted in different ways, e.g., using the theory of gradient flows in metric spaces, or using an infinite-dimensional Riemannian structure on the space of probability measures; see [AGS08] for details. Here we present the original interpretation from [JKO98] in terms of the convergence of a discrete "minimizing movement" scheme, which can be seen as an analogue of the implicit Euler scheme for the gradient flow equation. For ρ 0 ∈ P 2 (R d ) and t > 0, define J t (·|ρ 0 ) : Since this minimisation problem has a unique minimiser, S t [ρ 0 ] is well defined. The JKOfunctional J t can be used to construct an iterative discretisation scheme: it was shown in [JKO98] that exists for each t > 0 and satisfies the Fokker-Planck equation (2).
Relating I t and J t . The main result of [ADPZ11] unveils a relation between the large deviation principle and the Wasserstein gradient flow structure. Roughly speaking, it asserts that the functionals I t and 1 2 J t are asymptotically equivalent as t → 0. More precisely, it was shown that in the sense of Γ-convergence. This result provides an appealing microscopic explanation for the emergence of the Wasserstein gradient flow structure at the macroscopic level.
The proof of this theorem in [ADPZ11] required two strong technical assumptions. Firstly, the result was limited to one space dimension. Secondly, the proof required highly restrictive regularity assumptions on the involved measures.
In a subsequent paper [DLR13], Duong, Laschos and Renger were able to remove the strong regularity assumptions. Their approach is based on a different representation of the rate functional I t due to Dawson and Gärtner [DG87] (see also [FK06]), that we shall describe in Section 2. The proof of the lower bound in the Γ-convergence result in [DLR13] is valid in arbitrary dimensions. However, the remaining part of the argument (the construction of a recovery sequence) is restricted to one dimension, since it relies on regularity estimates for optimal transport maps which are known to be false in multiple dimensions.
In this note we shall provide a different argument for the construction of a recovery sequence that works in arbitrary dimensions. Combined with the result from [DLR13], this completes the proof of (5) in arbitrary dimensions. We refer to Theorem 2.2 below for a precise statement.
Structure of the paper. In Section 2 we give a detailed statement of the main convergence result. In Section 3 we collect well-known results about Wasserstein gradient flows that will be used in the proof. Section 4 contains the proof of the convergence result. For completeness, we also include the proof of the lower bound taken from [DLR13]. In the appendix we provide a short proof of the equivalence of different formulations of the Benamou-Brenier formula.

Statement of the main result
In this section we shall rigorously introduce the three objects appearing in the main result of this paper: the Wasserstein metric W 2 , the relative entropy functional F , and the large deviation rate functional I τ .
The Wasserstein metric. Let P 2 (R d ) := {ρ ∈ P(R d ) : |x| 2 ρ(dx) < ∞} denote the set of probability measures with finite second moment. The L 2 -Wasserstein distance between ρ 0 , ρ 1 ∈ P 2 (R d ) is defined by where the infimum is taken over all couplings π of ρ 0 and ρ 1 , i.e., Γ(ρ 0 , ρ 1 ) denotes the collection of all π ∈ P( The relative entropy. Throughout this paper we assume that Ψ : R d → R is twice continuously differentiable and λ-convex for some λ ∈ R, i.e., Hess Ψ(x) ≥ λ Id for all x ∈ R d . The relative entropy functional F : This functional is well-defined, since the assumption on the second moment implies that the negative parts of f log f and Ψf are integrable with respect to the Lebesgue measure. If ρ is absolutely continuous with respect to the Lebesgue measure, then F can be written as a relative entropy with respect to the equilibrium measure ν(dx) = e −Ψ(x) dx. Namely, We also introduce the relative Fisher information G : The large deviation rate functional. The definition of the rate functional I τ involves a weighted Sobolev norm of negative order 1. Let D = C ∞ c (R d ) be the space of test functions and let D ′ be the dual space of distributions. Given ρ ∈ P(R d ), we define the weighted H −1 (ρ)-norm of s ∈ D ′ by the duality formula where the supremum runs over all smooth test functions f ∈ D for which the denominator does not vanish. Using the identity b 2 /a 2 = sup t∈R 2tb − t 2 a 2 , one obtains the equivalent formula For fixed ρ 0 ∈ P 2 (R d ) and τ > 0, the functional I τ (·|ρ 0 ) : where AC 2 (ρ 0 , ρ 1 ) denotes the set of 2-absolutely continuous curves (ρ t ) t∈[0,1] in P 2 (R d ), W 2 with boundary conditions ρ| t=0 = ρ 0 and ρ| t=1 = ρ 1 . We refer to Section 3 for the definition of 2-absolutely continuity. Intuitively, I τ (ρ|ρ 0 ) is the value of an optimal control problem, which requires to interpolate between ρ 0 andρ in such a way that deviations from the Fokker-Planck equation are minimised.
Remark 2.1. Under two different sets of growth conditions on the potential Ψ, coined 'subquadratic' and 'superquadratic', the term inside the infimum of (6) is the large deviation rate functional for trajectories [0, τ ] → P(R d ) of the empirical measure of independent particles, see [DG87]. Using the contraction principle, it was proved in [DLR13,Cor. 4.10] that the large deviation rate functional for the empirical measure at the end time τ is obtained by taking the infimum over (1-)absolutely continuous curves in (P 2 (R d ), W 2 ) with the right boundary conditions. In the subquadratic case, it follows from the proof of [DLR13,Prop. 4.6] that if ρ 0 ∈ P 2 (R d ) and F (ρ 0 ) < ∞, any weakly continuous curve with τ 0 ∂ t ρ t − ∆ρ t − div(ρ t ∇Ψ) 2 dt < ∞, is also 2-Wasserstein absolutely continuous.
In the superquadratic case, the same result was proved in [FN12, Lem. 2.1]. Therefore, under both sets of conditions on Ψ, we can take the infimum over 2-absolutely continuous curves in (P 2 (R d ), W 2 ), hence the large deviation rate functional (3) coincides with (6).
In the rest of this paper we will not be concerned with the exact conditions under which these expressions coincide, but rather take (6) as the object of study. For more details, see [DLR13,Section 4].
Now we are ready to state the main theorem of this paper: in the sense of Γ-convergence. More precisely: (i) For any ρ 1 ∈ P 2 (R n ) and any sequence {ρ τ 1 } τ ⊆ P 2 (R d ) converging to ρ 1 in the 2-Wasserstein metric, we have As discussed in the introduction, this theorem was first proved in dimension 1 in [ADPZ11] under more restrictive conditions on the measures ρ 0 and ρ 1 . Part (i) has been extended to arbitrary dimensions in [DLR13]. The novel contribution of our paper is a proof of (ii) in arbitrary dimensions.

Ingredients of the proof
The Benamou-Brenier formula. It will be convenient to work with the dynamic characterisation of the Wasserstein distance due to Benamou-Brenier [BB00], which asserts that, for ρ 0 , ρ 1 ∈ P 2 (R d ), For p ≥ 1, recall that a curve (ρ t ) t∈[0,1] is said to be p-absolutely continuous with respect to W 2 , if there exists a scalar function m ∈ L p (0, 1) satisfying W 2 (ρ s , ρ t ) ≤ t s m(r) dr for all 0 ≤ s < t ≤ 1. We use the notation (ρ t ) t ∈ AC p (ρ 0 , ρ 1 ). If p = 1, we simply say that (ρ t ) t∈[0,1] is absolutely continuous. In this case, the metric derivative |ρ t | := lim h→0 W 2 (ρ t+h , ρ t ) h exists for a.e. t ∈ (0, 1), see, e.g., [AGS08, Theorem 1.1.2] for more details. It can be shown that (10) implies the identity We refer to Appendix A for an equivalent formulation of the Benamou-Brenier formula which is commonly used in the literature on optimal transport and to [AGS08, Theorem 8.3.1] for a proof of (10), (11) in this formulation. [McC97] asserts that the λ-convexity of Ψ implies displacement λ-convexity of F , see also [Vil03,Theorem 5.15]. This means that for any constant speed W 2 -geodesic (ρ t ) t∈[0,1] ⊆ P 2 (R d ) and any t ∈ [0, 1], we have

Relative entropy, Fisher information, and heat flow. A seminal result by McCann
In particular, F is finite along geodesics as soon as it is finite at the endpoints. The fact that the relative Fisher-information does not enjoy this property is the source of several complications in [DLR13]. We recall further that F is lower semicontinuous with respect to W 2 -convergence, see [AGS08, Remark 9.4.2 and Lemma 9.4.3].
The semigroup associated to the Fokker-Planck equation (2) will be denoted by (P t ) t≥0 . More precisely, for ρ ∈ P 2 (R d ) we set P t ρ := ρ t , where (ρ t ) t is the unique distributional solution to the Fokker-Planck equation (2) with ρ 0 = ρ. This solution can be obtained using, e.g., the metric theory of gradient flows for (generalised) λ-convex functionals, see [AGS08, Thm. 11.2.8].
In the following result we collect some well-known results on the behaviour of the semigroup (P t ) t≥0 .
(2) For all ρ, σ ∈ P 2 (R d ) and all t ≥ 0 we have the contraction estimate: Moreover, for any curve (ρ s ) s that is absolutely continuous with respect to W 2 we have (3) For all ρ ∈ P 2 (R d ) and t > 0 we have as well as the bounds Finally, for any W 2 -geodesic (ρ s ) s∈[0,1] with F (ρ 0 ), F (ρ 1 ) < ∞, we have as t ց 0: Proof. For part (1) and the properties (13), (15) and (16), see [AGS08, Theorems 11.2.1 and 11.2.8]. The estimate (14) follows immediately from (13) and (11). It remains to prove the statement (17), which is less standard. Note first that by (12) we have that s → F (ρ s ) is continuous and bounded. Our aim is to show that for every ε > 0 there exists δ > 0 such that F (ρ s ) − F (P t ρ s ) < ε whenever t < δ and s ∈ [0, 1]. Assume the contrary, i.e., that there exist ε > 0 and sequences t k → 0 and (s k ) ⊂ [0, 1] such that for all k, By compactness we can assume that s k → s 0 as k → ∞ for some s 0 ∈ [0, 1]. We claim that P t k ρ s k → ρ s 0 in W 2 -distance as k → ∞. Indeed, again by (13) the triangle inequality yields and the claim follows from the continuity of P t at t = 0 and the continuity of the curve (ρ s ). Passing to the limit k → ∞ in (18), using the continuity of s → F (ρ s ) and the lower semicontinuity of F with respect to W 2 , we obtain the following contradiction: which completes the proof.
We conclude this section by stating some useful identities for the derivative of the entropy. In fact, for any absolutely continuous curve (ρ t ) t∈[0,1] with F (ρ t ) ∈ R for all t and where the second equality follows from (26).

Proof of the main result
4.1. Upper bound. In this section we prove existence of the recovery sequence, i.e., statement (ii) of Theorem 2.2. For this purpose we define the set Q := ρ ∈ P 2 (R d ) : G(ρ) < ∞ . Note that F (ρ) < ∞ for all ρ ∈ Q in view of Remark 2.3. Below we will prove the following two claims: Claim 4.1. For all ρ 0 , ρ 1 ∈ Q we have as τ → 0, Claim 4.2. For every ρ ∈ P 2 (R d ) there exists a sequence (ρ n ) n ⊆ Q such that W 2 2 (ρ n , ρ) → 0 and F (ρ n ) → F (ρ).
The existence of the recovery sequence then follows from a straightforward diagonal argument, see [DLR13, Proposition 6.2] for details.
Proof of Claim 4.1: We only need to prove the limsup inequality for the left-hand side of (21), since the liminf inequality will be proved in Section 4.2 below. If ρ 0 = ρ 1 the claim is immediate, so we take distinct measures ρ 0 , ρ 1 ∈ Q, and take a geodesic (ρ t ) t∈[0,1] connecting ρ 0 and ρ 1 . We will approximate this curve by running the semigroup for a small time ε = ε(τ ) > 0, which will be determined below. A careful choice of ε as a function of τ is crucial for our argument. We thus consider the curve (ρ ε t ) t∈[0,1] defined by For the sake of brevity, we shall write Lρ = ∆ρ + div(ρ∇Ψ). Using the definition of I τ (ρ 1 | ρ 0 ) and the second identity (20), we obtain We shall estimate these three terms separately. Let c λ , k λ > 0 be sufficiently large so that e −2λε 1−2ε ≤ 1 + k λ ε and ε 0 e −2λt dt ≤ c λ ε for all ε ∈ (0, 1 4 ). Using the semigroup estimates (16) and (14) and the Benamou-Brenier formula (10), the first term can be bounded by . For the third term we use (16) to obtain We claim that h(ε) is finite for each ε > 0. Indeed, using (16) and (20) we obtain The right-hand side is uniformly bounded in t thanks to the λ-convexity of F and the uniform convergence (17). Consequently, h(ε) < ∞.
To treat the second term, we can thus use (19) to obtain Combining these three bounds, we infer that We claim that ε = ε(τ ) can be chosen as a function of τ such that This yields the limsup inequality in (21). The corresponding liminf inequality will follow from (8).
In this formula, the norm |||·||| −1,ρ is defined by for ρ ∈ P(R d ) and s ∈ D ′ . It can be shown that the infimum in this definition is uniquely attained, and its minimiser can be characterised as follows: a solution v ∈ L 2 (ρ; R d ) to the "continuity equation" s + div(ρv) = 0 is optimal in (25) if and only if it belongs to the space of generalised gradient vector fields defined by We refer to [AGS08, Section 8.4] for the proof of these facts. Note in particular that whenever ∇ψ ∈ L 2 (ρ; R d ).
On the other hand, if s −1,ρ < ∞, it follows from s, f ≤ s −1,ρ · ∇f L 2 (ρ;R d ) that the mapping ∇f → s, f extends to a bounded linear functional T : (H ρ , · L 2 (ρ;R d ) ) → R of norm s −1,ρ . Hence, the Riesz representation theorem implies that s, f = R d v · ∇f dρ for some v ∈ H ρ with v L 2 (ρ;R d ) = s −1,ρ . It follows that |||s||| −1,ρ ≤ v L 2 (ρ;R d ) . In view of the first part of the proof, the latter inequality is in fact an equality.
As a consequence of this lemma we infer that the Benamou-Brenier formulas in (10) and (24) are equivalent.