The enhanced Sanov theorem and propagation of chaos

We establish a Sanov type large deviation principle for an ensemble of interacting Brownian rough paths. As application a large deviations for the ($k$-layer, enhanced) empirical measure of weakly interacting diffusions is obtained. This in turn implies a propagation of chaos result in rough path spaces and allows for a robust subsequent analysis of the particle system and its McKean-Vlasov type limit, as shown in two corollaries.


Large deviation and rough paths
The present paper is concerned with the intersection of large deviations, rough paths and (weakly) interacting diffusions. We note (i) that large deviations have been one of the first application areas of rough paths theory: indeed, following Ledoux et al. [15], a large deviation principle for Brownian motion and Lévy's area, scaled by and 2 respectively, in rough path topology, will yield immediately the Freidlin-Wentzell theory of large deviations for diffusions with small noise -its suffices to combine continuity of the Itô-map in rough path sense with the contraction principle of large deviation theory; [13]. (ii) The interplay of rough paths with interacting stochastic differential equations was pioneered in [4]. This work, as well as the more recent [1], required in particular the development of a McKean Vlasov theory in the context of random rough differential equations (which is not at all the aim of this paper). At last, (iii) large deviations for interacting diffusions is a huge field, a small selection of relevant references is given by [5-7, 17, 18].
In sense, we combine here aspects of all the afore-mentioned references. In particular, when compared to the many classical works (iii) an advantage of our approach is robustness: as soon as we have a LDP on a suitably enhanced space ("enhanced Sanov") -on which most stochastic operations of interest are continuous, the raison d'être of rough paths -basic facts of large deviation theory, such as contraction principle or Varadhan's lemma become directly applicable. On the contrary, stronger versions of contraction principles or Varadhan's lemma need suitable approximated continuity properties which must be checked case by case.
We briefly describe our main results. Let {B i : i ∈ N} be a family of independent d-dimensional standard Brownian motions, 1 on a fixed filtered probability space (Ω, A, (F t ) t , P). On a finite timehorizon, say [0, T ], we may regard them as C([0, T ]; R d )-valued i.i.d. random variables. By a classical law of large numbers (LLN) argument (e.g. [10,Thm 11.4.1]) the empirical measure, L n , a random measure on pathspace, converges to the d-dimensional Wiener measure P {d} . More precisely, with probability one, 2 Sanov's theorem quantifies the speed of this convergence: for a measure Q on C([0, T ]; R d ), 1 Later on, we shall allow for non-trivial Law(B i 0 ) ≡ λ. 2 We regard L B n and P d as random variables with values in the (Polish) space P(C([0, T ]; R d )), equipped with the C bweak topology.
in the form of a large deviation principle [8,9], where H is the relative entropy. Now, for each 1 ≤ i, j ≤ n, we introduce the 2d-dimensional double-layer process B {2};ij ≡ B ij as 1) and then define the enhanced double-layer process B {2};ij ≡ B ij , with values in the space of 2d × 2d matrices, as where • denotes Stratonovich integration. Clearly, for any i = j, we have where P {e} denotes e-dimensional Wiener-measure. We are interested in (the G 2 (R 2d )-valued process) In order to extend the enhanced "double-layer" (k = 2) empirical measure to any k ≥ 3, define the (kd-dimensional) k-layer process B {k};i 1 ,...,i k ≡ (B i 1 , . . . , B i k ), its rough path lift B {k};i 1 ,...,i k , and then the enhanced "k-layer" empirical measure given by L B;{k} n (ω) := n −k δ B {k};i 1 ,...,i k (ω) with summation over all 1 ≤ i 1 , . . . , i k ≤ n. One may expect, as suggested by our notation, that, for any integer k, 4 L B;{k} n (ω) −→ P {kd} as n → ∞. (1.4) This is indeed the case, however not a consequence of LLN, for even when k = 2 the {B ij : i, j = 1, . . . , n} are not independent. In fact, we shall study the speed of convergence around this limit: one of our main results is a large deviation principle for the law of L ) : n ∈ N} satisfies a large deviation principle on P(C 0,α g ([0, T ]; R kd )) endowed with the C b -weak topology, with scale n and good rate function I : P(C 0,α g ([0, T ]; R kd )) → R ∪ ∞ that is given by (

1.5)
This LDP is also valid in a stronger ("modified Wasserstein") topology.
Here π 1 :  The interest in a modified Wasserstein topology (on probability measures on the space of rough paths, Section 4 for details) is continuity of the map (here k = 2, but then trivially for k ≥ 2 by projection) for sufficiently niceb. Indeed, combining Girsanov's theorem and Varadhan's lemma will then imply a LDP for the empirical measures, as n → ∞, for the particle system given by 5 In fact, our approach not only allows to recover the (known, see e.g. [2,7]) rate function for the large deviations of such a particle system, of the form J b (Q) = H(Q|Φ(Q)) cf. Section 7 (where Φ = Φ b is introduced, such that fixed points of Φ are solutions to the martingale problem of the corresponding McKean-Vlasov equation with mean-field drift b), but it gives the LDP on the level of k-layer enhanced empirical measures. We shall see in two applications, namely Corollary 1.5 and Corollary 1.6 below, how useful exactly this can be.
, let X n = (X 1,n , . . . , X n,n ) be the solution to the above system where the initial law satisfies a suitable exponential integrability condition (Condition (3.15)). Let L X,{k} n be the corresponding enhanced k-layer empirical measure, k ≥ 2. Fix α ∈ (1/3, 1/2). Then the sequence of laws {Law L X,{k} n : n ∈ N} satisfies a large deviation principle on (a modified Wasserstein) space of probability measures on C 0,α g ([0, T ]; R kd ) with scale n and good rate function J b given by (1.8) 5 Given a function b : R 2d → R d , we use the notation (x 1 , x 2 ) ∈ R d × R d = R 2d and we denote byb : R 2d → R 2d the function such thatb(x 1 , x 2 ) 1 = b(x 1 , x 2 ) andb(x 1 , x 2 ) 2 = 0.
A first consequence of this large deviation principle, together with the fact that the rate function has a unique zero, is a "law of large number" which already contains a remarkably strong form of propagation of chaos (POC), namely Theorem 1.4 below. Note that this result can also be recovered via classical Itô calculus (the reader can verify this as an exercise), nevertheless it illustrates well the extra information carried by the LDP above, moreover its corollary 1.5 is another example of the combination of mean field and rough paths arguments. For context, we first give the classical form of POC. Let us also note there is much new interest in POC, with recent applications ranging from calibration methods in quantitative finance to the analysis of lithium-ion batteries. Theorem 1.3 (Classical POC, e.g. [17] In classical terminology [17] the law of X 1,n , . . . , X k,n is Q-chaotic, where Q = Law(X) is a probability measure on the (Polish) space We now state the enhanced POC on the space of rough paths, that is paths with values in G 2 (R N ) rather than R N . We insist that this is not just a form of the classical POC (a.k.a. Q-chaos) in which E = C([0, T ], R d ) is replaced by some other (Polish) space, which happens to be a space of rough paths. To wit, the limiting measure in our Theorem 1.4 below is not of product measure form, since it effectively tracks all areas between the particle trajectories (in the mean-field limit) which requires it to be a measure on the space of geometric rough paths which indeed offers enough room to capture (the anti-symmetric part of which corresponds to the afore-mentioned areas). In contrast, a space of k rough paths over R d , say contains strictly less information as it contains, particle trajectories on R d aside, only (and hence only the areas of each single d-dimensional particle trajectory). This extra information contained in C 0,α g [0, T ], R kd makes a difference indeed when one is interested in subsequent analysis of this particle system, as we shall see in the corollary below. But first we state our enhanced POC. Recall that for a e-dimensional semi-martingale Z, its Stratonovich (level 2) lift is given by (1.10) as C b -weak convergence of probability measures on C 0,α g ([0, T ], R kd ) equipped with α-Hölder geometric rough path topology.
We now illustrate the power of this new form of propagation of chaos. Recall that the solution flow to an SDE depends continuously on the driving noise in rough path topology (e.g. [13].) We then have immediately the following result, a direct proof of which would require substantial work. Corollary 1.5. Fix some k ∈ N and consider, for n ≥ k, the solution flow Y n ≡ Y to ) more generally). Then, (in the sense of flows, cf. [13], and 1/2 − -Hölder on compacts in time) where the weak limit flow is given by We give now a second application of Theorem 1.2, which cannot be covered, to our understanding, by classical LDP results. This is a large deviation principle associated with SDEs driven by k-layer paths (X i 1 ,n , . . . X i k ,n ): we take, for i 1 , . . . i k in {1, . . . n}, the SDE . . k, are given C 3 b vector fields. For this SDE we can consider the empirical measure This empirical measure can be seen as a symmetrization of the system (1.11), as it tracks the positions of Y i 1 ,...i k discarding the particular choice of indices i 1 , . . . i n . Now, as in the previous application, rough paths provide continuity of the solution map of this SDE with respect to the driving noise in rough path topology. Therefore contraction principle implies the following: Corollary 1.6. For any fixed 1/3 < β < 1/2, the sequence {Law(L Y ;{k} n )|n ∈ N} satisfy a large deviation principle on P(C 0,β ([0, T ]; R m ), endowed with the C 0 -weak topology.
The paper is organized as follows. In Section 1, after a brief introduction, we explain our main results. Section 2 is devoted to settle notation and some recalls on rough paths. In Section 3, we prove the enhanced Sanov theorem (Theorem 1.1) in the 1-Wasserstein metric, leaving the extension to the modified Wasserstein topology to Section 4. For notational simplicity we focus in Sections 3 and 4 on the two-layer case (k = 2). We explain in Section 5 how to extend this to general k (and so conclude with a full proof of Theorem 1.1). In Section 6, we introduce the n-particle system, more precisely a system of n weakly interacting diffusions, and prove a large deviation principle for the empirical measure, that is Theorem 1.3. At last, in Sections 7 and 8, we prove resp. the (enhanced k-layer) propagation of chaos property and the LDP for the system (1.14). [4] We comment in some detail on the relation of our work to Cass-Lyons. In [4], the authors first and foremost establish a theory of mean-field RDEs (more precisely, [4,Theorem 4.9], rough differential equations with mean-field interaction in the drift term) for suitable classes of random rough paths B(ω). When it comes to propagation of chaos (see [4,Section 5]) they are able to consider interacting particle dynamics of the form

Relation to the work of Cass-Lyons
with i.i.d. initial data and driving noise, (X i,n 0 , B i ), and show (Theorem 5.2) that In the scale of k-layer enhanced empirical measure, L X n ≡ L X;{k} n k=1 . Furthermore, it is conjectured (see [4, page 25]) that their approach will be useful to establish Sanov-type theoremà la Dawson-Gärtner for (1.15). Although related, our work is not a proof of this conjecture. That said, such a result will not imply our results. To be more specific, in our work no mean-field RDE theory is required, and in fact we have taken the noise to be additive Brownian noise, that is dB i (ω) versus σ(X i t (ω)) dB i (ω). (We note that including non-interacting diffusion coefficients dB i σ(X i t ) dB i would have been possible, as long as the Girsanov argument we use, cf. the proof of Theorem 6.1, remains feasable, which amounts to an ellipticity assumption on σ.) In the cases where our setting overlaps with [4], we indeed quantify the above with a large deviation principle, but then we also obtain (Theorem 1.2) a Sanov-typeà la Dawson-Gärtner for the general k-layer enhanced empirical measure L X;{k} n . This is in fact out of reach of [4] as can be trivially seen noting that L X;{k} n necessarily involves information of (X 1,n , . . . , X n,n ), and hence (take e.g. b ≡ 0) of (B 1 , . . . , B n ), as joint rough path, rather then a collection of n rough paths. But no such information is assumed in [4], making L X;{k} n , k ≥ 2, effectively an ill-defined object. In contrast, for us, by working directly with Brownian motion, we always have the Stratonovich lift at our disposal, so this is not an issue. For the same reason, our robust propagation of chaos (Theorem 1.4, and then e.g. Corollary 1.5) and the large deviation principle in Corollary 1.6 cannot possibly be obtained in the framework of Cass-Lyons. We finally note that forthcoming work of Bailleul-Catellier deals with Sanov-type theorem a la Dawson-Gärtner for (1.15), again in the spirit of Cass-Lyons.

Basic notation and results on rough paths
We introduce the space of rough paths and the space where our empirical measures live. Most of this section is taken from [12] or [13]. Before going into the theory, let us recall the basis of α-Hölder continuous functions. Given a Polish space (E, d) with a compatible structure of Lie group (it will be R e or G 2 (R e )) and given α in (0, 1), we define the space C α ([0, T ]; E) of the α-Hölder continuous paths from [0, T ] to E. This is a complete metric space, endowed with the distance This space is not separable in general. However, the subspace C 0,α ([0, T ]; E) given by the closure, with respect of d α , of the smooth (C ∞ ) paths is separable, hence Polish. Furthermore, for any β > α, ; E) and the inclusion is compact. When dealing with rough paths, we will always assume α in (1/3, 1/2]. An α-Hölder rough path on R e is a triple X = (X 0 , X, X), with X 0 point in R e , X = (X s,t ) s<t two-index R e -valued map and X = (X s,t ) s<t two-index R e×e -valued map (we always suppose 0 ≤ s, t ≤ T when not specified), satisfying the following conditions (here v ⊗ w denotes the tensor product vw T ): 1. algebraic conditions (Chen's relation): for any s < u < t, X s,t = X s,u + X u,t and X s,t = X s,u + X u,t + X s,u ⊗ X u,t ; (2.2) 2. analytic conditions: Here X 0 represents the initial condition; it is not included in the standard definition (Definition 2.1 in [12], Chapter 2), but we need to keep track of it because we will work with paths starting from a generic probability measure (and not just from a single point). However, with some abuse of notation, we will usually write X = (X, X), without X 0 , when this is not relevant for our purposes, as for example when the initial point is fixed (this was the case for the main result 1.1). The space of α-Hölder rough paths on R e is denoted by C α ([0, T ]; R e ). It is not a vector space (since the sum of two rough paths does not respect Chen's relation), but it is a complete metric space, endowed with the distancẽ For convenience, we also introduce a "norm" on rough paths; this is actually not a norm, but it has some good homogeneity property. We define is that it is not separable. That is why we introduce also the space C 0,α g ([0, T ]; R e ) of geometric rough paths. This is the subspace of C α ([0, T ]; R e ) obtained as the closure, with respect to theρ α distance, of the space of smooth R e -valued paths and their iterated integrals (see [12], Section 2.2). Now the space C 0,α g ([0, T ]; R e ), endowed with the distanceρ α is a Polish space. This will be the space of interest for us.
The space of geometric rough paths has also the following geometrical interpretation (taken for example from [12, Section 2.3]): it can be identified with the space C 0,α ([0, T ]; G 2 (R e ) of the closure of smooth paths, with respect to the α-Hölder topology, over a the (free step-2 nilpotent) Lie group G 2 (R e )). In particular, we can consider the α-Hölder distance d α associated with the (Carnot-Caratheodory) distance in G 2 (R e )), as explained at the beginning of this section, and we have, for a constant C > 0, for every geometric rough path X. We call this distance the homogeneous distance. Unless otherwise stated, we will always use the homogeneous distance for geometric rough paths. Notice however that, for the purpose of this paper, only the asymptotic behaviour of d α (X, 0), as |X 0 | + X α → ∞, is of interest for us (see Sections A and 4.1 on the link between this behaviour and the Wasserstein topology), therefore one can use |X 0 | + X α instead of d α (X, 0). A consequence of this geometrical interpretation is that, for any α < β, we have the continuous embedding for spaces of rough paths, where the first embedding is compact. A basic result in Lyons' rough paths theory is that, given a function f regular enough, the integral t 0 f (Y ) dY is well defined and continuous with respect to Y in the rough paths topology. We have (e.g. Theorem 4.4 in [12], Chapter 4): and let X be a geometric α-Hölder rough path on R e . Given a partition ∆ of the interval [0, T ], define the approximated integral on ∆ as (2.7) Then, the limit exists for every sequence (∆ n : n ∈ N) with infinitesimal size |∆ n | = sup [s,t]∈∆n (t − s) and is independent of the sequence itself. Furthermore, the application is continuous and it holds, for some constant C f depending on f , (2.10) Recall that Theorem 1.1, through definition of F {k} given in (1.6), involves a (measurable) "rough path lifting map" Here is the precise definition. Consider piecewise linear approximation {X k : k ∈ N} of X based on dyadic partitions and zero elsewhere. By construction, S(X) is in C α and actually in C 0,α g (since X k is a Lipschitz path and so S k is in C 0,α g ) and X → S(X) is a well-defined measurable (but in general discontinuous!) map on path space.
We now recall the basic relations between rough and stochastic integration, see [12, Proposition 3.5, 3.6 and Corollary 5.2]. We allow B to start from a generic initial probability measure λ with finite second moment. (ii) there exists a null-set N with respect to the e-dimensional Wiener measure P = P {e} (and hence to every Q absolutely continuous with respect to P {e} ), such that, away from this null-set, S k is a Cauchy sequence in the rough path metric and S(B) = B = (B, B) P-a.s.

Proposition 2.3. Let B be as before and let f be a function in
and the rough integral coincide P-a.s..

The enhanced Sanov theorem
The main objective in this section is to prove an LDP for the enhanced empirical measures L B n = L B;{k} n in the 1-Wasserstein topology, in the double layer case (k = 2 will be fixed, and often omitted, throughout this section). For this purpose, consider a sequence of independent d-dimensional Brownian motions {B i : i ∈ N} each starting with initial distribution λ, defined on some filtered probability space (Ω, A, (F t ) t , P). In the sequel, for fixed α ∈ (1/3, 1/2), we use the convention to denote a generic measure on C 0,α ([0, T ]; R d ) by Q, and we write P Y to denote the law on this space of a process Y ; The empirical measure L B n is defined as We use the 1-Wasserstein metric as the topology on the space of probability measures (with finite first moment) on the spaces C α and C 0,α g . In this topology, all the maps of the form for ϕ continuous with at most linear growth, are continuous; on the contrary, in the C b -weak topology we could only allow for continuous bounded ϕ. The reason why we consider the 1-Wasserstein metric is mainly because it is more convenient in the proof: first it gives an easy-to-handle distance between probability measure, then it makes the map will be a suitable approximation of the stochastic integral t 0 X r ⊗ •dX r ) continuous for m fixed (X → X (m) has linear growth with respect to d α , so the C b -weak topology would not fit into this scheme).
The Section is organized as follows. We start with proving Sanov theorem in the 1-Wasserstein metric. Then, as an intermediate result, we prove an LDP for the double-layer empirical measures which is a consequence of Sanov theorem (in 1-Wasserstein metric) and the contraction principle. Finally, we show an LDP for the enhanced empirical measures, whose proof uses the idea for the double-layer empirical measures but exploits the extended contraction principle, together with approximation lemmata coming from rough paths theory.
Before going to the results, for completeness we recall the definition of LDP. Classical notions and results on large deviations can be found for example in [8] and in [9]. Definition 3.1. Let E be a regular Haussdorff topological space (endowed with its Borel σ-algebra) and let I : E → [0, +∞] be a nonnegative lower semi-continuous function (i.e. such that {I ≤ a} is closed for every finite a). We say that a sequence {µ n : n ∈ N} of probability measures on E satisfies a large deviation principle (LDP) with scale n and rate function I if the following facts holds: We say that the rate function I is good if, for every a ≥ 0, the set {I ≤ a} is compact.

Sanov theorem in 1-Wasserstein metric
We quickly review Sanov theorem in 1-Wasserstein metric on a general Polish space. A necessary and sufficient condition for Sanov theorem in p-Wasserstein metric was in fact given in [20], but as the argument is short we include it in a form convenient to us.
Given a Polish space (E, d E ), we denote by P 1 (E) the space of probability measures on E with finite first moment, i.e. the probability measures µ where Γ(µ, ν) is the set of all probability measures on E × E with the first marginal and the second marginal equal resp. to µ and ν. Whenever E is some (Polish) space of α-Hölder continuous (rough) paths, cf. beginning of Section 2, we shall write d W,α for the corresponding 1-Wasserstein distance. Some basic facts on 1-Wasserstein metric will be specified later in the Appendix. We also recall that the relative entropy between two probability measures µ and ν on F is defined as

Theorem 3.2 (Sanov theorem in Wasserstein metric). Let E be a Polish space and let
Assume that µ satisfies the following condition: there exists a function G : E → [0, +∞], with compact sublevel sets (in particular lower semi-continuous), with more than linear growth (i.e., for some Then the sequence of laws of the empirical measures satisfies a large deviation principle on P 1 (E), endowed with the 1-Wasserstein metric, with rate n and good rate function H(·|µ).
This result differs from the classical Sanov theorem by the fact that it involves the 1-Wasserstein metric, while classical Sanov theorem involves C b -weak topology. In this, the statement above is stronger, but does need the additional condition on the measure λ. Remark 3.3. In the case E = C 0,α ([0, T ]; R d ), α < 1/2, the assumption above is satisfies by {B i : i ∈ N} (independent Brownian motions starting from λ), if λ verifies Condition (3.15). Indeed one can take where β is in (α, 1/2) and c, ε are the same of Condition (3.15). This G has compact sublevel sets and more than linear growth; Condition (3.5) is verified since (B 1 (x = 0) is the Brownian motion starting at 0) by Condition (3.15) and exponential integrability of c B 1 1+ε C β (a consequence for example of Corollary 13.15 in [13]).

Proof of Theorem 3.2.
The assertion is a consequence of classical Sanov theorem (in the weak convergence topology, see for example [9, Theorem 3.2.17]) and the inverse contraction principle, see [8,Theorem 4.2.4], provided we prove exponential tightness, in 1-Wasserstein metric, of the laws of the empirical measures L X n . We need to prove that, for any M > 0, there exists a compact set We take K M as in Lemma A.3. By Markov inequality and i.i.d. hypothesis on X i , for any C M , we have The assumption implies that A := E[exp(G(X 1 ))] < ∞. Hence, by taking C M = M + log A + 1, we obtain (3.9) which completes the proof.

The LDP for the double-layer empirical measure
As a warm-up example, we investigate what happens with the double layer empirical measure where P 1 (C 0,α ([0, T ]; R 2d )) denotes the space of probability measures on C 0,α ([0, T ]; R 2d ) endowed with the 1-Wasserstein metric. In the following, we identify (they are equivalent as metric spaces) and we call π 1 the canonical projection in C 0,α ([0, T ]; R d ) 2 on the first d components.
is continuous. This continuity result is provided in Lemma A.4 in the Appendix.

The LDP for the enhanced empirical measure
We are ready to prove the large deviation result for sequence of the enhanced empirical measure the corresponding rough paths. We define it as B = S(B) (this ensures we can apply the extended contraction principle on the whole space), but, as far as the law is concerned, it is equivalent to define B via Statonovich integral (see the section on rough paths). The enhanced empirical measure L B n is defined as Recall the definition of S given in (2.13) and of F : (3.14) Recall also the definition of the projection π 1 as π 1 (X) = X 1 for any element X = ((X 1 , Theorem 3.6. Let {B i : i ∈ N} be a family of independent d-dimensional Brownian motion, with initial measure λ and assume that there exists c, ε > 0 such that The family {Law(L B n ) : n ∈ N} satisfies a LDP on P 1 (C 0,α g ([0, T ]; R 2d )) endowed with the 1-Wasserstein metric, with scale n and good rate function I given by (3.16) The basic fact, that invites us to use the extended contraction principle, is the following lemma. Proof. The image measure of L B n under F is given by By Proposition 2.2, the Stratonovich rough paths B ij coincides a.s. with S(B ij ), hence the image measure of L B n under F coincides a.s. with L B n .
In order to apply the extended contraction principle, we introduce a continuous approximation F m to the map F , defined in this way. Given a continuous trajectory Y , we define its piecewise linear approximation Y (m) as Note that this S (m) is defined as S k , but replacing the dyadic approximation with the approximation at step 1/m. We denote by L B (m) n the enhanced empirical measure associated with B (m) , namely . Notice that, for each m, S (m) is continuous with at most linear growth (this is due to the use of the homogeneous distance d α ) and the map Q → Q ⊗ Q is continuous with respect to the 1-Wasserstein metrics on P 1 (C 0,α ([0, T ], R d )) and P 1 (C 0,α ([0, T ], R 2d )) (Lemma A.4 in the Appendix). So F m is continuous in the 1-Wasserstein metric (by Corollary A.2 in the Appendix).
In the proceeding lemmata, we show that the approximation given by F m is indeed exponentially good, in the sense of the extended contraction principle (as in [9, Lemma 2.1.4]). The main tool is the following lemma, which follows from [13] (see Corollary 13.21 and Exercise 13.22, a proof is given in the Appendix), which gives an exponential bound for the approximation.
As a first step, we establish the exponential tightness of the approximation L B (m) Lemma 3.9. For any δ > 0, it holds Proof. Consider the coupling measure 1 n and L B n . Then, in view of (3.3), we obtain that where we used the fact that the map (X, X ) → d α (X, X ) is Lipschitz continuous. By means of Hoeffding's decomposition [14], the right-hand side of (3.20) can be rewritten as where S n denotes the set of all permutations of {1, . . . , n} and Hence, an application of the Markov inequality and Jensen's inequality gives, for any C > 0 and any n and m, Here, we see the advantage of Hoeffding's decomposition: by using the mutually independence of {H σ(2i − 1), σ(2i) : i = 1, . . . , n/2 } we finally get that On the other hand, by choosing By combining this estimate with (3.21), the assertion follows.
The idea is the following: For any Q with bounded entropy, dQ dP {d} ⊗ dQ dP {d} has a uniform L log L bound with respect to the Wiener measure P {d} . Hence, the lemma is proven if the norm of d α (S (m) (X), S(X)) in the dual space of L log L, again with respect to P {d} , converges to 0. This convergence follows by an exponential control of d α (S (m) (X), f (X)) under P {d} , which is a consequence of Lemma 3.8.
To make this argument work, we use the theory of Orlicz space.
Further, on a given measure space (Λ, Σ, µ), introduce for any g, h : Λ → [0, ∞) measurable Then, the classical Orlicz-Birnbaum estimate, see [16,Section 3.3] implies that for any measurable, nonnegative functions g and h, it holds In particular, by using the explicit form of the Orlicz pair (Φ, Ψ) the follwing estimates holds for any measurable, nonnegative functions g, h and k > 0 where we used that Λ h log h dµ = 2H(Q|P {d} ) ≤ 2a. Finally, by choosing k = cm η/2 , a further application of Lemma 3.8 yields which completes the proof.
It is easy to see that this rate function coincides with the I defined in Theorem 3.6.
We close the section with the convergence (in probability) of the enhanced empirical measures, which follows from the LDP (as well known in large deviations theory). Proof. The result is a consequence of the LDP for the laws of L B n and of the fact that the good rate function has a unique zero in P {2d} .

The modified Wasserstein space
As already mentioned in the Introduction, in view of our application (Theorem 6.2), we will have to deal with maps of the form and we would like these maps to be continuous (to apply standard tools of large deviations theory). On one side, we know that a map µ → G dµ is continuous in the 1-Wasserstein metric if G is continuous with at most linear growth. But on the other side, by Theorem 2.1, the rough path integral has a growth of order at most 1/α, in particular a more than linear growth (with respect to the homogeneous rough paths norm). 7 This creates a problem. Following [3], we introduce a new function N of X with good concentration properties (w.r.t. to Brownian rough paths) such that the rough integral has at most linear growth with respect to N . We then device a strengthened topology, on a restriction of the space P 1 (C 0,α g ([0, T ]; R e )), which allows us to use as test functions also functions with linear growth with respect to such N .
In this new topology we prove the large deviation principle for the enhanced empirical measures, as a consequence of the LDP in the 1-Wasserstein metric, via inverse contraction principle. This amounts to verify exponential tightness in the new topology, which can be proved using again Hoeffding decomposition and also Gaussian estimates for Brownian rough paths.
Remark 4.1. One may ask why we do not take simply N (X) = X 1/α α , or allow for p-Wasserstein distance, for p = 1/α. The reason is that, with this choice of N , we are not able to prove a Sanov-type theorem for the enhanced empirical measure. Actually, in [20], it is proved that a large deviation result in the p-Wasserstein distance does not hold for any p > 2 (and actually also for p = 2), as a consequence of the lack of exponential integrability of X p α .

A modified Wasserstein topology
For the definition of N , consider the following sequence of stopping times: given X in C 0,α g ([0, T ]; R e ), we define Here X (1/α)−var,[s,t] is the (1/α)-variation of X, as group-valued path, in the interval [s, t], see [13] for precise definition. What we need here is that the norm X (1/α)−var,[s,t] is a continuous function of X, in the space C 0,α g ([0, T ]; R e ), for fixed s, t, and it is independent of the initial datum X 0 . Notice, that it is also a continuous function of s, t, for fixed X, and it is monotone in s and t, in the sense that We omit α when not necessary. The following lower-semicontinuity property of N will be proved in the Appendix.

Lemma 4.2. The function N is lower semi-continuous on
The next lemma gives the desired sublinear growth of the rough integral in terms of N , see [12] for a proof.  Now we introduce a modified topology, on a restriction of P 1 (C 0,α g ([0, T ]; R e )), in order to deal with functionals of the form µ → G dµ for some continuous G with G(X) ≤ C(1 + N (X)).

we have
We say that a subset C of P ( · +N ) 1+ε (C 0,α g ([0, T ]; R e )) is closed in the ( · +N ) 1+ε -Wasserstein topology if it is closed under convergence of sequences.
Furthermore, since N is lower semi-continuous, by Corollary B.2 the functional is sequentially lower semi-continuous with respect to the 1-Wasserstein topology: if {µ n : n ∈ N} converges to µ in the 1-Wasserstein or in the ( · + N ) 1+ε -Wasserstein topology, then Proposition 4.6. Assume that {µ n : n ∈ N} converges to µ in the ( · + N ) 1+ε -Wasserstein topology. Let G be a continuous function on C 0,α g ([0, T ]; R e ) such that G(X) ≤ C(1 + |X 0 | + X α + N α (X)), as for example the rough integral. Then, (4.8) Proof. For any m positive integer, let G m be the continuous bounded function defined from G with truncation at level m, that is G m = G1 |G|≤m + m1 G>m − m1 G<−m . We have for every m, n, Notice that the condition G(X) ≤ C(1 + |X 0 | + X + N (X)) implies (for m with m/C > 4) that So it holds for any n where D := sup n C 0,α g (|X 0 | + X + N (X)) 1+ε µ n (dX) is bounded by assumption. The same estimates holds also for µ in place of µ n , by the lower semi-continuity property (4.7). Hence, for any ρ > 0, we can find m ρ such that Fix such m ρ . Since G mρ is continuous bounded, there exists n ρ < ∞ such that, for every n ≥ n ρ , So we conclude that, for every n ≥ n ρ , The proof is complete.
We conclude this subsection with a lemma on compact sets on this space. This will be useful in view of exponential tightness on P ( · +N ) 1+ε (C 0,α g ([0, T ]; R e )). Recall that, given a topology τ (i.e. the set of all open sets), its restriction τ A to a set A is given by {B ∩ A : B ∈ τ }. 1. For any R > 0, the ( · + N ) 1+ε -Wasserstein topology restricted on the set coincides with the 1-Wasserstein topology restricted there.
Proof. For the first part, every closed set inB(R) with respect to the (restricted) 1-Wasserstein topology is also closed with respect to the (restricted) ( · + N ) 1+ε -Wasserstein topology, this being stronger.
Conversely, let C be a closed subset ofB(R) in the restricted ( · + N ) 1+ε -Wasserstein topology; notice that C is closed also in the (not restricted) ( · + N ) 1+ε -Wasserstein topology, sinceB(R) is closed in this topology (by the lower semicontinuity property (4.7)). Let (µ n ) n be a sequence in C, converging to µ in the 1-Wasserstein metric. Since C is inB(R), the uniform bound holds, hence µ n converges to µ also in the ( · + N ) 1+ε -Wasserstein topology. Furthermore µ is also in B(R), by the lower semicontinuity property (4.7). Since C is closed in the ( · + N ) 1+ε -Wasserstein topology, µ must be in C and so C is closed also in the 1-Wasserstein topology. The first statement is proved. The second part follows from the first one (as a general fact in topology), we give a proof for completeness. Let (A i ) i∈I be a family of open sets, in the ( · + N ) 1+ε -Wasserstein topology, whose union contains H and take R > 0 such that H is contained inB(R). ConsiderÃ i := A i ∩B(R), which are open sets in the ( · + N ) 1+ε -Wasserstein topology restricted onB(R). By the first statement, they are open also in the restricted 1-Wasserstein topology onB(R). That is, there exist B i (subsets of P 1 (C 0,α g )), open sets in the 1-Wasserstein topology, such thatÃ i = B i ∩B(R). Actually, sinceB(R) is closed in every topology under consideration, one can choose B i = A i ∪ P 1 (C 0,α g ) \B(R). In particular (B i ) i∈I is a family of open sets, in the 1-Wasserstein metric, covering H. By the compactness of H in the 1-Wasserstein metric, we can extract a finite subset {i 1 , . . . , i m } of I such that 1≤k≤m B i k contains H. SinceÃ i = B i ∩B(R) and H is inB(R), also 1≤k≤m A i k contains H. The proof is complete.

The LDP in the modified Wasserstein space
In this section we prove the LDP for the enhanced empirical space, in the stronger ( · + N ) 1+ε -Wasserstein topology, again for the double layer case (k = 2 will be fixed and often omitted in the notation). Recall the definition of F and S in (3.14), (2.13). Recall that the Brownian motions B i (and their corresponding rough paths) start from measure λ satisfying (3.15). Here we assume ε to be the one appearing in condition (3.15). Mind that we need large deviation tools on a regular Hausdorff spaces (as in [8]) and not just on metric spaces. (4.12) Recall that the strategy is to use the inverse contraction principle starting from the previous Theorem 3.6 and that, for this, we need the exponential tightness of the family (Law(L B n )) n in the ( · + N ) 1+ε -Wasserstein topology.
The main tool is the following lemma, which follows for example from [12, Theorems 11.9 and 11.13], see also [3, Theorem 6.3] adapted to our case in the Appendix, and gives an exponential bound for N (B). Lemma 4.9. Let B the Stratonovich enhanced Brownian motion on R e , with initial measureλ satisfying condition (3.15). Then, for any α < 1/2, β < 1/2 the random variables B β and N α (B) have Gaussian tails, in particular, for some c > 0, The same result holds for B 11 on R 2d (where B 1 is a Brownian motion on R d starting from λ and λ satisfies (3.15)).
Proof. For any M > 0, we have to find a set K M , compact in the ( · + N ) 1+ε -Wasserstein topology, such that lim sup (4.14) Our candidate for K M is The compactness of K M follows from Lemma 4.7, since K M satisfies the hypotheses of that result. Indeed 1) K M is "bounded" in P ( · +N ) 1+ε (in the sense of (4.10)), since X α ≤ C X β for some C > 0; 2) it is also compact in the 1-Wasserstein metric, as a consequence of Lemma A.3 (since G has more than linear growth, is lower semi-continuous and has pre-compact sublevel sets for the compact inclusion of C β g in C 0,α g ). Now we verify (4.14). We use a strategy similar to that for Lemma 3.9. By Markov inequality, we have for any C > 0, Exploiting Hoeffding decomposition as in the proof of Lemma 3.9, we get where now By using that H (1, 2) ≤ G(B 12 ) + G(B 11 ) and applying Lemma 4.9 to B 12 and to B 11 (with initial measure λ ⊗ λ), we get that E exp(cH (1, 2)) =: D is finite for some constant c > 0. Hence, choosing C = c[n/2], we get From this (4.14) follows (up to choosing K 3M/c instead of K M ).
Proof of Theorem 4.8. The result follows applying the inverse contraction principle (in the version of [8, Theorem 4.2.4]), from the space P 1 to the space P ( · +N ) 1+ε (with the identity map), having the LDP on the former space (Theorem 3.6) and the exponential tightness on the latter space.

Extension to k-layer enhanced empirical measures
So far we have considered the double layer enhanced empirical measure. We now deal with the extension of the LDP to the k-layer enhanced empirical measure, namely Here is the extension of Theorem 3.6, that is Theorem 1.1 in the Introduction, for the 1-Wasserstein metric, extended to a general initial measure.
The proof of this LDP goes like the proof in the double layer case (k = 2). We recall the main steps and the main changes.
First Lemma 3.7 is extended to the k layer case; the map Q → Q ⊗k is continuous by Lemma B.4 in the Appendix. Therefore the strategy is still to apply the extended contraction principle. For this we define the approximant S (m);{k} as for the double layer case but on R kd , the approximation B (m);{k} of the enhancement of the kd-dimensional Brownian motion and the maps F (m);{kd} are defined correspondingly. More in general, the space R 2d must be replaced in all the arguments by R kd . Then we need to extend Lemmata 3.9 and 3.10 to the k layer case.
For a general k, H (m);{k} (i 1 , . . . i k ) can be written as where the sum in the second addend is over all (j 1 , . . . , j k ), with at least one repetition of indices, such that, for any l ∈ {1, . . . k}, there exists l ≤ l with i l = j l , and the coefficient a j 1 ,...,j k is the inverse of a positive integer depending on the repetition of indices in (j 1 , . . . , j k ). The only relevant fact is that the number of terms is independent of m and n (for fixed k) and the coefficients a j 1 ,...,j k are bounded by 1. Now, we use Hoeffding's decomposition [14], for the general k layer case: the right-hand side of (5.4) can be rewritten as Proof. The proof of the lemma goes on as the proof of Lemma 3.10. The only changes are: Q ⊗ Q must be replaced by Q ⊗k , and similarly for the density with respect to P ⊗k , µ must be taken as (P {d} ) ⊗k , h as ( dQ dP {d} ) ⊗k and the estimate on h log h dµ becomes h log h dµ = kH(Q|P {d} ) ≤ ka.
Proof of Theorem 5.1. As for the proof of the double layer case, Lemmata 5.2 and 5.3 allow to apply the extended contraction principle, which gives the desired result.
We also have the convergence (in probability) of the enhanced k layer empirical measures, which follows again from the LDP. ) n satisfies a LDP on P ( · +N ) 1+ε (C 0,α g ([0, T ]; R kd )) (endowed with the ( · + N ) 1+ε -Wasserstein topology) with scale n and good rate function Proof. The proof is analogous to the proof of Theorem 4.8, we recall only the main points and changes. In all the arguments, R 2d must be replaced in all the arguments by R kd . First, Lemma 4.9 is extended to B {k};i 1 ,...,i k for any multi index (i 1 , . . . , i k ), with a similar proof. Then, Lemma 4.10 is extended to the k layer case and gives the exponential tightness, in the modified Wasserstein topology, of Law(L B;{k} n ) n ; the proof is similar to the proof of Lemma 4.10, using the Hoeffding decomposition for the general k layer case (as in the proof of Lemma 5.2) and Lemma 4.9 applied to B {k};1,...,k and to B {k};i 1 ,...,i k with repetition of indices. The exponential tightness allows to conclude the LDP in the modified Wasserstein topology, by the inverse contraction principle.

Large deviations for weakly interacting diffusions
We consider an interacting particle system of the following type: Here X i,n , i = 1, . . . , n are the unknown positions of the particles, each of them in R d , b : R d ×R d → R d is a given vector field, which we assume regular as much as needed (precisely Brownian motions, on a fixed filtered probability space (Ω, A, (F t ) t , P), and λ is a given probability measure on R d satisfying the exponential integrability condition (3.15) for some c > 0, ε > 0. We will omit the superscript n when not necessary. It is well known that the above system admits existence and strong uniqueness (i.e. uniqueness for fixed X 0 and B).
The object of interest is an empirical measure associated to this system. For this, let X n = (X 1,n , . . . , X n,n ) be the solution to the SDE (6.1). However, we will not, as is classical [17], study but instead, as n → ∞, the k-layer, enhanced empirical measure L X;{k} n , defined in complete analogy to the Brownian motion setting. To wit, with k = 2 for notational simplicity only, where X ij,n = (X ij,n , X ij,n ) is the rough path on R 2d associated with X ij,n = (X i,n , X j,n ), defined by X ij,n = S {2d} (X ij,n ). Clearly, L X;{2} n (ω) is a (random) measure on C 0,α g ([0, T ]; R 2d ) and we define, on the space of such measures, 8 2 is a measure on the space C 0,α g ([0, T ]; R 2d ) of 2-layer rough paths. As previously, P {d} is d-dimensional Wiener measure with λ initial distribution. N was introduced in (4.3). 8 Given b : R 2d → R d , using notation (x 1 , x 2 ) ∈ R d × R d = R 2d , we denote byb : R 2d → R 2d the function such that b(x 1 , x 2 ) 1 = b(x 1 , x 2 ) andb(x 1 , x 2 ) 2 = 0. 9 Π2 is the projection (X 12 , . . . ; X 12 , . . .) → (X 12 ; X 12 ).
, let X n = (X 1,n , . . . , X n,n ) be the solution to the system (6.1), with initial law λ satisfying Condition (3.15) for fixed ε > 0, and let L X,{k} n be the corresponding enhanced k-layer empirical measure, k ≥ 2. Fix α ∈ (1/3, 1/2). Then, the sequence of laws {Law(L X n ) : n ∈ N} satisfies a large deviation principle on P ( · +N ) 1+ε (C 0,α g ([0, T ]; R kd )) with scale n and good rate function J b given by Proof. FIRST STEP: Enhanced Girsanov theorem. Let X = X n = (X 1,n , . . . , X n,n ) be the solution to the SDE (6.1) with Stratonovich lift X = S {nd} (X). We prove that the law of X on C 0,α g ([0, T ]; R nd ) is absolutely continuous with respect to the law of the enhanced nd-dimensional Brownian motion B, with density given by exp(ρ n (B)), where ρ n is deterministically defined by dt. that is enhanced Girsanov theorem. SECOND STEP: Density for the law of the enhanced empirical measures. First consider the doublelayer case k = 2. We prove that on the space P ( · +N ) 1+ε (C 0,α g ([0, T ]; R 2d )) the law of the enhanced empirical measure L X n is absolutely continuous with respect to the law of L B n , with density given by exp(nK b ) exp(K b ) for a bounded function K b specified below. The main point is that where This follows from formula (6.6), the structural reason being the mean field interaction. Now by Lemma B.4 in the Appendix (applied with k = 2) the enhanced empirical measure associated with a rough path in R nd is a continuous, in particular measurable function G n of the rough path, that is L X n = G n (X), L B n = G n (B). So it is enough to apply formula (6.7) to ϕ = φ • G n , where φ is any measurable bounded function on P ( · +N ) 1+ε (C 0,α g ([0, T ]; R 2d )), and to use (6.8). In the case k > 2, on the space P ( · +N ) 1+ε (C 0,α g ([0, T ]; R kd )) the law of the enhanced empirical measure L We can conclude as in the double-layer case (applying Lemma B.4 to the general k layer case). ))] is the usual renormalization constant. Indeed, the second step invites to apply Varadhan's lemma (Theorem B.3, which is an easy and well-known consequence of Varadhan's lemma in [8, Theorem 4.3.1]). We need to verify the hypotheses, namely, for k = 2, that K b is a continuous function on P ( · +N ) 1+ε (C 0,α g ([0, T ]; R 2d )) and that it holds, for some γ > 1, The hypotheses for general k follow from those for k = 2 (so we will fix and omit k = 2 in the argument below). On the continuity of K b , it is easy to see that the deterministic integrals in formula (6.4) (i.e. the second and third addend) are continuous bounded functions of µ (they are actually continuous bounded functions of Q = µ • π −1 1 in the C b -weak topology on P(C 0,α ([0, T ]; R d ))), so we concentrate on the term with the rough integral. By Theorem 2.1, the rough integral is continuous on C 0,α g with at most linear growth with respect to N (by (4.4)). So, by Proposition 4.6, the term is continuous on P ( · +N ) 1+ε (C 0,α g ([0, T ]; R 2d )). Hence K b is continuous. Now we prove (6.9) with γ = 2. We use the fact that is a martingale, as one can verify easily (and classically). Hence we have which implies (6.9). Hence, we can apply Varadhan's lemma and get the LDP for {Z Therefore, for any Borel set A in P ( · +N ) 1+ε (C 0,α g ([0, T ]; R kd )), we have ))] = 1 (the exponential being a density) and K b is bounded from above and from below, | log Z n | is bounded uniformly in n. Hence, We insist that it is crucial in the above proof to work with k = 2 (or more) layers, for otherwise the argument -based on continuity of K -fails. Theorem 6.1 implies, of course, an immediate LPD for the (1-layer, non-enhanced) empircal measure L X n as defined in 6.2: it suffices to apply the contraction principle, applied to the map with resulting (good) rate function The (only) purpose of the following corollary is to re-express this rate function in more familiar terms of stochastic analysis. To this end, we define, for any measure Q on C 0,α ([0, T ]; R d ) which makes the coordinate process, and then also the doubled coordinate process X = (X 1 , X 2 ) under Q ⊗ Q, a Wiener process plus a square integrable (in time and Ω) drift (this happens when H(Q|P {d} ) < ∞, see the proof of Corollary 6.2), Note the last two summands (integrals against dt) are finite under our assumptions on b.
Corollary 6.2. Under the assumptions of Theorem 6.1, the sequence of laws {Law(L X n ) : n ∈ N} satisfies an LDP on P 1 (C 0,α ([0, T ]; R d )) with scale n and good rate function J b given by (6.12) with the understanding that the right-hand side above is +∞ whenever H(Q | P {d} ) = +∞.
Proof. Consider a measure Q with H(Q, P {d} ) = ∞. We need to show that inf{J b (µ) : µ • π −1 1 = Q} = ∞, that is, J b (µ) = ∞ whenever µ projects to Q. By looking at the definition of J b , there is nothing We now consider a measure Q with H(Q, P {d} ) < ∞. We have to show that In fact, from the very definition of . Thus, it only remains to see that Since H(Q, P {d} ) < ∞, by a classical result (see for example [11, Section II Remark 1.3]), there exists an adapted process g such that W t = X t − X 0 − t 0 g r dr is a Wiener process under Q and, denoting by ν the marginal of Q at time 0, it holds In particular we can define t sb (X r ) • dX r , which appears in the definition of K b (so K b (X) makes sense), as a Stratonovich integral under Q ⊗ Q or equivalently under P ⊗ P , and by Proposition 2.3 this integral coincides P ⊗ P -a.s. (and so Q ⊗ Q-a.s.) with the rough integral T 0b (X t ) dX t in the definition of K b . Therefore, i.e. the first addend in the definitions of K b (Q) and K b (µ) coincide. The other addends also coincide, as easily verified (they are classical integrals). Therefore The above discussion has another useful consequence. whenever µ = F {k} (Q) and infinite otherwise.

Application 1: Robust propagation of chaos
It is an elementary fact of large deviations theory, that a LDP at scale n with good rate function, which has a single zero, implies a (weak and in fact -thanks to Borel-Cantelli -strong) law of large numbers. We now give different representations of the rate functions obtained in the last section, which will allow to "see" the single zero. This requires us to consider the following mean field (McKean-Vlasov) The law PX of the solutionX can be seen as fixed point of the map Φ defined in this way: for any probability measure Q on C 0,α ([0, T ], R d ), calling Q t the marginal of Q at time t, Φ(Q) is the law of the solution to the SDE (for given X 0 and Q, this SDE has a pathwise-unique solution). Proof. (i) Indeed, by Girsanov theorem, Φ(Q) is absolutely continuous with respect to P {d} , with density satisfying (where we have used stochastic Fubini theorem for exchanging stochastic integration and integration in Q in the first term). Notice that, for Q absolutely continuous with respect to P d , (ii) The statement follows from Part (i) and Lemma 6.3.
Remark 7.4. As already noted in the Introduction, this enhanced propagation of chaos is also a consequence of classical propagation of chaos and classical Itô calculus, applying Itô formula to t s X i,n s,r • dX i,n r . We leave the computations as exercise.

Application 2: An LDP for SDEs driven by k-layer noises
We start recalling the notation. We fix k in N and, for a multi-index I in {1, . . . n} k , we use the notation I j for the j-th component of I. We denote by X I,n = X {k};I,n the vector (X I 1 ,n , . . . X I k ,n ). We take f j : R d → R m , j = 1, . . . k, given C 3 b vector fields. We consider the following family of SDEs on R m driven by X i,n , parameterized by multi-indices I in {1, . . . n} k : where y 0 is a point in R m independent of I and n (however more general choices of initial data should be possible). We call it is a random variable with values in P(C 0,β ([0, T ]; R m ), for any 1/3 < β < 1/2. By rough paths theory, precisely Theorems 8.4 and 8.5 in [12], there exists a (unique) continuous function ϕ : C 0,α g ([0, T ]; R kd ) → C 0,β ([0, T ]; R m ) such that, for every I and every n, Y I,n = ϕ(S kd (X I,n )) (actually ϕ is locally Lipschitz continuous). This brings to the following LDP, as recalled in the introduction:

A Basic facts on 1-Wasserstein metric
Let (F, d F ) be a Polish space. We denote by P 1 (F ) the space of probability measures on F with finite first moment. It is a Polish space endowed with the 1-Wasserstein distance d W , namely where Γ(µ, ν) is the set of all probability measures on F × F with the first marginal and the second marginal equal resp. to µ and ν (such measures are sometimes called transportation plans). When F = C 0,α ([0, T ]; E) (for some Polish space E), we use the notation d W,α for the 1-Wasserstein distance associated with the α-Hölder distance on C 0,α ([0, T ]; E). We recall the following characterization of convergence in the 1-Wasserstein metric, stated in [19,Definition 6.8]. Here and in the following, we say that a map ϕ : F → F (F , F being Polish spaces) has at most linear growth if there exists x 0 ∈ F , y 0 ∈ F and C ≥ 0 such that, for every x in F It is easy to see that this property is equivalent to the following fact: for any x 0 in F , y 0 in F , there exists C ≥ 0 such that, for every x in F , (A.2) holds.
Lemma A.1. The following facts are equivalent: • µ n → µ in 1-Wasserstein distance; Proof. Using the equivalence above, it is enough to verify that, for any sequence {µ n : n ∈ N} converging to µ in P 1 (F ), for any continuous function ϕ : F → R with at most linear growth, . Now, since h is continuous with at most linear growth, also ϕ • h is continuous with at most linear growth, hence the convergence above holds.
The following Lemma provides a wide class of compact sets in the 1-Wasserstein metric.
, with compact sublevel sets and with more than linear growth. Define the set Then K M is compact (in the 1-Wasserstein metric).
Proof. We prove sequential compactness (which is equivalent to compactness for metric spaces). Let {ν n : n ∈ N} be a sequence of measures in K M , we will prove that ν n is tight and that there exists x 0 ∈ F such that, for every η > 0, there exists R > 0 verifying This two conditions imply the existence of a subsequence {ν n k : k ∈ N} converging to some measure ν in P 1 (F ) in the 1-Wasserstein metric; it is easy to prove that ν is still in K M (since the functional ν → F G dν is lower semi-continuous by Corollary B.2), so that K M is compact. For tightness, we use the compact sublevel sets property of G: for every δ > 0, the set {G ≤ δ} is compact and, by Markov inequality, we have, for any n, This proves tightness. For (A.5), we use the more than linear growth property of G: for some x 0 ∈ F , for any η > 0, there exists R > 0 such that d(x, x 0 )/G(x) < η. Hence, for any n, This proves (A.5) (up to choosing a different R). The lemma is proved.  ((x, y), (x , y )) 2 = d(x, x ) 2 + d(y, y ) 2 ; similarly, for any k ≥ 2, (F k , d {k} ) is a Polish space as well, where d {k} ((x 1 , . . . , x k ), (x 1 , . . . , x k )) 2 = d( Lemma A.4. Let F be a Polish space, k ≥ 2 integer. Then the map is continuous (where P 1 (F, d), P 1 (F k , d {k} ) are endowed with the 1-Wassestein distance induced by d and d {k} , respectively).
In the case k = 2 h for some positive integer h, it is enough to note that µ → µ ⊗2 h is the htimes iteration of the map µ → µ ⊗2 . In the case k general, the measure µ ⊗k is obtained projecting the measure µ ⊗2 h on the first k components, for some h with k ≤ 2 h , so continuity of µ → µ ⊗k follows.

B Technical results and proofs
We start with a known result on lower semi-continuous functions, that we use at least twice in the paper. Proof. If f is identically +∞, then it is enough to take f k ≡ k as Lipschitz approximants. Hence, we consider f assuming at least one finite value. We define {f k : k ∈ N} as the lower envelope of f , namely Since f is bounded from below and not identically +∞, f k is a real-valued function. The sequence f k is increasing and, for every k, x, we have f k (x) ≤ f (x) (by choosing y = x in (B.1)). Moreover, for each k, f k is Lipschitz continuous: for every y, |(f (y) + kd(x, y)) − (f (y) + kd(x , y)) ≤ kd(x, x ) and therefore |f k (x) − f k (x )| ≤ kd(x, x ). We are left to prove the pointwise convergence of f k to f . We start with proving convergence on the points x with f (x) finite. Fix ε > 0 and, for every k, take a point x k such that f (x k ) + kd(x, x k ) < f k (x) + ε. The sequence {x k : k ∈ N} converges to x: indeed kd(x, x k ) ≤ f k (x) + ε + (inf(f )) − ≤ f (x) + ε + (inf(f )) − for every k. Therefore, by lower semi-continuity, By the arbitrariness of ε, we conclude f (x) = lim k→∞ f k (x).
Corollary B.2. Let (E, d) be a metric space and let f : E → (−∞, ∞] be lower semi-continuous, bounded from below. Then, for every sequence {µ n : n ∈ N} in P(E), converging C b (E)-weakly to µ in P(E), it holds Proof. The previous Lemma gives that f = sup k≥1 f k , where {f k : k ∈ N} is an increasing sequence of continuous functions. We can assume, possibly replacing f k with f k ∧ k, that f k is bounded for every k. By monotone convergence theorem, we have for every ν in P(E) So the function ν → E f (x) ν(dx) is the supremum of a family of continuous functions in the C b (E)weak topology, therefore, by a standard argument, it is sequentially lower semi-continuous in that topology.
Here is the version of Varadhan's lemma we need. In particular, if Z n = 1 for each n (i.e. if e nϕ µ n is a probability measure), then J = I − ϕ. Furthermore, if {µ n : n ∈ N} is exponentially tight, then so is {ν n : n ∈ N} and the rate function J is good.  for some constant C > 0. The proof is complete.
We prove now the lower-semi-continuity of N α .
Proof of Lemma 4.2. Notice first that, for any i, so that lower semi-continuity of N follows from upper semi-continuity of τ i , for any i, which we now aim to prove. We must show that, for any i in N, for any t > 0, is a closed set. We use induction on i. For i = 1, since τ 0 = 0, closedness follows from continuity of the (1/α) − var norm (with respect to X). For the passage from i to i + 1, take {X m : m ∈ N} sequence in A i+1 (t) converging to some X in C 0,α g ([0, T ]; R e ), we must prove that X belongs to A i+1 (t). By upper semi-continuity of τ i (inductive hypothesis), we have that τ i (X) ≥ lim sup m→∞ τ i (X m ), so, for any δ > 0, the interval [τ i (X) + δ, t] is contained in [τ i (X m ), t] for m large enough. So, for any δ > 0, by continuity and monotonicity properties of the (1/α) − var norm, we have X (1/α)−var,[τ i (X)+δ,t] ≤ lim sup m→∞ X (1/α)−var,[τ i (X m ),t] ≤ 1. (B.12) By arbitrariness of δ > 0 and again by continuity of the norm, we get that X (1/α)−var,[τ i (X),t] ≤ 1, that is X belongs to A i+1 (t). The proof is complete. Now we prove Lemma 4.5.
Proof of Lemma 4.5. The Hausdorff property follows from the fact that the ( · + N ) 1+ε -Wasserstein topology is stronger than the 1-Wasserstein metric (which is an Hausdorff space).
Proof of Lemma 3.8. Corollary 13.22 in [13] applies clearly also to Brownian rough path starting from any initial measure (since B and B (m) start from the same point) and gives the existence of a constant C > 0 such that, for every q ≥ 1, for every m, it holds E d α (B (m) , B) q ≤ (Cq 1/2 m −η/2 ) q .
From this we get the following estimate on the exponential of the distance above: for any ρ > 0, Using the elementary estimate q q ≤ e q−1 q! (which can be easily proved by induction on q), we have E exp(ρd α (B (m) , B)) ≤ 1 + The proof is complete. [Notice that some estimates were not optimal: in fact the result holds also for d α (B (m) , B) 2 replacing d α (B (m) , B).] Proof of Lemma 4.9. Notice that (for ε < 1, using independence of the initial datum and the increments of Brownian motion) Now, E exp 2c( B β + N α (B)) 1+ε is finite (actually for every c > 0), as proved in Theorems 11.9 and 11.13 in [12]; E e 2c|B 0 | = R e e 2cxλ (dx) is finite because of the exponential integrability condition 3.15 (replacing c with 2c). The same proof applies to B 11 (and to B {k};i 1 ,...,i k for any multiindex (i 1 , . . . , i k ) also with repetition of indices).