Improved rates of convergence for the multivariate Central Limit Theorem in Wasserstein distance

We provide new bounds for the rate of convergence of the multivariate Central Limit Theorem in Wasserstein distances of order $p \geq 2$. In particular, we obtain what we conjecture to be the asymptotically optimal rate whenever the density of the summands admits a non-zero continuous component and has a non-zero third moment.


Introduction and main result
Let X 1 , . . ., X n be i.i.d.random variables drawn from a measure µ on R d and such that E[X 1 ] = 0 and E[X 1 X T 1 ] = I d .By the Central Limit Theorem, we know that the measure µ n of S n = 1 √ n n i=1 X i converges to the d-dimensional standard normal distribution γ.In this work, we wish to quantify this convergence for the family of Wasserstein distances of order p ≥ 2, defined between any two measures ν and ν ′ on R d by where π has marginals ν and ν ′ and • is the traditional Euclidean norm.
In recent years, multiple works provided non-asymptotic bounds for W p (µ n , γ).For instance as long as E[ X 1  4 ] < ∞, Theorem 1 [2] gives where C > 0 and • HS denotes the Hilbert-Schmidt norm.Similar results were also obtained for other W p distances [2,5].However, this bound is not sharp with respect to the dimension.Indeed, if X 1 has i.i.d.components, (1) scales with d 3/4 while an optimal bound would scale with √ d.Sharper bounds have been obtained under additional assumptions on the measure µ.For instance, if µ satisfies a Poincaré inequality with constant K ≥ 1, Theorem 4.1 [3] gives and similar results have been obtained for any W p distances with p ≥ 1 in Theorem 1.2 [6] under the additional assumption that µ is log-concave.As a consequence of (2), if µ is log-concave then it admits a Poincaré constant K ≤ C √ log d for some C > 0 [8] and if the Kannan-Lovász-Simonovits isoperimetric conjecture is true then K ≤ C. Finally, for uniformly log-concave measures, the optimal dependency on √ d is obtained in Theorem 3.4 [7] without any further assumptions.
Some insight on the conditions required to obtain this optimal dependency on the dimension in a more general case can be obtained from Proposition 1.2 [13] which states that, if X 1 takes value in the lattice hZ d with h > 0, then In particular, if h is of order √ d then lim inf n→∞ √ nW 2 (µ n , γ) ≥ Cd.Therefore, if one wants W 2 (µ n , γ) to scale with the square root of the dimension, one would require h to be independent of d, or X 1 to not be lattice-distributed.Such a result does not come as surprising in the light of known asymptotic results obtained in the univariate setting.Indeed, according to Theorem 1.2 [12], if X 1 takes values in {a + kh | k ∈ Z} for some a ∈ R, h > 0 and has a finite moment of order p + 2 with p ∈]1, 2], then where Z ∼ γ, U is a uniform random variable on [−1/2, 1/2] independent of Z and • p = E[ • p ] 1/p .On the other hand, as long as X 1 is not latticedistributed, one has Furthermore, faster rates of convergence have been obtained for all p ≥ 1 whenever the first moments of µ and γ are equal and µ satisfies the Cramer's condition [1].
One can thus expect the rate of convergence for the central limit theorem in Wasserstein distance in a high-dimensional setting to not only be determined by the moments of X 1 but to also depend on whether the measure is latticedistributed.In other words, along with the large-scale behaviour of µ, described by its moments, we expect a tight bound on W p (µ n , γ) to include a term corresponding to the small-scale behaviour of µ.In this work, we provide a first instance of such a result in the multidimensional setting.In particular, we obtain the following asymptotic bound.
Corollary 1.Let p ≥ 2 and X 1 , . . ., X n be i.i.d.centered random variables drawn from a measure µ on R d with identity covariance matrix and finite moment of order p + 2. Suppose there exists h > 0 such that the matrix where C > 0 is a generic constant, C d,p,µ is a constant depending on d, p and µ, Z is drawn from the d-dimensional standard normal distribution γ and Furthermore, if µ has a non-zero absolutely continuous component with respect to the Lebesgues measure then, Whenever µ admits a non-zero continuous component and has a non-zero third moment, we conjecture our result to be asymptotically optimal as it is a natural multidimensional generalization of (4).In particular, if X 1 has i.i.d.components we recover the correct dependency on √ d since, by Lemma 3, Our bound is also asymptotically sharper than known existing bounds.Indeed, using Lemma 3 and Lemma 6, we obtain thus recovering (1) in the asymptotic setting.In particular, this means that if X 1 ≤ M almost surely then, for any p ≥ 1, .
Remark that this bound scales with at least d as M must be of order at least √ d.On the other hand, if µ admits a Stein kernel τ as defined in [9], combining Lemmas 3 and 7 gives Hence, following the work of [3], if µ admits a Poincaré constant K ≥ 1 we can generalize (2) to all p ≥ 1: Let us note that, asymptotically, this bound depends only on √ p − 1, thus improving on the bound obtained in Theorem 1.2 [6] which scales with p 2 while lifting the requirement for µ to be log-concave.
For lattice-distributed measures, our bound is close to matching a multidimensional equivalent of (3) but still requires improvements.However, obtaining the optimal rate of convergence for discrete but non lattice-distributed random variables is still an open issue.In any case, let us note that the remainder term is likely sub-optimal.
Corollary 1 is derived from a non-asymptotic bound obtained in Theorem 1 which also deals with non-identically distributed random variables.Our result is derived through refinements on a variant of Stein's method used in [2] which might be of interest in other contexts.

Notations
Let d be a positive integer.For any k ∈ N, let (R d ) ⊗k be the set of elements of the form (x j ) j∈{1,...,d} k ∈ R d k .For x ∈ R d and k ∈ N, we denote by x ⊗k the element of (R d ) ⊗k such that ∀j ∈ {1, . . ., d} k , (x ⊗k ) j = k i=1 x ji .
For any x, y ∈ (R d ) ⊗k , we denote by < x, y > the Hilbert-Schmidt scalar product between x and y defined by and, by extension, we write Furthermore, for any x ∈ (R d ) ⊗(k+1) and y ∈ (R d ) ⊗k , let xy be the vector defined by ∀i ∈ {1, . . ., d}, (xy) i = j∈{1,...,d} k x i,j y j .
For any k ∈ N, any function φ with partial derivatives of order k and any x ∈ R d , we denote by ∇ k φ(x) ∈ (R d ) ⊗k the k-th gradient of φ at x: For any k ∈ N, let H k be the d-dimensional Hermite polynomial, defined by Finally, for any random variable X on R d , we denote by X p the L p -norm of X, that is

Main Result
Let n > 0 and W 1 , . . ., W n be independent centered random variables on R d such that W = n i=1 W i has identity covariance matrix and max i∈{1,...,n} E[W ⊗2 i ] < 1.We denote by ν the measure of W .For any i ∈ {1, . . ., n}, let i is an independent copy of W i .Let us define a set of features describing the large-scale behaviour of the variables (W i ) 1≤i≤n : is positive-definite, we consider the following small-scale feature: is positive-definite, then, for any q, r > p such that In order to prove Corollary 1 from this result, we take ∀i ∈ {1, . . ., n}, W i = X i √ n and β = h √ n and let us assume n is sufficiently large so that β < n 2/3 .In the following, we denote by C a positive constant depending on properties of µ but independent of n.First, we have Then, and, since we have lim Furthermore, since X 1 has finite moment of order p + 2, we can use Theorem 6 from [2] to obtain 13/24 .which concludes the proof whenever µ does not admit an absolutely continuous component with respect to the Lebesgues measure.If it does, let us denote by µ c this continuous component.For any h > 0, there must exist a ball B with radius h and non-zero mass for µ c .Remark that must be positive definite.Otherwise the dimension of the support of µ c on this ball would be lower than d which is impossible since µ c is absolutely continuous with respect to the Lebesgues measure.Thus, q must be finite.Therefore, for n sufficiently large, we can take h n as Then lim which yields the desired result.
Remark 1.Note that we restricted ourselves to the existence of a moment of order p + 2 for the summands to simplify computations.Let us note that one could only consider existence of a moment of order p + l with l < 2 only in order to obtain the rate o n −1/2+1/p−l/2p log(n) −l/2p for the i.i.d.case which would slightly improve on Theorem 6 [2] in which the rate O n −1/2+1/p−l/2p was obtained.Our approach would also be able to deal with varying moment assumption where each variable W i admits a finite moment of order p + l i for non identically distributed summands.

Diffusion interpolation approach
Let p > 0 and W be a random variable drawn from a measure ν on R d .In the following, we assume ν admits a density h with respect to the Gaussian measure which is both bounded and with bounded gradient.These additional assumptions can later be lifted to obtain Theorem 1 using approximation arguments similar to those developed in Section 8 [2].Let t > 0 and let us consider the random variable F t := e −t W + √ 1 − e −2t Z, where Z is a random variable drawn from the d-dimensional standard Gaussian measure γ and independent of W .We denote by ν t the measure of F t .Due to our assumptions on h, ν t admits a smooth density h t with respect to γ.We can thus consider the score function of F t defined by Then, by Equation (3.8) [9], we have We are thus left with bounding ρ t p for all t ≥ 0.
One can first remark that this score function verifies the following formula (see e.g.Lemma IV.1 [10]): where ∆(t) := e 2t − 1.A first, somewhat trivial, bound on ρ t p can then be obtained by applying Jensen's and the triangular inequalities: Note that this bound can still be nearly optimal for small values of t.Indeed, whenever W takes values in hZ d , one has, for small enough values of t << h, However, for continuous measures ν or for higher values of t, it is usually possible to obtain better bounds on ρ t p .For instance, ( 1) is obtained by combining (7) with another bound on ρ t p which holds for large values of t.A similar approach was used in [5] to provide quantitative results for normal approximation in various frameworks such as Wiener chaos or homogeneous sums.In this work, we refine this approach by using three different bounds: (7) for small values of t, a bound for medium values of t highlighting the small-scale behaviour of the measure ν and a last bound for larger values of t which depends on the large-scale structure of ν through its moments.

Small times
Let p ≥ 2 and let W = n i=1 W i where the (W i ) 1≤i≤n are centered and independent random variables on R d with finite moment of order p.If E[W ⊗2 ] = I d , there exists C > 0 such that Indeed, since the (W i ) 1≤i≤n are independent and centered, we can use Rosenthal's inequality (see Lemma 2) to obtain On the other hand, by Lemma 3, Injecting these bounds into (7) then yields (8).

Medium times
When looking at (7), we can see that, for small values of t, the main contributor of this bound is E[Z | F t ] p / ∆(t).In the previous Section, we upper bounded this quantity somewhat crudely by using Jensen's inequality.In this Section, we establish a sharper bound on ρ t p by proving a variant of Proposition 6.1 [5] leveraging the small scale features of W .We start by covering the more general exchangeable pair framework, a standard framework for applying Stein's method, before tackling the specific Central Limit Theorem case.

Exchangeable pairs framework
Proposition 1.Let p ≥ 2 and (W, W ′ ) be a pair of random variables on R d such that (W, W ′ ) and (W ′ , W ) follow the same law.For any t ≥ 0, let η p (t) = ∆(t)/(p − 1) and where The proof of this result mostly follows the proof of Proposition 6.1 [5].
Proof.Let 0 < s < t and let , A small modification of Lemma 6.5 [4] (see also the proof of Lemma 1) gives Therefore, and using (6) along with the triangle inequality yields Then, since Z and W are independent, we have, by Jensen's inequality, p and, by Lemma 3, Since Γ s is positive-definite, we have, for any k ≥ 3, Thus, since , we obtain that there exists C > 0 such that Finally, one can remark that, by definition of Γ s , E[Γ s ] = I d , concluding the proof.

Sum of independent variables
Proposition 2. Let W = n i=1 W i where the (W i ) 1≤i≤n are independent random variables on R d with finite moment of order p ≥ 2. For any i ∈ {1, . . ., n} and β > 0, let D i,β = (W ′ i − W i )1 Di ≤β where W ′ i is an independent copy of W i .Suppose there exists β > 0 such that is positive-definite.Then, for any t such that ∆(t) ≥ (p − 1)β 2 , there exists C > 0 such that , where ∀q ≥ 0, In the following, we denote by C a generic positive constant.Let s be such that ∆(s) = (p − 1)β 2 and let t > s.Let W ′ = W + (W ′ I − W I ) where I is a uniform random variable on {1, . . ., n}.Since (W, W ′ ) and (W ′ , W ) follow the same law, we can apply Proposition 1 to obtain with Λ s = nΛ β .First, following the proof of ( 8), we have Then, by definition of D s and since I is independent of W , Hence, p and, by Jensen's inequality, Let i ∈ {1, . . ., n}.Since W ′ i and W i are independent, we have We can thus apply Rosenthal's inequality (see Lemma 2) to obtain Similarly, and, since D i,β ≤ β ≤ ∆(t)/(p − 1), we can use Cauchy-Schwarz inequality to obtain Therefore, p which concludes the proof.

Large times
Finally, we are left with bounding ρ t p for "large" values of t by using largescale features of the (W i ) 1≤i≤n .In practice, we improve on Proposition 6.1 [5].However, while this result was derived in the general exchangeable pairs framework, our improvements require dealing with sums of independent random variables and thus to the Central Limit Theorem case.
Proposition 3. Suppose W = n i=1 W i where the (W i ) 1≤i≤n are centered independent random variables on R d with finite moment of order p + 2 such that E[W ⊗2 ] = I d .There exists C > 0 such that for any p < q ≤ p + 2 and r verifying 1 q + 1 r = 1 p and any t such that ∆(t) > (p − 1) max i∈{1,...,n} E[W ⊗2 i ] , we have , where For any i ∈ {1, . . ., n} and any t > 0, let D i,t = D i 1 Di 2 ≤ηp(t)ξi(t) .Let us first rewrite ρ t with the help of the following result.Lemma 1.For any i ∈ {1, . . ., n}, the quantity Proof.Let i ∈ {1, . . ., n} and let φ be a smooth test function.Since Φ : ] is real analytic (see e.g.Lemma 1 [2] or Lemma 6.4 [5]), we have Thus, by performing multiple integrations by parts with respect to the Gaussian measure (see e.g.Equation ( 16) [2]), we obtain Finally, since W i and W ′ i are independent and identically distributed, we have ] concluding the proof.
We are now ready to start the proof of Proposition 3. Using Lemma 1, we obtain Then, since we can write Thus, combining the triangle inequality, Jensen's inequality and Lemma 3, we obtain where First, by Lemmas 4 and 5, we have where q > p and r is such that Let Di,t = D i 1 Di 2 ≥ηp(t)ξi(t) .In order to deal with A(t), let us first remark that, since E[W i ] = 0 and since W ′ i and W i are independent, we have And similarly, Let us also note that .
Then, viewing A(t) as the p-norm of an infinite-dimensional vector, we can apply Rosenthal's inequality (see Lemma 2) to obtain Let us conclude the proof by bounding these quantities.

Bounding the first term
Let i ∈ {1, . . ., n}.First, let us note that since W i and W ′ i are independent, On the other hand, since On the other hand, Finally, for any k ≥ 3, Therefore, by definition of ξ i (t), and thus .

Bounding the last two terms
Let i ∈ {1, . . ., n}, q ∈ [2, p].First, by Jensen' inequality and by definition of Di,t , Let W ′′ i and D′ i,t be a conditionally independent copies of W ′ i and Di,t with respect to W i .We have and, by Cauchy-Schwarz's inequality, On the other hand, Finally, for any k ≥ 3, Combining both these bounds yields, by definition of ξ i (t) and, by Jensen's inequality, Therefore,

Technical lemmas
Lemma 2 (Rosenthal inequality, Theorem 5.2 [11]).There exists C > 0 such that, for any p ≥ 2 and any independent random variables (U i ) 1≤i≤n with finite moment of order p taking values in a Hilbert space H, we have , where, for any random variable X taking values in H and any q > 0, Proof.By Theorem 5.2 [11], . Now for q ∈ [2, p], combining the triangle and Jensen's inequalities yields concluding the proof.
Lemma 3 (Lemma 3 [2]).Let Z be a d-dimensional standard normal random variable.For any p ≥ 2, k ∈ N and M ∈ (R d ) ⊗k+1 , we have Lemma 4. Let X, Y and Z be three random variables on R d such that Z is drawn from the Gaussian measure γ and is independent from (X, Y ).Let q > p ≥ 2 and suppose that X and Y have finite moment of order q.Then, for any k ≥ 0 and any i ∈ {1, . . ., d} k , where C > 0 is a generic constant, H i = (H k ) i and r is such that 1 r + 1 q = 1 p .Proof.Let ǫ = Y − X.We have and Let us denote the density of Z by f γ and the measure of (Y, ǫ) by µ.For any t ∈ [0, 1], let where ∇ i • = (∇ k •) i .We then have By definition of the conditional expectation, where H i+1 (x) = (−1) k+1 ∇∇ifγ (x) fγ (x) .Therefore, letting G t = Y + Z + tǫ, we obtain Applying the triangle inequality along with Cauchy-Schwarz, Holder's and Jensen's inequalities then yields Finally applying Lemma 3 yields concluding the proof.
Lemma 5. Let Y and Z be two independent standard normal random variables on R d .Then, for any k ≥ 1 and any α > 0, we have Proof.Let φ be a smooth function with compact support.By performing multiple integrations by parts with respect to the Gaussian measure (see e.g.Equation (16) [2]), we obtain concluding the proof.