Behavior of the empirical Wasserstein distance in R d under moment conditions

We establish some deviation inequalities, moment bounds and almost sure results for the Wasserstein distance of order p ∈ [1 , ∞ ) between the empirical measure of independent and identically distributed R d -valued random variables and the common distribution of the variables. We only assume the existence of a (strong or weak) moment of order rp for some r > 1, and we discuss the optimality of the bounds.


Introduction and notations
We begin with some notations, that will be used all along the paper.Let X 1 , . . ., X n be n independent and identically distributed (i.i.d.) random variables with values in R d , with common distribution µ.Let µ n be the empirical distribution of the X i 's, that is Let X denote a random variable with distribution µ.For any x ∈ R d , let |x| = max{|x 1 |, . . ., |x d |}.Define then the tail of the distribution µ by As usual, for any q ≥ 1, the weak moment of order q of the random variable X is defined by X q q,w := sup t∈R t q H(t) , and the strong moment of order q ≥ 1 is defined by For p ≥ 1, the Wasserstein distance between two probability measures ν 1 , ν 2 on (R d , B(R d )) is defined by where | • | 2 is the euclidean norm on R d and Π(ν 1 , ν 2 ) is the set of probability measures on the product space (R d × R d , B(R d ) ⊗ B(R d )) with margins ν 1 and ν 2 .
In this paper, we prove deviation inequalities, moment inequalities and almost sure results for the quantity W p (µ n , µ), when X has a weak or strong moment of order rp for r > 1.As in [13], the upper bounds will be different according as p > d min{(r − 1)/r, 1/2} (small dimension case) or p < d min{(r − 1)/r, 1/2} (large dimension case).Most of the proofs are based on Lemma 6 in [13] (see the inequality (6.4) in Section 6), which may be seen as an extension of Èbralidze's inequality [12] to the case d > 1. Hence we shall use the same approach as in [8], where we combined Èbralidze's inequality with truncation arguments to get moment bounds for W p (µ n , µ) when d = 1.
There are many ways to see that the upper bounds obtained in the present paper are optimal in some sense, by considering the special cases d = 1, p = 1, p = 2, or by following the general discussion in [13], and we shall make some comments about this question all along the paper.However, the optimality for large d is only a kind of minimax optimality: one can see that the rates are exact for compactly supported measures which are not singular with respect to the Lebesgue measure on R d (by using, for instance, Theorem 2 in [9]).
In fact, since the rates depend on the dimension d, it is easy to see that they cannot be optimal for all measures: for instance the rates will be faster as announced if the measure µ is supported on a linear subspace of R d with dimension strictly less than d.This is of course not the end of the story, and the problem can be formulated in the general context of metric spaces (X, δ).For instance, for compactly supported measures, Boissard and Le Gouic [7] proved that the rates of convergence depend on the behavior of the metric entropy of the support of µ (with an extension to non-compact support in their Corollary 1.3).In the same context, Bach and Weed [2] obtain sharper results by generalizing some ideas going back to Dudley ([11], case p = 1).They introduce the notion of Wasserstein dimension d * p (µ) of the measure µ, and prove that n p/s E(W p p (µ n , µ)) converges to 0 for any s > d * p (µ) (with sharp lower bounds in most cases).
Note that our context and that of Bach and Weed are clearly distincts: we consider measures on R d having only a finite moment of order rp for r > 1, while they consider measures on compact metric spaces.However, the Wasserstein dimension is well defined for any probability measure (thanks to Prohorov's theorem), and some arguments in [2] are common with [9] and [13].A reasonable question is then: in the case of a singular measures on R d , are the results of the present paper still valid if we replace the dimension d by any d ′ ∈ (d * p (µ), d]?The paper is organized as follows: in Section 2 we state some deviations inequalities for W p (µ n , µ) under weak moment assumptions.In Section 3 we bound up the probability of large and moderate deviations.In Section 4 we present some almost sure results, and in Section 5 we give some upper bounds for the moments of order r of W p (µ n , µ) (von Bahr-Esseen and Rosenthal type bounds) under strong moment assumptions.The proofs are given in Section 6.
All along the paper, we shall use the notation f (n, µ, x) ≪ g(n, µ, x), which means that there exists a positive constant C, not depending on n, µ, x such that f (n, µ, x) ≤ Cg(n, µ, x) for all positive integer n and all positive real x.

Deviation inequalities under weak moments conditions
In this section, we give some upper bound for the quantity P(W p p (µ n , µ) > x) when the random variables X i have a weak moment of order rp for some r > 1.We first consider the case where r ∈ (1, 2).Theorem 2.1.If X rp,w < ∞ for some r ∈ (1, 2), then for any x > 0, where log + (x) = max{0, log x}.
Remark 2.1.As will be clear from the proof, the upper bounds of Theorem 2.1 still hold if the quantity , according to the discussion after Theorem 1 in [13], if p = d(r − 1)/r, one can always find some measure µ for which the rates of Theorem 2.1 are reached (see example (e) in [13] for p > d(r − 1)/r and example (c) in [13] for p < d(r − 1)/r).
We now consider the case where r > 2. We follow the approach of Fournier and Guillin [13], but we use a different upper bound for the quantity controlled in their Lemma 13 (see the proof of Theorem 2.2 for more details).Theorem 2.2.If X rp,w < ∞ for some r ∈ (2, ∞), then, for any x > 0 and any q > r, for some positive constants C, c depending only on p, d, and a positive constant A depending only on p, d, r.
Remark 2.2.Let us compare our inequality with that of Theorem 2 of Fournier and Guillin [13] (under the moment condition (3) in [13]).We first note that the inequality in [13] is stated under a strong moment of order rp for r > 2, but their proof works also under a weak moment of order rp.Hence, under the assumptions of our Theorem 2.2, Fournier and Guillin obtained the bound (we assume here that X rp,w = 1 for the sake of simplicity): for any ε > 0 (the constant implicitly involved in the inequality depending on ε).In particular, one cannot infer from (2.1) that lim sup x r , which follows from our Theorem 2.2.

Large and moderate deviations
We consider here the probability of moderate deviations, that is for α ≤ 1 in a certain range.As usual, the case α = 1 is the probability of large deviations.
As for partial sums, we shall establish two type of results, under weak moment conditions or under strong moment conditions.If the random variables have a weak moment of order rp for some r > 1, the results of Subsection 3.1 are immediate corollaries of the theorems of the preceding section.On the contrary, the Baum-Katz type results of Subsection 3.2 cannot be derived from the results of Section 2 and will be proved in Subsection 6.4.

Weak moments
As a consequence of Theorem 2.1, we obtain the following corollary.
Remark 3.2.Let us now comment on the case p = 2, d = 1.In that case del Barrio et al. [5] proved that, if the distribution function F of X is twice differentiable and if F ′ • F −1 is a regularly varying function in the neighborhood of 0 and 1, then there exists a sequence of positive numbers v n tending to ∞ as n → ∞, such that v n W 2 2 (µ n , µ) converges in distribution to a non degenerate distribution.For instance, it follow from their Theorem 4.7 that, if X is a positive random variable, F is twice differentiable and F (t) = (1 − t −β ) for any t > t 0 and some β > 2, then n (β−2)/β W 2 2 (µ n , µ) converges in distribution to a non degenerate distribution.In that case, there is a weak moment of order β, and, for β ∈ (2, 4), the first inequality of Corollary 3.1 applied with r = β/2 and α = 1/r gives lim sup Hence, in the case where β ∈ (2, 4), our result is consistent with that given in [5], and holds without assuming any regularity on F .
As a consequence of Theorem 2.2, we obtain the following corollary.

Baum-Katz type results
We first consider the case where the variables have a strong moment of order rp for r ∈ (1, 2).
Theorem 3.3.If X rp < ∞ for some r ∈ (1, 2), then for any x > 0, Remark 3.3.Our proof does not allow to deal with the case where p = d(r − 1)/r.As an interesting consequence of Theorem 3.3, we shall obtain almost sure convergence rates for the sequence W p p (µ n , µ) (see Corollary 4.1 of the next section).
We now consider the case where the variables have a strong moment of order rp for r > 2.

Almost sure results
Using well known arguments, we derive from Theorem 3.3 the following almost sure rates of convergence for the sequence W p p (µ n , µ) (taking α = 1/r in the case where p > d(r − 1)/r, and applying the third item in the case where p > d(r − 1)/r).
Remark 4.1.Let us comment on these almost sure results in the case where p = 1 and d < r/(r − 1).
Starting from the dual definition of W 1 (µ n , µ), we get that Now, by the classical Marcinkiewicz-Zygmund theorem (see [15]) for i.i.d.random variables, we know that if and only if X r < ∞.It follows that, for p = 1, the rates of Corollary 4.1 are optimal is the case where d < r/(r − 1).
We now give some almost sure rates of convergence in the case where Remark 4.2.In the case p = 1, d = 1, it follows from the central limit theorem for W 1 (µ n , µ) (see [4]) and from Theorem 10.12 in [14] that the sequence

Moment inequalities
In this section, we give some upper bounds for the moments W p p (µ n , µ) r r when the variables have a strong moment of order rp.
As will be clear from the proofs, the maximal versions of these inequalities hold, meaning that the quantity W p p (µ n , µ) r can be replaced by in all the statements of this section.

Moment of order 1 and 2
Theorem 5.1.
where the constant implicitly involved does not depend on M .
Remark 5.1.In particular, if ) and p = d(r − 1)/r, we easily infer from Theorem 5.1 that which can also be deduced from Theorem 2.1.If p = d(r − 1)/r, we get Finally, if ∞ 0 t p−1 H(t) dt < ∞, the rates in the cases p > d/2 and p = d/2 can be slightly improved (taking q = 2 and M = ∞ in Theorem 5.1); this can be directly deduced from Theorem 5.2 below.Note that all those bounds are consistent with that given in Theorem 1 of [13], and slightly more precise in terms of the moment conditions.Hence, the discussion on the optimality of the rates in [13] is also valid for our Theorem 5.1 (see Remark 5.3 below).For p > d/2 and X q < ∞ for some q > dp/(d − p), it follows from Theorem 2(ii) in [9] that lim inf n→∞ n p/d W p p (µ n , µ) 1 > 0 if µ has a non degenerate absolutely continuous part with respect to the Lebesgue measure, and that lim sup n→∞ n p/d W p p (µ n , µ) 1 = 0 if µ is singular.Still for p < d/2, we refer to the paper [2], which shows that, for compactly supported singular measures, the rates of convergence of W p p (µ n , µ) 1 can be much faster than Remark 5.3.According to the discussion after Theorem 1 in [13], if p = d/2, one can always find some measure µ for which the rate of Theorem 5.2 is reached (see example (a) and (b) in [13] for p > d/2 and example (c) in [13] for p < d/2).Note also, that, in the case d = 1, p = 1, del Barrio et al. [4] proved that , which is consistent with the first inequality of Theorem 5.2.For d = 1, p > 1, we refer to the paper by Bobkov and Ledoux [6] for some conditions on µ ensuring faster rates of convergence.Finally, when p = 1, d = 2 and µ is the uniform measure over [0, 1] 2 , Ajtai et al. [1] proved that E(W 1 (µ n , µ)) is exactly of order (log n/n) 1/2 , while we get a rate of order log n/ √ n, which is therefore suboptimal in that particular case.
Remark 5.4.For d = 1, the first inequality of Theorem 5.3 has been proved in [8].Our proof does not allow to deal with the case where p = d(r − 1)/r.However, in that case, it is easy to see that (same proof as the second inequality of Theorem 5.2).For p = 1 and d < r/(r − 1), using the dual definition of W 1 (µ n , µ), we get the upper bound where Λ 1 is the the set of functions f such that |f (x) − f (y)| ≤ |x − y|.Note that (5.1) may be seen as a uniform version of the von Bahr-Esseen inequality (see [3]) over the class Λ 1 .

Rosenthal type inequalities
We consider now the case where r > 2.
Theorem 5.4.If X rp < ∞ for some r > 2, then where, for the second inequality, γ can be taken as γ = ε(2p−d) d(r−2+ε) for any ε > 0 (and the constants implicitely involved in the inequality depend on ε).Remark 5.5.For d = 1, the first inequality of Theorem 5.4 has been proved in [8].As a consequence of the two first inequalities of Theorem 5.4, we obtain that, if p > d/2, As a consequence of the third inequality of Theorem 5.4, we obtain that, if p = d/2,

Proofs
The starting point of the proofs is Lemmas 5 and 6 in [13], which we recall below.For ℓ ≥ 0, let P ℓ be the natural partition of (−1, 1] d into 2 dℓ translations of (−2 −ℓ , 2 ℓ ] d .Let also For a set F ⊂ R d and a > 0, we use the standard notation aF = {ax : x ∈ F }.For a probability measure ν on R d and m ≥ 0, let R Bm ν be the probability measure on (−1, 1] d defined as the image of ν| Bm /ν(B m ) by the map x → x/2 m .For two probability measures µ and ν on R d , by Lemma 5 in [13], there exists a positive constant κ p,d depending only on p and d such that where In addition, by Lemma 5 in [13], From the considerations above, there exists a constant C depending only on p and d such that where This inequality may be seen as an extension to the case d > 1 of Èbralidze's inequality [12], which we used in [8] to obtain moment bounds for W p p (µ n , µ) when d = 1.
As in [8] we shall use truncation arguments.For a positive real M , let With these notations, it follows from (6.4) that For the proofs, we shall follow the order of the theorems, except for Theorem 5.3 whose proof comes naturally after that of Theorems 2.1 and 2.2.

Proof of Theorem 2.2
Let r > 2. Note first that, by homogeneity, the general inequality may be deduced from the case where X rp,w = 1 by considering the variables X i / X rp,w .Hence, from now, we shall assume that X rp,w = 1.
According to the beginning of the proof of Theorem 2 in [13], we get that for some positive constant C = C p,d , where the random variable V p n is such that Consequently, it remains to bound up the quantity For a positive real M , let With these notations, + P B * p,M (µ n , µ) > x/(4C) .(6.20) Let y = x/4C.By Markov's inequality at order q > 2 and 1, y q , (6.21) Applying Rosenthal's inequality, we get Choosing q > r, it follows that )/q n (q−1)/q .(6.23) the last inequality being true since we assumed that sup t>0 t rp H(t) = 1.On another hand, Gathering (6.20) -(6.24), we get that for any q > r, x , (6.25) Hence choosing M = n 1/p x 1/p , we infer from (6.18), (6.19) and (6.25) that for any q > r, which is the desired inequality when sup t>0 t rp H(t) = X rp rp,w = 1.

Proof of Theorem 3.4
Let r > 2. As in the proof of Theorem 2.2, we assume without loss of generality that X rp,w = 1; hence, we can use directly some of the upper bounds given in the proof of Theorem 2.2.From (6.18), we see that Now, for any x > 0 By (6.19), it follows that, for any x > 0, the last inequality being true because k → a(k, x/k) is increasing.Now, by definition of a(n, x), we infer that, for any Hence, it remains to prove that Arguing as in the proof of Theorem 2.2, and using a maximal version of Rosenthal's inequality (see for instance [16]), we get that, for any q > r and M > 0, Clearly, since α ∈ (1/2, 1], taking q large enough, we get that Let M = n α/p .Arguing as in (6.30) with β > 1/2, we get Hence, the sum over n of the last term in (6.48) will be finite provided Taking β close enough to 1/2 so that α(r − 2) + α(1 − 2β)/p > 0 and interverting the sum and the integral, we get that Arguing as in (6.30) with β < (q − 1)/q, we get Hence, the sum over n of the second term in (6.48) will be finite provided Taking β close enough to (q − 1)/q so that α(r − q) + α(q − 1 − βq)/p < 0 and interverting the sum and the integral, we get that 6.6 Proof of Theorem 4.2 Recall that where Fm = 2 m F ∩ B m .Let α ∈ (0, 1) and Taking into account that, for any positive measure ν, simple algebras lead to the following inequality: for Similarly, for . (6.54) Starting from (6.2) and ( 6.54), it follows that max Using the Rosenthal inequality (with the constants given in (4.2) of Theorem 4.1 in [16]), as in the proof of Theorem 2.2, we get that there exist positive universal constants c 1 , c 2 and c 3 such that for any q > 2 , any M > 0 and λ > 0, and any ε > 0, Let κ > 1 and choose q = q k = γ log log n k with γ such that With this choice of q k , it follows that On the other hand, by Hölder's inequality, setting

Choose now
with a such that 4c 2 a p γ εV p = 1/κ .
Let K 1 be such that q K 1 ≥ 4. It follows that With similar arguments, one can prove that, for any ε > 0, By the direct part of the Borel-Cantelli lemma, it follows that, almost surely, lim sup Note that Proceeding as in the proof of Theorem 2 in [13] (case p > d/2), and noting that we derive that for C large enough.Starting from (6.65) and (6.57), it follows that lim sup k→∞ m≥0
We start again from (6.54).All the terms can be handled similarly as in the previous case except for the terms studied in (6.65) and (6.62).For instance, proceeding again as in the proof of Theorem 2 in [13] (case p < d/2), we get . The term studied in (6.62) can be handled similarly, and the result follows by taking into account that (µ(B m )) 1−p/d ≤ (µ(B m )) 1/2 and inequality (6.58).
6.8 Proof of Theorem 5.2 From (6.4), we have that From (6.9) and (6.10) with M = ∞, we get the upper bound Then we conclude as in Subsection 6.1 by considering the three cases p > d/2, p = d/2 and p < d/2.

2 ,
there exists an universal positive constant C depending only on (p, d) such that lim sup n→∞ n log log n W p p (µ n , µ) ≤ C ∞ 0 t p−1 H(t)dt a.s.• If p ∈ [1, d/2), there exists an universal positive constant C depending only on (p, d) such that lim sup n→∞ n log log n p/d