Central limit theorems for entropy-regularized optimal transport on finite spaces and statistical applications

The notion of entropy-regularized optimal transport, also known as Sinkhorn divergence, has recently gained popularity in machine learning and statistics, as it makes feasible the use of smoothed optimal transportation distances for data analysis. The Sinkhorn divergence allows the fast computation of an entropically regularized Wasserstein distance between two probability distributions supported on a finite metric space of (possibly) high-dimension. For data sampled from one or two unknown probability distributions, we derive the distributional limits of the empirical Sinkhorn divergence and its centered version (Sinkhorn loss). We also propose a bootstrap procedure which allows to obtain new test statistics for measuring the discrepancies between multivariate probability distributions. Our work is inspired by the results of Sommerfeld and Munk (2016) on the asymptotic distribution of empirical Wasserstein distance on finite space using unregularized transportation costs. Incidentally we also analyze the asymptotic distribution of entropy-regularized Wasserstein distances when the regularization parameter tends to zero. Simulated and real datasets are used to illustrate our approach.


Motivations
In this paper, we study the convergence (to their population counterparts) of empirical probability measures supported on a finite metric space with respect to entropy-regularized transportation costs.Transport distances are widely employed for comparing probability measures since they capture in a instinctive manner the geometry of distributions (see e.g [36] for a general presentation on the subject).In particular, the Wasserstein distance is well adapted to deal with discrete probability measures (supported on a finite set), as its computation reduces to solve a linear program.Moreover, since data in the form of histograms may be represented as discrete measures, the Wasserstein distance has been shown to be a relevant statistical measure in various fields such as clustering of discrete distributions [39], nonparametric Bayesian modelling [23], fingerprints comparison [32], unsupervised learning [1] and principal component analysis [3,30,5].
However, the computational cost to evaluate a transport distance is generally of order O(N 3 log N ) for discrete probability distributions with a support of size N .To overcome the computational cost to evaluate a transport distance, Cuturi [7] has proposed to add an entropic regularization term to the linear program corresponding to a standard optimal transport problem, leading to the notion of Sinkhorn divergence between probability distributions.Initially, the purpose of transport plan regularization was to efficiently compute a divergence term close to the Wasserstein distance between two probability measures, through an iterative scaling algorithm where each iteration costs O(N 2 ).This proposal has recently gained popularity in machine learning and statistics, as it makes feasible the use of smoothed optimal transportation distance for data analysis.It has found various applications such as generative models [19] and more generally for high dimensional data analysis in multi-label learning [16], dictionary learning [28], image processing [9,25], text mining via bag-of-words comparison [18], averaging of neuroimaging data [20].
The goal of this paper is to analyze the potential benefits of the Sinkhorn divergence and its centered version [14,19] for statistical inference from empirical probability measures.We derive novel results on the asymptotic distribution of such divergences for data sampled from (unknown) distributions supported on a finite metric space.The main application is to obtain new test statistics (for one or two samples problems) for the comparison of multivariate probability distributions.

Previous work and main contributions
The derivation of distributional limits of an empirical measure towards its population counterpart in p-Wasserstein distance W p (µ, ν) is well understood for probability measures µ and ν supported on R [15,10,11].These results have then been extended for specific parametric distributions supported on R d belonging to an elliptic class, see [27] and references therein.Recently, a central limit theorem has been established in [12] for empirical transportation cost, and data sampled from absolutely continuous measures on R d , for any d ≥ 1.The case of discrete measures supported on a finite metric space has also been considered in [32] with the proof of the convergence (in the spirit of the central limit theorem) of empirical Wasserstein distances toward the optimal value of a linear program.Additionally, Klatt et al. [21] analyzed, in parallel with our results, the distributional limit of regularized optimal transport divergences between empirical distributions.In particular, the work in [21] extends the study of distributional limits of regularized empirical transportation cost to general penalty functions (beyond entropy regularization).The authors of [26] also studied the link between nonparametric tests and the Wasserstein distance, with an emphasis on distributions with support in R.
However, apart from the one-dimensional case (d = 1), and the work of [21], these results lead to test statistics whose numerical implementation become prohibitive for empirical measures supported on R d with d ≥ 2. The computational cost required to evaluate a transport distance is indeed only easily tractable in R. It is therefore of interest to propose test statistics based on fast Sinkhorn divergences [7].In this context, this paper focuses on the study of inference from discrete distributions in terms of entropically regularized transport costs, the link with the inference through unregularized transport, and the construction of tests statistics that are well suited to investigate the equality of two distributions.The results are inspired by the work in [32] on the asymptotic distribution of empirical Wasserstein distance on finite space using unregularized transportation costs.
Our main contributions may be summarized as follows.First, for data sampled from one or two unknown measures µ and ν supported on a finite space, we derive central limit theorems for the Sinkhorn divergence between their empirical counterpart.These results allow to build new test statistics for measuring the discrepancies between multivariate probability distributions.Notice however that the Sinkhorn divergence denoted W p p,ε (µ, ν) (where ε > 0 is a regularization parameter) is not a distance since W p p,ε (µ, µ) = 0.This is a serious drawback for testing the hypothesis of equality between distributions.Thus, as introduced in [14,19], we further consider the centered version of the Sinkhorn divergence W p p,ε (µ, ν), referred to as Sinkhorn loss, which satisfies W p p,ε (µ, µ) = 0.This study thus constitutes an important novel contribution with respect to the work of [21].We present new results on the asymptotic distributions of the Sinkhorn loss between empirical measures.Interestingly, under the hypothesis that µ = ν, such statistics do not converge to a Gaussian random variable but to a mixture of chi-squared distributed random variables.To illustrate the applicability of the method to the analysis of real data, we propose a bootstrap procedure to estimate unknown quantities of interest on the distribution of these statistics such as their non-asymptotic variance and quantiles.Simulated and real datasets are used to illustrate our approach.Finally, one may stress that an advantage of the use of test statistics based regularized Wasserstein distance (rather than other losses or divergences) is to allow further statistical inference from the resulting optimal transport plan as demonstrated in [21] for the analysis of protein interaction networks.

Overview of the paper
In Section 2, we briefly recall the optimal transport problem between probability measures, and we introduce the notions of Sinkhorn divergence and Sinkhorn loss.Then, we derive the asymptotic distributions for the empirical Sinkhorn divergence and the empirical Sinkhorn loss.We also give the behavior of such statistics when the regularization parameter ε tends to zero at a rate depending on the number of available observations.A bootstrap procedure is discussed in Section 3. Numerical experiments are reported in Section 4 for synthetic data and in Section 5 for real data.

Optimal transport, Sinkhorn divergence and Sinkhorn loss
Let (X , d) be a complete metric space with d : X × X → R + .We denote by P p (X ) the set of Borel probability measures µ supported on X with finite moment of order p ≥ 1, in the sense that X d p (x, y)dµ(x) is finite for some (and thus for all) y ∈ X .The p-Wasserstein distance between two measures µ and ν in P p (X ) is defined by where the infimum is taken over the set Π(µ, ν) of probability measures π on the product space X × X with respective marginals µ and ν.
In this work, we consider the specific case where X = {x 1 , . . ., x N } is a finite metric space of size N .In this setting, a measure µ ∈ P p (X ) is discrete, and we write µ = N i=1 a i δ x i where (a 1 , . . ., a N ) is a vector of positive weights belonging to the simplex Σ N := {a = (a i ) i=1,...,N ∈ R N + such that N i=1 a i = 1} and δ x i is a Dirac measure at location x i .Therefore, computing the p-Wasserstein distance between discrete probability measures supported on X amounts to solve a linear program whose solution is constraint to belong to the convex set Π(µ, ν).However, the cost of this convex minimization becomes prohibitive for moderate to large values of N .Regularizing a complex problem with an entropy term is a classical approach in optimization to reduce its complexity [38].This is the approach followed in [7] by adding an entropy regularization to the transport matrix, which yields the strictly convex (primal) problem (2) presented below.
As the space X is fixed, a probability measure supported on X is entirely characterized by a vector of weights in the simplex.By a slight abuse of notation, we thus identify a measure µ ∈ P p (X ) by its vector of weights a = (a 1 , . . ., a n ) ∈ Σ N (and we sometimes write a = µ).Definition 2.1 (Sinkhorn divergence).Let ε > 0 be a regularization parameter.The Sinkhorn divergence [7] between two probability measures µ = N i=1 a i δ x i and ν = N i=1 b i δ x i in P p (X ) is defined by where •, • denotes the usual inner product between matrices, a⊗b denotes the tensor product (x i , x j ) → a i b j and is the set of transport matrices with marginals a and b (with 1 N denoting the vector of R N with all entries equal to one) , is the pairwise cost matrix associated to the metric space (X, d) whose (i, j)-th Remark 1.This entire section is also valid for symmetric positive cost matrices C for which The dual version of problem (2) is introduced in the following definition.

Definition 2.2 (Dual problem)
. Following [8], the dual version of the minimization problem (2) is given by We denote by S ε (a, b) the set of optimal solutions of the maximization problem (3).
It is now well known that there exists an explicit relationship between the optimal solutions of primal (2) and dual (3) problems.These solutions can be computed through an iterative method called Sinkhorn's algorithm [8] that is described below and which explicitly gives this relationship.
Proposition 2.1 (Sinkhorn's algorithm).Let K = exp(−C/ε − 1 N ×N ) be the elementwise exponential of the matrix cost C divided by −ε minus the matrix with all entries equal to 1.Then, there exists a pair of vectors (ũ, ṽ) ∈ R N + × R N + such that the optimal solutions T * ε and (u * ε , v * ε ) of problems (2) and (3) are respectively given by where is the pointwise multiplication.Moreover, such a pair (ũ, ṽ) is unique up to scalar multiplication (or equivalently (u * ε , v * ε ) is unique up to translation), and it can be recovered as a fixed point of the Sinkhorn map where K T is the transpose of K and / stands for the component-wise division.
Remark 2. When the cost matrix C is defined as c ij = x i − x j 2 and the grid points x i are uniformly spread, the matrix vector products involving exp(−C/ε) within the Sinkhorn algorithm can be efficiently performed via separated one dimensional convolutions [31] without storing C .
As discussed in the introduction, an important issue regarding the use of Sinkhorn divergence for testing the equality of two distributions is that it leads to a biased statistics in the sense that W p p,ε (a, b) is not equal to zero under the null hypothesis a = b.A possible alternative to avoid this issue is to consider the so-called notion of Sinkhorn loss [14,19] as defined below.Definition 2.3 (Sinkhorn loss).Let ε > 0 be a regularization parameter.The Sinkhorn loss between two probability measures µ = N i=1 a i δ x i and ν = N i=1 b i δ x i in P p (X ) is defined by The Sinkhorn loss is not a distance between probability distributions, but it satisfies various interesting properties for the purpose of this paper, that are summarized below.
Proposition 2.2.The Sinkhorn loss satisfies the following three key properties (see Theorem 1 in [14]): From Proposition 2.2, we have that a = b is equivalent to W p p,ε (a, b) = 0, therefore the function (a, b) → W p p,ε (a, b) reaches its global minimum at a = b, implying that the gradient of the Sinkhorn loss is zero when a = b which is summarized in the following corollary.

Statistical notations
We denote by L −→ the convergence in distribution of a random variable and Likewise G L ∼ H stands for the equality in distribution of the random variables G and H. Let a, b ∈ Σ N and ân and bm be the empirical measures respectively generated by iid samples We also define the multinomial covariance matrix and the independent Gaussian random vectors G ∼ N (0, Σ(a)) and H ∼ N (0, Σ(b)).As classically done in statistics, we say that for a and b in the simplex are unique up to a scalar addition.Hence for any t ∈ R,

Notations for differentiation
For a sufficiently smooth function f : (x, y) ∈ R N × R N −→ R, we denote by ∇f and ∇ 2 f the gradient and the hessian of the function f .In particular, the gradient of f at the point this notation also holds for the hessian).Moreover, the first-order partial derivative with respect to the first variable x (resp.y) is given by ∂ 1 f (resp.∂ 2 f ).Equivalently, the second-order partial derivative is denoted

Differentiability of W p p,ε
As stated at the beginning of the section, the differentiability of W p p,ε (in the usual Fréchet sense) is needed in order to apply the delta-method.This is proved in the following proposition. where ), the set of optimal solutions of (3).
Proof.From Proposition 2 in [14], W p p,ε is Gâteaux differentiable and its derivative reads In order to prove its differentiability (or Fréchet differentiability, since R N is a finite dimensional space) at the point (a, b), we only need to prove that the operator ∇W p p,ε is continuous (see e.g.Prop.3.2.3. in [40]) in (a, b).Suppose that (a n , b n ) tends to (a, b) when n tends to infinity.Therefore, this convergence also holds in the weak * topology for the probability measures Then, we can apply Cauchy-Schwarz inequality and then use Proposition 13 in [14] on the convergence of the pair ( which concludes the proof.

Convergence in distribution
The following theorem is our main result on distributional limits of empirical Sinkhorn divergences.
) be an optimal solution of the dual problem (3) and ân , bm be the empirical measures defined in (6).Then, the following central limit theorems hold for empirical Sinkhorn divergences.
1.One sample.As n → +∞, one has that 2. Two samples.For ρ n,m = (nm/(n + m)) and m/(n+m) → γ ∈ (0, 1) as min(n, m) → +∞, one has that Proof.Following the proof of Theorem 1 in [32], we have that (see e.g.Theorem 14.6 in [37]) since nâ n is a sample of a multinomial probability measure with probability a.For the two samples case, we use that where ρ n,m and γ are the quantities defined in the statement of Theorem 2.5.From Proposition 2.3, we can directly apply the delta-method: while, for n and m tending to infinity such that n ∧ m → ∞ and m/(n + m) → γ ∈ (0, 1), we obtain that This completes the proof of Theorem 2.5.

Convergence in probability
Distributional limits of empirical Sinkhorn divergences may also be characterized by a convergence in probability by the following result which directly follows from the delta-method (see e.g.Theorem 3.9.4 in [35]).

Convergence in distribution
The following theorems are our main results on distributional limits of the empirical Sinkhorn loss, for which we now distinguish the cases a = b (alternative hypothesis) and a = b (null hypothesis).
2. Two samples.For ρ n,m = (nm/(n + m)) and m/(n+m) → γ ∈ (0, 1) as min(n, m) → +∞, one has that Proof.The only difference with the proof of Theorem 2.5 is the computation of the gradient of W p p,ε , which is given by The proof of Theorem 2.7 then follows from the same arguments as those used in the proof of Theorem 2.5 .
Under the null hypothesis a = b, the derivation of the distributional limit of either W p p,ε (â n , a) or W p p,ε (â n , bm ) requires further attention.Indeed, thanks to Proposition 2.2, one has that the function (a, b) → W p p,ε (a, b) reaches its global minimum at a = b, and therefore the gradient of the Sinkhorn loss satisfies ∇W p p,ε (a, a) = 0. Hence, to obtain the distributional limit of the empirical Sinkhorn loss it is necessary to apply a second-order delta-method yielding an asymptotic distribution which is not Gaussian anymore.Theorem 2.8.Let a = b be a probability distribution on Σ N , and denote by ân an empirical measures obtained by independent sampling data from a.Then, as n tends to infinity, the following asymptotic result holds where λ 1 , . . ., λ N are the non-negative eigenvalues of the matrix and χ 2 1 (1), . . ., χ 2 N (1) are independent random variables with chi-squared distribution of degree 1.
Proof.From Corollary 2.4, we have that ∇W p p,ε (a, a) = 0.In order to apply a second order delta-method, the Hessian matrix ∇ 2 W p p,ε (a, b) of the Sinkhorn loss W p p,ε (a, b) needs to be nonsingular in the neighborhood of a = b.Note that the Sinkhorn loss is at least C 3 (admitting a third continuous differential) on the interior of its domain, as proved in Theorem 2 by [22].Moreover, since the function a → W p p,ε (a, b) is ε-strongly convex (Theorem 3.4, [2]) and a → − 1 2 W p p,ε (a, a) is (strictly) convex (Proposition 4, [14]), we have that the Hessian matrix of a → W p p,ε (a, b) is non-singular.We can thus apply Theorem 17 in [33] which states that from second order delta-method, the distributional limits of nW p p,ε (â n , a) is given by 1 that can be rewritten as 1 2 where λ 1 , . . ., λ N are the eigenvalues of the matrix Σ(a) . This concludes the distributional limit presented in relation (13).
In the two samples case, the Hessian matrix is not guaranteed to be non-singular, in which case the asymptotic distribution is degenerated.Nevertheless, we have the following theorem.Theorem 2.9.Let a = b be a probability distribution on Σ N , and denote by ân , ãm two empirical measures obtained by independent sampling data from a.Then, let us write the Hessian matrix where λ1 , . . ., λN are the eigenvalues of the matrix of size R 2N × R 2N given by and χ 2 1 (1), . . ., χ 2 N (1) are independent random variables with chi-squared distribution of degree 1.
Proof.As in the proof of Theorem 2.8, we have that both A and C are ε-strongly convex and therefore non-singular.The determinant det(∇ 2 W p p,ε (a, b)) = det(A)det(S) is therefore non-zero in a neighborhood of a = b if and only if the Schur complement S is invertible in a neighborhood of a = b.Therefore, applying Theorem 17 in [33] as in the one sample case, we obtain the distributional limit (14).This completes the proof of Theorem 2.9.

Remark 4. A sufficient condition to ensure the non-singularity of the Schur matrix S comes from the ε-strong convexity of A and C, implying that for any
A sufficient condition for the non-singularity of S is therefore ε > sup x∈R N Bx x at the points a = b.Remark that since the global minimum is attained in the critical points a = b, we have that the Hessian W p p,ε is symmetric semi-definite positive at these points.Therefore its Schur complement S = C − B T A −1 B is also semi-definite positive (see e.g.Section A.5.5. in [4]).

Convergence in probability
Limits for empirical Sinkhorn loss can again be established from a corollary of the Deltamethod as done in Theorem 2.6.
Theorem 2.10.Using the same notations as introduced in the statement of Theorem 2.7, the following asymptotic results hold for all a, b ∈ Σ N .

Link with unregularized optimal transport
A natural question that arises is the behavior of distributional limits when we let ε tends to 0 at an appropriate rate depending on the sample size.Under such conditions, we recover the distributional limit given by Theorem 1 in Sommerfeld and Munk [32] in the setting of unregularized optimal transport.Theorem 2.11.Suppose that X ⊂ R q , and consider the cost matrix C such that c ij = x i − x j p where • stands for the Euclidean norm.We recall that S 0 (a, b) ⊂ R N × R N is the set of optimal solutions of the dual problem (3) for ε = 0.
1.One sample.Suppose that (ε n ) n≥1 is a sequence of positive reals tending to zero such that lim Then, we have that 2. Two samples.Suppose that (ε n,m ) is a sequence of positive reals tending to zero as min(n, m) → +∞ such that for ρ n,m = (nm/(n + m)) and m/(n + m) → γ ∈ (0, 1).Then, one has that Proof.We will only prove the one sample case as both proofs work similarly.For that purpose, let us consider the decomposition From Theorem 1 in [32], we have that Since X is a finite set, it follows that the cost c is a L-Lipschitz function separately in x ∈ X and y ∈ X with respect to the Euclidean distance.Therefore, it satisfies the assumptions of Theorem 1 in [17] that gives a bound on the error between the Sinkorn divergence and the unregularized transport for a given pair of distributions.It follows that for any a, b ∈ Σ N (possibly random), where q is the dimension of the support space, and diam(X ) is the diameter of X (i.e.diam(X ) = sup x,y∈X x − y ) which is always finite in the discrete case.Then, as soon as the sequence (ε n ) n≥1 satisfies (15), we obtain that sup Combining ( 19) with ( 20) and ( 22), and using Slutsky's theorem allow to complete the proof of Theorem 2.11.

Use of the bootstrap for statistical inference
The results obtained in Section 2 on the distribution of the empirical Sinkhorn divergence and Sinkhorn loss are only asymptotic.It is thus of interest to estimate their non-asymptotic distribution using a bootstrap procedure.The bootstrap consists in drawing new samples from an empirical distribution Pn that has been obtained from an unknown distribution P. Therefore, conditionally on Pn , it allows to obtain new observations (considered as approximately sampled from P) that can be used to approximate the distribution of a test statistics using Monte-Carlo experiments.We refer to [13] for a general introduction to the bootstrap procedure.
We can apply the delta-method to prove the consistency of the bootstrap in our setting using the bounded Lipschitz metric as defined below.Definition 3.1.The Bounded Lipschitz (BL) metric between two probability measures µ, ν supported on Ω is defined by where BL 1 (Ω) is the set of real functions Ω → R such that h ∞ + h Lip ≤ 1.
Our main result on the consistency of bootstrap samples can then be stated.Notice that similar results for the Sinkhorn divergence are obtained straightforward using the same arguments.

One sample case:
√ for the BL metric, in the sense that

Two samples case:
for the BL metric, in the sense that Proof.We only prove the one sample case since the convergence for the two samples case can be shown with similar arguments.We know that √ n(â n − a) tends in distribution to G ∼ N (0, Σ(a)).Moreover √ n(â * n − ân ) converges (conditionally on X 1 , . . ., X n ) in distribution to G by Theorem 3.6.1 in [35].Then, applying Theorem 3.9.11 in [35] on the consistency of the delta-method combined with the bootstrap allows us to obtain the statement of the present Theorem 3.2 in the case a = b.
As explained in [6], the standard bootstrap fails under first order degeneracy, meaning for the null hypothesis case a = b.However, the authors propose a corrected version -called the Babu correction-of the bootstrap in their Theorem 3.2 given for the one sample case by sup and for the two samples case by sup Note that most of the requirements to apply Theorem 3.2 in [6] are trivial since the distributions are defined on a subset of R N and the function (a, b) → W p p,ε (a, b) is twice differentiable on all Σ N × Σ N .However, the (Assumption 3.3 in [6]) on the second derivative requires a finer study that is left for future work.Hence, we stress that the Babu-bootstrap approach that we use in our numerical experiments is missing theoretical guarantees.Nevertheless, the results reported from our experiments on simulated and real data illustrate its correctness. As converges in distribution (conditionally on X 1 , . . ., X n , Y 1 , . . ., Y m ) to for the BL metric.

Numerical experiments with synthetic data
We propose to illustrate Theorem 2.7, Theorem 2.8, Theorem 2.9 and Theorem 3.2 with simulated data consisting of random measures supported on a l × l square lattice (of regularly spaced points) (x i ) i=1,...,N in R 2 (with N = l 2 ) for l ranging from 5 to 20.We use the squared Euclidean distance as the cost function C which therefore scales with the size of the grid.The range of interesting values for ε is thus closely linked to the size of the grid, as it can be seen in the expression of K = exp(−C/ε − 1 N ×N ).Hence, ε = 100 for a 5 × 5 grid corresponds to more regularization than ε = 100 for a 20 × 20 grid.We ran our experiments on Matlab using the accelerated version [34] 1 of the Sinkhorn transport algorithm [7].Furthermore, we considered the numerical logarithmic stabilization described in [29] which allows to handle relatively small values of ε.Indeed, in small regularization regimes, the Sinkhorn algorithm quickly becomes unstable, even more for large grids with a small number of observations.

Convergence in distribution
We first illustrate the convergence in distribution of the empirical Sinkhorn loss (as stated in Theorem 2.7) for the hypothesis a = b with either one sample or two samples.We consider the case where a is the uniform distribution on a square grid and is a distribution with linear trend depending on a slope parameter θ ≥ 0 that is fixed to 0.5, see Figure 1.

Alternative a = b -Two samples
We consider the same setting as before, excepting that data are now both sampled from distributions a and b.Hence, we run M = 10 3 experiments to obtain a kernel density estimation of the distribution of that is compared to the density of the Gaussian variable for different values of n and m.The results are reported in Figure 5.The convergence does not seem as good as in the one sample case, this must be due to the randomness coming from both ân and bm .We also report in Figure 6 results on the consistency of the bootstrap procedure under the hypothesis H 1 with two samples.From the distributions a and b, we generate two random distributions ân and bm .The value of the realization ) is represented by the red vertical lines in Figure 6.Then, we generate from ân and bm , two sequences of M = 10 3 bootstrap samples of random measures denoted by â * n and b * m .We use again a kernel density estimate (with a data-driven bandwith) to compare the green distribution of 6.The green vertical lines in Figure 6 represent a confidence interval of level 95%.We can draw the same conclusion as in the one sample case.All these experiments thus perfectly illustrate the Theorem 2.7.As in the previous cases, we consider a to be the uniform distribution on a square grid.We recall that the distributional limit in the right hand side of ( 13) is the following mixture of random variables with chi-squared distribution of degree 1 It appears to be difficult to compute the density of this distributional limit or to draw samples from it, since computing the Hessian matrix ∂ 2 11 W p p,ε (a, a) is a delicate task.We thus leave this problem open for future work, and only rely on the non-asymptotic distribution of nW p p,ε (â n , a).This justifies the use of the bootstrap procedure described in Section 3. We display the bootstrap statistic in Figure 7.The shape of the non-asymptotic density of nW p p,ε (â n , a) (red curves in Figure 7) looks chi-squared distributed.In particular, it only takes positive values.The bootstrap distribution in green also recovers the most significant mass location of the red density.We still consider a = b to be the uniform distribution on a square grid and we sample two measures from a denoted ân , bm .We compute the non-asymptotic distribution of (nm/(m + n))W p p,ε (â n , bm ) which, from Theorem 2.9, must converge to The results are displayed in red in Figure 8, together with the bootstrap distribution (in green) . We obtain similar results to the one sample case.

Estimation of test power using the bootstrap
One sample -distribution with linear trend and varying slope parameter.The consistency and usefulness of the bootstrap procedure is illustrated by studying the statistical power (that is P(Reject H 0 |H 1 is true)) of statistical tests (at level 5%) based on the empirical Sinkhorn loss.For this purpose, we choose a to be uniform and b to be a distribution with linear trend whose slope parameter θ is ranging from 0 to 0.1 on a 5 × 5 grid.We assume This experiments is repeated 100 times, in order to estimate the power (at level u) of a test based on nW p p,ε (a, bm ) by comparing the resulting sequence of p-values to the value u.The results are reported in Figure 9 (left).It can be seen that the resulting testing procedures are good discriminants for the three values of the regularization parameters ε that we considered.As soon as the slope θ increases then b sufficiently differs from a, and the probability of rejecting H 0 thus increases.We have also chosen to report results obtained with the Sinkhorn loss corresponding to optimal transport regularized by the entropy 9 (right)).Indeed, we remark that in the case of the relative entropy, the power of the test seems to highly depend on the value of ε.More precisely, for a fixed value of the slope parameter θ (or distribution   Notice that both ā20 and w17 are discrete empirical measures admitting a zero mass for many locations x i .
We use the two samples testing procedure described previously, and a bootstrap approach to estimate the distribution of the test statistics Notice also that n and m respectively correspond to the number of observations for the empirical Autumn distribution ā20 and the empirical Winter distribution w17 , which is the total number of pixels times the number of images.Therefore, n = 20 * 768 * 576 = 8847360 and m = 17 * 768 * 576 = 7520256.We report the results of the testing procedure for ε = 10, 100 by displaying in Figure 11 an estimation of M = 100 observations of the bootstrap statistic's density where â * n and ŵ * m are respectively bootstrap samples of ā20 and w17 , and (u ā20 , w17 , v ā20 , w17 ) are the optimal dual variables associated to (ā 20 , w17 ) in problem (3).
For ε = 10, 100, the value of ρ 2 n,m (W p p,ε (ā 20 , w17 )) is outside the support of this density, and the null hypothesis that the color distributions of images taken during Autumn and Winter are the same is thus rejected.In particular, the test statistic ρ We also run the exact same experiments for a smaller grid (size 8 3 = 512) and a higher number of observations (M = 1000).The results are displayed in Figure 12.The distributions ρ 2 n,m (W p p,ε (â * n , ŵ * m ) − W p p,ε (ā 20 , w17 ) − ( u ā20 , w17 , â * n − ā20 + v ā20 , w17 , ŵ * m − w17 )) are much more centered around 0 (we gain a factor 10).However, we obtain the same conclusion as before, with a test statistic equal to 9.39 × 10 6 for ε = 10 and 8.50 × 10 6 for ε = 100.

Testing the hypothesis of equal distribution when splitting the Autumn dataset
We propose now to investigate the equality of distributions within the same dataset of Autumn histograms.To this end, we arbitrarily split the Autumn dataset into two subsets of 10 images and we compute their mean distribution The results are displayed in Figure 13.We obtain similar results than in the two seasons case, the null hypothesis that both Autumn distributions follow the same law is thus rejected.On the other hand, the test statistics are smaller in this case as the histogram of color seems to be closer.Indeed, the quantity ρ 2 n,m W p p,ε (ā 1→10 , ā11→20 ) is equal to 11.02 × 10 6 for ε = 10 and 5.50 × 10 6 for ε = 100.Similarly to the Winter VS Autumn case, we also run the same Autumn VS Autumn experiments for a grid of size 8 3 = 512 and M = 1000 observations.The results are displayed in Figure 14 for test statistics equal to 14.07 × 10 5 for ε = 10 and 3.41 × 10 5 for ε = 100.AA is indeed smaller than χ 2 AW , the contrast between these two is weaker than with the Sinkhorn loss test.

Future works
As remarked in [32], there exists a vast literature for two-sample testing using univariate data.However, in a multivariate setting, it is difficult to consider that there exist standard methods to test the equality of two distributions.We thus intend to further investigate the benefits of the use of the empirical Sinkhorn loss to propose novel testing procedures able to compare multivariate distributions for real data analysis.A first perspective is to apply the methodology developed in this paper to more than two samples using the notion of smoothed

Theorem 2 . 7 .
Let a = b be two probability distributions in Σ N .Let us denote by ân , bm their empirical counterparts and by (u a,b ε , v a,b ε ) ∈ S ε (a, b) the dual variables which are the optimal solutions of the dual problem (3).Then, the following asymptotic results hold. 1.One sample.As n → +∞, one has that By definition of the Sinkhorn loss, one has that W p p,ε (a, b)−W p p,ε (a, b) = 1 2 W p p,ε (a, a) + W p p,ε (b, b) .Therefore, using the upper bound (21), we get √ n(W p p,εn (â n , b) − W p p (â n , b) p p (a, b) − W p p,εn (a, b))

Figure 1 :Figure 2 :
Figure 1: Example of a distribution b with linear trend (with slope parameter θ = 0.5 on a 20 × 20 grid).We generate M = 10 3 empirical distributions ân (such that nâ n follows a multinomial distribution with parameter a) for different values of n and grid size.In this way, we obtain M realizations of √ n(W p p,ε (â n , b) − W p p,ε (a, b)), and we use a kernel density estimate (with a data-driven bandwidth) to compare the distribution of these realizations to the density of the Gaussian distribution G, u a,b ε − 1/2(u a,a ε + v a,a ε ) .The results are reported in Figure 2 (grid 5 × 5) and Figure 3 (grid 20 × 20).It can be seen that the convergence of the empirical Sinkhorn loss to its asymptotic distribution (n → ∞) is relatively fast.Let us now shed some light on the bootstrap procedure.The results on bootstrap experiments are reported in Figure 4. From the uniform distribution a, we generate only one random distribution ân .The value of the realization √ n(W p p,ε (â n , b) − W p p,ε (a, b)) is represented by the red vertical lines in Figure 4. Besides, we generate from ân , a sequence of M = 10 3 bootstrap samples of random measures denoted by â * n (such that nâ * n follows a multinomial distribution with parameter ân ).We use again a kernel density estimate (with a data-driven bandwidth) to compare the distribution of √ n(W p p,ε (â * n , b) − W p p,ε (â n , b)) to the distribution of √ n(W p p,ε (â n , b) − W p p,ε (a, b)) displayed in Figure 2 and Figure 3.The green vertical lines in Figure 4 represent a confidence interval of level 95%.The observation represented by the red vertical line is mostly located within this confidence interval, and the density estimated by bootstrap decently captures the shape of the non-asymptotic distribution of Sinkhorn losses.
for which n = m = 10 * 768 * 576 = 4423680.The procedure is then similar to the Autumn versus Winter case in Subsection 5.1, meaning that we sample M = 100 bootstrap distributions â * n and b * m from respectively ā1→10 and ā11→20 .

Remark 6 .
For comparison purpose, we ran a χ 2 test of homogeneity for testing the hypothesis of equal distributions of colors.The obtained test statistic in the Autumn vs Winter case is equal to χ 2 AW = 6.96 × 10 4 , and in the Autumn splitting case to χ 2 AA = 4.06 × 10 4 .Even if χ 2
bm , b * m − bm we can reformulate the Babu bootstrap as follows.1.One sample case.For (u ân,a