Information Geometry of Wasserstein Statistics on Shapes and Affine Deformations

Information geometry and Wasserstein geometry are two main structures introduced in a manifold of probability distributions, and they capture its different characteristics. We study characteristics of Wasserstein geometry in the framework of Li and Zhao (2023) for the affine deformation statistical model, which is a multi-dimensional generalization of the location-scale model. We compare merits and demerits of estimators based on information geometry and Wasserstein geometry. The shape of a probability distribution and its affine deformation are separated in the Wasserstein geometry, showing its robustness against the waveform perturbation in exchange for the loss in Fisher efficiency. We show that the Wasserstein estimator is the moment estimator in the case of the elliptically symmetric affine deformation model. It coincides with the information-geometrical estimator (maximum-likelihood estimator) when the waveform is Gaussian. The role of the Wasserstein efficiency is elucidated in terms of robustness against waveform change.

The affine deformation statistical model p(x, θ) is defined as where f (z) is a standard shape distribution satisfying zf (z)dz = 0, (3) and I is the identity matrix.We also refer to the standard shape f as a "waveform" in the following.The deformation parameter consists of θ = (µ, Λ) ∈ Θ such that µ is a vector specifying translation of the location and Λ is a non-singular matrix representing scale changes and rotations of x.Given a standard f , we have a statistical model parameterized by θ: M f = {p(x, θ)}.Geometrically, it forms a finite-dimensional statistical manifold, where θ plays the role of a coordinate system.Note that it is not necessarily identifiable in general.
The identifiability depends on f .The deformation model is a generalization of the locationscale model.Note that this model is often called the location-scatter model in several fields such as statistics and signal processing [50,41].Let T θ denote the affine deformation from x to z given by z = T θ x = Λ(x − µ).
Let F = {p(x)} be the space of all smooth positive probability density functions that have finite second moments.Let F S = {f (z)} be its subspace consisting of all the standard distributions f (z) satisfying ( 3) and (4).Then, any q(x) ∈ F is written in the form for f ∈ F S and θ = (µ, Λ) ∈ Θ.Note that θ is not necessarily unique due to possible symmetries in f .Hence, F = F S × Θ/ ∼, where ∼ is the equivalence relation of equality in distribution.See Figure 1.
Geometry of a manifold of probability distributions has so far been studied by information geometry and Wasserstein geometry.The two geometries capture different aspects of a manifold of probability distributions.We use a divergence measure to explain this.Let D F [p(x), q(x)] and D W [p(x), q(x)] be two divergence measures between distributions p(x) and q(x), where subscripts F and W represent Fisher-based information geometry and Wasserstein geometry, respectively.Information geometry uses an invariant divergence D F , typically the Kullback-Leibler divergence.Wasserstein divergence D W is defined by the cost of transporting masses distributed in form p(x) to another q(x).Roughly speaking, D F measures the vertical differences of p(x) and q(x), for example, represented by their log-ratio log(p(x)/q(x)), whereas D W measures the horizontal differences of p(x) and q(x) which corresponds to the transportation cost from p(x) to q(x).See Figure 2.
Information geometry is constructed based on the invariance principle of Chentsov [17] such that D F [p(x), q(x)] is invariant under invertible transformations of the coordinates x of the sample space X.This implies that the divergence does not depend on the coordinate system of X.We then have a unique Riemannian metric, which is the Fisher-Rao metric, and also a dual pair of affine connections [5].This is useful not only for analyzing the performances of statistical inference but also for image analysis, machine learning, statistical physics, and many others (see [1]).Wasserstein geometry has an old origin, proposed by G. Monge in 1781 as a problem of transporting mass distributed in the form p(x) to another q(x) such that the total transportation cost is minimized.It depends on the transportation cost c(x, y) between two locations x, y ∈ X.The cost is usually a function of the Euclidean distance between x and y.We use the square of the distance as a cost function, which gives L 2 -Wasserstein geometry.This Wasserstein geometry directly depends on the Euclidean distance of X = R d .Therefore, it is useful for problems that intrinsically depend on the metric structure of X, such as the transportation problem, non-equilibrium statistical physics, pattern analysis, machine learning and many others.However, it is in general difficult to calculate the Wasserstein distance [45].See recent papers [27,45] for computational algorithms of the Wasserstein distance.
Chen and Li [18] studied a general regular statistical model specified by a finite number of parameters and pulled-back a Riemannian metric from the Otto metric in the space consisting of all smooth and positive density functions on R n .Using this metric (W-metric), they studied the natural gradient estimation procedures, obtaining the result that the W-natural gradient has better performances compared to the natural gradient procedures due to the Fisher-Rao metric.This is remarkable observation of showing the usefulness of W-metric in statistical inference.We follow this idea of pulling-back the Otto Riemannian metric to a regular statistical model, using their regularity conditions which guarantee the existence and uniqueness of the pulled-back Riemannian metric.
Li and Zhao [34] gave a unified framework for the two geometries.The present article is based on their framework and focuses on the affine deformation model, for which the standard waveform f and the deformation parameter θ are separated.Li and Zhao [34] further introduced the Wasserstein score function in parallel to the Fisher score function, defining two estimators θF and θW thereby.The former θF is the maximum likelihood estimator that maximizes the log likelihood.This is the one that minimizes an invariant divergence from the empirical distribution p(x) to a parametric model, where the empirical distribution is given based on n independent observations and δ(x) is the delta function.The latter Wasserstein estimator θW is defined as the zero point of the Wasserstein score.It is asymptotically equivalent to the minimizer of the W -divergence between the empirical distribution and model.Also, Li and Zhao [34] further defined the Fefficiency and W -efficiency of a consistent estimator θ given a statistical model M = {p(x, θ)}, proving the Cramér-Rao type inequalities.The present paper is organized as follows.In Section 2, we introduce two divergences between distributions, one based on the invariance principle and the other based on the transportation cost.The divergences give two Riemannian structures in the space F of probability distributions p(x) over X = R d .A regular statistical model M = {p(x, θ)} parameterized by θ is a finite-dimensional submanifold embedded in F. In Section 3, we define the F -and W -score functions following [34].The Riemannian structure of the tangent space of probability distributions is pulled-back to the model submanifold, giving both the Riemannian metrics and score functions.We define the F -and W -estimators θF and θW by using the F -and W -score functions, respectively.Section 4 defines the affine deformation statistical model.Section 5 studies the elliptically symmetric affine deformation model M f , where f is a spherically symmetric standard form.For this model, we show that the W -score functions are quadratic functions of x.Hence, it is proved that θW is a moment estimator.For the Gaussian shape, the F -estimator and W -estimator coincide.We also show that M f and F S are orthogonal in the W -geometry, implying the separation of the waveform and deformation.In Section 6, we elucidate the role of W -efficiency from the point of view of robustness to a change in the waveform f due to observation noise.Section 7 briefly summarizes the paper and mentions future work.

Riemannian structures in the space of probability densities
We consider the space F = {p(x)} of all smooth positive probability density functions on R d that have finite second moments.Later, we may relax the conditions of positivity and smoothness, when we discuss a parametric model, in particular the deformation model1 .We define a divergence function D[p(x), q(x)], which represents the degree of difference between p(x) and q(x).The square of the L 2 distance between p(x) and q(x) plays this role, but a divergence does not necessarily need to be symmetric with respect to p(x) and q(x).A divergence function satisfies the following conditions: 1. D[p(x), q(x)] ≥ 0 and the equality holds if and only if p(x) = q(x).

Let δp(x) be an infinitesimally small deviation of p(x). Then, D[p(x), p(x) + δp(x)] is approximated by a positive quadratic functional of δp(x).
A divergence is said to be invariant if holds for every smooth reversible transformation k of the coordinates from x ∈ R d to y = k(x), where p(y) = ∂x ∂y p(x).
A typical invariant divergence is the α-divergence (α = ±1) defined by A characterization of the α-divergence is given in [1].The α-divergence gives informationgeometric structure to F. Another divergence is the Wasserstein divergence.Let us transport masses piled in the form p(x) to another q(x).To this end, we need to move some mass at x to another position y.Let π(x, y) be a coupling of p(x) and q(y): π(x, y)dx = q(y).
Note that π does not necessarily have a density.For convenience, we use the notation π(x, y) in this paper.Let c(x, y) be the cost of transporting a unit of mass from x to y.Then, the Wasserstein divergence D W [p(x), q(x)] is the minimum transporting cost from p(x) to q(x).By using stochastic plan π(x, y), the Wasserstein divergence between p(x) and q(x) is given by where infimum is taken over all stochastic plans π satisfying ( 5) and ( 6).When the cost is the square of the Euclidean distance c(x, y) = x − y 2 , we call D W the L 2 -Wasserstein divergence.We focus on this divergence in the following.Note that the L 2 -Wasserstein divergence is the square of the L 2 -Wasserstein distance.From Brenier's theorem [14,15], the optimal transport is actually induced by a transport map.In other words, for each point x, π(x, •) is supported at a single point.
The dynamic formulation of the optimal transport problem proposed by [16] and developed further by [10,38] is useful.Let ρ(x, t) be a family of probability distributions parameterized by t.It represents the time course ρ(x, t) of transporting p(x) to q(x), satisfying ρ(x, 0) = p(x), ρ(x, 1) = q(x).
We introduce potential Φ(x, t) such that its gradient ∇ x Φ(x, t) represents the velocity of mass flow at x and t in the dynamic plan.Then, Φ(x, t) satisfies the following continuity equation The Wasserstein divergence is written in the dynamic formulation as We introduce a Riemannian structure to F by the Taylor expansion of D[p, p + δp].The Riemannian metric g gives the squared magnitude ds 2 of an infinitesimal deviation δp(x) in the tangent space of F, for example, by In the case of the invariant divergence, so that In the case of the L 2 -Wasserstein divergence, consider the change of density from ρ(x, 0) = p(x) at t = 0 to ρ(x, dt) = p(x) + δp(x) at t = dt for an infinitesimal dt.By using the potential Φ(x) of this infinitesimal transport, where ∆ p is the operator defined by Then, the L 2 -Wasserstein divergence is where we used integration by parts for a(x) decaying sufficiently fast.See [18] for precise regulaity conditions.Thus, Note that Φ(x) is unique up to an additive constant under regularity conditions (see Theorem 13.8 of [53] and Section 8.1 of [52]).This is Otto's Riemannian metric [42].We focused on the space F of smooth and positive densities under L 2 -Wasserstein geometry.It is indeed an infinite-dimensional Riemannian manifold [36].Since the space F S is a codimension d + d(d + 1)/2 subspace of F that specifies the value of linear functionals (first and second moments), it is also an infinite-dimensional Riemannian manifold.However, if we consider the larger space of general probability distributions, then it is not even a Banach manifold under L 2 -Wasserstein geometry, because it includes singular (atomic) distributions for which the tangent space is more restricted than that of F. Note that information geometry and Wasserstein geometry induce different topologies on the space of probability distributions.However, this does not bother us when we study only a finite-dimensional regular statistical model included in F. We follow the regularity conditions given in Chen and Li [18] throughout the paper.

Score functions and estimators
We consider a regular statistical model M = {p(x, θ)} ⊂ F parameterized by an m-dimensional vector θ.The tangent space T θ M of M at θ is spanned by m functions for i = 1, . . ., m, so that a tangent vector δp(x) is given by Hereafter, the summation convention is used, that is, all indices appearing twice, once as upper and the other as lower indices, e.g.i's in (10), are summed up.
Let us define the score functions S i (x, θ) by using the basis functions ∂ i p(x, θ) of the tangent space of M for i = 1, • • • , m.In the case of the invariant Fisher geometry, from ( 8), the Fisher score function S F i (x, θ) is defined by which is the derivative of log-likelihood.In Wasserstein geometry, the Wasserstein score (Wscore) function S W i (x, θ) [34] is defined as the solution of where the latter condition is imposed to eliminate the indefiniteness due to the integral constant.By using ( 9) provided it holds, we see that S W i (x, θ) satisfies the Poisson equation: For infinitesimal δ, the map x → x + δ∇ x S W i (x, θ) is the optimal transport map from p(x, θ) to p(x, θ + δe i ) with transportation cost as δ → 0, where e i is the i-th standard unit vector.See Proposition 8.4.6 of [6] for more rigorous statement.In both Fisher and Wasserstein cases, the score function satisfies We think that the existence and uniqueness of S W i can directly be established along the lines of Brenier's theorem, as suggested by one of the reviewers.Note that [18] discussed some results restricting the Wasserstein metric to a regular parameteric statitical manifold.
The Riemannian metric tensor g ij (θ) is pulled-back from g in F to M. Note that Chen and Li [18] provided a detailed account on the restriction of the Wasserstein metric to a parametric statistical manifold with regularity conditions and concrete examples.In the Fisherian case, In the Wasserstein case, where identity ( 9) is used.
The score functions S i (x, θ) give a set of estimating functions from (13), which are used to obtain an estimator θ.Suppose that we have n independent observations x 1 , • • • , x n from p(x, θ).Let pemp (x) be the empirical distribution given by pemp Then, replacing expectation E in ( 13) by the expectation with respect to the empirical distribution, we have estimating equations The solution θ gives a consistent estimator for large n under regularity conditions on the score function [51].Roughly speaking, θ is the projection of pemp (x) to the model M with respect to the metric g (see Figure 3).It is the solution of pemp (x), S i (x, θ) = 0, giving a consistent estimator θ.A consistent estimator is Fisher efficient when the projection is orthogonal with respect to the Fisher-Rao metric [5].
For the invariant Fisherian case, the estimator θF defined by (15) is the maximum likelihood estimator: 1 The Cramér-Rao theorem gives a matrix inequality for any unbiased estimator θ, where Cov[•] is the covariance matrix and denotes the matrix order defined by the positive semidefiniteness.The maximum likelihood estimator θF satisfies asymptotically.Hence, it minimizes the error covariance matrix and the minimized error covariance is given asymptotically by the inverse of the Fisher metric tensor g F divided by n.Such a property is called the Fisher efficiency.In the following, we study the characteristics of the estimator θW defined by (15) with the Wasserstein score: 1 We call θW the Wasserstein estimator (W -estimator) following [34].In the case of the onedimensional location-scale model, the Wasserstein estimator is asymptotically equivalent to the estimator obtained by minimizing the Wasserstein divergence (transportation cost) from the empirical distribution pemp (x) to model M: See the end of Section 5.The properties of θW p were studied in detail by [4] in the case of the one-dimensional location-scale model.Note that θW = θW p in general, contrary to the Fisher case.

Affine deformation model
Now, we focus on the affine deformation model.Let f (z) ∈ F S be a standard probability density function satisfying (2), (3), and (4).To define M f , we use affine deformation of x to where µ is a vector representing shift of location and Λ is a non-singular matrix.Hence, θ = (µ, Λ) is m-dimensional where m ≤ d + d 2 due to the possible symmetries in f .The model This is a generalization of the location-scale model, which is simply obtained by putting Λ = (1/σ)I, with σ being the scale factor.It should be noted that Λ is decomposed as Λ = U DO, where U and O are orthogonal matrices and D is a positive diagonal matrix (singular value decomposition).In the following, we denote the log probability of standard shape f by l(z) = log f (z).
For each standard shape function f ∈ F S , an affine deformation model M f parameterized by θ = (µ, Λ) ∈ Θ is attached.Thus, F is decomposed as  [49] for the cone structure of F. When f is Gaussian, its structure is studied in detail by [48].
When p(x) belongs to M f , the waveform of p(x) is said to be equivalent to that of f .The space M f consists of the distributions of all equivalent waveforms.All ellipsoidal shapes are equivalent to a spherical shape.A family of special parallel-piped shapes are equivalent to a cubic form (see Figure 4).Therefore, our model is useful for separating the effect of the shape from location and affine deformation.
We may consider subclasses of the transformation model.One simple example is the location model, in which Λ is fixed to the identity matrix I.A stronger theorem is known in such a simple model [24].In our context, it can be expressed as follows.
Proposition 1. Wasserstein geometry gives an orthogonal decomposition of the shape and locations, 5 Elliptically symmetric deformation model Here, we focus on the cases where f (z) is spherically symmetric and thus written as f (z) = g( z ) for some function g.Thus, which is elliptically symmetric.We restrict the parameter Λ to be symmetric and positive definite in this section.Namely, the dimension of First, we consider the F -estimator θF (maximum likelihood estimator).The log-likelihood is given by log p(x, θ) = log |Λ| + log g( Λ(x − µ) ).
When there are n independent observations x 1 , • • • , x n , summation is taken over them so that we have the likelihood equations The solution θF strongly depends of the shape g.Contrary to this, the W -estimator θW does not depend on the shape g as follows.We write the i-th standard unit vector by e i ∈ R d so that (e i ) j = δ ij .

Lemma 1. For a symmetric
Proof.Straightforward calculation.
Lemma 2. Let A, B ∈ R d×d be symmetric matrices.If A is positive definite, then the Sylvester equation AX + XA = B has the unique solution X, which satisfies X ⊤ = X and tr(X) = tr(A −1 B)/2.
Proof.From the positive semidefiniteness of A, the spectra of A and −A are disjoint.Thus, from Theorem VII.2.1 of [12], the Sylvester equation AX + XA = B has a unique solution.
Let X be the solution of the Sylvester equation.From A ⊤ = A, we have AX ⊤ + X ⊤ A = (AX + XA) ⊤ = B ⊤ = B, which means that X ⊤ is also a solution of the Sylvester equation.Since the solution is unique, it implies X ⊤ = X.Also, from the positive semidefiniteness of A and AX + XA = B, we have X + A −1 XA = A −1 B. Taking the trace and using tr(A −1 XA) = tr(X), we obtain trX = tr(A −1 B)/2.Theorem 1.For the elliptically symmetric deformation model (18), the Wasserstein score functions are quadratic in x.Specifically, the Wasserstein score function for µ i is and the Wasserstein score function for Λ ij is where A is the unique solution of the Sylvester equation Λ 2 A + AΛ 2 = −Λe i e ⊤ j − e i e ⊤ j Λ and b = −Aµ.
Proof.We show that the above S W 's satisfy the Poisson equation ( 11) directly.
First, we consider the mean parameter µ i .From (18), Also, from Lemma 1, Therefore, Thus, the Wasserstein score function for the mean parameter µ i is Next, we consider the deformation parameter Λ ij .Since we have where L = Λe i e ⊤ j + e i e ⊤ j Λ.Let where A is the unique solution of the Sylvester equation Λ 2 A + AΛ 2 = −L and b = −Aµ.
From Lemma 1 and Lemma 2, Also, where we used Therefore, which means that S(x) is the Wasserstein score function for Λ ij .Now, we consider the Wasserstein estimator θW defined as the zero of the Wasserstein score function [34].
Corollary 1. Suppose that we have n independent observations x 1 , . . ., x n from the elliptically symmetric deformation model (18) where n ≥ d.Then, the Wasserstein estimator θW = ( μW , ΛW ) is the second-order moment estimator given by μW , irrespective of the waveform f (z) = g( z ).
Proof.From Theorem 1, the Wasserstein estimator is the solution of for i = 1, . . ., d and for i, j = 1, . . ., d, where A is the unique solution of the Sylvester equation Λ 2 A + AΛ 2 = −Λe i e ⊤ j − e i e ⊤ j Λ and b = −Aµ.(Note that the Wasserstein score function is a function from R d to R and does not depend on n.) From ( 19), the Wasserstein estimator of µ is Also, since (20) implies that the second-order empirical moments match the second-order population moments and Cov[x] = Λ −2 from ( 17), the Wasserstein estimator of Λ is Theorem 2. When f is Gaussian, the F -estimator θF and the W -estimator θW are identical.
Proof.It is well known that the F -estimator (maximum likelihood estimator) for the Gaussian model is given by the second-order moment estimator.
Note that [23] showed that the L 2 -Wasserstein divergence for the elliptically symmetric deformation model (18) does not depend on the waveform, and is given by using the Bures-Wasserstein divergence between positive definite matrices [13]: It is an interesting future problem to derive the Wasserstein score function and Wasserstein estimator for general affine deformation models.
Regarding the geometric structure of the elliptically symmetric deformation model ( 18), we obtain the following.See Figure 1.
Theorem 3. When f is spherically symmetric, the model M f is orthogonal in L 2 to F S at the origin µ = 0, Λ = I of M f with respect to the Wasserstein metric.
Proof.Let δp(x) be a tangent vector of F S at the origin.Since all p(x) in F S satisfy the standard conditions (2), (3), and (4), δp(x) is orthogonal to any quadratic function of x.Since the W -score functions of M f are quadratic functions from Theorem 1, it implies that δp(x) is orthogonal to the basis functions of the tangent space of M f , with respect to the Wasserstein metric.
Here, we discuss the relation between the current Wasserstein estimator θW and the estimator θW p in ( 16) defined as the projection of the empirical distribution with respect to the Wasserstein distance.For the one-dimensional location-scale model, [4] studied the estimator θW p in ( 16) by using the order statistics x (i) .This estimator minimizes the Wasserstein distance between the empirical distribution and the model.Here, we show that this estimator θW p = (μ W p , σW p ) is asymptotically equivalent to the Wasserstein estimator θW , which coincides with the second-order moment estimator from Theorem 1.We assume µ = 0 without loss of generality.The estimator (16) of the location is x i , which is the empirical mean and coincides with the moment estimator.Also, the estimator ( 16) of the scale is where Here, z i is the i-th equipartition point of f (z) defined by where F is the cumulative distribution function of f (z).From µ = 0, we have ) asymptotically [21].Hence, which leads to Since σW p ≈ σ asymptotically, This shows that θW p = ( μW p , σW p ) asymptotically coincides with the second-order moment estimator θW = ( μW , σW ).

Wasserstein covariance and robustness
Following [34], we define the Wasserstein covariance (W -covariance) matrix Var W θ [ θ] of an estimator θ by the positive semidefinite matrix given by where ∇ x θ is assumed to be square integrable.Li and Zhao [34] showed the Wasserstein-Cramer-Rao inequality where A consistent estimator θ is said to be Wasserstein efficient (W -efficient) if its Wasserstein covariance asymptotically satisfies (22) with equality.We give a proof of the Wasserstein-Cramer-Rao inequality based on the Cauchy-Schwarz inequality in the Appendix.We show that the Wasserstein covariance of an estimator can be viewed as a measure of robustness against additive noise.Suppose that X ∼ p(x, θ) and we estimate θ from the noisy observation X = X + Z, where Z is independent from X, E[Z] = 0 and Var[Z] = σ 2 I with σ 2 sufficiently small.The probability density of X is given by p(x, θ) = p(x, θ) * q(x), where * is the convolution and q is the probability density of the noise Z. Namely, the noise changes the waveform from p to p. Generally, the estimator degrades when the noise is added.Here, we quantify the robustness of an estimator against noise based on how much its variance increases due to noise.Namely, we focus on Var θ [ θ(X + Z)] − Var θ [ θ(X)], which does not depend on the exact distribution of Z but only its first two moments.If this quantity is small, it implies that the estimator is not much affected by noise contamination, which can be viewed as its robustness.This quantity is closely related to the Wasserstein covariance as follows.
Theorem 4. The Wasserstein covariance satisfies where ∆ is the Laplacian.In particular, if ∆ θa (X) is constant for every a (e.g.θ is quadratic in x), then Proof.By Taylor expansion, for sufficiently small z, where the covariance term vanishes when θ is quadratic in x.
For the elliptically symmetric deformation model (18), from Theorem 4 and Corollary 1, the Wasserstein covariance quantifies the robustness of the Wasserstein estimator and the Wasserstein-Cramer-Rao inequality gives its limit.It is an interesting future problem to investigate when the Wasserstein estimator attains the Wasserstein efficiency.Note that the Fisher efficiency (in finite samples), which is defined by the usual Cramer-Rao inequality, is attained by the maximum likelihood estimator if and only if the estimand is the expectation parameter of an exponential family.

Conclusion
Statistical inference based on the likelihood principle enjoyed great successes, and information geometry has played an important role in it.However, Wasserstein divergence gives another viewpoint, which is based on the geometric structure of the sample space X.There are many applications of the Wasserstein geometry not only to the transportation problem but to computer vision and image analysis and recently deep learning in AI.
We studied the Wasserstein statistics using the framework of [34], proving that the Wasserstein covariance quantifies robustness against the convolutional waveform deformation due to observation noise.We further studied W -statistics of the affine deformation model.We showed F -efficiency and W -efficiency of estimators θF and θW .We elucidated how the waveform f contributes to the efficiencies.The Gaussian distribution gives the only waveform in which the F -estimator and W -estimator coincide.
Estimation of a covariance matrix, especially in high dimensions under special structure (e.g.low-rankness, sparsity), has been an important problem in statistics.It is an interesting future problem to investigate whether the Wasserstein geometry is helpful in covariance estimation.
Other than the elliptically symmetric deformation model, it is difficult in general to derive the Wasserstein score, which corresponds to the infinitesimal optimal transport.It is an interesting future problem to explore other statistical models for which the Wasserstein score is obtained in closed form.Also, it would be useful to develop approximation techniques for the Wasserstein score.
The present paper is only a first step to construct general Wasserstein statistics.In future work, we need to use more general statistical models.We also need to extend our approach to statistical theories of hypothesis testing, pattern classification, clustering and many other statistical problems based on the Wasserstein geometry.

A Proof of Wasserstein-Cramer-Rao inequality
For random vectors U and V , for every t.Thus, by considering the discriminant of the quadratic equation, From the definition of the Wasserstein covariance ( 21), a i a j (∇ x θi ) ⊤ (∇ x θj ) From the definition of the Wasserstein information matrix ( 14), Substituting ( 25), ( 26) and ( 27) into (24), we obtain By putting it leads to which is equal to the Wasserstein Cramer-Rao inequality (22).
where ∼ is the equivalence relation of equality in distribution.For a general f , M f has cone structure parameterized by (µ, D, U, O), where Λ = U DO and D is a diagonal matrix with diagonal elements d i > 0. Thus, D can be identified with a vector in the open positive quadrant R d + of R d , which has the cone structure.Since µ ∈ R d , and U, O ∈ O(d), we have the decomposition