Barycenters and a law of large numbers in Gromov hyperbolic spaces

We investigate barycenters of probability measures on Gromov hyperbolic spaces, toward development of convex optimization in this class of metric spaces. We establish a contraction property (the Wasserstein distance between probability measures provides an upper bound of the distance between their barycenters), a deterministic approximation of barycenters of uniform distributions on finite points, and a kind of law of large numbers. These generalize the corresponding results on CAT(0)-spaces, up to additional terms depending on the hyperbolicity constant.


Introduction
This article is a continuation of [27] in which we studied discrete-time gradient flows for geodesically convex functions on (geodesic, proper) Gromov hyperbolic spaces.The theory of gradient flows for convex functions possesses fundamental importance in analysis, geometry and optimization theory, and has been well investigated in some classes of "Riemannian" metric spaces including CAT(0)-spaces (nonpositively curved metric spaces in the sense of triangle comparison; we refer to [4]).For "non-Riemannian" metric spaces such as normed spaces and Finsler manifolds, however, much less is known and, in fact, there is a large gap between properties of gradient flows in Riemannian and Finsler manifolds (see [27,30] for further discussions).
Intending to develop optimization theory in possibly non-Riemannian spaces, in [27] we studied discrete-time gradient flows for geodesically convex functions on Gromov hyperbolic spaces and showed some contraction estimates.Gromov hyperbolic spaces are metric spaces negatively curved in large-scale, and it is known that some non-Riemannian Finsler manifolds can be Gromov hyperbolic (see Example 2.2).The class of geodesically convex functions seems, however, restrictive when one has in mind the local flexibility of the Gromov hyperbolicity condition.Thereby, it is desirable to generalize the theory of gradient flows to a wider class of "convex functions" (we refer to [27, §3.4] for related discussions).As an initial step toward such a generalization, in this article, we study the (squared) distance function on a Gromov hyperbolic space, which should be included in the class of generalized convex functions (in view of Lemmas 3.2, 4.2).
Given a probability measure µ on a metric space (X, d) with finite second moment, its barycenter is defined as a minimizer of the function x → X d 2 (x, z) µ(dz).If the (squared) distance function is sufficiently convex, then barycenters enjoy a number of fine properties.Specifically, in a CAT(0)-space (X, d) for which d 2 is strictly convex by definition, every µ admits a unique barycenter β µ ∈ X and we have a contraction property d(β µ , β ν ) ≤ W 1 (µ, ν) in terms of the L 1 -Wasserstein distance W 1 (see [32]).Moreover, a kind of law of large numbers providing the almost sure convergence to barycenters by recursive applications of the proximal (resolvent) operator was established in [32] (we refer to [28,38] for some generalizations).
A metric space (X, d) is said to be Gromov hyperbolic (attributed to [16]) if it is δhyperbolic for some δ ≥ 0 in the sense that is the Gromov product.This is a large-scale notion of negative curvature and hence, on the one hand, it is natural to expect some variants of the aforementioned results in CAT(0)spaces.On the other hand, since the Gromov hyperbolicity provides no local control (up to the hyperbolicity constant δ), one cannot expect very sharp estimates as in the case of CAT(0)-spaces.Accordingly, our results will have additional terms (compared with the case of CAT(0)-spaces) those tend to 0 as δ → 0.
For a probability measure µ on a (geodesic) δ-hyperbolic space (X, d), its barycenter is not unique but lives in a bounded set (whose diameter tends to 0 as δ → 0; see Proposition 3.3).For this reason, we introduce the set for ε ≥ 0, and call it a barycentric set (here we consider only probability measures of finite second moment for simplicity).Then, we show a contraction property of the form for x ∈ B(µ, ε) and y ∈ B(ν, ε); see Theorem 4.5 for the precise statement.How to find (or approximate) a barycenter of a given probability measure µ is a fundamental problem.In this respect, we show the following law of large numbers (see Theorem 6.1 for the precise statement): Given a sequence (Z i ) i≥1 of independent, identically distributed random variables with distribution µ and an arbitrary initial point S 0 ∈ X, we consider a sequence (S k ) k≥0 recursively chosen as S k+1 = γ((2τ + 1) −1 ) for a minimal geodesic γ : [0, 1] −→ X from Z k+1 to S k .Then, for any ε > 0, we have for some k 0 ≤ C/(ε √ δ) by taking τ proportional to √ δ, where p ∈ B(µ, 0) is a barycenter of µ.We remark that the construction of S k+1 from S k is written by the proximal operator S k+1 ∈ J f k τ (S k ) for the distance function )), and that such an operation is meaningful only for large τ compared with δ due to the local flexibility of δ-hyperbolic spaces.By a similar analysis, we prove in Theorem 5.2 a deterministic approximation of barycenters of uniform distributions on finite points, as a generalization of [23] in CAT(0)-spaces (we refer to [2,3,17,28] for some related results).
We briefly mention related results on hyperbolic groups (discrete groups whose Cayley graphs are Gromov hyperbolic).Laws of large numbers can be formulated in terms of the behavior of the distance function d(g 1 g 2 • • • g k (x 0 ), x 0 ) for a sequence (g i ) i≥1 of independent, identically distributed random variables taking values in the group of isometric transformations acting on a metric space (X, d) and a base point x 0 ∈ X.Indeed, Kingman's subadditive ergodic theorem [21,22] ensures that exists almost surely and is constant almost everywhere.We refer to [19] and the references therein for more details as well as refined results including the case of hyperbolic groups.In this context, central limit theorems in hyperbolic groups were also established in [5,7].
Our law of large numbers (Theorem 6.1) associated with a probability measure µ on a Gromov hyperbolic space X is concerned with a more general setting and provides a direct approximation of barycenters.In this generality, it is difficult even to formulate central limit theorems.This article is organized as follows.In Section 2, we review the basics of Gromov hyperbolic spaces and some facts on barycenters in CAT(0)-spaces.In Section 3, we introduce and analyze barycentric sets for probability measures on Gromov hyperbolic spaces.Then we discuss the Wasserstein contraction property, a deterministic approximation, and a law of large numbers in Sections 4, 5, and 6, respectively.

Preliminaries
We review the basics of Gromov hyperbolic spaces, as well as some facts on barycenters in CAT(0)-spaces related to our results.For a, b ∈ R, we set a ∧ b := min{a, b} and a ∨ b := max{a, b}.
Let (X, d) be a metric space.For three points x, y, z ∈ X, we define the Gromov product (y|z) x by (y|z Observe from the triangle inequality that 0 ≤ (y|z) x ≤ d(x, y) ∧ d(x, z).In the Euclidean plane R 2 , (y|z) x is understood as the distance from x to the intersection of the triangle △xyz and its inscribed circle.If x, y, z are the endpoints of a tripod, then (y|z) x coincides with the distance from x to the branching point.
Definition 2.1 (Gromov hyperbolic spaces).A metric space (X, d) is said to be δ-hyperbolic holds for all p, x, y, z ∈ X.We say that (X, d) is Gromov hyperbolic if it is δ-hyperbolic for some δ ≥ 0.
Since the Gromov product does not exceed the diameter diam(X) := sup x,y∈X d(x, y), if diam(X) ≤ δ, then (X, d) is δ-hyperbolic.This also means that the local structure of X (up to size δ) is not influential in the δ-hyperbolicity.Another fact worth mentioning is that, if (2.1) holds for some p ∈ X and all x, y, z ∈ X, then (X, d) is 2δ-hyperbolic (see [16, Corollary 1.1.B]).
The Gromov hyperbolicity can be regarded as a large-scale notion of negative (sectional) curvature.We recall some fundamental examples (see also [16, §1]).(b) An important difference between the class of CAT(−1)-spaces and that of Gromov hyperbolic spaces is that the latter admits non-Riemannian Finsler manifolds.For instance, Hilbert geometry on a bounded convex domain in the Euclidean space is Gromov hyperbolic under mild convexity and smoothness conditions (see [20], [26, §6.5]).
(c) The definition (2.1) makes sense for discrete spaces.In fact, the Gromov hyperbolicity has found rich applications in group theory (a discrete group whose Cayley graph satisfies the Gromov hyperbolicity is called a hyperbolic group; we refer to [9,16], [10,Part III]).In the sequel, however, we do not consider discrete spaces.
(d) Assume that a metric space (X, d X ) admits a map φ : T −→ X from a tree (T, d T ) such that d X (φ(a), φ(b)) = d T (a, b) for all a, b ∈ T and the δ-neighborhood B(φ(T ), δ) of φ(T ) covers X.Then, since (T, d T ) is 0-hyperbolic, we can easily see that (X, d X ) is 6δ-hyperbolic.

CAT(0)-spaces
A geodesic space (X, d) is called a CAT(0)-space if, for any x, y, z ∈ X and any minimal geodesic γ : [0, 1] −→ X from x to y, holds for all t ∈ (0, 1).A complete, simply connected Riemannian manifold is a CAT(0)space if and only if its sectional curvature is nonpositive everywhere.Moreover, there are a number of rich classes of non-smooth CAT(0)-spaces such as Euclidean buildings, trees, phylogenetic tree spaces, and the orthoscheme complexes of modular lattices (see [4,6,12,18]).
The CAT(0)-inequality (2.2) can be regarded as the uniform strict convexity of the squared distance function d 2 (z, •), and such a convexity is known to be quite useful to study barycenters.In fact, as we mentioned in the introduction, every probability measure µ on X with finite second (or first) moment admits a unique barycenter β µ ∈ X, and the contraction property d(β µ , β ν ) ≤ W 1 (µ, ν) holds.Moreover, a law of large numbers was established in [32], followed by many variants and generalizations [3,17,23,28,38] (see also [15,24] for related works on a different notion of barycenter).

Barycentric sets
Henceforth, throughout the article, let (X, d) be a geodesic δ-hyperbolic space.We first recall the Wasserstein distance between probability measures (we refer to [35] for further reading).
Denote by for some (and hence any) x 0 ∈ X.For µ, ν ∈ P p (X), the L p -Wasserstein distance (or the Kantorovich distance) between µ and ν is defined by , where π runs over all couplings of (µ, ν) (namely probability measures on X × X with marginals µ and ν).A coupling π attaining the above infimum is called an L p -optimal coupling of (µ, ν).Observe that for all x ∈ X and µ ∈ P p (X), where δ x denotes the Dirac mass at x.Note that P 2 (X) ⊂ P 1 (X) by the Hölder (or Cauchy-Schwarz) inequality.According to [32], we will consider barycenters of probability measures not only in P 2 (X) but also in P 1 (X).Fix an arbitrary point x 0 ∈ X.For µ ∈ P 1 (X), we define We remark that the integral above is well-defined since by the triangle inequality.In a complete CAT(0)-space, every µ ∈ P 1 (X) admits a unique point x ∈ X attaining the infimum in (3.1).Such a point x is independent of the choice of x 0 and called the barycenter of (or the center of mass for) µ (see [32,Proposition 4.3], and [36] for the case of CAT(1)-spaces of small radii).More precisely, what we consider is an L 2 -barycenter involving the squared distance.We refer to [1,37] for related works on L p -barycenters.
In a δ-hyperbolic space, however, we do not have a unique barycenter.For this reason, we introduce the following set: for µ ∈ P 1 (X) and ε ≥ 0. We shall call B(µ, ε) a barycentric set.We remark that B(µ, ε) is independent of the choice of x 0 .Note also that B(µ, 0) may be empty (unless X is proper), while B(µ, ε) = ∅ for any ε > 0.
Our goal in this section is to estimate the diameter of B(µ, ε) in terms of ε and δ.To this end, we first generalize the CAT(0)-inequality (2.2) to δ-hyperbolic spaces, with an inevitable additional term depending on δ.Lemma 3.1 (CAT(0) + δ).For any x, y, z ∈ X and any midpoint w between x and y, we have Proof.Since w is a midpoint of x and y (namely d(x, w) = d(w, y) = d(x, y)/2), we have (x|y) w = 0. Then the δ-hyperbolicity (2.1) implies Hence, Combining these, we obtain as desired.
We also present the corresponding inequality for general intermediate points between x and y.Lemma 3.2 (General intermediate points).For any x, y, z ∈ X and any minimal geodesic γ : [0, 1] −→ X from x to y, we have for all t ∈ (0, 1).
Proof.Put w = γ(t), then we have (x|y) w = 0 again.Thus, it follows from the δ- and hence Now, we claim that holds.In fact, the inequality can be rearranged as which holds true by the triangle inequality.We can similarly show thereby we obtain (3.5).Therefore, we deduce that Note that, in the case of δ = 0, (3.4) boils down to the CAT(0)-inequality (2.2).For δ = 0, moreover, one can infer (3.4) from (3.3) by the standard subdivision argument (see, for example, (ii) ⇒ (iii) of [4,Theorem 1.3.3]).For δ > 0, however, iterating subdivisions makes the additional term (depending on δ) diverge, thereby we gave a direct argument to prove Lemma 3.2.We also remark that (3.4) is not meaningful for t close to 0 or 1, since then the triangle inequality could give a better estimate.
We are ready to estimate the diameter of barycentric sets B(µ, ε) defined in (3.2).Recall that x 0 ∈ X is an arbitrary point fixed at the beginning of this section.Proposition 3.3 (Diameter of B(µ, ε)).For any µ ∈ P 1 (X), x, y ∈ X and any midpoint w between x and y, we have In particular, for any x ∈ B(µ, ε 1 ) and y ∈ B(µ, ε 2 ) with ε 1 , ε 2 ≥ 0, we have Proof.The first assertion is shown by integrating (3.3) in z with respect to µ.Then, when x ∈ B(µ, ε 1 ) and y ∈ B(µ, ε 2 ), we find Therefore, we obtain This completes the proof.
The second assertion (3.6) (with ε 2 = 0) can be regarded as a generalization of the variance inequality (see [32,Proposition 4.4]; the reverse inequality under lower curvature bounds can be found in [25,33]).Note that, when we are interested in the case of Remark 3.4 (When µ ∈ P 2 (X)).In the case of µ ∈ P 2 (X), instead of V x 0 (µ) as in (3.1), we can directly consider which is called the variance of µ.One can simply write down the first assertion of Proposition 3.3 as

Wasserstein contraction property
We next consider a contraction property in terms of the Wasserstein distance, which in the case of (complete) CAT(0)-spaces means that holds for µ, ν ∈ P 1 (X), where β µ and β ν are the (unique) barycenters of µ and ν, respectively (see [32,Theorem 6.3]).In the current setting of δ-hyperbolic spaces, we shall estimate the distance between points in the barycentric sets.
Remark 4.1 (Busemann NPC versus CAT(0)).In his celebrated paper [11], Busemann showed that a complete, simply connected Riemannian manifold has the Busemann NPC if and only if its sectional curvature is nonpositive everywhere.Nonetheless, in general, the Busemann NPC is a weaker condition than the CAT(0)-property.On the one hand, it is easily seen that CAT(0)-spaces have the Busemann NPC.On the other hand, every strictly convex Banach space has the Busemann NPC, whereas it is a CAT(0)-space if and only if it is a Hilbert space.
Proof.Let ξ : [0, 1] −→ X and ζ : [0, 1] −→ X be minimal geodesics from x to q and from p to q, respectively.Denote by △xpq the triangle formed by (the image of) γ, ξ and ζ.We We set x := T (x), p := T (p) and q := T (q).Then T is 1-Lipschitz (non-expanding) and   We remark that, similarly to (3.4) in Lemma 3.2, the inequality (4.1) does not give a meaningful estimate for t close to 0 or 1.One can use a 1-Lipschitz map to a tripod also for showing a variant of the CAT(0)-inequality, whereas then the additional term seems to be necessarily dependent on the size of a triangle (as in (3.4)), since we take the square of the distance.Now, let d be the The following subset will play a role: If (X, d) is a CAT(0)-space, then A is a (geodesically) convex set thanks to the Busemann NPC.
We will make use of the nearest point projection to A to prove the Wasserstein contraction property.We remark that (X × X × R, d) is not a Gromov hyperbolic space (it is a CAT(0)-space if so is (X, d)), thereby the contraction property of projection maps as in [16,Lemma 7.3.D] does not directly apply.
The next lemma is an essential step for our contraction result.let (x, ỹ, r) ∈ A be a point attaining d((x, y, r), A) given as in the proof of Lemma 4.3.Then, for any (p, q, d(p, q)) ∈ A, we have where we set Observe that, if δ = 0, then the assumption (4.2) is void and the assertion shows that d((x, ỹ, r), (p, q, d(p, q))) is strictly smaller than d((x, y, r), (p, q, d(p, q))).

Deterministic approximations of barycenters
We next discuss an approximation of barycenters of uniform distributions on finite points by the gradient flow method.Given a function f : X −→ R, we define the proximal operator (also called the resolvent operator ) by for x ∈ X and τ > 0 (that is, y ∈ J f τ (x) if y attains the above minimum).Roughly speaking, an element in J f τ (x) can be regarded as an approximation of a point on the gradient curve of f at time τ from x. Thus the iteration of the proximal operator can be regarded as discrete-time gradient flow for f , which is expected to lead us to a minimizer of f .We refer to [27] for some contraction properties of discrete-time gradient flows for geodesically convex functions on Gromov hyperbolic spaces.
If (z|w) y ≤ δ, then we find Combining this with the triangle inequality |d(w, y) − d(x, y)| ≤ d(w, x) and d(x, y) = 2τ d(z, y), we obtain In the other case of (x|w) y ≤ δ, together with the triangle inequality, we have This completes the proof.
In the CAT(0)-case (see [3,28]), we have without the additional term Θτ δ.Note that Θτ δ in (5.2) tends to 0 not only as δ → 0 but also as τ → 0. This is the natural behavior since y tends to x as τ → 0.
By using Lemma 5.1, we establish the following deterministic approximation of barycenters by the iterative application of the proximal operator in the spirit of [17,23] (so-called the no dice theorem).We also refer to [3,Theorem 3.4] and [28,Theorem 5.5] for generalizations to the sum of convex functions.Theorem 5.2 (Deterministic approximation).Fix a finite sequence (z i ) n i=1 in X, put f i := d 2 (z i , •), and let p ∈ X be a minimizer of the function f := n i=1 f i .Given τ > 0 and an arbitrary initial point y 0 ∈ X, we recursively choose for k ≥ 0, 1 ≤ i ≤ n, and assume that p, (z i ) and (y kn+i ) are all included in a bounded set Ω ⊂ X.Then, for any ε > 0, there exists some k 0 < d 2 (p, y 0 )/(2τ ε) such that where we set D Ω := diam(Ω) and Moreover, we have We remark that, in a CAT(0)-space, we can employ as Ω a ball including (z i ) n i=1 and y 0 .This is because balls are convex by the CAT(0)-inequality (or the Busemann NPC).In a δ-hyperbolic space, however, balls are not necessarily convex and it is unclear to the author how to control (the sum of) the additional terms in (3.4) (or (4.1)) during the recursive scheme y kn+i ∈ J f i τ (y kn+i−1 ).Proof.It follows from Lemma 5.1 that Summing up for 1 ≤ i ≤ n, we have (5.5) Concerning the third term in the right-hand side, we infer from the triangle inequality that Then, by the choice of y kn+l ∈ J f l τ (y kn+l−1 ) with f l = d 2 (z l , •), we find Hence, we obtain Plugging this into (5.5)yields It immediately follows from (5.7) that necessarily holds for some k 0 < d 2 (p, y 0 )/(2τ ε).Indeed, otherwise we have with k the minimum integer not smaller than d 2 (p, y 0 )/(2τ ε), a contradiction.Finally, the second assertion (5.4) is a consequence of Proposition 3.3.Putting µ = n −1 n i=1 δ z i , we deduce from (5.3) that Hence, (3.6) (with ε 1 = 0) yields Thanks to (5.3), up to d 2 (p, y 0 )/(2τ ε) iterations, f (y kn ) nearly achieves min X f and y kn passes close to p as in (5.4).We remark that the sublinear rate ε < d 2 (p, y 0 )/(2τ k 0 ) (following from k 0 < d 2 (p, y 0 )/(2τ ε)) could be compared with [28,Proposition 5.7].

A law of large numbers
We next consider a law of large numbers in our setting.Our formulation follows Sturm's [32,Theorem 4.7] for CAT(0)-spaces.We refer to [28,Theorem 6.7] and [38,Theorem 3] for some generalizations to other (upper and lower) curvature bounds.
Theorem 6.1 (Law of large numbers).Let (Z i ) i≥1 be a sequence of independent, identically distributed random variables on a probability space taking values in X with distribution µ ∈ P(X), and take p ∈ B(µ, 0).Given τ > 0 and an arbitrary initial point S 0 ∈ X, we define a sequence (S k ) k≥0 in X recursively by Assume that p, supp µ and (S k ) k≥0 are all included in a bounded set Ω ⊂ X.Then, for any ε > 0, we have for some k 0 < d 2 (p, S 0 )/(τ ε), where D Ω := diam(Ω) and Θ Ω is defined as in Theorem 5.2.
Proof.We can apply a calculation similar to the proof of Theorem 5.2.It follows from Lemma 5.1 that where we used in the latter inequality (recall also (5.6)).Taking the expectations in Z k+1 conditioned on F k := {Z 1 , . . ., Z k } and applying the variance inequality (3.6) (with x = p, y = S k , ε 1 = 0), we obtain Taking the expectations once again, we arrive at An advantage of the above recursive choice of (S k ) k≥0 is that S k+1 is concretely given as a point on a geodesic between S k and Z k+1 without any knowledge about the construction of barycenters of probability measures (though minimal geodesics are not unique in Gromov hyperbolic spaces).
Employing "empirical means" instead of (S k ) k≥0 , one can also show the following version of law of large numbers (see [32,Proposition 6.6]).Proposition 6.2 (Empirical law of large numbers).Let (X, d) be complete and separable, (Z i ) i≥1 be a sequence of independent, identically distributed random variables on a probability space taking values in X with distribution µ ∈ P(X) of bounded support, and take p ∈ B(µ, 0).Then, We finally discuss two possible directions of further research.
Remark 6.3 (Further problems).(a) We have studied in [27] discrete-time gradient flows for geodesically convex functions (namely, they are convex along geodesics).However, despite the negative curvature nature, distance functions on Gromov hyperbolic spaces are not geodesically convex due to the local flexibility.Therefore, it is an intriguing problem to introduce an appropriate notion of "roughly convex functions" on Gromov hyperbolic spaces, including the distance function d(z, •) or its square.
(b) Compared with the contraction estimates in [27] directly akin to trees, the results in the present article are generalizations from CAT(0)-spaces to δ-hyperbolic spaces.Therefore, on the one hand, it may be possible to improve our estimates (for example, the order of δ) via an analysis closer to the case of trees.Specifically, it is desirable that we can reduce the dependence on D 2 , D Ω in Theorems 4.5, 5.2, 6.1.On the other hand, there seems a room for further generalizations to metric spaces satisfying the CAT(0)-inequality with small perturbations in some way (cf.Lemma 3.2).
can construct a map T : △xpq −→ Y to a tripod (Y, d Y ) with three edges of lengths (p|q) x , (x|q) p and (x|p) q from the branching point such that the restrictions T | γ , T | ξ and T | ζ are isometric (see Figure 1, where T (a) = T (b) = T (c) is the branching point O of the tripod).

Figure 1 : A 1 -
Figure 1: A 1-Lipschitz map from a triangle to a tripod