Convergence Rates for the Generalized Fr\'echet Mean via the Quadruple Inequality

For sets $\mathcal Q$ and $\mathcal Y$, the generalized Fr\'echet mean $m \in \mathcal Q$ of a random variable $Y$, which has values in $\mathcal Y$, is any minimizer of $q\mapsto \mathbb E[\mathfrak c(q,Y)]$, where $\mathfrak c \colon \mathcal Q \times \mathcal Y \to \mathbb R$ is a cost function. There are little restrictions to $\mathcal Q$ and $\mathcal Y$. In particular, $\mathcal Q$ can be a non-Euclidean metric space. We provide convergence rates for the empirical generalized Fr\'echet mean. Conditions for rates in probability and rates in expectation are given. In contrast to previous results on Fr\'echet means, we do not require a finite diameter of the $\mathcal Q$ or $\mathcal Y$. Instead, we assume an inequality, which we call quadruple inequality. It generalizes an otherwise common Lipschitz condition on the cost function. This quadruple inequality is known to hold in Hadamard spaces. We show that it also holds in a suitable way for certain powers of a Hadamard-metric.


Introduction
Let Q, Y be sets, Y a Y-valued random variable, and c : Y × Q → R a cost function. Every element m of the set arg min q∈Q E[c(Y, q)] is called generalized Fréchet mean or c-Fréchet mean. Given independent copies Y 1 , . . . , Y n of Y , natural estimators of the generalized Fréchet mean are elements m n of the set arg min q∈Q 1 n n i=1 c(Y i , q). Our goal is to find suitable conditions for establishing convergence rates for such plug-in estimators.
The described setting generalizes the usual setting for Fréchet means, where Q = Y is a metric space with metric d : Q × Q → [0, ∞) and c = d 2 , which has been introduced in [Fré48].
The Fréchet mean has been investigated in many specific settings, often under a different name, e.g., center of mass or barycenter. In the context of Riemannian manifolds, it has been studied -among others -by [BP03]. An asymptotic normality result for generalized Fréchet means on finite dimensional manifolds is shown in [EH19]. For complete metric spaces of nonpositive curvature, called Hadamard spaces, [Stu03] shows how some classical results of probability theory in Euclidean spaces (e.g., strong law of large numbers, Jensen's inequality) can be transferred to the Fréchet mean setting. An algorithm for calculating Fréchet means in Hadamard spaces is described in [Bač14a].
One important application of statistics in Hadamard spaces is the space phylogenetic trees. A phylogenetic tree represents the genetic relatedness of biological species. The geometry of the space of phylogenetic trees T m with m leaves is studied in [BHV01]. In particular, it is shown that T m is a Hadamard space. There has been a lot of recent interest in statistics on T m . E.g., [BLO18] show a central limit theorem for the Fréchet mean in T m and [Nye11] apply principal component analysis in that space.
For general metric spaces [Zie77] shows consistency of the Fréchet mean estimator. This is extended to generalized Fréchet means in [Huc11].
The Fréchet mean estimator is a M-estimator. Thus, we can build upon many classical and deep results from the M-estimation literature, see, e.g., [VW96;Gee00;Tal14]. Using such M-estimation techniques, rates of convergence in probability for Fréchet means in general bounded metric spaces are obtained in [PM19]; in fact the authors consider a more complex regression setting. In [DM19] results on the analysis of variance in metric spaces are shown.
Results on convergence rates in expectation, i.e, bounds on E[d(m, m n ) 2 ], seem to be rare in the literature on the Fréchet mean. Common are convergence rates in probability or exponential concentration. The latter also implies rates in expectation, but under rather strong assumptions. One publication that establishes rates in expectation more directly, for general cost functions in Euclidean spaces is [BFW17].
The recent article [AGP19] provides nonasymptotic concentration rates in general bounded metric spaces. Its relation to our results will be discussed in the next subsection.

Our Contribution
Our contribution consists of three parts: (a) We introduce a condition, which we call quadruple inequality, that is used to establish convergence rates in probability and expectation for spaces with infinite diameter, see Theorem 1, Theorem 2, and Theorem 4.
(b) We formulate our results in the setting of the generalized Fréchet mean with a cost-function c that is not restricted to being the square of a metric.
(c) We prove a quadruple inequality for exponentiated metrics of Hadamard spaces, Theorem 3. We apply it to obtain rates of convergence for estimators of the Fréchet mean of an exponentiated metric.
[PM19] and [AGP19] show rates of convergence for metric spaces which have a finite diameter (or at least the support of the distribution of observations must be bounded). The proofs in both papers rely on empirical process theory. In particular, they make use of symmetrization and the generic chaining to bound the supremum of an empirical process. But where [AGP19] use that bound to be able to apply Talagrand's inequality [Bou02], [PM19] employ a peeling device (also called slicing; see, e.g., [Gee00]) to obtain rates. As a consequence, [AGP19] achieve stronger results (nonasymptotic exponential concentration instead of O P -statements), but they rely more heavily on the boundedness of the metric. As our goal is to obtain results for spaces with infinite diameter, our proof technique is closer to [PM19], i.e., we also apply a peeling device.
A law of large numbers, such that the estimator of the Fréchet mean converges in probability to the true value, implies that the estimator eventually is in a subset with finite diameter. Thus, for asymptotic rates in probability as in [PM19], it is not very restrictive to assume a finite diameter. Our motivation to directly deal with infinite diameter comes from our interest in nonasymptotic results and in rates in expectation (asymptotic or nonasymptotic).
As [PM19] and [AGP19], we use the generic chaining. Therefore we have entropy bounds as conditions of our theorems. These entropy bounds can be stated by requiring a bound on the covering numbers where (Q, d) is a metric space, Q ⊆ Q, and r > 0. To be more precise, in a metric space (Q, d), we require log N (B δ (m), d, r) ≤ Cδ r D for some constants C, D > 0 and all 0 < r < δ, which is the same assumption as in [AGP19]. We note, that this requirement could be weakened by using the optimal bound on Rademacher (or Bernoulli) processes [BL14] at the cost of a more complicated and less comprehensible condition.
In the classical Fréchet mean case, where (Q, d) is a metric space and the cost function is c = d 2 , the empirical process that has to be bounded consists of functions of the form y → d(y, q) 2 for q ∈ Q. To apply some classical empirical process results, one requires a Lipschitz condition on these functions. In [PM19] and [AGP19] this Lipschitz condition is fulfilled by for all y, q, p ∈ Q. Thus, a finite diameter is required. We show, that one can instead require that d(y, q) 2 − d(y, p) 2 − d(z, q) 2 + d(z, p) 2 ≤ 2d(y, z)d(q, p) holds for all y, z, q, p ∈ Q and then bound the supremum of the empirical process even if diam(Q) = ∞. Equation (2) is a special instance of what we call quadruple inequality. Roughly speaking, the transition from Lipschitz to quadruple condition removes certain squared terms and the right hand side by adding and subtracting further squared terms on the left hand side. This is related to the idea of defining the Fréchet mean as minimizer of q → E[d(Y, q) 2 − d(Y, o) 2 ] for an arbitrary fixed point o ∈ Q instead of q → E[d(Y, q) 2 ]. Then, for existence of the Fréchet mean, only a first moment condition on Y is required instead of a second moment condition, see [Stu03,Acknowledgement to Lutz Mattner].
The inequality (2) does not hold in every metric space. But it characterizes Hadamard spaces among geodesic metric spaces, see [BN08]. In Hadamard spaces, (2) is known as Reshetnyak's quadruple inequality [Stu03] or quadrilateral inequality [BN08] and can be interpreted as generalization of the Cauchy-Schwartz inequality to metric spaces [BN08]. Note that our results are not restricted to geodesic metric spaces.
In (subsets of) Hadamard spaces (Q, d), we can not only utilize the quadruple inequality with the squared metric d 2 (2). But we show that for d a with a ∈ [1, 2], we also obtain a version of the quadruple inequality, namely d(y, q) a − d(y, p) a − d(z, q) a + d(z, p) a ≤ 4a2 −a d(y, z) a−1 d(q, p) , for all y, z, q, p ∈ Q, see Theorem 3. We show that the constant 4a2 −a is optimal. Similar to equation (1), one can easily show -using the mean value theorem -that for a > 0, q, p, y ∈ Q, where (Q, d) is an arbitrary metric space. The proof of equation (3) is much more complicated, see appendix Appendix G. We state our convergence rate results in a general way, where observations live in a space Y and a cost function c : Y ×Q → R is minimized over Q. The quadruple inequality then reads c(y, q) − c(y, p) − c(z, q) + c(z, p) ≤ a(y, z)b (q, p) for all y, z ∈ Y and q, p ∈ Q and an arbitrary function a : Y × Y → [0, ∞) and a pseudometric b : Q × Q → [0, ∞). This general formulation includes, among others, arbitrary bounded metric spaces, Hadamard spaces (including Euclidean and non-Euclidean spaces) with an exponentiated metric d a , a ∈ [1, 2], and regression settings with Q ̸ = Y, where observations (x, y) ∈ Y are described by regression functions (x → q(x)) ∈ Q.
Furthermore, some trivial statements in appendix Appendix B show that the quadruple inequality is stable under many operations such as taking subsets, limits, or product spaces.
We prove -via a peeling device -nonasymptotic rates of convergence in probability, Theorem 1. We do not achieve exponential concentration as [AGP19], but our results can be applied in cases where the cost function is not bounded by a finite constant, i.e., in metric spaces with infinite diameter. Furthermore, we show two ways of obtaining rates in expectation: One -nonasymptotic -under the assumption of a stronger version quadruple inequality, Theorem 2; the other -asymptotic -with a stricter entropy condition, Theorem 4.
Aside from the application in Hadamard spaces (including the use of the power inequality, Theorem 3), we illustrate our results in different toy examples: Euclidean spaces and infinite dimensional Hilbert spaces. In (convex subsets of) Hilbert spaces the Fréchet mean is equal to the expectation. Thus, these examples are interesting as a benchmark, because we can compare results from our general Fréchet mean approach to exact results. In two additional examples, we apply our results to nonconvex subsets of Hilbert spaces and to Hadamard spaces.

Outline
We start by presenting the convergence rates results of Theorem 1 (rates in probability) and Theorem 2 (rates in expectation) in the abstract setting in section 2. The different versions of the quadruple inequality are discussed in section 3, including the power inequality, Theorem 3. This discussion concludes with the statement of Theorem 4 (alternative route to rates in expectation). In section 4, we apply the abstract results in different settings: Euclidean spaces, infinite dimensional Hilbert spaces, nonconvex sets, and Hadamard spaces.

Abstract Results
In this section, we prove rates of convergence for the Fréchet mean in a very general setting, see section 2.1. For rates in probability Theorem 1 is stated in section 2.2 and for rates in expectation Theorem 2 is stated in section 2.3. The proofs can be found in appendix Appendix A. Some remarks on further extensions are given in section 2.4.

Setting
Here we define an Abstract Setting in which we will state our most general results. This setting of the generalized Fréchet mean is similar to what is used in [Huc11; EH19].
Let Q be a set, which is called descriptor space. Let (Y, Σ Y ) be a measurable space, which is called data space. Let Y be a Y-valued random variable. Let c : Y × Q → R be a function such that y → c(y, q) is measurable for every q ∈ Q. We call c cost function.
. We call F n empirical objective function. Let l : Q × Q → [0, ∞) be a function such that l(m, q) measures the loss of choosing q given that the true value is m.
We want to bound l(m, m n ) for m ∈ arg min q∈Q F (q) and m n ∈ arg min q∈Q F n (q).

Rate of Convergence in Probability
For our result on convergence rates in probability, we make some assumptions, which are listed in the following. We denote the "closed" ball with center o ∈ Q of radius r > 0 in the set Q with respect to an arbitrary distance function d : Assumptions.

Existence:
We have E[|c(Y, q)|] < ∞ for all q ∈ Q. There are m n ∈ arg min q∈Q F n (q) measurable and m ∈ arg min q∈Q F (q).

Growth:
There are constants γ > 0 and c g > 0 such that F (q) − F (m) ≥ c g l(m, q) γ for all q ∈ Q.

Weak Quadruple:
There are a function a : Y × Y → [0, ∞) measurable and a pseudo-metric b : Q × Q → [0, ∞), such that, for all p, q ∈ Q, y, z ∈ Y, we have where we use the notation c yq := c(y, q). We call a the data distance and b the descriptor metric.

Moment:
where Y ′ is an independent copy of Y . We have M(ζ) < ∞.

Entropy:
There are α, β > 0 with α β < γ such that for a constant c e > 0 and all δ, r > 0. Here is the covering number of A ⊆ Q with respect to b-balls B r (·, b) of radius r. Entropy is essentially the same condition as in [AGP19], but written down for the setting of the generalized Fréchet mean instead of the classical Fréchet mean in metric spaces. We shortly discuss other assumptions before stating the theorem for rates of convergence in probability.
The measurability assumptions can be weakened by using the outer expectation, see [VW96].
In [BDG07], the Growth condition is called margin condition. It is called low noise assumption in [AGP19]. If Growth holds for every distribution of Y and we are in the traditional setting of the (not generalized) Fréchet mean, it implies that the metric space Q has nonpositive curvature: Assume that (Q, d) is a complete geodesic space [Stu03, Definition 1.1], i.e., every pair of points y 1 , y 2 has a mid-point m, i.e., y 1 m = y 2 m = 1 2 y 1 y 2 , where we use the notation qp := d(q, p). Set Y = Q, c = d 2 , and l = d. If P(Y = y 1 ) = P(Y = y 2 ) = 1 2 with y 1 , y 2 ∈ Q, the Fréchet mean m ∈ Q of Y is the mid-point between y 1 and y 2 . If we assume that the growth condition holds for every distribution of Y , in particular, for every uniform 2-point distribution, with c g = 1 and γ = 2, then 1 2 y 1 q 2 + 1 2 As m is the mid-point between y 1 and y 2 , we obtain This inequality implies that the space (Q, d) has nonpositive curvature [Stu03, Definition 2.1]. Such spaces are called Hadamard spaces. Aside from the Growth condition they also fulfill the quadruple inequality, which we discuss in section 3.2.3. The Weak Quadruple-condition will be discussed in detail in section 3. Among other things, we will show that we have in a nice way in all Hadamard spaces, which include the Euclidean spaces.
The following theorem states rates of convergence for the estimator m n to the true value m measured with respect to the loss function l.
Theorem 1 (Convergence rate in probability). In the Abstract Setting of section 2.1, assume that following conditions hold: Existence, Growth, Weak Quadruple, Moment, Entropy. Define Then, for all t > 0, we have where c > 0 depends on α, β, γ, c e , c g , ζ.
The proof can be found in appendix Appendix A. Without loss of generality, one can choose γ = 1 by using the loss l ′ = l γ . This is consistent with the result: If Growth and Entropy are fulfilled with l, α, β, γ, then they are also fulfilled with l ′ = l γ , α ′ = α γ , β ′ = β, γ ′ = 1, which gives the same result. We keep this redundancy in the parameters of the theorem for convenience.
A more common way of stating rates of convergence in probability is the O P -notation, as in the following corollary. Note that the O P -result is asymptotic and, thus, weaker than the non-asymptotic Theorem 1.

Corollary 1.
In the Abstract Setting of section 2.1, assume that following conditions hold: Existence, Weak Quadruple, Growth, Moment with ζ = 1, Entropy. Then with η β,n as in Theorem 1.
It is possible to weaken the assumptions in Corollary 1. In particular, we can restrict the Growth and Entropy conditions to hold only in a neighborhood of m if we already know that l(m n , m) ∈ o P (1). In Theorem 1, the probability of large losses decays polynomially. If the exponent ζ(γ − α β ) is strictly greater than 1, we can integrate the tail probabilities to obtain a bound on the expectation of the loss.
Corollary 2. Let κ ≥ 1. In the Abstract Setting of section 2.1, assume that following conditions hold: Existence, Weak Quadruple, Growth, Moment with ζ > κ(γ− α β ) −1 , The proof can be found in appendix Appendix A. Corollary 2 may require unnecessarily high moments as ξ needs to be strictly larger than 1. In the next section, we present a more direct approach to rates in expectation, that requires weaker moment conditions, at least in some settings.

Rate of Convergence in Expectation
For obtaining rates in expectation directly, we need slightly modified, stronger assumptions.

Strong Quadruple:
DefineQ := Q\B 0 (m, l) = {q ∈ Q : l(m, q) > 0}. There are functions b m :Q× Q → [0, ∞) (possibly depending on m) and a : Y × Y → [0, ∞) with a measurable and ξ ∈ (0, γ), such that, for all p, q ∈Q, y, z ∈ Y, we have Assume that b m is a pseudo-metric onQ. We call a the data distance and b m the strong quadruple metric at m.
Strong Entropy: For later use in the application to Hilbert spaces, section 4.2, and for Theorem 2, we state the entropy part of Theorem 2 in a more general way than in Theorem 1. To this end, we need to introduce different measures of entropy.
(i) Given a set Q an admissible sequence is an increasing sequence (A k ) k∈N 0 of partitions of Q such that A 0 = Q and card(A k ) ≤ 2 2 k for k ≥ 1.
By an increasing sequence of partitions we mean that every set of A k+1 is contained in a set of A k . We denote by A k (q) the unique element of A k which contains q ∈ Q.
(ii) Let (Q, b) be a pseudo-metric space. Define where the infimum is taken over all admissible sequences in Q and for A ⊆ Q.
(iii) Let (Q, b) be a pseudo-metric space and n ∈ N. Define Items (i) and (ii) are basic definitions form [Tal14]. Item (iii) is just a convenient notation.
Theorem 2 (Convergence rate in expectation). In the Abstract Setting of section 2.1, assume that following conditions hold: Existence, Growth, Strong Quadruple, Strong Moment. Then, we have where c > 0 depends only on κ, γ, ξ, c g . If additionally Strong Entropy holds, then for β > 1 , and C > 0 depends only on κ, β, γ, ξ, c g .
The proof can be found in appendix Appendix A.
As in Theorem 1 the statement contains some redundancy. E.g., by using the loss l = l ξ we set ξ = 1 without loss of generality. Then the growth exponent and the resulting rate of convergence will scale accordingly.

Further Extensions
In general M := arg min q∈Q E[c(Y, q)] is some subset of Q. One can also extend the main theorems of this paper to deal with a the whole set of Fréchet means and Fréchet mean estimators. To do that, the Growth condition has to be stated as growth of the minimal distance to M . Furthermore, some of the statements and assumptions made in the theorems and proofs have to be modified so that the hold uniformly over all m ∈ M . Additionally, one has to think about the right notion of convergence for sets. We found that those results hard to read without significantly increasing insight into the problem, which is why we chose to stick with unique Fréchet means and only remark that an extension to Fréchet mean sets is possible.
One can also consider ε-arg min-sets, i.e., the sets of elements which minimize a function up to an ε > 0. If one chooses m n ∈ ε n -arg min q∈Q F n (q) with ε n → 0 fast enough, the convergence rate is of the same as for the absolute minimizer.
There are a couple of trivial stability results for quadruple inequalities, see appendix Appendix B.
In section 3.1 we compare the quadruple inequality with a more common Lipschitz property. The simplest advantageous applications of the quadruple inequality are in inner product spaces and quasi-inner product spaces, as is discussed in section 3.2. In section 3.3 we state the power inequality, Theorem 3. It allows to establish quadruple inequalities for exponentiated metrics. We conclude with Theorem 4 in section 3.4, which yields rates of convergence in expectation under the assumption of only a weak quadruple inequality instead of a strong one as in Theorem 2.

Bounded Spaces and Smooth Cost Function
Let (Q, d) be a metric space and use the notation qp = d(q, p). For obtaining convergence rates in probability for the Fréchet mean estimator, [PM19] use for all q, p, y ∈ Q. In the proof of Theorem 1, we have replaced this bound by the weak quadruple inequality, i.e., This generalizes the results by [PM19] as for bounded metric spaces (Q, d) and cost function c = d 2 , the weak quadruple inequality holds with a(y, z) = 4 diam(Q) and b = d: More generally, if we can show Lipschitz continuity in the second argument of the cost function, i.e., c yq − c yp ≤ a(y)b(q, p), then the quadruple inequality holds with data distance a(y) + a(z) and descriptor metric b. But this might lead to an unnecessarily large bound. We will see in section 3.2.3 that at least for certain metric spaces, we can find a bound via the quadruple inequality that does not involve the diameter of the space and, thus, allows for meaningful results in unbounded spaces.

Inner Product Space
Let (Q, d) be a metric space such that d comes from an inner product ⟨· , ·⟩ on Q, i.e., Q is a subset of am inner product space and d(y, q) 2 = ⟨y − q , y − q⟩. Use Y = Q and the squared metric as cost function, c = d 2 . Then Here the Cauchy-Schwartz inequality gives rise to an instance of the weak quadruple inequality. The very general framework that we impose also allows for a more flexible bound: If Q ⊆ H is the subset of an infinite dimensional, separable Hilbert space H, we can use a weighted Cauchy-Schwartz inequality: Let s = (s k ) k∈N ⊆ (0, ∞). Then where ∥x∥ 2 s = ∞ k=1 s 2 k x 2 k with generalized Fourier coefficients (x k ) k∈N with respect to a fixed orthonormal basis of H.
For the strong quadruple inequality, we set ξ = 1, l(q, p) = ∥q − p∥ and obtain Thus, the strong quadruple inequality hold with a(y, z) = 2∥y − z∥ and b m (q, p) = The analogous result for the weighted Cauchy-Schwartz inequality is

Bregman Divergence
Let Q ⊆ R r be a closed convex set. Let ψ : Q → R be a continuously differentiable and strictly convex function. The Bregman divergence D ψ : Q × Q → [0, ∞) associated with ψ for points y, q ∈ Q is defined as D ψ (y, q) = ψ(y) − ψ(q) − ⟨∇ψ(q) , y − q⟩. It is the difference between the value of ψ at point y and the value of the first-order Taylor expansion of ψ around point q evaluated at point y. It is well-known, that the minimizer Theorem 1]. The Bregman divergence c = D ψ fulfills the weak quadruple inequality: Similarly, we obtain a version of the strong quadruple inequality with ξ = 1, l(q, p) = ∥q − p∥,

Hadamard Spaces and Quasi-Inner Product
Let (Q, d) be a metric space. Use the notation qp := d(q, p). We use the squared metric as the cost function c(y, q) = d(y, q) 2 = yq 2 . One particularly nice version of the weak quadruple inequality with this cost function is Let us call this inequality the nice quadruple inequality. As seen before, this holds for subsets of inner product spaces. It also plays an important role for geodesic metric spaces. In this section, we paraphrase some results of [BN08]. In particular, we state that the nice quadruple inequality characterizes CAT(0)-spaces.
A metric space (Q, d) is said to fulfill the NPC-inequality if and only if for all y 1 , y 2 ∈ Q there exists a point m ∈ Q such that for all q ∈ Q, we have mq 2 ≤ 1 2 y 1 q 2 + 1 2 y 2 q 2 − 1 4 y 1 y 2 2 . Then m is the midpoint of y 1 and y 2 .
A characterization of CAT(0)-spaces can be found in [Stu03, Section 2]: A metric space is CAT(0) if and only if it fulfills the NPC-inequality.
Another characterization of CAT(0)-spaces by the nice quadruple inequality is given in [BN08, Corollary 3]: A geodesic space is CAT(0) if and only if it fulfills the nice quadruple inequality.
In [BN08], the authors define the quadrilateral cosine for q, p, y, z ∈ Q as Obviously, the statement cosq( ⃗ yz, ⃗ qp) ≤ 1 for all q, p, y, z ∈ Q is equivalent to the nice quadruple inequality. To further motivate this notation and compare it with inner product spaces, they introduce a quasilinearization of the metric space and a quasi-inner product: Thus, the nice quadruple inequality can be viewed as the Cauchy-Schwartz inequality of the quasi-inner product.

Power Inequality
If the metric space (Q, d) fulfills the nice quadruple inequality, i.e, yq 2 −yp 2 −zq 2 +zp 2 ≤ 2 yz qp, where yq = d(y, q), then (Q, d a ), a ∈ [ 1 2 , 1], also fulfills a weak quadruple inequality with a suitably adapted bound. The implications of this result for the estimators of the corresponding Fréchet means are discussed in section 4.4.2.
According to [DD16], the metric d a is called power transform metric or snowflake transform metric.
Theorem 3 (Power Inequality). Let (Q, d) be a metric space. Use the short notation qp := d(q, p). Let q, p, y, z ∈ Q, a ∈ [ 1 2 , 1]. Assume Then In particular, if the metric space (Q, d) fulfills the nice quadruple inequality and a ∈ [ 1 2 , 1], then the weak quadruple inequality for Following the intermediate step Lemma 14 (appendix Appendix G) in the proof of Theorem 3, one can easily show a similar result if the constant on the right hand side of equation (4) is larger than 2. Only the constant 8a2 −2a on the right hand side of equation (5) changes.
The theorem applies to subsets of Hadamard spaces. But note that it is not required that Q is geodesic, but can consist of only the points q, p, y, z. As a statement purely about metric spaces, it might be of interest outside the realm of statistics.
In Corollary 5 (section 3.2.3) it is used to show rates of convergence for the Fréchet mean estimator of the power transform metric d a . There the asymmetry of the exponents of the factors on the right hand side of (5) is essential for proving the result under weak assumption.
Unfortunately, the only proof of this statement that the author was able to derive (see appendix G) is very long and does not give much insight into the problem as it mostly consists of distinguishing many cases and then using simple calculus. The author is convinced that a more appealing proof is possible.
e ln(2) ≤ 2.123. Thus, the constant factor in the bound is very close to 2, but 2 is not sufficient.
In appendix Appendix E, we show that 8a2 −2a is the optimal constant, and that we cannot extend Theorem 3 to a > 1 or a < 1 2 . It is not known to the author whether the nice quadruple inequality in (Q, d) does or does not imply the nice quadruple inequality in (Q, d a ) for a ∈ ( 1 2 , 1), i.e.,

Weak Implies Strong
The weak quadruple inequality is well justified as a condition: Aside from allowing to establish rates in probability (Theorem 1), it can be interpreted as a form of Cauchy-Schwartz inequality (section 3.2.3), it is fulfilled in a large class of metric spaces (bounded metric spaces, Hadamard spaces, appendix Appendix B), and the power inequality (Theorem 3) implies even more applications with a nice interpretation in statistics (section 4.4.2). The case for the strong quadruple inequality, which we use in Theorem 2 to establish rates in expectation, seems much weaker. Although it can be established in Hilbert spaces, see section 3.2.1, it is not directly clear whether we can have a suitable version for Hadamard spaces or a power inequality.
The next section examines the strong quadruple inequality in Hadamard spaces and concludes with a negative result. Thereafter, we discuss an approach to infer convergence rates in expectation when only assuming the weak quadruple inequality by showing that a weak quadruple inequality imply certain strong quadruple inequalities. This approach is executed to obtain Theorem 4 for convergence rates in expectation, where the result holds only asymptotically, in contrast to Theorem 2.

Projection Metric
In Euclidean spaces, we can take b m (q, p) = q−m ∥q−m∥ − p−m ∥p−m∥ as the strong quadruple metric. This pseudo-metric can be written down only depending on the metric (not the norm or vector space operations) as The metric d proj m (q, p) can be defined in any metric space. Unfortunately, it does not yield a strong quadruple inequality in non-Euclidean Hadamard spaces in the same way as in Euclidean spaces. See appendix Appendix D for details.

Power Metric
To establish rates of convergence in expectation for the c-Fréchet mean, given that a weak quadruple inequality holds, we first show that some version of the strong quadruple inequality is implied by the weak one, Lemma 1. Unfortunately, we obtain a strong quadruple distance b m such that the measure of entropy entr(Q, b m ) might be infinite. To solve this problem, we define an increasing sequence of sets Q n such that Q n ⊆ Q n+1 and n∈N Q n = Q with distances b m,n such that the strong quadruple inequality is fulfilled on Q n with strong quadruple distance b m,n , and entr(Q n , b m,n ) is finite and can be suitably controlled in n. This allows us to prove an asymptotic result for the rate of convergence in expectation, Theorem 4.
See appendix Appendix C for a proof. We would like to have ξ large, i.e., close to 1, to obtain the same rate of convergence in expectation as in probability. We achieve that by defining sequences ξ n ↗ 1 and Q n ↗ Q, and control the entropy of Q n with respect to b 1−ξn .
To state the result, we have to modify the Entropy and the Existence condition. Recall the definition of the objective function F (q) = E[ c Y q] and the empirical objective

Assumptions.
Existence': There are m Qn n ∈ arg min q∈Qn F n (q) measurable and m ∈ arg min q∈Q F (q).

Small Entropy:
There are β, c e > 0 such that for δ > 0 large enough Note that the Small Entropy condition is much stronger than Entropy, which we assumed in Theorem 1. In Euclidean subspaces Q ⊆ R b , we have Thus, Small Entropy is fulfilled in Euclidean spaces.
Theorem 4 (Convergence rate in expectation). In the Abstract Setting of section 2.1 with loss l = b, where b is a pseudo-metric, and rate parameter ξ = 1, assume that following conditions hold: Existence', Growth with γ > 1, Weak Quadruple, See appendix Appendix A for the proof.

Application of the Abstract Results
We apply the abstract results of Theorems 1 to 4 in this section. We first consider two toy examples -Euclidean spaces, section 4.1 and infinite dimensional Hilbert spaces, section 4.2 -to better understand the result and compare them to optimal bounds. Then we discuss two more involved settings: The Fréchet mean for non-convex subsets of Euclidean spaces, section 4.3, and for Hadamard spaces, section 4.4.

Euclidean Spaces
Thus, the Growth condition is fulfilled with γ = 2. The space has the strong quadruple inequality at every point with data distance a(y, z) = 2∥y − z∥ and strong quadruple Strong Entropy. The constants C, C ′ > 0 are universal. Compare this with the result that one obtains by direct calculations, i.e., We pay an extra dimension factor b when using the Fréchet mean approach instead of direct calculations. This comes from the use of the Cauchy-Schwartz inequality, which powers the strong quadruple inequality in Euclidean spaces.

Hilbert Spaces
Let H be an infinite dimensional Hilbert space and As in the Euclidean case, the Fréchet mean m equals the expectation E[Y ], the Growth condition holds with γ = 2, and the strong quadruple inequality is fulfilled with a(y, z) = 2∥y − z∥ and pseudometric b m (p, q) = q−m ∥q−m∥ − p−m ∥p−m∥ . Unfortunately, Strong Entropy is not fulfilled on H if dim(H) = ∞. By introducing a weight sequence, we can make b m smaller by making a larger: Assume that the Hilbert space H is separable and thus admits a countable basis. Let s = (s k ) k∈N ⊆ (0, ∞). In section 3.2.1, we derived that the strong quadruple condition holds with a(y, Proposition 2.5.1]. As a condition on the variance term, we need Similar to the Euclidean case, Theorem 2 implies where ∥s∥ 2 ℓ 2 = ∞ k=1 s 2 k . Direct calculations yield a better result: As in the Euclidean case, we pay a factor related to the dimension for using the more generally applicable Fréchet mean approach instead of using the inner product for direct calculations. To get the same rate as in section 4.2, we mainly need to be concerned with the Growth condition, as the quadruple condition holds in all subsets. For q ∈ H, simple calculations show

Non-Convex Subsets
We want to find a lower bound of this term in the form of c g ∥q − m∥ γ for constants γ, c g > 0. For a > 1, we have Equivalently, the Growth condition holds with γ = 2 and c g ∈ (0, 1) if and only if the point m cannot be the Fréchet mean of a distribution with expectation µ. To fulfill the Growth condition, we need Q ∩ B r 1 (p 1 ) = ∅ for a ball with larger radius r 1 > r 0 and adjusted center p 1 . Increasing the radius further, r 2 > r 1 , only improves the constant c g of the Growth condition, but not the exponent γ.
for all q ∈ Q, i.e., if and only if Q ∩ B r (p) = ∅, where r = 1 1−cg ∥µ − m∥ and p = µ + 1−cg cg (µ − m). Note that ∥p − m∥ = r. This is illustrated in Figure 1. We have answered the question of how Q may look like, given the location of µ and m. Possibly more interesting is the question of, given Q, where may µ be located so that m can be estimated with the same rate as for convex sets. We will answer this question only informally via a description similar to a medial axis transform [CCM97]: For simplicity assume Q = R 2 \A, where A is a nonempty, open, and simply connected set with border ∂A that is parameterized by the continuous function γ : [0, 1] → ∂A. Roll a ball along the border on the inside of A. Make the ball as large as possible at any point so that it is fully contained in A and touches the border at point γ(t). Denote the center of the ball as c : [0, 1] → A and the radius as r : [0, 1] → [0, ∞). Take ϵ ∈ (0, 1) and trace the point p ϵ : [0, 1] → A on the radius connecting the center of the ball c(t) and the border γ(t) such that it divides the radius into two pieces of length p ϵ (t)c(t) = ϵr(t) and p ϵ (t)γ(t) = (1−ϵ)r(t). If µ lies on the outside of the set prescribed by p : [0, 1] → A, it can be estimated with the same rate as for convex sets. This is illustrated in Figure 2. The set of all centers C := {c(t) | t ∈ [0, 1]}, also called the medial axis ot cut locus, is critical: The closer µ is to C, the larger the guaranteed error bound for the estimator. In particular, we cannot guarantee consistency of the estimator if µ ∈ C. A very similar phenomenon is described in [BP03, section 3]. The authors consider a Riemannian manifold Q that is embedded in an Euclidean space Y. The extrinsic mean of a distribution on Q is the projection of the mean µ in Y to Q. The points C are called focal points. It is shown [BP03, Theorem 3.3] that in many cases the intrinsic mean, i.e, the Fréchet mean in Q with respect to the Riemannian metric on Q, is equal to the extrinsic mean, i.e, the Fréchet mean in Q with respect to the Euclidean metric on Y.
The conditions described above are connected to the term reach of a set [Fed59]. The reach of Q ⊆ R b is the largest ϵ > 0 (possibly ∞) such that inf q∈Q d(x, q) < ϵ implies that x ∈ R b has a unique projection to Q, i.e., a unique point x Q with d(x, x Q ) = inf q∈Q d(x, q). If the distance of the mean µ to Q is less than the reach of Q, then the Growth condition holds with γ = 2. Thus, the rate of convergence is upper bounded by cn − 1 2 for some c > 0. Note that convex sets have infinite reach and exhibit this upper bound for any distribution with finite second moment.
By considering the growth condition ∥µ − q∥ 2 − ∥µ − m∥ 2 ≥ c g ∥q − m∥ γ , one can also find examples of subspaces where the growth exponent for specific distributions is different from 2.

Hadamard Spaces
Let (Q, d) be a Hadamard space. A definition of Hadamard spaces is given in section 3.2.3. Use the notation yq = d(y, q). For our purposes the most notable property of Hadamard spaces is that they fulfill the nice quadruple property, i.e., yq 2 − yp 2 − zq 2 + zp 2 ≤ 2 yz qp. In the following subsections, we will see how this translates to convergence rates for the Fréchet mean estimator and use the power inequality to obtain results for a generalized Fréchet mean with cost function d 2a for a ∈ [ 1 2 , 1]. For an introduction to Hadamard spaces see [Bač14a]. A survey of recent developments  Figure 2: Let A ⊆ R 2 be the set enclosed by the heart (solid black lines). Let Y = R 2 and Q = R 2 \ A. We consider a distribution on R 2 with mean µ ∈ Y and Fréchet mean m ∈ Q with respect to the Euclidean metric and the descriptor space Q. The green, blue, and red lines show p ϵ (t) for ϵ = 0.6, 0.3, 0.
can be found in [Bac18]. In [BN08] the authors characterize Hadamard spaces by the nice quadruple inequality and discuss a quasilinearzation of these spaces by observing that the left hand side of the nice quadruple inequality behaves like an inner product to some extend. [Stu03] shows how some important theorems of probability theory in Euclidean spaces, like the law of large numbers and Jensen's inequality, translate to non-Euclidean Hadamard spaces. In [Stu02] martingale theory on Hadamard spaces is discussed.
Turning to more applied topics, [Bač14b] shows algorithms for calculating the Fréchet mean in Hadamard spaces with cost function d 2a for a = 1 2 and a = 1. An important application of Hadamard spaces in Bioinformatics are phylogenetic trees [BHV01]. See also [Bac18, section 6.3] for a quick overview. Another application of Hadamard spaces is taking means in the manifold of positive definite matrices, e.g., in diffusion tensor imaging. But note that, as the underlying space is a differentiable manifold, one an use gradient-based approaches, see [PFA06].
Further examples of Hadamard spaces include Hilbert spaces, the Poincaré disc, complete metric trees, complete simply-connected Riemannian manifolds of nonpositive sectional curvature. See also [Stu03, section 3].

Fréchet Mean
Let (Q, d) be a Hadamard space. We use Q as data space as well as descriptor space, i.e., Q = Y. The cost function is c = d 2 , the loss l = d. As described in section 3.2.3 the weak quadruple inequality holds with a = 2d and b = d, i.e., (Q, d) fulfills the nice quadruple inequality. Let Y be a random variable with values in Q. Let Y 1 , . . . , Y n be iid copies of Y .
If E[d(Y, q) 2 ] < ∞ for one q ∈ Q, then it is also finite for every q ∈ Q and the Fréchet mean m ∈ arg min q∈Q E[d(Y, q) 2 ] exists and is unique, see [Stu03,Proposition 4.3]. The same holds true for the estimator m n ∈ arg min q∈Q n i=1 d(Y i , q) 2 . Thus, Existence is fulfilled.
Here, we chose a second moment condition, because we will need it for estimation anyway. But note that choosing the cost function as c(y, q) = d(y, q) 2 − d(y, o) 2 for a fixed, arbitrary point o ∈ Q allows us to require only a finite first moment for Existence and the resulting Fréchet mean coincides with the d 2 -Fréchet mean if the second moment is finite. This is described in more detail and utilized in [Stu03].
Furthermore, the Growth-condition holds in Hadamard spaces with γ = 2 and c g = 1, see [Stu03,Proposition 4.4]. Thus, we obtain following corollary of Theorem 1. Then, for all s > 0, we have with a constant c > 0 depending only on β and c e . In particular, As described in section 3.2.3, it may be difficult to find a version of the strong quadruple inequality such that the same rate can be derived for convergence in expectation. Thus, instead of trying to apply Theorem 2, we utilize (i) Corollary 2 and (ii) Theorem 4, respectively.
for a constant c > 0 depending only on β.

Power Fréchet Mean
We go beyond Hadamard spaces by utilizing the power inequality, Theorem 3. Let On the other hand, using the tight power bound of Lemma 17 (appendix Appendix G), Theorem 1 with ζ = 2 implies following corollary.
Corollary 5 (Rates in probability for power mean). Assume: Let o ∈ Q be an arbitrary fixed point. Assume there are m n ∈ arg min q∈Q Then, for all s > 0, we have where c > 0 depends only on β, γ, c e . In particular, Note that the moment condition becomes weaker as a gets smaller and vanishes for a = 1 2 , where, in the Euclidean case, the Fréchet mean is the median. Existence of m n and m is a purely technical condition, as one will usually only be able to minimize the objective functions up to an ϵ > 0 and the set of ϵ-minimizers is always nonempty.
The Growth condition is more interesting. It seems possible to choose γ = 2 for all a ∈ [ 1 2 , 1] in many circumstances, at least under some conditions on the distribution of Y . But precises statements of this sort are unknown to the author. If γ really can be chosen independently of a, then the rate is the same for all a ∈ [ 1 2 , 1]. In the Euclidean case, this is manifested in the fact that we can estimate median (a = 1 2 ) and mean (a = 1) and all statistics "in between" (a ∈ ( 1 2 , 1)) with the same rate (under some conditions), but with less restrictive moment assumptions for smaller powers a.
Similarly to the corollary above, we can apply Corollary 2 or Theorem 4 to obtain rates in expectation.

Further Research
The growth condition, especially for power Fréchet means, see section 4.4.2, needs to be studied further to get a better understanding of what properties a distribution must have, so that all power means can be estimated with the same rate.
In [Bač14a] the author describes algorithms for calculating means and medians in Hadamard spaces, i.e., power Fréchet means as in section 4.4.2 with a ∈ { 1 2 , 1}. As we have shown results also for a ∈ ( 1 2 , 1), it would be interesting to see, whether one can generalize the algorithms for a = 1 2 and a = 1 to a ∈ [ 1 2 , 1]. We plan to use the results of this paper in a regression setting similar to [PM19]. We will show convergence rates for an orthogonal series-type regression estimator for the conditional Fréchet mean m( Results similar to following Lemma are well known in the M-estimation literature. The proof relies on the peeling device, see [Gee00]. Lemma 2 (Weak argmin transform). Assume Growth. Let ζ ≥ 1. Assume that there are constants ξ ∈ (0, γ), h n ≥ 0 such that E[∆ n (δ) ζ ] ≤ h n δ ξ ζ for all δ > 0. Then where c > 0 depends only on c g , γ, ξ, ζ.

Proof of Corollary 2. Theorem 1 yields
In general for a > 1, b > 0, we have The proof is concluded by applying this statement and noting that ξ > 1.

A.2 Proof of Theorem 2
To state the next Lemma, which will be used to prove Theorem 2, we introduce an intermediate condition, which we call Closeness.

Closeness:
There is ξ ∈ (0, γ) and a random variable H n ≥ 0, such that for all q ∈ Q almost surely. Proof. We use Growth and the fact that m n minimizes F n to obtain where we applied the Closeness condition in the last step. Thus, which implies the claimed inequality. Define Lemma 5. Let ζ ≥ 1. Assume Strong Moment and Strong Quadruple. Then where c > 0 is a constant depending only on ζ. Additionally, assume Strong Entropy. Then where C > 0 is a constant depending only on ζ, β, c e , and η β,n := Thus, X(q) = n i=1 Z i (q). The Strong Moment condition together with the Strong Quadruple condition imply that Z i integrable. Let (Z ′ 1 , . . . , Z ′ n ) be an independent copy of (Z 1 , . . . , Z n ), where (Y ′ 1 , . . . , Y ′ n ) is an independent copy of (Y 1 , . . . , Y n ). By Strong Quadruple, we have Furthermore, with M(ζ) < ∞ due to the assumption Strong Moment. Thus, Theorem 6 (appendix Strong Entropy together with Lemma 12 (appendix Appendix F) yield for a constant C > 0 depending only on β, ζ, c e .
Proof of Theorem 2. Using H n := sup q∈Q |X(q)| in Lemma 4 fulfills the Closeness condition by definition of X. Next, apply Lemma 5 with ζ := κ γ−ξ to conclude the proof.

A.3 Proof of Theorem 4
Lemma 6. The condition Small Entropy implies Proof. Obviously, we have for any set Q ⊆ Q. Furthermore, Thus, for Q := B R (o, b), we obtain, using the Small Entropy condition, To calculate the integral, we substitute s := r R and get For general a ∈ (0, 1), b > 0, we have where Γ(·) is the Gamma function. Thus, for a constant c β > 0 depending only on β. Putting everything together, we obtain Lemma 7. Set ξ n := 1 − log(n) −1 . Then where c γ > 0 is a constant depending only on γ.
Theorem 2 implies for n large enough. Note, that C > 0 can be chosen independently of n (even for ξ n depending on n).
In Strong Moment we require κ ≥ γ − 1, because then x → x κ γ−1 is convex, which is needed for the symmetrization argument in the proof of Theorem 2. But, if κ = γ − 1, then κ γ−ξn < 1, and Theorem 2 cannot be applied directly. For this technical reason, we assumed κ > γ − 1, so that κ ≥ γ − ξ n for n large enough.
By Small Entropy and Lemma 6 there is c β > 0 such that for n ∈ N large enough, we have Using R 1−ξn n = n 1 log(n) = exp(1) together with Lemma 7, we obtain Finally, there is a n 0 ∈ N such that for all n ≥ n 0 , we have m ∈ Q n , which implies m = m Qn . Thus,

Appendix B Stability of Quadruple Inequalities
We present some trivial stability results for quadruple inequalities. The notation we use here is introduced in the beginning of section 3.

Subsets:
If (Q, Y, c, a, b) fulfills the weak quadruple inequality, then so does

Images:
Assume (Q, Y, c, a, b) fulfills the weak quadruple inequality and f : g(p)).

Limits:
Let (Q, Y, c i , a i , b i ) fulfill the weak quadruple inequality for i ∈ N and assume for all q, p ∈ Q and y, z ∈ Y the point-wise limits exist. Then (Q, Y, c, a, b) also fulfills the weak quadruple inequality.
Similar results hold for the strong quadruple inequality. For the following results it may not be so easy to obtain an analog for the strong quadruple inequality.

Product Spaces:
using the Cauchy-Schwartz inequality.
Proof. We have by Hölder's inequality.

Minima:
Let (Q, Y, c, a, b) fulfill the weak quadruple inequality. LetỸ ⊆ 2 Y . Define the cost function C :Ỹ × Q → R by C(y, q) = inf y∈y c(y, q) and A(y, z) = sup y∈y,z∈z a(y, z) assuming the infinma and suprema are finite. Then (Q,Ỹ, C, A, b) fulfills the weak quadruple inequality.
Proof. Let y, z ∈Ỹ and q, p ∈ Q. Assume there are y q , y p ∈ y, z q , z p ∈ z such that C(y, q) = c y q q, C(y, p) = c y p p, C(z, q) = c z q q, and C(z, p) = c z p p. Then z)b(q, p) .
If the infima are not attained, one can follow the same proof with minimizing sequences.
In many interesting problems the setting is opposite to what was described before, i.e., C : Y ×Q → R, (y, q) → inf q∈q c(y, q), whereQ ⊆ 2 Q : the elements of the descriptor space are subsets and the elements of data space are points. Examples are k-means, whereQ consists of k-tuples of points in Q, or fitting hyperplanes. A quadruple inequality with sup q∈q,p∈p b(q, p) as the descriptor distance can be established. Unfortunately, this is usually not useful, as the entropy condition cannot be fulfilled with distances of this type. The framework described in this article can still be applied using inequalities as for bounded spaces, see section 3.1. But we cannot directly use the advantage of quadruple inequalities over Lipschitzcontinuity.

Appendix C Proof of Lemma 1
We first state and prove two simple lemmas for some simple arithmetic expressions and then use those for the proof of Lemma 1.

Proof. For t ≥ s, using the bound on A and on
st by using the bound on B and A − B. Together, we obtain We finish the proof by pointing out the symmetry between (A, a, s) and (B, b, t).
Multiplying by (c − a) β and using c − a ≤ b, we get c − a ≤ b β 2 β c 1−β − a 1−β . Thus, Proof of Lemma 1. Applying Lemma 8 to the left hand side of equation (6), yields

Appendix D Projection Metric Counter Example
We take a tripod (Q, d) as a simple example of a non-Euclidean Hadamard space, see Let r > ε > 0 and define y, z, q, p, o on a tripod as in Figure 3. We take c = d 2 , ξ = 1, If the strong quadruple inequality holds, then Thus, d proj m is not a suitable candidate for the strong quadruple distance in general Hadamard spaces.

Appendix E Optimality of Power Inequality
We show that 8α2 −2α is the optimal constant, and that we cannot extend Theorem 3 to α > 1 or α < 1 2 . Let ϵ ∈ (0, 1) and (Q, d) be a metric space with q, p, y, z ∈ Q such that for each case below the distances have the values written down in Table 1. One can easily show that in all three cases the necessary triangle inequalities and the nice quadruple inequality hold.
(b) For α > 1, we have, using l'Hôpital's rule, Thus, there is no power inequality in the form of Theorem 3 for α > 1.

Appendix F Chaining
Recall the measures of entropy γ 2 and entr n defined in Definition 1. We add another useful entry to this list.
We write down the Bernoulli bound for powers of the Bernoulli process. [BL14] show that the bound can be reversed (up to an universal constant). Thus, this step can be regarded as optimal.
Proof. Let T 1 , T 2 ⊆ R n such that T ⊆ T 1 + T 2 . As (a + b) κ ≤ 2 κ−1 (a κ + b κ ) for all a, b ≥ 0, we can split the supremum into two parts, The first term is bounded using the 1-norm, E sup t∈T 1 X t κ ≤ sup t∈T 1 ∥t∥ κ 1 . For the second we use Talagrand's generic chaining bound for the supremum of the subgaussian process E sup t∈T 2 X t κ ≤ c ′ κ γ 2 (T 2 ) κ , see [Tal14]. We obtain Lemma 10 (Lipschitz connection). Let (Q, b) be a pseudo-metric space. Assume there are function f i : ..,n : q ∈ Q}. Set a = (a 1 , . . . , a n ). Then where C > 0 is an universal constant.
Proof. For ϵ > 0, choose Q 2 to be an ϵ-covering of Q with respect to b, i.e., for all q ∈ Q there is a p q ∈ Q 2 such that b(q, p q ) ≤ ϵ. For q ∈ Q denote t q := (f i (q)) i=1,...,n ∈ R n . Define T 2 := {t p : p ∈ Q 2 } and T 1 := t q − t pq : q ∈ Q . Then T ⊆ T 1 + T 2 . The Lipschitz-condition implies ∥t q − t p ∥ 2 ≤ ∥a∥ 2 b(q, p) for all q, p ∈ Q. Thus, By the properties of γ 2 , see [Tal14], we obtain for universal constants c, c ′ > 0. Applying the two inequalities to the definition of b(T ) concludes the proof.
The symmetrization lemma is well-known. The statement here is an intermediate step of from the proof of [VW96, 2.3.6 Lemma].
Theorem 6 (Empirical process bound). Let (Q, b) be a separable pseudo-metric space. Let Z 1 , . . . , Z n be centered, independent, and integrable stochastic processes indexed by Q with a q 0 ∈ Q such that Z i (q 0 ) = 0 for i = 1, . . . , n. Let (Z ′ 1 , . . . , Z ′ n ) be an independent copy of (Z 1 , . . . , Z n ). Assume the following Lipschitz-property: There is a random vector A with values in R n such that for i = 1, . . . , n and all q, p ∈ Q. Let κ ≥ 1. Then where C > 0 is an universal constant.
Proof. Use Lemma 11. Then apply Theorem 5 and Lemma 10 conditionally on Z 1 , . . . , Z n . In particular where c depends only on c e and β and η β,n := The proof consists of calculating the entropy integral with the given bound on the covering numbers and, for β ≥ 1, choosing the minimizing starting point of the integral ϵ > 0.
The advantage of using Lemma 13 to prove Theorem 3 is, that we do not need to consider a system of additional conditions for describing that the real values in the inequality are distances, which have to fulfill the triangle inequality. The disadvantage is, that we loose the possibility for a geometric interpretation of the proof.
Proof. Three points from an arbitrary metric space can be embedded in the Euclidean plane so that the distances are preserved. Thus, the cosine formula of Euclidean geometry can be applied to the three points y, p, q ∈ Q: We have where s := cos(∡ypq) with the angle ∡ypq in the Euclidean plane. Similarly where r := cos(∡zpq). Thus, where a := zp, c := yp, b := qp. Hence, Lemma 13 yields The assumption of Theorem 3 states yq 2 − yp 2 − zq 2 + zp 2 ≤ 2 yz qp. This implies Therefore, ra − sc ≤ yz (or b = 0, but then q = p and Theorem 3 becomes trivial). Furthermore, the triangle inequality implies |a − c| = |zp − yp| ≤ yz. Thus, we obtain max(ra − sc, |a − c|) ≤ yz .
Finally, (8) and (9) together yield The remaining part of this section is dedicated to proving Lemma 13. The proof of Lemma 13 can be described as brute force. We will distinguish many different cases, i.e., certain bounds on a, b, c, r, s, e.g., a ≤ c and a > c. In each case, we try to simplify the inequality step by step until we can solve it easily. Mostly, the simplification consists of taking some derivative and showing that the derivative is always negative (or always positive). Then we only need to show the inequality at one extremal point. This process may have to be iterated. It is often not clear immediately which derivative to take in order to simplify the inequality. Even after finishing the proof there seems to be no deeper reason for distinguishing the cases that are considered. Thus, unfortunately, the proof does not create a deeper understanding of the result.

G.2 First Proof Steps and Outline of the Remaining Proof
We want to show Lemma 13 to prove Theorem 3. We refer to the left hand side of the inequality, a 2α − c 2α − a 2 − 2rab + b 2 α + c 2 − 2scb + b 2 α , as LHS. By RHS we, of course, mean the right hand side, 8α2 −2α b max(ra − sc, |a − c|) 2α−1 .
For max(ra − sc, |a − c|) = 0 we have a = c and r ≤ s. Thus, LHS ≤ 0. If max(ra − sc, |a − c|) > 0, LHS and RHS are continuous in all parameters. Thus, it is enough to show the inequality on a dense set. In particular, we can and will ignore certain special cases in the following which might introduce technical problems, e.g., "0 0 ".
We have to distinguish the cases |a − c| = max(ra − sc, |a − c|) and ra − sc = max(ra − sc, |a − c|). We further distinguish a ≥ c and c ≥ a.

G.2.2 The Case |a − c| ≥ ra − sc
In the case |a − c| ≥ ra − sc, the RHS does not depend on s or r. Thus, we maximize the LHS with respect to r and s and only need to show the inequality for this maximized term. Define We have max Then Then Thus, f ′ (r) ≥ 0. In this case, we need to show Case 1.2: a 2 ≥ c 2 + 2ab − 2cb. Then Thus, f ′ (r) ≤ 0. The relevant values are r = r min = 1 − 2 c a , with s = s min (r) = −1. In this case, we need to show Case 2: a ≤ c. For fixed r ∈ [−1, 1], set s = s min (r) = (r + 1) a c − 1. Define Then Case 2.1: a 2 ≤ c 2 − 2ab + 2cb. Then Thus, f ′ (r) ≥ 0. The critical value is r = r max = 1, with s = s min (r) = 2 a c − 1. In this case, we need to show Case 2.2: a 2 ≥ c 2 − 2ab + 2cb. This cannot happen for a ≤ c. Then

G.2.3 Outline
Remark 2 (What we need to show). Define The proofs consist of distinguishing many different cases and applying simple analysis methods in each case. Nonetheless, finding the poofs is often quite hard, as the inequalities are usually very tight and the right steps necessary for the proof are hard to guess.
As intermediate steps we can, in some cases, use two lemmas: the Tight Power Bound, see section G.3, and the Merging Lemma, see G.4. The remaining cases that cannot be solved via Tight Power Bound and Merging Lemma will be discussed in sections G.6 and G.7.

G.3 Tight Power Bound
Following lemma gives one very useful inequality in three different forms. It gives a hint to why the power . . . 2α−1 comes up in the RHS of Lemma 13.
Note that this result is slightly stronger than the application of the mean value theorem to the function x → x a , which yields x a − y a ≤ a(x − y)z a−1 for all x ≥ y ≥ 0 and a > 0, where z ∈ [y, x].
If we can show f (z) ≤ 2a, then We have We have where L ′ H indicates the use of L'Hospital's rule. Furthermore, f (z 0 ) ≥ f (1) = 2 a , which implies the lower bound. This finishes the proof for (i). The other parts follow immediately.

G.4 Merging Lemma
In many cases (i.e., with additional assumption on a, b, c, r or s), we prove the inequality of Lemma 13 by applying first a merging lemma to the LHS to reduce the four summands to two summands of a specific form. Then we apply the Tight Power Bound. The Merging Lemma is discussed in this section.

G.4.3 a − sc-Merging Lemma
Lemma 20 covers the case 1 2 b ≥ sc. The following lemma covers 1 2 b ≤ sc under the additional restriction sc ≤ a − b. Then We have Thus, The next lemma shows f (sc) ≤ 0. Thus, f (δ) ≤ 0 for all δ ≥ sc.
Proof. Define We have Thus, g(x) ≤ 0 for all valid x.

G.5 Application of Tight Power Bound and Merging Lemma
Whenever a Merging Lemma holds, we apply it as a first step and then use the Tight Power Bound, Lemma 17, to obtain In particular, we have finished the proof of Lemma 13 in following cases: • ra ≥ sc and s, r ∈ {−1, 1}: Lemma 18, • 2ra ≥ b and s ∈ {−1, 1}; or b ≥ 2sc and r ∈ {−1, 1}; or 2ra ≥ b ≥ 2sc: Lemma 20, Furthermore, by concavity of x → x α , Thus, With this we get For a ≥ c, the remaining case is solved by following lemma.
and a ≥ c. Then Proof. Because a ≥ c and 1 (c 2 , (a − b) 2 ). Thus, applying either Lemma 23 (if c 2 − 2scb + b 2 is larger then either c 2 or (a − b) 2 ) or Lemma 24 yields

G.6.2 The Case a ≤ c
For the case c ≥ a, we only need ra − sc ≥ c − a (for r = 1), i.e., Then We distinguish 1 2 b ≤ a − sc and 1 2 b ≥ a − sc.
Proof. The conditions imply In particular, We have We define The next lemma shows h(a, b) ≤ 0 for a ≥ b. Thus, g(c) ≤ 0. Thus, f (x) ≤ 0.
We have f (a, b, a − w, w) g(a, b, w) , The conditions 0 ≤ a − w ≤ b 2 ≤ w and a ≤ b imply w ≤ a ≤ b. We have, Thus, h(a, b, w) ≤ 0 for all a ∈ [w, b]. Thus, ∂ b g(a, b, w) ≤ 0. The conditions for g are 0 ≤ a − w ≤ b 2 ≤ w ≤ a ≤ b. As a ≤ b and b 2 ≤ w, we have a ≤ 2w and thus a ≥ 2a − 2w.
Applying the exponential function yields Reordering the factors yields the desired inequality.