Stochastic approximation versus sample average approximation for Wasserstein barycenters

In the machine learning and optimization community, there are two main approaches for the convex risk minimization problem, namely the Stochastic Approximation (SA) and the Sample Average Approximation (SAA). In terms of the oracle complexity (required number of stochastic gradient evaluations), both approaches are considered equivalent on average (up to a logarithmic factor). The total complexity depends on a specific problem, however, starting from the work [A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM. J. Opt. 19 (2009), pp. 1574–1609] it was generally accepted that the SA is better than the SAA. We show that for the Wasserstein barycenter problem, this superiority can be inverted. We provide a detailed comparison by stating the complexity bounds for the SA and SAA implementations calculating barycenters defined with respect to optimal transport distances and entropy-regularized optimal transport distances. As a byproduct, we also construct confidence intervals for the barycenter defined with respect to entropy-regularized optimal transport distances in the -norm. The preliminary results are derived for a general convex optimization problem given by the expectation to have other applications besides the Wasserstein barycenter problem.


Introduction
In this paper, we consider the problem of finding a barycenter of discrete random probability measures. We refer to optimal transport (OT) metrics which provides a successful framework to compare objects that can be modelled as probability measures (images, videos, texts, etc.). Transport-based distances have gained popularity in various fields such as statistics [13,27], unsupervised learning [4], signal and image analysis [76], computer vision [66], text classification [51], economics and finance [62] and medical imaging [38,77]. Moreover, a lot of statistical results are known about optimal transport distances [47,72,78].
The success of optimal transport led to an increasing interest in Wasserstein barycenters (WBs). Wasserstein barycenters are used in Bayesian computations [74], texture mixing [61], clustering (k-means for probability measures) [21], shape interpolation and color transferring [71], statistical estimation of template models [15] and neuroimaging [38]. For discrete random probability measures from the probability simplex n (n is the size of support) with distribution P, a Wasserstein barycenter is introduced through a notion of Fréchet mean [30] min p∈ n E q∼P W(p, q).
( 1 ) If a solution of (1) exists and is unique, then it is referred to as the population barycenter of distribution P. Here W(p, q) is optimal transport between measures p and q W(p, q) = min π∈U(p,q) where C ∈ R n×n + is a symmetric transportation cost matrix and U(p, q) {π ∈ R n×n + : π 1 = p, π T 1 = q} is transport polytope. 1 In [18], the entropic regularization of optimal transport problem (2) was proposed to improve its statistical properties [9,47] and to reduce the computational complexity from O(n 3 ) (n is the size of the support of the measures) to n 2 min{Õ( 1 ε ),Õ( Here γ > 0 and E(π ) − π , log π is the entropy. Since E(π ) is 1-strongly concave on n 2 in the 1 -norm, the objective in (3) is γ -strongly convex with respect to π in the 1norm on n 2 , and hence, problem (3) has a unique optimal solution. Moreover, W γ (p, q) is γ -strongly convex with respect to p in the 2 -norm on n [8,Theorem 3.4]. Another particular advantage of the entropy-regularized optimal transport (3) is a closed-form representation for its dual function [1,19] defined by the Fenchel-Legendre transform of W γ (p, q) as a function of p W * γ ,q (u) = max p∈ n u, p − W γ (p, q) = γ E(q) + q, log(Kβ) , where β = exp(u/γ ), K = exp(−C/γ ) and functions < roman > log < /roman > or exp are applied element-wise. Hence, the gradient of dual function W * γ ,q (u) is also represented in a closed-form [19] ∇W * γ ,q (u) = β K · q/(Kβ) ∈ n , where symbols and / stand for the element-wise product and element-wise division respectively. Background on the SA and the SAA. Let us consider a general stochastic convex minimization problem where function f is convex in x (x ∈ X, X is a convex set) and Ef (x, ξ) is the expectation of f with respect to ξ ∈ . Such kind of problems arise in many applications of data science [67,69] (e.g. risk minimization) and mathematical statistics [73] (e.g. maximum likelihood estimation). There are two competing approaches based on Monte Carlo sampling techniques to solve (4): the Stochastic Approximation (SA) [64] and the Sample Average Approximation (SAA). The SAA approach replaces the objective in problem (4) with its sample average approximation where ξ 1 , ξ 2 , . . . , ξ m are the realizations of a random variable ξ . The number of realizations m is adjusted by a desired precision. The total working time of both approaches to solve problem (4) with the average precision ε in the non-optimality gap in term of the objective function (i.e. to find x N such that EF(x N ) − min x∈X F(x) ≤ ε), depends on a specific problem. However, it was generally accepted [54] that the SA approach is better than the SAA approach. Stochastic gradient (mirror) descent, an implementation of the SA approach [44], gives the following estimation for the number of iterations (that is equivalent to the sample size of ξ 1 , ξ 2 , ξ 3 , . . . , ξ m ) Here we considered the minimal assumptions (non-smoothness) for the objective f (x, ξ) Whereas the application of the SAA approach requires the following sample size [70]: that is n times more (n is the problem dimension) than the sample size in the SA approach. This estimate was obtained under the assumptions that problem (5) is solved exactly. This is one of the main drawback of the SAA approach. However, if the objective f (x, ξ) is λstrongly convex in x, the sample sizes are equal up to logarithmic terms Moreover, in this case, for the SAA approach, it suffices to solve problem (5) with accuracy [68] Therefore, to eliminate the linear dependence on n in the SAA approach for a non-strongly convex objective, regularization λ = ε 2R 2 should be used [68]. Let us suppose that f (x, ξ) in (4) is convex but not strongly convex in x (possibly, λ-strongly convex but with very small λ ε R 2 ). Here R = x 1 − x * 2 is the Euclidean distance between starting point x 1 and the solution x * of (4) which corresponds to the minimum of this norm (if the solution is not the only one). Then, the problem (4) can be replaced by The empirical counterpart of (9) is where the sample size m is defined in (6). Thus, in the case of non-strongly objective, a regularization equates the sample size of both approaches.

Contribution and related work
The SA and SAA approaches. This paper is inspired by the work [54], where it is stated that the SA approach outperforms the SAA approach for a certain class of convex stochastic problems. Our aim is to show that for the Wasserstein barycenter problem, this superiority can be inverted. We provide a detailed comparison by stating the complexity bounds for implementations of the SA and SAA approaches for the Wasserstein barycenter problem. As a byproduct, we also construct a confidence interval for the barycenter defined w.r.t. entropy-regularized OT. Sample size. We also estimate the sample size of measures to calculate an approximation for Fréchet mean of a probability distribution with a given precision.
Consistency and rates of convergence. The consistency of empirical barycenter as an estimator of true Wasserstein barycenter (defined by the notion of Fréchet mean) as the number of measures tends to infinity was studied in many papers, e.g. [12,52,57,63], under some conditions for the process generated the measures. Moreover, the authors of [15] provide the rate of this convergence but under restrictive assumption on the process (it must be from admissible family of deformations, i.e. it is a gradient of a convex function). Without any assumptions on generating process, the rate of convergence was obtained in [11], however, only for measures with one-dimensional support. For some specific types of metrics and measures, the rates of convergence were also provided in works [17,37,49]. Our results were obtained under the condition of discreteness of the measures. We can always achieve this condition through additional preprocessing (discretization of measures).
Penalization of barycenter problem. For a general convex (but not strongly convex) optimization problem, the empirical minimization may fail in offline approach despite the guaranteed success of an online approach if no regularization was introduced [68]. The limitations of the SAA approach for non-strongly convex case are also discussed in [39,70]. Our contribution includes introducing a new regularization for population Wasserstein barycenter problem that improves the complexity bounds for the standard penalty (squared norm penalty) [68]. This regularization relies on the Bregman divergence and peox-function introduced in [6].

Preliminaries
Notation. Let n = {a ∈ R n + | n l=1 a l = 1} be the probability simplex. Then we refer to the jth component of vector x i as [x i ] j . The notation [n] means 1, 2, . . . , n. For two vectors x, y of the same size, denotations x/y and x y stand for the element-wise product and element-wise division respectively. When functions, such as log or exp, are used on vectors, they are always applied element-wise. For some norm · on space X , we define the dual norm · * on the dual space X * in a usual way s * = max x∈X { x, s : x ≤ 1}. We denote by I n the identity matrix, and by 0 n×n we denote zeros matrix. For a positive semidefinite matrix A, we denote its smallest positive eigenvalue by λ + min (A). We use denotation O(·) when we want to indicate the complexity hiding constants, to hide also logarithms, we use denotation O(·).

Definition 1.3:
The Fenchel-Legendre conjugate for a function f :

Paper organization
The structure of the paper is as follows. In Section 2, we give a background on the SA and SAA approaches and derive preliminary results. Section 4 presents the comparison of the SA and SAA approaches for the problem of Wasserstein barycenter defined w.r.t. regularized optimal transport distances. Finally, Section 5 gives the comparison of the SA and SAA approaches for the problem of Wasserstein barycenter defined w.r.t. (unregularized) optimal transport distances.

Strongly convex optimization problem
We start with preliminary results stated for a general stochastic strongly convex optimization problem of form where f (x, ξ) is γ -strongly convex with respect to x. Let us define x * = arg min x∈X F(x).

The SA approach: stochastic gradient descent
The classical SA algorithm for problem (11) is presented by stochastic gradient descent (SGD) method. We consider the SGD with inexact oracle given by g δ (x, ξ) such that Then the iterative formula of SGD can be written as (k = 1, 2, . . . , N.) Here x 1 ∈ X is starting point, X is the projection onto X, η k is a stepsize. For a γ -strongly convex f (x, ξ) in x, the stepsize η k can be taken as 1 γ k to obtain optimal rate O 1 γ N . A good indicator of the success of an algorithm is the regret It measures the value of the difference between a made decision and the optimal decision on all the rounds. The work [46] gives a bound on the excess risk of the output of an online algorithm in terms of the average regret.
For the update rule (13) with η k = 1 γ k , this theorem can be specified as follows. x k be the average of outputs generated by iterative formula (13) with η k = 1 γ k . Then, with probability at least 1 − α the following holds: where D = max x ,x ∈X x − x 2 and δ is defined by (12).

Proof:
The proof mainly relies on Theorem 2.1 and estimating the regret for iterative formula (13) Adding and subtracting the term g δ (x k , ξ k ), x * − x k we get using Cauchy-Schwarz inequality and (12) From the update rule (13) for x k+1 we have From this it follows Together with (14) we get Summing this from 1 to N, we get using From Lipschitz continuity of f (x, ξ) w.r.t. to x it follows that ∇f (x, ξ) 2 ≤ M for all x ∈ X, ξ ∈ . Thus, using that for all a, b, From this and (15) we bound the regret as follows Here the last bound takes place due to the sum of harmonic series. Then for (16) we can use Theorem 2.1. Firstly, we simplify it rearranging the terms using that √ ab ≤ a+b Then we substitute (16) in this inequality and make the change α = 4β log N and get with probability at least 1 − α

Preliminaries on the SAA approach
The SAA approach replaces the objective in (11) with its sample average Let us define the empirical minimizer of (17) The next theorem gives a bound on the excess risk for problem (17) in the SAA approach.
Then, with probability at least 1 − α the following holds: The proof of this theorem mainly relies on the following theorem.
where m is the sample size.

Proof:
For any x ∈ X, the following holds: From Theorem 2.4 with probability at least 1 − α the following holds: Then from this and (19) we have with probability at least 1 − α From Lipschitz continuity of f (x, ξ) it follows, that for ant x ∈ X, ξ ∈ the following holds: Taking the expectation of this inequality w.r.t. ξ we get Then we use Jensen's inequality (g(E(Y)) ≤ Eg(Y)) for the expectation, convex function g and a random variable Y. Since the module is a convex function we get Thus we have From strong convexity of f (x, ξ) in x, it follows that the average of f (x, ξ i )'s, that isF(x), is also γ -strongly convex in x. Thus we get for any x ∈ X, ξ ∈ By using (21) and (22) and taking x =x ε in (20), we get the first statement of the theorem Then from the strong convexity we have Equating (23) to ε, we get the expressions for the sample size m and auxiliary precision ε . Substituting both of these expressions in (24) we finish the proof.

Non-strongly convex optimization problem
Now we consider a non-strongly convex optimization problem of type where f (x, ξ) is Lipschitz continuous in x. Let us define x * = arg min x∈X F(x).

The SA approach: stochastic mirror descent
We consider stochastic mirror descent (MD) with inexact oracle [34,44,54] 3 . For a proxfunction d(x) and the corresponding Bregman divergence B d (x, x 1 ), the proximal mirror descent step is We consider the simplex setup: prox-function d(x) = x, log x . Here and below, functions such as < roman > log < /roman > or exp are always applied element-wise. The corresponding Bregman divergence is given by the Kullback-Leibler divergence Then a starting point is taken as x k be the average of outputs generated by iterative formula (26) Proof: For MD with prox-function function d(x) = x log x the following holds for any x ∈ n [44,Equation 5.13] Then by adding and subtracting the terms F(x), x − x k and ∇f (x, ξ k ), x − x k in this inequality, we get using Cauchy-Schwarz inequality the following: Then using convexity of F(x k ) we have Then we use this for (27) and sum for k where we used KL(x * , x 1 ) ≤ R 2 and max k=1,...,N p k − p * 1 ≤ D. Then using convexity of F(x k ) and the definition of outputx N in (28) we have Next we use the Azuma-Hoeffding's [45] inequality and get for all β ≥ 0 Thus, using (30) for (29) we have that with probability at least 1 − α Then, expressing β through α and substituting η = R M ∞ 2 N to (31) ( such η minimize the r.h.s. of (31)), we get Using R = log n and D = 2 in this inequality, we obtain We raise this to the second power, use that for all a, b ≥ 0, 2 √ ab ≤ a + b and then extract the square root. We obtain the following: 3 log n + 4 log(1/α) 2 = 9 log n + 16 log(1/α) + 24 log n log(1/α) ≤ 18 log n + 32 log(1/α).
Using this for (32), we get the statement of the theorem

Penalization in the SAA approach
In this section, we study the SAA approach for non-strongly convex problem (25). We regularize this problem by 1-strongly convex w.
and we prove that the sample sizes in the SA and SAA approaches will be equal up to logarithmic terms. The empirical counterpart of problem (33) is Let us definex λ = arg min x∈XFλ (x). The next lemma proves the statement from [68] on boundness of the population sub-optimality in terms of the square root of empirical suboptimality.

Lemma 3.2: Let f (x, ξ) be convex and M-Lipschitz continuous w.r.t 2 -norm. Then for any
x ∈ X with probability at least 1 − δ the following holds: where x * λ = arg min x∈X F λ (x), M λ M + λR 2 and R 2 = r(x * , x 1 ). ξ) is also Lipschitz continuous with M λ M + λR 2 . From the Jensen's inequality for the expectation, and the module as a convex function, we get that F λ (x) is also M λ -Lipschitz continuous From λ-strong convexity of f (x, ξ), we obtain thatF λ (x) is also λ-strongly convex From this and (35) it follows For any x ∈ X and x * λ = arg min x∈X F λ (x) we consider From [68, Theorem 6], we have with probability at least 1 − α Using this and (36) for (37), we obtain with probability at least 1 − α The next theorem proves eliminating the linear dependence on n in the sample size of the regularized SAA approach for a non-strongly convex objective (see estimate (6)), and estimates the auxiliary precision for the regularized SAA problem (8).

Theorem 3.3: Let f (x, ξ) be convex and M-Lipschitz continuous w.r.t x and letx ε be such that
To satisfy where R 2 = r(x * , x 1 ). The precision ε is defined as where we used the definition ofx ε from the statement of the this theorem. Then we subtract F(x * ) in both sides of (38) and get Then we use The inequality holds for any x ∈ X, Then from this and (39) and the definition of F λ (x ε ) in (33) we get Assuming M λR 2 and choosing λ = ε/(2R 2 ) in (40), we get the following: Equating the first term and the second term in the r.h.s. of (41) to ε/4 we obtain the rest statements of the theorem including F(x ε ) − F(x * ) ≤ ε.

Fréchet mean with respect to entropy-regularized optimal transport
In this section, we consider the problem of finding population barycenter of discrete measures. We define the population barycenter of distribution P with respect to entropyregularized transport distances

Remark 1 (Connection with the ρ-Wasserstein distance):
When for ρ ≥ 1, is a distance on support points x i , x j of space X, then W(p, q) 1/ρ is known as the ρ-Wasserstein distance.
Nevertheless, all the results of this paper are based only on the assumptions that the matrix C ∈ R n×n + is symmetric and non-negative. Following [18], we introduce entropy-regularized optimal transport problem where γ > 0 and E(π ) − π , log π is the entropy. Since E(π ) is 1-strongly concave on n in the 1 -norm, the objective in (44) is γ -strongly convex with respect to π in the 1norm on n , and hence problem (44) has a unique optimal solution. Moreover, W γ (p, q) is γ -strongly convex with respect to p in the 2 -norm on n [8, Theorem 3.4].
One particular advantage of the entropy-regularized optimal transport is a closed-form representation for its dual function [1,19] defined by the Fenchel-Legendre transform of W γ (p, q) as a function of p where β = exp(u/γ ), K = exp(−C/γ ) and [q] j is jth component of vector q. Functions such as < roman > log < /roman > or exp are always applied element-wise for vectors. Hence, the gradient of dual function W * γ ,q (u) is also represented in a closed form [19] ∇W * γ ,q (u) = β K · q/(Kβ) ∈ n , where symbols and / stand for the element-wise product and element-wise division respectively. This can be also written as The dual representation of W γ (p, q) is Any solution u * ν * of (47) is a subgradient of W γ (p, q) [58, Proposition 4.6] We consider u * and ν * such that u * , 1 = 0 and ν * , 1 = 0 (u * and ν * are determined up to an additive constant). The next theorem [8] describes the Lipschitz continuity of W γ (p, q) in p on probability simplex n restricted to ρ n = p ∈ n : min i∈ [n] p i ≥ ρ , where 0 < ρ < 1 is an arbitrary small constant.

Theorem 4.1 ([8, Theorem 3.4, Lemma 3.5]):
• For any q ∈ n , p ∈ n , W γ (p, q) is γstrongly convex w.r.t. p in the 2 -norm • For any q ∈ n , p ∈ ρ n and 0 < ρ < 1, In what follows, we use Lipschitz continuity of W γ (p, q) and W(p, q) for measures from n keeping in mind that adding some noise and normalizing the measures makes them belong to ρ n . We also notice that if the measures are from the interior of n then their barycenter will be also from the interior of n .

The SA approach: stochastic gradient descent
For problem (42), as a particular case of problem (4), stochastic gradient descent method can be used. From Equation (48), it follows that an approximation for the gradient of W γ (p, q) with respect to p can be calculated by Sinkhorn algorithm [3,26,58] through the computing dual variable u with δ-precision Here denotation ∇ δ p W γ (p, q) means an inexact stochastic subgradient of W γ (p, q) with respect to p. Algorithm 3 combines stochastic gradient descent given by iterative formula (13) for η k = 1 γ k with the Sinkhorn algorithm (Algorithm 1) and Algorithm 2 making the projection onto the simplex n .

Proof:
We estimate the co-domain (image) of W γ (p, q) Therefore, W γ (p, q) : n × n → [−2γ log n, C ∞ ]. Then we apply Theorem 2.2 with B = C ∞ and D = max p ,p ∈ n p − p 2 = √ 2, and we sharply get Equating each terms in the r.h.s. of this equality to ε/2 and using M = O( √ n C ∞ ), we get the expressions for N and δ. The statement p N − p * γ 2 ≤ √ 2ε/γ follows directly from strong convexity of W γ (p, q) and W γ (p).
The proof of algorithm complexity follows from the complexity of the Sinkhorn's algorithm. To state the complexity of the Sinkhorn's algorithm, we first defineδ as the accuracy in function value of the inexact solution u of maximization problem in (47). Using this we formulate the number of iteration of the Sinkhorn's [16,29,50,75] The number of iteration for the accelerated Sinkhorn's can be improved [40] O Here ε is the accuracy in the function value, which is the expression u, p + ν, q − γ n i,j=1 exp((−C ji + u i + ν j )/γ − 1) under the maximum in (47). From strong convexity of this objective on the space orthogonal to eigenvector 1 n corresponds to the eigenvalue 0 for this function, it follows that where κ λ + min (∇ 2 W * γ ,q (u * )). From [8,Proposition A.2.], for the eigenvalue of ∇ 2 W * γ ,q (u * ) it holds that 0 = λ n (∇ 2 W * γ ,q (u * )) < λ k (∇ 2 W * γ ,q (u * )) for all k = 1, . . . , n − 1. Inequality (52) holds due to ∇ δ p W γ (p, q) := u in Algorithm 3 and ∇ p W γ (p, q) u * in (48). Multiplying both of estimates (50) and (51)  Next, we study the practical convergence of projected stochastic gradient descent (Algorithm 3). Using the fact that the true Wasserstein barycenter of one-dimensional Gaussian measures has closed-form expression for the mean and the variance [22], we study the convergence to the true barycenter of the generated truncated Gaussian measures. Figure 1 illustrates the convergence in the 2-Wasserstein distance within 40 s.

The SAA approach
The empirical counterpart of problem (42) is the (empirical) Wasserstein barycenter problem where q 1 , q 2 , . . . , q m are some realizations of random variable with distribution P.
For instance,p ε can be calculated by the IBP algorithm [7] or the accelerated IBP algorithm [40]. The next theorem specifies Theorem 2.3 for the Wasserstein barycenter problem (53).

The total complexity of the accelerated IBP computingp ε is
Proof: From Theorem 2.3 we get the first statement of the theorem From [40] we have that complexity of the accelerated IBP is Substituting the expression for m and the expression for ε from Theorem 2.3 to this equation we get the final statement of the theorem and finish the proof.
Next, we study the practical convergence of the Iterative Bregman Projections on truncated Gaussian measures. Figure 2 illustrates the convergence of the barycenter calculated by the IBP algorithm to the true barycenter of Gaussian measures in the 2-Wasserstein distance within 10 s. For the convergence to the true barycenter w.r.t. the 2-Wasserstein distance in the SAA approach, we refer to [15], however, considering the convergence in the 2 -norm (Theorem 4.3) allows to obtain better convergence rate in comparison with the bounds for the 2-Wasserstein distance.

Comparison of the SA and the SAA for the WB problem
Now we compare the complexity bounds for the SA and the SAA implementations solving problem (42). We skip the high probability details since we can fixed α (say α = 0.05) in the all bounds. Moreover, based on [68], we assume that in fact all bounds of this paper have logarithmic dependence on α which is hidden in O(·) [28,48]. Table 1 presents the total complexity of the numerical algorithms implementing the SA and the SAA approaches. When γ is not too large, the complexity in the first row of the Table 1. Total complexity of the SA and the SAA implementations for min p∈ n E q W γ (p, q).

Algorithm Complexity
Projected SGD (SA) O where κ λ + min (∇ 2 W * γ ,q (u * )). This is typically bigger than the SAA complexity when κ γ /n. Hereby, the SAA approach may outperform the SA approach provided that the regularization parameter γ is not too large.
From the practical point of view, the SAA implementation converges much faster than the SA implementation. Executing the SAA algorithm in a distributed manner only enhances this superiority since for the case when the objective is not Lipschitz smooth, the distributed implementation of the SA approach is not possible. This is the case of the Wasserstein barycenter problem, indeed, the objective is Lipschitz continuous but not Lipschitz smooth.

Fréchet mean with respect to optimal transport
Now we are interested in finding a Fréchet mean with respect to optimal transport min p∈ n W(p) E q W(p, q). (55)

The SA approach with regularization: stochastic gradient descent
The next theorem explains how the solution of strongly convex problem (42) where p * is a solution of (55).

Proof:
The proof of this theorem follows from Theorem 4.2 and the following [33,50,58]: where p ∈ n , p * = arg min p∈ n W(p). The choice γ = ε 4 log n ensures the following: This means that solving problem (42) with ε/2 precision, we get a solution of problem (55) with ε precision. When γ is not too large, Algorithm 3 uses the accelerated Sinkhorns algorithm (instead of Sinkhorns algorithm). Thus, using γ = ε 4 log n and meaning that ε is small, we get the complexity according to the statement of the theorem.

The SA approach: stochastic mirror descent
Now we propose an approach to solve problem (55) without an additional regularization. The approach is based on mirror descent given by the iterative formula (26). We use simplex setup which provides a closed form solution for (26). Algorithm 4 presents the application of mirror descent to problem (55), where the gradient of W(p k , q k ) can be calculated using dual representation of optimal transport [58] by any LP solver exactly: Then where u * is a solution of (56) such that u * , 1 = 0. The next theorem estimates the complexity of Algorithm 4 Calculate ∇ p k W(p k , q k ) solving dual LP by any LP solver 4: The total complexity of Algorithm 4 is Proof: From Theorem 3.1 and using Notice, that ∇ p k W(p k , q k ) can be calculated exactly by any LP solver. Thus we take δ = 0 in (57) and get the first statement of the theorem. The second statement of the theorem directly follows from this and the condition To get the complexity bounds we notice that the complexity for calculating ∇ p W(p k , q k ) isÕ(n 3 ) [2,20,23,32], multiplying this by N = O( C 2 ∞ R 2 /ε 2 ) with R 2 KL(p * , p 1 ) ≤ log n, we get the last statement of the theorem.
Next we compare the SA approaches with and without regularization of optimal transport in problem (55). Entropic regularization of optimal transport leads to strong convexity of regularized optimal transport in the 2 -norm, hence, the Euclidean setup should be used. Regularization parameter γ = ε 4 log n ensures ε-approximation for the unregularized solution. In this case, we use stochastic gradient descent with Euclidean projection onto simplex since it converges faster for strongly convex objective. For non-regularized problem we can significantly use the simplex prox structure, indeed, we can apply stochastic mirror descent with simplex setup (the Kullback-Leibler divergence as the Bregman divergence) with Lipschitz constant M ∞ = O( C ∞ ) that is √ n better than Lipschitz constant in the Euclidean norm M = O( √ n C ∞ ). We studied the convergence of stochastic mirror descent (Algorithm 4) and stochastic gradient descent (Algorithm 3) in the 2-Wasserstein distance within 10 4 iterations (processing of 10 4 probability measures). Figure 3 confirms better convergence of stochastic mirror descent than projected stochastic gradient descent as stated in their theoretical complexity (Theorems 5.1 and 5.2).

The SAA approach
Similarly for the SA approach, we provide the proper choice of the regularization parameter γ in the SAA approach so that the solution of strongly convex problem (42) approximates a solution of convex problem (55).

Proof:
The proof follows from Theorem 4.3 and the proof of Theorem 5.1 with γ = ε/(4 log n).

Penalization of the WB problem
For the population Wasserstein barycenter problem, we construct 1-strongly convex penalty function in the 1 -norm based on Bregman divergence. We consider the following prox-function [6]: that is 1-strongly convex in the 1 -norm. Then Bregman divergence B d (p, B d (p, p 1 ) is 1-strongly convex w.r.t. p in the 1 -norm andÕ(1)-Lipschitz continuous in the 1 -norm on n . One of the advantages of this penalization compared to the negative entropy penalization proposed in [5,10] is that we get the upper bound on the Lipschitz constant, the properties of strong convexity in the 1 -norm on n remain the same. Moreover, this penalization contributes to the better wall-clock time complexity than quadratic penalization [10] since the constants of Lipschitz continuity for W(p, q) with respect to the 1 -norm is √ n better than with respect to the 2 -norm but are equal up to a logarithmic factor. The regularized SAA problem is as follows: The next theorem is a particular case of Theorem (3.3) for the population WB problem (55) with r(p, p 1 ) = B d (p, p 1 ). To satisfy with probability at least 1 − α, we need to take λ = ε/(2R 2 d ) and The precision ε is defined as The total complexity of Mirror Prox computingp ε is Proof: The proof is based on saddle-point reformulation of the WB problem. Further, we provide the explanation how to do this. First we rewrite the OT as [42] W(p, q) = min where b = (p , q ) , d is vectorized cost matrix of C, x be vectorized transport plan of X, and A = {0, 1} 2n×n 2 is an incidence matrix. Then we reformulate the WB problem as a saddle-point problem [25] min p∈ n , x∈X The gradient operator for f (x, p, y) is defined by (62) where [d(p)] i = 1 a−1 p 2−a a [p] a−1 i . To get the complexity of MP, we use the same reasons as in [25] with (62 The precision ε is defined as

Comparison of the SA and the SAA for the WB problem
Now we compare the complexity bounds for the SA and SAA implementations solving problem (55). We also skip the high probability details as in fact all bounds of this paper have logarithmic dependence on α [28,48]. Table 2 presents the total complexity for the numerical algorithms. For the SA algorithms, which are Stochastic MD and Projected SGD, we can conclude the following: non-regularized approach (Stochastic MD) uses simplex prox structure and gets better complexity bounds, indeed Lipschitz constant in the 1 -norm is M ∞ = O( C ∞ ), whereas Lipschitz constant in the Euclidean norm is M = O( √ n C ∞ ). The practical comparison of Stochastic MD (Algorithm 4) and Projected SGD (Algorithm 3) can be found in Figure 3.
For the SAA approaches (Accelerated IBP and Mirror Prox with specific penalization), we enclose the following: entropy-regularized approach (Accelerated IBP) has better dependence on ε than penalized approach (Mirror Prox with specific penalization), however, worse dependence on n. Using Dual Extrapolation method for the WP problem from paper [25] instead of Mirror Prox allows to omit √ n in the penalized approach.
Using Dual Extrapolation method for the WP problem [25] instead of Mirror Prox, we can omit √ n in the last row of Table 2. One of the main advantages of the SAA approach is the possibility to perform it in a decentralized manner in contrast to the SA approach, which cannot be executed in a decentralized manner or even in distributed or parallel fashion for non-smooth objective [36]. This is the case of the Wasserstein barycenter problem, indeed, the objective is Lipschitz continuous but not Lipschitz smooth.

Notes
1. When for ρ ≥ 1, C ij = d(x i , x j ) ρ in (2), where d(x i , x j ) is a distance on support points x i , x j , then W(p, q) 1/ρ is known as the ρ-Wasserstein distance. Nevertheless, all the results of this thesis are based only on the assumptions that the matrix C ∈ R n×n + is symmetric and non-negative. Thus optimal transport problem defined in (2) is a more general than the Wasserstein distances. 2. The estimate n 2 min{Õ( 1 ε ),Õ( √ n)} is the best known theoretical estimate for solving OT problem [14,42,53,59]. The best known practical estimates are √ n times worse (see [40] and references therein).
3. By using dual averaging scheme [55] we can rewrite Algorithm 4 in online regime [41,56] without including N in the stepsize policy. Note, that mirror descent and dual averaging scheme are very close to each other [43].

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The