Efficient Discretization of Optimal Transport

Obtaining solutions to optimal transportation (OT) problems is typically intractable when marginal spaces are continuous. Recent research has focused on approximating continuous solutions with discretization methods based on i.i.d. sampling, and this has shown convergence as the sample size increases. However, obtaining OT solutions with large sample sizes requires intensive computation effort, which can be prohibitive in practice. In this paper, we propose an algorithm for calculating discretizations with a given number of weighted points for marginal distributions by minimizing the (entropy-regularized) Wasserstein distance and providing bounds on the performance. The results suggest that our plans are comparable to those obtained with much larger numbers of i.i.d. samples and are more efficient than existing alternatives. Moreover, we propose a local, parallelizable version of such discretizations for applications, which we demonstrate by approximating adorable images.


Introduction
Optimal transport is the problem of finding a coupling of probability distributions that minimizes cost [1], and it is a technique applied across various fields and literatures [2,3]. Although many methods exist for obtaining optimal transference plans for distributions on discrete spaces, computing the plans is not generally possible for continuous spaces [4]. Given the prevalence of continuous spaces in machine learning, this is a significant limitation for theoretical and practical applications.
One strategy for approximating continuous OT plans is based on discrete approximation via sample points. Recent research has provided guarantees on the fidelity of discrete, sample-location-based approximations for continuous OT as the sample size N → ∞ [5]. Specifically, by sampling large numbers of points S i from each marginal, one may compute a discrete optimal transference plan on S 1 × S 2 , with the cost matrix being derived from the pointwise evaluation of the cost function on S 1 × S 2 .
Even in the discrete case, obtaining minimal cost plans is computationally challenging. For example, Sinkhorn scaling, which computes an entropy-regularized approximation for OT plans, has a complexity that scales with |S 1 × S 2 | [6]. Although many comparable methods exist [7], all of them have a complexity that scales with the product of sample sizes, and they require the construction of a cost matrix that also scales with |S 1 × S 2 |.
We have developed methods for optimizing both sampling locations and weights for small N approximations of OT plans (see Figure 1). In Section 2, we formulate the problem of fixed size approximation and reduce it to discretization problems on marginals with theoretical guarantees. In Section 3, the gradient of entropy-regularized Wasserstein distance between a continuous distribution and its discretization is derived. In Section 4, we present a stochastic gradient descent algorithm that is based on the optimization of the locations and weights of the points with empirical demonstrations. Section 5 introduces a parellelizable algorithm via decompositions of the marginal spaces, which reduce the computational complexity by exploiting intrinsic geometry. In Section 6, we analyze time To properly compute W k k (γ λ (µ, ν), γ λ (µ m , ν n )), a metric d X×Y on X × Y is needed. We expect d X×Y on X-slices or Y-slices to be compatible with d X or d Y , respectively; furthermore, we may assume that there exists a constant A > 0 such that: max{d k X (x 1 , x 2 ), d k Y (y 1 , y 2 )} ≤ d k X×Y ((x 1 , y 1 ), (x 2 , y 2 )) ≤ A(d k X (x 1 , x 2 ) + d k Y (y 1 , y 2 )). (2) For instance, (2) holds when d X×Y is the p-product metric for 1 ≤ p ≤ ∞.

Remark 1.
The regularizing parameters (λ and ζ above) introduce smoothness, together with an error term, into the OT problem. To make an accurate approximation, we need λ and ζ to be as small as possible. However, when parameters become too small, the matrices to be normalized in the Sinkhorn algorithm lead to an overflow or underflow problem of numerical data types (32-bit or 64-bit floating point numbers). Thus, the value for regularizing the constant threshold is proportional to the k-th power of the diameter of the supported region. In this work, we try our best to control the value (mainly on ζ), which ranges from 10 −4 to 0.01 when the diameter is 1 in different examples.

Gradient of the Objective Function
Let ν = ∑ m i=1 w i δ y i be a discrete probability measure in the position of "µ m " in the last section. For a fixed (continuous) µ, the objective now is to obtain a discrete target ν * = argmin W k k,ζ (µ, ν). In order to apply a stochastic gradient descent (SGD) to both the positions {y i } m i=1 and their weights {w i } m i=1 to achieve ν * , we now derive the gradient of W k k,ζ (µ, ν) about ν by following the discrete discussions of [13,14]. The SGD on X is either derived through an exponential map, or by treating X as (part of) an Euclidean space.
Let g(x, y) := d k X (x, y), and denote the joint distribution minimizing W k k,ζ as π with the differential form at (x, y i ) being dπ i (x), which is used to define W k k,ζ in Section 2. By introducing the Lagrange multipliers α ∈ L ∞ (X), β ∈ R m i, we have W k k,ζ (µ, ν) = max α,β L(µ, ν; α, β), where L(µ, ν; α, β) = X α(x)dµ (x) [5]). Let α * , β * be the argmax; then, we have for any t ∈ R, the representative with β n = 0 that is equivalent to β (as well as β * ) is denoted by β (similarly β * ) below in order to obtain uniqueness and make the differentiation possible. From a direct differentiation of W k k,ζ , we have With the transference plan dπ i (x) = w i E * i (x)dµ(x) and the derivatives of α * , β * , g(x, y i ) calculated, the gradient of W k k,ζ can be assembled. Assume that g is a Lipschitz constant that is differentiable almost everywhere (for k ≥ 1 and a d X Euclidean distance in R d , differentiability fails to hold only when k = 1 and y i = x) and that ∇ y g(x, y) is calculated. The derivatives of α * and β * can then be calculated thanks to the Implicit Function Theorem for Banach spaces (see [15]). The maximality of L at α * and β * induces N := ∇ α,β L| (α * ,β * ) = 0 ∈ (L ∞ (X) ⊗ R m−1 ) ∨ , and the Fréchet derivative vanishes. By differentiating (in the sense of Fréchet) again on (α, β) and y i , w i , respectively, we get as a bilinear functional on L ∞ (X) × R m−1 (note that, in Equation (6), the index i of dπ i cannot be m). The bilinear functional ∇ (α,β) N is invertible, and we denote its inverse by M as a bilinear form on L ∞ (X) ⊗ R m−1 ∨ . The last ingredient for the Implicit Function Theorem is ∇ w i ,y i N : Then, ∇ w i ,y i (α * , β * ) = M(∇ w i ,y i N ). Therefore, we have gradient ∇ w i ,y i W k k,ζ calculated. Moreover, we can differentiate Equations (4)- (8) to get a Hessian matrix of W k k,ζ on w i 's and y i 's to provide a better differentiability of g(x, y) (which may enable Newton's method, or a mixture of Newton's method and minibatch SGD to accelerate the convergence). More details about the claims, calculations, and proofs are provided in the Appendix B.

The Discretization Algorithm
Here, we provide a description of an algorithm for the efficient discretizations of optimal transport (EDOT) from a distribution µ to µ m with integer m, which is a given cardinality of support. In general, µ does not need not be explicitly accessible, and, even if it is accessible, computing the exact transference plan is not feasible. Therefore, in this construction, we assume that µ is given in terms of a random sampler, and we apply a minibatch stochastic gradient descent (SGD) through a set of samples that are independently drawn from µ of size N on each step to approximate µ.
, we need: (1). π X,ζ , the EOT transference plan between µ and µ m , (2). the cost g = d k X on X, and (3). its gradient on the second variable ∇ δ y i and calculate the gradients with µ replaced by µ N as an estimation, whose effectiveness (convergence as N → ∞) is proved in [5].
We call this discretization algorithm the Simple EDOT algorithm. The pseudocode is stated in the Appendix C.
Proposition 2 (Convergence of the Simple EDOT). The Simple EDOT generates a sequence (µ (i) m ) in the compact set X m × ∆. If the set of limit points of (µ (i) m ) does not intersect with X m × ∂∆, then (µ (i) m ) converges to a stationary point in X m × Int(∆) where Int(·) represents the interior.
In simulations, we fixed k = 2 to reduce the computational complexity and fixed the regularizer ζ = 0.01 for X of diameter 1 and scales proportional with diam(X) k (see next section). Such a choice for ζ is not only small enough to reduce the error between the EOT estimation W k,ζ and the true W k , but also ensures that e −g(x,y)/ζ and its byproduct in the SK are distinguishable from 0 in a double format.
Let N = 100 for all plots in this section. Figure 2a-c plots the discretizations (µ m ) for E.g., (1)-(3) with m = 5, 5, and 7, respectively. Figure 2f illustrates the convergence rate of W k k,ζ (µ, µ m ) versus the SGD steps for Example (2) with µ m obtained by a 5-point EDOT. Figure 2d,e plot the entropy-regularized Wasserstein W k k,ζ (µ, µ m ) versus m, thereby comparing EDOT and naive sampling for Examples (1) and (2). Here, the µ m s are: (a) from the EDOT with 3 ≤ m ≤ 7 in Example 1 and 3 ≤ m ≤ 8 in Example 2, which are shown by ×s in the figures. (b) from naive sampling, which is simulated using a Monte Carlo of volume 20,000 on each size from 3 to 200. Figure 2d,e demonstrate the effectiveness of the EDOT: as indicated by the orange horizontal dashed line, even 5-point EDOT discretization in these two examples outperformed 95% of the naive samplings of size 40, as well as 75% of the naive samplings of size over 100 (the orange dash and dot lines).

Methods of Improvement
I. Adaptive EDOT: The computational cost of a simple EDOT increases with the dimensionality and diameter of the underlying space. Discretization with a large m is needed to capture higher dimensional distributions, which result an increase in parameters for calculating the gradient of W k k,ζ : md for the y i positions and m − 1 for the w i weights. Such an increment will not only increase the complexity in each step, but also require more steps for the SGD to converge. Furthermore, the calculation will have a higher complexity (O(mN) for each normalization in Sinkhorn).
We proposed to reduce the computational complexity using a "divide and conquer" approach. The Wasserstein distance took the k-th power of the distance function d k X as a cost function. The locality of distance d X made the solution to the OT / EOT problem local, meaning that the probability mass was more likely to be transported to a close destination than to a remote one. Thus, we can "divide and conquer"-thereby cutting the space X into small cells and solve the discretization problem independently.
To develop a "divide and conquer" algorithm, we need: (1) an adaptive dividing procedure that is able to partition X = X 1 · · · X I , which balances the accuracy and computational intensity among the cells; (2) to determine the discretization size m i and choose a proper regularizer ζ i for each cell X i . The pseudocodes for all variations are shown in the Appendix C Algorithms A2 and A3.
Choosing size m: An appropriate choice of m i will balance contributions to the Wasserstein among the subproblems as follows: Let X i be a manifold of dimension d, let diam(X i ) be its diameter, and let p i = µ(X i ) be the probability of X i . The entropy-regularized Wasserstein distance can be estimated as [16,17]. The contribution . Therefore, to balance each point's contribution to the Wasserstein among the divided subproblems, we set Occupied volume (Variation 1):A cell could be too vast (e.g., large in size with few points in a corner), thus resulting in obtaining a larger m i than needed. To fix it, we may replace the diam(X i ) above with Vol(X i ) 1/d , where Vol(X i ) is the occupied volume calculated by counting the number of nonempty cells in a certain resolution (levels in previous binary division). The algorithm (Variation 1) becomes a binary tree to resolve and obtain the occupied volume for each cell, then there is tree traversal to assign m i .
Adjusting the regularizer ζ: In the W k k,ζ , the SK on e −g(x,y)/ζ is calculated. Therefore, ζ should scale with d k X to ensure that the transference plan is not affected by the scaling of d X . Precisely, we may choose ζ i = diam(X i ) k ζ 0 for some constant ζ 0 .
The division: Theoretically, any refinement procedure that proceeds iteratively and eventually makes the diameter of each cell approach 0 can be applied for division. In our simulation, we used an adaptive kd-tree-style cell refinement in a Euclidean space R d . Let X be embedded into R d within an axis-aligned rectangular region. We chose an axis x l in R d and evenly split the region along a hyperplane orthogonal to x l (e.g., cut square [0, 1] 2 along the line x = 0.5); thus, we constructed X 1 and X 2 . With the sample set S given, we split it into two sample sizes S 1 and S 2 according to which subregion each sample was located in. Then, the corresponding m i and ζ i could be calculated as discussed above. Thus, two cells and their corresponding subproblems were constructed. If some of the m i was still too large, the cell was cut along another axis to construct two other cells. The full list of cells and subproblems could be constructed recursively. In addition, another cutting method (variation 2) that chooses the most sparse point as a cutting point through a sliding window is sometimes useful in practice.
After having the set of subproblems, we could apply the EDOT for the solutions in each cell, then combine the solutions µ (i) Figure 3b shows the optimal discretization for the example in Figure 2c with m = 30, which was obtained by applying the EDOT with adaptive cell refinement, or ζ = 0.01 × diam 2 .
II. On embedded CW complexes: Although the samples on space X are usually represented as a vector in R d , inducing an embedding X → R d , the space X usually has its own structure as a CW complex (or simply a manifold) with a more intrinsic metric. Thus, if the CW complex structure is known, even piecewise, we may apply the refinement on X with respect to its own metric, whereas direct discretization as a subset in R d may result in a low expressing efficiency.
We now illustrate the adaptive EDOT by an example on a mixture normal distribution of a sphere mapped through stereographic projection. More examples of a truncated normal mixture over a Swiss roll and the discretization of a 2D optimal transference plan are detailed in the Appendix D.5.
On the sphere: The underlying space X sphere is the unit sphere in R 3 . µ sphere is the pushforward of a normal mixture distribution on R 2 by stereographic projection. The sample set S sphere ∼ µ sphere over X sphere is shown on Figure 4 on the left. Consider a (3D) Euclidean metric on the X sphere induced by the embedding.  To consider the intrinsic metric, a CW complex was constructed about a point on the equator as a 0-cell structure; the rest of the equator was constructed as a 1-cell, and the upper hemisphere and lower hemisphere were constructed as two dimension 2-(open) cells. We took the upper and lower hemispheres and mapped them onto a unit disk through stereographic projection with respect to the south and north pole, respectively. Then, we took the metric from spherical geometry and rewrote the distance function and its gradient using the natural coordinate on the unit disk. Figure 4b shows the refinement of the EDOT on the samples (in red) and the corresponding discretizations in colored points. More figures can be found in the Appendices.

Analysis of the Algorithms
In this section, we derive the complexity of the simple EDOT and the adaptive EDOT. In particular, we show the following: where N is the minibatch size (to construct µ N in each step to approximate µ), d is the dimension of X, L is the maximal number of iterations for SGD, and is the error bound in the Sinkhorn calculation for the entropy-regularized optimal transference plan between µ N and µ m .
Proposition 3 quantitatively shows that, when the adaptive EDOT is applied, the total complexities (in time and space) are reduced, because the magnitudes of both N and m are much smaller in each cell.
The procedure of dividing sample set S into subsets through the adaptive EDOT is similar to Quicksort; thus, the space and time complexities are similar. The similarity comes from the binary divide-and-conquer structure, as well as that each split action is based on comparing each sample with a target.

Remark 2.
Complexity is the same as Quicksort. The set of N 0 sample points in the algorithm are treated as the "true" distribution in the adaptive EDOT, since, in the later EDOT steps for each cell, no further samples are taken, as it is hard for a sampler to produce a sample in a given cell.
Postprocessing of the adaptive EDOT has O(m) complexity in both time and space.

Remark 3.
For the two algorithm variations in Section 5, the occupied volume estimation works in the same way as the original preprocessing step, which has the same time complexity as before (by itself, since dividing must happen after knowing the occupied volume of all cells), but, with the tree built, the original preporcessing becomes a tree traversal and has (additional) time complexity O(N 0 ) and (additional) space complexity O(N 0 ) for the space storing occupied volume. For details on choosing cut points with window sliding, the discussion can be seen in the Appendix C.5.
Comparison with naive sampling: After having a size m discretization on X and a size n discretization on Y, the EOT solution (Sinkhorn algorithm) has time complexity O(mn log(1/ )). In the EDOT, two discretization problems must be solved before applying the Sinkhorn, while the naive sampling requires nothing but sampling.
According to Proposition 3, solving a single continuous EOT problem using a size m simple EDOT method may result in higher time complexity than naive sampling with an even larger sample size N (than m). However, unlike the EDOT, which only requires access to a distance function d X and d Y on X and Y, respectively, a known cost function c : X × Y → R is necessary for naive sampling. In real applications, the cost function may be from real world experiments (or from extra computations) done for each pair (x, y) in the discretization; thus, the size of discretized distribution is critical for cost control. d X and d Y usually come along with the spaces X and Y, respectively, and are easy to compute. An additional application of the EDOT is necessary when the marginal distributions µ X and ν Y are fixed for different cost functions; then, discretizations can be reused. Thus, the cost of discretization is calculated one time, and the improvement it brings accumulates in each repeat.

Related Work and Discussion
Our original problem was the optimal transport problem between general distributions as samplers (instead of integration oracles). We translated that into a discretization problem and an OT problem between discretizations.
I. Comparison with other discretization methods: There are several other methods that generate discrete distributions from arbitrary distributions in the literature, which are obtained via semi-continuous optimal transport where the calculation of a weighted Voronoi diagram is needed. Calculating the weighted Voronoi diagrams usually requires 1. that the cost function be a squared Euclidean distance and 2. the application of Delaunay triangulation, which is expensive in more than two dimensions. Furthermore, semi-continuous discretization may only optimize one aspect between the position and weights of the atoms, and this process is mainly based on [18] (the optimized position) and [19] (the optimized weights).
We mainly compared the prior work of [18], which focuses on the barycenter of a set of distributions under the Wasserstein metric. This work resulted in a discrete distribution called the Lagrangian discretization, which is of the form 1 m ∑ m i=1 δ x i [2]. Other works, such as [20,21], find barycenters but do not create a discretization. Refs. [19,22] studied the discrete estimation of a 2-Wasserstein distance locating discrete points through a clustering algorithm k-means++ and a weighted Voronoi diagram refinement, respectively. Then, they assigned weights and made them non-Lagrangian discretizations. Ref. [19] (comparison in Figure 5) roughly followed a "divide-and-conquer" approach in selecting positions, but the discrete positions were not tuned according to Wasserstein distance directly. Ref. [22] converged as the number of discrete points increased. However, it lacked a criterion (such as the Wasserstein in the EDOT) to show that the choice is not just one among all possible converging algorithms, but, rather, it is a special one.
By projecting the gradient in the SGD to the tangent space of the submanifold or by equivalently fixing the learning rate on the weights to zero, the EDOT can estimate a Lagrangian discretization (denoted by EDOT-Equal). A comparison among the methods is held on the map of the Canary islands, which is shown in Figure 6. This example shows that our method can get a similar result using Lagrangian discretization as the methods in the literature, while, in general, this type of EDOT can work better.
Moreover, the EDOT can be used to solve barycenter problems. Note that, to apply adaptive EDOT for barycenter problems, compatible divisions of the target distributions are needed (i.e., a cell A from one target distribution transports onto a discrete subset D thoroughly, and D transports onto a cell B from another target distribution, etc.).  [19]. Potrait of Riemann, discretization of size 625. Left: green dots show position and weights of EDOT discretization (same as right); cells in background are discretization of the same size in the original [19]. Right: A size 10,000 discretization from [19]; we directly applied EDOT to this picture, treating it as the continuous distribution. ζ = 0.01 × diam 2 . We also tested these algorithms on discretizing gray/colored scale pictures. The comparison of discretization with points varying from 10 to 4000 for a kitty image between EDOT, EDOT-equal, [18] and estimations of their Wasserstein distances to the original image are shown in Figures 7 and 8. Relations between log(W 2 ) and log m (all with divide and conquer); it can be seen that the advantage of W EDOT over W Equal grows with the size of discretization.
Furthermore, the EDOT may be applied on RGB channels of an image independently, which then combine plots of discretizations in the corresponding color. The results are shown in Figure 1 at the beginning of this paper.
Lagrangian discretization may have a disadvantage in representing repetitive patterns with incompatible discretization points.
In Figure 9, we can see that discretizing 16 objects with 24 points caused weight incompatibility locally for the Lagrangian discretization, thus making points locate between objects and increasing the Wasserstein distance. With the EDOT, the weights of points that lie outside of the blue object were much smaller. The patterned structure was better represented by the EDOT. In practice, patterns often occur as part of the data (e.g., pictures of nature), and it is easy to get an incompatible number in Lagrangian discretization, since the equal weight-requirement is rigid; consequently, patterns cannot be properly captured. II. General k and deneral distance d X : Our algorithms (Simple EDOT, adaptive EDOT, and EDOT-Equal) work for a general choice of parameter k > 1 and C 2 distance d X on X. For example, in Figure 4 part (b), the distance used on each disk was spherical (arc length along the big circle passing through two points), which could not be isometrically reparametrized into a plane with Euclidean metrics because of the difference in curvatures.
III. Other possible impacts: As the OT problem widely exists in many other areas, our algorithm can be applied accordingly, e.g., the location and size of supermarkets or electrical substations in an area, or even air conditioners in the rooms of supercomputers. Our divide-and-conquer methods are suitable for solving these real-world applications.
IV. OT for discrete distributions: Many algorithms have been developed to solve OT problems between two discrete distributions [3]. Linear programming algorithms were first developed, but their applications have been restricted by high computational complexity. Other methods such as [23], with a cost of form c(x, y) = h(x − y) for some h, which applies the "back-and-forth" method by hopping between two forms of a Kantorovich dual problem (on the two marginals, respectively) to get a gradient of the total cost over the dual functions, usually solve problems with certain conditions. In our work, we chose to apply an EOT developed by [8] for an estimated OT solution of the discrete problem.

Conclusions
We developed methods for efficiently approximating OT couplings with fixed size m × n approximations. We provided bounds on the relationship between a discrete approximation and the original continuous problem. We implemented two algorithms and demonstrated their efficacy as compared to naive sampling and analyzed computational complexity. Our approach provides a new approach to efficiently compute OT plans.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix B. Gradient of W k k,ζ
Appendix B.1. The Gradient Following the definitions and notations in Sections 2 and 3 of the paper, we calculate the gradient of W k k,ζ (µ, µ m ) about parameters of µ m := ∑ m i=1 w i δ y i in detail.
Since γ on the second component X is discrete and supported on {y i } m i=1 , we may denote dγ(x, y i ) by dπ i (x); thus, Then, the Fenchel duality of Problem (A2) is Let α * and β * be the argmax of the Fenchel dual (A5). The primal is solved by dγ * (x, y i ) = e (α * (x)+β * i −g(x,y i ))/ζ w i dµ(x). To make the solution unique, we restrict the freedom of the solution (where we see that L(µ, µ m ; α, β) = L(µ, µ m ; α + t, β − t) for any t ∈ R). We use the condition β m = 0 to narrow the choices down to only one, and denote the dual variable β having the property β and β * .
Next, we calcuate the derivatives of α * and β * by finding their defining equation and then using the Implicit Function Theorem. The optimal solution to the dual variables α * and β * is obtained by solving the stationary state equation ∇ α,β L = 0. The derivatives are taken in the sense of the Fréchet derivative. The Fenchel dual function on α and β, has its domain and codomain L(µ, µ m ; ·, ·) : where E i (x) = e (α(x)+β i −g(x,y i ))/ζ is defined as in the paper, ∇ α L(µ, µ m ; α, β) ∈ (L ∞ (X)) ∨ (as a linear functional), and ∂ ∂β i L(µ, µ m ; α, β) ∈ R. Next, we need to show that L is differentiable in the sense of the Fréchet derivative, i.e., The last equality is from a Taylor expansion of the exponential function. Consider that ||h|| ∞ = ess sup | x∈X h(x)| the essential supremum of |h(x)| for x ∈ X given measure µ.
which shows that the expression of ∇ α L(α) in Equation (A8) gives the correct Fréchet derivative. Note here that α ∈ L ∞ (X) is critical in Equation (A12). Let N := ∇ α,β L values in (L ∞ (X)) ∨ × R m−1 . Then, N = 0 defines α * and β * , which makes it possible to differentiate them about µ m using the Implicit Function Theorem for Banach spaces. From now on, N take values at α = α * , β = β * , i.e., the marginal conditions Thus, we need ∇ α,β N and ∇ w i ,y i N calculated, and prove that ∇ α,β N is invertible (and give the inverse).
From the expression of N , we may differentiate (similarly as the calculations (A11) to (A13)): The ∇ α,β N as the Hessian form of L can be written as over the basis {δ(x), e i } x∈X,i<m . By the inverse of ∇ α,β N , we mean the element in Hom b R (L ∞ (X)) ∨ × (R m−1 ) ∨ , L ∞ (X) × R m−1 which composes with ∇ α,β N (on the left and on the right) as identities. By the natural identity between double dual V ∨∨ ∼ = V and the tensor hom adjunction, we can write the inverse of ∇ α,β N as a bilinear form again.
According to the block-inverse formula where F = D − B T A −1 B, whose invertibility determines the invertibility of A B B T D .

Consider that
The matrix F is symmetric, of rank m − 1, and strictly diagonally dominant; therefore, it is invertible. To see the strictly diagonal dominance, consider ∑ m j=1 X w i w j E i (x)E j (x)µ(x) = X w i E i (x) ∑ m j=1 w j E j (x)µ(x) = X w i E i (x)µ(x) = w i by applying the marginal conditions. The matrix F is of size (m − 1) × (m − 1) (there is no i = m or j = m for F ij ). Then, the matrix F is strictly diagonally dominant.
With all ingradients known in formula (A19), we can calculate the inverse of ∇ α,β N . Following the implicit function theorem, we need ∇ w i ,y i N ; each partial derivative is an element in L ∞ (X) ∨ × R m−1 .
Note that if we apply the constraint ∑ m i=1 w i = 1 to the w i s, we may set w m = 1 − ∑ m−1 i=1 w i and recalculate the above derivatives as Finally, by the Implicit Function Theorem, which fits in (A6) and (A7).

Appendix B.2. Second Derivatives
In this part, we calculate the second derivatives of W k k,ζ (µ, µ m ) with respect to the ingredients of µ m , i.e., w i s and y i s, for the potential of applying Newton's method to the EDOT (which we have not implemented yet).
Using the previous results, we can further calculate the second derivatives of W k k,ζ about w i s and y i s. Differentiating (A6) and (A7) results in Once we have the second derivatives of g(x, y) on y i s, we need the second derivatives of α * and β * to build the above second derivatives. From the formula ∇ w i ,y j (α * , β Here, from the formula that ∇A −1 = −A −1 ∇AA −1 (this is the product rule for The last piece we need is ∇ w k ,y l ∇ w i ,y j N | α * ,β * : where in the last one, k, represents the k-th component in N 's second part (about β).

Appendix C. Algorithms
Appendix C.1. Algorithm: Simple EDOT The following states the Algorithm A1 of Simple EDOT.

Appendix C.2. Proof of Proposition 2
Remark A1. The convergence to a stationary point in expectation means that the liminf of the expected norm of the gradient over all the sequences considered approaches to 0.
Proof. Discrete distributions of size m can be parameterized by X m × ∆ in terms of ∑ m i=1 p i δ y i with (p 1 , p 2 , . . . , p m ) ∈ ∆ and y i ∈ X.
To make the SGD work, we assume that X is a path-connected subset of R d of dimension d.
2 with the second derivative calculated.
(2). As a consequence of (a), we have that ∇W k k,ζ (µ, µ . Noise has bounded variance: Equivalently, we just need to check that Var(||∇W k k,ζ (µ N , µ m )|| 2 ) is finite, where µ N is the empirical distribution with N samples taken (which is stochastic), and µ m is the fixed discretization in X m × ∆ (this need not to be the "optimal" one). ∇W k k,ζ (µ Thus, the proposition holds with assumption ( * ). Further suppose that assumption ( * ) does not hold. Then, for any sequence 1 , 2 · · · → 0, there always exist infinite limit points of (µ (i) m ) that lie outside ∆ k for any k > 0. Therefore, we can construct a subsequence of µ (i) m converging to a point p ∈ X m × ∂∆. Thus, p is also a limit point. This contradicts the assumption that the set of limit points of (µ (i) m ) does not intersect with X m × ∂∆. The proof is then complete.

Appendix C.3. Proof of Proposition 3
Proof. First, for each iteration in the minibatch SGD, let N be the sample (minibatch) size of µ N for approximating µ. Let m be the size of target discretization µ m (the output). Furthermore, let d be the dimension of X and be the error bound in the Sinkhorn calculation for the entropy-regularized optimal transference plan between µ N and µ m . The Sinkhorn algorithm for the positive matrix e −g(x,y)/ζ (of size N × m) converges linearly, which takes O(log(1/ )) steps to fall into a region of radius , thus contributing O(Nm log(1/ )) in (6)) is taken block-wise  (N 2 m). From M to the gradient of dual variables, the tensor contractions have complexity O((N + m) 2 md). Finally, to get the gradient, the complexity is dominated by the second term of ∇ y i W k k,ζ (see Equation (5)), which is a contraction between a matrix Nm (i.e., gdπ(x)) with tensors of sizes Nmd and m 2 d (two gradients on the dual variables α and β) along N and m, respectively. Thus, the final step contributes O((N + m)md).
The time complexity of increment steps in the SGD turns out to be O(md). Therefore, for L steps of the minibatch SGD, the time complexity is O((N + m) 2 mdL + NmL log(1/ )).
For space complexity of the simple EDOT, the Sinkhorn algorithm (which can be done in position O(Nm)) is the only iterative computation in a single SGD step, and between two SGD steps, only the resulting distribution is passed to the next step. Therefore, the space complexity is O((N + m) 2 ) coming from the M; others are, at most, of size O(m(N + m)).

Appendix C.4. Adaptive Refinement via DFS: Midpoints
The pseudocode for the division algorithm of the adaptive EDOT using KD-tree refinement cutting at the midpoints is shown in Algorithm A2. The Round 0.5↑ means the rounding method with 0.5 rounded up, and Round 0.5↓ is that with 0.5 rounded down; thus, the discretization point is correctly partitioned.  (2) ← (a 0 , . . . , mid, . . . , a d−1 );  20: end if 21: for i ← 1, 2 do 22: if m i > m * then 23: T.push((S i , m i , p i , a (i) , b (i) )); 24: else if m i > 0 then 25: out.push((S i , m i , p i )); 26: end if 27: end for 28: end while between the divided cells. However, when the mass of µ is not distributed evenly in a cell to be cut, especially if it concentrates on a few small regions, the estimation of W k by the diameter of a cell becomes far greater than the actual one, thereby resulting in assigning much more discretizaiton points to a cell ( Figure A1). Thus, we develop the division algorithm of Variation 1 to elevate the performance in this situation by estimating the Wasserstein distance using an occupied volume of a set of sample points (usually the same sample points we used in Algorithm A2): Given a resolution R > 0 (the upper bound of a cell's volume can be taken as Vol(X)/N 0 , with N 0 as the size of sample points), we keep cutting the region X until each cell is either of a volume smaller than R or contains no sample points; then, we call the total volume of those nonempty cells by the occupied volume V Occ (X). Similar definition applies to each cell X i . After having the occupied volume of each cell, we may proceed to assign a number of discretization points to each cell. The only improvement of division algorithm Variation 1 in this part is on the Wasserstein estimation step (line 12 and 13 in Algorithm A2), where the estimated Wasserstein W k of cell i is changed from µ( It is considered that the algorithm assigning the discretization size depends on the estimation of the Wasserstein distance; however, this estimation in Variation 1 requires the occupied volume, which is calculated from leaf to root, meaning that the binary tree for occupied volume has to be built before starting Algorithm A2 with a modified estimation of W k . Fortunately, as the cutting points in this step have to coincide with the occupiedvolume-calculation step, and the sample points belonging to each cell are both needed, we may save the sample points partition in the binary tree building for occupied volume and reuse them in discretization size assigning. Therefore, the discretization size assigning step works as a tree traversal (on a subtree with the same root, which is defined by the stopping conditions in depth along each path) of the binary tree built for occupied volume calculation. Therefore, the time complexity for the occupied volume calculation is again O(N 0 log N 0 ), as the algorithm works in the same way as Quicksort again, and the time complexity for the rest (assigning discretization sizes) is O(m) as traversal on a tree of, at most, m leaves (m discretization points in total).
For space complexity, it is still O(N 0 + m), since after the calculation of occupied volume, the rest is only adding constant size decorations onto the subtree with, at most, m leaves mentioned above.
Appendix C.6. Adaptive Refinement: Variation 2 The "cutting in the middle" method is easy to implement and guarantee the volume decreasing while going deeper in the tree (so the depth of getting under the resolution is guaranteed). However, it is also too rigid to fit the natural gaps of the original distribution, which may critically affect the optimal location of discretization points.
Our Variation 2 is on the dimension of redefining the cutting points from midpoints along the corresponding axis to the most sparse points. The sparsity is calculated by the moving window method along an axis / component of the d-coordinates; by applying the moving window method, we may have to sort the data points every time (since at each node, the sorting axis / component may be different). Since we still want to control the depth of the tree, a correction must be added to avoid the cutting point from locating too close to the boundaries (usually, the function − C (x−a) k (x−b) k with a and b the boundaries and C > 0 as a constant). One influence is that now each cell's volume (not the occupied volume) has to be calculated using the rectangular boundaries instead of being indicated only from its depth as before.
Thus, the influence on the time complexity is the following: 1. Changing the treebuilding step to N 2 0 (log N 0 ) 2 in the average case, N 4 0 in worst case (if Quicksort is applied) on each node's moving-window method), and 2. Introducing a O(N 0 ) for calculating the volume on each node in the binary tree. Furthermore, it introduces, at most, O(N 0 ) additional space complexity, since each cell's volume has to be stored instead of being calculated directly from the depth.
Variation 2 can be applied together with Variation 1, since they are aiming at different parts of the algorithm. An example is shown in Figure A6.

Appendix D. Empirical Parts
Appendix D.1. Estimate W k k,ζ : Richardson Extrapolation and Others In the analysis, we may need W k k,ζ (µ, µ m ) to compare how discretization methods behave. However, when the µ is not discrete, we are generally not able to obtain the analytical solution to the Wasserstein distance.
In certain cases, including all examples this paper contains, the Wasserstein can be estimated by finite samples (with a large size). According to [26], for µ ∈ P (X) in our setup (a probability measure on a compact Polish space with Borel algebra) and with g = d k X ∈ C(X 2 ) being a continuous function, the Online Sinkhorn method can be used to estimate W k k,ζ . The Online Sinkhorn needs a large number of samples for µ (in batch) to be accurate.
In our paper, as X are compact subsets in R n , and µ has a continuous probability density function, we may use the Richardson Extrapolation method to estimate the Wasserstein distance between µ and µ m , which may require fewer samples and fewer computations (the Sinkhorn twice with different sizes).
Our examples are on intervals or rectangles, in which two grids Λ 1 of N points and Λ 2 of rN points (N and rN are both integers) can be constructed naturally for each. With µ determined by a smooth probability density function ρ, let µ (N) be the normalization of ∑ N i=1 ρ(Λ i )δ Λ i (this may not be a probability distribution, so we use its normalization). From a continuity of ρ and the boundedness of the dual variables α and β, we can conclude that Let W k k,ζ (µ (N) , µ m ) be a function of 1/N; to apply Richardson extrapolation, we need the exponent of the lowest term of 1/N in the expansion Since W k,ζ (µ (N) , µ) ∝ N −1/d , we may conclude that h = k/d, where d is the dimension of X. Figure A2 shows an empirical example in a d = 1 and k = 2 situation. The CW complex structure of the unit sphere S 2 is constructed as follows: let (1, 0, 0), the point on the equator, be the only dimension-0 structure, and let the equator be the dimension-1 structure (line segment [0, 1] attached to the dimension-0 structure by identifying both end points to the only point (1, 0, 0)). The dimension-2 structure is the union of two unit discs, which is identified to the south/north hemisphere of S 2 by stereographic projection: with respect to the north / south pole.

Spherical Geometry
The spherical geometry is the Riemannian manifold structure induced by embedding onto the unit sphere in R 3 .
The geodesic between two points is the shorter arc along the great circle determined by the two points. In their R 3 coordinates, d S 2 (x, y) = arccos( x, y ). Composed with stereographic projections, the distance in terms of CW complex coordinates can be calculated (and be differentiated).
The gradient about y (or its CW coordinate) can be calculated via the above formulas. In practice, the only problem is when x = ±y function arccos at ±1 is singular. From the symmetry of sphere on the rotation along axis x, the derivatives of distance along all directions are the same. Therefore, we may choose the radial direction on the CW coordinate (unit disc). Furthermore, the differentiations are primary to calculate.

Appendix D.3. A Note on the Implementation of SGD with Momentum
There is a slight difference between our implementation of the SGD and the algorithm provided in the paper. In the implementation, we give two different learning rates to the positions (y i s) and the weights (w i s), as moving along positions is usually observed much slower than moving along weights. Empirically, we make the learning rates on the positions be exactly three times the learning rates on the weights at each SGD iteration. With this change, the convergence is faster, but we do not have a theory or empirical evidence to show that a fixed ratio of three is the best choice.
Implementing and testing the Newton's method (hybrid with SGD) and other improved SGD methods could be good problems to work on. Figure A3. The sphere example with 3D discretization (same as the paper) on two view directions. Colors of dots represent the weights of each atom in the distribution.

Appendix D.5. Example: Swiss Roll
In this case, the underlying space X swiss is a the Swiss Roll, which is a 2D rectangular strip embedded in R 3 : (θ, z) → (θ, θ, z) in cylindrical coordinates. µ swiss is a truncated normal mixture on a (θ, z)-plane. Samples S swiss ∼ µ swiss over X swiss are shown in Figure A5 (left) embedded in 3D and in Figure A6a as isometric into R 2 .
By following the Euclidean metric in R 3 , Figure A5 (right) plots the EDOT solution µ m through adaptive cell refinement (Algorithm A2) with m = 50. The resulting cell structure is shown as colored boxes. The corresponding partition of S swiss is shown on Figure A6a, with samples contained in a cell marked by the same color. According to Figure A5 (right), the points in µ m were mainly located on the strip, with only one point off in the most sparse cell (yellow cell located in the bottom in the figure). On the other hand, consider the metric on X swiss induced by the isometry from the Swiss Roll as a manifold to a strip on R 2 . A more intrinsic discretization of µ swiss can be obtained by applying the EDOT through a refinement on the coordinate space-the (2D) strip. The partition of S swiss is shown on Figure A6b, and the resulting discretization µ 50 is shown in Figure A6c. Notice that all 50 points were located on the (locally) high density region of the Swiss Roll. We observe from Figure A6a,b that the 3D partition pulled disconnected and intrinsically remote regions together, while the 2D partition maintained the intrinsic structure.  For the colored figures (Starry Night and Girl with a Pearl Earring) in Figure 1, we process the three channels independently, then plotted the colored dots, and finally combined them as corresponding channels in a colored file. In the reconstruction of Starry Night, we made the size of the colored dots of same size with a modified color value according to the weights. Furthermore, for Girl with a Pearl Earring, we used pure color ((255,0,0) as red, etc.) and changed the size of the dots (with an area proportional to the weights).

Appendix D.7. Example: Simple Barycenter Problems
The EDOT in simple form (no divide and conquer) can solve barycenter problems. The idea is simple: the gradient of a sum of functions is the sum of gradients of each function. Thus, to find the discrete barycenter of size m for several distributions µ i , we take the objective to be the sum of Wasserstein distances (raised to power k for rationality), whose gradient-to-target distribution is the sum of gradients between the discretization and each target distribution. This method only works for the simple EDOT, since there is no locality in barycenter problems. After a division, the weights of each target distribution in each cell of the partition may be different, so there is inter-cell transport in the optimal transport plan, which the current algorithm cannot deal with.
We can see in Figure A7 that the simple EDOT-Equal (no divide and conquer) achieved similar results as the non-regularized discretization in [18], whereas the EDOT produced a better approximation of the barycenter by taking advantage of changing weights freely.