Probabilistic Fr\'echet Means for Time Varying Persistence Diagrams

In order to use persistence diagrams as a true statistical tool, it would be very useful to have a good notion of mean and variance for a set of diagrams. In 2011, Mileyko and his collaborators made the first study of the properties of the Fr\'echet mean in $(\mathcal{D}_p,W_p)$, the space of persistence diagrams equipped with the p-th Wasserstein metric. In particular, they showed that the Fr\'echet mean of a finite set of diagrams always exists, but is not necessarily unique. The means of a continuously-varying set of diagrams do not themselves (necessarily) vary continuously, which presents obvious problems when trying to extend the Fr\'echet mean definition to the realm of vineyards. We fix this problem by altering the original definition of Fr\'echet mean so that it now becomes a probability measure on the set of persistence diagrams; in a nutshell, the mean of a set of diagrams will be a weighted sum of atomic measures, where each atom is itself a persistence diagram determined using a perturbation of the input diagrams. This definition gives for each $N$ a map $(\mathcal{D}_p)^N \to \mathbb{P}(\mathcal{D}_p)$. We show that this map is H\"older continuous on finite diagrams and thus can be used to build a useful statistic on time-varying persistence diagrams, better known as vineyards.

A key tool in TDA is the persistence diagram [7,14]. Given a set of points χ in some possibly high-dimensional metric space, the persistence diagram d(χ) is a computable summary of the data which provides a compact two-dimensional representation of the multi-scale topological information carried by the point cloud; see Figure 1 for an example of such a diagram and Section 2 for a more rigorous description. If the point cloud varies continuously over time (or some other parameter) then the persistence diagrams vary continuously over time [9]; the diagrams stacked on top of each other then form what is called a vineyard [10].
A key part of data analysis is to model variation in data. To address this there has been a recent effort to study the mean and variance of a set of persistence diagrams [3,5,20,26], as well a very recent paper that gives nice convergence rates for persistence diagrams of larger and larger point clouds sampled from a compactly-supported measure [8]. There are a variety of reasons to want to characterize statistical properties of diagrams. For example, given a massive point cloud χ, there is a computational and statistical advantage to subsampling the data to produce smaller point clouds χ 1 , . . . , χ n , and computing the mean and variance of the set of persistence diagrams obtained from the n subsampled data sets. In statistical terminology, this example consists of computing a bootstrap estimate [?] of persistence diagram of the data. This procedure requires a good definition for the mean (and variance) of a set of persistence diagrams.
The papers [20,26] make careful study of the geometric and analytic properties of the space (D p , W p ) of persistence diagrams equipped with the Wasserstein metric, which enables a definition or mean and variance in the former paper, and an algorithm for their computation in the latter. There is, however, an unfortunate problem with the definition in [20]: namely, the mean of a set of diagrams can itself be a set of diagrams, rather than a single unique diagram. This results in means that do not continuously vary as the input set of diagrams vary.
In this paper, we provide an alternative definition for the mean of a set of diagrams. The key idea is that our mean is not itself a diagram, but is rather a probabilistic mixture of diagrams, thus an element of P(D p ), the space of probability distributions over persistence diagrams. Uniqueness of this new mean will be obvious from the definition we propose. More crucially, we prove (Theorem 13) continuity of the map (D p ) k → P(D p ) taking a set of k persistence diagrams to its mean; in fact, we show that this map is Hölder with exponent 1 2 . Finally, we give examples of this mean computed on diagrams drawn from samples of various point clouds, and introduce a useful way to visualize them.
Outline Section 2 contains definitions for persistence diagrams and vineyards, as well as a discussion of the space (D p , W p ). The contributions of [20] and [26] are reviewed more fully in Section 3, and the non-uniqueness issue is also discussed in that section. We give our new definition in Section 4, and prove its desirable theoretical properties in Section 5, although some technical details are confined to the Appendix. Examples, implementation details, and a discussion of visualization, all come in Section 6, and the paper concludes with some discussion in Section 7.

Diagrams and Vineyards
Here we give the basic definitions for persistence diagrams and vineyards, and then move on to a description of the metric space (D p , W p ). For more details on persistence, see [15]. We assume the reader is familiar with homology; [23] is a good reference. For the expert, we note that all homology groups in this paper need to be computed with field coefficients.

Persistent Homology
To define persistent homology, we start with a nested sequence of topological spaces, ∅ = X 0 ⊆ X 1 ⊆ X 2 ⊆ · · · ⊆ X n = X. (1) Often this sequence arises from the sublevel sets of a continuous function, f : X → R, where X i = f −1 ((−∞, a i ]) with a 0 ≤ a 1 ≤ · · · ≤ a n . For many applications, this function is the distance function from a point cloud χ such as in the example of Figure 1a. In this case, a sublevel can be visualized as a union of balls around the points in χ.
The sequence of inclusion maps from Equation (1) induces maps on homology for any dimension r, In order to understand the changing space, we look at where homology classes appear and disappear in this sequence. Let ϕ j i : H r (X i ) −→ H r (X j ) be the composition of necessary maps from Equation (2). The homology class γ ∈ H r (X i ) is said to be born at X i if it is not in the image of ϕ i i−1 . This same class is said to die (a) Point cloud sampled from an annulus.
(b) Persistence diagram encoding information about the point cloud. Figure 1: A point cloud, shown at left, is sampled from an annulus. In order to summarize the topological data, we look at the level sets of the distance function from the set of points, then construct the persistence diagram, shown at right. The points near the diagonal are considered noise, while the single point far from the diagonal gives information about the hole in the annulus.
In the case that the spaces arose from the level sets of a function f as defined above, we define the persistence of a class γ which is born at X i = f −1 ((−∞, a i ]) and dies at X j = f −1 ((−∞, a j ] to be pers(γ) = a j − a i .
Notice that this equivalence can also be seen from working with persistence modules [7], an abstraction of the definition presented here where persistence is defined at the algebraic level. In fact, given any set of maps between vector spaces, we can analogously define the birth and death of classes in the vector spaces. In order to visualize the changing homology, we draw a persistence diagram d r for each dimension r. A persistence diagram is a set of points with multiplicity in the upper half plane {(b, d) ∈ R 2 | d ≥ b} along with countably many copies of the points on the diagonal ∆ = {(x, x) ∈ R 2 }. For each class γ which is born at X i and dies at X j , we draw a point at (a i , a j ). A point in the persistence diagram which is close to the diagonal represents a class which was born and died very quickly. A point which is far from the diagonal had a longer life. Depending on the context, this may mean the class is more important, or more telling of the inherent topology of the space. See Figure 1b for an example.

The Space (D p , W p )
In order to define a framework for statistics, we will ignore the connection to topological spaces or maps between vector spaces and instead focus on the space of persistence diagrams abstractly. Definition 1. An abstract persistence diagram is a countable multiset of points along with the diagonal, ∆ = {(x, x) ∈ R 2 | x ∈ R}, with points in ∆ having countably infinite multiplicity.
The distance between these abstract diagrams is the p th Wasserstein distance. Definition 2. The p th Wasserstein distance between two persistence diagrams X and Y is given by σ is a metric on the plane, and ϕ ranges over bijections between X and Y .
We often use σ = L q . Notice that for p = ∞, is often referred to as the bottleneck distance. For the remainder of this paper, we will be using W p [L 2 ], which we refer to as W p for brevity. The persistence pers(u) of a point u = (x, y) in a diagram is defined to be y − x, and the p th total persistence of a diagram d is defined to be the sum of the p th -powers of the persistences of the off-diagonal points in d.
The authors in [20] show that (D p , W p ) is a Polish (complete and separable) space, and they also describe all of the compact sets in this space. In [26], it is shown that D p is a geodesic space, and thus that every pair of diagrams has a minimal geodesic between them. This geodesic can be defined using a bijection between the diagrams which minimizes Wasserstein distance.

Vineyards
The first definitions of vineyards [10,21] were used in the well-behaved case of a homotopy between two functions. In this case, each off-diagonal point of a diagram varies continuously in time, D(X(t)), and is called a vine. Vines can start and end at off diagonal points at times 0 or 1, or have starting or ending points on the diagonal for any t, see Figure 2.
As we do with persistence diagrams, let us consider the space of abstract vineyards to be the space of paths in persistence diagram space.

Fréchet Means of Diagrams
This section reviews previous definitions of the mean of a set of diagrams [20] and an algorithm to compute the mean [26]. We will define the mean of a diagram as the Fréchet mean, give the algorithm for the computation of this mean, and finally present the non-uniqueness problem.

Fréchet Means
The Fréchet mean generalizes the mean of a set of points or a distribution in Euclidean space to any metric space. It can be thought of as a generalization of the arithmetic mean in that it minimizes the sum of the square distances to points in the distribution. Given a probability space (D p , B(D p ), P), we can define the Fréchet mean as follows.
Definition 5. Given a probability space (D p , B(D p ), P), the quantity is the Fréchet variance of P and the set at which the value is obtained is the Fréchet expectation, also called Fréchet mean.
The mean in this case need not be a single diagram, but may be a set of diagrams. In fact, there is no guarantee that E(P) is even non-empty. However, it was proved in [20] that the Fréchet mean for (D p , W p [L ∞ ]) is non-empty for certain types of well-behaved probability measures on D p . Theorem 6. Let P be a probability measure on (D p , B(D p )) with a finite second moment. If P has compact support, then E(P) = ∅.
A similar result holds when the tail probabilities of the distribution P decay fast enough, see [20] for details.

Algorithm
The focus of [20] was to develop the probability theory required for statistical procedures on persistence diagrams, including defining a mean. In [26] an algorithm to compute an estimate of the Fréchet mean of a set of diagrams was given. It is also shown there that the Fréchet function is semiconcave for distributions with bounded support, so we can make use of a gradient descent algorithm to find local minima of the Fréchet function, Definition 5. In order to present the algorithm for computing the Fréchet , is used to determine the bijection for the diagrams in (a). Dashed edges in (b) correspond to ∆ − ∆ pairings, which contribute nothing to the total distance. mean, we must first describe the algorithm for computation of Wasserstein distance. The representation of a diagram in this algorithm is a list of its off-diagonal points, X = [x 1 , · · · , x k ]. In order to compute the Wasserstein distance between two diagrams, we will reduce the problem to computing a minimum cost coupling of a complete, weighted bipartite graph.
Let X = [x 1 , · · · , x k ] and Y = [y 1 , · · · , y m ] be diagrams. In order to compute W 2 [L 2 ](X, Y ), we construct a bipartite graph with vertex set U ∪ V . U has a vertex for each x i , as well as m vertices representing the abstract diagonal ∆. Similarly, V has a vertex for each y i as well as k vertices representing ∆. Take all edges from U to V so that this is a complete bipartite graph. The edge between points x i and y j is given weight x i − y j p . Each edge (x i , ∆) and (∆, y j ) has weight x i − ∆ p and y j −∆ p respectively where a−∆ = min z∈∆ a−z . Finally, edges between two vertices representing ∆ are given weight 0. The minimum cost coupling algorithm typically used is the Hungarian algorithm of Munkres [22].
A minimum cost coupling in the bipartite graph immediately gives a bijection ϕ : U → V and the Wasserstein distance is given by the square root of the sum of the squares of the weights of the edges. Notice that since there could be multiple couplings for a bipartite graph which minimize the cost, there could be multiple couplings which minimize the Wasserstein distance. To compute the mean diagram, we will actually be more interested in the bijection returned in this algorithm than in the distance itself. Figure 3 displays an example of a pair of diagrams and their corresponding bipartite graph.
Definition 7. Given a set of diagrams X 1 , · · · , X N , a selection is a choice of one point from each diagram, where that point could be ∆. The trivial selection for a particular off-diagonal point x ∈ X i is the selection s x which chooses x for X i and ∆ for every other diagram.
A coupling is a set of selections so that every off-diagonal point of every diagram is part of exactly one selection.
Given a set of diagrams, a coupling is the analog of a bijection for a pair of diagrams. A coupling for N diagrams which has k selections can be stored as a k × N matrix G where entry G[j, X i ] = x means that the j th selection has point x ∈ X i . Note that we consider couplings to be equivalent up to  Figure 4: An example of a coupling for three overlaid persistence diagrams, d , d , and d • is given in (a). In this example, the coupling has four selections and the corresponding coupling matrix is given in Equation 3. The dark circles in diagram (b) give the mean diagram associated to this particular coupling.
reordering of the selections. See Figure 4 for an example. In this case, the coupling shown is given by the matrix where ∆ represents the diagonal. The mean of a selection is the point which minimizes the sum of the square distances to the elements of the selection. Consider the mean of the selection s consisting of N points: {p 1 , · · · , p k } with p i = (x i , y i ) off-diagonal, and the rest copies of the diagonal ∆. A quick computation gives this point as Sometimes it may be simpler to consider the mean of two selections in rotated coordinates with axes In these coordinates, it is easier to see what happens for the mean of a trivial selection. If the single off-diagonal point is at x = (a, b) and there are a total of N diagrams, so the point is placed at distance 1 N x − ∆ from the diagonal. The mean of a coupling, mean(G), is a diagram in D p with a point at the mean of each selection. When it is unclear as to the set of diagrams from which this mean arose, we will denote it as mean X (G). Note that the mean of the selection yields a point while the mean of a coupling yields a diagram. Now we are ready to give the algorithm for the Fréchet mean of a set of diagrams. Given a finite set of diagrams {X 1 , · · · , X N }, start with a candidate for the mean, Y , and compute the bijection for W 2 [L 2 ](Y, X i ). We denote this as WassersteinPairing(Y, X i ). From this, we have a coupling G where G[y j , X i ] gives the point in X i which was paired to point y j ∈ Y . Set Y = mean(G). This new diagram is now the candidate for the mean and the process is repeated. The algorithm terminates when the Wasserstein pairing does not change. [26] uses the structure of (D 2 , W 2 [L 2 ]) to prove that this algorithm terminates at a local minimum of the Fréchet function. See Algorithm 1 for the pseudocode.
Algorithm 1 Algorithm for computing the Fréchet Mean of a finite set of diagrams Input: Persistence diagrams X 1 , · · · , X N Output: Y , a persistence diagram giving a local min of the Fréchet function Choose one of the Move each point to the for each point y j ∈ Y do barycenter of its selection.

Issues with extensions to Vineyards
Consider the example of Figure 5. Here we have two persistence diagrams overlaid: d has square points 1 and 2, d • has circle points a and b. Since the four points lie exactly on a square, the pairing to give the Wasserstein distance could either be {(a, 1), (b, 2)} or {(a, 2), (b, 1)}. Thus there are two diagrams which give a minimum of the Fréchet function: the diagram with u and v, or the diagram with x and y.
If two vineyards pass through this configuration, the mean of the vineyards constructed by finding the mean at each time will not be continuous. Consider for example two vineyards of two points each who start in the elongated configuration of Figure 6a and move along the dotted lines to the configuration in Figure 6b. At the bend of the dotted line, the points are at the corners of a square, so as in the example of Figure 5, there are two possible choices for the mean. One is close to the means from the previous times, and one is close to the means from the following times.

The Mean as a Distribution
To overcome the lack of continuity of the mean vineyard and the non-uniqueness illustrated in Figure 5 we will define a mean diagram that is a mixture of diagrams or a distribution over persistence diagrams. In order to prove continuity, the diagrams will be limited to S M,K , the set of diagrams in D 2 with at most K off-diagonal points, and all points Consider the space P(S M,K ), the space of probability measures with finite second moment on S M,K ⊂ D 2 . This is of course a metric space with the standard probability Wasserstein distance as defined below.
Definition 8. The p th -Wasserstein distance between two probability distributions, ν and η, on metric is the space of distributions on X × X with marginals ν and η respectively. When d X is obvious from context, we will instead write W p (ν, η).
] as the distance function on P(S M,K ), where the outside W 2 is the Wasserstein distance of Definition 8 and the inside W 2 is the deterministic Wasserstein distance of Definition 2. Note that the map Y → δ Y , where δ Y is the delta measure concentrated on the diagram Y , gives an isometric embedding of S into P(S).

Intuition
We first give an intuitive description of the main ideas behind our new definition. The basic ideas we use to achieve continuity of a mean diagram is to think of diagrams as probabilistic objects and couplings between diagrams as probabilistic objects. These two ideas imply that the mean of a set of diagrams is not a diagram but a distribution over diagrams. A useful perspective on Wasserstein distance metrics is considering them as couplings. The general definition of a coupling follows [27].
Definition 9. Let (X , ν) and (Y, η) be two probability spaces. Coupling ν and η means constructing two random variables X and Y on some probability space (Ω, P), in such a way that law (X) = ν, law A natural interpretation of the above definition is that a coupling consists of constructing a joint probability space with marginals ν and η. Deterministic couplings follow from the above definition as the case where the law γ on (X, Y ) is concentrated on the graph of a measurable function G : X → Y.
There are two problems with deterministic couplings: (1) given two diagrams there may be multiple optimal couplings (see Figure 5), (2) given two diagrams slight perturbations of the point configurations can result in radically different optimal couplings, consider slight perturbations of Figure 5. These two problems led to the problem shown in Figure 6, where a jump between optimal couplings led to a discontinuity in the mean diagram. The first problem is addressed by considering probabilistic couplings, so that γ is a probability distribution. The second problem is addressed by considering each diagram as a distribution of point configurations.
The distribution over couplings we construct is a discrete probability distribution over deterministic couplings. The idea behind probabilistic couplings is that given two diagrams we consider all possible (optimal) couplings between the diagrams, and assign a probability to each coupling. For now consider each coupling to be equiprobable. For example, the mean of d and d • in Figure 5 will be since there are two deterministic couplings that are equally probable. This addresses the non-uniqueness issue but the second problem of instability of the coupling remains. To see this, consider a slight perturbation of the point configuration in Figure 5, this will result in a mean of either d or d since the perturbation will either result in either one of the two configurations to be optimal couplings. To address this problem we consider diagrams as probabilistic objects, given a diagram with off diagonal points {p 1 , ..., p } we consider the collection as a probability density function on point configurations in R 2 with where k(x, p i ; h) is a density function centered at p i and localized to a region h in R 2 . In this case, slightly perturbing the points in Figure 5 will result in a mean diagram of where p = 1 − p ≈ .5 since the probability of the two optimal matches of the localized probability distributions around the points in Figure 5 is about equal, whether the points are slightly perturbed or not.
In general, if X = {X 1 , · · · , X N } is a set of diagrams from S M,K , we define its mean to be the following distribution on S M,N K : Here the sum is taken over all possible couplings G on the set of diagrams, and mean X (G) ∈ S is the mean diagram for the specific coupling G. Note that the mean mean X (G) of a set of N diagrams, each with at most K points, can itself have at most N K points, so that µ X , as defined, is indeed an element of P(S M,N K ). The weights P(H = G) are derived from a random variable H which can be thought of either as a probabilistic coupling where each point in the input diagram is replaced with a localized distribution centered on the point or as the probability that a stochastic perturbation of the input diagrams would lead to G being the optimal matching. We now explain this in more detail.

The Definition of H
We are given a set X = {X 1 , · · · , X N } of diagrams from S M,K . Sometimes, where appropriate, we will also use X = i X i to represent the set of off-diagonal points in all the X i . We now define H, a coupling valued random variable.
First, we fix an α > 0 and a probability distribution η on R 2 which is absolutely continuous with respect to Lebesgue measure λ. This means that η has a Radon-Nikodym derivative; there exists a measureable non-negative function f on R 2 such that the equation η(A) = A f dλ holds for every measurable set A. We further require that f is bounded with support of a ball of radius α, B(0, α), and is radially symmetric. Examples of distributions satisfying these requirements are the uniform distribution U(B α (0)) or the truncated normal distribution N (0, σ 2 , α) which is just the portion of the standard normal distribution N (0, σ 2 ) contained inside of B α (0) and normalized to have total mass 1.
For each x ∈ R 2 , let η x be the translation of η centered around x.
We will often be considering the measure of a ball of a radius smaller than α centered around x, so we use the following notation η(r) = η x (B r (x)).
By construction we know that η(r) ≤ πr 2 f (0) where f is the probability density function corresponding to the density η. We now define, for each input diagram X i , a persistence-diagram-valued random variable X i , as follows. For each off diagonal point x ∈ X i which is more than distance α from the diagonal, draw a point from the distribution η x and add it to the diagram X i . For each off-diagonal point x ∈ X i which is less than distance α from the diagonal, draw a point x from η x and add it to the diagram X i only if it is contained in the ball of radius β = x − ∆ centered at x. In this way, the probability that a point gets added to the diagram from one of these points close to the diagonal decreases as the distance to the diagonal decreases. See Figure 7.
For each particular draw of the random variables X 1 , · · · , X N , we can compute a mean diagram via Algorithm 1. In particular, we need the coupling used by the algorithm, and can discard the computed mean diagram. Since each point in a draw of X i is associated to a point in X i , we consider the coupling used to compute the mean of a drawn{X i } to be the same as the coupling using the corresponding points of X i . However, since some points in the X i did not get corresponding points in the draw X i , we can extend this to a full coupling by adding the trivial selection for these points. That is, if a point Figure 7: The method for drawing points. For a point x ∈ X i which is more than distance α from the diagonal as at right, a point is drawn from the distribution η x , which has mean α and is contained in the ball of radius α centered at x. This point is then added to the diagram X i . For a point x ∈ X i which is less than distance α from the diagonal as at left, a point is still drawn from the distribution η x , however the point is only added to X i if it is inside the ball of radius β = x − ∆ centered at x.
x ∈ X i did not lead to a drawn point in X i , we add the selection which chooses x for X i and ∆ for every other diagram. This completes the definition of the random variable H: as desribed above, each draw of X 1 , · · · , X N leads to the draw H = G, where G is the coupling on X described above.

Example
Here is an example to make the discussion above a little more clear. Consider the three overlaid diagrams in Figure 8a. Points are drawn in the ball of radius α centered at each point. Since a, c, h, g, y, and z are near the diagonal there is a chance that no point is drawn for them. In this particular draw, given in Figure 8b, no point is drawn for a or c.
When computing the mean of the diagrams in Figure 8b, the coupling used is So, to find the corresponding coupling for the original diagrams, we replace each point with its corresponding point, and add in the trivial selection for the points that were not chosen:

Continuity
In this section, we prove our main theorem: that the mean distribution varies continuously when faced with a continuously varying set of input diagrams. We start by developing some more detailed theory about the measures η x , as well as fixing some notation which will be needed in the proof.

Product Measures
For each point x ∈ R 2 , recall that we have the probability distribution η x , with support contained in the ball B α (x). We use this to define a new distribution η x , given by the formula: Note that η x = η x iff x is more than α away from the diagonal. As before, we start with a set X = {X i } of input diagrams, and we let X = i X i be the set of all off-diagonal points in all the X i . Recall that, for each draw of the set of random variables {X i }, some of the points in each X i created a drawn point in X i and some did not. To formalize this, we define the subset-valued random variable Ω ⊂ X to be the set of points in X which do indeed lead to a drawn point; note that each draw of Ω will always contain the points in X which are more than α away from the diagonal, but it may contain more points as well.
For each subset ω ⊂ X, consider the space (R 2 ) |ω| , where we have associated a copy of R 2 to each point in ω. For each coupling G on X, we define U ω,G ⊂ (R 2 ) |ω| to be the set of points which lead to this coupling. Notice that this set is completely independent of the locations of the points from the original diagrams; it relies only on the combinatorial structure of G.
Let η ω = x∈ω η x be the product measure on (R 2 ) |ω| . Then the probability that G is the coupling used assuming ω is chosen is For example, say X consists of two diagrams, X 1 and X 2 , where X 1 has a single off-diagonal point x 1 , and X 2 has a single off diagonal point x 2 . Then there are exactly two possible couplings: If ω = {x 1 , x 2 }, then the space we are interested in is (R 2 ) 2 . So U ω,G 1 is the set of points which would rather be paired with the diagonal than with each other, thus Likewise, U ω,G 2 is the set of points where being paired to each other is better, so For the purposes of determining the probability of G 1 , we must take the locations of x 1 and x 2 into account. Now we are only interested in the (a, b) ∈ U ω,G 1 for which a is in the support of η x 1 and b is in the support of η x 2 . So, the probability that G 1 is chosen assuming both points are picked is Likewise, the probability that G 2 is chosen assuming both points are picked is This construction is particularly useful since we can consider the same coupling for two sets of diagrams which are close. Since U ω,G is independent of the actual diagrams, the only thing varying is the measure.

The Proof
We will prove Hölder continuity for the map that takes a set of persistence diagrams to its mean distribution. First we give a lemma which bounds the distance between the measures associated to nearby points in the plane. The proof, which is purely analytical, can be found in the Appendix.
Lemma 11. Let f x and f y be the Radon-Nikodym derivatives of η x | B(x, x−∆ ) and η y | B(y, y−∆ ) respectively. Then where f is the Radon-Nikodym derivative of η.
Next we prove a result which relates the mean of a set of diagrams to the mean of a new set which is obtained by simply moving one of the points in one of the diagrams to a new location.
Proposition 12. Let Z = {Z 1 , Z 2 , . . . , Z N } ∈ (S M,K ) N with mean µ ∈ P(S M,N K ). LetẐ 1 be the same diagram as Z 1 except that one of the off-diagonal points, x, has been moved to the new location y (while still maintainingẐ 1 ∈ S M,K ). Letμ be the mean ofẐ = {Ẑ 1 , Z 2 , Z 3 , . . . Z N }. Then Here M is the maximal distance between diagrams in S M,N K and f is the Radon-Nikodym derivative of η.
Proof. Let G be the set of couplings on the set of diagrams Z = {Z 1 , Z 2 , . . . , Z N } andĜ the set of couplings on the set of diagramsẐ = {Ẑ 1 , Z 2 , . . . Z N }. The bijection ϕ : Z 1 →Ẑ 1 , which sends x to y and fixes every other point, induces a bijection between G andĜ which we will also denote by ϕ.
In order to understand the bound we will put on W 2 (µ,μ), it is best to use the earth mover analogy for Wasserstein distance. Each distribution is thought of as dirt piles, and the cost of moving a dirt pile is the amount of dirt times the distance it must be moved.
In this analogy, since µ = P(H = G)δ mean(G) , mean(G) is the location of the dirt piles and P(H = G) is the quantity of dirt in that location. Then we will pair the couplings in G with couplings in G via ϕ. This pairs up the piles in a way that the amount of dirt in the matched piles are approximately the same. Then we can argue that we move most of the dirt a short distance, and the amount of leftover dirt is small enough to move it anywhere without too much penalty.
Note that min{P(H Z = G), P(HẐ = ϕ(G))} for G ∈ G is the maximum amount of dirt that can be moved from mean Z (G) to meanẐ(G). This fact, combined with the argument in the paragraph above, gives the inequality where M is the maximum distance between any two diagrams in S M,N K .
First, we will bound W 2 (mean Z (G), meanẐ(ϕ(G))). For selections m in Z which do not contain x the corresponding selection, ϕ(m), inẐ is the same set of points and hence mean Z (m) = meanẐ(ϕ(m)).
Let Ω and Ω be random variables giving the subsets of Z andẐ, respetively, which do indeed lead to a drawn point.
Thus considering both (10) and (11) Combining inequalities (9) and (12) provides the final bound With these last two steps in hand, we can finally prove our main theorem.
be the map which sends a set of diagrams to its mean as defined above. Then Φ is Hölder continuous with exponent 1/2. That is ,there exists a constant C such that The main idea of the proof is, given two close sets of diagrams (X 1 , · · · , X N ) and (Y 1 , · · · , Y N ), to create a bijection from most of the couplings of the X i to most of the couplings of the Y i in such a way that the probability of getting the corresponding couplings is similar and the associated mean diagrams are close. This allows for a transportation plan which moves most of the mass a short way, and we can then argue that although the rest of the mass could be moved a long distance, its weight is negligible. Equations 14 and 8 each encode this information for different transportation plans.
Proof. Let X = (X 1 , · · · , X N ) and Y = (Y 1 , · · · , Y N ) denote sets of diagrams in (S M,K ) N with corresponding means µ X and µ Y . We wish to find a constant C such that For the moment assume that W 2 (X, Y ) ≤ 1.
For each i, let ϕ i be an optimal bijection given by W 2 (X i , Y i ). Let X i be the diagram consisting of points x in X i such that ϕ i (x) = ∆. Likewise, let Y i be the diagram consisting of points y ∈ Y i such that ϕ −1 i (y) = ∆. Let G X be the set of all couplings for the diagrams X 1 , · · · , X N , and let G X be the set of all couplings for the diagrams X 1 , · · · , X N . G Y and G Y are defined similarly. Note that since all diagrams are in S M,K , they have finitely many off-diagonal points, and therefore finitely many couplings.
There is an injection i X : G X → G X which takes a coupling G ∈ G X and maps it to the coupling G ∈ G X which has all the same selections along with the trivial selection for each unused point x ∈ We will bound W 2 (µ X , µ Y ) using the triangle inequality: This will be done in two parts; first we will bound W 2 (µ X , µ X ) and W 2 (µ Y , µ Y ), then we will bound W 2 (µ X , µ Y ).

Part 1:
In order to understand the bound we will put on W 2 (µ X , µ X ), it is best to use the earth mover analogy for Wasserstein distance. Each distribution is thought of as dirt piles, and the cost of moving a dirt pile is the amount of dirt times the distance it must be moved.
In this analogy, since µ X = P(H = G)δ mean(G) , mean(G) is the location of the dirt piles and P(H = G) is the quantity of dirt in that location. Then we will pair up most of the couplings in G X with couplings in G X in such a way that the dirt piles are approximately the same. Then we can argue that we move most of the dirt a short distance, and the amount of leftover dirt is small enough to move it anywhere without too much penalty.
The pairing comes from the map i X : G X → G X . Note that min{P(H X = i X (G), P(H X = G)} for G ∈ G X is the maximum amount of dirt that can be moved from mean X (i X (G)) to mean X (G). The above argument gives the inequality where M is the maximum distance between any two diagrams in S M,N K . In this equation, the first term corresponds to moving as much dirt as possible from mean X (i X (G)) to mean X (G), the second term accounts for the leftover dirt from this motion, and the last term is the amount of dirt from couplings which have no pair.
In order to bound W 2 (mean X (G), mean X (i X (G))) 2 , observe that every off diagonal point that appears in mean X (G) also appears in mean X (i X (G)) and that the additional points in mean X (i X (G)) correspond to the trivial selections m x for all x ∈ X \ X. Each of these additional points are at distance x − ∆ /N to the diagonal. Thus, using the bijection sending each of these additional points to the diagonal, Therefore, Let Ω andΩ be random variables defined on subsets of X andX respectively which gives the set of points which do indeed lead to a drawn point. Note that P(H X = i X (G)) = P(H X = i X (G) and Ω ⊂ X) + P(H X = i X (G) and Ω ⊂ X) and P(H X = G) = P(H X = G and Ω ⊂ X). Hence, Note that for any subset ω ⊂ X and coupling G ∈ G X , ∈ Ω) and likewise P(H X = G and Ω = ω) =P(H X = G | Ω = ω) · P( Ω = ω) Comparing these two equations gives P(H X = i X (G) and Ω = ω) = P(H X = G and Ω = ω) · x∈X\ X P(x / ∈ Ω).

This equation with the triangle inequality gives
where the final inequality follows from the union bound. From our method of drawing points P( Thus we can further continue equation (17) as follows.
This bounds the first piece of Equation 16, and thus all that remains to be bounded from Equation 14 is the leftover from Equation 16 and the leftover from Equation 14, Note that for any G ∈ G X \ Im (i X ), it is impossible to form the coupling G without at least one point outside of X, so P(H X = G) = P(H X = G and Ω ⊂ X). Therefore, Note that the argument for Y is similar, so we also have Together these imply that Part 2: We now wish to bound W 2 (µ X , µ Y ). To do this we will use the triangle inequality alongside Proposition 12.
Consider an arbitrary ordering of the points in X and order the points in Y accordingly via ϕ. For each j, we create a set of diagrams X j by taking the first j points to be from Y and the remaining points from X. Note that X 0 = X and X m = Y , where m is the total number of off-diagonal points contained in the diagrams in X, and therefore the triangle inequality gives: If x is the point in X j which is moved to ϕ(x) in X j+1 then Proposition 12 shows that Using the stipulation that W 2 (X, Y ) ≤ 1 we know that x−ϕ(x) ≤ 1 for every x ∈ X and hence This implies that .
Since there are at most N K points in X we know We finally can say, assuming W 2 (X, Y ) ≤ 1, that Set

Examples
We now give some examples of the probabilistic Fréchet mean of a set of diagrams, introducing a useful way to visualize them along the way. Recall that the mean distribution of a set of diagrams is a weighted sum of delta-measures, each one concentrated on the mean of one of the possible matchings among the diagrams, with the weights given by the probability that a pertrubation of the diagrams would produce that matching. Figure 9: Change in mean distribution as a set of two diagrams moves through the problematic configuration of Figure 5. On the left we see the mean of two diagrams which form a rectangle that is slightly longer in the death-axis direction On the right is the result for a rectangle that is quite a bit longer in the birth-axis direction.
In Figure 9, we show a resolution to the discontinuity issue raised in Figure 5, although the figure needs some explanation. The flat colored dots on the left side of this figure represent a pair of diagrams which form a rectangle that is slightly longer in the death-axis direction. To approximate the probability of each possible matching between the pair, we perturbed the diagrams 100 times with α = 0.3 and η 0 equal to the uniform distibution, and simply counted the number of times each matching occured. The results are shown on the left side of the figure, where the height of a colored stack represents the weight of the diagram which contains the point at the bottom of the stack; note that the green stacks are slightly taller than the purple ones. On the other hand, the right side of the same figure shows the mean distribution for a pair of diagrams which forms a rectangle that is quite a bit longer in the birth-axis direction.
For a more complicated example, we drew thirty different point clouds from a pair of linked annuli of different radii; one such point cloud is shown on the left of Figure 10. Then we computed the one-dimensional persistence diagram for each point cloud, using the recently-developed M 12 software package [12]. As one might expect, each diagram contained a point for the big annulus, and point for the small annulus, and a good bit of noise along the diagonal. However, the birth times of the non-noisy points varied quite widely. The set of thirty diagrams, overlaid in one picture, is shown on the right of the same figure.
Finally, we computed the mean distribution of these thirty diagrams, using the same approximation scheme as above. On the left of Figure 11, we see an overlay of the set of all diagrams which receive positive weight in the mean distribution, while the right side of the same figure displays the mean distribution using the same colored-stack scheme as in the example above. Notice that the two very large stacks are actually at height one, which indicates that every single diagram in the mean contains the two non-noisy dots from the left side of the figure.

Conclusions and Future Work
In this paper, we have defined a new mean which, unlike its predecessor, is continuous for continuously varying diagrams. This mean is, in fact, a distribution on diagram space which is one feature of the distribution of diagrams from which it arose. We hope that this new definition will provide a useful statistical tool for topological data analysis. We also believe that this is an important step in the overall project of establishing persistent homology as an important shape statistic. Several questions remain, Figure 10: Thirty diagrams were created from thiry point clouds drawn from a double annulus. One such pont cloud is shown on the left. All thirty diagrams are overlaid on the right. however, and there are obviously many directions for future research. We list some of them here.
The most pressing need, of course, is to study how far we can take this new definition into the realm of traditional statistics. In particular, can we prove laws of large numbers, central limit theorems, and the like? Will this mean actually provide a useful tool towards the bootstrapping idea discussed in the introduction? Can we use this new mean, and the associated variance function, to provide more insight into the convergence rate theorems of [8]?
On a more technical level, can we improve our continuity theorem to remove the reliance on the subspaces S M,K ? At the moment, we can not find counterexamples to a more general statement, but nor can we prove the theorem without making finiteness assumptions.
Note, too, that we have only addressed means and variances in this paper. Another interesting statistical summary of data is the median; this will be addressed in an upcoming paper [25].
Perhaps the most important project is to understand under what conditions persistence diagrams provide sufficient statistics for an object, a data cloud, etc. The work in this paper will be a critical part of this effort. where A (x, [a, b]) denotes the annulus centered at x with inner radius a and outer radius b. From these last two equations, we compute: For each i, if x − y < r i , then a ball of diameter 2r i − x − y fits in the intersection of B(x, r i ) and B(y, r i ). Thus, recalling that the absolute value of the difference between the indicator functions of two sets is just the indicator function of their symmetric difference, we compute: The bound in (23) also holds if x − y ≥ r i as then the integral is 2πr 2 i . For each r with α ≥ x − ∆ ≥ r ≥ y − ∆ |1 A(x,[ y−∆ ,r]) | dλ = πr 2 − π y − ∆ 2 Similarly if x − ∆ > α ≥ r ≥ y − ∆ then |1 A(x,[ y−∆ ,r]) | dλ ≤ 2απ x − y also.
Together these imply that Observing that i a i is the value of f at 0 completes the proof.