Topological Regularization via Persistence-Sensitive Optimization

Optimization, a key tool in machine learning and statistics, relies on regularization to reduce overfitting. Traditional regularization methods control a norm of the solution to ensure its smoothness. Recently, topological methods have emerged as a way to provide a more precise and expressive control over the solution, relying on persistent homology to quantify and reduce its roughness. All such existing techniques back-propagate gradients through the persistence diagram, which is a summary of the topological features of a function. Their downside is that they provide information only at the critical points of the function. We propose a method that instead builds on persistence-sensitive simplification and translates the required changes to the persistence diagram into changes on large subsets of the domain, including both critical and regular points. This approach enables a faster and more precise topological regularization, the benefits of which we illustrate with experimental evidence.


Introduction
Regularization is key to many practical optimization techniques.It allows the user to add a prior about the expected solution -e.g., that it needs to be smooth or sparse -and optimize it together with the main objective function.Classical regularization techniques [1], such as 1 -and 2 -norm regularization, have been studied in statistics and signal processing since at least the 1970s.These techniques are especially important in machine learning, where problems are often ill-posed and regularization helps prevent overfitting.Accordingly, various regularization techniques are not only used in machine learning research [2,3], but are also incorporated into the standard optimization software and routinely used in applications.
Recently, several authors have begun to explore the use of topological methods to regularize the objective function.All of them use persistent homology to measure either the shape of the data set or the topological complexity of the learned function.For instance, Chen et al. [4] use persistence to describe the complexity of the decision boundary in a classifier and add terms to the loss to keep this boundary topologically simple.Brüel-Gabrielsson et al. [5] use persistence as a descriptor of the topology of the data and introduce a family of losses to control the shape of the data once it passes through a neural network.
All the methods that incorporate persistence into the loss function [4,5,6] rely on the same observation.Persistent homology describes data via a diagram, a collection of points {b i , d i } in the plane, that encodes the topological features of the data: components of the decision boundary, "wrinkles" in the learned function, cycles in the point set once it passes through the neural network.Each point represents the birth b i and death d i of a topological feature.Each coordinate depends on the value of the function on a set of points.In the simplest case, (b i , d i ) = (f (x), f (y)) for some x, y in the input, where f is the learned function.In the more sophisticated cases, each point in the persistence diagram is generated by a handful of input points (e.g., four [5]).Accordingly, if a loss L prescribes moving a point in the persistence diagram via a gradient (∂L/∂b i , ∂L/∂d i ), one can back-propagate it to update the model parameters.
Although persistent homology describes a family of topological features of different dimensions (connected components, loops, voids), most practical examples have focused on 0-dimensional features (connected components generated by the extrema of the input function).In this case, a natural loss is one that penalizes and tries to remove low-persistence features, which are interpreted as noise: e.g., Persistence-sensitive simplification [7,8,9] offers a direct solution to this problem.It prescribes how to modify a given input function f to find a function g that is ε-close to f , but without the noisy features.Given such a g, which by construction minimizes the diagram loss L above, one can use f − g 2 as a term in the loss.In the context of learning, this approach offers a major advantage: instead of supplying gradients only on the critical points of f , we also get gradients on the regular points of f whose values must be changed to topologically simplify the function; see Figure 1.

Our contributions are:
• a method to control the topological complexity of a function, represented by a neural network, by incorporating persistence-sensitive simplification into the training; • comparison of the training results after backpropagating gradients through the diagram vs. using persistence-sensitive optimization; • experiments with data that illustrate the utility of controlling the topology of the learned function.
We note that topological methods have found a much broader use in machine learning than regularization.An important line of work involves developing techniques to incorporate topological features detected in data into machine learning algorithms [10,11,12].Although there is some overlap in methods between the two research directions (notably propagating loss through the persistence diagram), our work is focused on regularization.

Background
We recall the relevant background in topological data analysis [13], focusing specifically on 0-dimensional persistent homology, which we introduce using an auxiliary computational construction, merge trees.
Merge trees.Let f : X → R be a function on a topological space X.A merge tree tracks evolution of connected components in the sub-level sets f −1 (−∞, a] of the function, as we vary the threshold a. Formally, we identify two points x, y of X, if f (x) = f (y) = a and x and y belong to the same connected component of the sub-level set f −1 (−∞, a].The quotient of X by this equivalence relation is called a merge tree of f . Throughout the paper we use graphs to approximate continuous spaces, so we briefly dissect the above definition for functions on graphs.Let f : G → R be a function on a graph G = (V, E), defined on the vertices and linearly interpolated on the edges.For simplicity, we assume that all the values of f on the vertices are distinct and index the vertices ] and there does not exist k such that i < k < j and v k ∈ C. A merge tree T is not necessarily a tree -it is a forest, with a tree for every connected component of G -but the distinction is minor for this paper.
T is naturally decomposed into branches; see Figure 1 Points closer to the diagonal represent shorter branches and we interpret them as noise.
Although we have defined everything in terms of the sub-level sets, the definition for super-level sets, f −1 [a, ∞) is symmetric, with maxima replacing the minima.We use both constructions throughout the paper.
If graph G has n vertices and m edges, then a merge tree on G can be computed in O(n log n + mα(m)), where α is the inverse Ackermann function.It follows that a 0-dimensional persistence diagram can be computed in the same time.
To visualize the topological changes in the model during optimization, we stack persistence diagrams next to each other.The resulting vineyard of a family of functions f i is a multiset of points (i, In other words, over each i (for example, a training epoch) we plot all persistences of the corresponding diagram.
Simplification.An important property of persistence is stability: a small perturbation of function f causes a small perturbation of the persistence diagram Dgm(f ).The formal statement is the celebrated Stability Theorem: where f and g are two real-valued functions on the same domain and d B denotes the bottleneck distance.This theorem is one of the justifications for treating points close to the diagonal as topological noise.This view suggests getting rid of the topological noise.
In other words, g is ε-close to f but its persistence diagram has only those points whose persistence exceeds ε.In the case of 0-dimensional persistence, ε-simplification always exists and can be computed in the same time as a merge tree [7,8,9].

Method
We start with the standard supervised learning problem.
Given training data x i with labels y i , we want to learn a model f θ , with parameters θ, that approximates y i given x i .Although this framework applies more generally, throughout the paper we focus on the case where f θ is a neural network.
Suppose we are solving a regression problem.In this case, the input labels are scalars, y i ∈ R, and our network maps from some (typically) Euclidean space into reals, f θ : R d → R. The learning process is usually a form of gradient descent on the network parameters with respect to a user-chosen loss, for example, the mean-squared error (MSE), Ideally, we would like to topologically simplify the model f θ either on its entire domain, or at least on the "data manifold," the subset of the domain that contains all possible data.Unfortunately, there are no algorithms to solve this problem -topological methods require a combinatorial representation of the domainso we resort to a standard approximation.
We take the domain of the network f θ to be the k-nearest neighbors graph on the training set X: each training sample is a vertex, and two vertices are connected if and only if one of them is among the k-nearest neighbors of the other one.The k-NN graph G approximates the data manifold.We can increase the quality of this approximation by sampling additional points in the neighborhood of our input.In the experiments in Section 6, we draw n additional points from a normal distribution, centered on each training data point, x ∈ X, which results in a graph with (n + 1) • | X| vertices.(Although we don't know the true label on the extra points, we don't need it for the topological simplification.)Both because computing a k-NN graph is expensive for high-dimensional data and because it helps to control noise, in some experiments we build the k-NN graph on the lower-dimensional projection of X using PCA.
We use merge trees to compute an ε-simplification g of our model f θ .For every vertex v, we find its first ancestor u that lies on a branch with persistence at least ε.
The effect of this operation on the merge tree is that all the branches with persistence less than ε are removed; see Figure 1.
Applying simplification.Given an ε-simplification g of f θ , we could add a term λ• f θ −g 2 to the loss and use a single optimizer.Instead, we opted for a different approach by alternating between the standard training and the topological phases, with a separate optimizer for each phase.A key advantage of this separation is that it keeps two histories of the gradients, one for each phase, so that the topological loss does not influence the momentum in the standard training.
An important decision is when to switch to the topological phase.We use a heuristic that depends on the validation loss.In each epoch, we first iterate over all batches and perform standard training using the first optimizer.Then, if the validation loss increases, compared to the previous epoch, by more than some threshold (a hyperparameter), we compute the εsimplification g and take 5 to 10 steps with the second optimizer to minimize f θ − g 2 .We use the norms of the gradients of the ordinary training loss and of the topological loss, to set a learning rate for the latter that ensures that we update the model parameters θ by comparable amounts in both phases.

Choice of ε.
A key decision in implementing our method is how to choose ε, to decide which points to keep and which to remove in the persistence diagram.Earlier works [4,5] prescribe a fixed number of points to keep in a certain region of the persistence diagram.For instance, some of the losses in [5] penalize all but j of the most persistent points.We can optimize such a loss by setting ε = (p j + p j+1 )/2, where p i is the persistence of each point, sorted in descending order.
Another alternative, used in topological data analysis to automatically distinguish between persistent and noisy points, is the largest-gap heuristic.To apply it, we find index j such that the difference p j − p j+1 is maximized.
Finally, the heuristic that we found most effective and use for all experiments in Section 6 is to use validation loss as our ε.Validation loss tells us how far we are from a function that gives perfect answers on the validation set.Using it as ε, we find the topologically simplest function g that is within the same distance from our model f θ .
Classification.For regression, the network itself serves as a real-valued function amenable to topological analysis.Classification requires a little more work.We assume that the data has m classes and the network has m output channels, f θ : R d → R m , with the predicted class chosen as p = arg max i f θ (x)[i].We define the confidence function, φ : R d → R, to measure how much higher the value in the predicted channel is compared to the second highest candidate: When φ(x) is close to 0, the network is not confident whether to classify x as the top class p or the secondbest guess.The zero set φ −1 (0) is the decision boundary, by definition.Outliers of one class scattered among the points of another introduce spurious extrema in the confidence function.By driving optimization towards the simplified version of φ, we can reduce overfitting.
Because generically φ(x) is never zero on an input point x ∈ X, we need an extra step to capture the topology of the decision boundary.If two vertices u and v, connected by an edge in the k-NN graph, are assigned two different classes by the network, then the decision boundary passes somewhere between them.In this case, we remove the edge (u, v) from the graph.This pruning results in multiple connected components, at least one per class.We compute the merge tree -forest in this case -of the confidence function on the pruned graph, with respect to the super-level sets, i.e., tracking persistence of the maxima.Because confidence function is never negative, we restrict the infinite branches in the merge tree to die at 0. This obviates special treatment of separate connected components in the graph: if one of them produces a low-persistence merge tree, we simplify it by setting the values of all of its vertices to 0.

Comparison with Diagram Simplification
Earlier work on applying topological regularization to neural networks [4,5] relied on backpropagation through persistence diagrams.For piecewise-linear functions on a graph, each point in the 0-dimensional persistence diagram corresponds to a pair of vertices, (b i , d i ) = (f (x), f (y)).If one adds a regularization term of the form (d i − b i ) 2 , where the sum is taken over all points (b i , d i ) with persistence less than ε, then one can back-propagate the gradient to the function values and then to the model parameters, i.e., the weights of the network.We call this loss the diagram loss, and the loss proposed in the previous section, the PSO loss.
The first disadvantage of the diagram loss is that only critical points generate pairs in the persistence diagram.Accordingly, most input points are not used and receive no information during the backpropagation.To illustrate this, we take f : R 2 → R to be the sum of 4 Gaussians and evaluate f on the uniform grid over unit square [0, 1] × [0, 1] with 10, 000 vertices.Figure 2a illustrates the plot of f .We pick ε so that the two lower persistence points in the diagram of f (corresponding to the two Gaussians with lower peaks) are simplified, and take 50 steps of gradient descent using the PSO loss and the diagram loss directly on values of f at each vertex.The simplified functions appear in Figures 2c  and 2e, respectively.
Figures 2b and 2d show the vineyards of the two optimization processes.In both vineyards, we show the original persistence values in black, the desired values in red, and the values at each step of the optimization in green.With PSO loss, this is an unconstrained convex problem, so the optimizer quickly eliminates Figure 3 shows the effect of the two losses on a neural network.We train a fully connected network with 5 layers for 100 epochs and then perform 30 steps of topological optimization.The key difference from the previous example is that we do not have direct control over function values, but only over the weights of the network.The diagram loss provides information only for the critical points of the function, and the optimizer ends up minimizing this loss by pushing the whole function towards a constant: in the vineyard on the right-hand side, all points, not just the points below ε, are moving to 0. Since the PSO loss penalizes changes to the high-persistence parts of the function, its optimization does not suffer from the same problem, as the vineyard on the left-hand side shows.
It is not clear how to fix this overzealousness of the diagram loss.The main difficulty is that the critical vertices and their pairing change after each gradient descent step.A naive fix would be to add a term that pushes high-persistence points to ∞: We have tried this approach, but it did not perform well.Depending on weight λ, either the additional term had no influence at all, and the function was squashed to a constant; or it dominated, and the function exploded numerically.
A more principled solution would be to compute a matching between the persistence diagram after each step of the topological optimization and the target simplified diagram.The matching would translate into a loss that would simplify the diagram, while trying to preserve the high-persistence points.However, this approach has many drawbacks.The computation of the matching, even using the fast algorithms [14], is prohibitively expensive and would make this procedure completely impractical.The method itself, by construction, would only preserve the structure of the persistence diagram, not its values at individual vertices.Finally, changing the diagram loss function at each step of the gradient descent may have unexpected effects on the momentum.

Illustrative Example
To illustrate how topological regularization using the PSO loss can reduce overfitting, we consider a simple three-class dataset, shown in Figure 4a.It consists of points sampled from three Gaussians, 1,000 points from each, that represent three distinct classes.We randomly shuffle 20% of the labels to introduce class noise.We train a fully-connected feedforward neural network with 5 hidden layers of 100 nodes each for 500 epochs.
Figure 4b illustrates the training and validation losses, and Figure 4c shows the persistence vineyard of the confidence function for epochs 350 to 500.In the beginning of this range, the network has already overfit the labels.The growing validation loss confirms the overfitting, which is also evident in the vineyard, where the second and third highest persistence points, which represent the true classes in the data, are becoming indistinguishable from the noisy points.
Starting with epoch 450, we apply ten steps of topological simplification after every training epoch.Because we expect each of the three classes to be a single cluster, we set ε to keep the three highest points in the persistence diagram.This defines a PSO loss that encourages removing maxima of the confidence function that do not correspond to the 3 predominant class clusters.
As Figure 4b illustrates, after turning on simplification at epoch 450, the validation loss decreases by over 20%. Figure 4c demonstrates the abundance of high persistence features prior to epoch 450.Most of these correspond to mountains in the confidence function around noisy mislabeled points.Turning on simplification at epoch 450 reduces the persistence of these peaks which drives the network to match the class labels of the dominant class around the outliers.
This toy example demonstrates how PSO simplification identifies regions of overfitting due to class noise and reduces the confidence function near these noisy labeled points, lowering the validation loss and increasing the accuracy of the model after overfitting has occured.

Experiments
We study the performance of persistence-sensitive optimization on six regression problems and seven classification problems from the UCI repository [15].To represent a variety of problem settings, the selected datasets vary in the number of features, sample size, and number of classes.We standardize the features by subtracting the mean and dividing by the standard de-viation.For both regression and classification, we use a dense neural network with five hidden layers and 100 hidden nodes per layer.We use the Adam optimizer and a learning rate of 0.001 across all experiments, including regular training and training with topological simplification.
We compare performance of the networks trained (1) without regularization, (2) with 2 regularization, (3) with topological regularization.For all experiments, training with and without regularization were run for the same number of total epochs.For the 2 regularization, the square of the weights of the network is added to the loss, scaled by a factor of λ, which we choose by sweeping through a logarithmically spaced grid from [10 −5 , 10 1 ].We report the best performance across all λs for each dataset.For each dataset, we run all the models at least five times with different preset random seeds and average over all the trials.
As described in Section 3, we set a number of hyperparameters during the topological simplification: • topological simplification is applied when validation loss increases by more than t; • k determines the number of neighbors in the k-NN graph used to approximate the domain of the function; • n is the number of additional points we sample, for each input point, before building the k-NN graph; • the points are drawn from a Gaussian with variance σ, ranging from 0.001 to 0. We evaluate the quality of the prediction using the root-mean-square-deviation, (ŷ i − y i ) 2 /n.
Table 1 presents the results of our regression experiments.Overall, topological simplification reduces RMSD across all the datasets by an average of 6.9%.Sampling each point multiple times with a small amount of perturbation improves performance.By applying simplification when validation loss increases by more than threshold t, we reduce overfitting and the resulting error.We also see that across the λ hyperparameter swept for 2 regularization, the performance is always worse than with topological simplification.We note that our method is fast enough to be used on very large datasets (we give two examples with 40,000+ points, but that's by no means the limit); previous approaches to topological regularization (using a form of diagram loss) [4] were limited to much smaller datasets (hundreds to a thousand points).
Classification.We also evaluate our method on seven classification datasets.Each one has from two to 26 classes.Similar to the regression datasets, each has hundreds (Wisconsin cancer, Vertebral, SPECT) to thousands (Wine, Semeion, Wireless) to tens of thousands (Letter recognition) data points.We use the same 56%-19%-25% training-validation-test split.When topological simplification is applied, we set ε to the crossentropy loss and simplify the confidence function φ, described in Section 3. We evaluate the quality of our predictions by computing the cross-entropy (X-E) loss and accuracy.

Conclusion
We presented a topological regularization method that uses persistent homology, merge trees, and persistencesensitive simplification to minimize the number of noisy extrema in a machine learning model.Unlike previous such methods, our approach is faster -requiring to compute the topological descriptor only once per simplification phase -as well as more robust and predictable in its effects on the model.The key distinction of the method is its ability to prescribe gradients on the entire domain, approximated as a k-NN graph, rather than only on the critical points.We illustrated the benefits of its use in experiments with a number of well-known data sets.
Our work has a larger implication for the use of topological methods in machine learning.The realization that one can back-propagate gradients through a persistence diagram has generated considerable interest in the community, with a number of recent works [4,5,6,10,11] exploring this idea.Our results suggest that it may be better to not treat persistence as a black box.Rather, it is a rich language that allows one to precisely express topological constraints and priors to add to a problem.The actual enforcement of these constraints can be accomplished via different methods, back-propagation through the persistence diagram being but one of them.
Building on prior work in computational topology, we describe only how to simplify extrema, i.e., 0dimensional persistence diagrams.A key research direction is how to adapt these ideas to higher dimensional persistent homology.It is undoubtedly useful to incorporate higher-dimensional topological constraints, such as loops or voids in the data, into optimization.Doing so efficiently may require imposing constraints not only on the points in the persistence diagrams, but on the entire representative cycles implied by those points.

Figure 1 :
Figure 1: (a) Function on a graph, with gradients on critical points prescribed by the diagram loss.(b) Persistence diagram of this function.Points closer to the diagonal correspond to smaller fluctuations in the function, and we interpret them as topological noise.ε indicates the level of desired simplification that generates the gradients in (a) and (d).(c) Merge tree of the function, with branches highlighted in different color.The branches translate into the points in the persistence diagram of the matching color.(d) Gradients prescribed by the persistence-sensitive optimization (PSO loss).The gradients are present both on critical and regular points.

Figure 2 :Figure 3 :
Figure 2: Optimization of the values.(a) Original function.(b) Vineyard of simplification with PSO loss.(c) Function simplified with PSO loss.(d) Vineyard of simplification with diagram loss.(e) Function simplified with diagram loss.

Figure 4 :
Figure 4: (a) Input data: 1,000 points sampled from each of the three Gaussians, representing three distinct classes, with 20% of the labels randomly shuffled.(b) Training and validation loss during the training of a neural network, restricted to the later epochs, where the network overfits the data.Simplification is applied after every epoch, following epoch 450, marked with a dashed line.(c) Vineyard of the confidence function during training; the start of the simplification is marked with a dashed line.The three persistent points, representing the three classes in the data, become prominent after the simplification.

Figure 5 :
Figure 5: (a) Training and validation loss curves for an experiment on the wine regression dataset.Performance is best at epoch 44, and simplification is applied only once, after epoch 43.(b) Vineyard over all epochs.

Table 1 :
2. RMSD results on regression datasets comparing no regularization, 2 regularization of the weights, and topological simplification, averaged over multiple trials.The best model for each dataset is in bold.As topological simplification always results in performance improvement, the percentage of improvement (decrease in RMSD), from None to PSO, is also shown (∆).The last four columns show the hyperparameters for the best model.during the experiments.We always set ε to the validation loss.

Table 2 :
Cross-entropy loss and accuracy results on classification datasets comparing no regularization, 2 regularization of the weights, and topological simplification, averaged over multiple trials.The best model is in bold.As the improvement from topological simplification is always greater than or equal to training the model without regularization, the percentage of improvement (decrease in the case of X-E loss and increase in the case of accuracy), from None to PSO, is also shown (∆).The last four columns show the hyperparameters for the best model, with the lowest X-E loss.