SHAPER: Can You Hear the Shape of a Jet?

The identification of interesting substructures within jets is an important tool for searching for new physics and probing the Standard Model at colliders. Many of these substructure tools have previously been shown to take the form of optimal transport problems, in particular the Energy Mover's Distance (EMD). In this work, we show that the EMD is in fact the natural structure for comparing collider events, which accounts for its recent success in understanding event and jet substructure. We then present a Shape Hunting Algorithm using Parameterized Energy Reconstruction (SHAPER), which is a general framework for defining and computing shape-based observables. SHAPER generalizes N-jettiness from point clusters to any extended, parametrizable shape. This is accomplished by efficiently minimizing the EMD between events and parameterized manifolds of energy flows representing idealized shapes, implemented using the dual-potential Sinkhorn approximation of the Wasserstein metric. We show how the geometric language of observables as manifolds can be used to define novel observables with built-in infrared-and-collinear safety. We demonstrate the efficacy of the SHAPER framework by performing empirical jet substructure studies using several examples of new shape-based observables.

1 Introduction Collisions at the Large Hadron Collider (LHC) produce events with hundreds of particles in the final state, which must be carefully analyzed to extract information about the underlying physics. In order to make sense of these high-dimensional data, increasingly sophisticated observables are required that are well understood at both the theoretical and experimental levels. Event shape [1][2][3][4] and jet shape [5,6] observables have played an important role in refining our understanding of the structure of high energy collisions, by relating hadronic final states to perturbatively accessible partonic degrees of freedom. Many shape observables, such as event thrust [1,7,8] and jet angularities [9,10], have been computed to next-to-next-tonext-to leading log (N 3 LL) accuracy [11] and next-to-next-to leading log accuracy N 2 LL [12] in e + e − collisions, respectively. Shape observables have been extensively measured and used to search for new physics signatures [13][14][15][16][17][18][19][20][21][22][23][24]. It was shown in Ref. [25] that many of these event shapes and jet shapes can be cast as optimal transport problems, using the Energy Mover's Distance (EMD). The EMD was introduced in Ref. [26] in order to provide a quantitative measure of the "distance" between two collider events, E and E . The EMD is based off the "earth mover's distance" from computer vision [27][28][29][30][31], which itself is a special case of the Wasserstein metric [32,33]. The EMD has since seen many uses in collider physics applications, such as in building a metrized latent space of events [26,[34][35][36] and in event/jet tagging and classification [37][38][39]. The EMD has also been used to define a novel shape observable, the event isotropy [40][41][42], which probes how "uniform" an event E looks by comparing it to the idealized isotropic event U.
In this paper, we seek to explain the effectiveness of the Wasserstein metric, by showing that it is the unique metric on collider events that both is continuous and respects the detector geometry faithfully. As shown in Ref. [26], continuity encodes the collider physics concept of infrared-and-collinear (IRC) safety. Geometric faithfulness is, to our knowledge, a new concept for the collider community, which allows statements to be made about spatial distributions of energy within events. After advocating for Wasserstein geometry, we then generalize the notion of event shapes and jet shapes, motivated by the EMD. Our framework -called Shaper -not only allows observables to be defined that probe any geometric substructure of events and jets in an IRC-safe way, analogous to the event isotropy probing the uniform structure of events, but also allows those observables to be numerically estimated.
In particular, we: 1. Motivate the Wasserstein Metric: In Sec. 2, we show that the Wasserstein metric is the natural structure for building shape-based observables for collider physics, justifying its success in Refs. [25,26,[34][35][36][37][38][39][40][41][42] and beyond. By adopting a measure-theoretic language for energy flows, we show that the EMD is not an ad-hoc structure to impose on the space of collider events, but rather the only structure that faithfully respects the detector geometry and continuity on the space of events. Further details of this argument are provided in Apps. A and B.
2. Use Optimal Transport to Define Shapes: In Sec. 3, we build off the work in Ref. [25], where it was shown that several well-known shape observables can be described in the form: where M is a parameterized manifold of energy flows that define the shape, R sets a length scale for the shape, and β is a distance weighting exponent. Importantly, both the observable O M and the optimal shape parameters θ M can be separately extracted from the EMD. We extend this construction to define many new shape observables, by greatly expanding the class of manifolds M considered, which can be constructed as explicit geometric shapes. We develop a prescription for defining new custom shape observables by parameterizing probability distributions, and even prescriptions for composing old shape observables together to form new ones.
3. Introduce SHAPER: In Sec. 4, we introduce Shaper, or Shape Hunting Algorithm using Parameterized Energy Reconstruction. This is a computational framework for defining shapes and evaluating Eqs. (1.1) and (1.2) on data. Shaper leverages the Sinkhorn approximation of the Wasserstein metric [43][44][45][46], which enables fast numerical calculation and even gradient estimation with respect to entire events, enabling easy and efficient optimization. See NEEMo [47] for an alternative gradient-based Wasserstein estimator.

Evaluate Empirical Examples:
To demonstrate the potential of Shaper, in Sec. 5 we define and evaluate several observables on a top and QCD jet benchmark dataset [48,49]. These shape observables can be used to extract dynamic jet radii and jet energies, and even non-radially-symmetric structures, such as jet eccentricity. In particular, we show empirical examples of our new observables for jet substructure analysis and automated pileup removal.
Generalized shape observables defined using Shaper can be used to probe interesting collider signatures. For example, Shaper can be used to build specialized jet algorithms with dynamic radii 1 and even dynamic pileup mitigation. This can be viewed as a generalization of k-means type clustering algorithms, such as XCone [54]: rather than finding k points that best approximate an event, shape observables can be used to find the k geometric structures that best describe the event. This means that it is possible to design specialized jet algorithms that select for e.g. elliptical or non-isotropic jets, or that even probe the soft and collinear structure of jets separately. This can prove especially useful, for example, in boosted top or heavy vector boson decays that produce "fat jets" with multi-pronged substructure, which may not be described well by circular or isotropic patterns. We comment on further phenomenological studies in Sec. 6.

The Unreasonable Effectiveness of Wasserstein in Collider Physics
In this section, we aim to answer the question, "If I had never heard of event or jet shapes before, how could I have come up with them myself?" Our discussion builds off the work of Refs. [25,26], wherein the EMD was introduced as a new language for event and jet shape observables. Here, we show that the EMD is the natural language for event and jet shapes. We do this by showing that the EMD is the unique metric between a geometric shape E and an event E that encodes IRC safety (through its topological features) and faithfully respects the geometry of the detector. This section is largely self-contained, and readers primarily interested in the construction of new shape observables can skip to Sec. 3.
We begin with a review of event shapes and jet shapes, noting that they all share a general form -they can all be written as a minimization of a universal loss function between the event and a parameterized set of "idealized" events, which can be interpreted as geometric shapes. We show that if the universal loss function is both IRC-safe and reduces to the ground metric on the detector geometry (that is, it faithfully lifts the ground metric, without distorting extended objects), then the universal loss function must indeed be the Wasserstein metric. More details of this construction can be found in Apps. A and B.

Event Shapes, Jet Shapes, and Geometric Shapes
We begin with a review of event shapes and jet shapes. Event shapes are observables that probe the geometric distribution of energy in events. Many different event shapes, such as thrust [1,7,8], spherocity [55], broadening [56], and N -jettiness [54,57], have been defined and extensively studied over the years in the context of e + e − collisions [3], with analogues studied in the context of pp collisions [58,59]. We may also include jet algorithms, such as XCone [54] and sequential recombination algorithms (k T [60,61], Cambridge-Aachen [62,63], and anti-k T [64]), in this list. Similarly, jet shapes probe the geometric distribution of energy within individual jets rather than the global event. Examples of commonly studied jet shapes include the integrated jet shape 2 [65,66], angularities [9,10], and N -subjettiness [68]. While there are significant theoretical complications when considering the difference between event and jet shapes, such as the introduction of non-global logarithms [69,70], for our purposes, we will treat event shapes and jet shapes interchangeably.
Following the definition in Ref. [67], an event shape (jet shape) is an IRC-safe weighted sum over the four-momenta of the particles in an event (jet). These observables probe the geometric distribution of energy in an event (jet), and typically depend on the detector ground metric, d(x, y), which defines distances between points on the detector. Expressions for common event and jet shapes can be found in Tables 1 and 2, respectively. We note that all the above listed event shapes can be written in the generic form: where F and φ θ are generic functions, and the observable may involve a minimization (or maximization) over auxiliary parameters θ living in some constrained manifold M. The choice of F , φ θ , and M define the event shape. Not every shape requires an optimizationfor instance, while recoil-free jet angularities require an optimization over possible jet axes, it is also possible to define angularities with respect to a fixed axis [25]. In this case, the minimization may be written over the trivial manifold isomorphic to M = {0}. It is also common to divide by the total energy scale E tot , or some other hard scale, as this reduces the sensitivity of the event (jet) shape on experimental jet energy scale uncertainties [59]. Unless otherwise stated, we will normalize our events such that E tot = 1 without loss of generality. We propose to write Eq. (2.1) in a universal form, such that the event shape is instead specified solely by the choice of M: where L is a universal loss function. All geometrical information about the event shape is then contained in the construction of M. To emphasize this, we will adopt the notation O M for these observables, to remind us that the observable is defined through the choice of M. The task is now to determine what universal L reproduces all event and jet shape observables -we will argue in Sec. 2.4 that L must be the Wasserstein metric. To begin, we may rewrite Eq. (2.2) in a more suggestive form. We note that for all of the event and jet shapes in Tables 1 and 2, there is always some optimal E * , not necessarily unique, such that O M (E * ) = 0. For example, the E * for thrust is a perfectly back-to-back event, the E * for N -subjettiness is an event with exactly N particles, and so on. Thus, it is convenient to 2 A note about nomenclature: The original "jet shape" refers to the observable Ψ(r/R), the radial jet energy fraction [65,66]. However, the word has been hijacked by Ref. [67] to refer to observables consisting of weighted sums of particle momenta. We later justify the name "shapes" by showing how this corresponds to fitting actual geometric shapes to event data.
XCone [54] Which N -particles?n i (E) = argminn [60][61][62][63][64] Clustering History?  Table 2: Common jet shapes studied in collider physics. Note that most of these observables take the general form of Eq. (2.1). The notation d iJ refers to the distance from particle i to the jet axis. Here, we do not necessarily normalize energies. rewrite Eq. (2.2), such that the minimization is over a space of events E θ , and that L = 0 is achieved when E = E θ , where θ parameterizes the space of all E * 's: Eq. (2.3) provides a nice geometric intuition for event and jet shapes. We can interpret O M (E) as the answer to "How close, in event space, is my event to looking like an optimal E * ?". Importantly, the E * 's do not have to be physically realized events -they can be any radiation pattern measured on the detector wall, even continuous ones. For example, we can take E * to be events with a radiation pattern that look like the interior of a hexagon -then the event shape O M (E) is a measure of how far E is, in "event space", from an idealized hexagonal event. By taking our idealized events E * to have radiation patterns resembling literal geometric shapes, living in the parameterized manifold M, the observable O M (E) can be used as a measure of how much E "looks like" the shape of interest.

Measure-ing the Energy Flow
In order to make progress in determining the universal loss function in Eq. (2.3), we must first understand the IRC-safe information available for us to use within the events E and E θ . This information is represented by the energy flow of the event. We first briefly review energy flows, before proposing a new definition of the energy flow as a measure theoretic quantity, which enables a useful language for discussing "idealized" events such as those discussed in Sec. 2.1. The energy flow E of an event is the distribution of energy within the event. At a very high level, in a collider experiment, one has a detector with geometry X infinitely far away from the collision site -for instance, in pp collisions such as those at the LHC, one uses a cylindrical detector X = [y min , y max ] × S 1 , where y ∈ [y min , y max ] is the rapidity and φ ∈ S 1 is the azimuthal angle. After a collision, particles hit the detector at a site with coordinate x i ∈ X , where the energy E i is recorded by a calorimeter. The energy flow E for an event with M particles of energies E i measured at locations x i is given by: The energy flow quantifies the total amount of energy measured at position x, which can be thought of as an idealized calorimeter cell. Assuming that the particles are massless, this is the complete accessible 3 kinematic information about the event, which therefore allows us to consider an event and its energy flow interchangeably. In the context of hadron colliders, the transverse momentum p T is often used in place of the energy E. In this paper, however, we focus on energies to save on notational complexity, as the story is relatively unchanged when switching to p T .
The energy flow operator is well-understood theoretically and in some cases, can even be computed analytically [71][72][73][74][75][76][77][78][79][80][81][82][83]. In terms of field-theoretic quantities, the energy flow is given as: where n i is the unit 3-vector corresponding to the detector coordinate x. We assume for this work that the spectrum of energies is non-negative -that is, for all x, E(x) ≥ 0. We now propose a natural generalization of the energy flow that captures its salient properties and is key to enabling our geometric analysis: Definition 1. The energy flow E of an event in a detector geometry X is a (positive) measure over subsets X ⊆ X , such that E(X ) = E tot , the total energy of the event. ϕ η Detector Geometry X Detector Region ℰ(X) Figure 1: An illustration of the energy flow E(X), which is the total amount of radiation captured inside a subset X (in purple) of the total detector geometry X . The red dots represent particles that hit the detector wall, with their size proportional to their energy.
In this new language, the energy flow E(X) is the total energy measured in any region X ⊆ X of the detector, rather than just probing a localized point x ∈ X . The region X can be an extended set and does not need to be connected. Fig. 1 illustrates an example of this on a cylindrical collider. This generalized notion of energy flow E(X) reduces to the usual energy flow E(x), which we now refer to as the energy flow density, and can be written as: (2.6) A particle measured at x i will contribute energy to E(X) only if x i ∈ X, which can be seen by carrying out the integration over the δ-functions in Eq. (2.4). 4 Unlike Eq. (2.4), however, we do not restrict energy flows to just a finite sum of localized δ-functions -they can be continuous, extended deposits of energy! In Sec. 3, we will see energy flows with continuous energy distributions are key to defining generalized shape observables. We will refer to energy flows whose densities can be represented by a finite sum of weighted δ-functions (as in, for example, Eq. (2.4)) as atomic measures or atomic flows. We will often write atomic measures as E ∼ i E i δ x i for notational simplicity. Under Def. 1, energy flows inherit a very rich and natural mathematical structure. The most important operation for our purposes is the integral of a function φ : X → R against an energy flow E, which we denote E, φ , defined as: This operation can be thought of as the energy-weighted expectation value of the random variable φ under the distribution E. A brief review of this, and other salient measure-theoretic concepts and definitions we call upon in this paper, is presented in App. A.

Geometrizing IRC Safety
Infrared and colinear (IRC) safety is an incredibly powerful constraint on the form of observables -it ensures not only that an observable is well-defined in perturbation theory, but also that the observable is robust to detector effects. Using the language developed in Sec. 2.2, IRC safety becomes a topological statement on the space of energy flows, which we may use to place constraints on the potential form of the universal loss function L of Eq. (2.3). An observable O is IRC safe if it satisfies: 5 • Infrared safety: For any event atomic E, adding or removing an -soft emission to E leaves O unchanged as → 0.
• Collinear safety: For any atomic event E, splitting any particle into two particles at the same location with the same total energy leaves O unchanged. Moreover, translating either particle by an -small displacement leaves O unchanged as → 0.
Essentially, IRC safety means that observables should not change significantly if we change E by slightly adjusting particle energies and positions. As with energy flows, we propose a generalization of IRC safety that captures all its salient features: An observable O is IRC safe if it is continuous with respect to the weak* topology on energy flows.
A function f on energy flows is continuous to the weak* topology if, for any sequence of energy flows E n that converges to E, the function F (E n ) converges to F (E) (see App. A for more details). Note that this is actually a slightly weaker constraint than the one considered in Ref. [25], which defines IRC safety through the metric topology induced by the EMDthe definition here does not require a metric on the space of events, or even a metric on the detector space, only a notion of continuity. In fact, there is a large class of metrics one can place on the space of events to metrize the weak* topology, not just the EMD.
An interesting consequence of this definition is that if an observable O is IRC safe, then O(E) for any energy flow E can be arbitrarily well-approximated by atomic energy flows. This implies, for example, that a continuous circle can be arbitrarily well approximated by a finite number of points arranged in a ring -this is makes possible to not only encode continuous distributions numerically, but also to make broad statements about the behavior of IRC safe observables by considering their action only on simple atomic energy flows.
In order to be IRC safe, our universal loss function L(E, E ) must be continuous in both of its arguments. This is very restrictive, and immediately implies that L cannot have any terms that are discontinuous in either energy or distance, e.g. terms like E −1 or d(x, y) −1 , or any term of the form E, φ for noncontinuous φ. Recalling the discussion in Sec. 2.1 that L quantifies how close in the space of energy flows E and E are, it is convenient (though not strictly necessary) to use L to metrize the weak* topology -that is, if an observable O is continuous with respect to the metric topology induced by L, then it is also continuous with respect to the weak* topology, and therefore is IRC safe. This allows the same universal loss function L to be used both to define shape observables and to define IRC safety. 6 This is convenient, since it captures the very intuitive notion that if two events geometrically look similar (that is, L is small), then IRC-safe observables evaluated on them should also be the same.

The Importance of Being Faithful
In order to encode geometric information about energy distributions, the universal loss function L of Eq. (2.3) must explicitly depend on the detector ground metric, d(x, y). While there are many metrics on the space of measures that encode geometric information while also being IRC-safe (as defined in Sec. 2.3), a natural choice is the family of Wasserstein metrics, which we denote L(E, E ) = EMD (β,R) (E, E ) (for Earth-Mover's or Energy-Movers Distance, which we will use synonymously). We show in this section that unlike other potential candidates for L, the Wasserstein metric will never warp distances between shapes -that is, the Wasserstein metric lifts the ground metric of the detector faithfully. A constructive proof of this can be found in App. B.
The EMD between two measures E and E is given by: π(X , Y ) ≤ E (Y ), π(X, X ) ≤ E(X), π(X , X ) = min(E tot , E tot ), (2.9) where d(x, y) is the ground metric between points x and y on X , and M(X × X ) is the space of all positive measures on X × X . The parameter R > 0 sets a distance scale for the EMD, and sets the relative scale for the two terms in Eq. (2.9). The parameter β ≥ 1 sets the distance norm. 7 Note that our definition of the EMD differs from Refs. [25,26] by a factor of β, which we do to match the conventions of the geomloss [46] package. The additional energy difference term, |∆E tot |, contributes whenever the two energy flows do not have the same total energy. The Wasserstein metric is special in that it faithfully lifts the ground metric, d. To lift the ground metric means that the Wasserstein metric reduces to d(x, y) β when evaluated on 6 This is not required however -there are many different choices of L that may be used to give the same definition of IRC safety, e.g. maximum mean discrepancies, which does not necessarily have to be the same function L whose minimum defines shapes as in Eq. (2.3) 7 In order to satisfy the triangle inequality and be a true metric, the first term in the EMD must be raised to the 1/β power, and either 2R should exceed the largest value of d(x, y) [25,26] or ∆Etot must be guaranteed to be zero. In this paper, we will not need the triangle inequality, so we will not do this.
two point measures E ∼ δ x and E ∼ δ y -that is, the Wasserstein metric preserves distances between points. Moreover, to do so faithfully means that the Wasserstein metric preserves distances for entire extended shapes: if E is any measure, and E is the same as E whose density is translated by a vector t (that is, E has corresponding density E(x − t)), then the metric between them is simply d(0, t) β . To see this explicitly, we can compare to two other potential candidates for our universal loss function L: the class of Maximum Mean Discrepancies (MMDs) [84] and Chamfer distances [85], which can respectively be written as: 8 (2.11) These candidate loss functions are IRC safe, translationally invariant, and even lift the ground metric, but importantly, they do not do so faithfully. To see this, we choose the following example energy flows: where a and t are arbitrary vectors (where E consists of 2 points separated by a vector a, and E is E translated by a vector t). A direct computation using Eqs. (2.9), (2.10), and (2.11) (with d(x, y) = |x − y|) yields: 14) While Eqs. (2.14) and (2.15) do indeed reduce to ∼ |t| β when a → 0 (that is, when E reduces to a single point), in general the MMD and Chamfer distance effectively "distort" the shape. 9 When β = 1, for instance, E appears slightly closer to E, as measured using either MMD or the Chamfer distance, than its total displacement |t| -energy in the interior of an extended distribution gets effectively "screened" by the energy in the rest of the distribution! Not only does this distortion ruin our ability to think of our observables as measuring the geometric distribution of energy in the detector, the screening effect also induces a practical issue, as it causes vanishing gradients when trying to optimize over L [46] (in this case, by minimizing t, for example). The Wasserstein metric does not suffer these problems, making it the natural choice for our universal loss function.

Hearing Shapes
Having constructed the Wasserstein metric and EMD in Sec. 2 for event and jet shapes, we next generalize Eqs. (1.1) and (1.2), which were originally introduced in Ref. [25] as a common form for many well-known observables. We treat Eqs. (1.1) and (1.2) as definitions for shape observables O M and shape parameters θ M , which together are the natural generalization of event and jet shapes. Moreover, we show how the manifold of energy flows M can be chosen to construct new observables that probe specific geometric structures. We provide a prescription for building M, which defines the shape observable, as well as prescriptions for composing shape observables together, allowing new shape observables to be defined from simpler ones in a geometrically intuitive way. 10 This section proceeds as follows. First, we define generalized shape observables and shape parameters, and discuss their properties. Next, we discuss our prescription for shape composition, which can be used to define shape observables and parameters that probe complex geometric structure. Finally, we use our prescription to construct a large (but importantly, inexhaustive) suite of novel shape observables for jet substructure analysis to serve as an example of what can be done with this framework. Several examples of emperical studies using these new shape observables, evalulated using the Shaper framework defined in Sec. 4, can be found in Sec. 5.

Shape Observables and Shape Parameters
We define a shape observable O M as follows: Definition 3. A shape observable O M , with associated shape parameters θ M , on an energy flow E is any function of the form: where M is a manifold of positive measures on the detector space X , and EMD (β,R) is the β-Wasserstein distance with length scale R.
Importantly, we return both the minimum EMD value, O(E), and the parameters of the shape that produced the minimum EMD, θ(E). For a manifold M of generic parameterized shapes, we refer the former as the "shapiness" of E, and the latter as the associated "shape parameters" of E. For example, if M is the manifold of N -(sub)jet events, then O(E) is the N -(sub)jettiness of E, and θ(E) are the (sub)jet parameters of E, highlighting that Eq. (3.1) is really a generalization of the N -jettiness. Intuitively, O(E) answers the question, "How much like a shape does my event look like?", while θ(E) answers the question, "Which shape does my event look most like?".
For a fixed choice of β and scale R defining the EMD, shape observables are completely specified by the choice of the manifold of energy flows M. This choice specifies the class of shapes being considered. Practically speaking, this manifold can be defined by choosing a set of coordinates θ on the manifold, that parameterize a set of constrained energy flows E θ . Since, as established in Sec. 2.2, energy flows are positive measures on the detector space, M can be built as a (weighted) parameterized probability distribution p θ , which is realized by a finite sampling procedure for weighted points on X . 11 In Ref. [25], the manifold corresponding to several event and jet shapes were listed. However, the power of our framework is that any manifold of parameterized energy flows defines a valid shape observable, and moreover, this observable directly probes intuitive geometric information. By defining, for example, M to be the manifold of energy flows resembling uniform rings of energy, uniform disks of energy, or even uniform ellipses of energy, the corresponding observables directly quantify the diskiness, circliness, and ellipsiness of events. To our knowledge, the event isotropy [40] is the first observable of this form with no known alternative formulation, as it quantifies the "uniforminess" of events. Examples of analyses using these custom observables can be found in Sec. 5.
Below, we list some useful properties of all shape observables: This captures the monotonic nature of observables such as N -jettiness, which satisfy shape E θ can be used to approximate additive Lipschitz observables on E, up to a known bounded error.

Upper Bounds
: If X is bounded by a maximum distance scale R max , and both E and all energy flows on M satisfy E tot = 1, then O(E) is bounded above by Rmax 2 additional (constrained) parameters to the shape, z 1 and z 2 , that ensure that all energy flows are still normalized to 1, so that O is balanced. Here, M can be realized by generating points according to the parameterized sampling procedures for M 1 and M 2 , and multiplying the energy weights of the generated particles by z 1 and z 2 , respectively. The values of z 1 and z 2 control the relative contribution of each base shape to the composition. In Case 2, the total energy remains unconstrained, and the relative importance of each base shape to the contribution is controlled by the ratio of the E 1tot and E 2tot parameters. Note that monotonicity implies that, for both Case 1 and Case 2, It is easy to extend these definitions to a composition of N observables, In the totally unbalanced case, M comprises energy flows of the form When the O i are all copies of the same shape observable O, we define the notation N × O, and name the resulting composite shape observable the "Nshapiness".
Given this language, we can understand several existing shape observables as composites of basic "shape" building blocks. For instance, we can define an observable, the (sub)pointiness τ 1 (E), corresponding to the manifold of weighted (normalized) δ-function measures on X . Geometrically, this observable to fitting a single weighted (normalized) point to an event E, the former with a floating energy weight. We can then define the N - with N -subjettiness corresponding to balanced composition, and N -jettiness corresponding to totally unbalanced composition.
Shape composition provide a novel avenue for understanding event and jet substructure. For example, the observable τ N ⊕ I, where I is the event isotropy, can be thought of as a "pileup-corrected" N -(sub)jettiness, where a uniform background is subtracted off. One can use this observable to probe the percentage of energy deposited in hard jets versus a uniform background due to pileup [86] and underlying event effects [87,88]. In particular, the shape parameter θ(E) = z 1 (E) is an estimate of the percentage of energy in E due to hard jets. In Sec. 5.7, we consider perform empirical studies of these types of shape observables.

Examples of Novel Shapes
In this subsection, we list (some) potentially phenomenologically interesting shape observables, all of which are defined for the first time in this work, that can be constructed using the prescription outlined above. For all these observables, we consider the detector geometry to be a rectangular patch of a cylinder, X = [y min , y max ] × [φ min , φ max ], though one could consider the full cylinder as well. These observables are summarized in Table 3, and we show empirical examples of these observables in action in Sec. 5. We consider balanced observables with β = 1, leaving unbalanced observables for future work. Sec.

Shape
Specification Illustration Table 3: Custom observables, defined using the Shaper prescription, designed to probe jet substructure at increasing levels of complexity. For each observable, the manifold parameterization is given, either explicitly, or as a composition of previously defined objects. Here, τ 1 is the one-pointiness (1-subjettiness), and I is the event isotropy. More details on these types of observables, plus explicit construction of sampling functions, can be found in Sec. 3.3.

N -Ringiness
We first consider a simple shape observable: ringiness, which probes how ring-like an event is. We begin by defining the manifold of all ring-like energy flows M R , which consist of energy flows corresponding to energy flow densities of the form: where the parameters x 0 and R 0 correspond to the center and radius of a ring, respectively.
To build a sampler, we use the so-called reparameterization trick [89]. Drawing N samples from the unit uniform distribution φ ∼ U (0, 1), the distribution of points: where each particle has weight 1 N , is a realization of E x 0 ,R 0 (x). The corresponding observable, O R (E), is the ringiness of the event E. While most QCD jets are not expected to be ring-like, this observable can identify clumps of radiation scattered around a central point, as may be the case in a 3-pronged top quark decay. Additionally, observables that probe the boundary of a jet with an empty interior may prove useful in studies of the dead-cone effect [90][91][92] where collinear radiation is relatively suppressed.
Having defined ringiness, we can next define N -ringiness, which probes how much an event looks like N rings, each of arbitrary center, radius, and weight. This observable is defined as N × O R . Following the prescription outlined in Sec. 3.2, we can build N weighted rings, by separately sampling Eq. (3.4) for each ring's center and radius, and multiplying the weights by z i ∈ ∆ N −1 .
For numerical methods, one must set initial values for these parameters in an IRC-safe way. While in principle, the choice of initialization should make no difference, in practice, the presence of numerical effects and local minima make the choice of initialization important. In our initialization scheme for N -rings, we perform k T clustering to find N subjets. The location of the subjets is then taken to be the ring center, and the subjet energy is taken to be the ring energy. We choose to initialize the radius of each ring to zero, so that the N -ringiness is guaranteed to only deviate from N -subjettiness only if it will make the event more ringlike, though it is also possible to initialize the radius to e.g. the distance of the (N + 1)-th jet in the clustering history.

N -Diskiness
Next, we define diskiness O D , which measures how much like a disk an event is. Similar to ringiness, we parameterize the manifold of energy flow densities: where x 0 and R 0 are the center and radius of the disk.
To build a sampler, as with ringiness, we draw N samples from the unit uniform distribution φ ∼ U (0, 1), and also N points r ∼ U (0, 1). Then, the distribution of points: where each particle has weight 1 N , is a realization of a uniform disk. The random variable r controls the radius of the point being sampled, and the square root is from a Jacobian factor to make the disk uniform.
Given the diskiness, we can easily compose the N -diskiness, N × O D . The N -diskiness is analogous to the N -jettiness/XCone jet algorithm, in that it returns the locations of conical clusters of particles. However, unlike XCone, the radius R 0 is a learned, rather than fixed parameter, allowing for dynamic jet radii. 16 To initialize the N -diskiness, the exact same procedure is used as described in Sec. 3.3.1 for N -ringiness. Note that there are many ways to modify the N -diskiness to produce similar observables -for example, one can replace the uniform disks with Gaussians to probe different radiation patterns.

N -Ellipsiness
Jets need not necessarily be circular! Indeed, many jet algorithms, such as the widely-used k T [64] and Cambridge-Aachen [62,63] algorithms, do not return circular jets. Motivated by this, we define a generalization of diskiness, the ellipsiness O E of a jet. The manifold of ellipses is given by energy flow densities of the form: x 0 is the center of the ellipse, a and b are the semi-major and semi-minor axes, 17 and ϕ is the tilt of the x-axis. Here, we have restored vector notation x to indicate that the x-and y-axes are treated differently. There are many equivalent alternate parameterizations of the ellipse, including in terms of its focal length c = max(a, b) 2 − min(a, b) 2 and eccentricity e = 1 − min(a,b) max(a,b) . Note that for a = b, the ellipse reduces to a disk, and the ϕ parameter becomes redundant.
The sampling procedure for disks can be recycled for ellipses, with some small modifications. Given N sampled points φ, r ∼ U (0, 1), the distribution: where U ϕ is the 2×2 rotation matrix corresponding to the angle ϕ and each particle has weight 1 N , is a realization of a uniform ellipse. We can then easily compose the N -ellipsiness, which 16 In place of fixing the radius R, as in the XCone algorithm, one instead assumes that the jet energies are uniform across the disk, so the number of assumptions is conserved. 17 non-respectively; a corresponds to the x-axis and b to the y-axis, and we make no distinction here which is the major versus minor axis.
can serve as a jet algorithm that finds non-circular jets. In particular, this shape observable allows for the eccentriciy e of the clustered jets to be extracted, allowing one to quantify how far from circular each jet is. As with the N -ringiness and N -diskiness, the centers of the ellipses are chosen using the k T clustering algorithm. Both a and b are initialized to be zero, so that deviations from either N -subjettiness or N -diskiness occur if it makes the event more elliptical.

... Plus Pointiness
Energy is not uniformly distributed within a jet! Indeed, to leading order in perturbative QCD, much of a jet's radiation will be soft and/or collinear with respect to the emitting parton. We can probe this by composing together shapes that explicitly target soft and collinear radiation separately. To this end, we construct a set of new observables, the (shape+point)iness, for shape ∈ {O R , O D , O E }. This is defined using the shape composition prescription described in Sec. 3.2, as: where τ 1 is the 1-pointiness (equivalently, the 1-subjettiness), and O i is any of Importantly, we fix the location of the δ-function in τ 1 to be x 0 , though one may consider letting the location of the δ-function float to define a recoil-free variant. 18 We can then extend this definition to compose the N -(shape+point)iness. When used as a jet algorithm, the N -(shape+point)iness provides a more physical picture of perturbative QCD than do the previously defined shapes. The base shapes, particularly disks and ellipses, capture wide-angle soft radiation, while the δ-functions capture both hard and soft collinear radiation at the center of the (sub)jet. Moreover, within each shape-point pair, the floating parameters z 1 and z 2 tell us the fraction of radiation in the wide-angle and collinear sectors, which in principle can be calculated in and compared to perturbative QCD.
When initializing observables of this type, the initialization occurs as described in previous sections using the k T algorithm with all radii set to zero. However, we choose to split the k T energy equally between the shape and the δ-function. Note that this is an IRC-safe choice, since at zero radius, the shape is indistinguishable from the δ-function.

... Plus Pileup
In hadron-hadron collisions, there are many sources of contamination in jets, including underlying event contributions from proton remnants [87,88], and pileup due to simultaneous hadron collisions [86]. We will collectively refer to these sources of contamination as pileup for simplicity. Pileup contamination biases and smears the "true" value of observables reconstructed from final state particles, driving the need for mitigation techniques. 18 In the elliptical case, one may consider attaching δ-functions to one or both of the focii instead of the center. We leave the study of variants of these observables to future work.
Pileup is approximately uniformly distributed in the rapidity-azimuth plane. This is exactly the shape probed by the event isotropy, I. Thus, in order to protect shape observables against pileup contamination, we can compose them with the event isotropy, which will soak up radiation uniform in the plane. This defines the shapines+pileup observable: where O i is any shape observable, including those previously defined. As a departure from Ref. [40], we realize the uniform event by randomly sampling in the plane, rather than defining a grid. We also primarily focus on the β = 1 event isotropy. Unlike mitigation techniques such as area subtraction [93][94][95] or jet grooming [96,97], where an implicit assumption is made about the pileup energy density (either explicitly as an input ρ, or implicitly through a soft scale z cut ), the shape observable O I i makes no explicit energy scale assumptions. 19 The uniform energy weight, z 2 , is optimized over, and so the observable "learns" its own pileup scale, which can then be extracted. We choose to initialize the pileup scale z 2 to zero, though one could choose any value of z 2 if they had a prior on the amount of pileup in events.

... And More!
This has not been an exhaustive list -one can use any manifold M of energy flows one can think of, with the only two limits being imagination and the ability to write down a sampling procedure. Other examples of shapes include polygons, hardcoded jet topologies (for example, two-pronged jets restricted to between ∆R = R 1 and R 2 apart), Gaussian clusters, graph-based shapes, and so on. These observables can also be combined into more complex ones using shape composition. All of these can be constructed within the Shaper framework (more details in Sec. 4), and we encourage the community to use this prescription to develop their own observables.

The Shaper Framework
Calculating the Wasserstein metric in Eq. (2.9) is notoriously difficult; if both events have n particles, then the runtime needed by a brute force, generic Wasserstein solver can be as high as O(n 3 log n) [98]. Generic solvers also make it difficult to extract the gradients of the metric with respect to one of the events, ∇ E EMD(E, E ), which are necessary for performing gradient descent over the space of events in M. Fortunately, by using the (de-biased) Sinkhorn divergence, which uses an -regularization to approximate the Wasserstein metric, the total costs can be lowered all the way down to O(n 2 log n) [46,[99][100][101][102][103].
In this section, we introduce the Shape Hunting Algorithm for Parameterized Energy Reconstruction -or Shaper -to define and calculate shape observables. Shaper is a Pytorch-enabled [104] and parallelized computational framework for defining and composing shape observables and their corresponding energy flow manifolds, built using the geomloss [46] package. We start by outlining the Shaper algorithm. Then, we provide details on the Sinkhorn divergence, before ending this section with implementation details. For the rest of this paper, we restrict ourselves to balanced observables, i.e. E θ,tot = E tot = 1, leaving the unbalanced case for future work.

The Shaper Algorithm
We now describe how to perform the minimization (arg)min E θ ∈M [EMD(E, E θ )] using Shaper. 20 The Shaper algorithm for estimating shape observables on an event is as follows: 1. Define: Following the prescription of Sec. 3.1, define a manifold M and coordinates θ parametrizing the manifold. Define the ground metric d(x, y), the exponent β, and the radius R. This fully defines the observable O. Build a sampling function p θ that uses the parameters θ to transform some base distribution into a realization of the energy flows E θ ∈ M. Finally, choose an approximation parameter 1 and an annealing parameter ∆ ∈ (0, 1).
where α is a learning rate hyper-parameter. The first term is the dependence of the EMD on particle energies due to θ, and the second is the dependence due on particle positions due to θ, both of which are implicit through the sampling function p θ . This step can be replaced with any other gradient descent optimizer.

Constrain:
If the manifold M is nontrivial, impose any necessary constraints on the coordinates θ, such as wrapping angles between −π and π, enforcing positivity, or a simplex projection.
6. Converge: Repeat Steps 3-5 until convergence. Return the final value of the EMD and the final θ parameters.
The Shaper framework contains modules to aid or automate each of these steps, which we describe further in Sec. 4.4.

The Dual Formulation of Wasserstein
Observe that the EMD in Eq. (2.9) falls into a generic class of problems called linear programs. A linear program involves minimizing a function L(x) = c, x over vectors x, where c is some cost function linear in x. Furthermore, x satisfies some linear constraint of the form b = Ax, and we additionally require x ≥ 0. In our case, x is the (flattened) transfer matrix π, c is the (flattened) distance matrix d β , b = (E, E ) are the energy flows, and A is a matrix enforcing the simplex constraints on π.
The theory of linear programs is well-studied [105]. In particular, for every primal linear program, there exists a dual linear program, where the constraints and variables to be optimized switch roles, similar to the method of Lagrange multipliers. In the dual problem, one instead maximizes the function L(y) = b, y , subject to A T y ≤ c. 21 For the Wasserstein metric, the dual formulation looks like: where f and g are known as the dual potentials or Kantorovich potentials. This formulation of the EMD is known as the Kantorovich-Rubinstein metric [106].
In this form, the EMD has several nice properties. First, the arguments E and E are explicit, rather than implicit in the form of constraints. This makes taking the gradient of the EMD with respect to either energy flow much easier. This property is incredibly useful for performing optimizations over energy flows, since it enables easy differentiation. Second, the optimization over an M N -dimensional object, π ij , is replaced by an optimization over the (M + N )-dimensional object, f i and g j , making the simplex constraint structure more apparent. It can be shown that the optimal choice of f and g actually saturates the bound in Eq. (4.2) [107]. Recalling that the ground metric satisfies d(x, x) = 0 for all x, we can see that the optimal f, g pair satisfies f (x) = −g(x). This allows us to rewrite the constraint as: That is, f is β-Hölder continuous. Note that for β = 1, this reduces to Lipschitz continuity on f .

Reviewing the Sinkhorn Divergence
The source of the difficulty in evaluating Eq. (4.2) is the highly nonconvex optimization. To alleviate this, we introduce a regulator [44,45] to the dual Wasserstein metric: where is a regulation parameter. The quantity OT (β,R) (E, E ) is known as the Sinkhorn divergence [43] between measures E and E . It reduces to the EMD as → 0. 22 Notably, for any > 0, the optimization over f and g is fully convex [107], making the minimum significantly easier to evaluate. Note that there are no constraints on the functions f and g anymore. Instead, the maximum will only be achieved when f (x) + g(y) is within order β of 1 βR β d(x, y) β , a softer version of the original simplex constraint. We can view the parameter as "blurring" the distance metric d(x, y), where is a distance scale measured in units of R. 23 As an unconstrained, convex minimization problem, we can estimate the Sinkhorn divergence using simple gradient descent by taking derivatives of Eq. (4.4) with respect to f and g. Given two atomic measures, E and E , with M and N particles respectively, and an approximation parameter , we estimate the Kantorovich potentials f (x i ) and g(y j ) that give us the Sinkhorn divergence using the following algorithm: 1. Initialize: Initialize f (x i ) = 0 and g(y j ) = 0.
2. Gradient Update: Update f and g simultaneously as follows: . This algorithm is known to converge in finite time [43,108,109]. The runtime of each iteration scales as approximately O((M +N ) 2 ), and the algorithm converges in approximately 22 In the → ∞ limit, we recover instead the maximum mean discrepancy (MMD) [84], another potential metric on collider events, which we have shown in Sec. 2.4 is not faithful. 23 Even though the R parameter is unimportant for calculating the exact Wasserstein metric for balanced observables, beyond defining a unit scale, its importance re-emerges when defining the blurring scale .
1 β iterations. This can be further improved to only log 1 β / log 1 ∆ iterations through the use of simulated annealing [110,111], with a parameter ∆ ∈ (0, 1). Beginning with a larger effective blurring radius = 2R, after every iteration of the Sinkhorn algorithm, we decrease ← − ∆ , until finally reaching = . Intuitively, we start with a large blurring scale R, and slowly "zoom in" to a distance scale of to refine the estimate of the Sinkhorn divergence.
However, the Sinkhorn divergence is biased, meaning that it is not generically the case that OT (E , E ) = 0, which is an important property of the Wasserstein metric. We use therefore the de-biased Sinkhorn divergence, defined in Ref. [46]: The de-biased Sinkhorn divergence satisfies S (E, E) = 0 by construction. This can be easily realized algorithmically by simply substituting new de-biased Kantorovich potentials f → F and g → G, where Here, the notationf (x) refers to the first Kantorovich potential corresponding to OT (E, E), andg(y) refers to the second Kantorovich potential corresponding to OT (E , E ). For the rest of this paper, we refer to the de-biased Sinkhorn divergence as simply the Sinkhorn divergence wherever there is no chance for confusion. In addition to returning the Sinkhorn divergence, this algorithm also returns the Kantorovich potentials, which allow us access to approximate gradients of the EMD, which can be used for shape parameter optimization. The gradients of the EMD with respect to the input measures can be read off of Eq. (4.4): (4.10)

Implementation Details
Before turning to our case study, we discuss the specific details of the Shaper implementation.
Shaper uses the geomloss package as a backend for computing Sinkhorn divergences, as described in Sec. 4.3. By default, we use a relatively conservative annealing value of ∆ = 0.9, and choose β = 10 −3 for our estimates. 24 Currently, the Shaper algorithm is implemented only for balanced observables, with either β = 1 or 2. Once the EMD and Kantorovich potentials are estimated using the geomloss backend, the gradient updates in Step 4 are then handled using automatic differentiation and backpropagation in pytorch. One may select from a suite of common machine learning optimizers to perform the gradient update -by default, we use the Adam optimizer [112], with a learning rate of 0.01. Steps 3-5 can be easily parallelized over batches of hundreds of events at once. This is accomplished by treating the parameters θ for each E n to be completely independent. The Sinkhorn divergence can be computed on many events at once, and since the parameters are independent, we take advantage of highly parallelized pytorch operations to perform independent derivatives ∇ θn of the combined batch loss, n EMD(E n , E θn ), all at once.
When using Shaper, one must specify a maximum number of epochs, so that the program eventually halts even if convergence is not achieved. We set this number by default to be 500 epochs, though we observe that convergence happens far earlier than this. We define convergence through an early stopping procedure: if an event's EMD has not decreased in at least N max epochs, stop early, and return the minimum EMD ever achieved during the training, and the parameters that achieved that minimum. When training on a large batch of events, we stop early when a certain fixed percentage (we choose 95% by default) have hit this condition -this is because there tends to be a handful of outlier events that take exceptionally long to converge. We choose N max = 25 epochs by default. Both this, and the batch stopping percentage, are adjustable user parameters.
To facilitate Steps 1 and 2 of the Shaper algorithm, where the user input occurs, many common manifolds, such as sets of N points, hypercubes, simplices, and so on, are already prebuilt. These pre-built "building-block" manifolds are listed in Table 4. 25 From these, more complex manifolds can be easily constructed, such as the rings, disks, and ellipses described in Sec. 3.3. The ⊕ and N × operators make it very easy to define new composite observables from old ones and quickly build sophisticated shapes. Parameters can also be easily frozen for more customization. Furthermore, it is straightforward to define more building-block manifold as needed within the framework. Each of the observables described in Sec. 3.3 is also pre-built into Shaper. When defining custom manifolds that use the N -points as a building block, the custom shape will automatically use the same IRC-safe k T -clustering initialization scheme as the observables in Sec. 3.3, though it is possible to modify the initialization scheme as needed.
Each manifold contains instructions on how to enforce parameter constraints (as needed for Step 5). For most manifolds, this usually involves clipping the values to be within a desired range, though for the simplex ∆ N , which occurs in almost every shape for which energy weights must be normalized, the enforcement is nontrivial. We use a simplex projection algorithm, inspired by the K-Deep-Simplex framework [113] and its nonlinear extension [114],

Manifold Description
Constraints Hypercube, Λ N Set of N numbers, λ i , between 0 and 1 Clipped to 0 ≤ λ i ≤ 1 Circle / Torus, S N Set of N angles, φ i Wrapped to −π ≤ φ i < π N -Simplex, ∆ N Set of N + 1 numbers, z i ≥ 0, summed to 1 Simplex projection Table 4: The building-block manifolds implemented in Shaper, from which parameters can be defined and more complex manifolds can be built. For each manifold, a description is given, and constraint instructions are provided to project points onto the manifold. Note that we do not distinguish here between "positive" reals versus "non-negative" reals, since for continuous parameterizations there is no difference between including zero and getting arbitrarily close to zero.
which solves the linear program [115]: which finds the simplex z i ∈ ∆ N closest to a set of unnormalized points y i .

Empirical Studies with Jets
We now use the Shaper framework and custom shape observable in example collider physics analyses. We begin by benchmarking the Shaper algorithm, testing the performance of the Sinkhorn divergence and the optimization procedure by comparing the jet isotropy and Nsubjettiness calculated using Shaper to other methods. Next, we calculate the ring, disk, and ellipse-based observables defined in Sec. 3.3 for a dataset of top and QCD jets, showing visualizations of each shape and analyzing the learned EMD's and parameters. Finally, we explore the potential to use shape observables for automatic pileup removal.

Dataset
For our empirical studies, we use the top tagging benchmark of Refs. [48,49], which is a dataset consisting of a top quark jet signal and a mixed light-quark/gluon jet background. These samples are generated in Pythia 8.2.15 [116] at 14 TeV, and then passed through background samples are generated using tt and QCD dijet events respectively. For signal top jets, a top parton, plus its decay products, are required to be within ∆R = 0.8 of the jet axis. All events are translated such that the jet axis is at (0, 0) on the rapidity-azimuth plane.
In this dataset, multiple parton interactions and pileup have not been included. To mock up the effects of pileup contamination in data, we add in pileup "by hand". To each event, we add in N particles randomly distributed in an R × R = 0.8 × 0.8 square centered at the origin on the rapidity-azimuth plane, where N is Poisson-distributed with a mean of 75. Each particle is given an energy weight randomly sampled from a normal distribution with mean E PU N and standard deviation E fluct N , where we take E PU , which represents the total amount of pileup radiation, to be uniformly distributed between 50 and 250 GeV, and E fluct , which represents per-particle fluctuations, to be 25 GeV. 26 Many refinements of this simplistic mockup could be considered, but this suffices to show qualitative features of Shaper and the shape observables defined in Sec. 3.3. For the purposes of calculating shape observables, all jets are normalized such that E tot = 1, though we save the original total energy of each jet for the purpose of restoring units.
An example top jet is shown in Fig. 2, before and after pileup is added, to illustrate this procedure. This contamination procedure is performed for the benchmarking studies in Secs. 5.2 and 5.3 and the pileup studies in Sec. 5.7. For the jet substructure studies in Secs. 5.5 and 5.6, we do not add any pileup, and instead we require that the jets have an invariant mass m J ∈ [145, 205] GeV to more closely match the analysis conditions of Ref. [68].

Benchmarking Sinkhorn: Jet Isotropy
We first use Shaper to compute the jet isotropy for the purposes of benchmarking the Sinkhorn divergence for runtime and accuracy. Jet isotropy is an ideal benchmark since the minimization is trivial; for balanced isotropy, the parameterized manifold consists only of a single event. Therefore, no gradient descent is necessary, and this is purely a test of the Sinkhorn approximation. This can be viewed as a proxy for the per-epoch runtime and accuracy of the Shaper algorithm.
We compute the n × n jet isotropy, which is the isotropy given by computing the EMD to the uniform event: U n×n ∼ particles in an R × R square, arranged in a uniform n × n grid, (5.1) as defined in Ref. [40]. We do this using Shaper with many different values of , and compare to the same calculation done using the Python Optimal Transport (POT) [119] implementation of the EMD, which was used in Refs. [40,41]. In Fig. 3, we show the results of a runtime vs. accuracy study. We compute the jet isotropy of 1000 top jets for several different values of n, , and for β = 1 and 2. In Fig. 4, we show the learned 16 × 16 jet isotropies for β = 10 −3 . For this experiment, we use a fixed (conservative) annealing parameter of ∆ = 0.9. Shaper allows for events to be computed in parallelized batches; we run the entire computation in a single batch on a NVIDIA A100, and report the total runtime of the entire batch. 27 We see from Fig. 3 that the accuracy of the Sinkhorn divergence, as an estimator for the Wasserstein metric, begins to saturate at β = 10 −3 , and that there is no substantial gain from choosing a smaller . Picking β = 10 −3 ensures percent level accuracy for β = 1, and few-percent level accuracy for β = 2, which we can see visually in Fig. 4. Furthermore, for larger values of n × n, the accuracy is mostly independent of n. We also observe from Figs. 3 and 4 that Sinkhorn tends to slightly underestimate the Wasserstein metric, which can be understood from the strictly negative β -regulator in Eq. (4.4). Note that for n = 16, it takes under 1 second to process 1000 events -this implies it is possible to process millions of events on the order of an hour, with further speedups possible by choosing a more aggressive value for the annealing parameter ∆.

Benchmarking Optimization: N -Subjettiness
We perform a second benchmark investigation using N -subjettiness. Unlike the jet isotropy, N -subjettiness requires a nontrivial minimization. This allows us to use it to estimate the fidelity of the Shaper algorithm's optimization step.  for different values of , n, and β. The fidelity is defined as the ratio of the Sinkhorn divergence to the "true" Wasserstein metric, as computed using the POT library, across a batch of 1000 events. The runtime is the total time to evaluate the Sinkhorn divergences for the entire 1000 event batch, as computed using a NVIDIA A100. An annealing parameter of ∆ = 0.9 is used globally.
It is well known that the ratio τ 32 ≡ τ 3 /τ 2 is a good discriminant between top and QCD jets, as top jets tend to have 3 prongs more often than QCD jets, and thus have lower expected values of τ 32 [68]. We compute this ratio for several different values of to see if any discrimination power is lost (or gained) in the -approximation. Within the Shaper framework, the N -subjettiness is given by the following manifold of parameterized events: As a baseline, we compute N -subjettiness using FastJet 3.4.0 with FJcontrib 1.050. The results of this study are shown in Fig. 5a as ROC curves, computed using an NVIDIA A100 GPU. We see that the Sinkhorn approximations have roughly the same discriminatory power as the baseline for ∼ 10 −3 . In Fig. 6, we show the distributions of τ 32 for both datasets computed with FastJet and Shaper with = 10 −3 , and see good agreement between the two methods. As with the isotropy study in Sec. 5.2, we observe that Shaper tends to slightly underestimate the observable. In order to gauge the impact of float precision on our estimates, we repeat ROC curve calculation using only a CPU, which is shown in Fig. 5b. For values of 10 −3 , we see that on the GPU architecture, the performance actually begins to degrade due to the lower machine precision due to the accumulation of floating-point errors during the optimization, while it saturates on the CPU. Therefore, it is recommended to use β ∼ 10 −3 , as this is the most stable compromise between fidelity and machine precision.

Hearing Gradients
Shaper can be used to not only estimate the shapiness of events, but also take derivatives of the shapiness with respect to the event. As discussed in Sec. 4.3, this is completely automatic, since the gradients with respect to the energy flow are given manifestly by the Kantorovich potentials, allowing us to see precisely how our EMD calculations depend on the energies and positions of particles in an event.
Reading off of Eq. (4.10), we obtain an expression for the gradient of the EMD with respect to the energy E i of particle i: If the gradient at particle i is negative, then increasing the energy of that particle will decrease the EMD, making the event more M-like. By adding "ghost" particles to E, one can probe the energy dependence of the EMD from any point in E. Similarly, we can take the gradient the EMD with respect to the position x i of particle i:   Figure 7: For an example top jet, (a) gradients of the 3-subjettiness with respect to the particle weights, ∇ E i EMD, and (b) gradients off the 3-subjettiness with respect to the particle positions, ∇ x i EMD, are shown. In (a), increasing the particle energy in red regions will increase the 3-subjettiness (look less like 3 subjets), and increasing particle energy in blue regions will decrease the 3-subjettiness (look more like 3 subjets). In (b), moving particles along or against the arrows will increase or decrease the 3-subjettiness, respectively, with the arrow's shading indicating the relative magnitude of the change.
This gradient (times −1) tells us where to move the particle i to decrease the EMD. Both the energy and position gradients answer the question, "If I want to decrease the EMD (make my energy look more like my shape), what should I do to a particle at site x i ?" Moreover, because of the reparameterization invariance of the energy flow density, the energy and position gradients are both are valid ways to change the EMD: one can either change the energy of particles at the location x i , move the particles at x i somewhere else, or some combination of both.
In Figs. 7 and 8, the gradients from Eqs. (5.3) and (5.4) are plotted for an example top jet, for the 3-subjettiness and 16 × 16,β = 1 jet isotropy, respectively. Using Fig. 7, we can see what parts of the event contribute to the 3-subjettiness -the three large clusters contribute negatively to the EMD (make the event look more like 3 subjets), while the rest of the event contribute positively to the EMD (makes the event deviate from 3 subjets). Similarly, we see from Fig. 8 that the overdensity of energy at the center of the event makes it less isotropic. In both figures, the vector quiver plot tells us which way particles should "flow" (against) to change the shape.
There are many potential applications of calculating gradients of a shape with respect to an energy flow. In experimental contexts, for example, one can use the fact that the gra- dients are practically instantaneous to compute to do easy Gaussian error propagation due to detector-induced uncertainties in particle energies and positions [120,121]. This avoids having to do expensive re-sampling and recalculation of the event shape. Phenomenologically, these gradients can be used to probe the sensitivity of observables to certain radiation patterns. For example, the sensitivity of an observable to pileup can be measured by taking derivatives with respect to the pileup scale [95], which can be numerically realized in Shaper, a potential avenue for future work.

Hearing Jets Ring (and Disk, and Ellipse)
We next use Shaper to realize the custom shape observables defined in Sec. 3.3, starting with some visualizations of these shape observables on an example event.
The three base observables we consider are the ringiness, diskiness, and ellipsiness, as defined in Sec. 3.3. We calculate the shape observables N ×O (the N -ringiness, the N -diskiness, and N -ellipsiness) and N × O τ (the N -(ring+point)iness, the N -(disk+point)iness, and N -(ellipse+point)iness), as defined in Sec. 3.3 with β = 1. We also consider, for comparison, the N -subjettiness, τ N , as defined in Eq. (5.2). We use Shaper to evaluate all of these shape observables on a single of top jet event from our dataset, restricted to m J ∈ [145, 205] GeV, though without any pileup contamination. Each extended shape is sampled with 100 points, with = 10 −3 and ∆ = 0.9.
Geometric visualizations of each of the 21 event shapes, as evaluated on an example top jet, can be found in Figs. 9, 10, 11, and 12. From these visualizations, we note some interesting qualitative features of these shape observables. First, the point variants of each shape correspond more closely to clusters of energy. For example, while the N = 3 uniform rings  ( Fig. 10), disks (Fig. 11), or ellipses ( Fig. 12) do not necessarily capture the regions of highest energy, the point variants of each shape align very well with the N -subjettinesses of Fig. 9. Correspondingly, the EMD's of the shapes are significantly reduced for the point variantsthis suggests that this event in particular is not well modeled by uniform radiation profiles, but rather looks more like localized spikes with radiation clouds around them. Moreover, we can see that circular radiation clouds do not model the event as well as elliptical ones -this is reflected in the N -ellipsiness in Fig. 12, which learns extremely eccentric line-like structures in an attempt to best model the event, which results in lower EMD's than the corresponding N -diskinesses in Fig. 11. We also note that the N -subjettiness and the point shape variants all qualitatively find the same jet centers, suggesting that these shapes can be treated as perturbations to N -subjettiness.

Shapiness and Shape Parameters
Continuing the discussion in Sec. 5.5, we now use Shaper to compute distributions of these shape observables on a large sample of top and QCD jets, restricted to m J ∈ [145, 205] GeV, though without any pileup contamination. In particular, we show the utility of both the shapiness O(E) and the shape parameters θ(E) in describing the geometry of jets.
For each histogram in this section, we calculate an AUC score, showing the efficacy of a cut on that observable as a top/QCD discriminant. Note that these AUCs are for cuts on a single feature -the discrimination power can in principle be improved by transforming these features and combining many features per jet.
As a representative sample of the "shapiness" observable, we plot the N -ellipsiness and N -(ellipse+point)iness of our top and QCD jet samples in Fig. 13. The EMD distributions for the ring and disk variants are qualitatively similar to the ellipse, and thus our discussion of these distributions carry over to them. We notice that the N -(ellipse+point)iness is not much   lower than the corresponding N -ellipsiness, indicating that (at least in the absence of pileup), subjets can indeed be approximated as roughly uniform. In the test event visualization in Fig. 12, we can see qualitatively that the found ellipses have roughly the same center, comparing the N -ellipsiness and its corresponding point variant. We also note that the EMD decreases with N , as expected.
In Fig. 14, we show the ratio of the N = 3 shapiness to the N = 2 shapiness for each class of shape observable. For N -subjettiness, which we show for comparison, this is the classic observable τ 32 , which is known to be a good top vs. QCD jet discriminant [68]. First, we observe that the uniform ring, disk, and ellipse observable, along with the point variants, each have an AUC of approximately 0.75, which is still considerably less than τ 32 's AUC of 0.825. One should expect that, using just the EMD alone, a more complexly parameterized shape should have a lower AUC than a simpler shape. In the extreme case, where the parameterization is flexible enough to reproduce any event in the dataset, the EMD will always   be zero and have no discriminatory power. However, this is not the end of the story -shape observables also include their learned parameters, and this information also contains multivariate discriminatory power. For the hypothetical infinitely flexible shape, the parameters contain the full event information even though the EMD is zero, and thus the combination of the shapiness and shape parameters together contain more information (and thus more discriminatory power) than just a simpler shape. We leave a full multivariate analysis of shape parameters for jet classification for potential future work.
We now move on to analyze the learned shape parameters themselves. In Fig. 15, the learned radius parameters of the N = 1 shapes are shown. 28 Considering the 1-(disk+point)iness and 1-(ellipse+point)iness, which as discussed earlier, we can think of as a "jet algorithm" looking for soft clusters of radius R with collinear radiation inside, we see that both learn 28 For ellipses, we define an effective radius, taken to be √ ab. This is the radius a circle with the same area as the ellipse.   an average radius of approximately R = 0.4, which is approximately half the radius of the original AK8 jet. One can think of the uniformity condition as imposing an radius effective cutoff. If one assumes that the energy density falls to zero with distance from the jet center, then there is some critical radius for which any larger uniform shape is an increasingly worse approximation, and in principle this critical radius can be computed in perturbative QCD. For the point shape variants of the disk and ellipse, we note that there is only a small difference between the top and QCD jets -that is to say, the soft component of a top jet is about as "wide" as the soft component of a QCD jet, as seen by disks and ellipses. However, as top jets tend to have their energy distributed across 3 prongs, rather than 1 prong as in QCD jets, we should expect clusters of radiation away from the central one for top jets. Since rings are thin, we should expect rings to be able to better capture these localized prongs than area-filling disks and ellipses, which can be visualized in Fig. 10. This is especially apparent in Fig. 15d for the 1-(ring+point)iness shape, where top jets are significantly larger than QCD jets. Interestingly, we note a spike at R = 0 in all of these distributions. This means that Shaper has determined that the best shape for these events is in fact point-like, and reduces to the N -subjettiness. This spike is reduced for the ellipse shapes, which implies that there are events better modeled by either an eccentric ellipse or a point rather than a uniform disk. As a last example of the information that shape observables and shape parameters can extract, we look at the learned eccentricity of ellipses. In Fig. 16, we show the minimum eccentricity across the N ellipses for the 1-, 2-, and 3-ellipsiness and (ellipse+point)iness for our top and QCD jet datasets, where the eccentricity is given by 1 − min(a, b)/ max(a, b). We can immediately see that a value of min(e) = 0 is rare -our jets are much better described by ellipses than they are described by disks. With increasing N , the distributions of min(e) tend to shift leftwards for both the top and QCD jet samples. This indicates that the N -'th ellipse, which probes more substructure than the (N − 1)-'th ellipse, is often less eccentric. The eccentricity distributions for top jets are slightly different from QCD jets. This is a nontrivial result; top jets and QCD jets have very different prong structures (as demonstrated in Fig. 14), so the angular distribution of radiation in the rapidity-azimuth plane is less circular for top jets than QCD jets. Thus, jet eccentricity could be used as a discriminant.
The above plots show just a few examples of the information that can be extracted using generalized shape observables. It is important to emphasize there is much more information beyond what we have shown here available for analysis, such as the learned weights of the points versus uniform shapes, the radii of shapes beyond the leading jet, and so on. This is to say nothing of the multivariate information contained in the correlations between the EMDs and shape parameters, both within the same shape and between different shapes. We leave a full quantitative study of the information contained within shapes for potential future work.

Pileup-Mitigating Shapes
Finally, we show a use case for shape observable composition by applying our custom shape observables to the task of pileup mitigation. We consider the top jet mass spectrum as an example, which is sharply peaked near m 2 top ≈ (175 GeV) 2 . In our modified dataset, jets are contaminated with uniform pileup radiation, which significantly biases and smears the top mass peak. There exist many techniques to "groom" away extra contamination, such as area subtraction and soft drop [93][94][95][96][97], but these techniques often have external hyperparamters characterizing the contamination density (such as z cut in soft drop).
In order to mitigate this bias even when the contamination density is unknown, we propose to build shapes that factorize the event into a uniform component, U, and a structure component, E. As described in Sec. 3.3.5, we accomplish this by building observables of the form: where O i is any shape observable and I is the jet isotropy. 29 We can then use the energy flow associated to O i to calculate a "pileup-corrected" mass, discarding the energy flow associated to I. The weight z 2 associated with I is then an estimate of the fraction of event energy due to pileup -the contamination density is an extracted observable, not an assumed hyperparameter! For this study, we consider O I i , where O i is either the 3-subjettiness or the 3-(disk+point)iness, motivated by the 3-prong nature of top jets. We also consider the uncorrected observables, O i , without the uniform background for comparison. For each observable, the "shapejet mass" m J is calculated as the sum as the (massless) four vectors of particles comprising the shape. This can be numerically approximated by sampling the disk with 200 particles. To calibrate the shape-jet masses, we calculate the shape-jet mass on top jets without the addition of pileup for each observable. In Fig. 17a, we plot the shape-jet mass corresponding to each observable as evaluated on top jets. Each observable slightly undershoots the top mass peak -the exact discrepancies are given in Table 5. These values are used to shift the means of each shape-jet mass curve to the right for the remainder of this study, such that the means give the truth mean top mass when evaluated on uncontaminated jets.
In Fig. 17b, we show the result of an empirical study using our pileup-mitigating shapes on our pileup-contaminated top jet sample. We also plot, for comparison, the mass corresponding to the uncorrected observables O i , with no uniform background. These distributions are all calibrated using the values in Table 5. We see clearly that the two pileup-mitigating shapes indeed remove a significant amount of bias due to pileup, and bring m jet much closer to m top , suggesting that this is a viable strategy for pileup removal. Note, interestingly, that the distribution for the uncorrected 3-(disk+point)iness jet mass is similar to the uncorrected jet mass -this is because to best approximate the entire event, disks will grow large in an  Figure 17: The shape-jet mass for several different choices of shape observable, evaluated on (a) top jets without pileup, used to derive the calibration, and (b) top jets with pileup, calibrated using Table 5. Two different shape observables are used -the 3-jettiness in blue and 3-(disk+point)iness in red. For each shape observable, the ordinary version is plotted with dark dashed lines while the pileup-mitigating variant is plotted with bright solid lines.

Observable Bias [GeV]
3-(Disk+Point)iness+Pileup −18.7 3-Subjettiness −12.5 Table 5: The difference between the mean shape-jet mass and the mean of the truth top jet distribution in Fig. 17a, for each of the four observables under consideration. These are used to calibrate the calculated mass distributions for each shape.

3-Subjettiness+Pileup
attempt to capture all of the pileup, since the disks themselves are also uniform radiation patterns.
In Fig. 18a, we show the distribution of extracted pileup energy fraction values, corresponding to the z 2 shape parameter. We see that there is qualitative agreement between the extracted pileup energy fractions and the true one, though the shapes tend to overestimate the pileup density slightly. This is consistent with Fig. 17b, where the shapes slightly under- On the x-axis, the change in jet mass (∆m 2 PU ) 1/2 before and after contamination is plotted. On the y-axis, the value of (∆m 2 PU ) 1/2 as extracted from the corrected top mass is plotted for a small test set. As a proxy for the resolution of the estimate, we takeσ to be the standard deviation of the residuals. estimate the top mass without the calibration. In Fig. 18b, we use z 2 to compute the value of the extracted mass bias ∆m 2 PU = m 2 jet −m 2 top (z 2 ), compared to the "true" mass bias obtained by comparing the jet mass before and after contamination. To approximate the resolution of our estimator, we takeσ 2 = Var m 2 PU − m 2 PU as the average Gaussian uncertainty. We can see that both pileup-mitigating observables estimate the pileup mass bias correctly on average, and that the 3-(disk+point)iness+pileup has a slightly better resolution than the 3-subjetiness+pileup, which is to be expected as the former has more freedom to better capture features of jets.

Conclusions and Outlook
In this work, we generalized the notion of event and jet shapes into shape observables, which are a wide class of observables that can probe the geometric structure of collider events. We introduced a natural measure-theoretic language for describing events as energy flows, which encodes properties such as IRC safety as inherent topological information. Using this construction, we showed that the Wasserstein metric arises naturally from the requirement of geometric faithfulness. This post-hoc justifies its past use to define observables, IRC safety, and geometry on the space of events [25,26,39,40].
We showcased how to define arbitrary shape observables, which can be specified solely by parameterizing the manifold of shapes one wants to fit to. Importantly, we can extract both the "shapiness" value (how much the event looks like the shape) and the "shape parameters" (which shape is the best fit). This is a very intuitive picture -if one wants to ask how "ellipsy" a jet is, for example, all one needs to do is parameterize the space of ellipses to define the ellipsiness observable. The Shaper framework makes it easy to build new shape observables and evaluate them on events. It leverages the Sinkhorn approximation of the Wasserstein metric to enable fast, parallelizable, and differentiable -approximations of shape observables.
With the Shaper framework in place, the natural question arises: Which shapes should we study? We introduced a few examples of shapes motivated by jet substructure, including rings, disks, and ellipses. Using the Shaper prescription, it is easy to modify these shapes to probe additional jet structures, by including δ-functions to capture collinear radiation or uniform backgrounds to capture pileup. Our empirical studies showed how these custom observables might serve as an effective and simple pileup mitigation strategy in the study of top jets, where the amount of pileup does not need to be assumed beforehand. Many further analyses are possible: for example, in heavy quark decays, collinear bremsstrahlung is suppressed (the so-called "dead-cone effect" [90][91][92]), leading to ring-like or annulus-like, rather than point-like, jets. Moreover, the ability to take derivatives with respect to entire events could allow one to compute observable sensitivities, such as pileup sensitivity [95]. The list of observables we have provided is certainly non-exhaustive, and we hope that the Shaper framework can be used by the community to explore a wide range of brand new observables.
While Shaper can be used to analyze any shape observable, and thus has broad generality, it is not necessarily the fastest way to compute or approximate any specific observable. For example, much faster, exact algorithms for computing N -jettiness and related observables are available in the FastJet [118] package, which make use of the fact that the double optimization in Eq. (3.1) can be simplified. Most of the new observables we have defined appear to be irreducible, though, making their analysis and theoretical treatment complicated. In principle, this class of observables is perturbatively calculable in QCD, at least numerically. We anticipate that these generalized shapes might provide a useful groundwork for further theoretical studies of QCD distributions.

Code and Data
The code for the general-use Shaper framework can be found at https://github.com/ rikab/SHAPER, along with installation instructions. This directory also contains the analysis code for the top/QCD plots in Sec. 5. In this directory is also a tutorialized example notebook, showing how to use the Shaper framework.
Our dataset is a modified version of the top/QCD jet benchmark dataset described in Refs. [48,49]. The original version of this dataset can be found at https://desycloud. desy.de/index.php/s/llbX3zpLhazgPJ6.

Acknowledgments
Special thanks goes to Samuel Alipour-fard for helping to come up with the acronym Shaper, and for useful discussions about pileup. We thank Cari Cesarotti

A Energy Flows as Measures
In this appendix, we review aspects of measure theory and topology related to energy flows, as introduced in Sec. 2.2. To begin, we define a measure: Definition 4. Given a set X and a σ-algebra on X denoted σ(X ), a (positive) measure E over σ(X ) is a function E : σ(X ) → R satisfying: • Non-Negativity: For all X ∈ σ(X ), we have E(X) ≥ 0.
• Null Set: For the empty set, we have E(∅) = 0.
• Additivity: For a (countable) collection of disjoint sets X i ∈ σ(X ), we have: If, additionally, we have E(X ) = 1, we say E is a probability measure.
Associating energy flows with measures is very natural: an energy flow answers the question "How much energy did I detect in the calorimeters located within the subregion X of my detector X ?", and this answer satisfies all of the above properties. We refer to E(X) as the energy flow, whereas we call the object as the energy flow density or measure density. This language (while different from Ref. [25]) is natural, since integrating over Eq. (A.2) yields the energy flow: Under this definition, the energy flow is independent of the choice of coordinates x used on X , unlike the energy flow density. In particular, the form of energy flow is completely invariant under exactly 0-energy emissions or exactly collinear splittings, and additionally is also invariant under particle relabellings. Throughout this paper, we restrict ourselves to measures that can be written as the integral of a well-behaved associated density, which is the case for all physically realized energy flows. We refer to measures whose density is a finite sum of weighted δ-functions as atomic measures.
Measures can be used to formalize what we mean by integration. We begin by defining E, φ as the integral of an integrable function φ : X → R against an energy flow E. 30 Assuming coordinates x on X , we define this quantity as the ordinary integral over the associated energy flow density: 31 In the special case where E is a probability measure, this quantity is the expectation value E X∼E [φ(X)] of the random variable φ(X) sampled over E. The notation E, φ highlights the inner product structure of a measure E "acting" on the space of functions on X . Notably, this inner product is bilinear -though when considering positive measures, one must be careful to not produce energy flows where the energy is anywhere negative, as these are nonphysical. Next, we discuss the important topological features of measures. First, we define weak* convergence (also sometimes referred to as convergence in law ) of a sequence of measures: Definition 5. A sequence of measures E n converges with respect to the weak* topology to a measure E, which we denote E n → E, if for any continuous test function φ on X , the sequence of real numbers E n , φ converges to the real number E, φ .
That is, we say a sequence of measures converges if every single expectation value converges. Associated to this definition of convergence is the weak* topology, which is the (weakest) topology for which the map ·, φ is considered continuous for all continuous φ. Note that this definition does not make use of any metric on measures; it simply inherits the metric and topological structure of the real numbers. Building off of this, we can now define what it means for any function on energy flows to be continuous: 30 The integral of φ over E is often denoted dE φ. When E is the Lebesgue measure, uniform over X , this becomes the ordinary integral. 31 We will be satisfied here defining integration in terms of the ordinary Lebesgue integration on real numbers, which is always possible if an associated density exists.

Definition 6.
A function F on the space of positive measures is continuous with respect to the weak* topology if, for every convergent sequence of measures E n → E, the sequence F (E n ) converges to F (E).
Continuity is a very powerful tool for dealing with energy flows. First, it immediately implies that continuous measures can be arbitrarily well approximated by atomic measures with an increasing number of samples. Moreover, if a function F on measures is continuous with respect to the weak* topology, this further implies it is only necessary to specify how F acts on atomic measures, since the action on all other measures is fixed by weak* continuity.
Intuitively, weak continuity captures the idea that a continuous distribution is well approximated by a discrete one with "enough samples" -as the number of sample increases, we approach the continuous one. Note, however, we have not yet defined what it means for two energy flows to be "close" to each other, so we cannot yet describe what happens to functions under small energy flow perturbations. This requires a metric on the space of energy flows that respects the weak* topology, which as argued in Sec. 2 and further justified in App. B, can be taken to be the Wasserstein metric.
We finally introduce one last piece of notation. For any two energy flows E 1 and E 2 , the joint energy flow E = E 1 ⊗ E 2 is the energy flow satisfying for any sets X, Y ⊆ X : The energy flows E 1 and E 2 are referred to as the marginals of E. They can be written as E 1 (X) = E(X, X ) and E 2 (Y ) = E(X , Y ), with densities E 1 (x) = X dy E(x, y) and E 2 (y) = X dx E(x, y) respectively.

B Constructing the Wasserstein Metric
In Sec.

B.1 Shaping Up the Loss
To start, we list the properties we would like the universal loss function L to satisfy. First, we would like L to be a proper metric on the space of energy flows, which implies the following usual properties of metrics: 1. Finiteness: We require that L is finite, even when evaluated on atomic measures. This immediately rules out the KL divergence [124] and its variants, including log-likelihoods, used in Ref. [51] to define "fuzzy jet" observables. These functions are only finite on continuous measures with support almost everywhere.
2. Positivity and Closure: We require that L(E, E ) ≥ 0. Moreover, we require that L(E, E ) = 0 if and only if E = E . This captures the notion of the "optimal shape" as discussed in Sec. 2.1.

Symmetry:
We require that L(E, E ) = L(E , E). This is to say, if the energy flow E looks like E , then the energy flow E looks like E as well.
We will not specifically require the triangle inequality, as we never use it throughout this work. Of course, the Wasserstein metric does indeed satisfy the triangle inequality when the appropriate powers of 1/β are included, making it a proper metric. Not only is L a metric, but it must be continuous to the weak* topology to be IRC safe, as established in Sec. 2.3: 4. Weak Contintuity/IRC Safety: We require that L is weakly continuous in both of its arguments. This not only allows for continuous energy flows to be arbitrarily well approximated by atomic energy flows, but heavily constrains the energy-and positiondependence of L.
Finally, as discussed in Sec. 2.4, we demand that L faithfully lifts the ground metric. This allows the space of events to inherit the geometry of the ground metric space with no distortions or warping: 5. Faithfully Lifts the Ground Metric: We require that if E and E are atomic measures with a single normalized particle at positions x and y respectively, then L(E, E ) is proportional to d(x, y) β , where β is some fixed positive power.
Moreover, we require that L faithfully lifts the ground metric. For any measure E, we can "translate" the measure by t to E t , defined by the density E(x − t). We then require that L(E, E t ) is proportional to d(0, t) β .
If the ground metric is translationally invariant, which is the case for Euclidean metrics, property 5 implies that L must also be translationally invariant -if both E and E are translated by a vector t, the metric distance between them is unchanged.

B.2 Why Wassterstein?
Having defined the properties we would like our universal loss function to satisfy, we can now finally construct L(E, E ).
To begin, weak continuity allows us to drastically simplify the problem. Continuous energy flows may be approximated arbitrarily well by atomic energy flows, and thus it suffices to build our function only on atomic energy flows. This allows us to argue that L can only depends on single powers of the distance d(x, y) β . In particular, lifting the ground metric (though not necessarily faithfully, yet) forbids terms of the form d(x, y) β d(x , y ) β or other higher-order distance correlations, since in the single-particle case this would result in losses of the form d β 1 + d β 2 + ... with several differing exponents. This implies the following form for L when evaluated on uniform atomic flows E ∼ i δ x i and E ∼ j δ y j : where Π, K, and F are universal functions, π, k, and f are coefficients that implicitly depend on the two measures, and g is a function from the domain of E to the domain of E (and vice-versa forg). Note that this form is symmetric under swapping E and E . Without loss of generality, k ii can be chosen to be a symmetric traceless matrix. Still considering the single particle case, lifting the ground metric also implies that the functions Π and F must be linear functions (though the single particle case provides no information about K), as to reproduce the desired d(x, y) β behavior. Moreover, to still maintain only terms proportional to d β even in the multiparticle case, K must also be linear. This linearity means that we can combine the arguments of each of the Π, K, and F functions.
Recall that energy flows are additive objects, and the energy flows defined by the densities E = (E 1 + E 2 )δ x and E = E 1 δ x + E 2 δ x must be identical. This implies that each term of the terms in Eq. (B.1) must be individually linear in the energy weights of E and E , respectively. This implies the following constraints, up to an overall proportionality factor: Now, consider the two particle case, as in Sec. 2.4, and define the energy flows for vectors a and t: When t → 0, the two energy flows are the same, and when a → 0, the energy flows reduce to the single particle case. Evaluating Eq. (B.1), we obtain: L(E, E ) = Π π 1x1y d(0, t) β + π 1x2y d(0, a + t) β + π 1x1y d(a, t) β + π 2x2y d(a, a + t) β + K 2k 1x2x d(0, a) β + 2k 1y2y d(t, a + t) β + 1 2 F d(0, g(0)) β + d(a, g(a)) β + d(g(t), t) β + d(g(a + t), a + t) β . (B.9) In the limit t → 0, this expression must evaluate to zero by closure, for any choice of a. This is possible if the off-diagonal components of π and k either cancel or are individually zero. Note that the function g must select a particle y j given x i (and vice-versa forg). Importantly, however, it must do this in a label-independent way, since the labels i and j are completely arbitrary. The only way to do so is to select y j based on its distance from x iin principle, we can choose g to select the furthest particle, or the closest particle, or even the particle closest to the average distance amongst all particles. However, the only choice consistent with closure, as can be seen in Eq. (B.9), is to choose the closest particle y j to the given x j . This allows us to write the third line of Eq. Eq. (B.10) is exactly the form of a Chamfer distance from Eq. (2.11)! The Chamfer distance is the average closest distance from points in E to E , plus the average closest distance from points in E to E. Following the explicit counter-example given in Eq. (2.15), the Chamfer distance does not lift the ground metric faithfully, and therefore, F must be zero. Returning to Eq. (B.9), we now want to show that K = 0. Since k is traceless by assumption, the constraints Eqs. (B.4) and (B.5) imply that k 1x2x = E 1 E 2 and k 1y2y = E 1 E 2 . Assuming that K = 0, closure plus Eqs. (B.2) and (B.3) allows us to solve for π ij . From this, we deduce that π ij reduces to E i E j as t → 0 (with proportionality constant Π = −2K). Furthermore, π ij has to be constant in t, since otherwise L would depend on higher powers of the metric and therefore fail to lift the ground metric. Given this, Eq. (B.1) reduces to: However, this implies that L is exactly the maximum mean discrepancy (MMD) of Eq. (2.10)! The MMD can be thought of as the average potential energy of a system of springs with potential V (r) ∼ r β connecting particles between E and E , minus the self-energy of springs connecting particles within E and within E . The explicit counter-example given in Eq. (2.14) shows that this cannot be faithful, except in the special case of β = 2, which is not sufficient for our purposes. 32 It follows that K must also be zero, and we have significantly constrained the form of our metric.
It now remains only to find the form of π. To do this, we now consider an energy flow E, consisting of N particles each with energy 1 N . By closure, we again must have L(E, E) = 0, which occurs when π ij is 1 N 2 times the N × N identity matrix. Next, consider E , an exact copy of E, except the particles have been re-indexed, sending the original particle x i to x j=σ(i) . As measures are invariant under reordering, L must still be zero, though π will be (proportional to) some N × N permutation matrix. We can write this as: This will only be zero if π ij is the correct permutation matrix that undoes the index shuffling. If one instead guesses that π is a different permutation matrix, the point y σ(i) will not lie on top of the point x i , leading to L > 0. Thus, even if we did not know how exactly the particle labels on E were shuffled, all we would have to do to find the correct π is to search through all possible permutation matrices, and take the one that gives the minimum answer: where S N is the set of all N × N permutation matrices. Moreover, the index-shuffling trick allows us to see that Eq. (B.13) faithfully lifts the ground metric. If E is translated by a vector t, then π should still be the identity matrix before shuffling so that our result is proportional to d(0, t) β . If we guess the wrong permutation matrix, while it may be possible for some distances to close (that is, d(x i , y σ(i) ) to be less than d(0, t)), this will be made up for by other particles being further apart than d(0, t), and so L will be larger. For values of 32 Generically, physical systems with potentials V (r) ∼ r β experience screening, most notably electrostatic screening when β = −1. However, screening does not occur for ideal springs, i.e. β = 2. Screening and (un)faithfulness are related -as E and E move relative to eachother, the potential energy between them will not necessarily scale with r β in extended systems due to screening.
The problem in Eq. (B.13) is referred to as the combinatorial Monge problem, which is the precursor to the Wasserstein metric. We can think of Eq. (B.13) as finding the optimal map π to "transport" the points x to y, minimizing the total distance β needed to travel. These arguments are enough to fully fix the form of L -even if E and E are different energy flows, as long as they both have N particles with uniform weights, the minimization over permutation matrices is valid.
To finish our construction, we use the additivity of measures and the weak* topology to generalize Eq. (B.13) to any energy flow. Since measures are additive, any atomic energy flow with rational energy weights can be thought of particles of a base denomination weight stacked on top of each other -for example, the two-particle energy flow E ∼ 0.381δ x 1 + 0.619δ x 2 is exactly equal to the energy flow given by 381 particles at site x 1 and 619 particles at site x 2 , each with energy 0.001. One can then solve the combinatorial Monge problem using this uniform energy flow, and "collapse" the corresponding redundant subspace in π to produce the optimal transport map. In the case that energy weights are irrational, weak continuity ensures that these may be constructed as limits of rationally-weighted energy flows, so this is not an issue. Similarly, any continuous energy flow can be written this way.
Thus, the loss Eq. (B.13) is our universal loss function! Accounting for arbitrary weights and particle numbers, as discussed above, this becomes the well-known Wasserstein metric. The β-Wasserstein metric, or "Energy/Earth Mover's Distance" (EMD), is the metric on the space of measures that is positive, closed, metrizes the weak convergence, and faithfully lifts the ground metric. Repeating Eq. (2.9) for convenience, the EMD between two measures E and E is given by: EMD (β,R) (E, E ) = min π∈M(X ×X ) 1 βR β π, d(x, y) β + |∆E tot |, π(X , Y ) ≤ E (Y ), π(X, X ) ≤ E(X), π(X , X ) = min(E tot , E tot ). (B.15) The parameter R > 0 sets a distance scale for the EMD. The additional energy difference term, |∆E tot | = |E tot − E tot |, contributes whenever the two energy flows do not have the same total energy. Our argument in this section does not fix this term uniquely, and in fact there are many different approaches to unbalanced metrics [39,125,126]. In the common case that both E and E are atomic measures with M and N particles with energies E i and E j respectively, the EMD takes the form: π ij = min(E tot , E tot ). (B.16)