The Use of Minimal Spanning Trees in Particle Physics

Minimal spanning trees (MSTs) have been used in cosmology and astronomy to distinguish distributions of points in a multi-dimensional space. They are essentially unknown in particle physics, however. We briefly define MSTs and illustrate their properties through a series of examples. We show how they might be applied to study a typical event sample from a collider experiment and conclude that MSTs may prove useful in distinguishing different classes of events.


Introduction
The quest for physics beyond the standard model (SM) takes on two forms: 1) direct observation of non-SM events in some kinematic region, and 2) indirect observation of non-SM interactions through a minute deviation of an observable from its SM value. In the former case, a deviation from the expected distribution of a kinematic variable may be localized, e.g. a narrow peak in an invariant mass distribution, or it may appear as a broad feature that can only be detected on a statistical basis, such as an enhancement of the transverse momentum distribution of vector bosons due to an anomalous trilinear coupling. Very often the researcher makes a histogram for a kinematic quantity and applies a statistical test for compatibility with the distribution expected from the SM alone. In this approach one hopes to know in advance which kinematic quantities will reveal new physics if it is present, and this requires insight and a fortunate choice of new physics model. Different kinematic quantities are examined one at a time; the researcher is effectively marginalizing over the other quantities in each case. Sometimes multiple kinematic quantities are combined in a so-called multivariate analysis, and a single indicator variable is constructed that allows the researcher to exploit several kinematic quantities simultaneously; this technique requires a model for the non-SM physics signal in order to optimize or train the indicator variable. Multivariate techniques are nearly always tied to more or less specific models of new physics.
We propose a method that stands between the extremes of histogramming individual quantities and constructing a highly-directed indicator variable. Our method makes use of minimal spanning trees (MST) constructed with events as they populate phase space. 1 The distribution of events in phase space is determined by a production process followed by a decay process, which in turn are determined by the density of states and by matrix elements computed from an underlying fundamental theory. For example, values for the invariant mass (M ) and transverse momentum (q T ) of muon pairs produced in the Drell-Yan process at a hadron collider are accurately generated based on SM calculations. The LHC experiments have carefully measured differential cross sections dσ/dM , dσ/dq T , and dσ/dY for the inclusive production of same-flavor [1], opposite-charge lepton pairs in pp collisions, and these measurements confirm precise SM predictions. For example, a narrow peak for M ≈ 91 GeV reflects the Z boson resonance, and the lepton pairs are produced more centrally as their invariant mass increases. These measurements do not attempt to probe correlations among kinematic variables, except perhaps to measure the differential cross section for one quantity in "slices" of another. Correlations or other structures that may be present in phase space may be overlooked and are certainly not the goal of these measurements. Our method is based on a construct that can preserve and reflect correlations or other structures in phase space. This method might complement traditional differential cross section measurements and ultimately provide a new tool for identifying deviations from the SM that indicate new physics processes. This paper is organized as follows. First we define MSTs and describe their construction. Second we briefly recall results from applications of MSTs to cosmological structure. Third we analyze a set of values for one stochastic quantity in preparation for multi-dimensional cases -these we call one-dimensional "trees." Fourth we present a number of artificial examples that demonstrate in a dramatic fashion how MSTs can be sensitive to the way stochastic variables are distributed in a multidimensional space. Fifth we illustrate the efficacy of MSTs for simulated dilepton events that might be observed at an LHC experiment. Finally we summarize our exploratory results and conclude.

Overview of minimal spanning trees
The Euclidean minimal spanning tree (MST) is a graph connecting m points in n dimensions [2]. Each point, called a "vertex," is connected to at least one other via a line segment, called an "edge," whose length is the Euclidean distance between the two vertices it joins. The edges in the MST are chosen such that the sum of their lengths is minimized and that all m vertices are connected; that is, the tree is both minimal and spanning. By construction, such a graph will have no closed loops or "circuits." For any given set of vertices, there is a unique MST so long as no two pairs of vertices are separated by exactly equal distances. Figure 1 shows a very simple MST with 20 vertices in two dimensions.

Construction
A number of algorithms to construct MSTs are available, the best known of which are Prim's algorithm [3] and Kruskal's algorithm [4]. We implemented Kruskal's algorithm in C++ code written expressly for this study. With this implementation, a tree with 6000 nodes takes a single computer about thirty seconds to build.

Statistical quantities
The structure of an MST can be analyzed on a statistical basis. The lengths of the edges in the tree, l, is perhaps the most intuitive MST quantity. The shape of the distribution of l reflects the density and uniformity of the tree's vertices. The more sparsely the vertices are distributed in phase space, the greater the mean of the distribution of l. Furthermore, a tree with more variation in the density of its vertices will have a wider distribution of l.
To compare the relative shapes of two trees, the length of each edge divided by the average of its tree's edges is a more useful quantity. This quantity is called "normalized length"l, sol = l/ l . The distribution of the logarithm of this quantity, ln(l), is effective in distinguishing trees with different structures.
Another intuitive quantity is the number of edges attached to each vertex, called its "degree" or d. A distribution of d with a sharp peak at 2 signifies a tree with a filamentary structure, or longer "branches." A "branch" is a chain of edges with a loose end, all connected via vertices of degree 2. The length of a branch is b and one can compare ln(b) among the trees.
The quantities l,l, d, and b pertain to a single tree, and distributions of these quantities allow a comparison of two or more trees. In addition, we have devised quantities that relate to how two MSTs overlap in phase space (see Section 5).

Data considerations
The MST relies upon the spatial arrangement of events in phase space, which must be taken into account when choosing which features to include. The first of these considerations is scaling. The MST uses Euclidean distance, so the structure of the tree is sensitive to the units and range of each feature. If a single tree makes use of kinematic quantities which take varying units or differ greatly in their distribution, all features can be rescaled to prevent any single quantity from inappropriately guiding the analysis. In this paper we have constructed examples that do not require rescaling; however, it might be necessary in a more general application.
Another consideration is the continuity of our feature space. As a distance-based method, the MST is not suited to categorical or discrete features such as an integer number of jets. The inclusion of discretized features may have unexpected consequences, such as creating artificial filaments and branches in one of the dimensions.

MST analysis of cosmological structure
One of the earliest applications of MSTs in a scientific analysis was reported by Barrow, Bhavsar, and Sonoda in 1985 [5]. At the time, the filamentary nature of the distribution of galaxies was becoming apparent and Barrow et al. wanted to find a tool that was sensitive to that filamentary structure. Traditional statistical measures were not effective in identifying the filamentary structure and were insufficient for distinguishing competing models for galaxy formation.
Barrow et al. took the available data on galaxy locations ("Zwicky") and constructed an MST. For comparison, they constructed a random grid of vertices ("Poisson"). Their analysis is based on the fraction of edges of a given length in the MST after scaling by the mean length. The results show that the Zwicky and the Poisson MSTs have rather different structures. The number of nodes is modest (slightly more than 1000) and the trees can be distinguished by inspection. More importantly, the distributions of the normalized edge lengthsl are quite different, proving that the distribution of galaxies through space has structure beyond that of a random distribution in space -the galaxies are more clustered.
Other analyses in astrophysics have employed MSTs. For example, Campana et al. employ them to identify sources of high-energy γ rays [6]. Allison et al. used MSTs to trace the segregation of mass (the fact that massive stars are not distributed in the galaxy the same way as other stars) [7]. Finally, MacFarlane et al. applied MSTs to chemical tagging of stellar debris [8]. These three applications are rather different, and do not have obvious analogs in particle physics. Nonetheless, they show interesting advantages over more standard methods from which particle physics might profit.

One-dimensional trees
We begin with three simple examples for a one-dimensional "tree." While a one-dimensional tree has very little structure, these examples nonetheless illustrate some statistical features that are important for our discussion. Since the trees are almost trivial, it is easy to see the connection between the distribution of "vertices" and the distribution of edge lengths.
Suppose we have a set of N values {x i } drawn from a given probability density function (pdf). They can be visualized as a set of vertices spread out along a line -a one-dimensional "tree." If we put the values in ascending order, then the MST is simply the set of edges from one vertex to the next, starting from one end and ending at the other. The length of the "tree" is just the sum of the distances from one vertex to the next, which is the same as the length spanned by the tree: D = |x last − x first |. There is only one branch, and that branch contains all vertices connected one to the next; all vertices have degree two except the first and the last.
The only quantity of interest is the edge length, l i = |x i+1 −x i | (i = 1, 2, . . . , N −1), or quantities derived from it. In this section we focus only on the distribution of l i .
We consider three illustrative cases, all of them defined for the interval 0 ≤ x ≤ 12. The first case is just the uniform distribution, with pdf f u (x) = 1/12. The second is sharply peaked toward the origin: f e (x) = e −x . The third function is multi-modal with a region where vertices are suppressed: f s (x) = C sin 2 (πx/8) where C is a normalization constant. Example trees of these three distributions are shown in figure 2.
We generated 10 5 vertices (x values) from each distribution, sorted them, and calculated the 10 5 − 1 edge lengths. Distributions of these edge lengths are shown in figure 3. As one would expect, the exponential and sin 2 distributions lead to long tails in the edge length distributions, in contrast to the uniform distribution which terminates far more quickly. The three distributions differ greatly for smaller edge lengths, l < 0.006.

Artificial examples
The examples from the previous section show clear discrimination power among the three pdfs, but for one dimension the analysis is rather simple and does not lead to deep insights. In this section we examine four more complicated examples based on trees in more than one dimension. While the examples are still quite artificial, they anticipate the features of the collider data we discuss later.

Dense vs. sparse
For a first example, we placed 800 evenly-spaced vertices on two grids. The first was sparsely populated with dimensions 20 × 40, and the second was densely populated with dimensions 3 × 40. Small (σ = 0.2) random numbers were drawn from a Gaussian distribution to perturb the position of each vertex. These perturbations created nonuniform edge lengths for each set of vertices, which is the necessary condition to build a unique MST. Both trees are shown in figure 4. These samples are easily distinguished by eye; our goal is to determine whether the MST provides a statistical basis for distinguishing them.
The statistical quantities of a given tree may be analyzed as shown in figure 5. It is illustrative to first look at the lengths of the edges in each tree, or l. The trees in this example have visibly different densities, so one expects a clear difference in the means of the distributions of l. Both trees are uniformly populated, however, so one does not necessarily expect a difference in their shape. To determine whether there exists such a difference, we look at ln(l). The shapes are very similar, but still distinguishable; the difference in shape comes from the difference in the aspect ratios of the trees. For d, the symmetry of the peaks shows a relatively uniform distribution of vertices in 2-space, as expected for these samples. The distributions of d typically do not provide much information, at least in two dimensions. The similarity in the distributions of ln(b) shows that the two trees are similar in shape but not in density.
In the context of this multidimensional example, we introduce the quantities that compare two trees directly. One method of comparing two trees is finding the shortest distance between them. For each vertex on one of the trees, we locate the nearest vertex on the other tree. We will refer to the distance between them as the vertex's "connection length," or c, which gives an idea of how far apart the trees are in our feature space. In this example the dense tree is within the boundaries of the sparse tree, so small c values are expected when measuring the dense tree against the sparse tree. For the converse, one expects a wider but relatively uniform distribution of c. It should be clear that the distribution of connection lengths for the dense tree with respect to the sparse tree is not the same as the distribution of connection lengths for the sparse tree with respect to the dense tree. In figure 6, the peaks on the left show the area of overlap. For the sparse tree, the bins on the right indicate the presence of events we would not expect to find in the dense tree.
If we compare the distance between trees in a way that accounts for their densities, we can pinpoint potential anomalies in a data sample. Such anomalies could include, for example, a dense cluster of events in an area we expect to be sparsely populated. One method of finding such anomalies is to compare the connection lengths found above to the lengths of the nearest edges of the tree in which we are interested. Consider an arbitrary vertex in the dense tree. We divide its c by the average of the lengths of the k edges nearest to this vertex, creating what we will call the "connection ratio" or r. This k is a tunable parameter, which we have chosen to be five. The histograms for r in figure 6 look similar to those for c because of the uniformity in the distribution of vertices, but the longer tail for the sparse tree shows the significance of its outlying vertices.

Spacing in x
For a second example, we experimented with nonuniform spacing. The first, "uniform" grid is identical to the sparse grid from the previous example. The second, "quadratic" grid has the same dimensions and the same number of vertices, but they are distributed quadratically in the x direction. As is evident in figure 7, this creates long branches near the left and right boundaries of the quadratic MST and short branches near its horizontal center, which affect each measured quantity.
The distributions of the statistical quantities for these trees are shown in figure 8. The l distributions have similar means, but the distribution for the quadratic tree is wider, indicating its variety of branch lengths. This nonuniformity is also apparent in the wider, more skewed distribution of ln(l) for the quadratic tree. The filamentary structure is further exemplified in the distributions of d, where the quadratic tree distribution peaks more sharply around 2. The wider distribution of ln(b) for the quadratic tree signifies its shorter and longer branches. Another effect of the sparse vertices on the sides of the quadratic tree is its narrower distribution of c, shown in figure 9. The results for r are similar because the trees in this example have similar average spacing as compared to figure 6.

Disc vs. strip
Our third example compares two different tree geometries: one set of 4000 vertices uniformly distributed over a disc and one set of 4000 vertices uniformly distributed over a long, narrow rectangle which we will refer to as a "strip." Both are shown in figure 10. The strip geometry limits the length of branches that can be created in that tree as compared to the disc tree.
The difference in the means of l in figure 11 shows that the vertices of the strip tree are denser. The broader, slightly shifted distribution for ln(l) of the disc tree is also due to its sparser distribution of vertices. We see a wider distribution of d for the strip tree because it has shorter branches. These short branches are also visible in the ln(b) distribution. The most noticeable difference in these trees aside from their shape is their differing locations in phase space. The peaks on the left in the c distributions in figure 12 represent the area of overlap. The bins on the right for the strip tree show a concentration of events that do not fit the disc distribution. Rescaling the connection lengths to obtain r gives us an even longer tail for the strip tree, magnifying the effect of its variant geometry.

Hidden variable
Our fourth example illustrates the multidimensional capabilities of the MST. This example is important because it illustrates the potential of an MST to profit from structures or distributions in a multidimensional feature space that may be unanticipated by the researcher. While this example plainly is contrived, it nonetheless illustrates how a multidimensional MST could provide non-obvious discriminating power.
We uniformly distributed and perturbed two sets of 4000 vertices on the same disc in twodimensional space, shown in figure 13. Figures 14 and 15 show that the distributions of statistical quantities for the two discs are the same apart from statistical fluctuations. In particular, the histograms of c in figure 15 are not only nearly identical, but also very narrow because the trees are so similar.          We then added a third variable and built the trees in all three dimensions. For the first disc, we filled the third (z) dimension with vertices from a uniform distribution. For the second, we used an exponential distribution. The familiar xy projections of these trees are shown in figure 16; they are essentially the same as figure 13. Their xz projections are shown in figure 17. The introduction of the new variable significantly alters the structure of each tree, making them qualitatively different even though the xy view in figure 16 is essentially the same. The difference becomes apparent, however, when the distributions of each of our statistical quantities are examined.
The distributions of l in figure 18 differ not only in mean but also in shape. The long tails on the exponential tree distribution of ln(l) show the abundance of edges that are longer and shorter than the average. The distributions of d are very similar to those in figure 14, but they may indicate slightly more of a filamentary structure in the exponential tree. In spite of its more filamentary structure, the exponential tree in fact has shorter branches, as is evident in figure 18. The c and r distributions in figure 19 show events present in the uniform tree that one would not expect to find in the exponential tree.
This fourth artificial example illustrates the promise of MST analyses: structure that may be obscure or unknown to the observer can be uncovered on a statistical basis. While the pictures in figure 16 show no differences, figure 18 and especially figure 19 show major differences.

Collider data
In an attempt to provide a quasi-realistic example, we simulated basic LHC processes that lead to final states with two energetic, opposite-charge muons using the Pythia event generator [9]. These processes are: direct Drell-Yan production, di-boson (WW, WZ, and ZZ) production, top quark pair production, and Z → τ + τ − followed by τ → µνν. Such events are of general interest both for precision measurements and also for searches for new physics.
We defined two samples of MC events to compare. The first sample (DY-only) contains only Drell-Yan production of lepton pairs. The other sample (all processes) contains all the most important processes: Drell-Yan, top quark production, di-boson production, and Z → τ + τ − . These processes contributed to the second sample in accordance with their cross sections and the relevant branching ratios. In order to approximate an event sample of interest, we required that the missing transverse momentum be greater than 50 GeV. 2 Both DY-only and all processes samples contained nearly 6000 events each. For the purposes of this study, we used the events to define vertices in the (M , q T ) plane, where M is the invariant mass and q T is the transverse momentum of the lepton pair.    figure 20. This is, essentially, the view of the data taken when differential cross sections dσ/dM and dσ/dq T are measured. Since both the DY-only and the all processes histograms have the same number of entries, we can only draw conclusions about differences in shape. Figure 20 shows that those differences are small but not insignificant. In figure 22 we see that the distributions of l and ln(l) are not as sharply peaked for the all processes tree, reflecting its more sparsely distributed events. The distributions of d are indistinguishable for the two trees. The all processes tree has slightly longer branches on average, however, which are located in the region populated by non-DY events. These non-DY events are even more apparent in figure 23, where the long tails of the c and r distributions for the all processes tree are visible.

Suppression of concentrated DY region
The previous example shows some of the differences we expect, but we can magnify them using our physical insight to focus on the interesting areas of the (M , q T ) plane. Using the same samples as before, we now apply weights such that events in the rectangle 50 < M < 90, 0 < q T < 100 GeV have weight zero and all other events have weight one. Effectively, the number of vertices is reduced from 6000 to 3739 for DY-only and 3819 for all processes. Each edge is given a weight equal to the product of its vertices' weights. This does not affect the structure of the trees, but reduces the number of events included in each histogram. In particular, it censors a portion of the tree that physicists may consider uninteresting.
For the l and ln(l) distributions in figure 24, the size of the DY-only histogram has been normalized to the size of the all processes histogram so that we can focus on differences in shape. For ln(l),  the all processes histogram has grown a shoulder for ln(l) ≈ 0.5. The DY-only histogram for the d distribution has also been normalized to the size of the corresponding all processes histogram. For ln(b), however, the DY-only histogram was normalized using the factor from the ln(l) distribution because it is normal for trees of comparable size to have a different number of branches, though they must have the same number of edges. The all processes ln(b) distribution has a gentler peak than it did before the weights were applied. Interestingly, after the weighting, the distributions of c and r in figure 25 are nearly identical in shape to those in figure 23.

Suppression of DY peak
We can try to further increase the differences between the two samples by focusing on the high-mass region only. To do so, we give weight zero to all events with M < 100 GeV. This weighting scheme removes most of the short edges from both trees, so both l distributions are much wider as seen in figure 27. The removal of so many DY events broadens the DY-only ln(l) distribution, and the all processes distribution moves upward. We see that the weighted DY-only tree has shorter edges on average, as expected for a tree with more concentrated vertices. The difference in the d distributions is modest, but they suggest a more filamentary structure in the DY-only tree. The slightly broader distribution of ln(b) for the all processes tree shows its longer branches. Finally, there is a dramatic difference in the shapes of the histograms of c in figure 27, indicating the presence of many non-DY events in the all processes sample. The r distribution also shows these events, although not as dramatically.

Incorporating MST information in collider data analysis
The MST information explored here can be used to improve the separation of events from different processes. As an example, we consider the following task: given the kinematic region with two muons with M > 100 GeV as in the previous subsection, can we estimate the fraction of events coming from tt production?
A standard approach to this problem might be to take the (q T , M ) plane and perform a binned maximum likelihood fit. As shown in figure 28, the distribution of tt and Drell-Yan plus di-boson events is rather different. A set of six bins was defined to take advantage of the different distribution of events in this kinematic plane, and a negative log-likelihood function was defined: where N is the number of bins, B j is the pdf for the Drell-Yan plus di-boson events, and S j is the corresponding pdf for the tt events. The one free parameter, α, corresponds to the fraction of selected events that come from the tt process. The two pdfs satisfy normalization conditions N j=1 B j = N j=1 S j = 1. Consequently, the function Q(α) profits only from the difference in shape for the background (Drell-Yan plus di-boson) and signal (tt).
The difference in the distribution of background and signal events visible in figure 28 leads to a weak constraint on α. In order to demonstrate the utility of MST quantities, we select the logarithm of the normalized edge length, ln(l), that has shown some discriminating power in the toy examples of the previous section. This single quantity is sensitive to the fraction of events that come from the signal process (tt), as shown in figure 29. The maximum differentiation afforded by    this quantity comes from the shape of the distribution in figure 29, but to be conservative, only the mean of the distribution is used. The mean µ l of the distribution of the logarithm of the normalized edge length varies linearly with the tt fraction. We parametrized this linear relation of α with µ l using simulations in which we varied the tt fraction directly. For each value of α, a new tree was built, the distribution of ln(l) was formed, and µ l was computed.
In order to incorporate the new MST-related information, an additional χ 2 -like term was added to the negative log-likelihood function, expanding the expression in eq. (1): where µ obs is the default value for µ l and µ l (α) is the linear function of µ l in terms of α. The variance σ 2 l is determined by the distributions as in figure 29. The inclusion of this single MST information improves the constraint on the signal fraction α. Figure 30 shows the curves for Q(α) for the binned likelihood alone, as in eq. (1), and also for the binned likelihood augmented by the MST information, eq. (2). On the basis of these curves, the standard deviation for α decreases from σ α ≈ 0.5 to σ α ≈ 0.3. This is a substantial gain, and illustrates the potential of MST quantities to enhance the statistical power of data collected by particle physics experiments.

Conclusions
We have presented an early exploration of the use of minimal spanning trees (MST) in particle physics. We briefly defined what an MST is and recalled how they were used in cosmological studies. We devised a number of simple examples that demonstrate how MST quantities can differentiate structures in a phase or feature space.
We hope that MSTs can play a useful role in particle physics. We devised a simple illustration for events with a pair of oppositely-charged muons and significant missing transverse energy; the distributions of tree-based quantities show that samples with different compositions can be distinguished. To the extent that distributions of tree-based quantities are statistically independent of distributions of kinematic and other quantities, discriminating power in, for example, a multivariate analysis will improve: the tree-based quantities can be considered as new additions to the feature space utilized in multi-variate analyses. While a distribution of a tree-based quantity for the entire tree will not be very discriminating (because the information about which kinematic region is interesting has been discarded), distributions for targeted regions of phase or feature space can be interesting, as our study of di-muon plus missing transverse energy events shows. This paper is not at all exhaustive of what can be studied with MSTs. For example, we did not use any pruning (deletion of short branches) which is common in other applications. Also, our collider example utilized only two kinematic quantities but many more are available. Finally, we have not tried to use tree-based quantities in machine learning or multi-variate analyses of the type we presented in ref. [10]. Clearly, opportunities for further development of MSTs in particle physics remain.