Clustering with minimum spanning trees: How good can it be?

Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they are meaningful in low-dimensional partitional data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can be very competitive. Next, we review, study, extend, and generalise a few existing, state-of-the-art MST-based partitioning schemes. This leads to some new noteworthy approaches. Overall, the Genie and the information-theoretic methods often outperform the non-MST algorithms such as K-means, Gaussian mixtures, spectral clustering, Birch, density-based, and classical hierarchical agglomerative procedures. Nevertheless, we identify that there is still some room for improvement, and thus the development of novel algorithms is encouraged.

Given a dataset X = {x 1 , . . ., x n } with n points in R d , the space of all its possible k-partitions X k , is very large.Namely, the number of possible divisions of X into k ≥ 2 nonempty, mutually disjoint clusters is equal to the Stirling value of the second kind, n k = O(k n ).Thus, in practice, clustering algorithms tend to construct a simpler representation of the search space to make their job easier.For instance, in the well-known k-means algorithm (Lloyd, 1957(Lloyd, (1982))), k (continuous) cluster centroids are sought and the point's belongingness to a subset is determined by means of the proximity thereto.In hierarchical agglomerative algorithms, we start with n singletons, and then keep merging pairs of clusters (based on different criteria, e.g., average or complete linkage; see (Müllner, 2011)) until we obtain k of them.In divisive schemes, on the other hand, we start with one cluster consisting of all the points and then we try to split it into smaller and smaller chunks iteratively.
From this perspective, different spanning trees of a given dataset offer a very attractive representation.In particular, the1 minimum spanning tree (MST; the shortest dendrite) with respect to the Euclidean metric2 minimises the sum of pairwise distances.
More formally, given an undirected weighted graph representing our dataset G = (V, E, W ); V = {1, . . ., n}, E = {{u, v}, u < v}, W ({u, v}) = x u − x v , the minimum spanning tree T = MST(G) = (V, E , W ), E ⊂ E, W = W | E is a connected tree spanning V with E minimising {u,v}∈E W ({u, v}).Any spanning tree representing a dataset with n points has n − 1 edges.If we remove k − 1 of them, we will obtain k connected components which can be interpreted as clusters; compare Figure 1.This reduces the search space to Fig. 1 Removing three edges from a spanning tree gives four connected components, which we can treat as separate clusters . While still large, some heuristics (e.g., greedy approaches) allow for further simplifications.
Applications of MST-based algorithms are plentiful (e.g., gene expression discussed in (Y.Xu, Olman, & Xu, 2002), pattern recognition in images (Yin & Liu, 2009), etc.).Overall, in our case, they allow for detecting well-separated clusters of arbitrary shapes (e.g., spirals, connected line segments, blobs; see Figure 2).They do not necessarily have to be convex like in the k-means algorithm (via its connection to Voronoi diagrams).
This paper aims to review, unify, and extend a large number of existing approaches to clustering based on MSTs (that yield a specific number of clusters, k) and determine which of them works best on an extensive battery of benchmark data.Furthermore, we quantify how well the particular MSTbased methods perform in general: are they comparable with state-of-the-art clustering procedures?
This paper is set out as follows.Section 2 reviews existing MST-based methods and introduces some noteworthy generalisations thereof, in particular: divisive and agglomerative schemes optimising different cluster validity measures (with or without additional constraints).In Section 3, we answer the question of whether MSTs can provide us with a meaningful representation of the benchmark datasets studied for the purpose of data clustering tasks.Then, we pinpoint the best-performing algorithms and compare them with non-MST-based approaches.Section 4 concludes the paper and suggests some topics for further research.

Divisive algorithms
Perhaps the most widely known MST-based method is the classic single linkage scheme (Wrocław Taxonomy, dendrite method, nearest neighbour clustering).It was proposed by Polish mathematicians Florek, Łukasiewicz, Perkal, Steinhaus, and Zubrzycki in (1951).
That the single linkage clustering can be computed using the following divisive scheme over MSTs was noted in (Gower & Ross, 1969).
Algorithm 1 (Single Linkage -Divisively).To obtain the single linkage kpartition of a given dataset X represented by a complete graph G whose weights correspond to pairwise distances between all point pairs, proceed as follows: 1. Let T = MST(G) = (V, E , W ); 2. Let {{1, . . ., n}} be an initial 1-partition consisting of the cluster representing all the points; 3.For i = 1, . . ., k − 1 do: (a) Split the cluster containing the u-th and the v-th point (so that they do not belong to the same connected component anymore), where {u, v} ∈ W is the edge of the MST with the i-th greatest weight; 4. Return the current k-partition as a result.
In other words, we remove the k − 1 edges of the greatest lengths3 from E and study the resulting connected components.
Another divisive algorithm over MSTs was studied by Caliński and Harabasz in (1974).They minimised the total within-cluster sum of squares (the same as in the k-means algorithm; they provided it as an alternative to the agglomerative (but non-MST) Ward (1963) algorithm and to the one by Edwards and Sforza (1965) who employed an exhaustive divisive procedure).
More generally, let F : X l ∈ R be some objective function that we would like to maximise over the set of possible partitionings of any cardinality like l (not just k, which we treat as fixed).We will refer to it as a cluster validity measure.
Moreover, let C(V, E ) = (X 1 , . . ., X l ) ∈ X l be a partition corresponding to the connected components (with no loss in generality, assuming that there are l of them) in a subgraph (V, E ) of (V, E).
Algorithm 2 (Maximising F over an MST -Divisively).A general divisive scheme over an MST is a greedy optimisation algorithm that goes as follows: Overall, a divisive scheme is slightly more time-intense (the partition refinement data structure can be used) than the agglomerative approach which we mention below.However, it is still significantly more feasible than in the case where the dataset is represented by a more complicated graph (nearest neighbours, complete, etc.).
And thus, in the case of the single linkage scheme, the objective is such that we simply maximise the sum of weights of the omitted MST edges and in the setting of the Caliński and Harabasz (1974) paper, we maximise (note the minus): , where µ i is the centroid (componentwise arithmetic mean) of the i-th cluster.
Naturally, other objective functions can be studied.For instance, Müller, Nowozin, and Lampert in ( 2012) considered the information-theoretic criterion based on entropy which takes into account cluster sizes and average within-cluster MST edges' weights: where L i denotes the sum of the weights of edges in the subtree of the MST representing the i-th cluster and n i denotes its size.Interestingly, this estimator can be derived from the Renyi entropy estimated on various graph representations of data, including MSTs; see, e.g., (Eggels & Crommelin, 2019;Hero III & Michel, 1998;Pál, Póczos, & Szepesvári, 2010).This leads to an algorithm called ITM4 .

(WCNN_M).
As a byproduct, we will be able to assess the meaningfulness of the cluster validity measures, just like in (Gagolewski et al., 2021) where we have done this in the space of all possible clusterings (leading to the conclusion that many measures are actually invalid).
On a side note, as the size of the space of all possible k-partitions of MSTs is O(n k−1 ), for small k, it is technically possible to find the true maximum of F (note that for k = 2 the divisive strategy gives exactly the global maximum).We leave this topic for further research.

Agglomerative algorithms
Single linkage was rediscovered by Sneath in (1957), who introduced it as a general agglomerative scheme.Its resemblance to the Kruskal MST algorithm (and hence that an MST is sufficient to compute it) was noted in, amongst others, (Gower & Ross, 1969).Thus, we can formulate it also as follows.
Algorithm 3 (Single Linkage -Agglomeratively).To obtain the single linkage k-clustering: . ., {n}} be an initial n-partition consisting of n singletons; 3.For i = 1, . . ., n − k do: (a) Merge the two clusters containing the u-th and the v-th point, where {u, v} ∈ W is the edge of the MST with the i-th smallest weight; 4. Return the current k-partition as a result.
For a given MST with edges sorted increasingly, the disjoint sets (unionfind) data structure can be used to implement the above so that the total run-time is only O(n − k).
Given a cluster validity measure F , the above agglomerative approach can be generalised as below.
Algorithm 4 (Maximising F over an MST -Agglomeratively).A general agglomerative scheme over an MST is a greedy optimisation algorithm that consists of the following steps: In the single linkage case, F is simply the sum of the MSTs edges left unconsumed (or minus the weight of the edge to be omitted).
Unfortunately, many cluster validity measures are not only inherently slow to compute, but also they might not be well-defined for singleton clusters (and this is the starting point of the agglomerative algorithm).Due to an already large number of procedures in our study, we will consider the agglomerative maximising of only the aforementioned information criterion, leading to the algorithm which we denote as IcA in Table 1.Its implementation is available in (Gagolewski, 2021).

Variations on the agglomerative scheme
Genie (Gagolewski, Bartoszuk, & Cena, 2016) is an example of a variation on the agglomerative single linkage theme, where we greedily optimise the total edge lengths, but under the constraint that if the Gini index of the cluster sizes5 grows above a given threshold g, only the smallest clusters can take part in the merging.
Algorithm 5 (Genie).Given g ∈ (0, 1]: Here, we will rely on the implementation of Genie included in the genieclust package for Python (Gagolewski, 2021).Given a precomputed MST, the procedure runs in O(n √ n) time.The algorithm depends on the threshold parameter g.In this study, we will only compare the results obtained for g ∈ {0.1, 0.3, 0.5, 0.7} (for a comprehensive treatment of the sensitivity analysis of Genie's parameters; see (Gagolewski, Cena, & Bartoszuk, 2016)).
In (Gagolewski, Bartoszuk, & Cena, 2016), the use of g = 0.3 is recommended.Cena in (2018) noted that Genie gives very good results, but sometimes other thresholds might work better than the default one.She thus proposed an agglomerative scheme optimising the information criterion, which does not start from a set of n singletons, but the intersection of the clusters obtained by multiple runs of Genie.
We have implemented an extended version of this algorithm in the genieclust (Gagolewski, 2021) , where E g is the final E from the run of Algorithm 5 seeking k + l clusters using a given threshold g, (i.e., an intersection of possibly more fine-grained clusterings returned by the Genie algorithm with different parameters).We shall only consider l ∈ {0, 5, 10}, as we observe that other choices of g and l led to similar results.

Other methods
Other MST-based methods that we consider in this study6 include: • HEMST (Grygorash et al., 2006) -deletes edges from the MST to achieve the best possible edge weights' standard deviation reduction; • CTCEHC (Ma et al., 2021) -constructs a preliminary partition based on the vertex degrees and then merges clusters based on the geodesic distances between the cluster centroids.
There are a few other MST-based methods in the literature, but usually they do not result in a given-in-advance number of clusters, k (which we require for benchmarking purposes as described in the next section).For instance, Zahn in (Zahn, 1971) constructs an MST and deletes "inconsistent" edges (with weights significantly (±cσ) larger than the average weight of the nearby edges), but the number thereof cannot be easily controlled.
Each dataset comes with one or more reference label vectors created by experts.Each of them defines a specific number of clusters, k.
We run each algorithm in a purely unsupervised manner: they are only given the data matrix X ∈ R n×d and k on input, not the true labels.
To enable a fair comparison (ceteris paribus), no kind of data preprocessing (e.g., standardisation of variables, removal of noise points, etc.) is applied.However, let us note that the spectral method and Gaussian mixtures can be thought of as algorithms that have some built-in feature engineering capabilities.In other cases, the methods are asked to rely only on the "raw" Euclidean distance.
As a measure of clustering quality, we consider the adjusted asymmetric accuracy (AAA; (Gagolewski, 2022a)) given by: AAA(C) = max σ:permutation of {1,...,k} , where the confusion matrix C is such that c i,j denotes the number of points in the i-th reference cluster that a given algorithm assigned to the j-th cluster.
AAA is a measure of the overall percentage of correctly classified points in each cluster (one minus average classification error) that uses the optimal matching of cluster labels between the partitions (just like PSI (Rezaei & Fränti, 2016) which is additionally symmetric and hence less interpretable).It is corrected for chance and cluster size imbalancedness.The total number of unique reference labels was 89.Let us note that some label vectors might define the same number of clusters k.Thus, only 83 unique partitions needed to be generated and in the case of tied ks, the maximal AAA was considered.This is in line with the recommendation from (Gagolewski, 2022b), where it was noted that there could be many equally valid partitions and the algorithm should be rewarded for finding any of them (note that unlike in (Gagolewski et al., 2021), we consider the maximum over datasets and ks, not just datasets); see also (Dasgupta & Ng, 2009;Luxburg, 2012) for further discussion.
Also, following the aforementioned guidelines, if a reference partitioning marks some points as noise, the actual way they are allocated to particular clusters by the clustering methods studied is irrelevant (they are omitted when computing the confusion matrix).

Some benchmark cases are difficult for all the methods
Overall, 68/83 82% of cases can be considered "easy" for at least one of the methods (maximal AAA ≥ 0.95).In other words, for each of them, there exists an approach that reproduces the reference partition relatively well.
On the other hand, 6 benchmark cases turned out very "difficult" for all of the methods studied (AAA < 0.80).We marked them with two and three exclamation marks in Tables 2 and 3.
The said sextet includes most datasets that we sourced from the UCI repository, which are all high-dimensional, and it is hard to verify if the reference clusters are meaningful.Originally, these datasets were suggested for benchmarking classification, not clustering problems.
This might mean that there is something wrong with these reference label vectors themselves (and not the algorithms tested; e.g., the clusters are overlapping), or that some further data preprocessing must be applied in order to reveal the cluster structure (this is, e.g., the case for the WUT/twosplashes datasets which normally requires the features be standardised beforehand, we got max AAA of 0.86).
Therefore, we exclude these 6 datasets from further analysis, as it does not make sense to compare an algorithm against what is potentially noise.
The topmost box-and-whisker in Figure 3 ("Max All" on the lefthand side) depicts the distribution of the highest observed cluster validity scores across all the remaining 77 benchmark cases.

Are MST-based methods any good?
Recall that the number of possible partitions of an MST with n edges into k subtrees is equal to (n − 1)(n − 2) • • • (n − k + 1).For all datasets with k = 2, 3, 4, and those with n ≤ 2500 for k = 5, we were able to identify the true maximum of AAA easily using the brute-force approach (considering all the possible partitions of the MST).The remaining cases were too timeconsuming to examine exhaustively.Therefore, we applied a tabu-like steepest ascent search strategy with at least 10 random restarts to find the lower bound for the maximum (similarly as in (Gagolewski et al., 2021)).
Studying the "Max MST" box-and-whisker on the righthand side of Figure 3, which denotes these theoretically achievable maxima of AAA (a hypothetical "oracle" MST-based algorithm), we note that for only 4/77 5% datasets, the minimum spanning tree (with respect to the Euclidean distance between unpreprocessed points) is not a good representation of the feature space.Namely, the accuracy scores relative to "Max All" is significantly smaller than 0.95.We marked them with asterisks in Tables 2 and 3 (WUT/olympic, WUT/z1, UCI/wine, and WUT/graph).
On the other hand, 6 cases turned out difficult for the non-MST methods (relative "Max Obs.Non-MST" AAA less than 0.95).This includes Graves/parabolic, SIPU/pathbased with k = 3 and k = 4, SIPU/Compound for k = 6, WUT/cross, and Other/chameleon_t8_8k.Still, they can be successfully tackled with MSTs.

Which MST-based algorithm then?
The above observation does not mean that we are in possession of an algorithm that gets the most out of the information conveyed by the minimum spanning trees, nor that a single strategy is always best.
We should thus inspect which strategies and/or objective functions are more useful than others.
Figure 4 depicts the adjusted accuracies relative to "Max MST" for each method, i.e., how well each algorithm compares to the best possible solution.
As far as other "standalone" algorithms are concerned, HEMST and Single linkage exhibit inferior performance, and CTCEHC is comparable with the divisive Caliński-Harabasz criterion optimiser.
Quite strikingly, some well-established internal cluster validity measures promote clusterings of very poor agreeableness with the reference labels (Davies-Bouldin, SilhouetteW, some generalised Dunn indices).This is in line with our observation in (Gagolewski et al., 2021), where we performed a similar study over the space of all possible partitionings.This puts their actual meaningfulness into question: are they really good indicators of clustering quality?

How MST-based methods compare against other clustering approaches?
Figure 5 compares the MST and non-MST approaches in terms of absolute AAAs.
As far as the current (large) benchmark battery is concerned, the MSTbased methods outperform the popular "parametric" approaches (Gaussian Mixtures, K-means) and other algorithms (Birch, Ward, Average, Complete linkage, and spectral clustering with the best-identified parameters) implemented in the scikit-learn package (Pedregosa et al., 2011) for Python.
We also notice that choosing the wrong objective function to optimise over MST can also lead to very poor results.This is particularly the case if the Davies-Bouldin and SilhouetteW indices are considered.
Apart from a few "difficult" label vectors, the minimum spanning treebased methods have been shown to be potentially very competitive clustering approaches.Furthermore, they can be improved by appropriate feature engineering (scaling of data columns, noise point and outlier removal, modifying the distance matrix, etc.; see, e.g., (Campello et al., 2015;Yin & Liu, 2009)).
They are quite simple and easy to compute: once the minimum spanning tree is considered (which takes up to O(n 2 ) time, but approximate methods exist as well; e.g., (Naidan et al., 2019)), we can potentially get a whole hierarchy of clusters of any cardinality.For instance, our top performer, the Genie algorithm as implemented in (Gagolewski, 2021), needs O(n √ n) to generate all possible partitions given a prebuilt MST.Unlike, e.g., the wellknown k-means algorithm, which is fast for small fixed ks, this property makes them suitable for solving extreme clustering tasks (compare (Kobren, Monath, Krishnamurthy, & McCallum, 2017)).
Just like in our previous contribution (Gagolewski et al., 2021) (where we tried to find an optimal clustering over the whole space of all possible partitions), we note that many internal cluster validity indices actually promote clusterings that agree poorly with the reference ones.This puts their validity/meaningfulness into question.
Overall, no single best MST-based method probably exists, but there is still some room for improvement, and thus the development of new algorithms is encouraged.In particular, the new divisive and agglomerative approaches we have proposed in this paper perform well on certain dataset types.Therefore, it might be promising to explore the many possible combinations of parameters/objective functions we have left out due to the obvious space constraints in this paper.
It would also be interesting to inspect the stability of the results when different random subsets of benchmark data are selected or study the problem of overlapping clusters (e.g., (Campagner, Ciucci, & Denoeux, 2023)).Also, the application of the MST-based algorithms could be examined in the problem of community detection in graphs (e.g., (Gerald, Zaatiti, Hajri, et al., 2023)).
Finally, let us recall that we have only focused on methods that guarantee to return a fixed-in-advance number of clusters k.In the future, it would be interesting to allow for the relaxation of this constraint.
as the edge with the smallest weight (equivalently, that the sum of weights of edges in E \ (E ∪ {{u, v}}) is the largest); (b) Otherwise, pick {u, v} ∈ E \ E as the edge with the smallest weight provided that the size of the connected component containing u (or v) is the smallest of them all; (c) Add {u, v} to E ; 4. Return C(V, E ) as a result.

Table 1
Clustering methods studied ( * denotes an algorithm not based on MSTs) package.Namely, what we denote with Genie+Ic (k + l) in Table 1, is a variation of Algorithm 4 that starts at E

Table 2
Benchmark datasets studied (part I; see The distribution of the adjusted asymmetric accuracies across the 77 benchmark cases (absolute AAA on the left and AAA relative to "Max All" on the righthand side)."Max Obs." gives the maximal observed AAA based on the outputs of all the 140 methods, and their counterparts for MST and non-MST algorithms only are denoted with "Max Obs.MST" and "Max Obs.Non-MST"."Max MST" gives the theoretically achievable maxima of the accuracy scores for the MST-based methods.Moreover, "Max All" is the maximum of "Max MST" and "Max Obs.".Apart from a few "hard" datasets, the MST-based methods are potentially very competitive, despite their simplicity.They can be improved further by appropriate feature engineering.