Fast and explainable clustering based on sorting

We introduce a fast and explainable clustering method called CLASSIX. It consists of two phases, namely a greedy aggregation phase of the sorted data into groups of nearby data points, followed by the merging of groups into clusters. The algorithm is controlled by two scalar parameters, namely a distance parameter for the aggregation and another parameter controlling the minimal cluster size. Extensive experiments are conducted to give a comprehensive evaluation of the clustering performance on synthetic and real-world datasets, with various cluster shapes and low to high feature dimensionality. Our experiments demonstrate that CLASSIX competes with state-of-the-art clustering algorithms. The algorithm has linear space complexity and achieves near linear time complexity on a wide range of problems. Its inherent simplicity allows for the generation of intuitive explanations of the computed clusters.


Introduction
Clustering is a widely-used unsupervised learning technique to find patterns and structure in data.Clustering algorithms group the data points into distinct clusters such that points within a cluster share similar characteristics on the basis of their distance, density or other spatial properties, while points in two distinct clusters are less similar.Applications of clustering methods are wide-ranging, including areas like finance (Bensmail and DeGennaro, 2004), traffic (Erman, Arlitt and Mahanti, 2006), civil engineering (de Oliveira, Garrett and Soibelman, 2011), and bioinformatics (Song and Zhong, 2020).Clustering algorithms can be categorized in different ways, but for our purposes distance-based and density-based clustering methods are the most relevant.Distance-based clustering methods, such as k-means (Lloyd, 1982), consider the pair-wise distance between points in order to decide whether they should be contained in the same cluster.Density-based clustering algorithms, such as DBSCAN (Ester, Kriegel, Sander and Xu, 1996), take a more global view by assuming that data occurs in continuous areas of high density surrounded by low-density regions.A key advantage of many density-based clustering algorithms is that they can handle arbitrarily shaped clusters without specifying the number of clusters in advance.On the other hand, they typically require more parameter tuning.
We propose a novel clustering method called CLASSIX 1 which shares features with both distance and densitybased methods.The method comprises two phases: aggregation and merging.In the aggregation phase, data points are sorted along their first principal component (Hotelling, 1933) and then grouped using a greedy aggregation technique.The sorting is key to traversing the data with almost linear complexity, provided that the number of pairwise distance computations is small.While the initial sorting requires an average-case complexity of ( log ), it is only performed on scalar values irrespective of the dimensionality of the data points.As a consequence, the cost of this initial sorting is almost negligible compared to computations on the full-dimensional data.As we will demonstrate experimentally, an almost linear complexity is indeed observed in practice.The aggregation phase is followed by the merging of overlapping groups into clusters using either a distance or density-based criterion.The density-based merging criterion usually results in slightly better clusters than the distance-based criterion, but the latter has a significant speed advantage.CLASSIX depends on two parameters and its tuning is relatively straightforward.In brief, there is a distance parameter radius that serves as a tolerance for the grouping in the aggregation phase, while an minPts parameter specifies the smallest acceptable cluster size.This is similar to the parameters used in DBSCAN but, owing to the initial sorting of the data points, CLASSIX does not perform spatial range queries for each data point.
Our comparisons demonstrate that CLASSIX achieves excellent performance against popular methods like k-means++, meanshift, DBSCAN, HDBSCAN * , and Quickshift++ with respect to speed and several metrics of cluster quality.In addition, CLASSIX's inherent simplicity allows for the easy explainability of the clustering results.The Python code available at https://github.com/nla-group/classiximplements CLASSIX with all these features, including the ability to query textual explanations for the clustering.
The rest of this paper is structured as follows.Section 2 reviews some of the existing popular clustering algorithms which share features with CLASSIX.Section 3 introduces the CLASSIX algorithm, starting with the greedy aggregration approach and a discussion of two different merging techniques.We also show how to treat outliers and out-of-sample data.Section 4 is devoted to theoretical considerations of the computational complexity of CLASSIX and how its performance depends on parameters such as the data dimension, Section 5 contains comprehensive performance tests of CLASSIX and comparisons with other popular clustering algorithms.Concluding remarks and a discussion of possible future work are contained in section 6.Finally, the Appendix gives an overview of using the Python implementation and a demonstration of the explainability feature.

Related work
CLASSIX shares features with at least two classes of clustering methods, namely distance-based and density-based methods.Here we briefly review some of the most important algorithms in these classes, highlighting the common features shared with CLASSIX and also their differences.
Two of the most popular distance-based clustering algorithms are k-means (Lloyd, 1982) and k-medoids (Kaufman and Rousseeuw, 1990, Chp. 2).These algorithms partition the data into  clusters by iteratively relocating the cluster centers and reassigning data points to their nearest center.As a consequence, these distance-based methods tend to form clusters of spherical shapes and it is generally hard to cluster data with more complicated cluster shapes (Jain, 2010).In terms of computational complexity, it is rather hard to predict how slow or fast k-means performs (Arthur and Vassilvitskii, 2006), but variants like k-means++ (Arthur and Vassilvitskii, 2007), mini-batch k-means (Sculley, 2010), or the contribution in (Elkan, 2003) can significantly improve on the original algorithm in practice.Both k-means and k-medoids require the user to specify , the expected number of clusters, and this is often not easy in practice.
Density-based clustering methods do not require an a-priori knowledge of the number of clusters.Among these methods are the popular mean shift (Cheng, 1995;Comaniciu and Meer, 2002) and DBSCAN (Ester et al., 1996) algorithms.These methods rely on the property that the density of points at the "core" of a cluster is higher than in its surrounding.Mean shift performs clustering by iteratively shifting data points to nearby peaks (modes) of the empirical density function, using kernel density estimation.Data associated with the same mode are assigned to a cluster.Quick shift (Vedaldi and Soatto, 2008) provides a faster alternative of mean shift clustering using Euclidean medoids as shifts.The recently introduced Quickshift++ further improves clustering performance by first estimating the data modes and providing guarantees that data are assigned to their appropriate cluster mode (Jiang, Jang and Kpotufe, 2018).
The DBSCAN method essentially relies on two parameters, namely the neighborhood radius and a minimum number of data points in every neighborhood.The original implementation uses an R* tree data structure to perform range queries, and the performance of DBSCAN relies crucially on the efficient implementation of such a tree structure.Other trees can be used as well, such as kd-tree of ball tree, but as can be seen in (Kriegel, Schubert and Zimek, 2017) the implementation of such structures is nontrivial.As pointed out by Gan and Tao (2015), DBSCAN has a worst-case time complexity of ( 2 ) and a time complexity of Ω( 4∕3 ) if the number of features is greater than three.This problem often is referred to as the curse of dimensionality (Indyk and Motwani, 1998;Har-Peled, Indyk and Motwani, 2012).
Many variants of DBSCAN have been proposed to mitigate the aforementioned issues.OPTICS computes an augmented cluster ordering over a range of parameters and thus provides a versatile basis for both automatic and interactive cluster analysis (Ankerst, Breunig, Kriegel and Sander, 1999;Hahsler, Piekenbrock and Doran, 2019).Gunawan (2013) proposes a DBSCAN variant for two-dimensional data which has a theoretical time complexity of ( log ).To fix the limitation left by (Gunawan, 2013), Gan and Tao (2015) introduce the concept of approximate DBSCAN with an expected linear time complexity () for a small sacrifice in accuracy.However, Schubert, Sander, Ester, Kriegel and Xu (2017) claim that the original DBSCAN can be competitive with -approximate DBSCAN for reasonable parameters choices and effective index structures.There are also parallel implementations like PDSDBSCAN (Patwary, Palsetia, Agrawal, Liao, Manne and Choudhary, 2012).APSCAN (Chen, Liu, Qiu and Lai, 2011) is a parameter-free clustering method that uses affinity propagation (Frey and Dueck, 2007) to infer the local density of a dataset.In such a way, APSCAN can cluster datasets with varying density and preserve the associated nonlinear data structure.HDBSCAN (Campello, Moulavi and Sander, 2013) and HDBSCAN * (Campello, Moulavi, Zimek and Sander, 2015), for simplicity just referred to as "HDBSCAN" here, extend DBSCAN to hierarchical clustering algorithms and then extract a flat clustering based on the stability of clusters.Again, this allows for variabledensity clusters and, possibly, a more robust parameter selection.DBSCAN++ (Jang and Jiang, 2019) is a modification of DBSCAN based on the observation that density estimates for merely a subset of  ≪  data points need to be computed, resulting in a speedup of DBSCAN while achieving similar clustering results as DBSCAN.To select the  data points, uniform and greedy -center-based sampling approaches are proposed.
CLASSIX shares features all of the aforementioned methods.CLASSIX begins with the greedy aggregation of the data into groups, each of which is identified with a so-called starting point.Similar to the neighborhood centers in DBSCAN, the starting points can be interpreted as a reduced-density estimator of the data.However, in CLASSIX the groups are also re-used to cluster the data, not just as estimators.CLASSIX's two-stage approach also shares similarities with hierarchical clustering methods like HDBSCAN.In the merging phase of CLASSIX, a distance-based criterion can be used, e.g., based on the Euclidean distance.However, as opposed to the distance-based methods discussed above, the merging phase of CLASSIX is non-iterative and does not require a specification of the number of clusters.Instead, the number of clusters is determined by the algorithm based on a radius parameter and the minimum number of data points a cluster, minPts.While we use a similar notation for these parameters as DBSCAN, their meaning in CLASSIX is slightly different.The groups formed in CLASSIX are not necessarily spherical as the scanned neighborhoods in DBSCAN.Instead, they are formed from a starting point by traversing the data in the sorted order and only including points that have not been aggregated before.The minPts criterion is applied on the cluster level in CLASSIX, while in DBSCAN it is used to distinguish between noise, core, and boundary points.

The CLASSIX algorithm
In this section we introduce the CLASSIX algorithm, starting with a discussion of the initial preparation of the data (section 3.1), followed by a discussion of the aggregation (section 3.2) and merging (section 3.3) phases.The short sections 3.4 and 3.5 discuss the treatment of outliers and out-of-sample data, respectively.

Data preparation
Let us assume that we are given  raw data points  1 , … ,   ∈ ℝ  which we would like to cluster.Throughout this work all vectors are column vectors and we assume that  ≪ .The dimensionality  is also referred to as the number of features.In the following, we will denote the processed data points as  1 , … ,   .
As a first step, we will center all data points by taking off the empirical mean value of each feature: This operation has a complexity of (), i.e. linear in , and it can be performed in-place, meaning that the data points   can be overwritten with the mean-centered values   if memory consumption is an issue.The second step consists of computing the first principal component  1 ∈ ℝ  , i.e., the vector along which the data {  } exhibits largest empirical variance.This vector can be computed by a thin singular value decomposition of the tall-skinny data matrix where  ∈ ℝ × and  ∈ ℝ × have orthonormal columns and The principal components are given as the columns of  = [ 1 , … ,   ] and we require only the first column  1 .The score of a point   along  1 is where   denotes the -th canonical unit vector in ℝ  .In other words, the scores   of all points can be read off from the first column of  = [ 1 , … ,   ] times  1 .The computation of the scores using a thin SVD requires ( 2 ) operations and is therefore linear in  (Golub and Loan, 2013, chapter 8.6).
The third (and crucial) step is to order all data points   by their   scores; that is, Note that  1 is not uniquely determined even when  1 >  2 as also − 1 is a principal component, but it does not matter if the order of the   is reversed from largest to smallest.The sorting generally requires ( log ) operations on average, and at most () memory, provided that an efficient algorithm such as Quicksort (Hoare, 1962), IntroSort (Musser, 1997) or TimSort (Auger, Jugé, Nicaud and Pivoteau, 2018) is used.It is important to highlight again that the sorting is performed on scalar values   , irrespective of the data dimensionality .This will result in a significant advantage in performance over tree-based query structures such as R* or ball tree, which work with the -dimensional data points and are nontrivial to implement efficiently; see, e.g., the discussion in Kriegel et al. (2017).
Finally, we estimate the median extend of the data as the smallest number mext > 0 such that there are ⌈∕2⌉ data points with a score   in the interval [−mext, mext].The use of this parameter is merely to make any distance computations independent of the data scale, ensuring that approximately half of the data points   have a score   that lie within a ball of radius mext about the origin.As the scalars   are already sorted, mext can be computed in () operations.An alternative and better estimate is mext ∶= median({‖  ‖}), but this is slightly more expensive to compute as it involves  Euclidean norm computations for a cost of () operations, and another sorting of scalar norm values costing ( log ).
We note that throughout this work we will use the Euclidean norm ‖ ⋅ ‖ for any distance computations.If a weighted Euclidean needs to be used (giving different weight to different features), this weighting can simply be applied to the original data points {  } before performing the follow-up steps as described above.

Aggregation phase
The aim of CLASSIX's aggregation phase is to group nearby data points together in a computationally inexpensive manner, where "nearby" is simply defined in terms of the Euclidean distance dist(  ,   ) = ‖  −   ‖.We consider two data points   and Here, radius is a user-chosen tolerance and mext is the median extend estimated from the data, as explained in the previous section.
The greedy aggregation now proceeds as follows: starting with the first data point  1 (i.e., the one with the smallest   value), we assign the data points   for  = 2, 3, … to the same group  1 if dist( 1 ,   ) ≤ .If done in a naive way, this would require computing all  − 1 distances between the   and  1 .However, we can utilize the Cauchy-Schwarz inequality to avoid doing this: as we have sorted the   such that  1 ≤  2 ≤ ⋯ ≤   , we know that if   −   >  for some  >  then dist(  ,   ) >  for all  ≥ .
Hence, we can terminate adding data points to the first group  1 as soon as we encounter a data point   such that   −  1 > .The point  1 is also referred to as the starting point of group  1 .
Once the first group  1 is completed, the assignment process continues with the first data point   that is not part of  1 .This point   will be the starting point of the next group  2 .(We emphasize that the starting points do not need to be searched for; they are always the first data point that was skipped when forming the previous group.)As before, we add unassigned data points   ,  > , to the group  2 until we encounter a point such that   −   > .And this process continues until all points are assigned to a group.
A pseudocode of the greedy aggregation algorithm is given in Algorithm 1. Once completed, all  data points have been partitioned into groups  1 , … ,   with their corresponding starting points  (1) , … ,  () .

Merging phase
The aim of this phase is to merge the groups  1 , … ,   into clusters  1 , … ,   ,  ≤ .See also Figure 1 for illustration.We discuss two approaches for doing this, one based solely on the distance of the starting points  (1) , … ,  () , and the other based on the density of the points in the groups and a volume of intersection.
Algorithm 1: Greedy aggregation of data points into groups Label all of them as "unassigned".2. Let   be the first unassigned point   (referred to as a "starting point").
If there are no unassigned points left, terminate.
IF   ≤  and   is unassigned, assign   to the same group as   6.
ELSE IF   −   > , go to Step 2.

Distance-based merging
We first highlight that the points   assigned to each group   are still sorted by their first principal coordinates   and, as a consequence, so are the starting points; i.e.,  (1) ≤ ⋯ ≤  () .
In the distance-based merging approach we will merge two groups   and   if the distance of their starting points satisfies where scale is a tuning parameter in the interval [1, 2], which can be set to scale = 1.5 in most cases.Note that two distinct starting points  () and  () ,  < , will never have a distance that is less than , as otherwise they would have ended up in the same group   during aggregation.Also, if two starting points have a distance greater than 2, their surrounding -balls have no overlap.Let us the denote the -ball surrounding a point  as The above requirement that scale ∈ [1, 2] ensures that there is a nonempty intersection of the balls ( () , ) and ( () , ).
To determine which pairs of starting points satisfy the merging criterion (3), we can again use the termination trick based on the corresponding principal coordinates.Given a starting point  () , we inspect the starting points that succeed it, namely  (+1) ,  (+2) , … in that order.If any of these starting points is a distance not more than scale ⋅  away from  () , it should be in the same cluster as   (𝑖) .As soon as we encounter a starting point  (+) with the property  (+) −  () > scale ⋅  we can terminate the search.
Once all starting points  () have been traversed, we effectively have a list of pairs of starting points that satisfy the merging criterion (3).These pairs can be thought of as the edges of a directed graph.We then simply determine clusters of starting points by computing the connected components of this graph (or its undirected version) using, e.g., depth-first search (Hopcroft and Tarjan, 1973) or a disjoint-set data structure (Galler and Fisher, 1964).The complexity of this procedure is (max(, )), where  is the number of edges in the graph.All data points   which are not starting points are naturally assigned to the same cluster as the starting point of the group they belong to.

Density-based merging
In density-based merging we will join two starting points  () and  () (and the points in their associated groups   and   , respectively) into one cluster if the density of the points that lie in the intersection ( () , ) ∩ ( () , ) is at least as big as the overall density of the two groups.Using | ⋅ | to denote cardinality of a set and vol( ⋅ ) to denote the volume of a set, this can be formalized as follows:  () and  () belong to the same cluster if ) . (4) The volume of the intersection of two -balls in with the incomplete beta function defined as see Li (2011).The volume of an -ball is given as vol , from which the volume of the union of two -balls is easily obtained.Two -balls can only have a nonempty overlap if their centers are not more than 2 apart.Hence we can again limit the number of pairs   and   ,  < , for which to check the merging criterion (4) by using the principal coordinate-based criterion: if  () −  () > 2, the groups   and   do not need to be inspected for intersection.
As in the distance-based case, the set of all pairs of starting points that satisfy (4) can be thought of as the edges of a graph, and we determine clusters of starting points by computing the connected components of this graph.All data points   which are not starting points are naturally assigned to the same cluster as the starting point of the group they belong to.

Remark:
As is well known, the volume of an -ball tends to zero as the dimension  increases, Similarly, the volume of any intersection of two balls tends to zero, and even the relative volume of the intersection of two balls (i.e., the ratio of the intersection volume and the volume of a ball) tends to zero unless both balls have the same centers.As a consequence, density-based clustering is only suitable for low feature dimensions.

Treatment of outliers
After all groups have been merged into  clusters  1 , … ,   , we optionally apply a removal process for outliers.In CLASSIX, outliers are simply defined as the points in clusters that have a small number of elements, namely clusters such that |  | < minPts.There are two options for processing outliers.
• Reassign outliers to larger clusters: All points in a cluster   with less than minPts points will be assigned to larger nearby clusters.This is done on the group level using the starting points.Every group   that was merged into   has an associated starting point  () .We simply find a starting point  ( ′ ) nearest to  () that belongs to a cluster   ′ with at least minPts points, and reassign   to cluster   ′ .This procedure will succeed to eliminate all outliers if there is at least one cluster with minPts or more points.Otherwise, all clusters will remain outliers.
• Return outliers separately: All points in clusters with less than minPts points will be labelled as outliers and returned as such to the user.
We illustrate these two options applied to an example with synthetic data in Figure 2.

Out-of-sample data
The reassignment strategy for outliers described in section 3.4 can also be used to classify out-of-sample data based on a previously trained model.That is, CLASSIX will allocate a new data point to its closest group (measured by distance to starting points) and then assign it to the associated cluster.This is illustrated in Figure 3.In this example, CLASSIX is first run on a train set (90% of the overall data points) to obtain the clusters, and then tested on the remaining 10% of the data (left vs right plots).In this example, both the train and test data are clustered with 100% accuracy when compared to the provided ground truth (top vs bottom plots).

Efficiency
The efficiency of the aggregation phase crucially depends on the number of pairwise distance calculations that are performed in Step 4 of Algorithm 1.In the best case, each point   is involved in exactly one such computation, resulting in () distance computations overall.In the worst case, the distances between each point   and all points  +1 , … ,   that succeed it are computed.This amounts to distance computations, an unacceptably high number in most practical situations.The hope is that through the sorting of the data points   by their principal coordinates   , the early termination criterion in Step 6 of Algorithm 1 is triggered often enough so that, on average, the number of distance computations a data point is involved in remains very small.An unassigned data point   is involved in more than one distance computation only if there is a starting point   ,  < , such that   −   ≤  but dist(  ,   ) > .So, it is natural to ask the following question: We first note that, using the singular value decomposition (1) of the data matrix , we can derive an upper bound on dist(  ,   ) that complements the lower bound (2).Using that    =     =     Σ  and denoting the elements of  by  , , we have and the gap in these inequalities is determined by  2 , the second singular value of .Indeed, if  2 = 0, then all data points   lie on a straight line passing through the origin and their distances correspond exactly to the difference in their principal coordinates.This is a best-case scenario for Algorithm 1, as points   will just be added in their sorted order to a group until the distance from the starting point   exceeds , in which case that   becomes the next starting point and a new group is formed.As no point needs to be revisited, the number of pairwise distance computations is smallest possible.If, on the other hand,  2 is relatively large compared to  1 , the gap in the inequalities becomes large as well and the   ,   may not be a good proxy for the point distance dist(  ,   ).

Parameter dependency
It is clear that there is no natural order of data points in two or higher dimensions, and it is expected that some data points will be involved in more than one distance computation.In order to get a qualitative understanding of how the number of distance computations depends on the various parameters (dimension , singular values of the data matrix, group radius , etc.), we consider the following model.Let {  }  =1 be a large sample of points whose  components are normally distributed with zero mean and standard deviation [1, , … , ],  < 1, respectively.These points describe an elongated "Gaussian blob" in ℝ  , with the elongation controlled by .In the large data limit ( → ∞) the singular values of the data matrix  = [ 1 , … ,   ]  approach √ ,  √ , … ,  √  and the principal components approach the canonical unit vectors  1 ,  2 , … ,   .As a consequence, the principal coordinates   =   1   follow a standard normal distribution, and hence for any  ∈ ℝ the probability that |  − | ≤  is given as On the other hand, the probability that ‖  − [, 0, … , 0]  ‖ ≤  is given by where  denotes the  2 cumulative distribution function.In this model we can think of the point [, 0, … , 0]  as a starting point, and our aim is to identify all data points   within a radius  of this starting point. Since Hence, the quotient  2 ∕ 1 can be interpreted as a conditional probability of a point   satisfying Ideally, we would like this quotient  be close to 1.As all the points   are sorted by their first principal coordinates   , all points within a radius of  from [, 0, … , 0]  (belonging to the same group) are very likely to have consecutive indices if  ≈ 1.On the other hand, if  is very small, then the sorting has very little effect of indexing nearby points close to each other.
It is now easy to study the dependence of the quotient  =  2 ∕ 1 on the various parameters.First note that  1 does not depend on  nor , and hence the only effect these two parameters have on  is via the factor in the integrand of  2 .This term corresponds to the probability that the sum of squares of  − 1 independent Gaussian random variables with mean zero and standard deviation  is less or equal to  2 − ( − ) 2 .Hence,  2 and therefore  are monotonically decreasing as  or  are increasing.This is consistent with intuition: as  increases, the elongated point cloud {  } becomes more spherical and hence it gets more difficult to find a direction in which to enumerate the points naturally.And this problem gets more pronounced in higher dimensions .
We now show that  converges to 1 as  increases.First note that for an arbitrarily small  > 0 there exists a radius To see this, note that the cumulative distribution function  monotonically increases from 0 to 1 as its first argument increases from 0 to ∞. Hence there exists a value  for which  (,  − 1) > 1 −  for all  ≥  .Now we just need to find  2 such that The left-hand side is a quadratic function with roots at  =  −  2 and  =  +  2 and symmetric with respect to the maximum at  = .Hence choosing  2 such that , or any value  2 larger than that, will be sufficient.Now, setting  = max{ 1 ,  2 }, we have Hence, both  1 and  2 come arbitrarily close to 1 as  increases, and so does their quotient  2 ∕ 1 .We illustrate our findings in Figure 4.The model just analyzed only gives quantitative insights as it does not take into account that our aggregation algorithm (i) uses starting points that are not necessarily at the center of a group and (ii) removes points once they have been assigned to a group.Nevertheless, we can read off from Figure 4 that aggregation generally gets harder in higher dimensions and when  gets smaller.

Experiments
In this section we conduct extensive experiments on various datasets to evaluate the performance of CLASSIX and compare it to other clustering algorithms.The quality of a clustering will be evaluated using metrics like the adjusted Rand index (ARI) and the adjusted mutual information (AMI); see (Hubert and Arabie, 1985) and (Vinh, Epps and Bailey, 2010), respectively.For a fairest possible comparison of timings we select widely-used clustering algorithms for which reliable and tuned implementations are publicly available.In particular, we use Cython compiled implementations of k-means++, DBSCAN, HDBSCAN, and Quickshift++.For the HDBSCAN and Quickshift++ implementations we refer to (McInnes and Healy, 2017) and (Jiang et al., 2018), respectively, while the other methods are implemented in the scikit-learn library (Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot and Duchesnay, 2011).All reported runtimes are averaged timings of ten consecutive runs.For the reproducibility of our results, all datasets are chosen from publicly available sources or generated with open-source software.All computations are done on a Dell PowerEdge R740 Server with two Intel Xeon Silver 4114 2.2G processors, 1,536 GB RAM, and 1.8 TB disk space.All algorithms are forced to run in one thread.

Gaussian blobs
In this first test we study the sensitivity of several clustering algorithms to the data size  and the feature dimensionality .To this end we generate  data points in  dimensions forming ten isotropic Gaussian blobs of unit standard deviation using the sklearn.datasets.make_blobsfunction in scikit-learn.

Dependence on data size
We compare k-means++, DBSCAN (using the index query structures "ball-tree" and "K-D tree"), HDBSCAN, and CLASSIX.The data are  points forming ten isotropic Gaussian blobs of unit standard deviation with  varying between 5, 000 and 50, 000 while the feature dimension  = 10 is fixed.The method's hyperparameters are chosen as follows.For CLASSIX, we set radius = 0.3 and minPts = 5.For DBSCAN, we use  = 3 and minPts = 1.Quickshift++ is run with  = 20 and  = 0.7.Finally, k-means++ is run with the ground-truth value of  = 10 clusters.The results are presented in the top two plots of Figure 6, with the ARI shown on the left and the runtime shown on the right.All the curves are plotted with a 95% confidence band estimated from the ten consecutive runs of each method.
We find that all compared methods obtain almost perfect clustering results in this test as judged from an ARI score close to one; see Figure 5 (top left).In terms of timings, both k-means++ and CLASSIX achieve very good performance, with the overall computation time increasing approximately linearly as  increases; see Figure 5 (top right).The other methods exhibit a nonlinear, possibly quadratic, increase in runtime.
The slow, seemingly linear, increase in runtime of CLASSIX is owed to the small number of distance computations between data points due to the early termination condition discussed in section 3.2.To confirm this, we show in Figure 5 (bottom) the average number of distance calculations per data point (avg_dist_pp) as  increases, both with and without the early termination condition.We find that early termination leads to a significant reduction in the number of distance calculations.In fact, for this data avg_dist_pp appears to stay bounded with early termination as  increases.

Dependence on data dimension
We now use a fixed number of  = 10, 000 samples of ten Gaussian blobs and vary the feature dimension from  = 10 to 100.The methods' hyperparameters are the same as in the previous section, except that for DBSCAN we now use  = 10.This change of  is in DBSCAN's favour, to delay a very sharp decline in ARI for increasing dimensions observed with  = 3.
The results are shown Figure 6.In the plot on the top left we observe an ARI degradation of DBSCAN and Quickshift++ as  increases.The hyperparameters of these two methods appear to be rather sensitive to the feature dimension.The other methods, including HDBSCAN and CLASSIX, perform robustly across all dimensions without requiring hyperparameter adjustment.In terms of timings, we again find that k-means++ and CLASSIX exhibit the best performance.For this data, the runtimes of all methods except HDBSCAN appear to scale linearly in the feature dimension .
In the bottom of Figure 6 we again show the average number of distance calculation per data point (avg_dist_pp) as the feature dimension  increases.We find that, both with and without early termination, avg_dist_pp decreases as  increases.This is to be expected as a Gaussian blob with unit standard deviation becomes more "focused" in higher dimensions.Clustering a more focused point cloud with a fixed aggregation radius  is equivalent to clustering a constant point cloud with increasing radius .In line with the analysis of the Gaussian model in section 4.2, we indeed expect the number of distance calculations to decrease in higher dimensions.We vary the number of data points from  = 5, 000 to 50,000 and measure the adjusted Rand index (left) and timing (right) for each method.The feature dimension is őxed at  = 10 in all cases.Bottom: Average number of distance calculations per data point in the CLASSIX aggregation phase with and without early termination.

Sensitivity to parameters
We now focus our attention on the sensitivity of CLASSIX with respect to its radius parameter.Figure 7 shows the ARI and AMI scores of clustering with radius varying between 0.1 to 1.0.The data forms ten Gaussian blobs in  = 10 dimensions and the number of data points  is varied from 1000 to 9000 in steps of 1000 (for each plot from the left to right and top to bottom).We show the ARI and AMI scores for both distance-based and density-based CLASSIX clustering.We observe that the method is not very sensitive to the radius parameter.As long as this parameter is chosen within the interval 0.2 to 0.6, good to very good clustering results are obtained.

UCI Machine Learning Repository
We now compare CLASSIX with DBSCAN, HDBSCAN, and Quickshift++ on a number of commonly-used realworld datasets listed in Table 1.All of these datasets except "Phoneme" and "Wine" can be found in the UCI Machine Learning Repository (Dua and Graff, 2017).
We preprocess the data by removing rows with missing values and z-normalise the features (i.e., shift each feature to zero mean and scale it to unit variance).We then perform a hyperparameter search by measuring the ARI and AMI scores while varying the main parameters of each method as shown in Figure 8 and its caption.For each method and dataset we use the respective best hyperparameters and report the achieved ARI and AMI scores in Table 2.The average scores across all datasets shown on the bottom of Table 2 indicate that CLASSIX achieves a good performance compared to the other methods, in particular when density-based merging is used.The only exception is the "Phoneme" dataset where CLASSIX with distance-based merging significantly outperforms density-based merging.This might be explained by the rather high feature dimension of this dataset; see also the remark at the end of section 3.3.2.

Scikit-learn benchmark
The scikit-learn library contains a benchmark 2 with six datasets and a number of clustering methods such as BIRCH (Zhang, Ramakrishnan and Livny, 1996), spectral clustering (Shi and Malik, 2000;Yu and Shi, 2003), and Gaussian mixture models (Murphy, 2012, Chap. 11).The library provides preselected parameters for all of these methods and we have left them unchanged.We added HDBSCAN, Quickshift++, and CLASSIX with distance-based and density-based merging to the benchmark and tuned these methods by grid search to find the best hyperparameters in terms of achieved ARI and AMI score.
The clusterings computed by each method are visualized in Figure 9. On all five datasets, CLASSIX produces very good clusters.Among the other methods, Quickshift++ stands out with comparable performance.The figure also shows the required computation time in the bottom right of each plot, and for CLASSIX the average number of distance computations per data point in the bottom left.Most methods can be considered fast, with just a few milliseconds of computation time required, but CLASSIX consistently performs fastest here.This is explained by the small number of distance computations that are required, on average never exceeding 5.47 comparisons per data point in these five tests.The clustering performance measured in terms of ARI and AMI score is shown in Table 3.Both Quickshift++ and CLASSIX stand out as the best performing methods here.

Shape clusters test
This benchmark involves eight synthetic datasets with varying shapes that are challenging for clustering algorithms.The properties of each dataset are detailed in Table 4. Before clustering this data, we z-normalise each of the features.We compare k-means++, Mean shift, DBSCAN, HDBSCAN, Quickshift++, CLASSIX (distance-based), and CLASSIX (density-based).For k-means we specify the ground-truth number , while the hyperparameters of the other methods are again optimized by grid search.The clustering results are visualized in Figure 10, together with the required computation time (bottom right of each plot) and for CLASSIX, the average number of distance computations per data point shown in the bottom left of the CLASSIX plots.We find that CLASSIX with density-based merging produces slightly better clusters than distance-based merging, but at the expense of a higher computation time.
The measured AMI and ARI scores are listed in Table 5.Here we see a more mixed picture, with algorithms performing rather differently depending on the dataset.Clearly, in view of Figure 10, ARI and AMI scores do not always provide a good measure of what would be considered a "good clustering" in human perception.Still, on average, CLASSIX with density-based merging appears to perform best across all datasets, closely followed by CLASSIX (distance-based), Quickshift++, DBSCAN, and HDBSCAN (in that order).

Facial image classification
Most often, image classification is done by supervised learning techniques such as deep convolutional neural networks (Krizhevsky, Sutskever and Hinton, 2012;Simonyan, Vedaldi and Zisserman, 2014) instead of unsupervised learning techniques.Nevertheless, there can be benefit in using unsupervised clustering algorithms in situations where labels are not provided.
We apply CLASSIX to classify images in the Olivetti face dataset.The dataset contains 400 images of 40 distinct faces with variations in lighting, facial expressions (e.g., smiling or not), and other facial details (e.g., wearing glasses or not).Following the example in (Rodriguez and Laio, 2014), we select ten variants of ten faces from the dataset.The images are resized to 100 × 100 pixels and represented as  = 100 2 -dimensional feature vectors containing the  gray-scale values.We run distance-based CLASSIX with radius = 0.6 and minPts = 7 to cluster these faces into groups of similar images, hopefully corresponding to the same person.The results are shown in Figure 11 with different colours corresponding to different clusters.Images shown in gray-scale are outliers, i.e., not assigned to any cluster.We see from Figure 11 that CLASSIX classifies 77 of the 100 images into 7 clusters.Six of these clusters contain ten images corresponding to the correct classes, i.e., these 60 images (60% of all) are perfectly classified.The seventh cluster mixes 8 images of one face with 9 images of another similarly looking face.We may distinguish two types of  For comparison, the method by Rodriguez and Laio (2014) classifies 41 out of the 100 images into 9 clusters.Of these, 33 images (33% of all) are perfectly classified into 7 clusters.The remaining 8 images correspond to the same face and have been type (B) misclassified into two clusters.There are no type (A) misclassifications.

Image segmentation
Image segmentation is a technique to partition an image into regions sharing similar characteristics, usually for the purpose of identifying objects.Here we demonstrate an application of CLASSIX to cluster images on a pixel-by-pixel level.To this end we select five images from the COCO 2017 dataset (Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár and Zitnick, 2014), as shown in Figure 12.Each RGB image of size 100-by-100 pixels is converted into a data array of shape [100 2 , 3], and we then cluster these  = 100 2 data points of dimension  = 3 using DBSCAN,  HDBSCAN, Quickshift++, and CLASSIX with distance-based merging.After the clusters are formed, each pixel feature is replaced by the mean of all points in the cluster it is associated with, resulting in the segmented image.
It is very challenging to choose the parameters of all these methods so that the number of resulting clusters is fairly comparable.Here we have used radius = 0.15 and scale = 1.05 for CLASSIX and then manually tuned minPts and the hyperparameters of the other clustering algorithms to produce a comparable, preferably slightly higher number of clusters.As can be seen in Figure 12, CLASSIX is by far the fasted method in this test, producing reconstructions of high visual quality compared to the other methods despite using the smallest number of clusters.The most challenging image (Zebra) required about 0.33s to segment with an average number of distance calculations per data point of 28.84.

Summary and future work
We have introduced a fast clustering algorithm based on the sorting of the data points by their first principal coordinate.CLASSIX's key component is the fast aggregation of nearby data points into groups, exploiting the sorted order to reduce the number of pairwise distance computations.The simplicity of the aggregation and merging phases allows for clustering results to be explainable, a property that we demonstrate in the Appendix.
We have tested the clustering efficiency in a number of experiments, ranging from low to high-dimensional data.Apart from the analysis of the simple Gaussian model in section 4.2, we currently do not have a good theoretical understanding of why sorting helps even for high-dimensional "real-world" datasets such as those contained in the UCI Machine Learning Repository.While it is known that "big data" matrices often have low numerical rank (Udell and Townsend, 2019), we usually find that the second singular value of the data matrix is still significant in comparison to the first.Hence, bounds like (5) cannot explain why data points can be aggregated with such a small number of pairwise distance computations as we have seen in the experiments.Further work will be needed to understand this inherent "sortability" of big data.
As CLASSIX is a direct, noniterative, clustering method it is easy to explain its outputs.In its aggregation phase, data points will be grouped together when their distance from a starting point does not exceed radius.The subsequent merging phase is also fully deterministic and can be explained by distances between starting points (in distance-based merging) or by comparing densities of the groups (in density-based merging).The whole clustering procedure can hence be tracked and summarised in a finite number of steps.
Below we provide a demo of CLASSIX and its explainability feature.In order to run the included Python code, CLASSIX needs to be installed via:

pip install ClassixClustering
We consider a large dataset relating to sensor readings in a vacuum distillation unit (Oster, Güttel, Shapiro, Chen and Jobson, 2021), which has been included in the CLASSIX module for convenience.It contains  = 2, 028, 780 data points and is difficult or even impossible to cluster for many density-based algorithms.Here we apply CLASSIX with radius = 1, minPts = 0, and distance-based group merging (the latter two are default values and could be omitted from the call).CLASSIX completes the task in about 1.2 seconds, returning 4 well separated clusters: We can get a more detailed overview of the computed groups and clusters by calling the explain method: clx.explain(plot=True)

Output:
A clustering of 2028780 data points with 2 features has been performed.The radius parameter was set to 1.00 and MinPts was set to 0. As the provided data has been scaled by a factor of 1/2.46, data points within a radius of R=1.00*2.46=2.46 were aggregated into groups.In total 3920623 comparisons were required (1.93 comparisons per data point).This resulted in 36 groups, each uniquely associated with a starting point.These 36 groups were subsequently merged into 4 clusters.A list of all starting points is shown below.
The above plot of the data is produced using the first two principal coordinates.The indices shown in the red rectangles refer to the starting points.If we want to understand the cluster assignment of a specific data point, we can do so by providing the index of that point: The data point 17070 is in group 11, which has been merged into cluster #3.
Similarly, CLASSIX can provide an explanation for why two data points ended up in the same cluster, or why they are in different clusters.In the below example, the shortest path of groups connecting two data points is computed using Dijkstra's algorithm (Dijkstra, 1959) The data point 10000 is in group 0 and the data point 16908 is in group 9, both of which were merged into cluster #1.These two groups are connected via groups 0 <-> 2 <-> 3 <-> 4 <-> 5 <-> 8 <-> 9.

Figure 1 :
Figure1: Illustration of a point cloud aggregated into  = 6 groups (dashed circles) and merged into  = 3 clusters (labelled by colour).The data points are ordered along the direction of the őrst principal component indicated by the arrow, and the six starting points of the groups (which are also data points) follow that same order.

Figure 2 :
Figure 2: Demonstration of the outlier treatment.We apply CLASSIX to synthetic data with  = 2500 points using the parameters  = 0.1 and minPts = 5.Left: Without outlier reassignment, 27 clusters are identiőed, four of which are genuine ground-truth clusters (shown in colour) and each of the other 23 clusters have less than minPts points (shown as the black dots).Right: The reassignment of the 23 outlier clusters results in four clusters of good quality.

Figure 3 :
Figure 3: Illustration of CLASSIX clustering with out-of-sample data.Top: Ground truth labels of the train and test data (left and right, respectively).Bottom: CLASSIX clustering of the train and test data (left and right, respectively).The train vs test split is 90% vs 10%.

Figure 4 :
Figure4: Illustration of the model analyzed in section 4.2.Top: A Gaussian blob of  = 10 4 points (red dots) in  = 2 dimensions, with the horizontal components having a standard deviation of 1 and the vertical components having a standard deviation of  = 0.3.The radius for the aggregation is chosen as  = 0.5.The blue dots are all points whose őrst component (őrst principal coordinate) is within a distance of  from  = 0.6, while the blue points are within a distance of  from the group center  = [0.6,0]  (shown as the black dot).Bottom left: The quotient  2 ∕ 1 corresponding to the ratio of green and blue points as a function of  and .A ratio closer to 1 means that sorting-based aggregation is more efficient, with fewer points being revisited.Bottom right:  2 ∕ 1 as a function of  and .

Figure 5 :
Figure5: Performance evaluation of various clustering methods on the Gaussian blobs data.Top: We vary the number of data points from  = 5, 000 to 50,000 and measure the adjusted Rand index (left) and timing (right) for each method.The feature dimension is őxed at  = 10 in all cases.Bottom: Average number of distance calculations per data point in the CLASSIX aggregation phase with and without early termination.

Figure 6 :
Figure 6: Performance evaluation of various clustering methods on the Gaussian blobs data.Here we keep the number of data points őxed at  = 10, 000 and vary the dimension from  = 10 to 100.Top: The plots show the adjusted Rand index (left) and timing (right) for each method.Bottom: Average number of distance calculations per data point in the CLASSIX aggregation phase with and without early termination conditions.

2Figure 7 :
Figure7: Sensitivity of CLASSIX on the Gaussian blobs data as the radius parameter is varied from 0.1 to 1.We report the ARI and AMI values for distance-based and density-based merging.CLASSIX produces good clusterings over a wide parameter range from 0.2 to 0.6 (approximately).

Figure 8 :
Figure 8: Clustering quality on the UCI Machine Learning Repository as the hyperparameters change.The rows correspond eight datasets in Table 1 while the columns correspond to Mean shift, DBSCAN, HDBSCAN, Quickshift++, CLASSIX with distance-based merging, and CLASSIX with density-based merging, respectively.The searched parameters are bandwidth (Mean shift),  (DBSCAN), minimum cluster size (HDBSCAN),  (Quickshift++), and radius (CLASSIX).

Figure 9 :
Figure 9: Visualization of clustering results on the scikit-learn benchmark.Each plot shows the required computation time on the bottom right, and for CLASSIX the average number of distance computations per data point on the bottom left.

Figure 10 :
Figure 10: Visualisation of clustering results on the Shape benchmark.Each plot shows the required computation time on the bottom right, and for CLASSIX the average number of distance computations per data point on the bottom left.

Figure 11 :
Figure 11: CLASSIX clustering performance on the Olivetti face dataset.Different colours indicate different clusters.The gray-scale images are outliers that have not been assigned to any cluster.The computation time is about 0.33s.

Figure 12 :
Figure 12: Image segmentation via clustering.The left-most column shows the original image.For each method and image, the required clustering time is shown on the bottom right.For CLASSIX, the average number of distance calculations per data point is shown on the bottom left.

Table 1
Datasets in the UCI Machine Learning Repository

Table 2
Performance on datasets in the UCI Machine Learning Repository.The blue-shaded rows are ARI scores and the white rows are AMI scores.For each row, the best (highest) scores are printed in bold.

Table 3
ARI (blue-shaded rows) and AMI (white rows) scores on the six datasets in the scikit-learn clustering benchmark.The last two rows łAVGž correspond to the averages across all datasets.For each row, the best (highest) scores are printed in bold.

Table 4
Datasets in the Shape benchmark used for our tests misclassification as follows: type (A) where images of a unique face are assigned to different clusters, and type (B) where images showing different faces are assigned to the same cluster.Using these types, CLASSIX produces no type (A) misclassifications on this data, but two faces have been type (B) misclassified.

Table 5
Performance of various clustering algorithms on the Shape benchmark.The blue-shaded rows correspond to ARI scores while the white rows are AMI scores.For each row, the best (highest) scores are printed in bold.