Fast and general density peaks clustering

Density peaks is a popular clustering algorithm, used for many different applications, especially for nonspherical data. Although powerful, its use is limited by quadratic time complexity, which makes it slow for large datasets. In this work, we propose a fast density peaks algorithm that solves the time complexity problem. The proposed algorithm uses a fast and generic construction of approximate k-nearest neighbor graph both for density and for delta calculation. This approach maintains the generality of density peaks, which allows using it for all types of data, as long as a distance function is provided. For a dataset of size 10 0,0 0 0, our approach achieves a 91:1 speedup factor. The algorithm scales up for datasets up to 1 million in size, which could not be solved by the original algorithm at all. With the proposed method, time complexity is no longer a limiting factor of the density peaks clustering. © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license. ( http://creativecommons.org/licenses/by/4.0/ )


Introduction
Clustering algorithms aim at grouping points so that the points in the same group are more similar to each other than points in other groups. Clustering can serve as an efficient exploratory data analysis tool in the fields such as physics [1] and bioinformatics [2] , or as a preprocessing tool for other algorithms in e.g. road detection [3] and motion segmentation [4] .
Traditional clustering methods like k-means are constrained to cluster data with spherical clusters. Since the clusters in real life data are not always restricted to follow spherical shapes, new methods have been introduced to cluster data having arbitrary shape clusters. These include density based clustering [2,5,6] , graph based methods [1,7,8] , exemplar based clustering [9][10][11] , and support vector clustering [12,13] .
In this paper, we focus on the Density peaks ( DensP ) [6] clustering algorithm, which detects clusters based on the observation that cluster centers are usually in dense areas and are surrounded by points with lower density. The algorithm first calculates densities of all points, and then the distances to their nearest point with higher density ( delta ). The cluster centers are selected so that they have a high value of both delta and density. After that, the remaining points are allocated ( joined ) to the clusters by merging with the nearest higher density point. The algorithm has been widely used for many applications, including autonomous vehicle navigation [3] , moving object detection [4] , electricity customer segmentation [14] , document summarization [15] and overlapping community detection [16] . Although being popular, its use has been limited by the O( N 2 ) time complexity. This slowness originates from two different bottlenecks: the need to calculate density, and to find the nearest neighbors with higher density. These make the algorithm slow for very large data sets.
Some attempts have been done to improve the speed of density peaks. Wang et al. [14] use sampling with adaptive k-means to reduce the number of data points. Xu et al. [17] also limit the size of data by using grid-based filtering to remove points in sparse areas. They reported speed-up factors from 2:1 to 10:1 for data of sizes N = 50 0 0-10,0 0 0. However, both of these methods work only with numerical data, which reduces the generality of the original density peaks algorithm.
In addition to speed-up, there have also been attempts to improve the quality of density peaks. This has two major directions: using different density function [18][19][20][21] , and using different strategies to allocate the remaining points to the clusters [20][21][22] .
In the original density peaks algorithm, the densities are calculated by using a cut-off kernel , where neighborhood is defined by a given cutoff distance ( dc ). This defines a dc -radius hyper ball in the D -dimensional space. The algorithm then counts how many data points are within this ball.
Several authors have suggested alternatives to the cut-off kernel. Mehmood et al. [18] use a kernel variant based on heat-diffusion. Du [19] , Xie et al. [20] and Yaohui [21] all use a combination of exponential kernel and k nearest neighbors.
Xu et al. [22] proposed a novel joining strategy based on support vectors. Xie et al. [20] allocates the points using k nearest neighbors. The points are processed by a breadth first search starting from the cluster centers. Yaohui [21] proposed to join the points to the clusters if they are density reachable .
Most of the proposed approaches apply only to numerical data, which limits the usefulness of density peaks. However, the original density peaks algorithm does not have such limitation. Instead, it can operate with any given distance matrix regardless of the type of data. The only requirement is that some kind of distance function can be formulated. Du et al. [23] have used density peaks for a mix of categorical and numerical data, by introducing a new distance measure. Liu et al. [24] and Wang [15] use density peaks for text data by vectorizing the input data.
In this work, we focus on solving the slowness problem of the density peaks. We propose a new fast density peaks algorithm called FastDP . It first creates a k-nearest neighbor (kNN) graph. The density and delta values are estimated using this graph. The standard joining strategy of density peaks is then applied to obtain the final clustering. The proposed algorithm is significantly faster than the original density peaks while keeping its generality Contrary to the existing attempts to speed up density peaks [14,22] , our approach is not limited to numerical data but it applies, as such, to any type of data for which a distance function can be formulated. To demonstrate the algorithm's capabilities for non-vectorial data, we apply it for clustering of strings.
Our main contributions are the following: -We present RP-Div algorithm to create kNN graph fast.
-We utilize the graph to calculate the delta values fast.
-We show that the algorithm applies also to text data.
In terms of the other aspects of the algorithm, we use the original point allocation (joining) strategy. For density estimation we use the k-nearest neighbors as proposed in [19] .

Density peaks clustering
We first recall the original density peaks algorithm and its main variations. We use the following definitions: Distance between data points x and y kNN ( x ) The k nearest neighbors of x Point that is not a Local peak

Density
There are two widely used approaches to estimate density: distance-based and kNN-based . Distance-based approach takes a distance threshold as a parameter and counts how many points there are within this distance. Original density peaks algorithm uses this approach [6] .
The second approach finds the k -nearest neighbors ( k -NN), and then calculates the average distance to the neighbor points [19] . According to our experiments, there are no significant differences which of these approaches to use. They both require O( N 2 ) calculations and provide virtually the same clustering result if correct parameter is used. However, good value for k is easier to determine than to find a suitable distance threshold [25] . We therefore use the kNN based approach for both the original variant of density peaks and the proposed fast variant.

Density peaks
The density peaks clustering algorithm [6] is based on the idea that cluster centers have usually a high density, and they are surrounded by points with lower density. So they have a large distance to points with higher density. Density peaks uses two features to determine the clusters: the density ρ and delta δ, which is the distance to the nearest point having a higher density. We denote this point as the big brother .
The original algorithm selects k points as the cluster centers based on ρ and δ. This is because cluster centers are expected to have a high value for both of them. However, it was not defined how exactly the selection should be made. Different strategies have therefore been considered by others. We denote the two most common as Delta strategy [26] , which selects the points with highest delta values, and Gamma strategy [26,27] , which uses the product of the two features ( γ = ρδ). Here we apply the gamma strategy.
One also needs to decide how many points should be selected. The original paper did not give any solution and left it as a usergiven parameter. Wang et al. [27] proposed to detect a knee point on the gamma values by finding the intersection of the two lines to most closely fit the curve. In general, the problem is how to threshold the selected feature (either δ or γ ). This is an open problem both in centroids-based and density based clustering. Fig. 1 demonstrates the differences between the two selection strategies (delta and gamma). In this paper, we assume K is given by the user.

General distance
Density peaks has been mostly applied for numerical data in some vector space. However, it is possible to generalize the method to other non-numeric distance functions as well. Here we consider text data as a case study.
Two studies exist where string data has also been used. Liu et al. [24] and Wang [15] apply first string vectorization based on TF-IDF model. Term frequency (TF) counts how many times a given word appears in the dataset. It is normalized by counting inverse document frequency (IDF). This approach converts the strings into a vector space, and then uses cosine distance to measure the distance between the two strings.
The TF-IDF approach can be highly useful when clustering large text documents. However, short text strings contain only few words, which would result in sparse vectors containing only very few non-zero elements. Our solution to this is to apply a string similarity (or distance) measure directly without vectorization. This is possible because our method does not require the distance function being in metric space, nor does it rely on any other vector space properties.
The choice of the string similarity (or distance) measure depends on the application. We consider here two choices: Levenshtein and Dice. Levenshtein edit distance [28] is the most well known string distance measure, and we apply it in the context of short text strings. For tweets, we use Dice coefficient [29] . For a more comprehensive comparison of the available measures, we refer to [30] . Different cluster selection strategies based on the density-vs-delta plot for the S4 dataset. Cluster centroids typically have both high density and high distance to higher density point (delta). Therefore, thresholding based on combination of delta and density (gamma) is expected to work better than using the delta values alone.

Fast density peaks algorithm
The proposed Fast density peaks method is presented in Algorithm 1 and demonstrated in Fig. 2 . Source code can be found on web. 1 It consists of five steps: 1. Create the kNN graph (line 1).

Calculate the density values (lines 2-3).
3. Find big brothers (line 6). 4. Identify cluster centers (lines 9-10). 5. Join the remaining points (lines 12-13). First, we calculate the k nearest neighbor graph. The graph is then used to calculate all the information needed by density peaks algorithm: (1) density values, (2) distance to the nearest point with higher density (delta), and, (3) pointer to this nearest neighbor ( big brother ). Product of density and delta values (gamma) is used to determine the first K cluster centers (density peaks). The big brother pointers are then used to join the remaining points to the same cluster as their big brother.
The algorithm has two computational bottlenecks: • Constructing the kNN graph • Finding the big brother points.
Both of these bottlenecks take O( N 2 ) time if straightforward solution is used. We next study how to make these two steps faster without sacrificing the quality of clustering. 1 https://github.com/uef-machine-learning/fastdp .

Creating kNN graph by RP-div algorithm
To make density peaks faster, we first generate an approximate k-nearest neighbor (kNN) graph by using an iterative algorithm called RP-Div ( Algorithm 3 ). Its preliminary version has been presented in [31] . The algorithm contains two loops. In the first loop (lines 1-4), we create a new candidate graph, which is used to improve the graph obtained from the previous iterations. The new graph is generated by an algorithm called Random Pair Division ( RP-div ) ( Algorithm 4 ).
The algorithm constructs the graph by applying hierarchical subdivision (demonstrated in Figs. 3 and 4 ). The dividing works by first selecting two random points: a and b ( Algorithm 4 , lines 5-6). The dataset is then split into two subsets (A and B), so that subset A contains all points that are closer to the point a , and subset B all points that are closer to the point b. The dividing continues recursively (lines 12-13) until the subset size reaches a predefined threshold ( W ), e.g. W < 20. Subsets smaller than the threshold are solved by the O( N 2 ) brute force algorithm (line 2), which calculates distances between all possible pairs and updates the list of k nearest points found (variable kNN).
Algorithm 2 findBigBrothers (kNNg, X, density).  The dataset is split based on which of the two points is nearer. After the first split, the size of the subset A is smaller than threshold W = 20, and is solved by brute force. The subset B is further divided into two more subsets, which both are smaller than W and now solved by brute force. One iteration of the algorithm produces a crude approximation of the kNN graph. The graph is improved by repeating the hierarchical subdivision several times ( Fig. 4 ) using different random pairs for splitting. As a result, several different kNN graphs are created. Every new graph is used to improve the previous solution by adopting those k-nearest neighbor pointers that are shorter than in the previous graph.
The first loop (line 1) in Algorithm 3 continues until the proportion of changed edges drops below 10% (line 4). In the second loop (line 5), we apply the same iterative process as in the first loop, but at this time, we use the NNDES algorithm [32] to examine every point's neighbors of neighbors as kNN candidates. NNDES works better when the graph already has a moderate quality and is therefore used only at the later iterations. We continue until the improvement drops below stop = 1% (line 9).
Time bottleneck of the algorithm is the brute force step which requires O( W 2 ). Assuming all subsets are exactly of size W , there will be N/W subsets. The total time complexity of single iteration of the algorithm is then O( N/W ·W 2 ) = O( NW ). Using W = 2.5 ·k , this leads to linear O( rkN ) time algorithm where the number of repeats ( r ) is a small constant.

Solving the Big brother pointers
Once the kNN graph is constructed, it can be used to speed up the bottlenecks of density peaks. The densities can be calculated Finding the big brother ( Algorithm 2 ) is a special case of the nearest neighbor search. However, instead of considering only the distance, we must also select a point with higher density. Fortunately, majority of points have their big brothers located within their kNN neighborhood (line 8). We call them as slope points , and all others are denoted as local peaks . For slope points, we can find big brothers fast in O( k ) steps simply by analyzing the kNN graph (lines 5-10).
Local peaks, on the other hand, are local density maxima and their big brothers cannot be found in the kNN graph. There is no shortcut to find their big brothers and O( N) full search must therefore be performed (lines [11][12][13][14]. These are also the points among which the final centroids will be chosen. The time complexity of this step depends on how many local peaks there are. Assuming that the proportion of the local peaks is p , the time complexity of this step is pN 2 + (1 − p ) kN = O( pN 2 ). The speed-up is therefore inverse proportional to p . Fig. 5 shows an example of the distribution of the local peak points with a sample dataset. In general, the higher the value of k , the less there are peak points, and the faster the search for the big brothers.
The proportion of local peaks ( p ) is bound by the number of neighbors k in the graph. A neighbor of a peak cannot be another peak. If we have pN local peaks, there will be at least kpN slope points. Since all points are either local peaks or slopes, we have the following inequality: Therefore, the time complexity of O( pN 2 ) can also be written in terms of k as O( N 2 / k ). When combining with the O( rkN ) time complexity of the kNN graph construction (see Section 3.1 ), this leads to a total time complexity of O( N 2 / k + rkN ) for the whole algorithm.

Experiments
We use parameters stop = 1% and k = 30 for kNN graph generation in FastDP algorithm, unless otherwise noted.
The experiments were run on Red Hat Enterprise Linux Server release 7.5 with 96 processing cores of Intel(R) Xeon(R) CPU E7-4860 v2 @ 3.20GHz and 1.0TB memory. Processing times are reported as run on single thread.

Datasets
We test the proposed algorithm with the following four different data types: • Clustering basic benchmark • High dimensional random clusters • Artificial shapes • Text datasets We use datasets from the clustering benchmark [33] . So far we know four algorithms that can cluster all these datasets correctly: global k-means [34] , genetic algorithm [35] , random swap [36] and density peaks [6] . We test whether our fast density peaks ( FastDP ) can do the same. We report results for the S1-S4 sets [37] and for the Birch and Birch2 datasets [38] .
To show the capabilities of the proposed method with large high-dimensional data, we generated three large High dimensional random clusters datasets, RC1M, RC100k-l and RC100k-l and RC1M. One hundred centroids were generated from uniform random distribution, each attribute in range [10..11]. For each cluster, 10 0 0 (RC100k) or 10,0 0 0 (RC1M) points were drawn from Gaussian distribution. To study the effect of the cluster variance, we generated two variants for the RC100k dataset: RC100k-h for high variance ( σ 2 = 0.50) and RC100k-l for low variance ( σ 2 = 0.05). For the larger RC1M dataset, the lower variance of 0.05 was used.
Artificial shapes are also used as algorithms minimizing sumof-squared errors cannot handle this type of datasets but density peaks often can. We use three datasets that the original density peaks is known to succeed: Flame [2] , Aggregation [39] , and Spiral [40] . Also the dataset DS6_8 provided by the authors of [8] was used.
Furthermore, we also generated two new artificial shape datasets: Worms2 (2D) and Worms6 4 (6 4D). The former is shown in see Fig. 6 . The data contains 25 individual shapes starting from a random position and moving to a random direction. At each step, points are drawn from the Gaussian distribution to produce a cloud around the current position. The cloud has both a low variance (data) and high variance (noise) component. Their variance increases gradually at each step. The direction of movement is continually altered to an orthogonal direction. In case of the 64D version, the orthogonal direction is selected randomly at each step.
Collision is detected to prevent completely overlapping clusters. Three text datasets are also used. Countries dataset has 48 clusters consisting of the names of all European countries. Each cluster contains 50 variations of the country name generated by adding random modifications to the names. English words dataset 2 contains 466,544 words of length varying from 1 to 45 characters (9.4 chars on average). Twitter data consists of tweets collected from Nordic Tweets channel [41] . For the Countries and words datasets, we use edit distance. For the twitter data, we use Dice coefficient of bigrams, which is faster than edit distance, especially for long strings ( Table 1 ).

Clustering quality
We use the Centroid Index (CI) to measure the success at cluster level [42] , and Normalized Mutual Information (NMI) at point level [43] . For the current state-of-the-art in measuring clustering quality we refer to [44] .
Given a ground truth solution (G) and clustering solution (C), centroid index counts how many real clusters are missing a center, or alternatively, how many clusters have too many centers. The CI-value is the higher of these two numbers [42] . Value CI = 0 means that the clustering is correct.
The calculation of CI is done by performing nearest neighbor mapping between the clusters in C and G based on centroid distances [42] . After the mapping, we count how many clusters were not mapped. These non-mapped clusters ( orphans ) are indicators of missing clusters. The mapping is done into both directions (C → G and G → C). The maximum number of orphans is the CI-value: In case of shape and text data, center is not a proper representative of a cluster. We therefore find the most similar cluster instead of the nearest centroid. For this, we use Jaccard coefficient, which calculates how many common points the two clusters have to the total number of unique points in the two clusters [45] : Normalized mutual information measures the information that the clustering solution (C) shares with the ground truth (G). Since 2 https://github.com/dwyl/english-words . The corresponding Fast-DensP processing time depends linearly on the data size, which is N = 10 0 0 ·K .

Table 2
Proportion of local peaks p . When the number of neighbors k in the kNN graph is increased, the proportion of the local peak points decreases. there scale has no upper bound, the result is normalized by the average of the self-entropies of C and G. The better the clustering, the closer the value of NMI is to 1. The exact scale varies across the datasets and it lacks similar intuitive interpretation as the CI-value. The English words and Twitter data do not have any ground truth, so for them we only provide qualitative samples to estimate the clustering quality.

Efficiency of the delta calculation
We test the efficiency of finding the big brothers by studying the number of local peak points. We need to perform O( N ) time full search only for the local peak points. For all other points, its big brother can be found faster in O(k) time simply by taking the nearest higher density point in its kNN neighborhood. Therefore, the less local peak points, the faster the algorithm. Fig. 7 shows the proportion of the local peaks for the subsets of the Birch2 where one cluster was removed at a time to produce subsets with number of clusters varying from K = 1-100. The corresponding data sizes varies from N = 10 0 0 to 10 0,0 0 0. We observe that the proportion of the peaks increases to about 0.9% at K = 10 clusters. After that it remains almost constant no matter how many more clusters there are. The total processing times are also shown, and it has near-linear dependency on the size of data. Table 3 Summary of the processing times and clustering quality. The quality of the kNN graph is varied by running the RP-Div algorithm for different number of iterations. Quality is recorded as NMI. We highlight the first NMI value that is equal (within 0.01 NMI) to the results of OrigDP. The processing times and CI-values are reported for this iteration.   The effect of the parameter k in kNN is shown in Table 2 . With all datasets, the number of local peaks is small already with k = 10, and reduces to about 0.26% if it is increased to k = 70.

Results (Speed v. quality)
We implemented two versions of DensP in C: the original version [6] denoted as OrigDP , and the proposed method denoted as FastDP . Both variants use the same kNN-based method for density estimation. In terms of clustering quality, the only difference originates from the quality of the kNN graph. In the OrigDP, an exact kNN is used while FastDP uses approximated version from [31] .
In the experiments, we vary the number of iterations in RP-DIV (1…30) to obtain different clustering quality. Quality of the approximated kNN graph increases with the number of iterations, and the same is expected to happen for the clustering quality.
From the results in Table 3 , we can see that FastDP achieves similar quality as OrigDP but much faster. This is especially true Overall, the proposed method is much faster than OrigDP. In case of the birch2 dataset ( N = 10 0,0 0 0), the processing time is reduced from 282 s to 1.96 s with no reduction on quality. In case of smaller datasets, there is 14:1 improvement in case of S-sets of size 50 0 0, and no improvement in case of smaller datasets ( < 10 0 0 points).
The proposed algorithm was also successful with the largest datasets (RC1M, English words, Tweets) which the original density peaks algorithm could not process. The 466,544 strings of the words dataset were clustered using Levenshtein distance to 500 clusters in 2091 s. Constructing the kNN-graph was the main bottleneck with 1779 s. This data set does not have a ground truth clustering, but one can verify by seeing Table 4 that the results contain meaningful clusters.
The Nordic tweets dataset of size 544,133 was also successfully clustered with parameters k = 100, stop = 5%, and NNDES disabled. Two sample clusters are shown in Table 5 , which both are meaningful for a human observer. In specific, the two particular clusters were observed to have lower variance, which can be partly explained by the fact that they were produced by bots rather than humans. The weather cluster is produced by one single bot, whereas the jobs cluster contains also human written tweets.

Conclusions
Fast density peaks (FastDP) algorithm was proposed. Its main advantage is that it removes the quadratic time complexity limitation of density peaks and allows clustering of very large datasets. The speed-up is achieved without any visible effect on the clustering quality. Experiments showed an average speed-up of 91:1 on datasets of size ≥ 100k. Clustering a high dimensional numerical dataset of size 1M took only 12 min. The algorithm works for all types of data as long as a distance function is provided. We managed to cluster a Nordic Tweet dataset of size 500k in 31 min.

Declaration of Competing Interest
Authors declare that they have no conflict of interest.