Clustering by Detecting Density Peaks and Assigning Points by Similarity-First Search Based on Weighted K-Nearest Neighbors Graph

+is paper presents an improved clustering algorithm for categorizing data with arbitrary shapes. Most of the conventional clustering approaches work only with round-shaped clusters. +is task can be accomplished by quickly searching and finding clustering methods for density peaks (DPC), but in some cases, it is limited by density peaks and allocation strategy. To overcome these limitations, two improvements are proposed in this paper. To describe the clustering center more comprehensively, the definitions of local density and relative distance are fused with multiple distances, including K-nearest neighbors (KNN) and shared-nearest neighbors (SNN). A similarity-first search algorithm is designed to search the most matching cluster centers for noncenter points in a weighted KNN graph. Extensive comparison with several existing DPC methods, e.g., traditional DPC algorithm, density-based spatial clustering of applications with noise (DBSCAN), affinity propagation (AP), FKNN-DPC, and K-means methods, has been carried out. Experiments based on synthetic data and real data show that the proposed clustering algorithm can outperform DPC, DBSCAN, AP, and K-means in terms of the clustering accuracy (ACC), the adjusted mutual information (AMI), and the adjusted Rand index (ARI).


Introduction
e natural ecosystem has the characteristics of diversity, complexity, and intelligence, which provide infinite space for data-driven technology. As a new research focus, the datadriven prediction method has been widely used in energy, transportation, finance, and automobiles [1][2][3][4][5][6][7]. Clustering algorithm is an important branch of data-driven technology, which provides important information for further data analysis through mining the internal association of data [8,9].
Due to the different definitions of clustering, different clustering strategies have been reported. Among them, the K-means algorithm is a simple and effective clustering algorithm. It preselects K initial clustering centers and then iteratively assigns each data point to the nearest clustering center [10]. Since the initial clustering center has certain impacts on the clustering results of K-means, the works [11,12] provided several methods for selecting the initial clustering center and improving the accuracy of clustering. Since the K-means and its variants are based on the idea that data points are assigned to the nearest clustering center, these methods cannot facilitate the nonspherical clustering task well. Unlike the K-means algorithm, affinity propagation (AP) [8] has been developed based on the similarity between data points, and it can complete clustering by exchanging information between them. Hence, the AP algorithm does not need to determine the number of clusters in advance, and it has the time advantage in completing the clustering task of large-scale datasets [13]. However, for complex datasets, the AP method may also lead to performance degradation as the K-means method [14].
To address the aforementioned problems, density-based clustering methods have been proposed, which can find clusters of various shapes and sizes in noisy data, where the high-density regions are considered as the clusters and separated by low-density regions [15][16][17][18][19]. In this line, density-based spatial clustering of applications with noise (DBSCAN) [15,16] was proposed as an effective density-based clustering method. It needs to determine two parameters about the density of points (ε and MinPts) to achieve clustering of arbitrary shapes, where ε is the neighborhood radius and MinPts is the number of points contained in the neighborhood radius ε [15]. However, choosing a suitable threshold is a challenging task for these methods [15,17]. Subsequently, Rodriguez and Laio [20] proposed a novel density-based clustering algorithm through fast search and density peaking (named as DPC). e DPC algorithm uses the local density and the relative distance of each point to establish a decision graph, finds the cluster centers according to the decision graph, and then assigns the noncenter point to the cluster of the nearest higher density neighbor. Although the DPC algorithm is simple and effective for detecting arbitrary shape clustering, several issues are limiting its practical application. Firstly, DPC is sensitive to the cutoff distance d c , implying that the parameter d c is set suitably to retain satisfactory performance, which is not a trivial task. Secondly, the clustering centers should be manually selected, which may not be feasible and convenient for some datasets. Moreover, the allocation error of high-density points will directly affect the allocation of low-density points around it, which may also contribute to propagating in the subsequent allocation process continuously.
To overcome these issues, the main advanced DPC algorithm has recently been studied. To avoid the influence of the cutoff distance d c , the concept of K-nearest neighbors (KNN) has been introduced into the DPC algorithm, which proposed two different density measures, e.g., DPC-KNN [19] and FKNN-DPC [9]. Although both algorithms are based on the K-nearest neighbor information, they have been developed separately. Moreover, to solve the problem of manual selection of clustering centers, Li et al. [21] proposed a density peak clustering method to automatically determine the clustering centers. In this algorithm, the potential clustering centers are determined by the c ranking graph, and then the true clustering centers are filtered out using the cutoff distance d c . To remedy the allocation error transmission, FKNN-DPC [9] and SNN [22] both adopted a two-step allocation strategy to allocate noncentral points. In the first step, they use the breadth-first search to assign nonoutlier points. In the second step, FKNN-DPC uses the fuzzy weighted K-nearest neighbor technology to allocate the remaining points, and the SNN is based on whether the number of shared neighbors reaches the threshold to determine the cluster of the remaining points.
is paper proposed an improved clustering algorithm based on the density peaks (named as DPC-SFSKNN). It has the following new features: (1) the local density and the relative distance are redefined, and the distance attributes of the two neighbor relationships (KNN and SNN) are fused.
is method can detect the low-density clustering center. (2) A new allocation strategy is proposed. A similarity-first search algorithm based on weighted KNN graphs is designed to allocate noncenter points. It has to be ensured that the allocation strategy is fault tolerant.
In general, this paper is organized as follows: Section 2 briefly mainly introduces the DPC algorithm and its development and analyzes the DPC algorithm in detail. Section 3 introduces the DPC-SFSKNN algorithm in detail and gives a detailed analysis. Section 4 tests the proposed algorithm on several synthetic and real-world datasets and compares its performance with DPC, DBSCAN, AP, FKNN-DPC, and K-means in terms of several very popular criteria for testing a clustering algorithm, namely, clustering accuracy (ACC), adjusted mutual information (AMI), and adjusted Rand index (ARI). Section 5 draws some conclusions.

Related Work
e density peak clustering algorithm (DPC) was proposed by Alex and Alessandro in 2014. e core idea of the DPC algorithm lies in the characterization of the cluster center, which has the following two characteristics: the cluster center point has a higher local density, which is surrounded by neighbor points with lower local density; the cluster center point is relatively far from other denser data points. ese characteristics of the cluster center are related to two quantities: the local density ρ i of each point i and its relative distance δ i , which represents the closest distance from the point to larger density points.

DPC Algorithm and Improvements
. Suppose X is a dataset for clustering and d ij represents the Euclidean distance between data points i and j. e calculation of local density and relative distance depends on the distance d ij . e DPC algorithm introduces two methods for calculating local density: the "cutoff" kernel method and the Gaussian kernel method. For a data point i, its local density ρ i is defined in (1) with the "cutoff" kernel method and in (2) with the Gaussian kernel method: where d c is defined as a cutoff distance, which represents the neighborhood radius of the data point. e most significant difference between the two methods is that ρ i calculated by the "cutoff" kernel is a discrete value, while ρ i calculated by the Gaussian kernel is a continuous value. erefore, the probability of conflict (different data points correspond to the same local density) in the latter is relatively smaller. Moreover, d c is an adjustable parameter in (1) and (2), which is defined as where d c represents the average number of neighbors for each point, which is between 1 and 2 of all points [20]; N is the serial number of the last data point after the ascending order of all the distances d ij , and it is also the total number of points; 2 in 2 Complexity formula (3) is the empirical parameter provided in reference [20], which can be adjusted according to different datasets. e relative distance δ i represents the minimum distance between the point i and any other higher density points and is mathematically expressed as where d ij is the distance between points i and j. When the local density ρ i is not the maximum density, the relative distance δ i is defined as the minimum distance between the point i and any other higher density points; when ρ i is the maximum density, δ i takes the maximum distance to all other points. After calculating the local density and relative distance of all data points, the DPC algorithm establishes a decision graph through the set of points ρ i and δ i . e point with high values of ρ i and δ i is called a peak, and the center of the cluster is selected from the peaks. en, the DPC algorithm directly assigns the remaining points to the same cluster as the nearest neighbor peak.
For the DPC algorithm, the selection of d c has a great influence on the correctness of the clustering results. Both DPC-KNN and FKNN-DPC schemes introduce the concept of K-nearest neighbors to eliminate the influence of d c . Hence, two different local density calculations are provided. e local density proposed by DPC-KNN [19] and FKNN-DPC [9] is given in (5) and (6), respectively: where K is the total number of nearest neighbors and KNN(i) represents the set of K-nearest neighbors of point i. ese two methods provide a unified density metric for datasets of any size through the idea of K-nearest neighbors and solve the problem of nonuniformity of DPC's density metric for different datasets.
Based on K-nearest neighbors, SNN-DPC proposes the concept of shared-nearest neighbors (SNN) [22], which is used to represent the local density ρ i and the relative distance δ i . e idea of SNN is that if there are more same neighbors in the K-nearest neighbors of two points, the similarity of two points is higher, and the expression is given by Based on the SNN concept, the expression of SNN similarity is as follows: where d ip is the distance between points i and p and d jp is the distance between points j and p. e condition for calculating SNN similarity is that points i and j appear in each other's K-nearest neighbor set. Otherwise, the SNN similarity between the two points is 0.
Next, the local density ρ i of point i is expressed by SNN similarity. Suppose point i is any point in the dataset X, then S(i) � x 1 , x 2 , . . . , x k represents the set of k points with the highest similarity with point i. e expression of local density is At the same time, the equation for the relative distance δ i of the point i is as follows: e SNN-DPC algorithm not only redefines the local density and relative distance, but also changes the data point allocation strategy. e allocation strategy divides the data points into two categories: "unavoidable subordinate points" and "probable subordinate points." e two types of data points have their allocation algorithms. Compared with the DPC algorithm, this allocation strategy method is better for the clustering of clusters with different shapes.

DPC Algorithm Analysis.
e DPC algorithm proposes a very simple and elegant clustering algorithm. However, due to its simplicity, DPC has the following two potential problems to be further addressed in practice.

DPC Ignores Low-Density
Points. When the density difference between clusters is large, the performance of the DPC algorithm can be significantly degraded. To show this issue, we take the dataset Jain [23] as an example, and then the clustering results calculated using the truncated kernel distance of the DPC are shown in Figure 1. It can be seen that the cluster distribution in the upper left is relatively sparse, and the cluster distribution in the lower right is relatively close. e red star in the figure represents the cluster centers in the upper left corner. Under the disparity in density, the clustering centers selected by the DPC are all on the tightly distributed cluster below. Due to the incorrect selection of the clustering centers, the subsequent allocations are also incorrect.
Analyzing the local density and the relative distance separately, from Figures 2(a) and 2(b), it can be seen that the ρ value and the δ value of point A of the false cluster center are much higher than that of the true cluster center C. e results of Gaussian kernel distance calculation are the same, and the correct clustering center cannot be selected on the dataset Jain. erefore, how to increase the ρ value and the δ Complexity value of the low-density center and make it stand out in the decision graph is a problem that needs to be considered.

DPC Ignores Low-Density Point Allocation Strategy with Low Fault Tolerance.
e fault tolerance of the allocation strategy of the DPC algorithm is not satisfactory, mainly because the allocation of points receives a higher impact than the density of points. Hence, if a high-density point allocation error occurs, it will directly affect the subsequent allocation of points with a lower density. Taking the Pathbased dataset [24] as an example, Figure 3 shows the clustering result calculated by the DPC algorithm by using the "cutoff" kernel distance. It can be seen from the figure that the DPC algorithm can find a suitable clustering center, but the allocation results of most points are incorrect. e same is true of the results using the Gaussian kernel distance calculation.
e results of point assignment on the Pathbased dataset are similar to those of "cutoff" kernel clustering. erefore, the fault tolerance of the point allocation strategy should be further improved. Moreover, the points are greatly affected by other points during the allocation, which is also an issue to be further addressed.

Proposed Method
In this section, the DPC-SFSKNN algorithm is introduced in detail. e DPC-SFSKNN algorithm is proposed, where the five main definitions of the algorithm are introduced, and the entire algorithm process is introduced. Moreover, the complexity of the DPC-SFSKNN algorithm is analyzed.

e Main Idea of DPC-SFSKNN.
e DPC algorithm relies on the distance between points to calculate the local density and the relative distance and is also very sensitive to   4 Complexity the choice of the cutoff distance d c . Hence, the DPC algorithm may not be able to correctly process for some complex datasets. e probability that a point and its neighbors belong to the same cluster is high. Adding attributes related to neighbors in the clustering process can help to make a correct judgment. erefore, we introduce the concept of shared-nearest neighbor (SNN) proposed in [22], when defining the local density and the relative distance. Its basic idea is that if they have more common neighbors, the two points are considered to be more similar as said above (see equation (7)).
Based on the above ideas, we define the average distance d snn (i, j) of the shared-nearest neighbor between point i and point j and the similarity between the two points.
Definition 1 (average distance of SNN). For any points i and j in the dataset X, the shared-nearest neighbor set of two points is SNN(i, j), and the average distance of SNN d snn (i, j) is expressed as where point p is any point of SNN(i, j) and S is the number of members in the set SNN(i, j). d snn (i, j) shows the spatial relationship between point i and point j more comprehensively by calculating the distances between two points and shared-nearest neighbor points.
Definition 2 (similarity). For any points i and j in the dataset X, the similarity Sim(i, j) between point i and j can be expressed as where K is the number of nearest neighbors. K is selected from 4 to 40 until the optimal parameter appears. e lower bound is 4 because a smaller K may cause the algorithm to become endless. For the upper bound, it is found by experiments that a large K will not significantly affect the results of the algorithm. e similarity is defined according to the aforementioned basic idea "if they have more common neighbors, the two points are considered to be more similar," and the similarity is described using the ratio of the number of shared-nearest neighbors to the number of nearest neighbors.
Definition 3 (K-nearest neighbor average distance). For any point i in the dataset X, its K-nearest neighbor set is KNN(i), and then the expression of K-nearest neighbor average distance d knn (i) is as follows: where point p is any point in KNN(i) and the number of nearest neighbors of any point is K. K-nearest neighbor average distance can describe the surrounding environment of a point to some extent. Next, we use it to describe local density.
Definition 4 (local density). For any point i in the dataset X, the local density expression is where point j is a point in the set KNN(i) and d knn (i) and d knn (j) are the K-nearest neighbor average distances of point i and point j, respectively. In formula (14), the numerator (the number of shared-nearest neighbor S) represents the similarity between the two points, and the denominator (the sum of the average distances) describes the environment around them. When S is a constant and if  Complexity the value of the sum of the average distances (d knn (i) + d knn (j)) is small, the local density ρ i of point i is large. Point j is one of the K-nearest neighbors of point i.
When the values of d knn (i) and d knn (j) are small, it means i and j are closely surrounded by their neighbors. If d knn (i) has a larger value (point j is far away from point i) or d knn (j) has a larger value (when the neighboring points of the distance are far away from the point j), the local density of the point i becomes smaller. erefore, only the average distances of the two points are small, and it can be expressed that the local density of point i is large. Moreover, when the sum of the average distances of the two points is constant and if the number of shared-nearest neighbors of the two points is large, the local density is large. A large number of shared neighbors indicate that the two points have a high similarity and a high probability of belonging to the same cluster. e higher the similarity points around a point, the greater its local density and the greater the probability of becoming a cluster center. is is beneficial to those lowdensity clustering centers. A large number of shared neighbors can compensate for the loss caused by their large distance from other points so that their local density is not only affected by distance. Next, we define the relative distance of the points.
Definition 5 (relative distance). For any point i in the dataset X, the relative distance can be expressed as where point j is one of the K-nearest neighbors of point i, d ij is the distance between points i and j, and d knn (i) and d knn (j) are the average distance from the nearest neighbor of points i and j. We can use the sum of the three distances to represent the relative distance. Compared to the DPC algorithm which only uses d ij to represent the relative distance, we define the concept of relative distance and K-nearest neighbor average distances of two points. e new definition can not only express the relative distance, but also be more friendly to low-density cluster centers. Under the condition of constant d ij , the average distance of the nearest neighbors of the low-density points is relatively large, and its relative distance will also increase, which can increase the probability of low-density points being selected. e DPC-SFSKNN clustering center is selected in the same way as the traditional DPC algorithm. e local density ρ and relative distance δ are used to form a decision graph. e n points with the largest local density and relative distance are selected as the clustering centers.
For DPC-SFSKNN, the sum of the distances from points of a low-density cluster to their K-neighbors may be large; thus, they receive a greater compensation for their δ value. Figures 4(a) and 4(b) show the results of DPC-SFSKNN on the Jain dataset [23]. Compared to Figure 2(b), the δ values of points in the upper branch are generally larger than those of the lower branch. is is because the density of the upper branch is significantly smaller and the distances from the points to their respective K-nearest neighbors are larger; thus, they receive a greater compensation. Even if the density is at a disadvantage, the higher δ value still makes the center of the upper branch distinguished in the decision graph. is shows that the DPC-SFSKNN algorithm can correctly select low-density clustering centers.

Processes.
e entire process of the algorithm is divided into two parts: the selection of clustering centers and the allocation of noncenter points. e main step of our DPC-SFSKNN and a detailed introduction of the proposed allocation strategy are given in Algorithm 1.
Line 9 of the DPC-SFSKNN algorithm establishes a weighted K-nearest neighbor graph, and Line 11 is a K-nearest neighbor similarity search allocation strategy. To assign noncenter points in the dataset, we designed a similarity-first search algorithm based on the weighted K-nearest neighbor graph. e algorithm uses the breadthfirst search idea to find the cluster center with the highest similarity for the noncenter point. e similarity of noncenter points and their K-nearest neighbors is sorted in an ascending order, the neighbor point with the highest similarity is selected as the next visited node, and it is pushed into the path queue. If the highest similarity point is not unique, the point with the smallest SNN average distance is selected as the next visited node. e visiting node also needs to sort the similarity of its K-nearest neighbors and select the next visiting node. e search stops until the visited node is the cluster center point. Algorithm 2 describes the entire search process. Finally, each data point except the cluster centers is traversed to complete the assignment.
Similarity-first search algorithm is an optimization algorithm based on breadth-first search according to the allocation requirements of noncenter points. Similarity is an important concept for clustering algorithms. Points in the same cluster are similar to each other. Two points with a higher similarity have more of the same neighbors. Based on the above ideas, the definition of similarity is proposed in (12). In the process of searching, if only similarity is used as the search criteria, it is easy to appear that the highest similarity point is not unique. erefore, the algorithm chooses the average distance of the SNN as the second criterion, and a smaller d snn point means that the two points are closer in space. e clustering results of the DPC-SFSKNN algorithm based on the Pathbased dataset are shown in Figure 5. Figure 3 clearly shows that although the traditional DPC algorithm can find cluster centers on each of the three clusters, there is a serious bias in the allocation of noncenter points. From Figure 5, we can see the effectiveness of the noncentral point allocation algorithm of the DPC-SFSKNN algorithm. e allocation strategy uses similarity-first search to ensure that the similarity from the search path is the highest, and a gradual search to the cluster center to avoid the points with low similarity is used as a reference. Besides, the similarity-first search allocation strategy based on the weighted K-nearest neighbor graph considers neighbor information. When the point of the highest similarity is not unique, the point with the shortest average distance of the shared neighbors is selected as the next visited point. Require: dataset X, parameter K Ensure: clustering result C (1) Data preprocessing: normalize the data (2) Calculate the Euclidean distance between the points (3) Calculate the K-nearest neighbors of each point i ∈ X (4) Calculate the average distance of K-nearest neighbors of each point d knn (i) according to (13) (5) Calculate the local density ρ i of each point i ∈ X according to (14) (6) Calculate the relative distance δ i of each point i ∈ X according to (15) (7) Find the cluster center by analyzing the decision graph composed of ρ and δ and use the cluster center as the set CC (8) Calculate the similarity between point i and its K-nearest neighbors according to (12) (9) Connect each point in the dataset X with its K-nearest neighbors and use the similarity as the connection weight to construct a weighted K-nearest neighbor graph (10) Calculate the average distance of SNN d snn (i, j) between point i and its shared-nearest neighbors according to (11) (11) Apply Algorithm 2 to allocate the remaining points ALGORITHM 1: DPC-SFSKNN.
Require: w ∈ X, set of cluster centers CC, number of neighbors K, similarity matrix S n * n � sim(i, j) n * n , and SNN average distance matrix DSNN n * n � d snn (i, j) n * n Ensure: point w ∈ CC (1) Initialize the descending queue Q and the path queue P. e K-nearest neighbors of point w are sorted in the ascending order of similarity and pushed into Q. Push M into P. (2) while tail point of P P ∈ CC do (3) if the highest similarity point is unique. then (4) Pop a point this at Q's tail (5) else (6) Select a point this with the smallest DSNN (7) end if (8) Empty descending queue Q (9) e K-nearest neighbors of this are sorted in the ascending order of similarity and pushed into Q. (10) Push this into P (11) end while ALGORITHM 2: Similarity-first search allocation strategy.

Complexity 7
dataset is n, the number of cluster centers is m and the number of neighbors is k.

Time Complexity. e time complexity analysis of DPC-SFSKNN is as follows.
Normalization requires a processing complexity of approximately O(n); the complexities of calculating the Euclidean distance and similarity between points are O(n 2 ); the complexity of computing the K-nearest neighbor average distance d knn is O(n 2 ); similarly, the complexity of the average distance d snn between the calculation point and its sharednearest neighbors does not exceed O(n 2 ) at most; the calculation process of calculating the local density ρ i and distance e time complexity of the DPC-SFSKNN algorithm is k times higher than that of the traditional DPC algorithm. However, k is relatively small compared to n. erefore, they do not significantly affect the run time. In Section 4, it is demonstrated that the actual running time of DPC-SFSKNN does not exceed k times of the running time of the traditional DPC algorithm.

Space
Complexity. DPC-SFSKNN needs to calculate the distance and similarity between points, and its complexity is O(n 2 ). Other data structures (such as ρ and δ arrays and various average distance arrays) are O(n). For the allocation strategy, in the worst case, its complexity is O(n 2 ). e space complexity of DPC is O(n 2 ), which is mainly due to the distance matrix stored. e space complexity of our DPC-SFSKNN is the same as that of traditional DPC, which is O(n 2 ).
To eliminate the influence of missing values and differences in different dimension ranges, the datasets need to be preprocessed before proceeding to the experiments. We replace the missing values by the mean of all valid values of the same dimension and normalize the data using the minmax normalization method shown in the following equation: where x ij represents the original data located in the ith row and jth column, x ij represents the rescaled data of x ij , and x j represents the original data located in the jth column. Min-max normalization method processes each dimension of the data and preserves the relationships between the original data values [36], therefore decreasing the influence of the difference in dimensions and increasing the efficiency of the calculation.
To fairly reflect the clustering results of the five algorithms, the parameters in the algorithms are adjusted to ensure that their satisfactory clustering performance can be retained. For the DPC-SFSKNN algorithm, the parameter K needs to be specified in advance, and an initial clustering center is manually selected based on a decision graph composed of the local density ρ and the relative distance δ. It can be seen from the experimental results in Tables 3 and 4 that the value of parameter K is around 6, and the value of parameter K for the dataset with dense sample distribution is more than 6. In addition to manually select the initial clustering center, the traditional DPC algorithm also needs  Complexity to determine d c . Based on the provided selection range, d c is selected so that the number of neighbors is between 1 and 2% of the total number of data points [20]. e two parameters that DBSCAN needs to determine are ε and minpts as in [15]. e optimal parameters are determined using a circular search method. e AP algorithm only needs to determine a preference, and the larger the preference, the more the center points are allowed to be selected [8]. e general method for selecting parameters is not effective, and only multiple experiments can be performed to select the optimal parameters. e only parameter of K-means is the number of clusters. e true number of clusters in the dataset is used in this case. Similarly, FKNN-DPC needs to determine the nearest neighbors K.

Analysis of the Experimental Results on Synthetic Datasets.
In this section, the performance of DPC-SFSKNN, DPC [20], DBSCAN [15], AP [8], FKNN-DPC [9], and K-means [10] is tested with six synthetic datasets given in Table 1. ese synthetic datasets are different in distribution and quantity. Different data situations can be simulated to compare the performance of six algorithms in different situations. Table 3 shows AMI, ARI, ACC, and EC/AC of the five clustering algorithms on the six comprehensive datasets, where the best results are shown in bold and "-" means no value. Figures 6-9 show the clustering results of DPC-SFSKNN, DPC, DBSCAN, AP, FKNN-DPC, and K-means based on the Pathbased, Flame, Aggregation, and Jain datasets, respectively. e five algorithms achieve the optimal clustering on DIM512 and DIM1024 datasets, so that the clustering of the two datasets is not shown. Since the cluster centers of DBSCAN are relatively random, only the positions of clustering centers of the other three algorithms are marked. Figure 6 shows the results of the Pathbased dataset. DPC-SFSKNN and FKNN-DPC can complete the clustering of the Pathbased dataset correctly. From Figures 6(b), 6(d), and 6(f ), it can be seen that the clustering results of DPC, AP, and K-means are similar. e clustering centers selected by DPC, AP, DPC-SFSKNN, and FKNN-DPC are highly similar, but the clustering results of DPC and AP are not satisfactory. For the DPC algorithm, the low fault tolerance rate of its allocation strategy is the cause of this result. A    12 Complexity high-density point allocation error will be transferred to low-density points, and the error propagation will seriously affect the clustering results. AP and K-means algorithms are not good at dealing with irregular clusters. e two clusters in the middle are too attractive to the points on both sides of the semicircular cluster, which leads to clustering errors. DBSCAN can completely detect the semicircular cluster, but the semicircular cluster and the cluster on the left of the middle are incorrectly classified into one category, and the cluster on the right of the middle is divided into two clusters. e similarities between points and manually prespecified parameters may severely affect the clustering. DPC-SFSKNN and FKNN-DPC algorithms perform well on the Pathbased dataset. ese improved algorithms that consider neighbor relationships have a great advantage in handling such complex distributed datasets. Figure 7 shows the results of four algorithms on the Flame dataset. As shown in the figure, DPC-SFSKNN, DPC, FKNN-DPC, and DBSCAN can correctly detect two clusters, while AP and K-means cannot completely correct clustering. Although AP can correctly identify higher clusters and select the appropriate cluster center, the lower cluster is divided into two clusters. Both clusters of K-means are wrong. e clustering results in Figure 8 show that the DPC-SFSKNN, DPC, FKNN-DPC, and DBSCAN algorithms can detect 7 clusters in the Aggregation dataset, but AP and K-means still cannot cluster correctly. DPC-SFSKNN, DPC, and FKNN-DPC can identify clusters and centers. Although the cluster centers are not marked for DBSCAN, the number of clusters and the overall shape of each cluster are correct. e AP algorithm successfully found the correct number of clusters, but it chose two centers for one cluster, which divided the cluster into two clusters. e clustering result of K-means is similar to that of AP.
e Jain dataset shown in Figure 9 is a dataset consisting of two semicircular clusters of different densities. As shown in the figure, the DPC-SFSKNN algorithm can completely cluster two clusters with different densities. However, DPC, AP, FKNN-DPC, and K-means incorrectly assign the left end of the lower cluster to the higher cluster, and the cluster centers of the DPC are all on the lower cluster. Compared with that, the distribution of the cluster centers of the AP is more reasonable. For the DBSCAN algorithm, it can accurately identify lower clusters, but the left end of the higher cluster is incorrectly divided into a new cluster so that the higher cluster is divided into two clusters.
According to the benchmark data shown in Table 3, it is clear that the performance of DPC-SFSKNN is very effective among the six clustering algorithms, especially in the Jain dataset. Although DPC and FKNN-DPC perform better than DPC-SFSKNN on Aggregation and Flame datasets, DPC-SFSKNN can find the correct clustering center of the aggregation and can complete the clustering task correctly.

Analysis of Experimental Results on Real-World Datasets.
In this section, the performance of the five algorithms is still benchmarked according to AMI, ARI, ACC, and EC/AC, and the clustering results are summarized in Table 4. 12 realworld datasets are selected to test DPC-SFSKNN's ability to identify clusters on different datasets. DBSCAN and AP algorithm cannot get effective clustering results on waveform and waveform (noise). e symbol "-" represents no result.
As shown in Table 4, in terms of benchmarks AMI, ARI, and ACC, DPC-SFSKNN outperforms all five other algorithms on the Wine, Segmentation, and Libras movement datasets. At the same time, FKNN-DPC performs better than the other five algorithms on the Iris, Seeds, Parkinsons, and WDBC datasets. It can be seen that the overall performance of DPC-SFSKNN is slightly better than DPC on 11 datasets except for Parkinsons. On the Parkinsons, DPC-SFSKNN is slightly worse than DPC in AMI but better than DPC in ARI and ACC. Similarly, DPC-SFSKNN had a slightly better performance in addition to Iris, Parkinsons, WDBC, and Seeds of eight sets of data in FKNN-DPC, and DPC-SFSKNN is slightly worse than DPC in AMI, ARI, and ACC. DBSCAN gets the best results on the Ionosphere. K-means is the best on Pima-Indians-diabetes, and K-means is the best in AMI on waveform and waveform (noise) datasets. In general, the clustering results of DPC-SFSKNN in real-world datasets are satisfactory.

Experimental Analysis of Olivetti Face Dataset.
Olivetti face dataset [28] is an image dataset widely used by machine learning algorithms. Its purpose is to test the clustering situation of the algorithm without supervision, including determining the number of clusters in the database and the members of each cluster. e dataset contains 40 clusters, each of which has 10 different images. Because the actual number of clusters (40 different clusters) is equal to the number of elements in the dataset (10 different images, each cluster), the reliability of local density becomes smaller, which is a great challenge for density-based clustering algorithms. To further test the clustering performance of DPC-SFSKNN, DPC-SFSKNN performed experiments on the Olivetti face database and compared it with DPC, AP, DBSCAN, FKNN-DPC, and K-means. e clustering results achieved by DPC-SFSKNN and DPC for the Olivetti face database are shown in Figure 10, and white squares represent the cluster centers. e 32 clusters corresponding to DPC-SFSKNN found in Figure 10(a) and the 20 clusters found by DPC in Figure 10(b) are displayed in different colors. Gray images indicate that the image is not assigned to any cluster. It can be seen from Figure 10(a) that DPC-SFSKNN found that the 32 cluster centers were covered 29 clusters, and as shown in Figure 10(b), the 20 cluster centers found by DPC were scattered in 19 clusters. Similar to DPC-SFSKNN, DPC may divide one cluster into two clusters. Because DPC-SFSKNN can find much more density peaks than DPC, it is more likely to identify a cluster as two different clusters. e same situation occurs with the FKNN-DPC algorithm. However, the performance of FKNN-DPC is better than that of DPC-SFSKNN in AMI, ARI, ACC, and EC/AC. From the data in Table 5, based on AMI, ARI, ACC, and EC/AC, the clustering results of these algorithms are compared. e performance of DPC-SFSKNNC is slightly superior to the performance of the other four algorithms except FKNN-DPC.

Running Time.
is section shows the comparison of the time performance of DPC-SFSKNN with DPC, DBSCAN, AP, FKNN-DPC, and K-means on real-world datasets. e time complexity of DPC-SFSKNN and DPC has been analyzed in Section 3.3.1, and the time complexity of DPC is O(n 2 ) and the time complexity of DPC-SFSKNN is O(kn 2 ), where n is the size of the dataset. However, the time consumed by DPC mainly comes from calculating the local density and the relative distance of each point, while the time consumed by DPC-SFSKNN comes mainly from the calculation of K-nearest neighbors and the division strategy of noncenter points. Table 6 lists the running time (in seconds) of the six algorithms on the real datasets. It can be seen that although the time complexity of DPC-SFSKNN is approximately k times that of DPC, their execution time on actual datasets is not k times.
In Table 6, it can be found that on a relatively small dataset, the running time of DPC-SFSKNN is about twice or more times that of DPC, and the difference mainly comes from DPC-SFSKNN's allocation strategy. Although the computational load of the local densities for points grows very quickly with the size of a dataset, the time consumed by the allocation strategy in DPC-SFSKNN increases randomly with the distribution of a dataset. is leads to an irregular gap between the running time of DPC and DPC-SFSKNN.
FKNN-DPC has the same time and space complexity as DPC, but the running time is almost the same as DPC-SFSKNN. It takes a lot of running time to calculate the relationship between K-nearest neighbors. e time complexity of DBSCAN and AP is approximate to O(n 2 ), and the parameter selection of both cannot be determined by simple methods. When the dataset is relatively large, it is difficult to find their optimal parameters, which may be the reason that  the two algorithms have no running results on the waveform dataset. e approximate time complexity of K-means is O(n), and Table 6 proves its efficiency. K-means has almost no loss of accuracy under the premise of fast speed, which makes it a very popular clustering algorithm, but K-means is not sensitive to irregularly shaped data.

Conclusions and Future Work
A new clustering algorithm is proposed based on the traditional DPC algorithm in this paper. is algorithm proposes a density peak search algorithm that takes into account the surrounding neighbor information and develops a new allocation strategy to detect the true distribution of the dataset. e proposed clustering algorithm performs fast search, finds density peaks, say cluster centers of a dataset of any size, and recognizes clusters with any arbitrary shape or dimensionality. e algorithm is called DPC-SFSKNN, which means that it calculates the local density and the relative distance by using some distance information between points and neighbors to find the cluster center, and then the remaining points are assigned using similarity-first. e search algorithm is based on the weighted KNN graph to find the owner (clustering center) of the point. e DPC-SFSKNN successfully addressed several issues arising from the clustering algorithm of Alex Rodriguez and Alessandro Laio [20] including its density metric and the potential issue hidden in its assignment strategy. e performance of DPC-SFSKNN was tested on several synthetic datasets and the real-word datasets from the UCI machine learning repository and the well-known Olivetti face database. e experimental results on these datasets demonstrate that our DPC-SFSKNN is powerful in finding cluster centers and in recognizing clusters regardless of their shape and of the dimensionality of the space in which they are embedded and of the size of the datasets and is robust to outliers. It performs much better than the original algorithm DPC. However, the proposed algorithm has some limitations: the parameter K needs to be manually adjusted according to different datasets; the clustering centers still need to be manually selected by analyzing the decision graph (like the DPC algorithm); the allocation strategy improves the clustering accuracy but takes time and cost. How to improve the degree of automation and allocation efficiency of the algorithm needs further research.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.