Next Article in Journal
Compressive Sensing Based Three-Dimensional Imaging Method with Electro-Optic Modulation for Nonscanning Laser Radar
Previous Article in Journal
Composition Identities of Chebyshev Polynomials, via 2 × 2 Matrix Powers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Double-Density Clustering Method Based on “Nearest to First in” Strategy

1
School of Software and Communication Engineering, Xiangnan University, Chenzhou 423000, China
2
School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou 510006, China
*
Author to whom correspondence should be addressed.
Symmetry 2020, 12(5), 747; https://doi.org/10.3390/sym12050747
Submission received: 18 January 2020 / Revised: 7 April 2020 / Accepted: 9 April 2020 / Published: 6 May 2020

Abstract

:
The existing density clustering algorithms have high error rates on processing data sets with mixed density clusters. For overcoming shortcomings of these algorithms, a double-density clustering method based on Nearest-to-First-in strategy, DDNFC, is proposed, which calculates two densities for each point by using its reverse k nearest neighborhood and local spatial position deviation, respectively. Points whose densities are both greater than respective average densities of all points are core. By searching the strongly connected subgraph in the graph constructed by the core objects, the data set is clustered initially. Then each non-core object is classified to its nearest cluster by using a strategy dubbed as ‘Nearest-to-First-in’: the distance of each unclassified point to its nearest cluster calculated firstly; only the points with the minimum distance are placed to their nearest cluster; this procedure is repeated until all unclassified points are clustered or the minimum distance is infinite. To test the proposed method, experiments on several artificial and real-world data sets are carried out. The results show that DDNFC is superior to the state-of-art methods like DBSCAN, DPC, RNN-DBSCAN, and so on.

1. Introduction

How to dig unknown information out of data sets is a widely concerned problem. Clustering, as an unsupervised method to partition a data set naturally, has the ability to discover the potential and internal knowledge, laws and rules of data. In recent decades, researchers have proposed many clustering algorithms based on different theories and models, which are generally divided into several categories such as partitioning, hierarchical, density, grid and model based methods. [1]
K-Means [2], a famous partitioning based clustering method, is suitable to the data set composed of spherical clusters, even though the clusters have different densities. One of the limitations of K-Means is that it unfits for the data sets having arbitrary shape clusters. Agglomerative [3] is a type of hierarchical clustering using button-up approach and Ward’s linkage, the method for calculating the distance between clusters, is its most commonly used method for its good monotonicity and spatial expansion. The time complexities of hierarchical clustering methods are often high. Meanshift [4] is a general non-parametric clustering method and suitable to data sets with arbitrary shape clusters.
Density-based clustering is famous for its capability of finding out arbitrary shape clusters from data sets and therefore it has been widely used in data mining, pattern recognition, machine learning, information retrieval, image analysis and other fields in recent years [5,6,7,8,9,10,11,12,13,14,15,16].
Ester etc. proposed a density-based algorithm for discovering clusters in large spatial databases with noise (DBSCAN) [17] in 1996, and since then lots of density-based clustering have been proposed and used widely. DBSCAN needs two parameters: eps and minpts. The density of each point is estimated by counting the number of points in its hypersphere with eps radius and each point then is recognized as a core or non-core object according to whether its density is greater than minpts or not. An obvious weakness of DBSCAN is that it is very difficult to select two appropriate parameters for different data sets. Rodriguez and Laio proposed a new clustering approach by fast search and find of density peaks (DPC) [18]. It is simpler and more effective than DBSCAN because only one parameter is needed and the clustering centers can be selected in a visible mode by projecting points into the ρ ϵ decision graph. But DPC cannot partition a data set with adjacent clusters of different densities correctly. For improving the performance of DPC, many methods tried to update the density calculation, mode of clustering centers selection, and so on. Vadapalli etc. introduced reverse k nearest neighbors (RNN) model into density-based clustering (RECORD) [19] and adopted the same way as DBSCAN to define the reachability of points in a data set but only required one parameter k. Though RECORD often confuses the points lying on borders of clusters with noises, for its ability to detect the cores of clusters with different densities correctly, some improved methods have been proposed.
The density-based approaches mentioned above are using only one way to measure point densities in the data set. So, though they performed well on data sets with arbitrary shape but separately distributing clusters, they cannot partition low-density clusters from their adjacent high-density clusters in mixture-density data sets. In this paper, a density clustering method based on double density and Nearest-to-First-in strategy is proposed, which is dubbed as DDNFC. Compared with the above state-of-the-art density clustering algorithms, this method has the following advantages: (1) It can depict the distributions of data in a data set more accurately by using the two densities of each point, reverse k-nearest neighbor number and local offset of position; (2) According to the two densities and respective thresholds, a data set is divided initially into several core areas with high-density points and a low-density boundary surrounding them. In this way, the low density clusters are not over segmented easily; (3) Adapting the Nearest-to-First-in strategy, only those unclassified boundary points closest to the existing clusters are classified in each iteration, and then the sensitivity of clustering to the input parameter and the impact of storage order of data points on classification results are reduced.
The remainder of this paper is organized as follows. In Section 2, we redefined the core object, which is introduced in DBSCAN firstly, and give the procedure of clustering the core objects initially. Moreover, the complexity of initial clustering procedure was analyzed here. Section 3 described Nearest-to-First-in strategy proposed by this paper in detail. We also analyzed the complexity of the strategy in this section. In Section 4, experimental results on synthetic and real data sets are presented and discussed. Additionally, the choice of appropriate of k is under discussion in this section. Finally, conclusions are made in Section 5.

2. Core Objects and Initial Clustering

In this section, we give two densities to represent the distribution of data, introduce and improve some notations defined in DBSCAN and RECORD into this paper but improve them, describe the Nearest-to-First-in strategy.

2.1. Abbreviations

X: a data set with N d-dimension points;
| | : the counting function to calculate the size of a set;
x , y , z : three points in the data set X;
d ( x , y ) : a distance function returns a real value of the distance between points x and y, it is the Euclidean function in this paper;
k: the input parameter indicates the number of nearest neighbors of an point;
N k ( x ) : the set of k nearest neighborhoods of point x;
R k ( x ) : the set of reverse k nearest neighbors of point x, which is defined as R k ( x ) : = { y | y X & x N k ( y ) } ;
c: the number of clusters in data set X;
C i , C j : the i t h and j t h clusters, 1 i , j c , i j , C i C j = ;
L: label array, in which, the value of each element indicates the cluster the corresponding point belongs to.

2.2. Density of Each Point

In some cases, using only one density cannot portray the real distribution of data. For example, as shown in Figure 1, Compound [20] is a typical data set composed of several clusters with different shapes and densities, and the clusters are adjacent or surrounded each other. Dividing it into correct clusters is a difficult task. In this paper, we use two densities to detect the true core areas of clusters. For each point in a data set, apart from using the reverse k nearest neighbors model to compute density, we introduce the offset of local position of a point to calculate its another density.
Definition 1.
Offset of local position. The offset of local position of point x is the distance between x and the geometric center x ¯ of all its k nearest neighbors, which is defined as follows:
d o f f ( x ) = d ( x , x ¯ )
where x ¯ = 1 k y N k ( x ) y .
As displayed in Figure 1, the original position of each point is a red ball in the figure, and the semi-transparent green cross connected with it is the geometrical center of its k nearest neighborhoods, and the length of line connected a red ball and its corresponding green cross with directed arrow is the offset of local position. Figure 1a is the entire view of points and their offsets of Component, and Figure 1b–d are the partial view of Compound. From these figures, we can find the truth that the value of offset of an point is small when it lies in the central of local region but is large in the boundary, and the sparser the data distributed locally, the larger the offset is. So it is reasonable to estimate density of the point by using the offset of local position. But it also has defects, as shown in Figure 1b, the two boundary points, which are circled out by a blue ellipse, have very small offset values because they locate in the local center of their neighbors. We use two density computing methods to estimate the true distribution of a point in the data set and they are defined as follows: (1) Density computed by the offset of local position
ρ o f f ( x ) = e x p d o f f 2 ( x )
(2) Density computed by the reverse k nearest neighbors model
ρ R ( x ) = | R k ( x ) |

2.3. Core Object

By the two densities of all points, two corresponding thresholds are set, and then we are able to classify each point into two types: core or non-core.
Definition 2.
Core object. A point x in data set X is a core object if it satisfies the following condition:
ρ o f f ( x ) > 1 N y X ρ o f f y a n d ρ R ( x ) k
Those points that do not satisfy the above condition are non-core objects.
After extracting core objects from a data set, we improve the two relationships, density directly-reachable and density reachable, which were defined in DBSCAN firstly. By these relations we construct a directed graph and find its maximum connected components, which contain the cores of all clusters.
Definition 3.
Density directly-reachable. A point x is density directly-reachable from another point y if they satisfy the two conditions as follows:
(1) x and y are all core objects;
(2) x R k ( y ) .
Because that x is one of reverse k nearest neighbors of y is not able to guarantee y is the reverse k nearest neighbor to x, too, y is not definitely density directly-reachable from x. It is to say that the density directly-reachable is not symmetrical.
Definition 4.
Density reachable. A point x is density reachable from another point y if there are a series of points, x 1 , , x m and x = x 1 , y = x m , satisfying the following conditions:
(1) i , 1 i m , x i is a core object;
(2) i , 1 i < m , x i is density directly-reachable from x i + 1 . The density reachable is not symmetrical for density directly-reachable being unsymmetrical.

2.4. Initial Clustering

In Algorithm 1, instead of constructing a directed graph of core objects obviously, we randomly start from an unclassified core object and visit all unclassified core objects density directly-reachable or reachable relationship from the point classified before, and the procedure is implemented iteratively.
Algorithm 1: Initial clustering.
Input: data set X, k, N k , R k .
Output: label array L.
Step 1. Initialize each element of L to 0, and set C i d to 0;
Step 2. for each x i in X:
Step 3.       Compute d o f f ( x i ) by (1);
Step 4.       Calculate ρ o f f ( x i ) and ρ R ( x i ) by (2) and (3) respectively;
Step 5. Calculate the threshold ρ ¯ o f f by (4)
Step 6. for each x i in X:
Step 7.       if ρ o f f ( x i ) > ρ ¯ o f f and ρ R ( x i ) > k :
Step 8.              L ( x i ) = 1
Step 9. Initialize an empty queue Q
Step 10. for each x i in X:
Step 11.       if L ( x i ) ! = 1 :   continue;
Step 12.        Q . p u s h ( x i ) , C i d + = 1 ;
Step 13.       while Q is not empty:
Step 14.              x = Q . p o p ( ) , L ( x ) = C i d ;
Step 15.             for each y in R k ( x ) :
Step 16.                   if L ( y ) = = 1 :    Q . p u s h ( y ) ;
Step 17. return L.
In Algorithm 1, there are 2 k N spaces to store the elements of N k and R k . And L, ρ o f f and ρ R need memory spaces N each. The storage spaces for the queue Q are dynamic but not beyond N. Thus the space complexity of the algorithm is O ( k N ) . The loop needs repeat k N times for calculating the two densities for all point from step 2 to 4 of the algorithm. The times for computing ρ ¯ o f f in step 5 and finding out all core objects from step 6 to 8 are all N obviously according to the corresponding definition. Steps from 10 to 16 form a three layers nested loop. The most inner loop is steps 15–16. It repeats more than k but less than 2 k times on average. The iterating times of the second loop depend on how many core objects are density reachable from the original seed point taken in step 11 but obviously are no more than N. The outer loop calls the inner loops c times if the dataset has c core regions. So, the time complexity of the nested loop is O ( c k N ) . The larger c is, the smaller the core regions will be, then the less time the second loop needs. Finding the k-nearest-neighbor set N k and the inverse k-nearest-neighbor set R k of each point before calling the algorithm are the common procedures for all related density-based approaches but do not specially need by our method, so we directly estimate their time complexity O ( N 2 ) . Therefore, the total time complexity of Initial Clustering procedure is O ( N 2 ) .
Figure 2 shows the result of initial clustering on data set Compound with k = 7 . Algorithm 1 gets 6 initial clusters. The blue ball in the right-bottom corner is the core object of the right sparse cluster.

3. Nearest-to-First-in Strategy

The initial clustering procedure groups all core objects, while there are a large number of non-core objects unclustered in the data set. Many density-based clustering such as RECORD, IS-DBSCAN, ISB-DBSCAN and RNN-DBSCAN adopted different approaches to classify each non-core object into one of clusters or treat it as noise. But these methods have a weakness that the clustering result suffers from the storage order of points. A Nearest-to-First-in strategy introduced in this section can solve the problem. We illustrate the basic idea and give the implementation steps of the algorithm. Moreover, we analyze the time and space complexity of this strategy.

3.1. Basic Idea of the Strategy

The Nearest-to-First-in Strategy is based on two simple ideas: (1) the closer the points are, the higher the probability is that they belong to the same category; (2) by using the higher probability event priority principle, the cost of clustering will be minimized. To illustrate the advances of our strategy, we first introduce several definitions.
Definition 5.
Distance from an unclassified point to a cluster. The distance between an unclassified point x and a cluster C i , dubbed as d ( x , C i ) , is defined as:
d ( x , C i ) = m i n y R k ( x ) C i d ( x , y ) , i f R k ( x ) C i , o t h e r w i s e
To compute the distance between an unclassified point x and a certain cluster C i , we first check whether the intersection of the reverse k nearest neighbor set R k ( x ) and C i is empty or not. If the intersection is not empty, which we call x is adjacent to C i , the distance between x and C i is the distance between x and the point in the intersection nearest to x. Otherwise, the distance from x to C i is infinite.
Definition 6.
Nearest cluster. Given an unclassified point x, its nearest cluster C N ( x ) is:
C N ( x ) = a r g m i n C i , 1 i c d ( x , C i )
In particular, if x has the same distance to several clusters, even if x is not adjacent to any cluster, C N ( x ) will be set to the first cluster in the ordered list of distances from x.
Definition 7.
Smallest distance to clusters. The smallest distance to clusters d m i n is the smallest one among the distances from all unclassified points to their nearest cluster respectively, that is:
d m i n = m i n x X , L ( x ) = 0 d x , C N ( x )
where L ( x ) represents the cluster label of x, its value is 0 if it is unclassified. The value of d m i n is ∞ if all unclassified points to their nearest clusters are ∞.
Definition 8.
Nearest neighbors of clusters. Given the smallest distance d m i n , the nearest neighbors of clusters N N is defined as follows:
N N = x | x X , L ( x ) = 0 , d x , C N ( x ) = d m i n
N N is composed all those unclassified points closest to clusters. But N N will be empty if d m i n is infinite.

3.2. Main Procedures of Nearest-to-First-in Strategy

The Nearest-to-First-in strategy is a greedy strategy. As shown in Algorithm 2, the clustering procedures of the stragtegy for unclassified points are completed in an iterative way. In each iteration, the nearest neighbors of clusters N N is calculated by using (5)–(8), and then the points in the set N N are allocated to their corresponding nearest neighbor clusters. Repeat this process until the N N becomes empty. At this time, either all points have been classified, or the unclassified points are not adjacent to any existing clusters. In this case, the unclassified points will be regarded as noise.
It should be noted that according to Definition 6, getting the nearest cluster of an unclassified point x may be impacted by the storage order of the reverse k nearest neighbor of x, but the Definitions 7 and 8 will minimize this impact.
Algorithm 2: NTFI.
Input: X , k , R k , L .
Output: label array L.
Step 1. Put all unclassified point ( L ( x ) = 0 ) into an array u n L and create two array D and C N , each has the same size as u n L . The value of each element in C N is set to −1 and the value of each element in D is set to ;
Step 2. while u n L is not empty:
Step 3.        d m i n = ;
Step 4.       for each x in u n L :
Step 5.             for each y in R k ( x ) :
Step 6.                   if L ( y ) > 0 and d i s t ( x , y ) < D ( x ) :
Step 7.                          D ( x ) = d i s t ( x , y ) , C N ( y ) = L ( y ) ;
Step 8.             if D ( x ) < d m i n : d m i n = D ( x ) ;
Step 9.       if d m i n = : break;
Step 10.       for each x in u n L :
Step 11.             if D ( x ) = = d m i n :
Step 12.                    L ( x ) = C N ( x ) ;
Step 13.                   Remove x from u n L , D , C N ;
Step 14. Return L.
In the process of classification, the strategy selects the points with global priority to be processed every time, but classifies them according to the principle of locality, so it can get better cluster results on unclassified points than other density-based clustering.
In the procedure of the Nearest-to-First-in strategy, the main memories needed for running the program are: 1. The distance matrix needs N 2 entries of real values. 2. Size of the reverse k nearest neighbors set is k N . 3. A label array needs N spaces to indicate the cluster label for each element. 4. Sets of unclassified points such as u n L , D , C N need less than N units each. Therefore, the space complexity of this algorithm is O ( N 2 ) .
In completing the strategy, the number of unclassified points is the key factor in determination of the time complexity, and of course it is less than N. The most time-consuming part is a tree layer nested loop from step 2 to 8, which is used to calculate the nearest cluster C N and its respective distance D of all unclassified points according to (6) and (5). Obviously, the size of R k ( x ) of an unclassified point x is less than k because x is a non-core object, so the iterative times of the inner two nested loop from step 5 to 7 are less than k N . Another inner loop between step 10 and step 13 is to classify those unlabeled points which are nearest to existing clusters and its iterative times also depend on the number of unclassified points. Step 13 removes all newly labelled points in one iteration and then the next loop times will decrease. While step 9 avoid the algorithm not to loop endlessly if there is any point cannot be classified. The total time complexity is k N 2 . In fact, after a round of iteration, it is not necessary to recalculate C N and D of all the remaining unclassified points. We only need to update the unclassified points in the k nearest neighbor of the new classification points in this round of iteration, so the time required is k 2 . In the worst case, there is only one unclassified point to be classified in one iteration but iterating N times. Thus, the worst time complexity of this algorithm is O ( k 2 N ) .

4. Experiments and Results Analysis

In order to verify the effectiveness of DDNFC, we compare it with another six algorithms, DPC, RNN-DBSCAN [21], DBSCAN, KMeans [2], Ward-link and Meanshift [4], on artificial and real data sets. By using three cluster validity indexes [1,22,23], F-measure with α = 1 (F1), adjusted mutual information (AMI) and adjusted rand index (ARI), we evaluate and discuss the performances of these seven methods. All algorithms are implemented in Python Integrated Development Environments (version 3.7). DPC and RNN-DBSCAN are coded according to the original articles respectively. Other five methods we use are the built-in algorithms or models provided by scikit-learn library [24]. The ways of parameter setting of the algorithms are described as follows: (1) Meanshift uses the default parameters of the model in scikit-learn; (2) The input parameters of KMeans and Ward-Link are all set to the true number of clusters. KMeans adopts KMeans++ to initialize the centers of clusters; (3) To get the best result of DBSCAN, we use the grid-searching method to find the optimal values for the two parameters, e p s and m i n p t s . The value of e p s varies in the range of 0.1−0.5 and the interval is 0.1. While the m i n p t s ranges from 5 to 30 and steps 5. (4) The cutoff distance required by DPC is calculated by the input percentage value. In this paper, the optimal parameter value is searched from 1.0 to 5.0 percentages in steps of 0.5. In order to simplify the implementation of DPC, we select c points with the top values of g a m m a , which defined as g a m m a = ρ * δ in [18], directly as the cluster centers. In addition, the density of DPC is calculated by Gauss kernel. (5) The optimal values of parameter k of our method and RNN-DBSCAN are set in the range of [ 2 N ) .
In addition, spectral clustering [25] is a famous algorithm and has been widely used too. Since it uses k-nearest neighbors to construct an adjacent matrix just as our methods do, we implemented the algorithm on data sets: Spiral, Swiss Roll, Compound, Pathbased, and Aggregation.

4.1. Experiments on Artificial Data Sets and Results Analysis

Table 1 displays the basic information of artificial data sets used in the paper. These data sets are all two dimensions and composed of clusters with different densities, shapes and distributed orientations [20,26].
Table 2 shows the settings of parameters (dubbed as par) and the test results, such as the number of clusters the methods obtain and the true cluster numbers (dubbed as c/C), the values of F1, AMI and ARI, of the above seven algorithms on artificial and real data sets respectively.
Meanwhile, we display some results of the above seven algorithms running on these data sets. Figure 3 and Figure 4 show the results of Compound and Pathbased, and other results are showed in Appendix A Figure A1, Figure A2, Figure A3, Figure A4 and Figure A5. Different clusters are distinguished by marks with different shapes and colors. Small black dots represent unrecognized data, i.e., noise points.
Compound is composed of six classes with different densities. In the upper-left corner, two classes are adjacent to each other and both subject to Gaussian distribution. In the right side area, an irregular shape class is surrounded by a sparse distributed class and the two classes are overlap spatially. In the bottom-left corner, a small disk like class is encompassed by a ring-shape class. It is a big challenge to distinguish all classes correctly in Compound because they have different densities, various shapes and complex spatial relationships. As in Figure 3, our method classified all points correctly except one point on the border between two classes in the upper-left corner. RNN-DBSCAN also found six clusters, but nearly 75 percent points of the sparse class on the right side are misclassified to the dense cluster. DBSCAN only detected out five clusters, and a large number of points are divided into noise as the sixth cluster of data set. DPC almost distinguished the two Gaussian and the disk classes correctly, but partitioned the ring class into two parts and mingled the sparse class with the dense. The results of Ward-Link and K-Means++ showed that the two methods are unable to cluster arbitrary shape data set. Meanshift cannot partition two clusters when one contains another, so it only recognized four clusters in Compound.
Pathbased has 3 classes, in which one contains 110 points form an unclosed thin ring, the other two contain 97 points and 93 points separately and all are enclosed in the ring. It is easy to misclassify the points, which belong to the ring class but are near other classes, to their adjacent class. As shown in Figure 4, DDNFS and RNN-DBSCAN get the nearly correct results, a few points in the adjacent space between the ring and the left class were misclassified. One point is not unrecognized by RNN-DBSCAN. DBSCAN found out two clusters but were treat points of the ring class as noise. The other four algorithms divided the ring clusters into three parts, the top of the arc area was regarded as a class, and the other two parts were respectively were clustered into the other two clusters.
Jain is composed of two lunate clusters with different densities. The upper left one is sparse and unevenly distributed. The other is dense and long. DDNFC partitioned the two classes completely correct. DBSCAN regarded the lower density class as noise again. The others have different degrees of misclassification.
Flame has two classes with similar densities but different shapes, and they are very close to each other. As the results shown in Figure A2 and Table 2, the algorithms based on density can find out two clusters more accurately, except for some classification errors in the adjacent parts between the two clusters. DDNFC got the completely correct result; DBSCAN treated two outliers in the upper left corner as noise. Obviously, in this data set, the algorithm based on density is obviously superior to the other three clustering algorithms.
The characteristics of t8.8k and t7.10k are similar to Compound, in which the clusters have different shapes and some are embraced or semi-embraced by others. But the two data sets are seriously contaminated by noise. On these two data sets, DDNFC and RNN-DBSCAN got more accurate clustering results than the other methods.
Seven clusters with different sizes in Aggregation are independent of each other, except two pairs are connected slightly. Four density-based methods achieved better results than the others. DDFNC misclassified three points located in the adjacent area between the right two clusters. RNN-DBSCAN misclassified one point located in the adjacent area between the left two connected clusters. DBSCAN regarded one edge point of the upper-left cluster as noise. DPC performed best in this data set.
In addition to the above 7 data sets, Table 2 also lists the test result of the 7 algorithms on the other 5 data sets.
The data sets t5.8k contain 7 clusters, 6 of which form the word “GEORGE”, and another bar-shape cluster runs through them. The data set contain noise data. As shown in Table 2, the results of all methods are not good because the bar-shape cluster is hard to recognize from the data set.
Clusters in the data sets Unbalance, R15, S1 and A3 are in general independent to each other, though some of them may be adjacent. Of course, the clusters have different densities and arbitrary shapes. On these data sets, DDNFC performed as well as the other algorithms.
The spectral clustering algorithm in scikit-learn library [24] need two parameters: the number of clusters and the number of nearest neighbors.
On Spiral, the k of DDNFC was set to 4, and the two parameters of spectral clustering were 3 and 4 respectively, and the results shown in Figure 5 are all entirely correct.
The Swissroll has 1500 points. The parameter settings of the two methods were 13 for DDNFC and (6, 13) for spectral clustering. From Figure 6 we can see that the width of clusters got by spectral clustering are more uniform than DDNFC.
On remain three data sets Compound, Pathbased, and Aggregation, the parameter settings of the spectral clustering were (6, 5), (3, 6), and (7, 9). As shown in Figure A6, the algorithm cannot classify the clusters in these data sets correctly.

4.2. Experiments on Real-World Data Sets and Results Analysis

Table 3 lists 12 real-world data sets, which are widely used in clustering and classification methods testing and downloaded from UCI machine learning repository [27] for testing the algorithms in this paper.
Additionally, we did data preprocessing on the data sets if needed.
  • Data that have null values, uncertain values, or duplicates were removed.
  • Most of data sets have a class attribute. Table 3 only gives the number of none class attributes.
  • We conserved from the third to ninth features of the data in Echocardiogram because some of its data has missing values.
As shown in Table 3, Ionosphere, SPECT-train and Sona are sparse data sets for their higher ratio of dimension and number of instances. Ionosphere has 351 radar data. Each data is composed of 17 pulse numbers which are described by 2 attributes each, corresponding to the complex values. But in our tests, we treated 34 attributes as being independent. SPECT-train is one of a subset of SPECT, which has 22 binary attributes and 2 groups with 40 points each. Sona has 208 data with 60 real features. From Table 4, we can see that DDNFC outperformed other six algorithms on all benchmarks much more on Ionosphere and also got the best results on two benchmarks on SPECT-train and Sona. The three data sets are sparse, DPC was better than the other 5 methods. Meanshift cannot distinguish the data in these data sets. KMeans++ performed good on SPECT-train. RNN-DBSCAN got the highest F1 on Sona.
Page-block is a data set about classification of the blocks of the page layout in a document. Its data has 4 real-type and 6 integer-type features. It has 5 classes with 4913, 329, 28, 88 and 115 data, respectively. Haberman contains 306 cases of the survival status of patients with breast cancers after they had undergone surgery. The 3 attributes of each data are integers representing operating time, patient’s year and number of positive axillary nodes detected. The data set is divided into 2 groups with 225 and 81 instances, respectively. Wilt-train consists two groups, one group has 4265 points but another has only 74 points. Obviously, the three data sets are unbalance data sets. The test results on them show that density-based clustering methods were better than others. DDNFC and RNN-DBSCAN got the best results. The two methods performed closely because they all used reverse k nearest neighbors model to determine core objects and the nearest-on-first-in tragedy of our method was not different from RNN-DBSCAN on these unbalance data sets. DBSCAN method performed badly.
The attributes of data in Breast-cancer-Wisconsin are integer category type. The original data set has 458 benign and 241 malignant cases, but we deleted cases with missing values. There are 444 benign and 239 malignant cases remained. Each data of Chess has 36 text category type attributes. The value of each attribute are selected from one of four groups f, t, g, l, b, n, w and n, t. For calculating the distance between two data, we replace the text label values of data to integer such as 0, 1, 2. The attributes of Breast-cancer-Wisconsin, Chess and Pendigits-train are categories or integers. On Breast-cancer-Wisconsin, Ward-link, KMeans++ and DDNFC got much higher benchmarks than the other methods, and DDNFC got the best performance in the density-based methods. On other two data sets, the performances of DDNFC, Ward-link, DPC and RNN-DBSCAN were close.
Table 4 shows the settings of parameters (dubbed as par) and the test results, such as the number of clusters the methods obtain and the true cluster numbers (dubbed as c/C), the values of F1, AMI and ARI, of the above seven algorithms on artificial and real data sets respectively.
Contraceptive-Method-Choice is a subset of the National Indonesia Contraceptive Prevalence Survey in 1987, which has multi-type attributes including integer, category and binary. The attributes of Echocardiogram are also composed of different numeric types with different value scales. On Echocardiogram, all methods except DBSCAN got the similar results. KMeans++ was the best one in all while DDNFC was the best one in density-based methods. On Contraceptive-Method-Choice, only four methods got the correct number of clusters. DDNFC achieved the best F1.
Segmentation-test is one of a subset of Segmentation, which has 19 real attributes, 7 groups with 300 points each. It is an ordinary data set. RNN-DBSCAN and DBSCAN did not classify all 7 clusters. DDNFC outperformed all.

5. Conclusions

In this paper, in order to deal with data sets with mixed density clusters, we proposed a density clustering method, DDNFC, based on the strategy of double density and Nearest-to-First-in. DDNFC uses two density calculation methods to estimate the data distributions in a data set. By thresholding on the data set with two densities respectively, DDNFC can more accurately partition the data into high-density core area and low-density boundary area initially. In the process of classification of low density points, the sensitivity of the clustering to the input parameters and the data storage order is reduced by applying the Nearest-to-First-in strategy. By comparing the proposed algorithm in this paper and other classical algorithms on artificial and real data sets, the results show that DDNFC is better than the other six algorithms as a whole.

Author Contributions

Conceptualization, Y.L. and F.Y.; methodology, Y.L.; validation, F.Y. and Z.M.; writing–original draft preparation, Y.L.; writing–review and editing, D.L. and F.Y.; supervision, Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by NSFC under Grant 61773022, Hunan provincial education department (No. 17A200, No. 18B504), Natural Science Foundation of Hunan Province (No. 2018JJ2370)), and Scientific Research Fund of Chenzhou Municipal Science and Technology Bureau, Hunan (No. cz2019).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Figures of Experiment Results

Figure A1. Experiment results on Jain data set.
Figure A1. Experiment results on Jain data set.
Symmetry 12 00747 g0a1
Figure A2. Experiment results on Flame data set.
Figure A2. Experiment results on Flame data set.
Symmetry 12 00747 g0a2
Figure A3. Experiment results on Cluto-t8.8k data set.
Figure A3. Experiment results on Cluto-t8.8k data set.
Symmetry 12 00747 g0a3
Figure A4. Experiment results on Cluto-t7.10 data set.
Figure A4. Experiment results on Cluto-t7.10 data set.
Symmetry 12 00747 g0a4
Figure A5. Experiment results on Aggregation data set.
Figure A5. Experiment results on Aggregation data set.
Symmetry 12 00747 g0a5
Figure A6. Experiment results of spectral clustering on three data sets.
Figure A6. Experiment results of spectral clustering on three data sets.
Symmetry 12 00747 g0a6

References

  1. Aggarwal, C.C.; Reddy, C.K. Data Clustering Algorithms and Applications, 1st ed.; CRC Press: Boca Raton, FL, USA, 2013; pp. 65–210. [Google Scholar]
  2. Arthur, D.; Vassilvitskii, S. K-means++: The advantages of careful seeding. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
  3. Murtagh, F.; Legendre, P. Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? J. Classif. 2014, 31, 274–295. [Google Scholar] [CrossRef] [Green Version]
  4. Wu, K.L.; Yang, M.S. Mean shift-based clustering. Pattern Recognit. 2007, 40, 3035–3052. [Google Scholar] [CrossRef]
  5. Jiang, Z.; Liu, X.; Sun, M. A Density Peak Clustering Algorithm Based on the K-Nearest Shannon Entropy and Tissue-Like P System. Math. Probl. Eng. 2019, 2019, 1–13. [Google Scholar] [CrossRef] [Green Version]
  6. Halim, Z.; Khattak, J.H. Density-based clustering of big probabilistic graphs. Evol. Syst. 2019, 10, 333–350. [Google Scholar] [CrossRef]
  7. Wu, C.; Lee, J.; Isokawa, T.; Yao, J.; Xia, Y. Efficient Clustering Method Based on Density Peaks with Symmetric Neighborhood Relationship. IEEE Access 2019, 7, 60684–60696. [Google Scholar] [CrossRef]
  8. Tan, H.; Gao, Y.; Ma, Z. Regularized constraint subspace based method for image set classification. Pattern Recognit. 2018, 76, 434–448. [Google Scholar] [CrossRef]
  9. Chen, Y.; Tang, S.; Zhou, L.; Wang, C.; Du, J.; Wang, T.; Pei, S. Decentralized Clustering by Finding Loose and Distributed Density Cores. Inf. Sci. 2018, 433, 510–526. [Google Scholar] [CrossRef]
  10. Wang, Z.; Yu, Z.; Philip Chen, C.L.; You, J.; Gu, T.; Wong, H.S.; Zhang, J. Clustering by Local Gravitation. IEEE Trans. Cybern. 2018, 48, 1383–1396. [Google Scholar] [CrossRef]
  11. Oktar, Y.; Turkan, M. A review of sparsity-based clustering methods. Signal Process. 2018, 148, 20–30. [Google Scholar] [CrossRef]
  12. Chen, J.; Zheng, H.; Lin, X.; Wu, Y.; Su, M. A novel image segmentation method based on fast density clustering algorithm. Eng. Appl. Artif. Intell. 2018, 73, 92–110. [Google Scholar] [CrossRef]
  13. Zhou, R.; Zhang, Y.; Feng, S.; Luktarhan, N. A novel hierarchical clustering algorithm based on density peaks for complex datasets. Complexity 2018, 2018, 1–8. [Google Scholar] [CrossRef]
  14. Zhang, X.; Liu, H.; Zhang, X. Novel density-based and hierarchical density-based clustering algorithms for uncertain data. Neural Netw. 2017, 93, 240–255. [Google Scholar] [CrossRef] [PubMed]
  15. Wu, B.; Wilamowski, B.M. A fast density and grid based clustering method for data with arbitrary shapes and noise. IEEE Trans. Ind. Inform. 2017, 13, 1620–1628. [Google Scholar] [CrossRef]
  16. Lv, Y.; Ma, T.; Tang, M.; Cao, J.; Tian, Y.; Al-Dhelaan, A.; Al-Rodhaan, M. An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing 2016, 171, 9–22. [Google Scholar] [CrossRef]
  17. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
  18. Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  19. Vadapalli, S.; Valluri, S.R.; Karlapalem, K. A simple yet effective data clustering algorithm. In Proceedings of the IEEE International Conference on Data Mining, Hong Kong, China, 18–22 December 2006; pp. 1108–1112. [Google Scholar] [CrossRef]
  20. Fränti, P.; Sieranoja, S. K-means properties on six clustering benchmark datasets. Appl. Intell. 2018, 48, 4743–4759. [Google Scholar] [CrossRef]
  21. Bryant, A.; Cios, K. RNN-DBSCAN: A density-based clustering algorithm using reverse nearest neighbor density estimates. IEEE Trans. Knowl. Data Eng. 2018, 30, 1109–1121. [Google Scholar] [CrossRef]
  22. Romano, S.; Vinh, N.X.; Bailey, J.; Verspoor, K. Adjusting for chance clustering comparison measures. J. Mach. Learn. Res. 2016, 17, 1–32. [Google Scholar]
  23. Xie, J.; Xiong, Z.Y.; Dai, Q.Z.; Wang, X.X.; Zhang, Y.F. A new internal index based on density core for clustering validation. Inf. Sci. 2020, 506, 346–365. [Google Scholar] [CrossRef]
  24. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  25. Von Luxburg, U. A tutorial on spectral clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
  26. Karypis, G.; Han, E.H.; Kumar, V. Chameleon: Hierarchical clustering using dynamic modeling. Computer 1999, 32, 68–75. [Google Scholar] [CrossRef] [Green Version]
  27. Dua, D.; Graff, C. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 1 October 2017).
Figure 1. Compound data set and the offset of local position of its point.
Figure 1. Compound data set and the offset of local position of its point.
Symmetry 12 00747 g001
Figure 2. Result of Algorithm 1. Initial clustering.
Figure 2. Result of Algorithm 1. Initial clustering.
Symmetry 12 00747 g002
Figure 3. Experiment results on Compound data set.
Figure 3. Experiment results on Compound data set.
Symmetry 12 00747 g003
Figure 4. Experiment results on Pathbased data set.
Figure 4. Experiment results on Pathbased data set.
Symmetry 12 00747 g004
Figure 5. Experiment results on Spiral data set.
Figure 5. Experiment results on Spiral data set.
Symmetry 12 00747 g005
Figure 6. Experiment results on Swissroll data set.
Figure 6. Experiment results on Swissroll data set.
Symmetry 12 00747 g006
Table 1. Artificial data sets.
Table 1. Artificial data sets.
Data SetSizeDimensionCluster Number
Compound39926
Pathbased30023
Flame24022
Aggregation78827
Jain37323
R15600215
Unbalance650028
A37500250
S15000215
t5.8k800026
t7.10k10,00029
t8.8k800028
Spiral31223
Table 2. Experiment results on artificial data sets.
Table 2. Experiment results on artificial data sets.
AlgorithmParc/CF1AMIARI Parc/CF1AMIARI
Compound Pathbased
DDNFC96/61.000.991.00 93/30.980.920.94
DPC1.06/60.780.820.62 1.53/30.740.550.50
RNN-DBSCAN86/60.890.860.86 64/30.980.930.95
DBSCAN0.2/56/60.940.890.93 0.3/113/30.960.840.88
Ward-Link66/60.720.690.55 33/30.720.530.48
KMeans++66/60.700.680.54 33/30.700.510.46
MeanShift-4/60.830.740.78 -3/30.700.510.46
Jain Flame
DDNFC182/21.001.001.00 172/21.001.001.00
DPC1.02/20.930.610.71 4.52/21.000.960.98
RNN-DBSCAN152/20.990.920.97 82/20.990.930.97
DBSCAN0.25/142/21.001.001.00 0.45/113/20.990.910.97
Ward-Link22/20.870.470.51 22/20.720.330.19
KMeans++22/20.800.340.32 22/20.840.390.45
MeanShift-2/20.620.210.02 -2/20.860.430.50
cluto-t8.8k cluto-t7.10k
DDNFC249/90.940.920.95 3510/100.900.870.89
DPC1.09/90.580.600.48 1.010/100.550.570.37
RNN-DBSCAN229/90.940.910.94 3510/100.900.870.90
DBSCAN0.1/2310/90.920.900.91 0.1/810/100.440.280.20
Ward-Link99/90.490.530.33 1010/100.550.610.41
KMeans++99/90.520.550.36 1010/100.470.540.34
MeanShift-7/90.510.540.36 -6/100.540.580.44
Aggregation cluto-t5.8k
DDNFC67/70.990.990.99 347/70.820.790.78
DPC2.07/71.001.001.00 2.07/70.790.790.74
RNN-DBSCAN57/71.000.990.99 397/70.820.800.79
DBSCAN0.3/237/70.940.910.91 0.15/117/70.280.050.01
Ward-Link77/70.900.880.81 77/70.780.790.74
KMeans++77/70.850.830.76 77/70.780.790.74
MeanShift-6/70.900.840.84 -6/70.810.790.78
Unbalance R15
DDNFC638/81.001.001.00 2415/150.990.990.99
DPC1.08/81.001.001.00 1.015/151.000.990.99
RNN-DBSCAN258/81.001.001.00 1215/151.000.990.99
DBSCAN0.45/118/80.790.700.61 0.1/1715/150.760.620.27
Ward-Link88/81.001.001.00 1515/150.990.990.98
KMeans++88/81.001.001.00 1515/151.000.990.99
MeanShift-8/81.001.001.00 -8/150.580.580.27
S1 A3
DDNFC2915/150.990.990.99 3650/500.970.970.94
DPC2.515/150.990.990.99 1.050/500.950.950.91
RNN-DBSCAN2016/150.990.990.99 3949/500.970.970.95
DBSCAN0.1/815/150.940.930.89 0.1/2937/500.780.860.67
Ward-Link1515/150.990.980.98 5050/500.970.970.94
KMeans++1515/150.990.990.99 5050/500.960.970.94
MeanShift-9/150.640.700.56 -----
Note: The bold numbers in the table indicate the best results the algorithm obtained. The underline numbers mean the algorithm got uncorrect number of clusters.
Table 3. Real-world data sets.
Table 3. Real-world data sets.
Data SetSizeDimensionCluster Number
Ionosphere351342
Page-blocks5437105
Haberman30632
SPECT-train80222
Segmentation2100197
Chess3196362
Breast-cancer-Wisconsin699102
Pendigits-train74941610
Sonar208602
Echocardiogram10672
Contraceptive-Method-Choice147393
Wilt-train433952
Table 4. Experiment results on real-world data sets.
Table 4. Experiment results on real-world data sets.
AlgorithmParc/CF1AMIARI Parc/CF1AMIARI
Ionosphere Page-blocks
DDNFC222/20.890.470.60 255/50.900.260.45
DPC4.52/20.790.220.32 1.55/50.860.040.10
RNN-DBSCAN62/20.690.010.01 195/50.900.260.48
DBSCAN0.45/82/20.650.09−0.05 0.1/57/50.860.040.07
Ward-Link22/20.720.130.19 35/50.800.050.02
K-Means++22/20.720.130.18 55/50.820.050.02
MeanShift-1/2--- -1/5---
Haberman SPECT-train
DDNFC102/20.730.010.05 42/20.700.140.21
DPC1.52/20.61−0.000.01 1.02/20.66−0.01−0.00
RNN-DBSCAN102/20.73−0.000.01 192/20.640.100.04
DBSCAN0.1/51/2--- 0.1/52/20.620.080.06
Ward-Link22/20.610.00−0.02 22/20.610.030.04
K-Means++22/20.55−0.00−0.00 22/20.700.170.17
MeanShift-2/20.73−0.000.01 -1/2---
Segmentation Chess
DDNFC177/70.670.590.42 152/20.670.000.00
DPC1.07/70.570.520.35 2.02/20.610.000.01
RNN-DBSCAN136/70.600.560.38 92/20.670.000.00
DBSCAN0.1/51/7--- 0.1/51/2---
Ward-Link77/70.570.460.32 22/20.590.010.00
K-Means++77/70.540.480.33 22/20.510.000.00
MeanShift-7/70.380.200.10 -1/2---
Breast-cancer-Wisconsin Pendigits-train
DDNFC272/20.940.650.77 6010/100.730.700.55
DPC3.02/20.590.200.02 1.010/100.730.730.61
RNN-DBSCAN392/20.690.000.00 3210/100.680.630.41
DBSCAN0.1/262/20.680.03−0.03 0.1/51/10---
Ward-Link22/20.970.790.87 1010/100.750.720.59
K-Means++22/20.960.740.85 1010/100.710.670.54
MeanShift-4/20.950.680.84 -3/100.340.230.17
Sonar Echocardiogram
DDNFC112/20.660.020.01 72/20.710.030.05
DPC1.02/20.57−0.00−0.00 1.02/20.700.010.06
RNN-DBSCAN72/20.67−0.00−0.00 42/20.70−0.000.04
DBSCAN0.35/53/20.650.05−0.01 0.1/51/2---
Ward-Link22/20.58−0.00−0.00 22/20.720.100.16
K-Means++22/20.540.000.00 22/20.730.110.19
MeanShift----- -2/20.710.000.04
Contraceptive-Method-Choice Wilt-train
DDNFC93/30.520.00−0.00 122/20.680.00-0.00
DPC1.03/30.430.020.02 1.02/20.68−0.00−0.00
RNN-DBSCAN192/30.52−0.00−0.00 292/20.68−0.00−0.00
DBSCAN0.1/51/3--- 0.1/51/2---
Ward-Link33/30.390.030.02 22/20.550.01-0.00
K-Means++33/30.410.030.03 22/20.550.000.00
MeanShift-2/30.460.010.01 -3/20.680.01−0.01
Note: The bold numbers in the table indicate the best results the algorithm obtained. The underline numbers mean the algorithm got uncorrect number of clusters. If the algorithm failed to classify a data set, the corresponding lines of results were set to ‘-’.

Share and Cite

MDPI and ACS Style

Liu, Y.; Liu, D.; Yu, F.; Ma, Z. A Double-Density Clustering Method Based on “Nearest to First in” Strategy. Symmetry 2020, 12, 747. https://doi.org/10.3390/sym12050747

AMA Style

Liu Y, Liu D, Yu F, Ma Z. A Double-Density Clustering Method Based on “Nearest to First in” Strategy. Symmetry. 2020; 12(5):747. https://doi.org/10.3390/sym12050747

Chicago/Turabian Style

Liu, Yaohui, Dong Liu, Fang Yu, and Zhengming Ma. 2020. "A Double-Density Clustering Method Based on “Nearest to First in” Strategy" Symmetry 12, no. 5: 747. https://doi.org/10.3390/sym12050747

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop