In order to verify the effectiveness of DDNFC, we compare it with another six algorithms, DPC, RNN-DBSCAN [
21], DBSCAN, KMeans [
2], Ward-link and Meanshift [
4], on artificial and real data sets. By using three cluster validity indexes [
1,
22,
23], F-measure with
(F1), adjusted mutual information (AMI) and adjusted rand index (ARI), we evaluate and discuss the performances of these seven methods. All algorithms are implemented in Python Integrated Development Environments (version 3.7). DPC and RNN-DBSCAN are coded according to the original articles respectively. Other five methods we use are the built-in algorithms or models provided by scikit-learn library [
24]. The ways of parameter setting of the algorithms are described as follows: (1) Meanshift uses the default parameters of the model in scikit-learn; (2) The input parameters of KMeans and Ward-Link are all set to the true number of clusters. KMeans adopts KMeans++ to initialize the centers of clusters; (3) To get the best result of DBSCAN, we use the grid-searching method to find the optimal values for the two parameters,
and
. The value of
varies in the range of 0.1−0.5 and the interval is 0.1. While the
ranges from 5 to 30 and steps 5. (4) The cutoff distance required by DPC is calculated by the input percentage value. In this paper, the optimal parameter value is searched from 1.0 to 5.0 percentages in steps of 0.5. In order to simplify the implementation of DPC, we select
c points with the top values of
, which defined as
in [
18], directly as the cluster centers. In addition, the density of DPC is calculated by Gauss kernel. (5) The optimal values of parameter
k of our method and RNN-DBSCAN are set in the range of
.
In addition, spectral clustering [
25] is a famous algorithm and has been widely used too. Since it uses
k-nearest neighbors to construct an adjacent matrix just as our methods do, we implemented the algorithm on data sets: Spiral, Swiss Roll, Compound, Pathbased, and Aggregation.
4.1. Experiments on Artificial Data Sets and Results Analysis
Table 1 displays the basic information of artificial data sets used in the paper. These data sets are all two dimensions and composed of clusters with different densities, shapes and distributed orientations [
20,
26].
Table 2 shows the settings of parameters (dubbed as par) and the test results, such as the number of clusters the methods obtain and the true cluster numbers (dubbed as c/C), the values of F1, AMI and ARI, of the above seven algorithms on artificial and real data sets respectively.
Meanwhile, we display some results of the above seven algorithms running on these data sets.
Figure 3 and
Figure 4 show the results of Compound and Pathbased, and other results are showed in
Appendix A Figure A1,
Figure A2,
Figure A3,
Figure A4 and
Figure A5. Different clusters are distinguished by marks with different shapes and colors. Small black dots represent unrecognized data, i.e., noise points.
Compound is composed of six classes with different densities. In the upper-left corner, two classes are adjacent to each other and both subject to Gaussian distribution. In the right side area, an irregular shape class is surrounded by a sparse distributed class and the two classes are overlap spatially. In the bottom-left corner, a small disk like class is encompassed by a ring-shape class. It is a big challenge to distinguish all classes correctly in Compound because they have different densities, various shapes and complex spatial relationships. As in
Figure 3, our method classified all points correctly except one point on the border between two classes in the upper-left corner. RNN-DBSCAN also found six clusters, but nearly 75 percent points of the sparse class on the right side are misclassified to the dense cluster. DBSCAN only detected out five clusters, and a large number of points are divided into noise as the sixth cluster of data set. DPC almost distinguished the two Gaussian and the disk classes correctly, but partitioned the ring class into two parts and mingled the sparse class with the dense. The results of Ward-Link and K-Means++ showed that the two methods are unable to cluster arbitrary shape data set. Meanshift cannot partition two clusters when one contains another, so it only recognized four clusters in Compound.
Pathbased has 3 classes, in which one contains 110 points form an unclosed thin ring, the other two contain 97 points and 93 points separately and all are enclosed in the ring. It is easy to misclassify the points, which belong to the ring class but are near other classes, to their adjacent class. As shown in
Figure 4, DDNFS and RNN-DBSCAN get the nearly correct results, a few points in the adjacent space between the ring and the left class were misclassified. One point is not unrecognized by RNN-DBSCAN. DBSCAN found out two clusters but were treat points of the ring class as noise. The other four algorithms divided the ring clusters into three parts, the top of the arc area was regarded as a class, and the other two parts were respectively were clustered into the other two clusters.
Jain is composed of two lunate clusters with different densities. The upper left one is sparse and unevenly distributed. The other is dense and long. DDNFC partitioned the two classes completely correct. DBSCAN regarded the lower density class as noise again. The others have different degrees of misclassification.
Flame has two classes with similar densities but different shapes, and they are very close to each other. As the results shown in
Figure A2 and
Table 2, the algorithms based on density can find out two clusters more accurately, except for some classification errors in the adjacent parts between the two clusters. DDNFC got the completely correct result; DBSCAN treated two outliers in the upper left corner as noise. Obviously, in this data set, the algorithm based on density is obviously superior to the other three clustering algorithms.
The characteristics of t8.8k and t7.10k are similar to Compound, in which the clusters have different shapes and some are embraced or semi-embraced by others. But the two data sets are seriously contaminated by noise. On these two data sets, DDNFC and RNN-DBSCAN got more accurate clustering results than the other methods.
Seven clusters with different sizes in Aggregation are independent of each other, except two pairs are connected slightly. Four density-based methods achieved better results than the others. DDFNC misclassified three points located in the adjacent area between the right two clusters. RNN-DBSCAN misclassified one point located in the adjacent area between the left two connected clusters. DBSCAN regarded one edge point of the upper-left cluster as noise. DPC performed best in this data set.
In addition to the above 7 data sets,
Table 2 also lists the test result of the 7 algorithms on the other 5 data sets.
The data sets t5.8k contain 7 clusters, 6 of which form the word “GEORGE”, and another bar-shape cluster runs through them. The data set contain noise data. As shown in
Table 2, the results of all methods are not good because the bar-shape cluster is hard to recognize from the data set.
Clusters in the data sets Unbalance, R15, S1 and A3 are in general independent to each other, though some of them may be adjacent. Of course, the clusters have different densities and arbitrary shapes. On these data sets, DDNFC performed as well as the other algorithms.
The spectral clustering algorithm in scikit-learn library [
24] need two parameters: the number of clusters and the number of nearest neighbors.
On Spiral, the
k of DDNFC was set to 4, and the two parameters of spectral clustering were 3 and 4 respectively, and the results shown in
Figure 5 are all entirely correct.
The Swissroll has 1500 points. The parameter settings of the two methods were 13 for DDNFC and (6, 13) for spectral clustering. From
Figure 6 we can see that the width of clusters got by spectral clustering are more uniform than DDNFC.
On remain three data sets Compound, Pathbased, and Aggregation, the parameter settings of the spectral clustering were (6, 5), (3, 6), and (7, 9). As shown in
Figure A6, the algorithm cannot classify the clusters in these data sets correctly.
4.2. Experiments on Real-World Data Sets and Results Analysis
Table 3 lists 12 real-world data sets, which are widely used in clustering and classification methods testing and downloaded from UCI machine learning repository [
27] for testing the algorithms in this paper.
Additionally, we did data preprocessing on the data sets if needed.
Data that have null values, uncertain values, or duplicates were removed.
Most of data sets have a class attribute.
Table 3 only gives the number of none class attributes.
We conserved from the third to ninth features of the data in Echocardiogram because some of its data has missing values.
As shown in
Table 3, Ionosphere, SPECT-train and Sona are sparse data sets for their higher ratio of dimension and number of instances. Ionosphere has 351 radar data. Each data is composed of 17 pulse numbers which are described by 2 attributes each, corresponding to the complex values. But in our tests, we treated 34 attributes as being independent. SPECT-train is one of a subset of SPECT, which has 22 binary attributes and 2 groups with 40 points each. Sona has 208 data with 60 real features. From
Table 4, we can see that DDNFC outperformed other six algorithms on all benchmarks much more on Ionosphere and also got the best results on two benchmarks on SPECT-train and Sona. The three data sets are sparse, DPC was better than the other 5 methods. Meanshift cannot distinguish the data in these data sets. KMeans++ performed good on SPECT-train. RNN-DBSCAN got the highest F1 on Sona.
Page-block is a data set about classification of the blocks of the page layout in a document. Its data has 4 real-type and 6 integer-type features. It has 5 classes with 4913, 329, 28, 88 and 115 data, respectively. Haberman contains 306 cases of the survival status of patients with breast cancers after they had undergone surgery. The 3 attributes of each data are integers representing operating time, patient’s year and number of positive axillary nodes detected. The data set is divided into 2 groups with 225 and 81 instances, respectively. Wilt-train consists two groups, one group has 4265 points but another has only 74 points. Obviously, the three data sets are unbalance data sets. The test results on them show that density-based clustering methods were better than others. DDNFC and RNN-DBSCAN got the best results. The two methods performed closely because they all used reverse k nearest neighbors model to determine core objects and the nearest-on-first-in tragedy of our method was not different from RNN-DBSCAN on these unbalance data sets. DBSCAN method performed badly.
The attributes of data in Breast-cancer-Wisconsin are integer category type. The original data set has 458 benign and 241 malignant cases, but we deleted cases with missing values. There are 444 benign and 239 malignant cases remained. Each data of Chess has 36 text category type attributes. The value of each attribute are selected from one of four groups f, t, g, l, b, n, w and n, t. For calculating the distance between two data, we replace the text label values of data to integer such as 0, 1, 2. The attributes of Breast-cancer-Wisconsin, Chess and Pendigits-train are categories or integers. On Breast-cancer-Wisconsin, Ward-link, KMeans++ and DDNFC got much higher benchmarks than the other methods, and DDNFC got the best performance in the density-based methods. On other two data sets, the performances of DDNFC, Ward-link, DPC and RNN-DBSCAN were close.
Table 4 shows the settings of parameters (dubbed as par) and the test results, such as the number of clusters the methods obtain and the true cluster numbers (dubbed as c/C), the values of F1, AMI and ARI, of the above seven algorithms on artificial and real data sets respectively.
Contraceptive-Method-Choice is a subset of the National Indonesia Contraceptive Prevalence Survey in 1987, which has multi-type attributes including integer, category and binary. The attributes of Echocardiogram are also composed of different numeric types with different value scales. On Echocardiogram, all methods except DBSCAN got the similar results. KMeans++ was the best one in all while DDNFC was the best one in density-based methods. On Contraceptive-Method-Choice, only four methods got the correct number of clusters. DDNFC achieved the best F1.
Segmentation-test is one of a subset of Segmentation, which has 19 real attributes, 7 groups with 300 points each. It is an ordinary data set. RNN-DBSCAN and DBSCAN did not classify all 7 clusters. DDNFC outperformed all.