Fuzzy C-Means Algorithm Automatically Determining Optimal Number of Clusters

: In clustering analysis, the key to deciding clustering quality is to determine the optimal number of clusters. At present, most clustering algorithms need to give the number of clusters in advance for clustering analysis of the samples. How to gain the correct optimal number of clusters has been an important topic of clustering validation study. By studying and analyzing the FCM algorithm in this study, an accurate and efficient algorithm used to confirm the optimal number of clusters is proposed for the defects of traditional FCM algorithm. For time and clustering accuracy problems of FCM algorithm and relevant algorithms automatically determining the optimal number of clusters, kernel function, AP algorithm and new evaluation indexes were applied to improve the confirmation of complexity and search the scope of traditional fuzzy C-means algorithm, and evaluation of clustering results. Besides, three groups of contrast experiments were designed with different datasets for verification. The results showed that the improved algorithm improves time efficiency and accuracy to certain degree.


Introduction
As the important technology in data mining field, clustering analysis is widely applied in statistics, decision support, machine learning, pattern recognition, picture processing, spatial database technology and e-commerce, etc. It is a very efficient data analysis method. Classical clustering algorithms mainly include partition-based clustering, hierarchical clustering algorithm, density-based method, grid-based method, model-based method and analysis method based on isolated point, etc. The quality of clustering algorithms greatly influences the final results of clustering process. Clustering process is an effective grouping of physical or abstract set of objects. The group generated in clustering results is called cluster. Cluster is the set of objects with certain same features in the database. The concrete manifestations include the following: any objects in the cluster have high similarity, while the objects which do not belong to the same cluster have relatively large dissimilarity. The value of similarity and dissimilarity can be calculated according to various attribute values of description objects. Usually, the distance between any objects is the measurement method which is mostly applied.
As an important method which is widely applied in data analysis, clustering is used to classify the samples as per the specific standards, with the purpose of maximizing intracategory similarity and minimizing intercategory similarity. In clustering analysis, the key to deciding clustering quality is to determine the optimal number of clusters. At present, most clustering algorithms need to give the number of clusters in advance for clustering analysis of the samples. How to gain the correct optimal number of clusters has been an important topic of clustering validation study. The existing method to determine the optimal number of clusters is mainly the fuzzy C-means (FCM) algorithm.

Improved fuzzy C-means algorithm based on kernel function 2.1.1 Fuzzy C-means algorithm analysis
Clustering analysis aims to classify objects according to their different features, degree of intimacy and similarity. The boundary of relations among things is usually not clear (i.e. fuzzy relation), so the application of fuzzy method for clustering analysis becomes inevitable. Fuzzy clustering analysis has been successfully applied in large-scale data analysis, data mining, picture analysis, pattern recognition, information fusion and so on. And, various fuzzy clustering algorithms appear. Among the numerous fuzzy clustering algorithms, fuzzy C-means (FCM) clustering algorithm [Bezdek (1981)] is most widely and successfully applied. It is a clustering analysis method based on objective function. Membership degree of each object to be classified for all centers of clustering can be gained through optimizing objective function so as to decide the category of classification objects and reach the purpose of automatic classification [Chen, Li and Wang (2006)]. FCM clustering algorithm: the set of objects to be classified is set as: wherein, each object has m characteristic indexes, and is set as: Now, the object set A is classified into c categories ( 2 c n ≤ ≤ ). The matrix which consists of vectors of c centers of clustering is set as: It is simplified as: To gain an optimal fuzzy classification, a best fuzzy classification can be chosen from the fuzzy classification space as per clustering norms. To calculate the appropriate fuzzy classification matrix U and center of clustering V, the objective function: is made to reach the minimum. Wherein, certain value can be taken for q (generally q=2).

|| ||
represents the distance between object Aj and the vector of i th category of centers of clustering. Usually, iterative operation is used to figure out the approximate solution of objective function given in Formula (5). The detailed steps are as follows: Step 1: Choose the number of categories c, 2 c n ≤ ≤ ; take a primary fuzzy classification matrix (0) U for gradual iteration, 0,1, 2, l = ⋅⋅⋅ Step 2: For ( ) l U , calculate the center of clustering: in which: Step 3: Amend fuzzy classification matrix ( ) Step 4: Compare ( ) l U and ( 1) Noise in dataset has a great influence on the whole clustering classification process. At present, many algorithms fail to process noise, thus leading to the influence on the dataset classification. Or, noise processing results are unsatisfactory. Noise data processing is too complex or there is no substantive influence of noise reduction. All these lead to some defects of FCM clustering algorithm in practical applications.

Objective function optimization based on kernel function
Kernel function is introduced to enhance optimizing ability of FCM clustering algorithm. It is supposed that the center of clustering k v θ in high-dimensional space can find primary image in the primary space. Then, the objective function changes to According to Mercer kernel definition, Meanwhile, Gaussian radial basis function ( ( , ) 1, K x x x X = ∀ ∈ ) is used as the kernel function for simplification. Then, the objective function of improved fuzzy C clustering algorithm can change to Lagrange multiplication approach is used to the center of clustering and iterative formula of membership matrix: The process of improving fuzzy C clustering algorithm is as follows: Step 1: Initialize. Give weighted index m and the number of clustering categories c(2≤c≤n); set the parameter values of the chosen kernel function; set threshold value of iteration stop ε ; initialize membership matrix (0) U , iteration counter b=0.
Step 4: If < is an appropriate norm), stop updating membership matrix U, otherwise, make 1 U U = + and turn to Step 2.

Improved FCM algorithm and analysis
The division method of standard FCM clustering algorithm is based on the following criterion: The sum E of distance between each data object p and corresponding cluster center is minimum. The computational formula of E is In conclusion, after the kernel function is added, the algorithm complexity of FCM clustering algorithm reduces.

Confirmation of search scope
Since clustering results of FCM clustering algorithm depend on the selection of initial center of clustering, different initial center of clustering will generate different clustering results. Thus, clustering results are unstable. How to determine the optimal clustering according to FCM algorithm is important. Usually, the basic thought of determining the optimal number of clusters is as follows: for the specific dataset, conform the search scope of number of clusters and operate clustering algorithm to gain the clustering results of different number of clusters; choose appropriate validity indexes to evaluate clustering results, and confirm the optimal number of clusters according to the evaluation result. Thus, the core of confirming the optimal number of clusters is to confirm reasonable search scope of number of clusters and evaluation indexes of clustering effectiveness.
To confirm the search scope of number of clusters min max [ , ] k k , min k and max k should be confirmed. min 1 k = refers to even distribution of samples, without obvious characteristic difference. The minimum number of clusters in clustering algorithms is usually 2, i.e. min 2 k = . There still no explicit theoretical direction about how to confirm max k . The empirical rule that most scholars use is: max k n ≤ , which is described in the Literature [Yu and Cheng (2002)]. The conclusion is based on the precondition of uncertainty function But the precondition is not the sufficient condition proved by Literature [Yang, Li, Hu et al. (2006)]. The conclusion is deduced based on the precondition that the sample space has fractal geometrical characteristics, and the conclusion have no universality. Besides, sample size and practical category number of all datasets in Literature (Frey and Dueck 2008) also have no such property. Sample size and practical category number of some datasets in Literature [Brusco (2008)] also have no such property. To sum up, max k n ≤ is only an empirical rule, and does not own universality. In this study, AP algorithm proposed by Frey et al. is applied to confirm max k . The algorithm is fast and effective. It has been well applied in multiple fields.

AP clustering algorithm
AP clustering algorithm [Kapp (2007); Xiao and Yu (2008)] is a kind of clustering algorithm based on affinity information propagation. Its purpose is to find out the set of optimal category representatives so that the sum of similarities of all samples to the nearest category is largest. AP algorithm deems all N samples in the dataset as the candidate category representatives and establishes attraction degree information with other samples for each sample. In other words, similarity between any two samples i To prevent oscillation in the iteration process, AP algorithm introduces the factor λ to prevent oscillation, and the value of λ is between 0 and 1. The update result of ( , ) r i k and ( , ) a i k is gained through the weighing of current iteration value and the last iteration result.

AP feasibility analysis
AP algorithm does not give the number of clusters. When the algorithm ends, the number of clusters is determined automatically. For the clustering structure of intra-category compactness and inter-category alienation, AP algorithm can get the accurate clustering result. But for the close clustering structure, the algorithm tends to generate much local clustering. Thus, the number of clusters is generally large and the accurate clustering results cannot be given [Wang, Li, Zhang et al. (2007)]. Because of its fast speed and effectiveness, AP algorithm rather than C-means clustering algorithm is used to complete initial category number screening of dataset. Since the category number searched by AP algorithm is greater than n , the maximum max k of category number is reduced from n (sample size) to the number of clusters AP k generated by AP algorithm.
Compared with n , the scheme involves clustering structure distribution of samples, which is scientific. The experiment also successfully verifies the feasibility of its scheme.

Confirmation of new evaluation index
At present, many validity evaluation indexes have been proposed to analyze clustering results for FCM algorithm, such as partition coefficient VPC ], partition entropy VPE ], VOS proposed by Kim et al. [Kim, Lee, Lee et al. (2004)] VXB index proposed by Xie et al. [Xie and Beni (1991)], VFS index proposed by Fukuyama et al. [Fukuyama and Sugeno (1989)], VK index pro-posed by Kwon [Kwon (1998)], VCWB index proposed by Rezaee et al. [Rezaee, Lelieveldt and Reiber (1998)], VB index proposed by Boudraa [Boudraa (1999)], VSV index proposed by Kim et al. [Kim, Park and Park (2001)], Wint (Weighted inter-intra) [Boudraa (1999)] and Silhouette [Silhouette (2004)] index. However, due to the defects of these indexes, it is hard for them to judge the clustering results. The clustering validity test effect is not ideal enough. Thus, geometric structure of datasets and clusters with different sizes are fully considered in this study. The specific value of intra-category compactness and inter-category separation degree is combined with clustering membership degree to define a new clustering validity index. Besides, the information of dataset and its geometric structure are fully considered. So, the optimal partition and the optimal number of clusters of fuzzy partition can be accurately confirmed by FCM algorithm. On this basis, a method to confirm the optimal number of clusters of samples is proposed to evaluate the clustering results of AP algorithm and determine the optimal number of clusters of samples.

Definition of compactness Index
Compactness index is used to measure intra-category compactness, and can be expressed with intra-category weighted squared error sum as follows:

Definition of separation index
Separation index is a method to measure separation degree of two fuzzy sets. Dispersion between two categories is defined as follows: the sample belongs to the minimum of membership degree of these two categories. Separation measurement uses the largest difference of all paired fuzzy clusters. So, similarity between two fuzzy sets i F and j F is defined as: Separation measurement of given fuzzy partition is Then, the boundary of separation index is 0

New evaluation index
Since compactness and separation have different scalar quantity, normalization result can be expressed as: These two formulas are effectively combined to get the new evaluation indexes of clustering effect: In the new validity indexes, compactness index ( , ) V c U reflects intra-category total variation, and it is used to express the concentration degree of intra-category samples. When its value is smaller, compactness of category is better. Separation index ( , ) S c U reflects intra-category total variation, and it is used to express the distance among fuzzy clusters. When separation is larger, the partition result is better. ( , ) V c U and ( , ) S c U are combined to reflect the partition features of dataset. When the indexes are smaller, intra-category is more compact, intra-category is more separate and the clustering result is better.

Results
To test validity and operation efficiency of the proposed algorithm, three groups of experiments were applied to carry out simulation test of artificial dataset and true dataset, and the algorithms were compared. There are three artificial datasets: Dataset1, Dataset2 and Dataset3. Dataset1 is composed of two two-dimensional Gaussian distribution data with the centers of (0, 0) and (20,20) respectively. Each category has 400 samples. Dataset2 is composed of four twodimensional Gaussian distribution data with the centers of (0, 0), (5, 7), (12,17) and ( Fig. 1.

Figure 1: Convergence comparison chart of FCM algorithm based on improved kernel function
In the Fig. 1, Line 1 represents the first clustering of FCM algorithm; Line 2 represents the second clustering of FCM algorithm; Line 3 represents the first clustering of improved FCM algorithm; Line 4 represents the second clustering of improved FCM algorithm. According to the figure, the iteration times of improved FCM algorithm is obviously lower than that of traditional FCM algorithm in the process where the target value tends to coincide. Thus, the validity of the algorithm is proved. Experiment 2: Validity experiment of AP algorithm determining upper limit of search. Artificial Dataset3 was chosen asexample. Simulation experiment was carried out for max ap k k = and max k n = , respectively.
To eliminate the influence of this algorithm, PC was used as the evaluation index for contrast experiment. The optimal number of clusters is shown in the following figure. It can be easily seen that, since the search scope confirmed by max k n = is smaller than the practical optimal number of clusters, the accurate optimal number of clusters cannot be gained. The accurate optimal number of clusters is obtained through AP algorithm. Experiment 3: Comparison experiment of several validity indexes Artificial Dataset1 and Dataset2 as well as UCI datasets Iris and Wine were selected to verify several representative evaluation indexes via comparisons. The results are shown in Tab. 2. Judging from the above results, the evaluation indexes proposed in this study can be all converged and get the accurate number of clusters in 4 groups data. They perform better than other indexes. The theoretical research and experimental results indicate that, compared with other indexes and methods, the indexes in this study have better performance and stability.

Discussion
The above three groups contrast experiments verify the timeliness of improved algorithm, validity of search scope and the accuracy of evaluation indexes respectively. The result shows that the proposed fuzzy clustering algorithm automatically determining the number of clusters is reliable.

Conclusion
Based on the analysis of FCM algorithm, an accurate and efficient algorithm used to confirm the optimal number of clusters is proposed in this study to solve the defects of traditional FCM algorithm. The algorithm is improved in the aspects of reducing algorithm complexity, confirming search scope and constructing clustering validity index. In addition, multiple groups of contrast experiments verify the improvement of algorithm with higher efficiency and accuracy. Despite some problems exiting in the algorithm, the future researches will be completed to improve the time efficiency, which is caused by mutual application of various algorithms in the process of automatically determining the number of clusters.