Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter October 8, 2014

Improved FCM Algorithm Based on K-Means and Granular Computing

  • Wei Jia Lu EMAIL logo and Zhuang Zhi Yan

Abstract

The fuzzy clustering algorithm has been widely used in the research area and production and life. However, the conventional fuzzy algorithms have a disadvantage of high computational complexity. This article proposes an improved fuzzy C-means (FCM) algorithm based on K-means and principle of granularity. This algorithm is aiming at solving the problems of optimal number of clusters and sensitivity to the data initialization in the conventional FCM methods. The initialization stage of the K-medoid cluster, which is different from others, has a strong representation and is capable of detecting data with different sizes. Meanwhile, through the combination of the granular computing and FCM, the optimal number of clusters is obtained by choosing accurate validity functions. Finally, the detailed clustering process of the proposed algorithm is presented, and its performance is validated by simulation tests. The test results show that the proposed improved FCM algorithm has enhanced clustering performance in the computational complexity, running time, cluster effectiveness compared with the existing FCM algorithms.

1 Introduction

There are large volumes of data to be processed in medicine, computer, and architecture sciences. Although the development of database techniques has increasingly facilitated data processing, the huge mass of data still makes data querying, adding, removing, and modification a nightmare [8, 22]. Data mining is the process of extracting required relationship from a large-scale database that is incomplete, ambiguous, noisy, and random [7, 9, 17, 19, 20]. It is deemed as a complex interdisciplinary technology involving artificial intelligence, information retrieval, database visualization, statistics, machine learning and database technique, and so on. Data mining can produce giant benefit in terms of its high technology, technical content, and practical value.

Fuzzy C-means [2, 4] (FCM) is a very important clustering algorithm and also has a wide range of applications. Its main advantage is the ability to convert the problem of clustering to the related mathematical problems. However, FCM algorithm also has some shortcomings [12–14], such as the definition of weighted number and cluster size, and it is easily falling into local optimum. These problems are likely to affect the classification results and reduce the efficiency of FCM algorithm. This article [11] discusses the problems of the current state and resolves those problems. It presents an approach that combines granular computing and FCM and achieve the purpose of optimizing the size of the clusters by selecting the correct validity function. Meanwhile, the capability of detecting data with different sizes solves the problems of optimal number of clusters and sensitivity to the data initialization in the conventional FCM methods.

The rest of the article is organized as follows. Section 2 presents the current research on the fuzzy clustering. Section 3 describes the ideas and steps of the improved FCM algorithm. Section 4 carries on theoretical analysis and experimental simulation on the improved algorithm.

2 Literature Review

Clustering algorithms can be roughly divided into the following categories: partitioning method, hierarchical method, density-based method, grid-based method, and model-based method. Ruspini [23] first proposed fuzzy clustering method based on the objective function. But the real effective algorithm FCM is given by Dunn [21]. Pal and Bezdek [18] further expanded and established a fuzzy clustering theory. FCM algorithm idea is simple and has a high efficiency and a wide range of applications, such as the application in image processing [3], text mining [10], radar signal recognition, etc.

However, the defect of FCM algorithm is obvious. Guo and Ma [6] proposed a fuzzy clustering algorithm based on simulated annealing particle swarm optimization (PSO) for the disadvantage of the sensitive of FCM algorithm at the initialize state and the problem of easily falling into local optimum. This algorithm uses the powerful global optimization ability of PSO and the method of jump-out local optima of annealing algorithm. This algorithm [15, 16] also overcomes the shortcomings of FCM clustering algorithm. Because of the problem of easily falling into local optimum, it introduced the idea of the evolutionary computation into FCM to achieve global optimization.

3 Improved FCM Algorithm Based on K-Means and Principle of Granularity

This section mainly describes the proposed algorithm in three aspects: cluster initialization, solution of optimal cluster number, and clustering process.

3.1 Cluster Initialization

The hierarchical clustering algorithm overcomes the selection problem in the initial K-medoid clusters, which regard the data set as the same class by taking the data objects as the cluster center. The application of random sampling algorithm needs sufficient data to guarantee that each class has the same amount of objects. Based on the above discussion, this article proposes a representative method that needs to extract only one sampling set to make the K-medoid clusters belong to different classes. The detailed procedure is presented as follows:

  1. First, choose a sample set from the raw data set.

  2. Second, classify the data objects into k′ clusters by hierarchical clustering (where 2k < k′ < 5k).

  3. Remove the cluster centers with the number of members equal to or less than min – C (the smallest cluster).

  4. Finally, group the remaining centers into k clusters by the hierarchical clustering algorithm (as shown in Figure 1C). Then take the k centers as the final initial K-medoid clusters.

Figure 1 Hierarchical Clustering Algorithm for Clustering a Group of Objects.
Figure 1

Hierarchical Clustering Algorithm for Clustering a Group of Objects.

Step A is for the removal of “isolated points” to avoid their adverse influence on the clustering results. Steps B and D ensure that the K-medoid cluster belongs to a different class and is well-represented.

3.2 Solution of Optimal Cluster Number Based on Principle of Granularity

The K-means algorithm, adopted in the improved FCM algorithm proposed in this article, can solve most of the cluster center initialization problems. Unfortunately, it fails to deal with the setting of the number of clusters. It is impossible to obtain the accurate K-means number of clusters without a predefined cluster number. Therefore, it is necessary to choose the accurate validity function to determine the optimal cluster number.

With different granular sizes, to distinguish the partitioning results, each cluster has its own sample sets. The partition of data is conducted by sequentially selecting different number of clusters (i.e., Pedigree method). The accuracy of the partition is weighted by the information granularity and coupling level.

Degree of coupling of information granules:

(1)Cd(c)=1ni=1cj=1nuijmdij2,i=1,2,,c,j=1,2,,n. (1)

Degree of partition of information granules:

(2)Sd(c)=i,k=1;ikcdik2[c(c1)]/2;i,k=1,2,,c. (2)

According to the degree of the partition of the information and coupling granules, the cluster validity function can be expressed as

(3)GD(c)=αCd(c)+(1α)1Sd(c), (3)

where α and 1 – α are the weighting factors to compensate for the difference between the partition and coupling degree, respectively. The factor α for coupling is a bit larger than the factor 1 – α for partition. Therefore, α = 0.6 and 1 – α = 0.4 are adopted in this study. In a domain with information granules of different size, to obtain clearer clustering results, it is better to assign a smaller weight factor to either the partition degree or the coupling degree with a larger varying range. The results obtained depend on the data set. Therefore, the conclusion can be drawn that better clustering results can be obtained for a small GD(c). The minimum value of the GD(c) for a given c is the optimal number of clusters.

3.3 Clustering Process of the Improved Algorithm

As mentioned above, the K-means algorithm is adopted to improve the existing FCM algorithm that fails to automatically achieve the optimal number of clusters and is sensitive to the input parameters of clusters. To determine the optimal cluster number, it needs to meet cc. An optimal number of clusters (1,c) is taken in this study. The detailed clustering process is presented as follows: (1) obtain n clustering results according to the improved K-means algorithm; (2) take these sets as the K-medoid clustering results of the proposed algorithm; (3) calculate the value of the validity function based on the principle of granularity for c = n; (4) according to the merging method of K-medoid clusters mentioned above, obtain c– 1 K-medoid clusters; (5) by subtracting 1 from the cluster number, calculate the clustering results and the validity function value for c=n1; (6) repeat the above steps until the cluster number reaches its lowest bound. Finally, obtain the optimal cluster by comparing the validity of all the functions. The flowchart of the proposed method is illustrated in Figure 2.

Figure 2 Flowchart of the Improved Algorithm Based on K-Means and Principle of Granularity.
Figure 2

Flowchart of the Improved Algorithm Based on K-Means and Principle of Granularity.

Based on the clustering process of the proposed algorithm, the detailed steps can be implemented as

  1. Set cmax=n as the maximum initial number of clusters, where n is the number of the samples; set m = 2 as the fuzzy coefficient, ε as the threshold to stop the iteration, α as the weight factor, lmax as the maximum number of iterations; zero out the iteration counter l = 0.

  2. According to the predefined initial number of clusters cmax, determine the ideal clustering results using the improved K-means algorithm.

  3. Update the fuzzy classification matrix U according to Equation (1).

  4. Update the K-medoid cluster V according to Equation (2).

  5. If the threshold is reached, stop the iteration and output the fuzzy classification matrix U and K-medoid cluster V; if not, make l = l + 1 and go back to Step 3.

  6. Save the obtained validity function GD(c).

  7. Calculate the distance between any two clusters first and merge the two classes with the smallest distance. The c – 1 K-medoid clusters are finally obtained.

  8. Taking c = c – 1. If c > 1, then reset l = 0 and return to Step 3. Otherwise, go to Step 9.

  9. The clustering results corresponding to the minimum GD(c) are the required best clustering results where the clustering algorithm is done.

Input: Data set ready for cluster analysis; number n of the sample points, maximum iteration number bmax.

Output: Optimal number of clusters, K-medoid cluster, fuzzy classification matrix.

Pseudocode:

4 Simulation Test

4.1 The Data Sets

To evaluate the performance of the proposed algorithm, simulation tests were carried out on two different data sets. The first one is the Iris Standard Data Set [5]. Fisher’s article is a classic in the field and is referenced frequently to this day. The data set contains three classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other two; the latter are not linearly separable from each other.

The second data set is another artificial data set that is prepared for the purpose of improving the generalization of the proposed method. The data set consists of 1000 samples each, of which each has 50-dimensional data points (see Table 1) [1]. This data set is characterized as containing multiple clusters; the data values of the irrelevant features are uniformly distributed in some range; and finally, the major features for different clusters may overlap.

Table 1

Details of the Artificial Data Set.

ClusterRelevant featuresSize
11,2,6,9,14,19,21,22,24,27,27,31, 33, 37,38,40,44,47,48,50726
21,2,3,5,8,11,12,16,18,20,23,24,27,29,30,31,33,34,36,37,38,39,44,45274

4.2 Test Results and Analysis

In the simulation test, the fuzzy coefficient, iteration threshold, maximum number of iterations, maximum classification number, and validity function were initialized as m = 2, ε = 0.001, bmax = 100, cmax=15012, and α = 0.6, respectively. Each algorithm for the Iris data set and artificial data set was repeated 100 times. The statistic results of the accuracy of classifications for the conventional FCM algorithm and the improved FCM algorithm are listed in Table 2.

Table 2

Test Results.

AlgorithmFCM algorithmImproved FCM algorithm
Number of accurate classifications in the Iris data set for 100 times7899
Accuracy rate in the Iris data set (%)7899
Average computation time in the Iris data set (s)602410
Number of accurate classifications in the artificial data set for 100 times8597
Accuracy rate in the artificial data set (%)8597
Average computation time in the artificial data set (s)450305

From the test results in Table 2, the proposed improved K-means algorithm has a higher accuracy ratio than the conventional FCM algorithm. The accuracy ratio of the improved K-means algorithm reaches more than 95% and the computation time is reduced. It is evident that the improved method has better performance compared with the existing FCM algorithm.

With regard to the computation time, the improved FCM algorithm is faster than the FCM algorithm. This is due to the random selection of the initial K-medoid clusters in the FCM algorithm, which is time consuming. The bigger the data, the longer the computation time and the larger the distance. The improved FCM algorithm, however, calculates the clustering results of the initial K-medoid clusters by the improved K-means algorithm.

For a given number of data sets, the performance of the conventional and improved FCM algorithms is listed in Table 3.

Table 3

Comparison of the Conventional and Improved FCM Algorithms.

AlgorithmFCM algorithmImproved FCM algorithm
Error times in the Iris data set168
Error ratio in the Iris data set (%)11.766.45
Quadratic sum of error in the Iris data set0.16520.0666
Error times in the artificial data set1510
Error ratio in the artificial data set (%)12.467.83
Quadratic sum of error in the artificial data set0.13090.0794

It can be seen from Table 3 that the improved FCM algorithm working in the Iris data set enhances the accuracy rate where the improved K-means algorithm plays a key role. The clustering processes of the FCM algorithm and improved K-means algorithm are compared in Figure 3 where the latter converges faster than the former.

Figure 3 Comparison of the Conventional and Improved FCM Algorithms.
Figure 3

Comparison of the Conventional and Improved FCM Algorithms.

The varying range of the cluster number was taken as c∈ [2, 12] in the simulation tests. The validity function values for difference cluster numbers are listed in Table 4.

Table 4

Values of Validity Functions for Various Numbers of Clusters.

Number of clusters cValue of validity function
20.8012
30.3952
40.6013
50.6352
60.7011
70.8103
80.6985
90.8352
100.8001
110.9105
120.9452

From Table 4, it is found that the validity function value reaches its minimum value for c = 3. Therefore, the validity function is reasonable.

On the whole, the improved FCM algorithm can solve the two main problems: optimal cluster number and sensitivity to the cluster initialization, which cannot be automatically solved in the conventional FCM algorithms. In addition, the improved FCM algorithm has better cluster effect, initial K-medoid cluster, and faster convergence speed.

5 Conclusions

Based on the conventional FCM algorithm, an improved FCM algorithm is proposed through the combination of the granular computing and FCM. Simulation tests have been carried out, and the results show that the improved FCM algorithm has a faster convergence speed even for large-scale databases with a guaranteed accuracy. Further improvements should be made on the FCM algorithms. Further future work is necessary to solve how to optimize the operation of steps, reduce the computation time, reduce resource consumption, and solve other issues concerning the environment of big data sets.


Corresponding author: Wei Jia Lu, School of Communication and Information Engineering, Shanghai University, Shanghai 200072, China; and Informatic Department, Affiliated Hospital of Nantong University, Nantong 226001, Jiangsu Province, China, e-mail:

Bibliography

[1] C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu and J. S. Park, Fast algorithms for projected clustering, SIGMOD Rec. 28 (1999), 61–72.10.1145/304181.304188Search in Google Scholar

[2] K. M. Bataineh, M. Naji and M. Saqer, A comparison study between various fuzzy clustering algorithms, Jordan J. Mech. Indust. Eng. 5 (2011), 335–343.Search in Google Scholar

[3] L. Chen and C. L. Philip Chen, A multiple-kernel fuzzy C-means algorithm for image segmentation, IEEE Trans. System Man Cybern 41 (2011), 1263–1274.10.1109/TSMCB.2011.2124455Search in Google Scholar PubMed

[4] M. Chiang and B. Mirkin, Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads, J. Classif. 27 (2010), 1–38.10.1007/s00357-010-9049-5Search in Google Scholar

[5] R. A. Fisher, The use of multiple measurements in taxonomic problems, Annu. Eugenics 7 (1936), 179–188.10.1111/j.1469-1809.1936.tb02137.xSearch in Google Scholar

[6] X. H. Guo and X. P. Ma, Fault diagnosis feature subset selection using rough set, Comput. Eng. Appl. 43 (2007), 221–224.Search in Google Scholar

[7] J. Han and M. Kamber, Data mining: concepts and techniques, China Machine Press, Beijing, 2001.Search in Google Scholar

[8] Q. He, Advance in fuzzy clustering theory and application, Fuzzy Sets Syst. 12 (1998), 89–94.Search in Google Scholar

[9] A. K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognition Lett. 31 (2010), 651–666.10.1016/j.patrec.2009.09.011Search in Google Scholar

[10] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman and A. Y. Wu, An efficient k-means clustering algorithm: analysis and implementation, IEEE Trans. Pattern Anal. Machine Intell. 24 (2012), 881–893.10.1109/TPAMI.2002.1017616Search in Google Scholar

[11] S. Krinidis and V. Chatzis, A robust fuzzy local information c-means clustering algorithm, IEEE Trans. Image Process. 19 (2010), 1328–1337.10.1109/TIP.2010.2040763Search in Google Scholar PubMed

[12] H. Liu, D. Wu, J. Yih and S. Liu, Fuzzy possibility C-mean based on complete Mahalanobis distance and separable criterion, in: Eighth International Conference on Intelligent Systems Design and Applications, pp. 89–94, 2008.10.1109/ISDA.2008.100Search in Google Scholar

[13] T. N. Long and H. P. Binh, Approach to image segmentation based on interval type-2 fuzzy subtractive clustering, in: Springer LNCS 7197, Intelligent Information and Database Systems, Kaohsiung, Taiwan, pp. 1–10, 2012.10.1007/978-3-642-28490-8_1Search in Google Scholar

[14] E. G. Mansoori, FRBC: a fuzzy rule-based clustering algorithm, IEEE Trans. Fuzzy Syst. 19 (2011), 960–971.10.1109/TFUZZ.2011.2158651Search in Google Scholar

[15] U. Maulik and I. Saha, Automatic fuzzy clustering using modified differential evolution for image classification, IEEE Trans. Geosci. Remote Sens. 48 (2010), 3503–3510.10.1109/TGRS.2010.2047020Search in Google Scholar

[16] S. Niazmardi, S. Homayouni and A. Safari, An improved FCM algorithm based on the SVDD for unsupervised hyperspectral data classification, IEEE J. Select. Top. Appl. Earth Observ. Remote Sens. 6 (2013), 1399–1404.10.1109/JSTARS.2013.2255118Search in Google Scholar

[17] R. M. Ramze, B. P. F. Lelieveldt and J. H. C. Reiber, A new cluster validity indexes for the fuzzy c-mean, Pattern Recognition Lett. 19 (1998), 237–246.10.1016/S0167-8655(97)00168-2Search in Google Scholar

[18] N. R. Pal and J. C. Bezdek, On cluster validity for the fuzzy c-means model [J], IEEE Trans. Fuzzy Syst. 3 (1995), 370–379.10.1109/91.413225Search in Google Scholar

[19] H. Q. Sun and Z. Xiong, Fuzzy cluster based on rough set and result evaluating, J. Fudan Univ. (Nat. Sci.) 43 (2011), 819–822.Search in Google Scholar

[20] C. J. Xiao and M. Zhang, Research on fuzzy clustering based on subtractive clustering and fuzzy c-means, Comput. Eng. 31 (2005), 135–137.Search in Google Scholar

[21] J. C. Dunn, Well-separated clusters and the optimal fuzzy partitions. J. Cybernet. 4 (1974), 95–104.Search in Google Scholar

[22] S. L. Yang, Y. S. Li and X. X. Hu, Optimization study on k value of K-means algorithm, Syst. Eng. Theory Pract. 26 (2006), 97–101.Search in Google Scholar

[23] E. H. Ruspini, A new approach to clustering, Inf.cont. 15 (1969), 22–32.10.1016/S0019-9958(69)90591-9Search in Google Scholar

Received: 2014-8-11
Published Online: 2014-10-8
Published in Print: 2015-6-1

©2015 by De Gruyter

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Downloaded on 19.4.2024 from https://www.degruyter.com/document/doi/10.1515/jisys-2014-0119/html
Scroll to top button