Automatic clustering using genetic algorithms

https://doi.org/10.1016/j.amc.2011.06.007Get rights and content

Abstract

In face of the clustering problem, many clustering methods usually require the designer to provide the number of clusters as input. Unfortunately, the designer has no idea, in general, about this information beforehand. In this article, we develop a genetic algorithm based clustering method called automatic genetic clustering for unknown K (AGCUK). In the AGCUK algorithm, noising selection and division–absorption mutation are designed to keep a balance between selection pressure and population diversity. In addition, the Davies–Bouldin index is employed to measure the validity of clusters. Experimental results on artificial and real-life data sets are given to illustrate the effectiveness of the AGCUK algorithm in automatically evolving the number of clusters and providing the clustering partition.

Introduction

Clustering is a fundamental problem that frequently arises in a great variety of application fields such as pattern recognition, machine learning, statistics, etc. It is a formal study of algorithms and methods for grouping or classifying objects without category labels. The resulting partition should possess two properties: (1) homogeneity within the clusters, i.e. objects belonging to the same cluster should be as similar as possible, and (2) heterogeneity between the clusters, i.e. objects belonging to different clusters should be as different as possible. Many clustering techniques have been proposed [1], [2]. Among them, the K-means algorithm is an important one. It is an iterative hill-climbing algorithm and the solution obtained depends on the initial clustering. Although the K-means algorithm had been applied to many practical clustering problems successfully, it may converge to a partition that is significantly inferior to the global optimum [3].

Recently, researchers solved the clustering problem by stochastic optimization methods such as genetic algorithms, tabu search, simulated annealing, etc. Liu et al. [4] integrated a tabu list into the genetic algorithm based clustering algorithm to prevent several fitter individuals from occupying the population and to maintain population diversity. In addition, an aspiration criterion is adopted to keep selection pressure. Bandyopadhyay and Maulik [5] designed a genetic clustering approach. They used the K-means algorithm to provide the domain knowledge and improve the search capability of genetic algorithms. Laszlo and Mukherjee [6] presented a genetic algorithm for evolving the cluster centers in the K-means algorithm. The set of the cluster centers is represented using a hyper-quadtree constructed on the data. Liu et al. [7] combined the K-means algorithm and the tabu search approach to accelerate the convergence speed of the tabu search based clustering algorithm. Ng and Wong [8] proposed a tabu search based fuzzy K-modes algorithm for clustering categorical objects. Bandyopadhyay et al. [9] integrated the K-means algorithm into the simulated annealing based clustering method to modify the cluster centroids. By redistributing objects among clusters probabilistically, the presented method obtains better results than the K-means algorithm. Güngör and Ünler [10] combined the K-harmonic means algorithm and the simulated annealing method to deal with the clustering problem. The simulated annealing method is used to generate non-local moves for the cluster centers and to select the best solution. Liu et al. [11] adopted the noising method, a metaheuristic technique reported by Charon and Hudry [12], to solve the clustering problem. With lower computational cost than Bandyopadhyay et al.’s method [9], the proposed method is inferior to the latter in terms of solution quality. By modeling the clustering problem as an optimization problem, Mahdavi et al. [13] proposed a harmony search based clustering algorithm for grouping the web documents. They hybridized the K-means algorithm and the harmony search method in two ways and designed two hybrid algorithms. Pacheco [14] adopted the scatter search approach to deal with the clustering problem under the criterion of minimum sum-of-squares clustering. Within the framework of the proposed method, greedy randomized adaptive search procedure (GRASP) based constructions, H-means+ algorithm, and tabu search are integrated. Jarhoui et al. [15] designed a clustering approach based on the combinatorial particle swarm optimization (CPSO) algorithm. In the CPSO method, each particle is represented as a string of length n (where n is the number of objects) and the ith element of the string denotes the group number assigned to object i. The CPSO algorithm obtains better results than a genetic algorithm based clustering method in some cases. Shelokar et al. [16] proposed an ant colony optimization method for grouping N objects into K clusters. The presented method employs distributed agents which mimic the way real ants find the shortest path from their nest to food source and back. Fathian et al. [17] presented the application of honeybee mating optimization in clustering (HBMK-means). By experimental simulations, the HBMK-means method is proved to be better than other heuristic algorithms in clustering, such as genetic algorithm, simulated annealing, tabu search, and ant colony optimization.

The aforementioned clustering techniques [3], [4], [5], [6], [7], [8], [9], [10], [11], [13], [14], [15], [16], [17] require the designer to provide the number of clusters as input. Unfortunately, in many real-life cases the number of clusters in a data set is not known a priori. How to automatically find a proper value of the number of clusters and provide the appropriate clustering under this condition becomes a challenge. In this paper, our aim is to develop a genetic algorithm based clustering method called automatic genetic clustering for unknown K (AGCUK) to automatically find the number of clusters and provide the proper clustering partition. We design two new operators, noising selection and division–absorption mutation, to keep the balance between selection pressure and population diversity. The Davies–Bouldin index is employed as a measure of the validity of clusters. Experimental results on artificial and real-life data sets are given to illustrate the superiority of the AGCUK algorithm over four known genetic clustering methods.

The remaining part of this paper is organized as follows. The related work on the automatic clustering method based on genetic algorithms is reviewed in Section 2. In Section 3, we propose the AGCUK algorithm and give detailed descriptions. In Section 4, the choice of the original noise rate rmax and the terminal noise rate rmin is discussed, how to estimate selection pressure and population diversity is given, and performance comparison between AGCUK and four known genetic algorithm based clustering methods is conducted for experimental data sets. Finally, some conclusions are drawn in Section 5.

Section snippets

Related work

In this study, we focus on how to solve the automatic clustering problem using genetic algorithms. In this regard, some attempts have been made to use genetic algorithms for automatically clustering the data. Bandyopadhyay and Maulik [18] applied the variable string length genetic algorithm, with real encoding of the coordinates of the cluster centers in the chromosome, to the clustering problem. Experimental results on artificial and real-life data sets show that their algorithm is able to

The AGCUK algorithm

In this section, we first briefly introduce two techniques (i.e., genetic algorithms and noising method), and then describe the AGCUK algorithm in detail.

Experimental results

In this paper, computer simulations are conducted in Matlab on an Intel Pentium D processor running at 3.4 GHz with 512 MB real memory. The population size P of each experimental algorithm is equal to 20. The number of generations G of each experimental algorithm is equal to 50. Each experiment includes 20 independent trials.

Conclusions

Clustering is aimed at discovering structures and patterns of a given data set. As a fundamental problem and technique for data analysis, clustering has become increasingly important. Many clustering methods usually ask the designer to provide the number of clusters as input. Unfortunately, the number of clusters in general is unknown a priori. In this paper, we propose a genetic algorithm based clustering method call automatic genetic clustering for unknown K (AGCUK). We design two operations:

Acknowledgements

The authors thank Dr. Chih-Chin Lai for his valuable suggestions on our works. This research was supported in part by the National Natural Science Foundation of China (NSFC) under grants 60903074 and 60828005, the National High Technology Research and Development Program of China (863 Program) under grant 2008AA01Z119, the National Basic Research Program of China (973 Program) under grant 2009CB326203, and the US National Science Foundation (NSF) under grant CCF-0905337.

References (38)

  • C.C. Lai et al.

    A hierarchical evolutionary algorithm for automatic medical image segmentation

    Expert Syst. Appl.

    (2009)
  • D.X. Chang et al.

    A robust dynamic niching genetic algorithm with niche migration for automatic clustering problem

    Pattern Recogn.

    (2010)
  • L.P.B. Scott et al.

    Using genetic algorithm to design protein sequence

    Appl. Math. Comput.

    (2008)
  • I. Charon et al.

    Application of the noising method to the traveling salesman problem

    Eur. J. Oper. Res.

    (2000)
  • W.H. Chen et al.

    A hybrid heuristic to solve a task allocation problem

    Comput. Oper. Res.

    (2000)
  • I. Charon et al.

    Noising methods for a clique partitioning problem

    Discrete Appl. Math.

    (2006)
  • C.K. Ting et al.

    On the harmonious mating strategy through tabu search

    Inf. Sci.

    (2003)
  • M.K. Pakhira et al.

    Validity index for crisp and fuzzy clusters

    Pattern Recogn.

    (2004)
  • A.K. Jain et al.

    Data clustering: a review

    ACM Comput. Surv.

    (1999)
  • Cited by (0)

    View full text