A novel hybrid K-means and artificial bee colony algorithm approach for data clustering

Article history: Received September 16, 2016 Received in revised format: October22, 2016 Accepted April15, 2017 Available online April172017 Clustering is a popular data mining technique for grouping a set of objects into clusters so that objects in one cluster are very similar and objects in different clusters are quite distinct. Kmeans (KM) algorithm is an efficient data clustering method as it is simple in nature and has linear time complexity. However, it has possibilities of convergence to local minima in addition to dependence on initial cluster centers. Artificial Bee Colony (ABC) algorithm is a stochastic optimization method inspired by intelligent foraging behavior of honey bees. In order to make use of merits of both algorithms, a hybrid algorithm (MABCKM) based on modified ABC and KM algorithm is proposed in this paper. The solutions produced by modified ABC are treated as initial solutions for the KM algorithm. The performance of the proposed algorithm is compared with the ABC and KM algorithms on various data sets from the UCI repository. The experimental results prove the superiority of the MABCKM algorithm for data clustering applications. Growing Science Ltd. All rights reserved. 8 © 201


Introduction
Clustering is an important unsupervised classification technique to discover hidden patterns or information from a given dataset.It is a process of partitioning a set of objects/data into disjointed groups called clusters such that objects similar to each other are identified in one cluster (different from those in other clusters).Similarity may be expressed in terms of Euclidean distance; the lesser the distance, the more similarity is there between two objects or two clusters.Clustering is widely used in numerous applications.K-means (KM) algorithm is the most widely used clustering algorithm due to its efficiency in clustering of large data sets and faster convergence.It generates partitions of an Ndimensional population such that each partition is having small within-class variance.In K-means, each cluster has a center called mean and attempt is made to minimize its objective function (a square error function).However, it has some limitations such as dependence on initialization of cluster centers, sensitivity to outliers, non-guaranteed optimal solutions, and formation of unbalanced clusters.
In order to overcome the initial cluster centers dependency problem of KM, a hybrid algorithm (MABCKM) using the modified ABC and KM is proposed.The proposed algorithm incorporates the output of the modified ABC algorithm as initial solutions of KM algorithm.The experimental results on several data sets prove that the proposed algorithm is better than others in terms of efficiency and convergence speed.Section 2 covers the discussion on clustering, including KM algorithm.Section 3 briefly describes the ABC algorithm.Section 4 first describes the modifications in ABC algorithm, followed by the MABCKM algorithm for clustering problems.Section 5 illustrates the data sets and experimental results.Finally, Section 6 concludes the paper.

Clustering
Clustering is considered as an important component of data mining, a process to extract useful information by exploring and analyzing large amount of data through automatic or semiautomatic means.Clustering methods identify groups or clusters of a data set using a step by step approach in the sense that in each cluster there are objects similar to each other i.e. homogeneity within the clusters, yet different from those in other clusters i.e. heterogeneity between the clusters.Clustering may be performed in two ways: hierarchical clustering and partitional clustering.Both of these methods are treated as hard (or crisp) in nature as they work by assigning the data points to one cluster only.The partitional clustering algorithms divide the set of data objects into clusters by iteratively relocating objects without hierarchy (Gan et al., 2007).The clusters are gradually improved to ensure high quality of clustering.The center-based clustering algorithms are the most common partitional algorithms and have been extensively used in the literature.

Center-based clustering
The center-based clustering algorithms are very effective in handling large and high-dimensional databases.The most popular of these algorithms is K-means algorithm and is simple in nature.K-means is a form of hard partitional clustering as each data point is assigned to one cluster only.In KM, the process of assigning data objects to the disjoint clusters repeats until there is no significant change in objective function values or membership of clusters.The objective function for a set of data objects , … , , having disjoint subsets is given as: K-means is useful in terms of efficiency in clustering of large data sets as its complexity is proportional to the data set.Also, it tends to converge fast as it requires few function evaluations.However, the KM does not guarantee optimal solutions although it converges into good solutions.

Artificial Bee Colony Algorithm
ABC is a population based optimization algorithm which is iterative in nature.Basically, ABC consists of five phases: initialization phase, employed bees phase, probabilistic selection phase, onlooker bees phase and scout bees phase.Bees going to a food source already visited by them are employed bees while the bees looking for a food source are unemployed.The onlooker bees wait for the information from employed bees for food sources and scout bees carry out search for new food sources.The information exchange among bees takes place through waggle dance.There is one employed bee for every food source.The main steps of ABC are as under:

Initialization phase
The locations of food sources are randomly initialized within the range of boundaries according to Eq.
(2) given by: where 1, … , 1, … , .indicates the number of food sources and taken as half of the bee colony, D is dimension of the problem, represents the parameter for th i employed bee on th j dimension, and are upper and lower bounds for .

Employed bee phase
Each employee bee is assigned to the food source for further exploitation.The resulting food source is generated according to Eq. (3) as given by: where k is a neighbor of i , , k i   is a random number in the range [-1,1] to control the production of neighbor solutions around ij x , ij v is the new solution for ij x .The fitness of new food source is now calculated using Eq. ( 4) as below: where i f is the objection function associated with each food source and i fit is the fitness value.

Probabilistic selection phase
For each food source a probability value is calculated using Eq. ( 5) as given below, and an onlooker bee selects the food source according to this value.
where i p is selection probability of th i solution.

Onlooker bee phase
Each onlooker bee selects a food source to exploit according to the probability associated with it (i.e. more fitness, higher the probability).The chosen food sources are exploited for better solutions using Eq.

Scout bee phase
If a food source does not produce better solutions even up to a predefined limit, the food source is abandoned and the corresponding bee becomes a scout bee.A new food source is randomly generated in the search space using Eq. ( 2).

Proposed Modified ABCKM Algorithm
ABC algorithm is simple, robust in nature, good in exploration and easy to implement since it utilizes the adaptable features of honeybee swarm.However, it has shortcomings such as slow convergence, tendency to local optima traps in solving complex multimodal problems, poor exploitation to find food sources in solution search equation.Therefore, there is scope of enhancement in exploitation ability as well as convergence speed of the ABC algorithm by use of one or more techniques.A few modifications have been proposed in different phases of the original ABC algorithm in order to obtain better results.

Modifications in original ABC algorithm
The proposed modifications are: 1.Use of chaotic sequences in combination with opposition based learning in the initialization phase to generate better initial solutions.2.Replacing the roulette wheel selection mechanism by variable tournament selection and replacing the worst solution by a random better solution, in the onlooker bee phase.

Initialization phase
In the standard ABC, the random initialization of population may affect the convergence characteristics and quality of solutions.In order to improve the convergence characteristics and population diversity, chaotic sequences have been successfully used instead of random sequences (Alatas, 2010).These sequences in combination with opposition based learning method generate better initial solutions (Rahnamayan et al., 2008).Based on these techniques, the algorithm for population initialization is given below (Gao& Liu, 2012):

Onlooker bees phase
Two modifications have been proposed in this phase to improve the quality of solutions.First step is to replace the roulette wheel selection mechanism by varying tournament selection mechanism.The size of tournament is selected on the basis of population size and cycle number.The tournament selection scheme works by holding a tournament of TN individuals chosen from the population, where TN is taken as tournament size (Blickle& Thiele, 1995;Miller & Goldberg, 1995).A tournament size TN=2 is chosen in early stages for better exploration and a variable tournament size based on the current cycle number is chosen in later stages for better exploitation as given below.In second step, the worst solution is replaced by a randomly generated better solution in order to enhance quality as well as convergence speed.
If 20, the tournament size is taken as: 1 * , 1,...,10 10 10 10 If 10 20, then tournament size is taken as: where SN : number of employed or onlooker bees, TN: tournament size and MCN: maximum cycle number.For small population, the tournament size is incremented by 1 to find solutions.However, with the growth in population this increment will slow down the algorithm and hence, the tournament size becomes dependent on current cycle.The high fitness food sources within this tournament size are chosen by the onlooker bees thus speeding up the algorithm.Moreover, the replacement of worst fitness solution by a randomly generated solution provides the scope for better quality of solutions.

4.2.Proposed MABCKM algorithm
We propose a hybrid algorithm, called MABCKM, based on modified ABC and KM for clustering problems.The hybrid algorithm incorporates the merits of KM and MABC to enhance the fitness value of each particle.Each particle is taken as a real numbered vector of dimensions K×D, K is the number of clusters and D is the dimension of data set.The fitness of each particle is evaluated using Eq.(1) i.e. smaller is the objective function value, higher is the fitness.The pseudo code for MABCKM is given below:

1.(Initialization phase)
Initialize the parameters including number of food sources , limit, maximum cycle number , and current cycle number =0; Initialize the food sources using modified initialization phase given in Section 4.1.1; Evaluate the fitness of food sources using Eq. ( 1); Send the employed bees to the current food source;

2.While ( <= ) do 3.(Employed bee phase) for (each employed bee)
Find a new food source in the neighborhood of old food source using Eq.(3); Evaluate the fitness of new food source using Eq. ( 1); Apply greedy selection on the original food source and the new one; end for 4. (Probabilistic selection phase) Calculate the probability values for each food source using Eq. ( 5);

5.(Onlooker bee phase) =1; while (current onlooker bee <= SN)
Calculate the tournament size based on population using Eq. ( 6) or ( 7) or ( 8); Out of the chosen tournament, find the food source having maximum probability value; Generate new solution for the selected food source using Eq. ( 3); Evaluate the fitness of new food source using Eq. ( 1); Apply greedy selection on the original food source and the new one; = +1; end while Replace the worst fitness food source with a randomly produced food source using Eq. ( 2); generate new solution using Eq. ( 3) and evaluate fitness value using Eq. ( 1), apply greedy selection on the original food source and the new one;

6.(Scout bee phase)
If (food source is not upgraded up to the limit) Send a scout bee to the solution of food source produced using Eq. ( 2); end if 7.Memorize the best solution obtained so far = +1;

Experimental Results and Analysis
Six data sets are employed to test our proposed algorithm.The six data sets taken from UCI Machine Repository are iris, glass, lung cancer, soyabean (small), wine and vowel data sets.The summary of clusters, features and data objects in each data set are given in Table 1.We evaluate and compare the performance of KM, ABC and MABCKM algorithms in terms of objective function of KM algorithm.The quality of the respective clustering is compared using the following six criteria:  The objective function value (OFV) of the KM algorithm i.e.
, .Clearly, the smaller the value of objective function is, the higher the quality of clustering. The F -measure using the ideas of precision and recall from information retrieval (Dalli, 2003;Handl et al., 2003).For each class i and cluster j , precision and recall are then defined as ) , ( j i p and ) , ( j i r in Eq. ( 9), and the corresponding value under the F -measure is as Eq. ( 10), where we choose b = 1 to obtain equal weighting for precision and recall.The overall F -measure for the data set of size N is given by Eq. ( 11).Obviously, the bigger F -measure is, the higher the quality of clustering is.
The average distance of a data point belonging to a cluster from all other data points in is taken as .For other clusters , with = 1, . . ., and ≠ , the smallest average distance of to all data points in is taken as in Eq. ( 12).The silhouette value of is now calculated as in Eq. ( 13).The overall Silhouette index is defined as average of over all data points, as given in Eq. ( 14).Clearly, a larger value indicates good quality of clustering.
 Adjusted Rand Index ( ) assumes the generalized hypergeometric distribution as the model of randomness (Hubert &Arabie, 1985).Let be the number of objects in both class and cluster .Let .and .be the number of objects in class and cluster respectively.The can be described in Eq. ( 15).A high value of means a good clustering result.

(16)
where ∑ ‖ ‖ and ∑ ∑ are the traces of the between and within-cluster scatter matrices respectively.Here, represents the number of data points belonging to cluster , is the total mean vector for the entire data set.A large value of indicates a clustering result with good quality. Davies-Bouldin Index ( ) to maximize the between-cluster distance while minimizing the distance between the cluster centroid and the other data points (Davies &Bouldin, 1979).The index is defined as: 1 where indicates the maximum comparison between cluster and other clusters in the partition, and is written as From Table 2 and 7, it may be concluded that the proposed algorithm provides best performance on various evaluation measures for iris, glass, lung cancer, and wine data sets, whereas it provides nearly best results for soyabean (small) and vowel data sets.The modified algorithm produces good clustering partitions on low-dimensional as well as high-dimensional data sets.The MABCKM algorithm takes less runtime in comparison to ABC on all data sets.By increasing iteration from 1 to 100, Fig. 1 to 6 describe the change in objective function value by using different methods.From Fig. 1 to 6, it may be summarized as:  The KM method provides stable and better convergence as compared to ABC.It converges to the global optimum or near global optimum for all data sets except soyabean (small). The ABC method depicts slow and less stable performance as compared to KM.However, it performs better and converges to the global optima or near global optima in iris, wine and vowel data sets.The results prove that ABC is not able to generate sufficiently good results on high dimensional data sets. The hybrid method shows fast, stable and best performance for all data sets.The algorithm converges to the global optima every time and provides excellent results irrespective of number of samples or dimensions of data sets.

Conclusion
This paper has presented a hybrid clustering algorithm (MABCKM) based on modified ABC and KM algorithms.The proposed method exhibits the qualities of both the algorithms.The modified ABC algorithm incorporates modified initialization phase to generate better initial solutions.Moreover, it makes use of variable tournament selection in place of roulette wheel selection in onlooker bee phase, so as to provide better exploration and exploitation of solution space in addition to enhanced convergence speed.The performance of the algorithm is evaluated in terms of different parameters on six standard data sets from UCI Machine Learning Repository and compared with ABC and KM algorithms.The experimental results show that the proposed MABCKM algorithm is able to escape local optima and find better objective function values with much lower standard deviation in comparison to other two algorithms.The proposed algorithm also outperforms the other methods in terms of the F-measure, silhouette, ARI, CH and DB indices and achieves best ranking among three methods.The results prove that the modified algorithm produces better clustering partitions and leads naturally to the conclusion that MABCKM is a viable and robust technique for data clustering.The proposed method needs improvement to perform automatic clustering without any prior knowledge of number of clusters.

Fig. 1 .Fig. 3 .Fig. 5 .
Fig.1.Comparison of OFV on iris data set Fig. 2. Comparison of OFV on glass data set then tournament size is taken as:

Table 2
Results obtained by various algorithms on different data sets.Bold face indicates the best and italic face the second best result

Table 3
Centers obtained for the best OFV on iris data set

Table 4
Centers obtained for the best OFV on glass data set

Table 5
Centers obtained for the best OFV on wine data set

Table 6
Centers obtained for the best OFV on vowel data set