Automatic Data Clustering Based Mean Best Artificial Bee Colony Algorithm

: Fuzzy C-means (FCM) is a clustering method that falls under unsupervised machine learning. The main issues plaguing this clustering algorithm are the number of the unknown clusters within a particular dataset and initialization sensitivity of cluster centres. Artificial Bee Colony (ABC) is a type of swarm algorithm that strives to improve the members’ solution quality as an iterative process with the utilization of particular kinds of random-ness. However, ABC has some weaknesses, such as balancing exploration and exploitation. To improve the exploration process within the ABC algorithm, the mean artificial bee colony (MeanABC) by its modified search equation that depends on solutions of mean previous and global best is used. Furthermore, to solve the main issues of FCM, Automatic clustering algorithm was proposed based on the mean artificial bee colony called (AC-MeanABC). It uses the MeanABC capability of balancing between exploration and exploitation and its capacity to explore the positive and negative directions in search space to find the best value of clusters number and centroids value. A few benchmark datasets and a set of natural images were used to evaluate the effectiveness of AC-MeanABC. The experimental findings are encouraging and indicate considerable improvements compared to other state-of-the-art approaches in the same domain.


Introduction
Clustering of data is a statistical approach used for managing large data volume. It is a multivariate analytical approach, which identifies patterns and relationships that exist amongst data. By data clustering, a user could divide data into relatively homogeneous groups. By reorganizing these groups, the user may be able to utilize the original data volume efficiently. Clustering accuracy is crucial because, according to [1][2][3], clustered data that does not accurately denote the original data will have adverse consequences. In other words, data clustering is a set of patterns or points represented as vectors of attributes, measurement methods, or objects that exist in multidimensional feature space [4,5].
Clustering algorithms consist of two types, namely, the partitional clustering, which generates several partitions, and hierarchical clustering, which generates only one partition [4,5]. As highlighted by [5,6], there are two types of partitional clustering, namely, hard (crisp) and fuzzy (soft). Every data point in crisp clustering belongs to one cluster only. In the latter clustering of fuzzy, at the same time, data points might relate to much more than single cluster based on a certain degree of fuzzy membership. Between these two types (crisp and fuzzy), fuzzy clustering appears to be more acceptable for datasets that exhibit undefined bounds between regions or clusters.
Over the last few years, clustering methods have been demonstrated to be effective, particularly in tasks of categorization, requiring semi or full automation [7,8]. Clustering is employed to classify identical data points according to similarity and distance, based on the standard approach. In inter-cluster, the similarity increases as the distances decrease while in intra-cluster, simultaneously increasing distances will reduce similarity.
Clustering has many useful features, making it one of the most well-known in many fields of image segmentation, pattern recognition, machine learning, data mining, etc. [3,9]. One of its features is the partitional method of clustering [2]. It is free from issues such as static-behaviour, where the elements of data of a cluster are unmovable to another cluster. Additionally, this type of clustering also does not have the problem of overlapping clusters' separation inability, which is common in hierarchical clustering. Amongst these fuzzy clustering techniques, FCM algorithm [10] is a useful approach that has been applied in many domains because it has robust ambiguity characteristics and can maintain rather more information than hard segmentation techniques [11,12].
However, the Fuzzy clustering-based approach still has significant weaknesses; for instance, its ability to obtain an automatic approach without prior knowledge of the number of clusters and centroid locations. Furthermore, determining the number of clusters within a particular dataset is a significant challenge and the unavailability of experts, operators, and any prior knowledge has contributed to this challenge as well. Accordingly, many researchers have worked on the successful implementation of clustering methods in the last few years to find the number of the appropriate clusters within a particular dataset without experts and operators.
A fuzzy clustering FCM was proposed in Ouadfel et al. [27] based on a modified ABC called MoABC-FCM. Additionally, Hancer et al. [28] presented an image segmentation approach using ABC algorithm to detect tumours from MRI images. In Alrosan et al. [29], ABC algorithms and fuzzy c-means were combined and called ABC-FCM. The combination used two types of MRI brain images namely simulated brain data, and actual MRI images, for its performance. Automatic Clustering-based Improved Binary ABC (IDisABC) which was proposed by [30]. In Kuo et al. [3] an automatic customer segmentation method based on an improved artificial bee colony and fuzzy clustering was proposed. Further, Kuo et al. [3], Alrosan et al. [23] proposed a new image clustering method based on improved ABC called (MeanABC-FCM). This method used MRI images, and the outcomes show promising results in this field.
A method, fuzzy automatic clustering named AFDE for image segmentation problem was proposed by [31]. Relevantly, Alia et al. [32] proposed a novel automatic algorithm of fuzzy clustering (DC) by the hybrid harmony search HS with FCM to produce an automatic segment method called DCHS. Further, an automatic hard clustering algorithm called DCPSO was proposed by [33][34][35]. While an algorithm called Fuzzy based genetic algorithm with variable string length FVGA to automatically find a fitting clusters number with the matching fuzzy clustering outcomes was proposed by [36]. A fuzzy clustering algorithm was proposed in Alomoush et al. [37] using a combination of a firefly optimizer with an FCM algorithm named FFCM [38]. In Agbaje et al. [26] an automatic data clustering method was proposed based on improved FA algorithm combined with the Particle Swarm Optimization (PSO) algorithm. This paper proposes an automatic fuzzy clustering approach based on mean artificial bee colony called (AC-MeanABC). The proposed method uses the (MeanABC) algorithm to determine the appreciate cluster number and initial location of centroids. The method modifies the search equation depending on mean previous and global best solutions to reach the best balance between exploration and exploitation. The remaining portions of the paper are as follows. The type 2 fuzzy set is described in Section 2, related background on the employed techniques are presented in Sections 4 and 5. The proposed method is presented in Section 5, and Section 6 presents the experiments and results, wherein Section 7 concludes the paper.

Type Fuzzy Set
FCM is an unsupervised method that can identically partite objectives based on the similarity attribute, which tends to increase the similarities of entities within the same set and decrease the similarities of entities within different sets [2,39]. The objective function defined by FCM is reformulated using type 2 fuzzy set as in Eq. (1) below.
. The location of centroids c refers to the standard in the sign of product . for the element x i to the jth centre's location; furthermore, the new value of membership h kj , and type 2 fuzzy set includes five steps whereby the first step involves the selection of fuzzy clusters number c, followed by the second step which involves the selection of location centres v i , v 2 , . . . , v c . The third step is the computation of elements of fuzzy membership value as in Eq. (2) where μ kj is the membership degree of pixel k to the jth cluster, while the fourth step involves calculating the cluster centres as in Eq. (3). As the final step, the aforementioned four steps are repeated until the limit of iterations (t) is reached or until a stopping condition is met as in Eq. (4).

The Artificial Bee Colony (ABC)
A standard ABC algorithm was proposed in Karaboga et al. [13] and this algorithm was inspired by the life of bee swarm. The bees in ABC are divided into three kinds: the employed bees focus on seeking sources of food and transmitting their search information to the onlookers. Meanwhile, the onlooker bees retain more searches to achieve better solutions depending on the quality of the discovered food sources. A source of food in the flock is a possible solution, and the amount of nectar determines their fitness values. The ABC algorithm randomly initializes the source of food as in Eq. (5).
where (i = 1, . . . , SN) and SN are the solutions (j = 1, 2, . . . , D), D is the parameters value, while x j min and x j max are the highest and lowest bound of the dimension j, respectively. The computation of the updated source of food is as in Eq. (6).
where: K (1, 2, . . .SN) and j (1, 2, . . .D) indexes which are randomly chosen and K = i · φ ij is a value from [−1, 1]. The balancing of the candidate food exploration and exploitation fitness of food source (solutions) fit(x i ) will be found from the value of fitness function f(x i ) that is computed by Eq. (7).
That onlooker selects solutions with certainty. This certainty means that correlation exists between the fitness value of a source of food and the bees employed. The probability of fitness is as in Eq. (8).
In this regard, when several trials of the solution are not changing, the "limit," is reached, and employee bees are converted to scout bees to abandon their solutions. Accordingly, Eq. (5) shows how the scout bees start new searches and solutions randomly until they reached the criterion of termination.

The Mean Artificial Bee Colony (MeanABC)
The ABC has some disadvantages in terms of unbalanced search behaviour, and this has become a common limitation with other optimization algorithms. The authors used an improved ABC based mean global best (MeanABC) [23] to solve these disadvantages. In Alrosan et al. [23] an improved ABC algorithm called MeanABC was proposed, and this algorithm presents a unique modified search formula for the bee phase employed based on the knowledge about the best candidates to produce a new neighbourhood source of food as in Eq.
where x i,j would be the first expression for the present position, φ ij x ij − x kj is the variation among the current position and the new position, and φ ij is a random value from [−1,1]. Both expressions are identical to traditional ABC algorithms. The third expression is the modified search equation in MeanABC. Here, the first side of the third expression is essential for switching the bee's present position towards the mean value of the positive way of the best global position and the positive way that is its own best position, The second concerns with switching the present mean position value of (pbest), which is the positive way of own best position and the (-gbest) negative way of the best global position.
Based on the above: ψ is a random value between [0, C], and C is a positive number. The constant value of C plays a critical role in balancing the candidate food exploration and exploitation [38]. Fig. 1 shows the MeanABC pseudo-code. AC-MeanABC is an unsupervised data clustering method that employs the ability of Mean-ABC to seek the best solution that comprises a set number of clusters and centroids' location. The pseudo-code for AC-MeanABC algorithm is available in Fig. 2. The steps of AC-MeanABC are described as follows.

Initialization
The AC-MeanABC algorithm begins by initializing the Np food sources. Each food source Np represents a vector that encodes the cluster number and centroid. Each Np food sources' vector length can change based on the randomly generated number of clusters for each vector. Also, the Np size range (number of clusters) depends on the dataset. Each Np vector encodes several clusters, denoted by 'ClustNo,' which is the number of Np vector encode between the maximum number of clusters (ClusMaxNum) and the minimum number of clusters (ClustMinNum). The food source initialization initializes the number of clusters and centroids value in MeanABC as in Eq. (10).
The values of ClusMaxNum and ClustMinNum are chosen based on the images used for segmentation. In this regard, the length of the vector is allowed to change if ClustNo is changed. Additionally, each length of a vector in Np must reach to highest clusters number (ClusMaxNum). Thus, for a matrix representation, it consists of d-dimensional space and the Np vector length (ClusMaxNum,d), where d represents the centroid locations. In case the vector number of clusters is less than ClusMaxNum, the vector is taken by centroids in random locations, while the rest of unused vector elements are represented as don't care '#' sign [32,40]. To clarify the 'don't care' elements, let d = 2 and ClusMaxNum = 6, which means that the feature space is 2-dimensional (d = 2) and the maximum number of clusters is ClusMaxNum = 6. So, suppose that one of the Np vectors has only 3-candidate cluster centres such as Np = {(113, 9), (80, 4), (42, 5)}, then, these three centres will be set in the Np vector while the rest of the elements are labelled as don't care '#' sign Np = {113 9 # # # # 80 4 # # 425}. Also, to determine the best solutions, the fitness function J(c1, c2, . . . , cn) in Eq. (1) is used. The maximum value of J(c1, c2, . . ., cn) is considered as the best solution.

Employed Bees Phase
This phase of AC-MeanABC clustering algorithm includes finding the best Np vector with the optimal number of clusters and centroid locations. To update the Np vector value, the new food source position is calculated using previously mentioned Eq. (9) in section (4).
Here, i = [1, 2, . . . , Np], j = [1, 2, . . ., Cn], x j,i is a randomly chosen jth parameter of the kth individual, and i is one of the Np food sources with i = k. If any parameter of v i . exceeds its predetermined boundaries, it should be adjusted to fit the appropriate range. A scale factor φ ij is a random number between [−1,1]. Meanwhile, ψ is a uniform random number from [0,C], where C is a nonnegative constant number. Here, the pbest represents the previous best value of each vector encodes namely NP = (C 1 , C 2 , C 3 , C 4 , . . . , C n ), and the gbest represents the previous best value of all vector encodes i = [1, 2, . . . , Np]. After generating the new candidate vector encodes, the new candidate solution's fitness value is calculated as in Eq. (7), say, fit j . If fit i < fit j ; then, update the food source vector encodes, else, keep it as it is.

Onlooker Bees Phase
In the onlooker bee stage, each onlooker selects a source of food with a probability determined by the amount of nectar (fit) from a source of food exchanged by employed bees. As in Eq. (8), the probability is determined. When the probability for fit i of the vector encodes is high, the food source is selected. Else, the vector will be considered as don't care '#' solution. After the employed bee phase and a food source have been chosen, a neighbour source v i is calculated by Eq. (9), and its f (x) is calculated. Then, the process of greedy is used between v i and x i .

Scout Bees Phase
When trials of solutions are not changing, and it reached to "limit," employee bees become scout bees and abandon their solutions. Here, is where, the scout bees begin new searches and solutions, randomly via Eq. (10). These aspects are iterated until the criterion of termination has been reached.

Solutions Evaluation and Clustering
The degree of appropriateness or goodness is measured by its fitness value of each Mean-ABC solution. Each data is allocated to one or more clusters by using the fuzzy membership categories. These fuzzy membership values are determined using the fuzzy membership equation. Consequently, a cluster validity index is used as an indicator for the appropriateness of the clustering. This index is typically exercised to establish the quality of different solutions obtained using different settings in a particular clustering algorithm (or for solutions given by different algorithms).
In this paper, the cluster validity index of VI is used for the appropriateness of the clustering [35]. The VI index is presented as follows: where s is a constant number, and N(μ, σ ) is a Gaussian distribution as in Eq. (12) with mean and standard deviation, where c n is the number of clusters Eqs. (13) and (14) represent these intra clusters and the inter-cluster.  (14) where N p is the number of patterns in a dataset, Z p is a pattern in the cluster, while m k , and c n is the kth centroids of the cluster and the number of clusters.

Experiments and Results
The AC-MeanABC fuzzy clustering algorithm is conducted as a fully automatic data clustering to find the number of clusters in datasets. The experiments and results are represented in three parts. Firstly, the AC-MeanABC parameters setting and the most appropriate values are chosen. Secondly, AC-MeanABC is conducted on 11 benchmark clustering datasets selected from the UCI databases such as Iris, Ecoli, Wisconsin, Wine, Dermatology, Glass, Aggregation, R15, D31, Libras movement and Wholesaler customers. Some details about these datasets are given in Tab. 1.
The third part includes AC-MeanABC being conducted with five natural images obtained from the Berkeley1 dataset [30,41] such as Lena, Jet, MorroBay, Mandril, and Pepper. All-natural images have the size is 512 × 512 except the Pepper image, which is of 256 × 256.

AC-MeanABC Parameters Setting
To obtain the best outcomes from any clustering optimization algorithm, the suitable selection of parameters is critical because these parameters are essential in the algorithm's performance and accuracy. To determine the values of parameters of AC-MeanABC (i.e., Population size SN, limit, maximum cycle number (MCN), nonnegative constant parameter (C), and Termination criterion TC) have to be determined. The values of these parameters were set in Tab. 2 based on related works [3,30]. In Tab. 3, other relevant parameters to the dataset, such as the maximum number of clusters Np, μ and σ [3,30].

Benchmark Clustering Datasets
In this experiment, AC-MeanABC was performed with 11 benchmark clustering datasets, which were chosen from the UCI database. The benchmark datasets were divided into two parts, whereby part1 of the benchmark datasets included Glass, Aggregation, R15, D31, Libras movement and wholesaler customers, while part2 of the benchmark datasets included Iris, Ecoli, Wisconsin, Wine and Dermatology.

Part1
In this part, the AC-MeanABC outcomes were compared with the standard ABC algorithm and other related works such as iABC, AKCBCO, AKC-MEPSO, DCPG, DCPSO and DCGA [3,42,43]. The parameter settings for AC-MeanABC algorithm were selected as in Tabs. 2 and 3 based on the same parameter setting in [3]. Accordingly, Tabs. 4 and 5 illustrate these benchmark data clustering, the results of VI index, and their optimal cluster numbers (# OC). Tab. 4 shows that AC-MeanABC outperforms other related works in minimizing the VI index in most of the outcomes, where iABC outperforms AC-MeanABC in R15 and D31. Additionally, Tab. 5 indicates that AC-MeanABC exceeds other state-of-the-art methods in finding the optimal number of clusters for Glass, Aggregation, R15, Libras movement and wholesaler customers, while iABC outperforms AC-MeanABC in D31.  In Tab. 6, the Friedman test was carried out and the results were used in comparing the performance of AC-MeanABC and other related works. As shown, the proposed algorithm outperforms all comparable methods in Tab. 6. The box whisker for the results of clustering algorithms in Tab. 4 is displayed in Fig. 3.

Part2
In this part, the AC-MeanABC outcomes were compared with other related works such as discrete binary artificial bee colony DisABC, GA based clustering algorithms, improved discrete binary artificial bee colony IDisABC [28] and dynamic clustering based particle swarm optimization DCPSO [33]. The parameter settings for DC-MeanABC algorithm were selected as in Tabs. 2 and 3 based on the same parameter setting in [28].   Tabs. 7 and 8 illustrate these benchmark data clustering, the results of VI index, and their optimal cluster numbers (# OC). Tab. 7 shows that AC-MeanABC outperforms other related works in minimizing the VI index in all outcomes. Additionally, Tab. 8 indicates that AC-MeanABC exceeds other state-of-the-art methods in finding the optimal number of clusters for Ecoli and Wine datasets. In the rest of the three cases, all approaches reached an optimal number of clusters in the Iris dataset. Further, AC-MeanABC, IDisABC, and DisABC obtained the same outcomes in Wisconsin dataset with a preference for AC-MeanABC in standard deviation value. In the Dermatology dataset, DCPSO and AC-MeanABC are very close in terms of result to find the best number of clusters. Still, the DCPSO standard deviation value is much higher than that of AC-MeanABC.  To show the significance of improvement of the AC-MeanABC, Friedman test was performed and the results are shown in Tab. 9. As shown clearly, the mean ranking value of AC-MeanABC is higher than that of IDisABC, DisABC, DCPSO and GA. where the best value is the minimum value and it's represented in bold. Also, Fig. 4 represents the box whisker for the results of five clustering algorithms as in Tab. 7. : Box-whisker to present the outcomes of five clustering algorithms in Tab. 7

Natural Images
In this experiment, five natural images were obtained from the Berkeley1 segmentation dataset [30,41] such as Lena, Jet plane, MorroBay, Mandril, and Pepper images. In this experiment, the AC-MeanABC outcomes were compared with other related works such as discrete binary artificial bee colony DisABC, GA based clustering algorithms, and improved discrete binary of ABC called IDisABC [30] and dynamic clustering based PSO that is called DCPSO [35]. The parameter settings for AC-MeanABC algorithm were selected based on the same parameters settings in [30], as in Tabs. 2 and 3.
Tabs. 10 and 11 illustrate these natural images, the VI index results, and the corresponding cluster numbers from ground truth image (# AC). Tab. 10 shows that AC-MeanABC outperforms other related works in minimizing the VI index in all cases. Tab. 11 shows the number of clusters that are determined automatically using algorithm AC-MeanABC and other state-of-theart methods. Tab. 8 shows that AC-MeanABC-FCM outperforms other related works in finding the optimal number of clusters for Lena, Jet plane, MorroBay, and Pepper images. In contrast, the IDisABC approach outperforms AC-MeanABC and other approaches in finding the optimal number of clusters for the Mandrill image, but AC-      In Tab. 12, the Friedman test was carried out to compare the performance between the proposed algorithm and other related works. As shown, AC-MeanABC outperforms IDisABC, DisABC, DCPSO and GA and the best value is presented in bold; the best value is considered the minimum value of all clustering algorithms in Tab. 9. The box whisker for the results of five clustering algorithms in Tab. 10 is displayed in Fig. 10.

Conclusion
In this paper, the automatic fuzzy clustering based on the MeanABC search method called AC-MeanABC was proposed to solve the challenges of determining the number of clusters (region) and cluster centroids. AC-MeanABC clustering method used the capability of the Mean-ABC algorithm to explore the search space in positive and negative directions to search for the near-optimal number of clusters and centroids values. The experiments and results were obtained using 11 benchmark datasets and 5 natural images. These experiments compared the AC-MeanABC with other clustering methods such as iABC, ABC, AKCBCO, AKC-MEPSO, DCPG, DCGA IDisABC, DisABC, DCPSO, and GA. In conclusion, the clustering results of the AC-MeanABC are better than those of the state-of-the-art techniques in determining the optimal number of clusters and the value of validity index VI.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.