A clustering algorithm based on the combination of screening strategy and swarm

aiming at the problem that the k-means clustering algorithm is affected by the initial clustering center, a k-means clustering optimization algorithm combining the screening strategy and artificial colony (ABC) was proposed.This algorithm operates in an unsupervised way, separating data from outlier points through screening, combining the advantages of ABC algorithm, using the objective function of ABC algorithm as the measurement function of initial clustering, and improving the effectiveness and accuracy of k-means clustering by changing the initial clustering center.


Introduction to relevant algorithms
2.1. K-means clustering algorithm K-means clustering algorithm is one of the clustering algorithms, which belongs to partition clustering. In this kind of clustering problem, the clustering effect is measured by distance. So we can set up a data set   n x x x X ,..., , 2 1 , where the data is d-dimensional data, expressed as ,... K-algorithm flow of mean clustering algorithm is as follows: Step(1) select k initial clustering center points in data set X; Step (2) calculates the similarity of data in the whole data set according to formula (1), and distributes the objects with high similarity into a cluster to make the data objects in each cluster as close as possible in the initial stage. This gap is usually expressed by Euclidean distance.
is the data in the KTH cluster.
Step (3) Where j x is the data sample in cluster Ci and the mean of the data objects in cluster Ci .
Step(4) calculates the value of criterion function E according to formula (3). If convergence occurs, the algorithm is terminated; otherwise, step2 is returned.
Step(5) the final output clustering results.

Artificial bee colony algorithm
The principle of artificial colony algorithm is relatively complex, but the implementation is relatively simple.In this algorithm, there are three kinds of bees: foraging bees, observing bees and detecting bees. Among them, honey bees will share the location of the source by rocking dance, recruiting observation bees to mine together.The observed bee, which makes up half the population, waits in the swing zone, selected from known nectar sources. The individual in the swarm will always ensure the optimal location of its choice and conduct a domain search based on the information. When a honey source reaches the maximum mining times, the current position of the bee will become a scout bee.Scout bees search the neighborhood for new sources of honey, and then become bee hunters.
ABC algorithm flow as follows: Step (1) initialization: by default, the number of random honey source generated is SN; the number of bee colony individuals is NP; SN=NP/2; the maximum number of cycles is MCN; the maximum number of collection is limit; the data dimension D; the SN honey source solution is ; the fitness of the solution is fit .  (4) is the random number between (0, 1), which controls the search range of the algorithm, and the max ij X and min ij X represent the j dimension data.
Step (2) bee picking stage: Bees were collected from formula (6) to obtain new honey sources, and their fitness fit was calculated. After comparison, honey sources were selected according to the greedy criterion.

(6)
In formula (6), Step(3) observation bee scale: after completing the honey source location information sharing, calculate the probability Pi , the observation bee will choose according to the roulette principle. If the roulette is successful, follow the bees, collect new sources of honey, and update the location of the source. If the new honey source fitness is better than before, observing the bee will save the new honey source solution, otherwise it will save the original honey source solution, at the same time the number of honey source collection plus 1. Step (4) bee detection stage:The honey source collection times are counted. If the honey source collection times reach the limit, the colony still fails to find the solution with the highest fitness, then the bees will discard the honey source, and their identity will change to the detection bee, and the search will be conducted again through formula (6) until a new honey source is found.
Step (5) determine the current optimal solution. If it is the global optimal, the solution will be output to complete the algorithm.Otherwise, the algorithm will go to step (2) and loop again.In this process, the three species of bees are constantly changing identity, division of labor, circular search.

K-Optimization of initial center of mean clustering algorithm
3.1. Initial cluster screening strategy k mean clustering algorithm can achieve better clustering effect when analyzing large-scale data sets, especially for spherical or spherical-like structure data sets. k-mean clustering algorithm can quickly discover the structure of data and cluster data with high similarity. However, when the data set is a non-convex structure, the k clustering algorithm often has poor recognition ability. meanwhile, k mean clustering algorithm is easy to be affected by noise points and edge points, and reduces the clustering effect, which makes the k mean clustering algorithm suitable for processing data limited.However, when the data set is non-convex, k clustering algorithm often has poor recognition ability.At the same time, the k-means clustering algorithm is susceptible to the influence of noise points and edge points, and the clustering effect is reduced, which makes the k-means clustering algorithm has limited data suitable for processing.
In the k-means clustering problem, the better the cluster center is selected, the higher the similarity of data in the class, and the larger the data difference between the classes, the higher the clustering efficiency will be.In the algorithm, Euclidean distance is generally used to represent the degree of similarity between the data.
In order to solve the problem of abnormal point influence in the center of the initial cluster and make the data of the initial cluster center more complete, the initial data set can be processed.In the initial stage of data, the filtering strategy is used to select the data sets to obtain a more concentrated and complete cluster.In order to ensure that the data after a filter is sufficiently concentrated, the density Q needs to be calculated for each data point i is the distance between the data is, the tighter the data elements are. According to the calculation of density, a new dataset X can be obtained by deleting the data whose density is less than    X x i n x T from the dataset * X . by this method, the effect of outliers is reduced to obtain a relatively tight initial data set.

Improved bee colony algorithm
The essence of k-means algorithm is the minimum optimization process of criterion function E. The algorithm's random selection of the initial clustering center easily leads to the algorithm falling into local optimization.In order to have better clustering effect, the similarity degree and difference degree of the members in the set are the key.Therefore, the clustering criterion function can be used as the objective function of the artificial bee colony by combining with the artificial bee colony algorithm. Searching for the best honey is the process of selecting the best initial center in the cluster.Its correspondence is shown in the following  (3).
The k-means clustering algorithm based on artificial bee colony [11] is as follows: Step(1) initializes the parameter and sets the number of colonies; the maximum iteration number is MCN; the maximum collection number is limit; the initial number of clusters is k.
Step(2 initializes the colony and generates SN bees randomly. The bees are composed of k*j dimension vectors. The data are divided and Y fit is calculated according to the new fitness function. Step(3) carry out field search according to formula (6). When honey sources are found by bee picking, a comparison is made. If the new honey source fitness is higher than the original honey source, keep the new honey source position, otherwise keep the original honey source position, the number of iterations plus one. Step(4) observe the bee will roulette principle to select, according to formula (8) to search the domain, the new solution will be as the clustering center.
Step (5) if there is no change after the limit collection, the scout bee will randomly generate a new position to replace the bee.
Step(6) determines the number of iterations, and when the maximum is reached, Step (8) is carried out; otherwise, setp (3) is returned.
Step (7) outputs the initial clustering center and relative fitness value, and the algorithm ends.

Experimental data
To test the effectiveness of the improved algorithm in this article, the Iris, Wine, and Red Wine data sets from the machine learning database were used.Experimental environment: the host CPU is i7, the main frequency is 4.6ghz, and the running memory is 8.0g. The software runs the original k-means algorithm, the artificial colony k-means algorithm and the algorithm in this paper 20 times respectively, and calculates the mean value and standard deviation obtained by the three algorithms.Since the number of samples and the number of data attributes in the Iris, Wine and Red Wine data sets are different, the clustering Numbers are set respectively. See    From the above data analysis, it can be seen that the original k-means algorithm converges quickly and completes fast clustering, but the test function value changes greatly. When the initial clustering center is selected differently, the result deviation is large.Therefore, the original k-means algorithm is unstable and greatly affected by the center, and local optimization can be obtained.In the experiment of ABC-k mean clustering algorithm, it can be seen that although the combination with ABC algorithm increases the time complexity and makes the convergence time longer, the fluctuation of E value is reduced to some extent and the clustering effect is improved. From the experiment, it can be seen that compared with the other two algorithms, the convergence time is improved, the change of e value is more stable, and the clustering effect is improved. With the addition of filtering strategy, the central data is more centralized, and the application scope of clustering algorithm in data types is more extensive.

Conclusion
In this paper, considering the advantages and disadvantages of k-means clustering algorithm and ABC algorithm, aiming at the problem that the clustering effect is affected by the initial clustering center, this k-means clustering algorithm is proposed to optimize the initial clustering center to improve the clustering effect of the algorithm. This paper starts with the initial data set and gets a data set more conducive to clustering by filtering the initial data set. Combined with the advantages of ABC algorithm, the objective function of ABC algorithm is taken as the measurement function of initial clustering. By changing the initial clustering center, the effectiveness and accuracy of k-means clustering are improved.