Modification fitness function of particle swarm optimization to improve the cluster centroid

Clustering is a classification technique. The clustering is brought to create models. The model is applied with the new input data to identify the class of that data. The goal of clustering is finding the best cluster centroid. The process of searching the best cluster centroid can use Particle Swarm Optimization. Many researches use it and obtain good results. But, searching the best cluster centroid by PSO suffers the problem from calculating the fitness function, because the fitness function is calculated by distances from the cluster centroid to all data inputs. These fitness functions may lead to the bad cluster centroid. So, this paper proposed the fitness function is calculated by the results from classification. The results from classification can lead to the good cluster centroid and create a good model to apply with the new input data. The proposed technique is tested on seven datasets from the UCI Machine Learning Repository and gives more satisfied search results in comparison with PSOs for the data clustering problems.


Introduction
The clustering [1], [2] is optimization problem and classification technique. It is applied with many application areas, such as bioinformatics, machine learning, information retrieval, data analysis, pattern recognition and image processing [3], [4]. It is brought to create models. The model is applied to the new input data to identify the class of that data. So, the goal of clustering is creating a model to identify the class of new objects. The clustering technique composes as follows: The object (data input) comprises many attributes. The attribute can be used to identify the class of the object. The objects that have similarity attributes are in the same class. On the other hand, the objects that have differential attributes are in other classes. Particle Swarm Optimization (PSO) is inspired from the behavior of flying birds and their communication mechanism [5], [6]. PSO is a stochastic global optimization technique. PSO is used for solving optimization problems. The goal of PSO is finding the global optimum of a fitness function or an objective function. PSO has many pros such as rapid convergence, simplicity, and little parameters to be adjusted [7]. In comparison with several other population-based stochastic optimization methods, such as the genetic algorithm (GA) [8,9] and the evolutionary programming (EP), PSO performs better in solving various optimization problems with fast and stable convergence rates [10 -12].
International Conference on Engineering and Industrial Technology (ICEIT2020) IOP Conf. Series: Materials Science and Engineering 965 (2020) 012038 IOP Publishing doi: 10.1088/1757-899X/965/1/012038 3 From previous mention, this paper proposed the fitness function is calculated from the result from classification or F-measure. The proposed technique is called that modification fitness function by Fmeasure for particle swarm optimization with clustering or FPSOC. A set of dataset in the UCI Machine Learning Repository [24] is used to compare PSOC [14] by source code from [20], PSOCC [21], CPSO [22], PSKO [23] and the proposed technique. The results show that the solutions quality of the proposed technique is better than other compared algorithms in all tested datasets. The rest of this paper is organized as follows: Section 2 explains the basic theoretical for this paper such as the K-Means Clustering, the standard PSO and F-Measure. Section3 explains related works (PSOC, PSOCC, CPSO, and PSKO). Section 4 explains the proposed technique or modification fitness function by F-measure for particles swarm optimization with clustering (FPSOC). Section 5 explains the case studies of datasets, the experiment setup and presents the experiment results. Section 6 concludes the paper with a brief summary.

Particle Swarm Optimization
In standard PSO, each member of the population is called a "particle". Each particle contains its position and its velocity. Each individual particle executes searching in the search space according to its velocity, the best position found in the whole swarm (GBEST) and the individual's best position (PBEST). PSO starts with randomizing particle positions and their velocities, and the evaluation of the position of each particle is calculated from the objective function of the optimization problem. In each generation, each individual particle updates its position and velocity according to the expression below: Where X id is the previous positions of i particle and d dimension, X ' id is the current positions of i particle and d dimension, V id is the previous velocity of i particle and d dimension, V ' id is the current velocity of i particle and d dimension, P id is PBEST of i particle and d dimension, and G id is GBEST of d dimension.  is an inertia weight,  1 and  2 are acceleration constants, and rand() is the random number generated uniformly in the range [0, 1]. A limit velocity calls V max . If calculate velocity of a particle exceeds V max , it is replaced value of V max .

K-Means Clustering
In standard KM, each data is called an object. The group of objects is called a class or the cluster centroid. Both class and object contain attributes. The attributes of an object and the attributes of classes are used to classify by Euclidean distance technique [14]. The main concept of KM is can be summarized as follows: the smallest Euclidean distances from the attributes of an object to the attributes of one class, this object is assigned into this class. The standard KM starts with randomizing attributes of classes and this step is applied only the first time. Then, the evaluation of each class is calculated from the expression below: Where N d is the number of attributes, Z p is the object p, Z pk is the attribute k in the object p, m j is the class j, m jk is the attribute k in the class j, and d(z p , m j ) is the distance between the object p and the class j. Then, each class is recalculated from the expression below: Where n j is the number of object in class j, and C j is all objects in the class j. This algorithm can be stopped when the maximum number of iterations has been exceeded.

F-Measure
F-Measure (FM) or F(i, j) is a famous evaluation measure. FM is the harmonic mean which combines the recall or R(i, j) and the precision or P(i, j). FM uses the ideas of precision and recall from information retrieval [3]. The precision is the most important measure of any classification system. It provides information on the probability that a prediction of a given category is correct. The recall is the relationship between the number of correctly detected intrusions and the total number of intrusions. True Positive (T P ) is the model correctly predicts the positive class. True Negative (T n ) is the model correctly predicts the negative class. False Positive (F P ) is the model incorrectly predicts the positive class. False Negative (F n ) is the model incorrectly predicts the negative class. FM, the precision and the recall for each class i of each cluster j is calculated as:

PSO with Clustering
The standard PSO was proposed to solve the benchmark function problem [25]. To PSO can apply with the data clustering problem, the objective function and particle position are modified. A single particle represents the class. The fitness function of the panicle is measured as (8). In addition to the previous mention, the process of PSO with data clustering [14] is similar to the process of the standard PSO.
Where N c is the number of classes, and |C i,j | is the number of object belonging to class j

PSO-clustering with the hard c-means
Process of PSO-clustering with the hard c-means (PSOCC) [21] is the same as PSO-clustering except the fitness function is used as follows: Where N is the number of object, k c is a positive constant, and J 0 is a small-valued constant. The experimental result shows that PSOCC algorithm has better performance than KM algorithm. Z − m is the Euclidean distance of Z p to m j .

The clustering algorithm based on PSO
The clustering algorithm based on PSO (CPSO) [22] is the same as PSO-clustering except the fitness function is used as follows: Where max ( ( )) is the maximum value of average values of distances within same classes in the classification plan which is showed by particle Y i , min ( ( )) is the minimum value of distances between classes in the classification plan which is showed by particle Y i , m j is the class j, and m i is the class i.

particle swarm K-means optimization
The particle swarm K-means optimization algorithm (PSKO) [23] is the same as PSO-clustering except the fitness function is calculated according to (3). Moreover, the process of recalculating the cluster centroid is added for each particle according to (4). From experimental results, PSKO which integrates both the PSO-clustering and K-means algorithm is able to give the particles both exploration and exploitation capabilities. The experimental results showed that performance of PSKO is better than those of other clustering algorithms.

Proposal Works
Duty of PSO is searching the cluster centroid. The process of PSO tries to get better fitness. So, the fitness function is guidance for searching of PSO and the main factor of searching the cluster centroid. For the data clustering problem, if the cluster centroid stays in a suitable area, it can classify data correctly. On the other hand, if the cluster centroid stays in an unsuitable area, it will rarely classify data correctly. These techniques (PSOC, PSOCC, CPSO, and PSKO) use the fitness function which is calculated by the distance from all input data to the cluster centroid. The effect from fitness function, PSO tries to decrease the lowest distance from all input data to the cluster centroid. So, the more fitness value is decreased, the better solution is obtained. It leads to the cluster centroid move to the space of two data groups and the cluster centroids stick together. The sticky cluster centroids are unsuitable to classify data because data of one group may be classified into another group as figure 1 (b). In fact, the cluster centroids should stay central in its group which it can classify data correctly as figure 1 (a).  Figure 1. the square point is the data of the first group which has ten points. The blue point or the circle point is the data of the second group which has ten points. The gre centroid of the first group. The violet point or the diamond point is the cluster centroid of the second group.
From figure 1 (a), the result from calculation by fitness function as (8) gets value that is cluster centroid from this figure can classify data that give correct 100% or 20 points. The value of F measure in this case is 1. So, this model classifies correctly and has performance and is suitable to be used. From figure 1 (b), the result The cluster centroid from this figure can classify data that give correct 65% or 13 points. The value of F-measure in this case is 0.601. Some data in the first group is identified into because they stay close to the cluster centroid of the second group more than that of the first group. So, this model classifies incorrectly and has no performance. The result from fitness function as (8), PSO selects figure 1 (b) to set 1 (b) is less than fitness of figure 1 (a). In fact, PSO should select figure 1 (a) to set GBEST because it is more classificatory than another. This case shows that distance may be unsuitable in some events to calculate fit classification. In this case, if the fitness function uses the results from classification or F selects figure 1 (a) to set GBEST because the amount of correct dat that from figure1 (b). The F can be used for guiding searching of PSO or the fitness function. Hence, this paper proposes the fitness function of PSO function by F FPSOC is shown below:  (8) gets value that is cluster centroid from this figure can classify data that give correct 100% or 20 points. The value of F measure in this case is 1. So, this model classifies correctly and has performance and is suitable to be used. From figure 1 (b), the result The cluster centroid from this figure can classify data that give correct 65% or 13 points. The value of measure in this case is 0.601. Some data in the first group is identified into because they stay close to the cluster centroid of the second group more than that of the first group. So, this model classifies incorrectly and has no performance. The result from fitness function as (8), PSO selects figure 1 (b) to set 1 (b) is less than fitness of figure 1 (a). In fact, PSO should select figure 1 (a) to set GBEST because it is more classificatory than another. This case shows that distance may be unsuitable in some events to calculate fitness of PSO. On the other hand, the fitness function should use the results from classification. In this case, if the fitness function uses the results from classification or F selects figure 1 (a) to set GBEST because the amount of correct dat that from figure1 (b). The F can be used for guiding searching of PSO or the fitness function. Hence, this paper proposes the fitness function of PSO function by F-measure for particles swarm optimization with clustering (FPSOC). Pseudo code of FPSOC is shown below: (a) This figure shows the results from searching of PSOC in two dimensions. The red point or the square point is the data of the first group which has ten points. The blue point or the circle point is the data of the second group which has ten points. The gre centroid of the first group. The violet point or the diamond point is the cluster centroid of the second From figure 1 (a), the result from calculation by fitness function as (8) gets value that is cluster centroid from this figure can classify data that give correct 100% or 20 points. The value of F measure in this case is 1. So, this model classifies correctly and has performance and is suitable to be used. From figure 1 (b), the result The cluster centroid from this figure can classify data that give correct 65% or 13 points. The value of measure in this case is 0.601. Some data in the first group is identified into because they stay close to the cluster centroid of the second group more than that of the first group. So, this model classifies incorrectly and has no performance. The result from fitness function as (8), PSO selects figure 1 (b) to set 1 (b) is less than fitness of figure 1 (a). In fact, PSO should select figure 1 (a) to set GBEST because it is more classificatory than another. This case shows that distance may be unsuitable in some events to ness of PSO. On the other hand, the fitness function should use the results from classification. In this case, if the fitness function uses the results from classification or F selects figure 1 (a) to set GBEST because the amount of correct dat that from figure1 (b). The F-measure is calculated from the results from classification. So, F can be used for guiding searching of PSO or the fitness function. Hence, this paper proposes the fitness function of PSO uses F-measure to solve the data clustering problem or modification fitness measure for particles swarm optimization with clustering (FPSOC). Pseudo code of FPSOC is shown below: This figure shows the results from searching of PSOC in two dimensions. The red point or the square point is the data of the first group which has ten points. The blue point or the circle point is the data of the second group which has ten points. The gre centroid of the first group. The violet point or the diamond point is the cluster centroid of the second From figure 1 (a), the result from calculation by fitness function as (8) gets value that is cluster centroid from this figure can classify data that give correct 100% or 20 points. The value of F measure in this case is 1. So, this model classifies correctly and has performance and is suitable to be used. From figure 1 (b), the result from calculation by fitness function as (8) gets value that is 1.074. The cluster centroid from this figure can classify data that give correct 65% or 13 points. The value of measure in this case is 0.601. Some data in the first group is identified into because they stay close to the cluster centroid of the second group more than that of the first group. So, this model classifies incorrectly and has no performance. The result from fitness function as (8), PSO selects figure 1 (b) to set 1 (b) is less than fitness of figure 1 (a). In fact, PSO should select figure 1 (a) to set GBEST because it is more classificatory than another. This case shows that distance may be unsuitable in some events to ness of PSO. On the other hand, the fitness function should use the results from classification. In this case, if the fitness function uses the results from classification or F selects figure 1 (a) to set GBEST because the amount of correct dat measure is calculated from the results from classification. So, F can be used for guiding searching of PSO or the fitness function. Hence, this paper proposes the measure to solve the data clustering problem or modification fitness measure for particles swarm optimization with clustering (FPSOC). Pseudo code of This figure shows the results from searching of PSOC in two dimensions. The red point or the square point is the data of the first group which has ten points. The blue point or the circle point is the data of the second group which has ten points. The gre centroid of the first group. The violet point or the diamond point is the cluster centroid of the second From figure 1 (a), the result from calculation by fitness function as (8) gets value that is cluster centroid from this figure can classify data that give correct 100% or 20 points. The value of F measure in this case is 1. So, this model classifies correctly and has performance and is suitable to be from calculation by fitness function as (8) gets value that is 1.074. The cluster centroid from this figure can classify data that give correct 65% or 13 points. The value of measure in this case is 0.601. Some data in the first group is identified into because they stay close to the cluster centroid of the second group more than that of the first group. So, this model classifies incorrectly and has no performance. The result from fitness function as (8), PSO selects figure 1 (b) to set 1 (b) is less than fitness of figure 1 (a). In fact, PSO should select figure 1 (a) to set GBEST because it is more classificatory than another. This case shows that distance may be unsuitable in some events to ness of PSO. On the other hand, the fitness function should use the results from classification. In this case, if the fitness function uses the results from classification or F selects figure 1 (a) to set GBEST because the amount of correct dat measure is calculated from the results from classification. So, F can be used for guiding searching of PSO or the fitness function. Hence, this paper proposes the measure to solve the data clustering problem or modification fitness measure for particles swarm optimization with clustering (FPSOC). Pseudo code of This figure shows the results from searching of PSOC in two dimensions. The red point or the square point is the data of the first group which has ten points. The blue point or the circle point is the data of the second group which has ten points. The green point or the triangle point is the cluster centroid of the first group. The violet point or the diamond point is the cluster centroid of the second From figure 1 (a), the result from calculation by fitness function as (8) gets value that is cluster centroid from this figure can classify data that give correct 100% or 20 points. The value of F measure in this case is 1. So, this model classifies correctly and has performance and is suitable to be from calculation by fitness function as (8) gets value that is 1.074. The cluster centroid from this figure can classify data that give correct 65% or 13 points. The value of measure in this case is 0.601. Some data in the first group is identified into because they stay close to the cluster centroid of the second group more than that of the first group. So, this model classifies incorrectly and has no performance. The result from fitness function as (8), PSO selects figure 1 (b) to set 1 (b) is less than fitness of figure 1 (a). In fact, PSO should select figure 1 (a) to set GBEST because it is more classificatory than another. This case shows that distance may be unsuitable in some events to ness of PSO. On the other hand, the fitness function should use the results from classification. In this case, if the fitness function uses the results from classification or F selects figure 1 (a) to set GBEST because the amount of correct dat measure is calculated from the results from classification. So, F can be used for guiding searching of PSO or the fitness function. Hence, this paper proposes the measure to solve the data clustering problem or modification fitness measure for particles swarm optimization with clustering (FPSOC). Pseudo code of (b) This figure shows the results from searching of PSOC in two dimensions. The red point or the square point is the data of the first group which has ten points. The blue point or the circle point is en point or the triangle point is the cluster centroid of the first group. The violet point or the diamond point is the cluster centroid of the second From figure 1 (a), the result from calculation by fitness function as (8) gets value that is cluster centroid from this figure can classify data that give correct 100% or 20 points. The value of F measure in this case is 1. So, this model classifies correctly and has performance and is suitable to be from calculation by fitness function as (8) gets value that is 1.074. The cluster centroid from this figure can classify data that give correct 65% or 13 points. The value of measure in this case is 0.601. Some data in the first group is identified into because they stay close to the cluster centroid of the second group more than that of the first group.
The result from fitness function as (8), PSO selects figure 1 (b) to set GBEST because fitness of figure  1 (b) is less than fitness of figure 1 (a). In fact, PSO should select figure 1 (a) to set GBEST because it is more classificatory than another. This case shows that distance may be unsuitable in some events to ness of PSO. On the other hand, the fitness function should use the results from classification. In this case, if the fitness function uses the results from classification or F selects figure 1 (a) to set GBEST because the amount of correct data from figure 1 (a) is more than measure is calculated from the results from classification. So, F can be used for guiding searching of PSO or the fitness function. Hence, this paper proposes the measure to solve the data clustering problem or modification fitness measure for particles swarm optimization with clustering (FPSOC). Pseudo code of This figure shows the results from searching of PSOC in two dimensions. The red point or the square point is the data of the first group which has ten points. The blue point or the circle point is en point or the triangle point is the cluster centroid of the first group. The violet point or the diamond point is the cluster centroid of the second From figure 1 (a), the result from calculation by fitness function as (8) gets value that is 1.303. The cluster centroid from this figure can classify data that give correct 100% or 20 points. The value of F measure in this case is 1. So, this model classifies correctly and has performance and is suitable to be from calculation by fitness function as (8) gets value that is 1.074. The cluster centroid from this figure can classify data that give correct 65% or 13 points. The value of measure in this case is 0.601. Some data in the first group is identified into the second group because they stay close to the cluster centroid of the second group more than that of the first group.
GBEST because fitness of figure  1 (b) is less than fitness of figure 1 (a). In fact, PSO should select figure 1 (a) to set GBEST because it is more classificatory than another. This case shows that distance may be unsuitable in some events to ness of PSO. On the other hand, the fitness function should use the results from classification. In this case, if the fitness function uses the results from classification or F-measure, it a from figure 1 (a) is more than measure is calculated from the results from classification. So, F can be used for guiding searching of PSO or the fitness function. Hence, this paper proposes the measure to solve the data clustering problem or modification fitness measure for particles swarm optimization with clustering (FPSOC). Pseudo code of This figure shows the results from searching of PSOC in two dimensions. The red point or the square point is the data of the first group which has ten points. The blue point or the circle point is en point or the triangle point is the cluster centroid of the first group. The violet point or the diamond point is the cluster centroid of the second 1.303. The cluster centroid from this figure can classify data that give correct 100% or 20 points. The value of Fmeasure in this case is 1. So, this model classifies correctly and has performance and is suitable to be from calculation by fitness function as (8) gets value that is 1.074. The cluster centroid from this figure can classify data that give correct 65% or 13 points. The value of the second group because they stay close to the cluster centroid of the second group more than that of the first group.
GBEST because fitness of figure  1 (b) is less than fitness of figure 1 (a). In fact, PSO should select figure 1 (a) to set GBEST because it is more classificatory than another. This case shows that distance may be unsuitable in some events to ness of PSO. On the other hand, the fitness function should use the results from measure, it a from figure 1 (a) is more than measure is calculated from the results from classification. So, F-measure can be used for guiding searching of PSO or the fitness function. Hence, this paper proposes the measure to solve the data clustering problem or modification fitness measure for particles swarm optimization with clustering (FPSOC). Pseudo code of Initial position and velocity of each particle Initial PBEST, GBEST For gen = 0 to MAXGEN For i = 0 to N Evaluate particle fitness according to formula (7) If F(X i ) < F(PBEST i ) PBEST = X i End If If F(X i ) < F(GBEST) GBEST = X i End If Update particle position according to formula (1) and (2) End For Next gen Where N is the number of particles, F(X i ) is fitness of i particle, F(PBEST i ) is fitness from PBEST of i particle, PBEST i is PBEST of i particle, F(GBEST) is fitness of GBEST, MAXGEN is the maximum generation.

The measures of algorithm performance
The measures of algorithm performance in the experiments are as follows: F-Measure (FM), the average correct number (ACN) is the average of the correct number from the prediction of algorithm in the final generation from all running. FM and ACN indicate the clustering efficiency of an algorithm. The bigger FM and ACN are, the higher the quality of clustering is. Therefore, PSO is the stochastic algorithm. It has the process of randomness. So, FM and ACN may be enough to measure performance of PSO. To measure performance of PSO, this paper adds the standard deviation of ACN (SD). SD indicates the clustering reliability of an algorithm. The bigger SD is, the higher the reliability of clustering is.

Parameters Setting
Parameters for all experiments are as follows: acceleration constants of  1 and  2 are both set to be 1.45, ω = 0.75, the number of population is 100. The number of experiments of each function is 25 runs. The maximum iteration is 1000. The non-PSO parameters for the compared algorithm parameters (PSOCC, CPSO, PSKO and FPSOC) are set according to suggested by the original papers. This research is conducted by a personal computer of AMD FX-8320 with 3.5 GHz CPU, 8 GB RAM and Visual C++ 2010 as the programming language. All datasets are used in experiment are from datasets [24].