META-HEURISTIC OPTIMIZATION ALGORITHMS BASED FEATURE SELECTION FOR CLINICAL BREAST CANCER DIAGNOSIS

Abstract Breast cancer is the leading cause of cancer death among women in the whole world. Meanwhile, early detection and accurate diagnosis can increase the chances of making the right decision on a successful treatment process. This article presents a two-step system that first uses four different swarm algorithms namely; whale optimization algorithm, grey wolf optimizer, flower pollination algorithm, and moth flame optimization for feature selection purpose. Then, several classifiers are applied including support vector machines, k-nearest neighbor, and decision tree. The performance of each algorithm is evaluated using five different aspects; classification based measurements, convergence, computational time, statistical measurements and stability. The obtained results from the proposed algorithms are compared and analyzed with other algorithms published in breast cancer diagnosis. The experimental using Wisconsin breast cancer diagnosis and Wisconsin prognosis breast cancer (WPBC) datasets outcomes positively that the proposed system was effective in undertaking breast cancer data classification and features selection tasks.


Introduction
Breast cancer is one of the serious problems in healthcare systems, as it is a major cause of death especially in women.Thus its diagnosis and treatment need to be carefully researched [1].According to literature review and statistics, the annual number of deaths around the world due to cancers is around 7.4 million (13% of all deaths) [2].Breast cancer is one of the five most common forms of cancer around the world [3].Not only the patients but also their families suffer from this disease.So, it is an essential task to ease the diagnosis and decision-making process of breast cancer for physicians and doctors regarding the medical treatment phases [4].Meta-heuristic optimization algorithms are turning out to be increasingly well known in different medical applications like the diagnosis of breast cancer because of their nature [5][6][7].This is due to (1) they depend on rather simple ideas, (2) they do not require inclination data, (3) they are easy to implement, (4) they can be used in an extensive variety of issues covering different disciplines and (5) they can bypass neighborhood optima.Through mimicking biological and physical applications, nature-inspired optimization algorithms are capable of solving several optimization problems.A good balance between exploitation and exploration can highly influence the performance of any meta-heuristic algorithms.Exploitation helps in getting advantages from the already found as best solutions, while exploration is defined as the algorithm's capability to reach the unexplored areas of the search space.This paper compares the efficiency of most meta-heuristic optimization algorithms for breast cancer diagnosis.These algorithms are whale optimization algorithm (WOA), grey wolf optimizer (GWO), flower pollination algorithm (FPA), and moth flame optimization (MFO).Compared to related works in the literature on clinical datasets diagnosis, this paper differs from the other studies since it applies recent bio-inspired algorithms.Swarm-based optimization algorithms have some features and strengths over evolution-based optimization algorithms.For instance, swarm-based optimization algorithms keep the information of search space over subsequent iterations; however evolution-based optimization algorithms dispose of any information when another population is found.They include a few operators in comparison with the evolutionary techniques (such as mutation, selection, elitism, and crossover) and thus they are easier to be implemented.Also, population-based bio-inspired algorithms can share some common features regardless of their nature and behavior.In these algorithms; the search process is divided into two stages: exploration and exploitation [8].This paper focuses on the bio-inspired optimization algorithms which can mimic the behavior of some animals.Details description of some swarm-based algorithms; authors referee to [9].This paper utilizes recent bio-inspired algorithms in the literature for features selection optimization algorithms to select the relevant features from the clinical datasets.We carefully selected the best representative feature set which is an essential task in building accurate classification diagnosis models.The rest of the paper is organized as follows: Section 2 gives a brief explanation of the proposed algorithms from the inspiration and mathematical point of view.Section 3 presents the proposed model in details.In Section 4, datasets, evaluation metrics and experimental results are discussed.Finally, Section 5 presents the conclusion and future work remarks.
2 Preliminaries: Swarms optimization algorithm 2.1 Whale optimization algorithm: Inspiration analysis and mathematical model

Inspiration analysis:
Whales are extravagant animals.They are considered as the biggest mammals among all animals.Whales live alone or in groups.Some of their parts (such as killer whales) can live in a family all their life period.Humpback whale is one of the biggest whales.Their favorite prey is krill and small fish species.The special hunting way of humpback whales is considered as the main interesting point of these whales.This method is called bubble-net feeding method.Detailed information about these behaviors of humpback whales and others are discussed in more details in [10].

Mathematical model:
This section highlights the mathematical model of three phases; spiral bubble-net feeding maneuver, encircling prey and search for prey.1. Encircling prey Humpback whales can perceive the area of prey and enclose it.The position of the optimal design in the hunt or search space is not known from the earlier.The WOA optimization algorithm supposes that the present best candidate solution is the objective prey or is near to the optimum.In this case, the humpback whales have defined the best search agent; the other search agents then will try to change their positions towards the best agent of search.This behavior can be described by the following equations: where t is the iteration number of the current position, A and C are coefficient vectors.X * is the position vector of the optimal solution obtained so far, || is defined as the absolute value.The vectors A and C can be mathematically formulated according to the following equations, where a is decreased linearly from 2 to 0 over the course of iterations and r is a random vector in [0,1].The humpback whales can attack the prey with the bubble-net method.
2. Bubble-net attacking method (exploitation phase) The mathematical model of the bubble-net behavior of humpback whales is defined as follows: Shrinking encircling mechanism: This method is achieved by decreasing the value of a in the Eq. 3. Taking a random value in [-1, 1] for A, the new search agent position of can be defined anywhere between the original agent position and the current best agent position.
Spiral updating position: In order to mimic the helix-shaped movement of humpback whales, a spiral method is created between the position of whale and prey.It is mathematically defined as follows: where D = | X * (t) − X(t)| is defined as the distance of the ith whale to the prey.l is a random number in [-1, 1], and • indicates element-by-element multiplication.During optimization phase; the mathematical model is as follows: where p is a random number in [0,1], b is a constant used to define the shape of the logarithmic spiral.

Search for prey (exploration phase):
Humpback whales randomly search for prey based on the position of each other.A is used with the random value in [-1,1] to prevent search agent from moving far away from a reference whale.| A| > 1 emphasize exploration.That means WOA algorithm can perform a global search.The mathematical model is described as follows: where X rand is a random position vector which has been chosen from the current population.
2.2 Grey Wolf optimizer: Inspiration analysis and mathematical model

Inspiration analysis:
GWO algorithm is one of recent meta-heuristic algorithms proposed in [11].The main inspiration of GWO came from the hunting technique, and social leadership of grey wolves belonged to the Canidae family.They live in a pack, where the size of the group is in [12].Alpha is their leader.It is responsible for deciding on: sleeping place, hunting, etc. Beta is the second leader after alpha.It helps the alpha in decision making.The lowest ranking gray wolf is defined as omega.Its responsibility to submit the information to all the others dominant wolves.The rest of gray wolves are called delta.They dominate the omega.The hunting mechanism of gray wolves is (a) chasing, tracking and approaching the prey; (b) harassing, encircling and pursuing the prey; (c) attacking the prey.

Mathematical model:
The mathematical model of the social hierarchy of wolves is defined as follows, the best solution is defined as alpha wolf (α).Consequently, the second best solution is defined as beta wolf (β) and delta wolf (δ) is third best solution.The reminder of the candidate solutions are assigned to omega (ω) wolves.
Encircling prey: The mathematical equations of encircling behavior are defined as follows, where A and C are coefficient vectors; X p (t) is the prey's position vector; and X is the grey wolf 's position vector.The coefficient vectors A = a(2r 1 − 1) and C = 2 • r 2 , where a is linearly decreased over the course of iterations from 2 to 0; Search for prey (exploration) Grey wolves search for a prey according to the position of the alpha, beta, and delta.They diverge from each other in searching for prey and converging to attack prey.A is utilized with random values in [-1,1] to oblige the search agent to diverge from the prey.This allows GWO to emphasizes exploration and thus search globally.When, |A| ≥ 1, it forces the grey wolves to diverge from the prey to hopefully find a better prey.C is another component of GWO that favors exploration.The C vector includes random values in range [0,2].This component provides random weights for prey to stochastically emphasize C ≥ 1 or deemphasize C < 1 the effect of prey in defining the distance in Eq. ( 9).Attacking prey (exploitation) The value of a is linearly decreased over the iterations to mathematically model approaching the prey.GWO forces the wolves to attack towards the prey when random values of S are in [-1,1] and |A| < 1. Hunting For mathematically simulating the grey wolves hunting behavior, the first three best solutions obtained so far are saved.Also, they oblige the other search agents (including the omegas) for updating their positions according to the best search agent position.They are defined as follows: 2.3 Flower pollination algorithm: Inspiration analysis and mathematical mode

Inspiration analysis
: FPA is one of recent meta-heuristic optimization algorithms proposed in [13].The main inspiration came from the flower pollination behavior.It stems from the purpose of reproduction.The objective of flower pollination respects to the biological evolution point of view is the survival of the fittest reproduction of plant species.All the mentioned factors and processes of flower pollination interact to achieve optimal reproduction of the flowering plants.

Mathematical model:
Yang in [13] emulates using the four idealizing rules the biological pollination process.These rules are (1) Self-pollination and abiotic and are the local pollination, (2) Biotic and cross-pollination with pollen-carrying pollinators are global pollination processes performing Lèvy flights, (3) Flower constancy is the reproduction probability proportional to the similarity of two involved flowers and (4) Switch probability p ∈ [0, 1] is used to control the choice between local and global pollination.Yang proposes to use the formula at Eq. ( 14) based on the formulation of local and global pollination.This rule is used to implement Rule 1 and 3.In addition, He uses Eq. ( 15) to implement Rules 2 and 3 (local pollination and flower constancy).
In Eq. ( 14), X t+1 1 is the pollen of X 1 at t − th iteration, g indicates the current best solution.The parameter L (step size) is the strength of pollination.Lèvy flight is adopted, as insects, birds and pollination can move for long distances with different distance steps.The latter consists of drawing L > 0 from a Lèvy distribution according to Eq. ( 16).
In the equation, Γ (λ) is the standard gamma function.When s > 0, Lèvy distribution is valid for large steps.
s is defined at Eq. ( 17), where U and V are random numbers obeying a Gaussian distribution U ∼ N (0, σ 2 ) with a zero mean and a variance of σ 2 .
2.4 Moth flame optimization: Inspiration analysis and mathematical model

Inspiration analysis:
Moths are highly similar to the butterflies family.They are fancy insects.In nature, there are over 160,000 various species of this insect.Larvae and adult are the two main milestones in their lifetime.The larvae are converted to moths in cocoons.The special navigation method of moths in the night is the most interesting fact about them.They used the moonlight to fly in the night.They utilized transverse orientation mechanism for navigation.In this mechanism, a moth can travel long distances in a straight path through setting a fixed angle respect to the moon [14].

Mathematical model:
In the MFO algorithm, moths are the candidate solutions, and the moths' position in the space are the problem's variables.By changing the position vectors of moths, they can fly in 1-D, 2-D, 3-D, or hyperdimensional space.Moths set is represented in matrix M , as the MFO algorithm is a population-based algorithm.The size of M is n × d, where d is the number of variables and n the number of moths.The first row of the matrix M of each moth is passed to the fitness function, and the output of the fitness function is assigned to the corresponding moth as its fitness value ( OM 1 in the matrix OM ).The MFO algorithm is a three-tuple that approximates the global optima of the optimization problems.It is defined as follows: I is a function, which used to generate a random population of moths as initial solutions M and corresponding fitness values OM .It defined as I : θ → {M, OM }.The P function (P : M → M ), which is the main function, moves the moths around the search space.This function received the matrix of M and returned its updated one eventually.T function (T : M → {true, f alse)) returns true if the termination criterion is satisfied and false if the termination criterion is not satisfied.

The proposed clinical breast cancer computer-aided diagnosis system
The proposed CAD system starts from taking the clinical breast cancer dataset as input, then for missing values, the median method is used to fill these gaps, then various bio-inspired feature selection algorithms are adapted to select the best representative features for the used dataset.Finally, these selected features are feed different classifiers, then evaluated using four measurements including accuracy, recall, precision, and f-measure.Figure (1) shows the proposed system architecture.

Preprocessing phase
Preprocessing the input data set is considered an important task for a knowledge discovery goal.It was reported that the used dataset contains missing values (information) in some records.In this paper, all these missing values for a given feature per class are replaced by the median value of all known values of this feature in that class.Eq. ( 19) shows the median method for dealing the missing value x i,j for j − th feature for a given k − th class C.

Features selection phase
Feature selection (FS) is defined as selecting a subset of features without making any transformation.Selecting relevant features is an important task for any classification system.As a large number of extracted features are usually produced irrelevant features.These large number of features known as "the curse of dimensionality" can have the major influence on performance and strength of the system.Also, it's considered a serious challenge to the existing learning methods [15].Selecting relevant features can simplify the learned classifier and thus can reduce the training computational time.Feature selection problem can be seen as an optimization problem, as selecting relevant features are used in optimization of a certain fitness function.There are several studies addressed this problem as an optimization problem.Some of them use the classification accuracy as fitness function, where their objective is to maximize the selected features.Recently, nature-inspired meta-heuristic algorithms are the most widely used algorithms for solving the optimization problems.In this paper, various bio-inspired algorithms as features selection algorithms have been proposed.These algorithms are WOA, MFO, FPA, and GWO.Also, the performance of each is compared with each other.These algorithms are used to choose the best features' combination in wrapper mode which maximizes the classification accuracy defined as the fitness function.Only two main parts are changed in these algorithms to be used as features selectors algorithms.The remainder of the steps of the algorithm including searching for the optimal solution and position updates are the same.The two main parts are population initialization and the fitness function.The parameter setting for all the proposed feature selection algorithms are population size = 50, maximum generation = 10, number of dimensions = number of features and switch Probability, crossover, selection and mutation rates =0.7 for WPBC and WDBC datasets.Also, they all used the same fitness function.Through this, it can be made almost a fair comparison between the proposed feature selection algorithms.
Population Initialization In this paper, the population initialization mechanism for all swarm algorithms is same.Each agent position including moth, wolf, whale, ...etc.Each represents a feature subset (solution) in the search space.Each subset has different combinational of features with different length.The indexes of each subset are randomly initialized within the range between 1 and 30 for WDBC and 32 for WPBC 'Total number of features'.
Fitness Function The fitness function is a measure which used to determine the goodness or quality of each search agent or solution.In each iteration, every search agent (such as a moth, wolf,... etc.) is evaluated.The search agents quality is evaluated by its ability to obtain highest accuracy results.In this paper, classification accuracy is the adopted fitness function.K-Nearest Neighbor is the used classifier which used to evaluate the performance of each solution where k = 5.The choice of k is based on trial and error as the best performing on the adopted datasets.The fitness value obtained by calculating the classification accuracy on average from 10-fold cross validation method using Eq. ( 20), where F n is defined as the objective function, j is a k-fold iteration, and F is a number of folds.The best position is the subset which gives highest classification accuracy.
In this work, a novel binary version of WOA, MFO, FPA, and GWO is proposed.In the binary version, the solutions pool is in binary form, where the solutions are restricted to the binary {0, 1}.The agents transfer from continues to binary space using the following equations, where rand() is random number from uniform distribution [0, 1] and V i,t+1 is the updated binary position at t iteration.
Positions updating The updating positions of agents are defined previously at section 2. Termination criteria The optimization process terminates when the best solution is found or when it reaches the maximum number of iterations.In this study, the maximum number of iteration is used as the termination criteria.The maximum number of iteration in all experiments is set to 10.

Classification phase
In this paper, the most common classifiers proposed in the literature are adopted.These classifiers are support vector machine (SVM) [16], decision tree (DT) [17] and k-nearest neighbour (KNN) [18].We used these classifiers because they are the most used by researchers due to their popularity.Also, leave-one-out cross-validation method is used to evaluate the robustness of the proposed algorithms [19].

Clinical breast cancer datasets characteristics
The Wisconsin Breast Cancer datasets taken from the UCI machine learning repository are adopted for the required analysis.The adopted datasets are Wisconsin diagnosis breast cancer (WDBC) and Wisconsin prognosis breast cancer(WPBC).Wolberg creates them since 1984 [20].WDBC has 32 attributes, number of records = 596 and number of classes = 2, while WPBC has number of attributes = 34, number of records = 198, number of classes = 2 and contain missing values in their records.Each record of these datasets represents follow-up data for one breast cancer case.Moreover, each one consists of some classification patterns with a set of real or numerical features or attributes.

Classification based Measurements
Four measures are used to evaluate the performance of the proposed approach which are precision, recall, accuracy, and f-score.They are mathematically defined in [21].Also, five measurements are adopted.These measurements are the worst, best, mean fitness value, standard deviation (Std) and the average of selected features (ASS).They are defined as follows, where max iter is the maximum number of iterations: W orst F itness = min maxiter iter=1 F n iter (24)

Experimental Results and Discussion
The experiments were implemented in MATLAB-R2012 on a computer with Intel Core 2 GHz and 2GB memory.The performance of four bio-inspired algorithms is carried for comparison.The four proposed algorithms are compared in four aspects: classification based measurements (i.e., accuracy, recall, precision, and f-measure), convergence, computational time and stability.

Comparison of Classification Measurements
Table 2(a) summarizes the highest obtained classification results of the proposed algorithms along with the number of reduced features using WPBC dataset.As can be seen, WOA has lowest reduced subset features and highest accuracy results.MFO in second place.The same findings are found at  3) compares the obtained results using different SVM' kernels for both WPBC and WDBC datasets regarding f-score, accuracy, recall, and precision.As can be seen, WOA always obtains the highest results.Table (4) compares the highest obtained results from the proposed system with the existing systems in the literature regarding accuracy.In [22], the authors showed that Principal Components Attribute obtained the best results for WPBC dataset and Symmetric Uncert Attribute set obtained the best results for WDBC dataset.Also, the authors used CART classifier.They obtained overall classification accuracy 94.72 % for WDBC dataset and 96.99 % WPBC dataset.In [7], authors evaluated four metaheuristic algorithms, namely PSO, ICA, FA and IWO on WDBC.They used in their experiment multi-layer perceptron (MLP) network for classification purpose.The experimental results showed that FA obtained overall classification accuracy 98.54%.Authors in [23] used CART with feature selection algorithm to evaluate WDBC.The experimental results showed that their system obtained overall 95.96%.In [24], authors used neural network with multi-layer perceptron (MLP) network and learning vector quantization (LVQ).The experimental results showed that their proposed approach obtained overall accuracy 96.8%.In [25], authors proposed Jordan Elman neural network approach.They applied their approach on three different breast cancer dataset, namely Winconins, WDBC and WPBC.Their proposed approach obtained overall 95.06%.Authors in [26], compared different classifiers Sequential Minimal Optimization (SMO), decision tree (J48), Naive Bayes (NB), Multi-Layer Perception (MLP) and K-Nearest neighbor (IBK) for breast cancer Wisconsin Breast Cancer (WBC), WDBC and WPBC.The results showed that SMO is the best classifier.It obtained overall accuracy for WBC 96.99%, for WDBC 97.71% and 77.32% for WPBC.In this study, four recent meta-heuristic algorithms are evaluated and compared.Moreover, both WDBC and WPBC datasets are used for evaluation.CART 93.49 [7] MLP 98.54 [27] Ensemble Decision Tree 95.96 [23] linear discreet analysis 96.8 [24] Neuron-Fuzzy 95.06 [25] Supervised Fuzzy Clustering 95.57[

Computational time analysis
Table (5) compares the obtained results of the proposed algorithms in terms of highest fitness value and computational time where G represents the number of generations (population).As mentioned before, the fitness value calculated from KNN classifier with k = 5.The best values of the proposed four algorithms are emphasized in boldface.It can be found that highest results obtained when the population size increase.Also, it can be seen that WOA always obtains highest results in case of using WPBC and WDBC.That proves the WOA capability to find an optimal solution regardless of different populations.On the other hand, MFO always obtains lowest results, while it spends the minimum time in the two cases.Moreover, as the population size increases, the computational time increases too.

Stability performance analysis
To test robustness and stability of convergence of the proposed algorithms, mean and standard deviation of the obtained fitness values are calculated for ten runs and outlined in Table (3).As can be observed MFO owns the best stability quality as has minimum standard deviation and mean.GWO is in second place.As the number of generation increase the stability increase too in the most of the situations.In addition to in most cases when generation equals to 30 obtains the highest stability.

Conclusion
In this paper, four bio-inspired algorithms are proposed as features selection algorithms.The performance of these algorithms is evaluated and compared using two datasets; WBCD and WPBC.Moreover, three different classifiers are adopted and compared.From the experimental results, it can be concluded that the selected features using WOA obtained the highest results.Also, MFO has the lowest computational time and the most stable algorithm compared with the others.In addition to, SVM with Quadratic kernel function got the highest accuracy for WDBC, while KNN with K = 5 and cosine distance got the most top results for WPBC.The obtained results can assist the doctors in the understanding of clinical breast cancer diagnoses context and further for data mining and machine learning application.

Figure 1 :
Figure 1: The general architecture of the proposed clinical breast cancer computer aided diagnosis system

Figure 2 :
Figure 2: Comparison of different swarms features selection algorithms for WPBC using Different Kernel Functions of SVM; (A) WOA Features Selection Results, (B) FPA Features Selection Results, (C) GWO Features Selection Results, (D) MFO Features Selection Results

Figure 3 :
Figure 3: Comparison of different swarms-based features selection algorithms for WDBC using Different Kernel Functions of SVM; (A) WOA Features Selection Results, (B) FPA Features Selection Results, (C) GWO Features Selection Results, (D) MFO Features Selection Results

Figure 4 :
Figure 4: Convergence curve of the four swarms-based features selection algorithms; (A) WPBC and (B) WDBC

Table 2
(1) using WDBC.As it observed, WOA is on top, as it obtains highest accuracy results with lowest reduced attributes.Table(1)compares the highest obtained results of three different classifiers.The highest results are emphasized in boldface.As it can be observed that SVM with quadratic kernel function obtains the highest results in case of using WDBC dataset, while KNN with k = 5 and cosine distance obtain the highest results in case of using WPBC dataset.

Table 1 :
The Performance of Different Classifiers

Table 2 :
Highest obtained results of All swarms-based feature selection (FS) algorithms, where (NRA) is number of reduced attributes -(a) for WPBC and (b) for WBCD

Table 4 :
Comparison of existing CAD results with the obtained CAD results for (A) WDBC and (B) WPBC

Table 5 :
Objective Function and Computational Time for All Proposed Algorithms;(a) for WDBC and (b) for WPBC

Table 6 :
(6)formance Evaluation of Different swarms AlgorithmsTo compare the convergence performance of the proposed algorithms, convergence curves for WPBC and WDBC are drawn in this section.The convergence curves for WOA, MFO, GWO, and FPA are shown in Figures4(A) and 4(B).As it can be observed, MFO converges faster than the other algorithms.The GWO method and the WOA method rank second and third, respectively.FPA has the slowest convergence speed.However, WOA and FPA have always the highest score and MFO is the worst one.4.3.3StatisticalmeasurementsanalysisIn this section, four measurements are used to evaluate the performance of each proposed algorithm.Table(6)outline the best, worst and mean obtained fitness function and average selected features ratio over all the iterations.As can be seen for WDBC and WPBC, the performance of WOA is superior in selecting the minimum number of features with good classification performance.However, FPA has the best worst value.Also, it can be observed that mean, worst and best are almost close in MFO, which proves of good convergence.