Feature Selection with a Local Search Strategy Based on the Forest Optimization Algorithm

: Feature selection has been widely used in data mining and machine learning. Its objective is to select a minimal subset of features according to some reasonable criteria so as to solve the original task more quickly. In this article, a feature selection algorithm with local search strategy based on the forest optimization algorithm, namely FSLSFOA, is proposed. The novel local search strategy in local seeding process guarantees the quality of the feature subset in the forest. Next, the ﬁtness function is improved, which not only considers the classiﬁcation accuracy, but also considers the size of the feature subset. To avoid falling into local optimum, a novel global seeding method is attempted, which selects trees on the bottom of candidate set and gives the algorithm more diversities. Finally, FSLSFOA is compared with four feature selection methods to verify its effectiveness. Most of the results are superior to these comparative methods.


Introduction
Feature selection methods are widely used in data mining and machine learning in order to improve the accuracy of classifiers and accelerate the training speed of models with large amounts of input features [Migdady (2013); Rong, Ma, Cao et al. (2019)]. From the original feature set, it allows us to eliminate irrelevant and strongly correlated redundant features and so to extract those valuable features [Paisitkriangkrai, Shen and Hengel (2016); Ma, Shao, Hao et al. (2018)]. Feature selection is a difficult task due mainly to a large search space, where the total number of possible solutions is 2 n for a dataset with n features [Xue, Zhang, Browne et al. (2016)]. It aims to decide whether to choose the feature in the dataset or not. The task becomes more challenging while n is increasing in many areas with the advances in the data collection techniques [Tzanis (2017)]. As we all known, an exhaustive search for the best feature subset of a given dataset is practically impossible in most situations, even for a moderate-sized dataset [Wipf and Nagarajan (2016)]. It is not excluded that quantum computing will probably provide enough computing power that makes violent search on today's dataset possible in the future [Vandersypen and Leeuwenhoek (2017) ;Lv, Ma, Tang et al. (2016)]. At that time, there is no doubt that bigger datasets will be available to scientists [Stuart (2016) ;Zhang, Ma, Cao et al. (2015)]. An effective way to find suitable feature from huge datasets is randomly set search start point at first [Melian and Verdegay (2011)]. Then the question is how to optimize the solution rapidly and directly like gradient descent using in optimizing Deep Learning [Keskar, Mudigere, Nocedal et al. (2016)]. According to the existing cognition, random search strategies coming up with approximately best solution are still our best choices for feature selection problem [Zhu, Xu, Chen et al. (2016)]. Our previous work on feature selection research obtained a method called FSFOACD [Ma, Jia, Zhou et al. (2018); Rong, Ma, Cao et al. (2019)]. That method used the contribution degree strategy to improve the efficiency of the forest optimization algorithm. In another word, it used the measure of the relevance and redundancy to guide the search direction of the forest optimization algorithm. The search direction, to a certain extent, improves the performance of the forest optimization algorithm. However, there are still some deficiencies in FSFOACD. For example, the number of features selected by each tree in the forest in the initialization phase is randomly generated. During the initialization, if most of the random number of selected features of the trees are close to the number of features in the original dataset, the search locations of the trees in the forest will be far away from the optimal solution, which also results in the low efficiency of the algorithm at the beginning of the search. On the other hand, during the local seeding process, some neighbor trees are added to the forest, who is strictly limited by their parents. The limited number and randomness also cost this algorithm a comparatively long time to achieve a well-accepted feature subset. Only selecting one feature at a time cannot guarantee the quality of feature subsets and so the search efficiency is lower. In this paper, a novel feature selection method with a local search strategy based on the forest optimization algorithm is proposed, namely FSLSFOA. Firstly, a feature subset size determination mechanism is used to initialize the forest optimization algorithm, which allows a small feature subset for trees in the forest during the initialization phase. In this way, more trees are in an ideal position at the beginning, since they are separated more evenly into the search space. This initialization strategy greatly improves the search efficiency of the algorithm. Secondly, the search strategy of FSFOACD is utilized as well as improved. Each time when adding a neighbor node, the algorithm no longer selects single feature for evaluation. Instead, it selects a probabilistic number of high-quality features and an empirical number of low-quality features at a time. The new search strategy can not only maximize the quality of the feature subsets in the forest, but also improve the search efficiency of the algorithm to a great extent. Thirdly, a new fitness function to balance the classification accuracy and the dimensionality reduction ratio is proposed, since the core purpose of feature selection is to increase dimensionality reduction ratio on the premise of ensuring the classification accuracy. In order to verifying the effectiveness of our algorithm, 10 widely-used datasets from UCI are selected and three classifiers respectively from support vector machine (SVM) [Aburomman and Reaz (2016)], decision tree (DT) [Shirani, Habibi, Besalatpour et al. (2015)] and k-nearest neighbor (KNN) [Benmahamed, Teguar and Boubakeur (2018)]. There are 4 feature selection methods are selected for comparison, including UPFS [Dadaneh, Markid and Zakerolhosseini (2016)], ANFS [Luo, Nie, Chang et al. (2018)], PSO(4-2) [Tran, Xue and Zhang (2014b)] and FSFOACD [Ma, Jia, Zhou et al. (2018)]. The experiments show that most of the results are superior to previous feature selection algorithms we selected. The rest of this paper is organized as follows: In Section 2, related works on feature selection are reviewed. Moreover, details about the proposed method are introduced in Section 3. In Section 4, the proposed algorithm is compared with the other existing feature selection methods. Finally, Section 5 summarizes the present study.

Related work
Feature selection involves two conflict objectives, which are to maximize the classification accuracy and minimize the number of features [Hancer, Xue, Zhang et al. (2017)]. Therefore, feature selection can be treated as a multi-objective problem to find a set of trade-off solutions between these two objectives [Selvam, Kumar and Siripuram (2017)]. An important problem of feature selection is the feature interaction (or epistasis) [Guyon (2003)]. There can be two-way, three-way, or complex multi-way interactions among features [Gorelick and Bertram (2010)]. A feature weakly relevant to the target concept can significantly improve the classification accuracy if it is used together with some complementary features [Anthimopoulos, Christodoulidis, Ebner et al. (2016)]. In contrast, an individually relevant feature may become redundant when used together with other features [Sun, Goodison, Li et al. (2007)]. Based on the evaluation criteria, feature selection algorithms are generally classified into two categories: 1) filter approaches and 2) wrapper approaches [Ambusaidi, He, Nanda et al. (2016);Zawbaa, Emary, Hassanien et al. (2016)]. Their main difference is that wrapper approaches include a classification/learning algorithm in feature subset evaluation, while filter algorithms are independent of any classification algorithms [Tran, Zhang and Xue (2016)]. Filters ignore the performance of the selected features, whereas wrappers not, which usually results in those wrappers getting better solutions [Yang, Liu, Zhou et al. (2013)]. Choosing filters or wrappers depends on your requirement for accuracy and time efficiency [Rodriguezgaliano, Luqueespinar, Chicaolmo et al. (2018)]. According to the filter and wrapper categories, the unsupervised learning and supervised learning methods are adopted for feature selection. Forest Optimization Algorithm (FOA) and its improvments [Ma, Jia, Zhou et al. (2018)] are supervised methods. Also, Particle Swarm Optimization (PSO) [Tran, Xue and Zhang (2014b)] method is supervised mthods. While choosing features, it will consider the classification performance. Differently, on the other hand, UPFS [Dadaneh, Markid and Zakerolhosseini (2016)] and ANFS [Luo, Nie, Chang et al. (2018)] are belong to unsupervised method. The labels of the dataset are masked while using these two methods. Unsupervised methods are the ultimate goal of feature selection, and so they occupy an important position in this field.

FOA
Forest Optimization Algorithm (FOA) is an classic supervised learning, which was proposed by Ghaemi et al. [Ghaemi and Feizi-Derakhshi (2014)]. And we propsed the feature contribution to promote the effecient based on Forest Optimization Algorithm [Ma, Jia, Zhou et al. (2018)]. FOA as a evolutionary algorithms, which is suitable for optimization task. Ghaemi et al. found FOA can improve the classification accuracy comparing with other feature selection methods [Ghaemi and Feizi-Derakhshi (2016)]. Fernández-García et al. compared the FOA with other feature selection method on Academic Data to foresee if students will finish their degree after finishing their first year in college [Fernández-García, Iribarne, Corral et al. (2018)]. Although the FOA get the better result in above experiments, the search efficiency is still low and the algorithm complexity is relative higher. At the beginning, the random initiation of parameter can not ensure the best results. So, we will do future work to how to optimize the initiate the parameter and optimize the search strategies.

PSO
Feature selection is a combinatorial optimization problem, which can be solved by Particle Swarm Optimization (PSO) [Zhang, Wang, Sui et al. (2017)]. Xue et al. studied on multi-objective particle swarm optimization (PSO) for feature selection, and it achieved comparable results with the existing well-known multi-objective algorithms in most cases [Xue, Zhang and Browne (2013)]. In 2014, Tran et al. improved the PSO to classification problems with thousands or tens of thousands of features [Tran, Xue and Zhang (2014a)]. At same time, Tran et al. proposed three new initialization strategies and three new personal best and global best updating mechanisms in PSO to develop feature selection approaches, which named as PSO(4-2) [Tran, Xue and Zhang (2014b)]. As the particle swarm in the search process will fall into the local optimal solution, there need solution to make the particles escape from the local optimal solution, and achieve the purpose of large data optimization. Usually, PSO can not assure the global optimization results.

UPFS
Unsupervised feature selection just evalue the similarity between features and remove the redundancy features therein. Mitra et al. proposed an unsupervised feature selection algorithm suitable for data sets, large in both dimension and size [Mitra, Murthy and Pal (2002)]. Hou et al. first combined the embedding learning and feature ranking together, and proposed joint embedding learning and sparse regression (JELSR) to perform feature selection [Hou, Nie, Li et al. (2014)]. Dadaneh proposed unsupervised probabilistic feature selection using ant colony optimization (UPFS) [Dadaneh, Markid and Zakerolhosseini (2016)]. They utilized the inter-feature information that showed the similarity between the features that led the algorithm to decreased redundancy in the final set. As ant colony optimization is heuristic algorithm as FOA, we choose this method as reference to verify our proposed algorithm. Luo et al. [Luo, Nie, Chang et al. (2018)] poposed an adaptive unsupervised feature selection with structurer regularization, which characterized the intrinsic local structure by an adaptive reconstruction graph and simultaneously consider its multiconnectedcomponents (multicluster) structure.

Methods
In order to solve part of the problems in related works, a feature selection algorithm with local search strategy based on forest optimization algorithm is proposed, namely FSLSFOA. There are quite lot of improvements and modifications while comparing with FSFOACD. FSLSFOA involves five main stages: (1) Initialize trees, (2) Group features, (3) Local seeding, (4) Size limitation, (5) Global seeding.

Initialization based on feature subset size determination mechanism
Since the main purpose of feature selection is to find the minimum feature subset necessary and sufficient to identify the target, the optimal feature subset generally has fewer features selected. In the initialization phase, the starting location of the tree in the forest is hoped to be closer to the optimal solution. In other word, the difference between the optimal feature subset size and the initialized feature subset size is not very large. Probabilistic random function is utilized to determine the initial size of the feature subset. The probability formula is defined as follows: (1) where f represents the number of original features of the dataset, sf represents the number of selected features and L indicates the distance between f and sf , that is L = f − sf . P (sf ) determines the initial probability of the number of selected features. According to Eq. (1), when sf is minimized, P (sf ) reaches the peak. Next, the roulette gamble algorithm is utilized to determine the number of sf . That is to say sf is randomly selected, its range of values is [m, M ]. The value of m is set to 3, since it is an efficient and comprehensive lower bound of sf for most of the dataset. It is also allowed to make necessary adjustments to some extremely simple datasets in practice. Besides, M = ε * f . Where f depends totally on the given dataset and ε is used to adjust the size of M . If ε is set to be close to 1, the value of M will certainly be close to f . This will make the search space very large, which leads to the increasing of the computation cost and lots of invalid feature subsets. Our goal is to select a significant subset of features with a small number of features. The discussion on ε will be explained in our experiment parameter setting section. The roulette gambling algorithm is designed as follows: Algorithm 1 Roulette Gambling Algorithm end if 8: end for 9: return sf

Grouping strategy based on feature importance
After the initialization phase, all features are divided into high-quality feature groups and low-quality feature groups according to their importance. The purpose of this grouping is to provide a reference for the forest optimization algorithm when selecting features. Those high-quality features are distributed into the forest optimization algorithm, thereby improving the quality of the trees (character subsets) in the forest. Here the Pearson correlation coefficient is applied to measure the correlation between different features [Mu, Liu and Wang (2017)]. Here the Pearson correlation coefficient is applied because of its simplicity and because it is simpler to manipulate as a univariate method [Coelho, Braga and Verleysen (2010)]. Extending to the subsequent importance, this part aims at the first round of coarse-grained partitioning of features. The definition is as follows: where C(i, j) is the correlation coefficient of feature i and feature j. f is the number of samples. x i (k) and x j (k) represents the value of i and j in the kth sample. x i and x j represents the mean value of x i and x j in the total f samples, respectively. According to Eq.
(2), the correlation coefficient between two features weighs the similarity between features. The higher the correlation coefficient is, the higher the similarity between two features is. On the contrary, a lower correlation coefficient means that the similarity between two features is also comparatively low. After calculating the correlation coefficients for all possible combinations of features, the importance of feature i is calculated as follows: where C(i, c) represents the correlation coefficient between feature i and class attribute c.
The larger the value of C(i, c) is, the greater the influence of feature i is on the classification effect. Conversely, the smaller the value is, the less the influence of the feature on the classification effect will be. If the importance value of a feature is high, it means this feature has high correlation with the class attribute and its redundancy is low. On the contrary, if the importance value of a feature is low, it means this feature has low correlation with the class attribute and its redundancy is high. For the forest optimization algorithm to select higher-quality feature subsets in the process of exploring the optimal solution, all the features are sorted according to their importance values in descending order and divided into two groups. The first half of features are placed in a high quality feature group (Group_high) and the second half feature in a low quality feature (Group_low). The importance of features in Group_high is greater than those in Group_low, and the features of both groups are sorted in descending order according to their importance value. Fig. 1 is a schematic diagram of the feature grouping strategy.

Local search strategy
After Group_high and Group_low produced by the feature grouping strategy, local search strategy is utilized to select suitable features from the two groups. This strategy aims to provide better locations for forest optimization algorithms to further explore optimal solutions. The local search strategy mainly improves the optimal solution searching of forest optimization algorithm through "add" and "delete" operations. For short, the "add" operation is defined as to select the desired number of features and use the "delete" operation is to remove a certain number of features from the current position of the tree. The "add" operation is to add high-class-correlation and low-redundancy features into the tree. The "delete" operation removes low-class-correlation and high-redundancy features from the tree. In the local seeding stage, a novel local search strategy is proposed to dynamically change the position of the neighbor tree by adding and deleting as many features as possible from Group_high and Group_low in each time of adding neighbor tree. The specific steps of the local search strategy are as follows:  Firstly, for each tree of age 0 in the forest, its features which equals 1 are put into a subset T . This subset T includes all the randomly selected features at the beginning of local search strategy. As is shown in Fig. 2, if the current tree is like T ree = [1, 1, 0, 1, 1, 0, 0, 1, 0, 1, age = 0], a subset T with six features is obviously obtained, T = f 1 , f 2 , f 4 , f 5 , f 8 , f 10 . In this example, it is assumed that Group_high = f 5 , f 3 , f 1 , f 7 , f 9 and Group_low = f 8 , ]f 4 , f 6 , f 10 , f 2 ,which should be calculated through grouping strategy in part 3.2. Secondly, T is grouped to two subsets, T high and T low , by Group_high and Group_low. T high contains high-quality features in T , which also included in Group_high. On contrast, T low includes low-quality features in T , which contains in Group_low. T high and T low are sorted in descending order according to their importance value respectively.
T high = f 5 , f 1 and T low = f 8 , f 4 , f 10 , f 2 can be easily inferred from Fig. 2 ,because f 5 , f 1 ∈ Group_high and f 8 , f 4 , f 10 , f 2 ∈ Group_low. Thirdly, the "add" operation is used to select the desired features, and the "delete" operation is to remove a certain number of features from the current tree. How many features empirically should being in T high and T low depends on parameter α and β, where α = λ * sf , β = (1 − λ) * sf . Where λ is specified by the user and sf is initialized by the feature subset size determination mechanism in part 3.1. The parameter λ is used to control the proportion of features selected from Group_high and Group_low. The discussion on λ will be explained in our experiment parameter setting section. If the number of high quality features in the tree, |T high |, is smaller than α, there will be α − |T high | of high-quality features added into the current tree selected from the set {Group_high − T high }, which means included in Group_high and not included in T high . Otherwise, if |T high | is larger than α, add_num of high-quality features will be added into the current tree from the set {Group_high − T high }. The parameter add_num is defined as follow: where add_num and y are natural number ranged in [1, |Group_high − T high |]. |Group_high − T high | indicates the feature number of the set {Group_high − T high }.
Random (1) is an uniformly distributed random variable ranged in (0, 1]. The function Sign(v) means if v < 0, ved will be set to 1, which announces that v will be withdrawn from the competition of argmin(·).
Here v represents the value inside function Sign(·).
The argmin(·) help us obtain the add_num that minimize the result of Sign(·).
In addition, if the number of low-quality features in the tree, |T low |, is bigger than β, there will be |T low | − β of features with low quality deleted from T low . Otherwise, if |T low | is smaller than β, del_num of features will be deleted from T low . The parameter del_num is calculated as follow: where del_num and z are natural number ranged in [1, β − |T low |], and the definition of Random(1) and Sign(x) are same with them in Eq. (4) and Eq. (5). According to this step, the sample in Fig. 2 finally comes to the N ewT ree = [1, 0, 1, 1, 1, 0, 0, 1, 0, 0, age = 0]. Besides, the old tree's age should grown up to 1 in this sample. Note that the "add" operation always selects the highest-quality feature, and the "delete" operation always drops the lowest-quality feature. Our strategy is different from the strategy proposed by Moradi et al. [Moradi and Gholampour (2016)]. Especially when dealing with two cases, |T high | > α and |T low | < β. In these two cases, neither the number of high-quality features is pull back to α, nor the number of low-quality features is compelled to β. The calculation of add_num and del_num can be recognized as the probability in considering of feedback theory.

Fitness function
The definition of the novel fitness function is as follows: where Accuracy(T ree) means the classification accuracy of the current tree (feature subset) on the classifier. |C| represents the number of features selected in the current tree. |f | is the total number of features of the dataset. The parameter µ is defined to describe the importance of the classification accuracy and the dimensionality reduction ratio of the current feature subset. Eq. (7) aim to find a balance point between the classification accuracy and the dimensionality reduction ratio, since the core purpose of feature selection is to increase the rate of dimensionality reduction ratio on the premise of ensuring the classification accuracy. Generally, accuracy is more important than dimensionality reduction ratio. In the experiments of parameter selection, the best fitness value achieved when µ is set to 0.8. Note that all trees in the forest are assessed by this novel fitness function instead of considering the accuracy only.

Size limitation
In order to speed up the local seeding process, "life time" and "area limit" is given to control the number of trees in the forest. Those trees who does not match the restrictions will be moved from the forest to a candidate set. In our algorithm, it works like a recycle bin for that it collects those trees with older age or lower fitness value. However, candidate trees are useful in the next step called global seeding. Note that if the tree with highest fitness value reaches "life time", it is age will be reset to 0.

Global seeding
This step gives the candidate trees a chance of renascence. We select "flood rate" of trees with the lowest fitness from the bottom of the candidate set. After a big rainfall, these weakest trees are carried by the flood to far far away. Their features and ages are completely reversed. For example, if a tree is like [1, 0, 1, 1, 1, 0, 0, 1, 0, 0, age = 10], it will be reversed into [0, 1, 0, 0, 0, 1, 1, 0, 1, 1, age = 0]. The flood motivates such a global seeding process, which brings us many strange new trees that seems quite different from those we got from local seeding. These strange new trees will participate in the next iteration and effectively prevent the algorithm falling into local optimal solution. The whole pseudo code of our FSLSFOA is as follows: Algorithm 2 FSLSFOA Input: life time, area limit, flood rate, ε, λ, µ Output: the best feature set with the highest fitness value STEP 1 Initialization 1: use feature subset size determination mechanism to calculate sf which can be affected by ε 2: sf of features are initialized to 0 and 1 randomly for the "area limit" of trees created 3: each tree is represented to (D + 1) dimension vector (D is features and age is set to 0) STEP 2 Feature Grouping 4: compute the importance of each feature and divide the feature set into Group_high and Group_low 5: while iterations < maximum iterations do STEP 3 Local Seeding (use λ to empirically control the proportion of selected features) sort all trees according to fitness value and move low fitness trees into candidate set 10: set the best tree's age to 0 STEP 6 Global Seeding 11: select "flood rate" of trees with the lowest fitness from the bottom of the candidate set 12: reverse the front D features of the tree vectors and set their age to 0 13: end while 14: return the best feature set with the highest fitness value 4 Experiment 10 standard datasets along with 3 classical classifiers are selected to conduct experiments. At the same time, our algorithm is compared with other feature selection methods. It includes feature selection using forest optimization algorithm based on contribution degree(FSFOACD) [Ma, Jia, Zhou et al. (2018)], feature selection based on ant colony optimization unsupervised probability algorithm (UPFS) [Dadaneh, Markid and Zakerolhosseini (2016)], feature selection based on supervised rough sets (ANFS) [Luo, Nie, Chang et al. (2018)] and the feature selection of a novel initialization and update mechanism of particle swarm optimization (PSO(4-2)) [Tran, Xue and Zhang (2014b)].

Data sets
In the experiment, 10 standard datasets from UCI machine learning database are selected to conduct experiments. These include Glass, Wine, Vehicle, Hepatitis, Parkinsons, Ionosphere, Dermatology, SpamBase, Sonar and Arrhythmia datasets. These datasets cover small, medium and large datasets, and they have been widely used in many research areas of machine learning. Tab. 1 shows the details of the dataset mentioned, such as the number of features, the number of categories, and the number of samples. In experiments, the average of corresponding features is used to fill missing values. In addition, in real cases, the numerical ranges between eigenvalues vary greatly. Therefore, the characteristics associated with large range values dominate those associated with small range values. In order to overcome this problem, all of the features are normalized.

Classifiers
To verify the generality of our proposed algorithm, several classical classifiers are utilized to test the classification effect of the selected feature subset. It includes support vector machine (SVM), decision tree (DT) and k-nearest neighbor (KNN). These classifiers are widely used in the field of data mining. We use the SMO, IBK and J48 algorithms in WEKA to implement the functions of SVM, KNN and DT classifier.

Evaluation index
In the experiment, classifiers above are able to classify datasets according to the features selected by feature selection algorithm. The accuracy of a classifier on one dataset is the percentage of samples that can be correctly labeled by the classifier. In order to measure the stability, the feature selection algorithm is repeatedly run 10 times in each condition. Immediately behind it, standard deviation is calculated by 10 accuracies of classifier. Moreover, the fitness value who leads in the feature number penalty is utilized inside the model to find a balance between the goal of feature selection and classification. In practice, Accuracy(T ree) in Eq. (7) can be calculated by feeding the classifier with the selected features and the label of the dataset. The fitness value takes the place of accuracy as the internal evaluation index of the model, while external experimental results still use more common accuracy for the help to other researchers.

Parameter discussion
In this section, a discussion on the parameters of FSLSFOA is conducted. The number of trees in the forest is initialized to 50, which is satisfactory for most datasets we used. The values of "life time", "flood rate" and "area limit" are set to 15, 5 and 50 respectively. "life time" is of great use to control the vitality of the forest. "flood rate" gives a chance to those trees at the bottom of candidate set, but may also bring confusion to the forest. So, it is recommended to set "flood rate" within 10% of "area limit", which directly affects the time cost of our algorithm. There is no doubt that parameters above are greatly affected by the size of actual datasets, but performance discussion on a specific dataset means little in comparison with the average accuracy on all datasets we used. Next, the baseline settings of ε, λ and µ are 0.5, 0.5 and 1 respectively. Then, discussion on them are given respectively according to the average accuracy of all classifiers on all datasets. Each time, only one parameter is changed from the baseline, namely variablecontrolling approach, which ensures the reflection of parameter's influence on average accuracy. In this part, the evaluation indicator is the total average of 30 accuracy rates of three classifiers on ten datasets. Firstly, in Tab. 2, ε shows best performance when it takes 0.4. This result shows that only about 40% of features are worth being selected. In another word, feature selection is universally necessary for common datasets. Secondly, the average accuracy peaks at 87.58% when λ is set to 0.6 in Tab. 3. The truth is that the contribution of feature grouping process is very limited. 60% of Group_high along with 40% of Group_low rather than 100% selecting from Group_high achieves the best result. Thirdly, µ is used to adjust composition of fitness value. As is shown in Tab. 4, the best value of µ is 0.8, which indicates that even better average accuracy can be achieved than using accuracy as evaluation index by optimizing the fitness value containing 20% of feature number penalty. What surprises us is that feature number penalty, which aims to reduce the number of features, brings better performance for our algorithm. We speculate that this is probably because those feature sets with the highest training accuracy has been over fitted.

Results and comparisons
To verify the generality of the proposed algorithm, different classifiers are utilized to evaluate the performance of the proposed algorithm, including KNN, DT (decision tree) and SVM. In order to obtain steady results, the classification accuracy of feature selection algorithm is measured by 10-fold cross validation. Tabs. 5 to 7 show the mean value and standard deviation of the classification accuracy of different classifiers SMO, IBK and J48 with different feature selection algorithms UPFS, ANFS, PSO(4-2), FSFOACD and FSLSFOA over the 10 datasets. Note that the best results for our FSLSFOA in the parameter discussion above have been used in this part from beginning to the end. From the results shown in Tab. 5 and Fig. 3, along with KNN classifier, the proposed FSLSFOA algorithm performs best on most datasets compared with other feature selection algorithms. Only on the dataset Vehicle, FSLSFOA, whose classification accuracy is 81.21%, is second to PSO(4-2) algorithm, whose classification accuracy is 84.31%. On the other hand, the standard deviation of our algorithm is the lowest in all datasets, which improves that our algorithm is very stable. Especially on the datasets Glass, Hepatitis, Sonar and Arrhythmia, our FSLSFOA has a significant improvement in classification accuracy compared with other feature selection algorithms.  From the results shown in Tab. 6 and Fig. 4, with the SVM classifier, the proposed FSLSFOA algorithm has the highest average classification accuracy compared with other feature selection algorithms on all datasets. Particularly on the datasets Wine, when the accuracy of FSFOACD is 99.17% , the accuracy of FSLSFOA achieves 99.09%. We infer that this is because of the comparatively high standard variance of FSLSFOA and the high accuracy of the dataset. For the rest of the datasets, our local search strategy and global seeding process have played a great role in improving the feature selection.  As can be seen from Tab. 7 and Fig. 5, cooperating with DT classifier, FSLSFOA also has the highest classification accuracy on most datasets compared with other algorithms. This is benefit from the local search strategy. In the stage of local seeding and global seeding, the local search strategy guides the forest optimization algorithm to select more features with high class correlation and low redundancy, while eliminating low class correlation and high redundancy features. Thus it guarantees the quality of the feature subset in the forest. The state-of-art probability model in our local search strategy allows more variance for each tree, which allows more possibility for the limited trees to find the optimal solution within a limited number of iterations.  In addition, several experiments have been conducted to compare the classification accuracy of the proposed FSLSFOA with other feature selection methods based on the number of selected features. Fig. 6 and Tab. 8 show the results on datasets Sonar and Arrhythmia. In each graph, the X-axis represents the subset size of the selected feature, while the Y-axis represents the accuracy. In Figs. 6(a), 6(b) and Tabs. 8(a), 8(b), the accuracy of FSLSFOA based on SVM and DT classifier is obviously better than other feature selection algorithms. Figs. 6(c), 6(d) and Tabs. 8(c), 8(d) show similar results. Tab. 8(c) shows that when the number of selected features in Arrhythmia dataset is set to 70, the accuracy of FSLSFOA is 63.66%. In this case, the accuracy of UPFS, ANFS, PSO (4-2) and FSFOACD is 54.12%, 55.83%, 48.14% and 61.41%, respectively. As shown in Fig. 6(d), FSLSFOA outperforms over other feature selection algorithms in the range of 10-70 features, but is defeated by PSO (4-2) when the feature number go over 100. This deficiency is worthy of our further research.  Based on the average accuracy in Tabs. 5 to 7, the average classification accuracy can be illustrated in Fig. 7. Obviously, our proposed FSLSFOA achieves the best results in combination with three classifiers. Among the three classifiers, DT achieves the best performance in combination with different feature selection methods. However, FSLSFOA+KNN get an accuracy of 88.81%, which is the better than SVM and DT.
To sum up the unsatisfactory results, FSLSFOA performs not very well on the Vehicle and Parkinsons. Besides, its standard deviation on accuracy is worth that FSFOACD, which is caused by the randomness in local search strategy. Eqs. (4)-(6) not only bring these randomness, but also improve the accuracy. After weighing the pros and cons, this mechanism is preserved. In our future work, FOA based feature selection methods is still the main task. More deep learning methods can be applied to further improve the performance of our work. The proposed methods should be compared with more recent study and methods on more datasets. In order to make the research results more convincing, it is also necessary to apply the model to some areas of urgent need. Abdel-Basset M will be our guide i the future work. The works of his team is worth study to improve our model.

Conclusion
This article proposes a feature selection algorithm with local search strategy based on the forest optimization algorithm, namely FSLSFOA. Our local search strategy guides the forest optimization algorithm to select features with higher class correlation and lower redundancy, and remove features with lower class correlation and higher redundancy. While searching for the optimal solution, this strategy guarantees the quality of the feature subset in the forest. Next, the fitness function is improved, which not only consider the classification accuracy, but also consider the size of the feature subset, thus to get an optimal solution with smaller feature subset. To avoid falling into local optimum, a new global seeding method is attempted, which picks up trees on the bottom of candidate set and gives the algorithm more possibilities. Finally, 10 datasets with different attribute benchmarks are selected to verify the effectiveness of the proposed algorithm. The experiment is implemented in terms of classification accuracy and the size of the selected feature subset.
Most of the results are superior to previous feature selection algorithms we selected as the comparative method.