An efficient binary Gradient-based optimizer for feature selection

Feature selection (FS) is a classic and challenging optimization task in the field of machine learning and data mining. Gradient-based optimizer (GBO) is a recently developed metaheuristic with population-based characteristics inspired by gradient-based Newton’s method that uses two main operators: the gradient search rule (GSR), the local escape operator (LEO) and a set of vectors to explore the search space for solving continuous problems. This article presents a binary GBO (BGBO) algorithm and for feature selecting problems. The eight independent GBO variants are proposed, and eight transfer functions divided into two families of S-shaped and V-shaped are evaluated to map the search space to a discrete space of research. To verify the performance of the proposed binary GBO algorithm, 18 well-known UCI datasets and 10 high-dimensional datasets are tested and compared with other advanced FS methods. The experimental results show that among the proposed binary GBO algorithms has the best comprehensive performance and has better performance than other well known metaheuristic algorithms in terms of the performance measures.


Introduction
With the rapid development of information technology, big data has been a persistently hot topic and the hotly anticipated artificial intelligence relies on the development of big data [1]. One of the reasons why big data has become a buzzing topic is that due to the increasing computing power of computers, the datasets to be processed have become larger and larger, where the datasets contain more

Mathematical Biosciences and Engineering
Volume 18, Issue 4, 3813-3854. and more attributes, making the task of machine learning in data mining complicated. If a dataset contains n features, then 2 n solutions need to be generated and evaluated [2,3]. If n is small, then the total number of feature subsets is small and the optimal feature subset can usually be obtained by exhaustive search. However, when n becomes large enough to reach a defined value, it is no longer possible to enumerate all the feature subsets, which is computationally expensive. How to handle these large datasets becomes paramount. Because these datasets often contain unimportant, redundant, and noisy features that reduce the efficiency of the classifier, choosing the right features is the key to solving this problem [4,5]. Pre-processing functions in the data mining process include Feature selection (FS), which aims to reduce the dimensionality of the data by eliminating irrelevant, redundant, or noisy features, thereby improving the efficiency of machine learning algorithms, such as classification accuracy [6]. FS approach, an important technique in machine learning and data mining, has been extensively researched over the past 20 years. Feature selection has been widely applied in areas including e.g., text classification [7,8], face recognition [9,10], cancer classification [11], genetic classification [12,13], financial [14], recommendation systems [15], customer relationship management [16], cancer diagnosis [17], image classification [18], medical technology [19], etc. As usually, FS methods can be classified as Filter [20,21] or Wrapper [22,23], depending on whether they are independent of the subsequent learning algorithm. Filter is independent of the subsequent learning algorithm and generally uses the statistical performance of all training data to evaluate features directly, which has the advantage of being fast, but the evaluation deviates significantly from the performance of the subsequent learning algorithm. Wrapper uses the training accuracy of subsequent learning algorithms to evaluate a subset of features, which has the advantage of being less biased but is computationally intensive and relatively difficult for large data sets. It is based on the fact that the selected subset is ultimately used to construct the classification model so that if the features that achieve high classification performance are used directly in the construction of the classification model, a classification model with high classification performance is obtained. This method is slower than the Filter method, but the size of the optimized feature subset is much smaller, which is good for identifying key features; it is also more accurate, but less generalized and has higher time complexity.
There is another FS method which is the embedded FS method [24,25]. In the filtered and wrapped FS methods, the FS process is clearly separated from the learner training process. In contrast, embedded FS automatically performs FS during the learner training process, which is an integration of the FS process and the learner training process, and both are done in the same optimization process, i.e., FS is performed automatically during the learner training process. Embedded FS is most commonly used for L1 regularization and L2 regularization. Generally, the larger the regularization term, the simpler the model and the smaller the coefficients, when the regularization term increases to a certain level, all the feature coefficients will tend to 0. In this process, some of the feature coefficients will become 0 first, and the feature selection process is realized. Logistic regression, linear regression, and decision tree can be used as base learners for regularized feature selection, and only algorithms that can get feature coefficients or can get feature importance can be used as base learners for embedded selection.
The FS process can be divided into supervised feature selection [26] and unsupervised feature selection [27,28], depending on whether the original data sample contains information about the pattern category or not. Supervised feature selection is the process of selecting a feature set using the relationships between features and between features and categories, given a pattern category.
proposed two binary variants of the WOA algorithm to search for the optimal feature subset for classification purposes. MM Mafarja et al. [62] used two hybridization models to design different feature selection techniques based on the whale optimization algorithm (WOA). R. K Agrawal et al. [63] proposed the quantum whale optimization algorithm (QWOA) for feature selection, which is a combination of quantum concept and whale optimization algorithm (WOA). The approach enhances the versatility and convergence of the classical WOA for feature selection and extends the prospects of nature-inspired feature selection methods based on high-performance but low-complexity wrappers. The third category of metaheuristic algorithms: physics-based methods, inspired by the physical laws of nature, simulate the physical laws in the optimization process to find the best. Common algorithms include Simulated Annealing (SA) [64], Gravitational Search Algorithm (GSA) [38], Lightning Search Algorithm (LSA) [65], Multi-verse Optimizer (MVO) [66], Electromagnetic Field Optimization (EFO) [67], Chemical Reaction Optimization (CRO) [68], and Henry Gas Solubility Optimization (HGSO) [69]. Meiri et al. [70] used simulated annealing (SA) method for specifying large-scale linear regression models. Lin et al. [71] proposed a simulated annealing (SA) method for parameter determination and feature selection of SVM, called SA-SVM. Rashedi et al. [72] introduced a binary version of the algorithm (GSA). Sushama et al. [73] used a wrapper-based approach for disease prediction analysis of medical data using GSA and k-NN. Rao et al. [74] proposed a feature selection technique based on a hybrid of binary chemical reaction optimization (BCRO) and binary chemical reaction optimization-binary particle swarm optimization (HBCRO-BPSO) in this paper to optimize the number of selected features and improve the classification accuracy. This method optimizes the number of features and improves the classification accuracy and computational efficiency of the ML algorithm. However, it still does not attempt to handle ultra-high-dimensional datasets with a large number of samples in FS. Neggaz et al. [75] proposed a novel dimensionality reduction method by using Henry Gas Solubility Optimization (HGSO) algorithm to select important features to improve classification accuracy. The method generates 100% accuracy for classification problems with more than 11,000 features and is valid on both low and high-dimensional datasets. However, HGSO also maintains certain limitations. Since HGSO maintains multiple control parameters, this may compromise its applicability compared to other well-known methods such as GWO [36] and HHO [53].
The last type of metaheuristic algorithms: human-based approaches, inspired by human interactions or human behavior in society. For examples: Teaching-learning-based optimization (TLBO) [76], Imperialist Competitive Algorithm (ICA) [77], Volleyball Premier League Algorithm (VPL) [78] and Cultural Evolution Algorithm (CEA) [79]. In which, Mohan Allam et al. [80] proposed a new wrapper-based feature selection method called the binary teaching learning based optimization (FS-BTLBO) algorithm. Mousavirad et al. [81] proposed an improved imperialist competition algorithm to apply this proposed algorithm to the feature selection process. Keramati et al. [82] proposed a new FS method based on cultural evolution. The proposed methods provide new solutions to the feature selection problem, but there are still problems of insufficient accuracy and excessive number of features.
From the above work related to metaheuristics, we can know that a great number of metaheuristics are successful applied to the FS problem. The above work related to different types of metaheuristic algorithms on feature selection problem proves that metaheuristic algorithms have promising performance in solving FS problems, especially in terms of accuracy. In addition, they can produce better results by using a smaller number of features. However, there are still some problems, such as the number of tested datasets is small and not comprehensive enough to include low-dimensional, high-dimensional or different types of datasets, the problem of slow convergence that exists in most metaheuristics, and the lack of significant effect on the number of features selected for highdimensional datasets. Based on the No Free Lunch (NFL) rule [83], no single metaheuristic algorithm can solve all problems, which indicates that a particular algorithm may provide very promising results for a set of problems, but the same method may be inefficient for a different set of problems. Gradientbased optimizer is a novel gradient-based metaheuristic algorithm proposed by Iman Ahmadianfar [41] in 2020. The GBO algorithm is a new metaheuristic algorithm that has been recently proposed and has not been systematically applied to feature selection problems. The gradient-based optimizationseeking mechanism (GSR and LEO) of the GBO algorithm makes it feasible to make an appropriate trade-off between exploration and exploitation. Therefore, in this paper, we propose binary GBO for solving FS problems, mapping continuous GBO into discrete forms using S-shaped and V-shaped transfer functions, and try to apply it in solving high-dimensional dataset problems. The following is a brief description of the GBO algorithm.

Gradient-based optimizer (GBO)
The metaheuristic algorithm was first proposed by Iman Ahmadianfar et al. in 2020 to solve optimization problems related to engineering applications. Exploration and exploitation are the two main phases in metaheuristic algorithms, which aim to improve the speed of convergence and/or local optimum avoidance of the algorithm when searching for a target/position. The GBO is managed to create a proper trade-off between exploration and exploitation to uses two main operators: gradient search rule (GSR) and local escaping operator (LEO). A simple introduction of this algorithm is described below:

Gradient search rule (GSR )
First, GBO proposes the first operator GSR, which helps the GBO to consider stochastic behavior in the optimization process to facilitate the exploration and avoidance of local optima. And the direction movement (DM) is added to GSR, which is used to perform a suitable local search trend to facilitate the convergence speed of the GBO algorithm. Based on the GSR and DM, the following equation is used to update the position of current vector () m n x . 12

End
End where 1 f is a uniform random number in the range of [-1,1], 2 f is a random number from a normal distribution with mean of 0 and standard deviation of 1, pr is the probability, and 1 2 3 , , u u and u are three random numbers, which are defined as: where rand is a random number in the range of [0, 1], and 1  is a number in the range of [0, 1]. The above equations can be simplified: where 1 L is a binary parameter with a value of 0 or 1. If parameter 1  is less than 0.5, the value of 1 L is 1, otherwise, it is 0. To determine the solution m k x in Eq (12), the following scheme is suggested.  is a random number in the range of [0, 1]. Eq (22) can be simplified as: 22 (1 ) where 2 L is a binary parameter with a value of 0 or 1. If 2  is less than 0.5, the value of 2 L is 1, otherwise, it is 0. The pseudo code of the GBO algorithm is shown in Algorithm 1.

Algorithm 1.
Pseudo code of the GBO algorithm.  Local escaping operator 13. if rand < pr

Motivation
The GBO algorithm is a novel population-based metaheuristic search method to solving the continuous problem [84]. The GBO algorithm is derived from a gradient-based search method that uses Newton's method to explore the better regions in the search space. Two operators (i.e., gradientbased rule (GSR) and local escape operator (LEO)) were introduced in GBO and mathematically computed to facilitate the exploration and exploitation of the search. Newton's method is used as a search engine in GSR to enhance the exploration and exploitation process, while LEO is used to deal with complex problems in GBO. GBO shows superior performance in solving optimization problems compared to other optimization methods in the literature. Due to the above advantages of GBO and since it has not been used to solve FS problems, searching for the best subset of features in FS is a challenging problem, especially in wrapper-based methods. This is because the selected subset needs to be evaluated by a learning algorithm (e.g., classifier) in each optimization step. Therefore, a suitable optimization method is needed to reduce the number of evaluations in this paper; we propose a method for solving the FS problem. On this basis, we present the motivation for using this algorithm as a search method in a wrapper-based FS process. According to the nature of FS probability, the search space can be represented by binary values [0, 1], and binary arithmetic is much simpler than continuous arithmetic, so we propose a binary version of GBO to solve the FS problem.

Our proposed binary GBO (BGBO)
The proposed binary GBO has two key parts, the first one is how to map from continuous values to [0, 1] and the second one is how to update the position of the population. Transfer functio n [44] is the simplest method in binary GBO, which maps continuous values to [0, 1] and then decomposes them into 0 and 1 based on probability. It preserves the structure of GBO and other operations that move the position of the population in the binary space. The transfer functions are divided into two main categories according to their shapes: S-shaped and V-shaped. Figure 1 shows these two families of transfer functions.
In the S-shaped transfer function, the vectors values are converted into probability values within the [0, 1] range. As shown in Eq (25) [85].
where 1 d t X  represents the t-th vectors' position at next-iteration in the d-th dimension.
For next, the Hyperbolic tan (V-shaped) function [86] is another transfer function and mathematical formulation is given below: where d i X represents the position in the d-th dimension of the i-th individual in the GBO algorithm.
Based on the probability values obtained from Eq (27), the current position of each vector is updated in the next iteration. Using Eq (28), the search space is transformed into a binary search space. Table 1 shows the mathematical expressions of all transfer functions that S-shaped and V-shaped transfer functions are included. There are four S-shaped transfer functions: S1, S2, S3, and S4. There are also four V-shaped transfer functions, namely: V1, V2, V3, and V4. S-shaped functions and V-shaped functions, with a suitable probability, the exploration and exploitation ability of individuals in the population is the best. And at the beginning of the iteration, the exploration ratio of these functions is high relative to the exploitation ratio. As the value is higher, the probability of individuals in the population changing their current position is higher. These transfer functions are designed to facilitate the exploration and utilization of the binary search space by the individuals of the population. Therefore, both families of transfer functions (S-shaped and V-shaped) are used to transfer continuous solutions of GBO into the binary search space "0" and "1", and the methods are called BGBO_S and BGBO_V. Algorithm 2 is the pseudo-code for this method, as shown in the following. Meanwhile the flow chart of BGBO algorithm is shown in Figure 1.

Binary GBO for FS problems
There are usually a large number of noisy and irrelevant features in data mining, and for these irrelevant features, partial processing is usually needed, otherwise, it will cause the waste of data processing resources and make the error probability of processing results more difficult, thus increasing the difficulty of the learning task. For example, in the classification task, if there are a large number of irrelevant features, the learning time of the classifier may be longer and the classification accuracy may be lower. As the dimensionality of the data increases and the dataset contains more and more messages, it is very difficult to handle large data, and the cost of computation time increases. x  using Eq (12) 11. end for

17.
The probability of converting a population to a binary space 0 or 1. Using Eq (26) or Eq (28) 18 X Therefore, it needs to reduce the features of the dataset effectively and keep the key features. Datasets are usually represented by a matrix whose rows represent instances (or samples) and columns represent attributes (or features). Feature selection is a common technique used in data mining and machine learning and is an effective and well-proven method. Most researchers focus on high precision and low feature methods, and feature selection is one such method. In this section, the wrapper method is used to implement feature selection. The binary version of the algorithm corresponding to the above eight transfer functions (BGBO_S1, BGBO_S2, BGBO_S3, BGBO_S4, BGBO_V1, BGBO_V2, BGBO_V3, and BGBO_V4) is applied to the feature selection problem.

Fitness function
As already known in the previous subsections, it is significant to choose an effective search strategy for FS methods. Since the proposed method is a wrapper-based approach, then a learning algorithm (e.g., a classifier) should be involved in the evaluation process. In this work, a well-known classifier, namely, the k-NN classifier [87] is used as an evaluator, and the k-NN classifier classifies unlabeled instances by measuring the distance between a given unlabeled instance and its k nearest instances [88]. It is based on the principle that if a majority of the k most similar samples of a sample in a given feature space belongs to a certain class, then that sample also belongs to that class. The classification accuracy of the selected features is incorporated into the proposed fitness function. The classification accuracy obtained is better when the features in the subset are correlated. The classification accuracy obtained is better when the number of features in the subset is smaller. Having a higher classification accuracy is one of the goals of the FS method, and another important goal is to recalculate the number of selected features; the fewer the number of features in the solution, the better the solution will be. The mathematical formulation of the fitness function is shown below: where ) (D R  represents the classification error rate corresponding to the currently selected feature subset of the classifier, M represents the number of features currently selected, N is the total number of features, and  are two weight coefficients to reflect the classification rate and length of the subset, is the random number between [0, 1], and  

Computational complexity
To have a better understanding of the implementation process of the BGBO feature selection algorithm proposed in this paper, the following subsections analyze the computational complexity of BGBO. The time complexity is first analyzed as follows.
1) Population initialization O (n*d), where n is the population size and d is the dimension size (i.e., the number of features size).
2) The k-NN classifier training takes O (n*m), where n is still the population size and m is the instance size.
3 The space complexity is then analyzed as follows.
1) The space required for population initialization is O (n*d).
2) The space required for the k-NN classifier is O (m*s), where s is the number of features that have been selected.
From the above analysis, n and s can be ignored, so the space complexity is O (m*d).
The above is the analysis of the time complexity and space complexity of the BGBO algorithm proposed in this paper when applied to feature selection, for which the time complexity and space complexity are mutually affected. When pursuing a better time complexity, the performance of the space complexity may be worse, i.e., it may lead to occupying more storage space; on the contrary, when pursuing a better space complexity, the performance of the time complexity may be worse, i.e., it may lead to occupying a longer running time. Time complexity and space complexity are incompatible, and we need to strike a balance between them.

Experiment results and discuss
The detailed descriptions of the 18 benchmark datasets are shown in Table 2. These datasets have distinct characteristics in nature. These datasets were extracted from the UCI repository [89]. These datasets were used to test the proposed method. All experiments were implemented in MatlabR2017a and executed on an Intel Core i3 machine with a CPU frequency of 3.70 GHz and 8 GB of RAM. The maximum number of iterations was 100.
As a preliminary study, we conducted experiments on the effect of different population sizes on the basic method (i.e., on the classification accuracy of BGBO), and thus evaluated BGBO at different population sizes (i.e., 10, 20, 30, and 50). The PenglungEW dataset was used for the experiments, and the average accuracy and time spent by the classifier were used as evaluation criteria. Because of the large dimensionality of this dataset, the sensitivity is high. The higher sensitivity allows the algorithm to respond significantly too small changes in parameters. Table 3 shows the experimental results for the PenglungEW dataset with different population sizes and the maximum number of iterations.  The effect of varying the population size (10, 20, 30, and 50) and varying the number of iterations (100 and 150) on the classification accuracy of the dataset PenglungEW can be seen in Table 3, where it can be seen that the range of variation in classification accuracy is small and that increasing the population size does not always improve the results, and similarly, increasing the number of iterations has little effect on the results. Therefore, we set the population size to 10 and set the number of iterations to 100 in all the next experiments as a trade-off between classification accuracy and the overhead of the algorithm running time.
As mentioned in the previous subsection, there are two weighting coefficients and  in the fitness function. They reflect the importance of the classification error rate and the number of selected features on the performance of the algorithm, respectively. To further investigate the effect of different weight coefficients and  on the experimental results, different combinations are set up for experiments, and Table 4 shows the experimental results for different combinations. The results are shown in Table 4. In general, decreasing the value of  and increasing the value of  , the accuracy has improved, but when  is 0.7 and  is 0.3, the accuracy has reached 1. Therefore, increasing the reference standard, the fitness value, and the number of selected features, according to the results, when the value of  is determined to be 0.99 and the value of  is 0.01, the fitness value and the number of selected features are the best, so to be able to make a reasonable comparison with other methods. In this paper,  and  are set to 0.99 and 0.01 in the subsequent experiments, respectively. Additionally, k in the Euclidean distance formula used to evaluate the feature subset of the k-NN classifier was set to 5 [90]. The parameters used in BGBO are shown in Table 5. Each variation of BGBO has tested 30 independent runs. The measurement criteria used in the comparison of BGBO are fitness values, classification accuracy and selection feature size. 1) Fitness values are obtained from fitness function by using the selected features on the benchmark datasets. (The mean, standard deviation of the fitness function is calculated). 2) Classification accuracy is obtained from the classifier by using the selected features on the benchmark datasets. 3) Selection feature size is the mean number of the selected features.

Comparison between the BGBO based approaches
The results of the average classification accuracy are shown in Table 7. Among them, BGBO_V3 achieved the best results, it managed to reach the highest precision in nine data sets, and in addition, it had the highest precision value among all methods in five data sets, so it was placed in the first place in the overall ranking. BGBO_V2 ranked second, it achieved the highest precision in six data sets, but it ranked better overall, so it ranked second. BGBO_V4 is in third place, and although it has six datasets that achieve the highest precision and additionally has the highest precision value of all methods in one dataset. It ranks relatively low in the overall ranking compared to BGBO_V2. Similarly, BGBO_S3 ranks fourth and BGBO_V1 ranks fifth. BGBO_S1, BGBO_S4, and BGBO_S2 rank sixth seventh, and eighth, respectively. In addition, their standard deviation, we can see that the standard deviation of BGBO_V3 in all nine data sets is 0, which is sufficient to prove that it is BGBO_V3 a robust method.
In Table 8, the proposed BGBO method is compared in terms of the average number of selected features.BGBO_V3 has the smallest average number of features in 14 datasets, while BGBO_V1 and BGBO_V4 have the smallest values in 7 datasets. BGBO_V3 and BGBO-S3 had the smallest values in 6 data sets, while BGBO_S1, BGBO_S2, and BGBO-S4 had the smallest number of features in 5, 4, and 5 data sets, respectively. As for the standard deviation values, BGBO_V3 proved to be a robust method that obtained small deviation values in the data set of 7. The superiority of BGBO_V3 was also confirmed by the average adaptation results shown in Table 6. In this section, BGBO_V3, which has the best performance among the eight methods introduced above, is considered as the best method among the methods proposed in this work. The proposed BGBO_V3 is compared with other existing state-of-the-art binary metaheuristics such as BPSO, BWOA, BDA, BBA, BGOA, BGWO, and BHHO. Table 9 shows the used parameters of all algorithms in the next experiments. Again, the performance is compared and analyzed by the average fitness value, the average classification accuracy, and the average number of selected features. Table 10 shows the experimental results of the average fitness values of BGBO_V3 and other metaheuristic-based methods. From Table 10, it can be seen that BGBO_V3 has the best results in 89% of the datasets (16 out of 18), which indicates that BGBO_V3 has the best performance. BDA and BBA achieved first place in two datasets, KrvskpEW and SonarEW, respectively, and BPSO, BWOA, BGOA, BGWO, and BHHO are not ranked first in any of the 18 datasets. This indicates that the performance of the proposed method is significant. In addition, it can be seen from Table 10 that BGBO_V3 performs significantly better than other methods in some datasets, such as Lymphography and IonosphereEW. In addition, comparing the standard deviations of the various methods, BGBO_V3 is also more stable. In terms of the final average ranking, BGBO_V3 ranked first, followed by BHHO, BDA, BWOA, BGOA, BBA, BGWO, and BPSO.

Comparison with other metaheuristic-based approaches
In Table 11, BGBO_V3 is compared with other methods based on the average classification accuracy. Among the 18 datasets, BGBO_V3 ranked first with 13 datasets compared to 1, 1, 4, 6, 2, 3, and 7 for BPSO, BWOA, BDA, BBA, BGOA, BGWO, and BHHO, respectively. The comparison with k-NN classifier also shows that BGBO_V3 has higher classification accuracy. In terms of the number of top-ranked methods, BGBO_V3 performed much better than the other methods. In addition, BGBO_V3 achieves an average classification accuracy of 100% on the Breast_cancer_wisconsin, Zoo, Parliment1984, Iris, Glass, Wine, Segmentation, and Vote datasets. Specifically, the classification accuracy of BGBO_V3 reaches 99% for both the high-dimensional datasets PenglungEW and Coil, indicating that the proposed method in this paper can better solve high-dimensional data. This indicates that the proposed method is important in practical applications. In addition, the standard deviation results show that BGBO_V3 is more stable than other methods on some data sets. The final ranking BGBO_V3 is also in the first place. It is followed by BHHO, BBA, BDA, BGOA, BGWO, BPSO, and BWOA.
Inspecting the results of the number of selected features from Table 12, we observe that BGBO_V3 outperforms the other algorithms, followed by BPSO, BBA, BDA, BWOA, BHHO, BGWO, and BGOA, with very competitive results. For most of the datasets, the differences between the average numbers of features selected between the algorithms are very small. However, it is worth noting that for the PenglungEW dataset, the highest dimensional dataset with 325 features, BGBO_V3 achieves the best classification accuracy of 99.78% with the smallest number of features (only 12.5667 features on average). As well as for Coil, the highest dimensional dataset with 1024 features, BGBO_V3 achieves the best classification accuracy of 99.46% with the smallest number of features (only 53.4667 features on average). This shows the advantage of this method when dealing with highdimensional datasets.     In terms of the fitness function of feature selection, the fitness function is optimal only when the accuracy and the number of selected features are simultaneously good. From the analysis of the experimental results in Tables 10-12, the fitness value of the BGBO_V3 algorithm proposed in this paper is superior to other algorithms in most data sets. Besides, the BGBO_V3 algorithm obtains a better subset of features while maintaining high accuracy. Although the other algorithms have higher accuracy, the optimal feature subsets obtained by them are not satisfactory. Figures 2-10 shows the average convergence curves of various algorithms on different data sets. It is clear from the convergence plots that BGBO_V3 outperforms the other methods in terms of the average fitness value from the beginning of the iteration. It is worth mentioning that BGBO_V3 has the fastest convergence on most of the datasets, obtaining better results in balancing the exploration and exploitation capabilities.                A nonparametric Wilcoxon rank-sum test was performed at the 5% significance level to verify whether there was a statistical difference between the results of the fitness values of BGBO_V3 and the respective comparison methods. BGOA, and BHHO, we can note that BGBO_V3 has p-values greater than 0.05 in the SonarEW, Semeion, and KrvskpEW datasets, respectively.

Comparison on high-dimensional datasets
From the previous subsection, we know that BGBO_V3 achieves excellent results on highdimensional datasets (PenglungEW and Coil). Therefore, to further test the performance of the algorithm proposed in this paper, this section conducts experiments with 10 high-dimensional datasets. These datasets were downloaded from the UCI database and the database of Arizona State University (ASU) [91]. The information (number of instances, features, and classes) of the high-dimensional datasets is shown in Table 14. Tables 15-17 show the experimental results of the average fitness value, average classification accuracy and, the average number of selected features obtained by various methods on the high-dimensional datasets. We first inspect the fitness function averages of all the compared algorithms on the highdimensional datasets. The results are shown in Table 15. From Table 15, we can notice that BGBO_V3 proposed in this paper achieves the best results in 80% of the datasets (8 out of 10), among the other methods, only BWOA and BBA achieve the best results in ORL and Yale, respectively. In addition, looking at the standard deviation again, we can see that BGBO_V3 achieves 0 for all four datasets (i.e., arcene, PCMAC, RELATHE, and Leukemia), indicating that the method has excellent stability on high-dimensional datasets. In the final comparative ranking, BGBO_V3 is first, followed by BDA, BGWO, BBA, BPSO, BHHO, BWOA, and BGOA in order. Table 16 shows the experimental results of the average classification accuracy of all methods on high-dimensional datasets. From the comparison of the average classification accuracy, BGBO_V3 ranks first in the average classification accuracy on 8 out of 10 high-dimensional datasets, accounting for 80% of all datasets. Other methods ranked first in the number of datasets far less than BGBO_V3. In particular, the average classification accuracy of BGBO_V3 reached 100% on the three datasets of arcene, RELATHE, and Leukemia. In addition, the standard deviation, it can be seen that BGBO_V3 achieves 0 on all four datasets (i.e., arcene, PCMAC, RELATHE, and Leukemia), which indicates from both the average fitness value and the average accuracy that BGBO_V3 has excellent performance as well as good stability on high-dimensional datasets. In the final ranking of the average classification accuracy comparison, BGBO_V3 is still the first, while the KNN machine learning method is ranked last among all methods. The remaining rankings are BGWO, BDA, BHHO, BBA, BPSO, BWOA, and BGOA in that order.
Finally, Table 17 shows the experimental results of all methods for the average number of selected features. As can be seen from the table, BGBO_V3 can obtain the smallest number of features on all data sets. Especially from the numerical point of view, the feature values obtained by BGBO_V3 are far smaller than those obtained by the other methods, even with a difference of hundreds of times. This indicates that BGBO_V3 can find a smaller subset of features compared to other methods. Similarly, the standard deviation of BGBO_V3 is within the acceptable range. In terms of the final ranking, BGBO_V3 ranks first, followed by BHHO, BDA, BBA, BPSO, BWOA, BGOA, and BGWO. Figures 11-15 show the average convergence curves of the various algorithms on the high-dimensional dataset. From the convergence plots, it is clear that BGBO_V3 achieves a good convergence rate from the beginning of the iteration to about the 50th generation, and the average fitness value is better than the other methods. The convergence curve of BGBO_V3 can be clearly distinguished from 50 to 100 generations, which shows the advantage of the method relative to other methods. This is due to the excellent mechanism of the GBO algorithm itself, which enables us to balance the process of exploration and exploitation in the algorithm.

Comparisons with metaheuristics in the literature
In this section, the average classification accuracy of BGBO_V3 is compared with other methods in the literature. These experimental results are also obtained in the same experimental setting with a good reference effect. Since the selected datasets are only partially the same, 10 datasets are selected for comparison and analysis in this work, which contains some important low-dimensional, highdimensional datasets, respectively. Table 18 shows the experimental data of the average classification accuracy of BGBO_V3 with various methods in the literature.               From Table 18, we can see that the proposed method in this paper ranks first in seven datasets, including low-dimensional datasets and high-dimensional datasets, respectively, and we can see that BGBO_V3 can achieve excellent results not only in low-dimensional but also in high-dimensional with good performance. Secondly, RBDA has four datasets ranked first, and BSSA_S3_CP has one dataset ranked first. The final ranking of BGBO_V3 is first, and RBDA, BSSA_S3_CP, BGOA-M, WOA-CM, BGSA, GA, and bGWO1 are ranked in order. Thus, the BGBO_V3 proposed in this paper achieves better performance than other methods in solving the FS problem. In this paper, eight variants of BGBO are proposed using transfer functions. This corresponds to s-shaped and v-shaped eight transfer functions, respectively, for mapping continuous search spatial to discrete search spatial, and in applying to the feature selection problem. The experimental results show that very promising results are achieved compared to the currently popular wrapper-based FS methods. The primary reason is that the algorithm is very reliable and efficient due to the excellent performance of GBO itself, making the algorithm a better choice for the wrapper-based FS method. Moreover, on challenging high-dimensional datasets, the results show clearly that the proposed BGBO has better results in solving high-dimensional problems compared to other methods. Thus, the algorithm has excellent performance and outstanding stability in dealing with challenging problems such as highdimensional datasets, which shows the potential of the algorithm in solving high-dimensional problems. Among the eight variants proposed in this paper, BGBO_V3 has the best performance. The main reason for the analysis is that different transfer functions have different slopes of curves and different probabilities of changing the position of population individuals. A reasonable probability can balance the exploration and exploitation ability of population individuals, and it is not idealized that the larger the probability of changing the location of the population individuals is better. In addition, an excellent exploration mechanism of the algorithm itself is essential. In the BGBO algorithm based on Newton's method, the GBO algorithm has good two operators (GSR) and (LEO), both of which have their own merit-seeking mechanism, and its perfect mechanism and simple variable parameters balance the exploration and exploitation of the algorithm well, making the individuals better able to make the algorithm proceed in the optimal direction. The transfer function also deserves credit for its own simple and easy-to-understand mechanism. So the combination of the algorithm itself and the transfer function has a clear advantage overall.

Conclusions and future works
In this paper, a binary version of the GBO algorithm is proposed to solve the FS problem, and eight variants of the binary gradient optimizer are proposed using transfer functions (i.e., s-shaped and v-shaped). The transfer functions are used to map the continuous search space to the discrete search space and are used in the proposed algorithm. To benchmark the proposed algorithm, 18 standard UCI benchmark datasets are used. The experimental results show that BGBO_V3 has the best performance among the proposed algorithms. The experimental results compare BGBO_V3 with the most popular and better-performing methods in the literature. By analyzing the experimental results, it is demonstrated that the BGBO_V3 algorithm has a high performance among the existing methods for solving FS problems, especially in high-dimensional data sets. Besides, the proposed algorithm has a high convergence speed, which enables the algorithm to find the exact minimum feature set faster. Combining the above experimental results, process discussion and conclusion analysis, it can be concluded that the proposed binary version of the GBO algorithm has advantages in solving FS, and it is worth considering when facing high-dimensional data sets.
In future work, more practical discrete applications combining the method, such as 0-1 backpack problem, TSP problem, scheduling tasks, etc.. The use of other classifiers as evaluators is also a very good research direction, such as extreme learning machines, neural networks and support vector machines, etc.. The performance of the algorithm can be further investigated by comparing different classifiers.