A novel feature selection algorithm based on damping oscillation theory

Fujun Wang; Xing Wang

doi:10.1371/journal.pone.0255307

Abstract

Feature selection is an important task in big data analysis and information retrieval processing. It reduces the number of features by removing noise, extraneous data. In this paper, one feature subset selection algorithm based on damping oscillation theory and support vector machine classifier is proposed. This algorithm is called the Maximum Kendall coefficient Maximum Euclidean Distance Improved Gray Wolf Optimization algorithm (MKMDIGWO). In MKMDIGWO, first, a filter model based on Kendall coefficient and Euclidean distance is proposed, which is used to measure the correlation and redundancy of the candidate feature subset. Second, the wrapper model is an improved grey wolf optimization algorithm, in which its position update formula has been improved in order to achieve optimal results. Third, the filter model and the wrapper model are dynamically adjusted by the damping oscillation theory to achieve the effect of finding an optimal feature subset. Therefore, MKMDIGWO achieves both the efficiency of the filter model and the high precision of the wrapper model. Experimental results on five UCI public data sets and two microarray data sets have demonstrated the higher classification accuracy of the MKMDIGWO algorithm than that of other four state-of-the-art algorithms. The maximum ACC value of the MKMDIGWO algorithm is at least 0.5% higher than other algorithms on 10 data sets.

Citation: Wang F, Wang X (2021) A novel feature selection algorithm based on damping oscillation theory. PLoS ONE 16(8): e0255307. https://doi.org/10.1371/journal.pone.0255307

Editor: Seyedali Mirjalili, Torrens University Australia, AUSTRALIA

Received: March 26, 2021; Accepted: July 13, 2021; Published: August 6, 2021

Copyright: © 2021 Wang, Wang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Experimental data are available on UCI Machine Learning (http://archive.ics.uci.edu/ml/index.php) and the website (http://www.gems-system.org/, Gene Expression Model Selector).

Funding: The work is supported by National Natural Science Foundation of China (61402212), Program for Liaoning Excellent Talents in University (LJQ2015045), Natural Science Foundation of Liaoning Province (2015020098), and China Postdoctoral Science Foundation (2016M591452).

Competing interests: The authors have declared that no competing interests exist.

Introduction

In recent years, many irrelevant and redundant data have been discovered in data collection and sorting. In order to reduce the complexity of calculation and improve the accuracy of data analysis, and to reduce the storage space at the same time, the feature selection method has become the primary data processing method for researchers [1–3].

The two basic types of feature selection methods are filter and wrapper. The filter method has a low time complexity in the process of searching the optimal feature subset [4,5], which has been well demonstrated by the famous MRMR algorithm [6], GRM algorithm [3], ReliefF and tuned ReliefF algorithm [7].

On the other hand, the wrapper method generally achieves higher classification accuracy. In recent years, the wrapper method has gained more and more attention in feature selection [8]. The wrapper method consists of two parts: Meta-heuristic algorithms and classifier. Meta-heuristic algorithms play an important role in the wrapper method. Common meta-heuristic algorithms are, Ant Lion Optimization (ALO) [9], Bat Algorithm (BA) [10,11], Bacteria Foraging Optimization (BFO) [12], Cuckoo Search(CS) [13,14], Genetic Algorithm(GA) [15,16], Particle Swarm Optimization(PSO) [17,18] and Simulated Annealing(SA) [19,20]. The seven algorithms mentioned above are Swarm Intelligence algorithms, which are often used in wrapper feature selection algorithms. In the wrapper method, the commonly used classifiers are Support Vector Machine (SVM), Random Forest (RF), etc.

Algorithms included wrapper method and filter method are often entitled hybrid algorithms. Chen et al. combined the PSO algorithm with the SSA algorithm to propose the HPSO-SSM algorithm [21]. Mafarja et al. combined the WOA and SA algorithms to form the WOA-SA algorithm [22]. Hacer et al. combined the wrapper method and the filter method into an algorithm [23]. Zheng et al. proposed a new hybrid algorithm based on WOA and MPMD [24]. Carrasco et al. proposed a new filter wrapper feature selection method based on ranking [25]. Zhang et al. proposed a two-stage feature selection method for feature selection [26]. According to the combination method of wrapper and filter, the hybrid feature selection method is divided into two stages and embedded. In the two-stage method, the filter method is executed only once, and there is only one candidate feature subset provided for the wrapper method. Therefore, when searching for the optimal feature subset, the wrapper method has a limitation. In the embedded method, the close combination of the two methods increases the complexity of the algorithm and is easy to fall into the local optimum.

In order to eliminate the uniqueness of candidate feature subsets in the two-stage method, and to prevent the embedded method from falling into the local optimum, in this article, we employ the oscillation theory and combine the filter method and the wrapper method to propose a novel algorithm. This algorithm is called the Maximum Kendall coefficient Maximum Euclidean Distance Improved Gray Wolf Optimization algorithm (MKMDIGWO). In MDMKIGWO, Maximum Kendall coefficient Maximum Euclidean Distance (MDMK) algorithm is a filter feature selection method. The improved grey wolf optimization algorithm and the support vector machine (IGWO+SVM) are composed of a wrapped algorithm. In MDMKIGWO, after the adjustment of the oscillation theory, the MKMD algorithm provides multiple candidate feature subsets for the wrapper method, and the wrapper method searches for the optimal feature subset in multiple candidate feature subsets.

Based on the MRMR idea, MKMD filter algorithm is proposed, using Kendall to measure the correlation between features and labels, and using Euclidean distance to measure the compatibility between features. Two parameters (a and b) are introduced in MKMD to improve the correlation and reduce the redundancy of candidate feature subsets. The wrapper method uses an improved grey wolf optimization to improve the position update formula. By comparing the maximum number of retention times with the multiple values of the damping oscillation, the filter method and the wrapper method alternately perform depth and breadth search on the data set, and finally find the optimal feature subset. Therefore, the MKMDIGWO algorithm has higher classification accuracy on 12 data sets than the other 4 algorithms.

The rest of the paper is arranged in this way. The second part introduces the basic concepts of damping oscillation theory, gray wolf optimization and support vector machine. The third part shows the new algorithm and specific content. The fourth part describes the experimental results and experimental analysis. The fifth part draws the conclusions of this paper.

Relation work

The most basic feature selection methods are the wrapper method and the filter method [27]. With the deepening of feature selection research, many researchers have merged the two basic methods to form a hybrid feature selection method. QASEM et al. proposed a binary version of the hybrid grey wolf optimization (GWO) and particle swarm optimization (PSO) [28]. RANYA et al. proposed a hybrid feature selection algorithm based on GWO and Harris Hawks Optimization (HHO) for feature selection [29]. Zheng et al. combined MSMC (Maximum Spearman Minimum Covariance) and CS (Cuckoo Search) algorithms to form MSMCCS algorithm [30].

In [28], to find the best solutions, the wrapper-based method K-nearest neighbors classifier with Euclidean separation matric is utilized. For performance evaluation of the proposed binary algorithm, 18 standard benchmark datasets from UCI repository are employed. The BGWOPSO algorithm is superior to the other four algorithms in accuracy, selecting the best optimal features, and the computational time.

In [29], a binary hybrid GWO and HHO was proposed. It can balance exploration and development through the combined binary GWO algorithm and HHO algorithm. In [30], a new filter algorithm (MSMC) was proposed. In hybrid algorithm, The MSMC algorithm is embedded in the CS algorithm. CS can select the optimal feature subset from the candidate feature subset provided by MSMC.

The above algorithm can show good performance on the specified data set, but it may decrease on other types of data sets. So an algorithm cannot solve the feature selection problem on all data sets. Therefore, the existing algorithms can be merged or improved to solve the feature selection problem. Next, we will discuss related theories and then clearly explain the proposed algorithm.

The relation theory

In this part, we will introduce four important theories, namely damping oscillation theory, gray wolf optimization algorithm and support vector machine. These theories lay the foundation for the MKMDIGWO algorithm.

Damping oscillation theory

Damping oscillation [31,32] refers to the vibration that the amplitude of the vibration system gradually attenuates with time due to friction and medium resistance or other energy consumption, also known as damping oscillation and damping oscillation. Whether it is a spring vibrator or a single pendulum, since external friction and medium resistance always exist, in the process of vibration, it must constantly overcome the external resistance to do work, consuming energy, and as the amplitude will gradually decrease, after a period of time, the vibration will completely stop. This vibration whose amplitude decreases with time is called damping oscillation. Because the amplitude is related to the energy of the vibration, the damping oscillation is also the vibration whose energy is decreasing. Damping oscillation is a non-harmonic motion. The damping oscillation system is a dissipative system. Damping here refers to the characteristics of any vibration system in the vibration, the amplitude of the vibration gradually decreases due to the external action or the inherent causes of the system itself, and the quantitative characterization of this characteristic.

In the ideal state, the spring device or the single pendulum device can complete the energy conversion, and the potential energy is converted into kinetic energy, thus the infinite conversion with no energy being lost, but in reality, whether it is the spring vibrator or the single pendulum due to the existence of external friction and medium resistance, in the process of vibration, the external resistance must constantly be overcome to do work and consume energy, and as the amplitude will gradually decrease, after a period of time, the vibration will stop completely. This is the cause of the damping oscillation.

The curve of the damping oscillation is shown in the Fig 1.

Download:

Fig 1. Damping oscillation curve.

https://doi.org/10.1371/journal.pone.0255307.g001

The curve in the figure can be expressed by the formula (1).

(1)

Where Ae-nt represents the envelope of the attenuation curve and t represents time.

According to the figure above, we can conclude that, the amplitude is gradually decreasing over time. It can be seen from the formula that the curve of the damping oscillation is composed of the product of the exponential function and the sine function.

Gray wolf optimization

Continuous gray wolf optimization algorithm.

In 2014, Mirjalili et al. imitated the living habits of the natural gray wolf and proposed the grey wolf optimization algorithm [33]. According to the characteristics of the gray wolves, they are divided into two parts. The first part is the leadership, and the second part is the general public. The leadership is arranged according to the level of the hierarchy as follows alpha (α), beta (β), gamma (δ). The general public layer consists of Omega (ω). According to the hunting habits of the gray wolf, the predation process of the wolves is divided into four steps. They are searching for prey, surrounding prey, attacking prey and preying on prey. According to the grouping and predation characteristics of wolves, in the constructed data model, alpha is considered the optimal solution, beta (β) is the second optimal solution, gamma (δ) is the third optimal solution, and Omega (ω) is the remaining candidate solution. The gray wolf grade model is shown in Fig 2.

Download:

Fig 2. Gray wolf ranks.

https://doi.org/10.1371/journal.pone.0255307.g002

According to the process of encircling prey during wolf hunting, the following formula is used to express its hunting behavior [33]: (2) (3)

In formulas (2) and (3), t represents the t-th iteration, represents the position of the gray wolf in the t-th iteration, and represents the position of gray wolf in the (t+1)th iteration. indicates the position of the prey at the t-th iteration. and are vector coefficients, which are calculated by formulas (4) and (5), respectively [33].

(4)

(5)

In these two formulas, r₁ and r₂ are random vectors whose range is [0, 1]. is a linear decreasing function that varies with the number of iterations.

The update formula for the optimal solution (alpha) is calculated by formulas (6) and (7), and other optimal solutions (beta and delta) are similar. In the next iteration, the effects of the three optimal values are expressed by Eq (8).

(6)

(7)

(8)

In the continuous gray wolf optimization algorithm, the wolves start from the randomly assigned position information. After much iteration, under the leadership of the three wolves, the wolves move toward the optimal value and finally obtain the optimal value. In this iterative process, if the candidate solution is far from the optimal value, if the candidate solution is close to the optimal value. Finally, the GWO algorithm is terminated with the initially set criteria.

Discrete grey wolf optimization.

Continuous gray Wolf optimization algorithm is not suitable for solving discrete problems. In 2016, Emary et al. proposed binary gray Wolf optimization algorithm to solve feature selection problem on the basis of continuous type [34]. In that paper, two discrete grey Wolf optimization algorithms, bGWO1 and bGWO2, are proposed, which are called the first discrete algorithm and the second discrete algorithm respectively in this paper.

In the first discrete algorithm, the position update strategy is calculated using formula (9) [34].

(9)

In this formula, X₁^d, X₂^d, X₃^d represent the positions of alpha (α), beta (β), and gamma (δ) in the d dimension, respectively. X_d(t+1) indicates the current wolf’s position in the d-dimension at (t+1) iterations, and rand is a random number whose range is [0,1]. The X₁^d in the formula (9) is calculated by the formula (10), and the X₂^d, X₃^d formula is similar to the X₁^d formula.

(10)

In formula (10), X_α^d represents the vector position of alpha (α) gray wolf in d dimensions. bstep_α^d represents a binary value of alpha (α) gray Wolf in d dimension, which is calculated by formula (11).

(11)

In formula (11), rand has the same meaning as rand in formula (9). cstep_α^d is a continuous value in d dimension. It is calculated by formula (12).

(12)

In formula (12), A₁^d, D_α^d are calculated by formula (2) and (4)

Based on the first discrete algorithm, the second discrete algorithm produces: transforming Eq (8) into Eq (13) [34].

(13)

In formula (13), rand is a random number, X_d^t+1 represents the binary value of the d-dimensional position update in the t-th iteration, and sigmoid(a) is defined in formula (14) [34].

(14)

The two discrete GWO algorithms proposed by Emary have more computational and understanding complexity. It is not conducive to the understanding and widespread use of researchers. In the third part, we will modify the discrete GWO algorithm to make the algorithm easier to understand and calculate.

Support vector machines

Support Vector Machine (SVM) is a commonly used classifier. It is based on the SVM algorithm proposed by Vapnik in 1995 [35]. In both the two-category problem and the multi-classification problem, the SVM shows good performance [36].

In the LIBSVM classification process, there are two main steps to complete. The first step is to select the kernel function based on the data set. The second step is to train the kernel function using the data set. In the two steps, the most important is the first step. When selecting a kernel function, there are four kernel functions commonly used, and the formulas are given below.

The first is a linear kernel function, expressed by formula (15).

(15)

The second is a polynomial kernel function, expressed by formula (16).

(16)

The third is the RBF kernel function, which is represented by formula (17).

(17)

The fourth is the sigmoid kernel function, represented by formula (18).

(18)

Here r and d are nuclear parameters.

MKMDIGWO algorithm

In this part, we propose the MKMDIGWO algorithm. Firstly, the MKMD filtering algorithm is proposed. Secondly, an improved discrete gray wolf optimization algorithm is proposed. Finally, the proposed MKMDIGWO algorithm is described.

MKMD filter algorithm

In hybrid feature selection algorithms, filter algorithms are often used to rank features in a data set. Commonly used filter algorithms are IG, F-score, MRMR, MCMD, etc., which are either univariate or multivariate. The univariate filter algorithm considers more of the correlation between labels and features, while the multivariate filter algorithm increases the measure of redundancy between features. The well-known multivariate filter algorithm MRMR uses mutual information to measure the correlation between labels and features and the redundancy between features, the proportion of the two parts is the same. In this paper, a new filter algorithm MKMD is proposed. The Kendall Rank correlation coefficient is used to measure the correlation between labels and features, and the Euclidean distance is used to measure the redundancy between features. Then the two parameters (a and b) dynamically adjust the ratio of correlation coefficient and redundancy.

Maximum Kendall rank.

In statistical theory, the Kendall rank factor is one of the methods widely adopted to measure the correlation between two vectors. The Kendall correlation coefficient is named after Maurice Kendall and often represented by the Greek letter τ (tau). The Kendall correlation coefficient ranges from -1 to 1. When τ is equal to 1, it means that two random variables have a consistent rank correlation; when τ is -1, it means that two random variables have the exact opposite rank correlation; when τ is equal to 0, it means that the two random variables are independent of each other.

To calculate the Kendall correlation coefficient, there exist three ways as introduced respectively below.

The first calculation method is expressed by the formula (19). This method is applicable to the case where the same elements do not exist in the set X and Y, that is, each element in the X set is unique (the Y set is the same).

(19)

Where C is the pairs of the element with consistency in XY (two elements are a pair); D is the pairs of the element with inconsistency in XY. N represents the number of elements contained in X or Y.

The second calculation method is expressed by the formula (20). This method is suitable for the case where the same element exists in the vector X or Y.

(20)

In the formula (20), C and D are the same as in the formula (19). N3 is calculated by formula (21). N1 and N2 are obtained for the data characteristics of the set X and Y, respectively, and the specific values are calculated by the formula (22) and the formula (23).

(21)

(22)

(23)

The calculation method of N1 is as follows. The same elements in X are respectively combined into a small set, and the number of these small sets in X is represented by s, and Ui represents the number of elements included in the i-th small set. N2 is calculated on the basis of the set Y. The same elements in Y are respectively combined into small sets, the number of these small sets in Y is represented by t, and Vi represents the number of elements included in the i-th small set.

The third calculation method is expressed by the formula (24).

(24)

In the formula (24), C, D, and N are the same as in the formula (19), and M represents the smaller one of the number of rows and the number of columns in the rectangular table. Eq (24) only applies to the calculation of the correlation coefficient between the random variables X and Y represented by the table, without considering the influence of the presence of the same element in the set X or Y on the final statistical value.

Assume that vectors X(4,1,1,1,2,3,3,4,5,6,7,3,4,1,2,2,2,2,3,3,5,5,5)’ and Y(4,4,0,4,4,4,4,0,0,4,4,2,0,0,4,6,2,4,0,0,2,2,4)’ are given. We can see that there are 7 different elements in vector X, of which 1–5 are repeated, and 6 and 7 only appear once. According to formula (22), s = 5, U1 = 4, U2 = 5, U3 = 5, U4 = 3, U5 = 4. From this, N1 = 35 is calculated. The same method according to formula (23), we can see that in vector Y, t = 3, V1 = 7, V2 = 4, V3 = 11, N2 = 82.

There are two random variables X and Y (also can be regarded as two sets), and their number of elements is N. The i-th (1< = i< = N) values of two random variables is represented by Xi, Yi respectively. The same position corresponding elements in X and Y constitute an element pair set KR, which contains elements (Xi, Yi) (1 < = i < = N). When any two elements (Xi, Yi) in the set KR have the same trend as (Xj,Yj), the two elements are considered to be consistent. In other words, when Xi>Xj, Yi>Yj is also true; or Xi<Xj and Yi<Yj are established at the same time. Conversely, when any two elements (Xi, Yi) and (Xj,Yj) in the set KR have different trends, the two elements are considered to be inconsistent. In other words, when Xi>Xj, Yi<Yj also holds; or Xi<Xj and Yi>Yj hold together.

Since the range of Kendall coefficient values is [–1, 1], we only consider integers when measuring random variables. Therefore, we use Eq (25) to calculate Kendall coefficients.

(25)

We consider the X vector as a label and the Y vector as a feature. The trends of the two are shown in Figs 3 and 4.

Download:

Fig 3. Trend about label (X) and feature (Y) (23 elements).

https://doi.org/10.1371/journal.pone.0255307.g003

Download:

Fig 4. Trends in label X and feature Y (top 10 elements).

https://doi.org/10.1371/journal.pone.0255307.g004

We use Eq (25) to calculate the Kendall coefficient for a given vector X and Y to be 0.046. It can be seen from the figure that the trends of the two are different and the correlation between the two is very low.

Assume that the vector X₁(1,2,3,4,5,6,2,4,6,3)’ and Y₁(0.6,1.8,5.4,3.9,7,10,1.85,4,9.8,5.7), the trend of the two are shown in Fig 5. Both Kendall coefficients are 0.7683.

Download:

Fig 5. Trends in label X1 and feature Y1.

https://doi.org/10.1371/journal.pone.0255307.g005

The tree figures tell us that the correlation between X and Y is relatively small, and the correlation between X1 and Y1 is relatively large.

There are multiple features in the data set, and the feature Fi with the highest correlation with the tag is calculated by the formula (26).

(26)

Where Fi represents the i-th feature in the data set, L represents the tag vector in the data set, and max represents the most-valued function, i = 1, 2, 3…, n.

Maximum European distance.

According to the idea of Maximum Relevance Minimum Redundancy, it is necessary to consider not only the correlation between labels and features but also the compatibility between features. Among the commonly used measures of compatibility between features are mutual information, Euclidean distance, cosine angle, and so on. The Euclidean Distance is adopted in this paper to measure the redundancy between features.

Euclidean distance is the easiest to understand distance calculation method, derived from the distance formula between two points in Euclidean space.

The Euclidean distance between two points a(x₁, y₁) and b(x₂, y₂) on a two-dimensional plane is expressed by formula (27).

(27)

The Euclidean distance between two points a(x₁, y₁, z₁) and b(x₂, y₂, z₂) in three-dimensional space is expressed by the formula (28).

(28)

The Euclidean distance between two n-dimensional vectors a(x₁₁,x₁₂,…,x_1n) and b(x₂₁,x₂₂,…,x_2n) is expressed by formula (29).

(29)

Two features are given as F_i and F_j, respectively, and the Euclidean distance is denoted as D (F_i, F_j). It is expressed by the formula (30).

(30)

The value of the Euclidean distance represents the redundancy between the two features. The larger the distance value, the more the redundancy of the feature.

Give two features F₁(1,5,1,1,1,1,1,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2) and F₂(1,4,2,2,1,2,2,2,1,1,2,1,2,2,2,1,2,1,1,2,1,1,1).The correlation distance D (F₁, F₂) = 1.1149 after data centering indicates that the redundancy is high.

The distance between the two features is expressed by Eq (30). According to a known feature, the feature of finding the feature Euclidean distance maximum in a feature set is represented by Eq (31).

(31)

Where F_k represents a known feature, F_Set represents a feature set, max function represents a maximum value, F_j is a feature in F_Set. The distance between F_j and F_k is greater than the distance between other features and F_k features. D(F_k, F_j) is calculated using Eq (30).

A collection of selected features is called a selected feature set. A set of candidate features is called a candidate feature subset. A feature is selected from the candidate features subset, and the distance between the feature and the selected feature subset is required to be a maximum. The distance between the two feature sets is represented by Eq (32).

(32)

Where F_Setk represents the selected feature set. F_set represents the candidate feature set. F_m represents the feature with the largest distance selected from the F_set set. MD1 (F_m,F_Setk) is calculated using Eq (31).

Maximum Kendall Maximum Distance (MKMD).

The maximum Kendall (MK) method measures the correlation between labels and features, and the Maximum Euclidean distance (MD) measures the redundancy between features. There are two ways to combine MK and MD into a formula, one method is MK multiplied by MD, the other method is MK added by MD. To reduce the amount of calculation, here we choose to add. Combining the contents of the first two sections, the MKMD method of selecting features (F_j) one by one from the subset of candidate features is represented by formula (33).

(33)

Where i represent the number of features that have been selected before executing formula (33), and F_j represent the selected features using this formula.

In (33), the roles of MK and MD are always the same. From the process of selecting features, when the number of features is gradually increased, the redundancy between features is greater than the correlation between features and labels. Therefore, two parameters (a and b) are added to the Eq (33) to adjust the role of correlation and redundancy.

(34)

Where a and b are parameters used to adjust the effects of both the maximum correlation and the maximum distance. Since the values of MK and MD on different data sets are quite different, the values of a and b cannot be 1, and the value of a changes from large to small, and the value of b increases from small to large. Formula (35) and (36) give values for a and b.

(35)

(36)

Where t represents the current number of iterations and T represents the total number of iterations. The trends of a and b can be seen from Fig 6. a gradually decreases from the maximum value of 1 to 0. b gradually increases from a minimum value of 0 to 1. a is gradually decreasing, and b is gradually increasing. Both are equal values at t = 70. According to the weights represented by a and b, it can be seen from the curve that the correlation effect of MK is greater than MD when t<70, and the effect of MD is greater when t>70.

Download:

Fig 6.

a and b change trend chart.

https://doi.org/10.1371/journal.pone.0255307.g006

The pseudo code of the MKMD algorithm is shown in Algorithm 1, and the flow chart is shown in Fig 7.

Download:

Fig 7. MKMD algorithm flow chart.

https://doi.org/10.1371/journal.pone.0255307.g007

Algorithm 1. MKMD algorithm pseudo code.

1 Input Parameter: the number of feature: F_set,a,b,i = 0,j = 1

2 Output: the order of feature

3 Select feature F₁ from Fset according to formula (34) (i = 0)

4 F_set = F_set-{F₁} i = i+1,j = j+1

5 Select feature F₂ from Fset according to formula (34) (i = 1)

6 F_set = F_set -{F₂}, i = i+1,j = j+1;

7 While Fset<>Null

8 Select feature F_j from Fset according to formula (34) (i>1)

9 F_set = F_set-{F_j}

10 j = j+1

11 i = i+1

12 Wend

Improved grey wolf optimization

In two typical discrete GWO algorithms, updating the positional formulas of the three leaders requires multiple formulas. This paper is an improvement on the gray wolf algorithm. To calculate each position of each gray wolf, what should be considered includes the three most important values for the current impact, as well as the position in the previous iteration. Therefore, the value of the current position is generated by two parts, which are, respectively, the three most value and the value in previous iteration. These two parts are represented by thrg(x₁, x₂, x₃) and xpre(i, j), respectively. x₁, x₂, x₃ respectively represent the position information of the three wolves in the wolf group in the discrete gray wolf algorithm, which is calculated by formula (9). xpre(i,j) represents the position information of the current position in the latest iteration. Therefore, the value of xpre(i,j) does not need to be given a calculation formula. The value of Thrg(x₁, x₂, x₃) is determined by formula (41). The update formula of the current position X(i,j) can be found in the formula (37).

(37)

Among them, t₁ and t₂ are parameters, which are used to mediate the different influence weights of the two parts on the current position. For the calculation methods of t₁ and t₂, see Eqs (38) and (39).

(38)

(39)

Where x is arbitrarily taking a number within the specified range, t is a linear decreasing function. Formula (40) is used to calculate t.

(40)

Where l represents the current number of iterations and T represents the maximum number of iterations.

The first discrete algorithm uses the segmentation function to select only one of them as a representative of three at a time, and the other two max values have no effect. The second discrete algorithm considers all three effects, but the calculation is more complicated. The method proposed in this paper considers the effects of three influencing results, while the amount of calculation is small. The thrg (x₁, x₂, x₃) is calculated by formula (41).

(41)

x₁, x₂, x₃ represent the position information of the three-headed wolf, and the range is [–1, 1]. The absolute value of the sum of the three is in the range of (0, 3). The value range of 3sin(x) is in the range of (0, 3). Rand is a random number. In the improved gray wolf algorithm, under the guidance of the three-headed wolf, the optimal feature subset can be found through iteration.

MKMDIGWO algorithm

Damping oscillation adjustment.

The filter algorithms and optimization algorithms mentioned in A and B can be combined in various ways, such as wrapper, filter, and hybrid. Among the various methods, the hybrid method is the most commonly used. This paper introduces the principle of damping oscillation and dynamically adjusts the combination of filter algorithm and wrapper algorithm. The wrapper algorithm is used to perform multiple optimizations based on the different results of the filter algorithm.

According to formula (1), a new damping oscillation formula (42) is proposed.

(42)

In the formula, t represents the number of iterations, which varies from 0 to 100, and y represents the amplitude value at each time t. In the original damping oscillation, the product of the exponential function and the sine function is used. The proposed function replaces the sine function with a cosine function because the algorithm does not undergo any iterations for optimization at time t. Since the cosine function has a negative number, which is meaningless to us, an absolute value is taken for the entire result. Rand represents a random number ranging from 0 to 1, and this symbol ⌈ ⌉ indicates rounding up. Increasing the random number means that the value of y is not fixed at each execution.

The two figures (Figs 8 and 9) show that whether or not to increase the random number, y has six maximum values during the change process. These six maximum values are used to adjust the execution of the wrapper algorithm and the filter algorithm.

Download:

Fig 8. Damping oscillation function (before adding random numbers).

https://doi.org/10.1371/journal.pone.0255307.g008

Download:

Fig 9. Damping oscillation function (after adding random numbers).

https://doi.org/10.1371/journal.pone.0255307.g009

MKMDIGWO algorithm framework.

The framework of the MKMDIGWO algorithm is shown in the Fig 10. First the parameters are initialized and then the damping oscillator function is executed. The six max values (maxvalue_six) of the damping oscillator function adjust the execution of the filter algorithm and the optimization algorithm. An initial value of the maximum number of reservations (savemax) is assigned to 100.

Download:

Fig 10. MKMDIGWO algorithm framework.

https://doi.org/10.1371/journal.pone.0255307.g010

After these preparations being completed, the program enters the loop section, in which it is determined whether to execute the filter algorithm according to the relationship between the maximum value retention number (savemax) and the damping oscillation maximum value (maxvalue_six). Due to the initial value setting, in the first loop, the filter algorithm is executed and the population is initialized, then the optimization algorithm is executed to realize the population position update and find the optimal value. Then the value of savemax is updated.

In the other cycles after the first cycle, the relationship between the savemax and the maxvalue_six is first determined.

If the value of savemax is bigger than the maxvalue_six, it is indicated that the obtained maximum value has not been updated many iterations. The optimal value found on this basis is already the maximum value, and there is no need to recycle. It is necessary to change to an initial state to continue searching for optimization. Therefore, the filter algorithm is re-executed. The population is reinitialized. To recalculate, the value of savemax is set to 0.

If the value of savemax is smaller than the maxvalue_six, it is indicated that the optimization can be continued on this basis, and the filter algorithm does not need to be re-executed. Therefore, the maximum value of the damping oscillation (maxvalue_six) determines the execution of the filter algorithm.

In one certain cycle, the highest value of the classification accuracy after optimization by the optimization algorithm is compared with the maximum value in the previous cycle. If the former is greater than the latter, the maximum value retention number (savemax) is assigned 1; otherwise, the value of savemax is automatically incremented by 1. Under the multiple adjustment of the damping oscillation, the maximum value of classification accuracy is obtained in many classification accuracies, and the maximum value is selected as the output information.

Experimental results and analysis

In this part, we will introduce experimental results and analysis. First, the data set used in the experiment is described. Second, we introduce the settings of the parameters used by each algorithm in the experiment. Finally, the results and analysis of the experiment are introduced.

Experimental data

Experimental data are available on UCI Machine Learning and the website (http://www.gems-system.org/, Gene Expression Model Selector). Based on these data sets, the proposed algorithm is evaluated. Table 1 describes the details of these data sets. There are a variety of classified data sets, including 2 categories, 3 categories, 5 categories, and 11 categories. The number of instances in the dataset is between 102 and 1000. The number of characteristics of the data set is between 13 and 1993. From the above information, it can be concluded that these data sets contain a wide range.

Download:

Table 1. Description of the data set used in the experiment.

https://doi.org/10.1371/journal.pone.0255307.t001

Experimental parameter setting

In the MKMDIGWO algorithm, a filter algorithm, MKMD, and a wrapper algorithm are included. In the wrapper algorithm, the number of iterations is 100 and the number of wolves is 30. Though there are some papers on wrapper algorithms and filter algorithms, it is difficult to find other algorithms that are exactly the same as the MKMDIGWO algorithm. For a fair comparison, we reproduce some algorithms, such as MRMR+ PSO, MRMR+GA, MRMR+BBA, MRMR+CS. In the four algorithms, the filter method also uses the mid method of the MRMR algorithm. The number of iterations and the number of populations are the same as MKMDIGWO. The initialization parameters and implementation methods of the four algorithms are referred to reference [37–41]. The parameter settings of the four algorithms are shown in Tables 2–5.

Download:

Table 2. Detailed parameter values of the mRMR+PSO algorithm.

https://doi.org/10.1371/journal.pone.0255307.t002

Download:

Table 3. Detailed parameter values of the mRMR+GA algorithm.

https://doi.org/10.1371/journal.pone.0255307.t003

Download:

Table 4. Detailed parameter values of the mRMR+BBA algorithm.

https://doi.org/10.1371/journal.pone.0255307.t004

Download:

Table 5. Detailed parameter values of the mRMR+CS algorithm.

https://doi.org/10.1371/journal.pone.0255307.t005

To make the results of the experiment more stable, each algorithm was executed 10 times on each data set. The classification accuracy rate of each data set has 2 values, which are the maximum of 10 times and the average of 10 times. In this study, the SVM classifier is used to obtain the highest classification accuracy of the feature subset. The parameters of the SVM classifier use a kernel function. The penalty parameter C and RBF parameters are selected by the Grid Search method.

Compared with other methods, the ten-fold cross-validation method is the best method of estimating the classification performance [42]. Therefore, the above-mentioned algorithm is tested with a 10-fold cross-validation method.

Experimental results and analysis

Seven data sets and five algorithms were used in this experiment. After each algorithm runs 10 times on each data set, the maximum value of the classification accuracy and the length of the feature subset are displayed in Table 6. The average of the classification accuracy and the average of the feature subset length are shown in Table 7.

Download:

Table 6. The highest classification accuracy of each algorithm on each data set and its feature subset length.

https://doi.org/10.1371/journal.pone.0255307.t006

Download:

Table 7. Average classification accuracy and average feature subset length of each of the six algorithms on each data set.

https://doi.org/10.1371/journal.pone.0255307.t007

From the above two tables, we can find that the MKMDIGWO algorithm achieves a high average classification accuracy on 12 data sets. On the LU, PT, WI dataset, the classification accuracy of the MKMDIGWO algorithm reached 100%. On the LU and SC data sets, the MKMDIGWO algorithm is at least 1% higher than other algorithms. In addition to the 11t and GE data sets, the MKMDIGWO algorithm has the shortest feature subset length on other algorithms. Especially on the DE and PT dataset, the MKMDIGWO algorithm’s dimensionality reduction effect is very obvious. The BGWO algorithm is a wrapper algorithm. Therefore, the effect of dimensionality reduction on high-dimensional data sets (11t, LU, PT, SC) is not obvious.

From Fig 11, it can be seen that the abscissa is the number of iterations, from 0 to 100, and the ordinate is the classification accuracy of each data set. The classification accuracy rate on the six data sets of DE, BD, LU, PA, PT and WI has reached more than 98%, and the accuracy rate is relatively high. The classification accuracy rate of the 11t, BP, EC and SH data sets reached more than 85%. Only the classification accuracy of the GE and SC data set reached 78% and 74%, which is lower than 80%.

Download:

Fig 11. Classification accuracy of 100 iterations over 12 data sets.

https://doi.org/10.1371/journal.pone.0255307.g011

The highest classification accuracy obtained for each data set is formed by a combination of multiple filter algorithms and optimization algorithms. In the filter algorithm, the important features appear in the candidate feature subsets by the Kendall coefficient and the Euclidean distance measurement method. In the wrapper algorithm, after several optimizations, the feature subset with higher classification accuracy is found. And the best results have been found before reaching 100 iterations.

The Fig 12 show that the abscissa is the number of iterations, from 0 to 100, and the ordinate is the change in the length of the feature subset of each data set. The number of feature subsets on the three data sets of PT, LU, 11t, and SC is relatively small compared to the total number of features, reaching one thousandth, two thousandths, three thousandths and three thousandths of the total number of features respectively. On the DE, EC, BD, PA, BP, and SH datasets, the number of feature subsets exceeds 30% of the total number of features. Only the dimensionality reduction on the GE and WI data sets is more than 50%. Therefore, the proposed algorithm has a more significant dimensionality reduction effect on high-dimensional data sets.

Download:

Fig 12. Feature subset length for 100 iterations over 12 data sets.

https://doi.org/10.1371/journal.pone.0255307.g012

Figs 13–24 tell us that the abscissa indicates the number of the most used values of the damping oscillation function, and the ordinate indicates the value of savemax. On the 12 data sets, the value of maxvalue_six represents the effect of different damping functions. The red line indicates the value calculated by the damping oscillating function, the black line indicates the number of times each maximum value is retained, and the blue line indicates the number of iterations when the damping oscillation is adjusted. The figure show that the black line is always slightly larger than the red line, that is, when the maximum number of retention times is just greater than the value of the damping oscillation function, the filter algorithm is executed.

Download:

Fig 13. Modulation of damping oscillation on the 11t data set.

https://doi.org/10.1371/journal.pone.0255307.g013

Download:

Fig 14. Modulation of damping oscillation on the BD data set.

https://doi.org/10.1371/journal.pone.0255307.g014

Download:

Fig 15. Modulation of damping oscillation on the BP data set.

https://doi.org/10.1371/journal.pone.0255307.g015

Download:

Fig 16. Modulation of damping oscillation on the DE data set.

https://doi.org/10.1371/journal.pone.0255307.g016

Download:

Fig 17. Modulation of damping oscillation on the GE data set.

https://doi.org/10.1371/journal.pone.0255307.g017

Download:

Fig 18. Modulation of damping oscillation on the LU data set.

https://doi.org/10.1371/journal.pone.0255307.g018

Download:

Fig 19. Modulation of damping oscillation on the SH data set.

https://doi.org/10.1371/journal.pone.0255307.g019

Download:

Fig 20. Modulation of damping oscillation on the EC data set.

https://doi.org/10.1371/journal.pone.0255307.g020

Download:

Fig 21. Modulation of damping oscillation on the PA data set.

https://doi.org/10.1371/journal.pone.0255307.g021

Download:

Fig 22. Modulation of damping oscillation on the PT data set.

https://doi.org/10.1371/journal.pone.0255307.g022

Download:

Fig 23. Modulation of damping oscillation on the SC data set.

https://doi.org/10.1371/journal.pone.0255307.g023

Download:

Fig 24. Modulation of damping oscillation on the WI data set.

https://doi.org/10.1371/journal.pone.0255307.g024

In Damping Oscillation Adjustment, we mention the maximum value of the six damping oscillator functions. However, the twelve graphs show that the 11t, DE, EC, Lu and SH data sets use the four highest values, and three maximum values are used in BD, BP, GE and PA data sets. PT and SC dataset reach at five highest values. The WI data set is the most, it reaches 6 highest values. The sum of the maximum values of the damping oscillating functions in each graph is definitely less than 100. The number of iterations did not reach 100. During the iteration, as long as the value of ACC changes, the maximum number of reservations is assigned a value of 1. Therefore, the maximum number of damping oscillating functions in the twelve graphs is less than 7.

The Wilcoxon signed-rank test is proposed by Frank Wilcoxon as a non-parametric statistical hypothesis test [43]. This strategy is applied to contrasting two related samples. We can decide whether the corresponding data distributions are identical based on this test. In this paper, the Wilcoxon signed rank test is executed by SPSS software. The data information in Table 8 are the result of applying SPSS software. In Table 8, five pairs of Wilcoxon signed-rank tests are made on SC data set. With the significant level 0.05, it can be seen from Table 6 that the performance of MKMDIGWO is better than other five algorithms. In other words, the classification accuracy of each time is improved.

Download:

Table 8. The comparison based on Wilcoxon signed-rank test on SC data set.

https://doi.org/10.1371/journal.pone.0255307.t008

Conclusion

This paper proposes a new feature selection algorithm named MKMDIGWO, which has two characteristics. The first one is that the filter algorithm MKMD was proposed as the feature to score and sort, taking the feature with higher score as the candidate feature subset. The second is the combination of damping oscillation function adjustment filter algorithm and wrapper algorithm. In the MKMD algorithm, two parameters are introduced, which improves the correlation of candidate feature subsets and reduces their redundancy, and thus the classification accuracy is improved.

The advantages of filter methods and wrapper method feature selection are combined in MKMDIGWO. In the filter algorithm, two parameters are used to reduce the overall redundancy. Due to the adjustment of the maximum value of the damping oscillator function, the filter algorithm and the optimization algorithm are closely combined. Therefore, the solution quality of the feature selection problem is improved. It is shown in the experimental results that the proposed MKMDIGWO algorithm is much better than the other five algorithms.

This paper mainly discusses feature selection methods. The dimensionality of the data set used in the experiment is not high. In order to make the feature selection method more effective, in future work, we will use the feature selection method on high-dimensional data sets.

References

1. Armanfard N., Reilly J. P., Komeili M., “Local Feature Selection for Data Classification”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 6, pp. 1217–1. pmid:26390448
- View Article
- PubMed/NCBI
- Google Scholar
2. Wang Y. W., Feng L. Z. “Hybrid feature selection using component co-occurrence based feature relevance measurement”, Expert Systems with Applications, vol.102, PP. 83–99, Jul.
- View Article
- Google Scholar
3. Wang D., Nie F., Huang H., “Feature Selection via Global Redundancy Minimization”, IEEE Trans. Knowl. Data Eng., vol: 27, PP. 2743–2755,2015.
- View Article
- Google Scholar
4. Sebban M., Nock R., “A hybrid filter/wrapper approach of feature selection using information theory”, Pattern Recognit. Vol. 35, pp. 835–846, 2002.
- View Article
- Google Scholar
5. Freeman C., Dana , Basir O, “An evaluation of classifier-specific filter measure performance for feature selection”, Pattern Recognition, vol: 48, PP. 2015. pmid:1812
- View Article
- PubMed/NCBI
- Google Scholar
6. Peng H., Long F., Ding C., “Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, pp. 1226–1238, 2005. pmid:16119262
- View Article
- PubMed/NCBI
- Google Scholar
7. Palma-Mendoza R. J., Rodriguez D., De-Marcos L., “Distributed ReliefF-based feature selection in Spark”. Knowledge and Information Systems, vol.51, no. 1, pp. 1–20, Oct, 2018.
- View Article
- Google Scholar
8. Majdi M., Seyedali M. “Whale optimization approaches for wrapper feature selection”, Applied Soft Computing., vol. 62, pp. 441–453, Jan, 2018.
- View Article
- Google Scholar
9. Mirjalili S., “The ant lion optimizer”, Adv. Eng. Softw., vol. 83, pp. 80–98, May, 2015.
- View Article
- Google Scholar
10. Yang X.S., He X., “Bat algorithm: literature review and applications”, International Journal of Bio-Inspired Computation, vol. 5, no. 3, pp. 141–149,2013.
- View Article
- Google Scholar
11. Rodrigues D., et al., “A wrapper approach for feature selection based on Bat Algorithm and Optimum-Path Forest”, Expert Syst. Appl., vol. 41, pp. 2250–2258,2014.
- View Article
- Google Scholar
12. Passino K.M., “Biomimicry of bacterial foraging for distributed optimization and control”, IEEE Control Syst., vol. 22, pp. 52–67, 2002.
- View Article
- Google Scholar
13. Mohapatra P., Chakravarty S., Dash P.K., “An improved cuckoo search based extreme learning machine for medical data classification”, Swarm Evol. Compu., vol. 24, pp. 25–49, 2015.
- View Article
- Google Scholar
14. Yang X.S., Deb S., “Cuckoo search via Lévy flights”, World Congress on Nature & Biologically Inspired Computing, 2009. NaBIC 2009, 2009, 210–214.
- View Article
- Google Scholar
15. Tsai C.F., Eberle W., Chu C.Y., “Genetic algorithms in feature and instance selection”, Knowl.-Based Syst., vol. 39, pp. 240–247, 2013.
- View Article
- Google Scholar
16. Wang Z., Shao Y.H., Wu TR, “A GA-based model selection for smooth twin parametric-margin support vector machine”, Pattern Recognit., vol. 46, pp. 2267–2277, 2013.
- View Article
- Google Scholar
17. J. Kennedy, R.C. Eberhart, “Particle swarm optimization, in: Proceedings of the Conference on Neural Networks”, IEEE Perth, Australia, 1995, 1942–1948.
18. Vieira S.M., Mendonca L.F., et al., “Modified binary PSO for feature selection using SVM applied to mortality prediction of septic patients”, Appl. Soft Comput., vol. 13, pp. 3494–3504,2013.
- View Article
- Google Scholar
19. Kirkpatrick S., Gelatt C.D., Vecchi M.P., “Optimization by simulated annealing”, Science, vol. 220, pp, 671–680, 1983. pmid:17813860
- View Article
- PubMed/NCBI
- Google Scholar
20. Lin S.W., Lee Z.J., et al., “Parameter determination of support vector machine and feature selection using simulated annealing approach”, Appl. Soft Compu., vol, 8, pp. 1505–1512,2008.
- View Article
- Google Scholar
21. Chen K., Zhou F. Y., Yuan X. F., “Hybrid Particle Swarm Optimization with Spiral-Shaped Mechanism for Feature Selection”. Expert Systems with Applications, vol. 128, pp. 140–156, 2019.
- View Article
- Google Scholar
22. Mafarja M. M., Mirjalili S., “Hybrid Whale Optimization Algorithm with Simulated Annealing for Feature Selection”, Neurocomputing, vol. 26, pp. 302–312, Oct, 2017.
- View Article
- Google Scholar
23. Hancer E., “Differential evolution for feature selection: a fuzzy wrapper–filter approach”. Soft Computing, vol.23, pp:5233–5248,Jul,2019.
- View Article
- Google Scholar
24. Zheng Y. et al., "A Novel Hybrid Algorithm for Feature Selection Based on Whale Optimization Algorithm," in IEEE Access, vol. 7, pp. 14908–14923, 2019,
- View Article
- Google Scholar
25. SF. Saul, et al. “A new hybrid filter-wrapper feature selection method for clustering based on ranking”. Neurocomputing, vol. 214, pp. 866–880, NOV,2016
- View Article
- Google Scholar
26. Zhang X, Zhang Q, Chen M, et al. “A two-stage feature selection and intelligent fault diagnosis method for rotating machinery using hybrid filter and wrapper method”. Neurocomputing, vol. 275, pp. 2426–2439, 2018.
- View Article
- Google Scholar
27. Mundra P A, Rajapakse J C. SVM-RFE With MRMR Filter for Gene Selection[J].IEEE Transactions on NanoBioscience,2010,9(1):31–37. pmid:19884101
- View Article
- PubMed/NCBI
- Google Scholar
28. Al-Tashi Q, Kadir S, Rais H M, et al. Binary Optimization Using Hybrid Grey Wolf Optimization for Feature Selection[J]. IEEE Access, vol. 7, pp.39496–39508 2019.
- View Article
- Google Scholar
29. Al-Wajih R, Abdulkadir S J, Aziz N, et al. Hybrid Binary Grey Wolf with Harris Hawks Optimizer for Feature Selection[J]. IEEE Access, 2021, 9(1):31662–31677.
- View Article
- Google Scholar
30. Zheng Y., Li Y., Wang G. et al. A novel hybrid algorithm for feature selection. Pers Ubiquit Comput 22, 971–985 (2018). pmid:30719423
- View Article
- PubMed/NCBI
- Google Scholar
31. Shuaishuai Sun, Huaxia Deng, Haiping Du, et al. A Compact Variable Stiffness and Damping Shock Absorber for Vehicle Suspension[J]. IEEE/ASME Transactions on Mechatronics. 2015,20,(5):2621–2629.
- View Article
- Google Scholar
32. Deng L, Sun S, Christie M D, et al. “Experimental testing and modelling of a rotary variable stiffness and damping shock absorber using magnetorheological technology”. Journal of Intelligent Material Systems and Structures, vol. 30, no. 10, pp. 1453–1465, 2019.
- View Article
- Google Scholar
33. Mirjalili S., Mirjalili S. M., Lewis A., “Grey Wolf Optimizer”. Advances in Engineering Software, vol.69, no. 3, pp. 46–61,2014.
- View Article
- Google Scholar
34. Emary E, Zawbaa H M, Hassanien A E. “Binary grey wolf optimization approaches for feature selection”. Neurocomputing, vol. 172, pp. 371–381, 2016.
- View Article
- Google Scholar
35. SUYKENS J.A.K., VANDEWALLE J., “Least Squares Support Vector Machine Classifiers”, Neural Processing Letters. Vol. 9, pp. 293–300,1999.
- View Article
- Google Scholar
36. Hsu C.W., Lin C. J., “A comparison of methods for multiclass support vector machines”, IEEE Trans. Neural Netw., vol. 13, pp. 415–425, 2002. pmid:18244442
- View Article
- PubMed/NCBI
- Google Scholar
37. Yang X.S., He X., “Bat algorithm: literature review and applications”, International Journal of Bio-Inspired Computation, vol. 5, pp. 141–149, 2013.
- View Article
- Google Scholar
38. Unler A., Murat A., Chinnam R. B., “mr(2)PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification”, Inf. Syst., vol. 181, pp. 4625–4641, 2011.
- View Article
- Google Scholar
39. Huang C.L., Wang C.J., “A GA-based feature selection and parameters optimization for support vector machines”, Expert Syst. Appl., vol. 31, pp. 231–240, 2006.
- View Article
- Google Scholar
40. Rodrigues Douglas, et al. “A wrapper approach for feature selection based on Bat Algorithm and Optimum-Path Forest”, Expert Syst. Appl., vol. 41, pp. 2250–2258, 2014.
- View Article
- Google Scholar
41. Lin S.-W. et al. “Parameter determination of support vector machine and feature selection using simulated annealing approach”, Appl. Soft Comput., vol. 8, no. 4, pp. 1505–1512, 2008.
- View Article
- Google Scholar
42. Braga-Neto Ulisses M., and Dougherty E. R. "Is cross-validation valid for small-sample microarray classification?." Bioinformatics, vol. 20, no. 3, pp. 374–380, 2004. pmid:14960464
- View Article
- PubMed/NCBI
- Google Scholar
43. Zimmerman DW, Zumbo BD. relative power of the wilcoxon test, the friedman test, and repeated-measures anova on ranks[J]. Journal Of Experimental Education,vol.62,pp.75–86,1993.
- View Article
- Google Scholar

[ref1] 1. Armanfard N., Reilly J. P., Komeili M., “Local Feature Selection for Data Classification”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 6, pp. 1217–1. pmid:26390448
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Wang Y. W., Feng L. Z. “Hybrid feature selection using component co-occurrence based feature relevance measurement”, Expert Systems with Applications, vol.102, PP. 83–99, Jul.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref3] 3. Wang D., Nie F., Huang H., “Feature Selection via Global Redundancy Minimization”, IEEE Trans. Knowl. Data Eng., vol: 27, PP. 2743–2755,2015.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref4] 4. Sebban M., Nock R., “A hybrid filter/wrapper approach of feature selection using information theory”, Pattern Recognit. Vol. 35, pp. 835–846, 2002.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. Freeman C., Dana , Basir O, “An evaluation of classifier-specific filter measure performance for feature selection”, Pattern Recognition, vol: 48, PP. 2015. pmid:1812
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. Peng H., Long F., Ding C., “Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, pp. 1226–1238, 2005. pmid:16119262
View Article
PubMed/NCBI
Google Scholar

[19] View Article

[20] PubMed/NCBI

[21] Google Scholar

[ref7] 7. Palma-Mendoza R. J., Rodriguez D., De-Marcos L., “Distributed ReliefF-based feature selection in Spark”. Knowledge and Information Systems, vol.51, no. 1, pp. 1–20, Oct, 2018.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref8] 8. Majdi M., Seyedali M. “Whale optimization approaches for wrapper feature selection”, Applied Soft Computing., vol. 62, pp. 441–453, Jan, 2018.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref9] 9. Mirjalili S., “The ant lion optimizer”, Adv. Eng. Softw., vol. 83, pp. 80–98, May, 2015.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref10] 10. Yang X.S., He X., “Bat algorithm: literature review and applications”, International Journal of Bio-Inspired Computation, vol. 5, no. 3, pp. 141–149,2013.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref11] 11. Rodrigues D., et al., “A wrapper approach for feature selection based on Bat Algorithm and Optimum-Path Forest”, Expert Syst. Appl., vol. 41, pp. 2250–2258,2014.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref12] 12. Passino K.M., “Biomimicry of bacterial foraging for distributed optimization and control”, IEEE Control Syst., vol. 22, pp. 52–67, 2002.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref13] 13. Mohapatra P., Chakravarty S., Dash P.K., “An improved cuckoo search based extreme learning machine for medical data classification”, Swarm Evol. Compu., vol. 24, pp. 25–49, 2015.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref14] 14. Yang X.S., Deb S., “Cuckoo search via Lévy flights”, World Congress on Nature & Biologically Inspired Computing, 2009. NaBIC 2009, 2009, 210–214.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref15] 15. Tsai C.F., Eberle W., Chu C.Y., “Genetic algorithms in feature and instance selection”, Knowl.-Based Syst., vol. 39, pp. 240–247, 2013.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref16] 16. Wang Z., Shao Y.H., Wu TR, “A GA-based model selection for smooth twin parametric-margin support vector machine”, Pattern Recognit., vol. 46, pp. 2267–2277, 2013.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref17] 17. J. Kennedy, R.C. Eberhart, “Particle swarm optimization, in: Proceedings of the Conference on Neural Networks”, IEEE Perth, Australia, 1995, 1942–1948.

[ref18] 18. Vieira S.M., Mendonca L.F., et al., “Modified binary PSO for feature selection using SVM applied to mortality prediction of septic patients”, Appl. Soft Comput., vol. 13, pp. 3494–3504,2013.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref19] 19. Kirkpatrick S., Gelatt C.D., Vecchi M.P., “Optimization by simulated annealing”, Science, vol. 220, pp, 671–680, 1983. pmid:17813860
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref20] 20. Lin S.W., Lee Z.J., et al., “Parameter determination of support vector machine and feature selection using simulated annealing approach”, Appl. Soft Compu., vol, 8, pp. 1505–1512,2008.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref21] 21. Chen K., Zhou F. Y., Yuan X. F., “Hybrid Particle Swarm Optimization with Spiral-Shaped Mechanism for Feature Selection”. Expert Systems with Applications, vol. 128, pp. 140–156, 2019.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref22] 22. Mafarja M. M., Mirjalili S., “Hybrid Whale Optimization Algorithm with Simulated Annealing for Feature Selection”, Neurocomputing, vol. 26, pp. 302–312, Oct, 2017.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref23] 23. Hancer E., “Differential evolution for feature selection: a fuzzy wrapper–filter approach”. Soft Computing, vol.23, pp:5233–5248,Jul,2019.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref24] 24. Zheng Y. et al., "A Novel Hybrid Algorithm for Feature Selection Based on Whale Optimization Algorithm," in IEEE Access, vol. 7, pp. 14908–14923, 2019,
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref25] 25. SF. Saul, et al. “A new hybrid filter-wrapper feature selection method for clustering based on ranking”. Neurocomputing, vol. 214, pp. 866–880, NOV,2016
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref26] 26. Zhang X, Zhang Q, Chen M, et al. “A two-stage feature selection and intelligent fault diagnosis method for rotating machinery using hybrid filter and wrapper method”. Neurocomputing, vol. 275, pp. 2426–2439, 2018.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref27] 27. Mundra P A, Rajapakse J C. SVM-RFE With MRMR Filter for Gene Selection[J].IEEE Transactions on NanoBioscience,2010,9(1):31–37. pmid:19884101
View Article
PubMed/NCBI
Google Scholar

[82] View Article

[83] PubMed/NCBI

[84] Google Scholar

[ref28] 28. Al-Tashi Q, Kadir S, Rais H M, et al. Binary Optimization Using Hybrid Grey Wolf Optimization for Feature Selection[J]. IEEE Access, vol. 7, pp.39496–39508 2019.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref29] 29. Al-Wajih R, Abdulkadir S J, Aziz N, et al. Hybrid Binary Grey Wolf with Harris Hawks Optimizer for Feature Selection[J]. IEEE Access, 2021, 9(1):31662–31677.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref30] 30. Zheng Y., Li Y., Wang G. et al. A novel hybrid algorithm for feature selection. Pers Ubiquit Comput 22, 971–985 (2018). pmid:30719423
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref31] 31. Shuaishuai Sun, Huaxia Deng, Haiping Du, et al. A Compact Variable Stiffness and Damping Shock Absorber for Vehicle Suspension[J]. IEEE/ASME Transactions on Mechatronics. 2015,20,(5):2621–2629.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref32] 32. Deng L, Sun S, Christie M D, et al. “Experimental testing and modelling of a rotary variable stiffness and damping shock absorber using magnetorheological technology”. Journal of Intelligent Material Systems and Structures, vol. 30, no. 10, pp. 1453–1465, 2019.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref33] 33. Mirjalili S., Mirjalili S. M., Lewis A., “Grey Wolf Optimizer”. Advances in Engineering Software, vol.69, no. 3, pp. 46–61,2014.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref34] 34. Emary E, Zawbaa H M, Hassanien A E. “Binary grey wolf optimization approaches for feature selection”. Neurocomputing, vol. 172, pp. 371–381, 2016.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref35] 35. SUYKENS J.A.K., VANDEWALLE J., “Least Squares Support Vector Machine Classifiers”, Neural Processing Letters. Vol. 9, pp. 293–300,1999.
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref36] 36. Hsu C.W., Lin C. J., “A comparison of methods for multiclass support vector machines”, IEEE Trans. Neural Netw., vol. 13, pp. 415–425, 2002. pmid:18244442
View Article
PubMed/NCBI
Google Scholar

[111] View Article

[112] PubMed/NCBI

[113] Google Scholar

[ref37] 37. Yang X.S., He X., “Bat algorithm: literature review and applications”, International Journal of Bio-Inspired Computation, vol. 5, pp. 141–149, 2013.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref38] 38. Unler A., Murat A., Chinnam R. B., “mr(2)PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification”, Inf. Syst., vol. 181, pp. 4625–4641, 2011.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref39] 39. Huang C.L., Wang C.J., “A GA-based feature selection and parameters optimization for support vector machines”, Expert Syst. Appl., vol. 31, pp. 231–240, 2006.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref40] 40. Rodrigues Douglas, et al. “A wrapper approach for feature selection based on Bat Algorithm and Optimum-Path Forest”, Expert Syst. Appl., vol. 41, pp. 2250–2258, 2014.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref41] 41. Lin S.-W. et al. “Parameter determination of support vector machine and feature selection using simulated annealing approach”, Appl. Soft Comput., vol. 8, no. 4, pp. 1505–1512, 2008.
View Article
Google Scholar

[127] View Article

[128] Google Scholar

[ref42] 42. Braga-Neto Ulisses M., and Dougherty E. R. "Is cross-validation valid for small-sample microarray classification?." Bioinformatics, vol. 20, no. 3, pp. 374–380, 2004. pmid:14960464
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref43] 43. Zimmerman DW, Zumbo BD. relative power of the wilcoxon test, the friedman test, and repeated-measures anova on ranks[J]. Journal Of Experimental Education,vol.62,pp.75–86,1993.
View Article
Google Scholar

[134] View Article

[135] Google Scholar

Figures

Abstract

Introduction

Relation work

The relation theory

Damping oscillation theory

Gray wolf optimization

Continuous gray wolf optimization algorithm.

Discrete grey wolf optimization.

Support vector machines

MKMDIGWO algorithm

MKMD filter algorithm

Maximum Kendall rank.

Maximum European distance.

Maximum Kendall Maximum Distance (MKMD).

Improved grey wolf optimization

MKMDIGWO algorithm

Damping oscillation adjustment.

MKMDIGWO algorithm framework.

Experimental results and analysis

Experimental data

Experimental parameter setting

Experimental results and analysis

Conclusion

References