A novel hybrid algorithm based on Harris Hawks for tumor feature gene selection

Background Gene expression data are often used to classify cancer genes. In such high-dimensional datasets, however, only a few feature genes are closely related to tumors. Therefore, it is important to accurately select a subset of feature genes with high contributions to cancer classification. Methods In this article, a new three-stage hybrid gene selection method is proposed that combines a variance filter, extremely randomized tree and Harris Hawks (VEH). In the first stage, we evaluated each gene in the dataset through the variance filter and selected the feature genes that meet the variance threshold. In the second stage, we use extremely randomized tree to further eliminate irrelevant genes. Finally, we used the Harris Hawks algorithm to select the gene subset from the previous two stages to obtain the optimal feature gene subset. Results We evaluated the proposed method using three different classifiers on eight published microarray gene expression datasets. The results showed a 100% classification accuracy for VEH in gastric cancer, acute lymphoblastic leukemia and ovarian cancer, and an average classification accuracy of 95.33% across a variety of other cancers. Compared with other advanced feature selection algorithms, VEH has obvious advantages when measured by many evaluation criteria.


INTRODUCTION
In data analysis, data dimension may be much more than the number of samples (Diao & Vidyashankar, 2013). The generally-used methods often perform poorly on such data, because they can-not avoid the dimensionality curse (Myakalwar et al., 2015). Therefore, it is necessary to datasets with feature that ensure the accuracy of subsequent analysis (Douglas & Shapiro, 2021). Microarray technology can simultaneously measure a large number of cancer related gene expression data (Sandra, Shukla & Kolthur-Seetharam, 2020;Su et al., 2017), and the efficient selection of disease feature genes from microarray data can improve the accuracy of disease classification and help to improve the treatment of cancer (An, Wang & Wei, 2018). Because the number of gene expressions is much larger than the number of cancer samples, and only a few feature genes in the gene expression data are closely related to cancers, selecting highly discriminative feature genes for cancers is a challenging task, and the existing methods are not effective.
Genes are closely related to tumors. Gene activation and mutation are one of the causes of tumor occurrence. Feature selection of high-dimensional data is divided into four standard methods: filter, embedded, wrapper, and hybrid (Dashtban & Balafar, 2017;Pashaei & Pashaei, 2021;Pfeifer et al., 2020;Tang et al., 2021;Yu & Ni, 2014). The filter method is a preprocessing method used for high-dimensional data. It evaluates each gene in the tumor according to specific rules and removes genes unrelated to the follow-up learning process. However, the filter method cannot analyze the mutual information between features, and the selected feature subset may not be optimal (Wang, Wang & Chang, 2016). The wrapper method uses the classification model including the heuristic algorithm and selects the optimal feature subset according to the classification performance (Sahebi et al., 2020). In the feature selection of high-dimensional medical data, the wrapper method is usually more effective than the filter method (Fu et al., 2020). The hybrid method is the combined application of the filter, embedded, and wrapper methods, as well as the improvement and expansion based on these three methods (Castellanos-Garzón et al., 2018). For example, Qu et al. (2021) proposed the Harris Hawks optimizer with variable neighborhood to screen feature genes, by combining the wrapper and embedded methods. Zhang et al. (2021) proposed a hybrid method that combined a Fisher score and gradient enhanced decision tree, and selected the best feature gene set with robustness across 11 high-dimensional gene expression datasets. Chuang et al. (2009), Deng et al. (2022), andMandal et al. (2021) also adopted the hybrid method by combining the filter and wrapper methods, and achieved good results in multiple open cancer datasets. This article presents a three-stage feature selection hybrid method VEH, that combines the filter and wrapper methods. Through the analysis and comparison of the experimental results, we confirmed that the VEH method has obvious advantages in the selection performance of feature genes, number of selected genes and calculation time.
The chapters of this article are organized as follows: first, we summarize the research work and corresponding algorithm principles related to variance filter, extremely randomized tree, Harris Hawks algorithm and introduce the hybrid algorithm VEH in detail. In the result, we compared the VEH method with 13 related feature selection algorithms, based on eight published cancer gene expression datasets. Finally, we summarize the experimental results.

Variance filter
Variance is important when measuring the degree of data dispersion. Hackstadt & Hess (2009) studied in detail the effect of using a variance filter on microarray data analysis. In this article, we set the variance threshold to 0.05 to filter out all feature genes whose variance was less than the threshold.

Extremely randomized tree
The extremely randomized tree is a machine learning algorithm constructed from multiple decision trees (Liang et al., 2021). Extremely randomized trees have the advantages of high computing efficiency and are suitable for processing high dimensional data. Extremely randomized trees constructs decision trees by randomly selecting attributes and splitting nodes. Heidari et al. (2019) proposed the Harris Hawks algorithm (HHO) according to the hunting law of Harris hawks in nature. The Harris Hawks algorithm is a new type of swarm intelligence optimization algorithm, that has strong search ability and high accuracy. In the algorithm, prey rabbit represents the fitness optimal solution in the current iteration. The whole algorithm is divided into two stages: exploratory and development. The exploratory stage starts by initializing a value to detect the habitat position and then observing the prey. During the development stage, the Harris Hawks carry out four attack modes based on the energy of their prey and the possibility of escape. Figure 1 shows the HHO workflow.

Harris Hawks algorithm
During the exploratory stage, the escape energy factor is E. When |E| ≥ 1, Harris Hawks randomly searches [lb,ub] and uses two strategies with the same probability to search for prey globally. The location update formula is shown in Eq. (1): where, X rand is the randomly selected Harris Hawks position in the population; X (t ) is the individual position in the iteration; X rabbit is the prey position in the current iteration; X m is the average position information of the population; and r 1 , r 2 , r 3 , r 4 is a random number with (0,1) distribution. q is the conversion factor controlling the two strategies. HHO controls the exploratory and development stages through the escape energy factor, as shown in Eq. (2): where T is the total number of iterations and E 0 is the random number of initial energy values (−1,1). When |E| ≥ 1, a global search is performed; otherwise, the development stage begins. During the development stage, when |E| < 1, the Harris Hawks raid and catch prey, and the prey avoids predation. HHO is based on random numbers r ∈ (0,1), E and the appropriate one is selected from the following four attack strategies to complete the location update where |E| is the deciding factor of the strategy. When |E| ≥ 0.5, Harris Hawks selects the soft besiege strategy; otherwise, it selects the hard besiege strategy. r is the probability of prey being captured. 1. When r ≥ 0.5, Harris Hawks can capture prey; otherwise, the hunt fails. When r ≥ 0.5 and |E| ≥ 0.5, the prey jumps with sufficient energy to avoid predation, and Harris Hawks uses prey energy to complete its predation using a soft besiege strategy, as shown in Formula (3): where, r 5 is a random number in (0,1), D(t ) is the distance between the prey and the current individual, and J is the movement distance of prey in jumping mode. 2. When r ≥ 0.5 and |E| < 0.5, prey energy is insufficient, Harris Hawks carries out a hard besiege strategy and quickly preys, as shown in Formula (4): 3. When r < 0.5, and |E| ≥ 0.5, the prey has enough energy to escape. At this time, Harris Hawks selects the soft besiege with progressive rapid dives strategy, as shown in Formula (5). This strategy has two hunting methods. When the first fails, the second is chosen.
where S is a random vector, D is the spatial dimension, f is the fitness function, and LF is the levy function, simulating the jumping behavior of prey. 4. When r < 0.5 and |E| < 0.5, prey lacks energy but has a chance to escape. At this time, Harris Hawks chooses the hard besiege with progressive rapid dives strategy to narrow the distance from prey and form an encirclement circle, as shown in Formula (6):

Coding rules
When the HHO algorithm searches the optimal feature gene subset, it needs to encode the feature dimensions of all feasible solutions in HHO using binary string to solve the discrete space optimization problem. We used 1 and 0 to represent the retention and elimination of the gene respectively, set the value range of the feature as [0,1], and updated the value of the binary coding position using the rounding method.

Fitness function
The fitness function is used to evaluate the advantages and disadvantages of individuals and determine the optimization direction of the algorithm. We selected KNN as the fitness function of the classification problem, as shown in Formula (7): where, KNN acc is the classification accuracy using the KNN classifier, num c is the correct classification quantity, num e is the number of wrong classifications, f num and F num are the feature subset and total feature number respectively, and α is an adjustment parameter (we set α = 0.99).

VEH
In this article, we propose a three-stage gene selection method, VEH, which combines a variance filter, extremely randomized tree, and Harris Hawks algorithm. First, we used the variance filter method to select a subset of feature genes. Second, we used the extremely randomized tree to calculate the importance score of each gene and obtain an effective gene subset. Finally, we used Harris Hawks algorithm to obtain the optimal feature gene subset. The pseudocode code of VEH is shown in algorithm 1. Figure 2 shows the gene selection process of the VEH algorithm. In a large dataset, the running time of the wrapper method is usually several orders of magnitude higher than that of the filter method and the embedded method. In the hybrid method VEH, the running time is mainly concentrated on the wrapper method HHO during the third stage, so the time complexity of the proposed method is O(N × (T + T · D + 1)).

Data collection and experiment setting
In this article, eight microarray gene expression datasets we are used to test the performance of each algorithm. The datasets used were from public websites: http://csse.szu.edu.cn/staff/ zhuzx/Datasets.Html (Qu et al., 2021) and https://github.com/Pengeace/MGRFE-GaRFE (Peng et al., 2021). Table 1 provides a detailed overview of the eight microarray datasets, including their sample size, number of genes, and class. In these datasets, the number of genes rangeds from 2,308 to 22,283, and the number of samples was less than 300. These datasets included central nervous system (CNS), leukemia (Leuk), diffuse large Bcell lymphoma (DLBCL), prostate (Pros), gastric2 (Gas2), acute lymphoblastic leukemia 1 (ALL1), ovarian cancer (Ovarian) and small round blue cell tumor (SRBCT). Among these, only SRBCT was a four-class dataset and the others were binary datasets. The number of class samples was uneven except in Prostate and Gas2. Therefore, the datasets used in this article were determined to comprehensively test the performance of different algorithms. During the data preprocessing, in order to operate simply and not change the mean of the variables, Algorithm 1: VEH Pseudocode Inputs: Initial data: Calculate the fitness values of hawks Set X rabbit as the location of rabbit(best location) for (each hawk(X i )) do Update the initial energy E 0 and jump strength J Update the E using Eq.() if (|E| ≥ 1) then //Exploration phase Update the location vector using Eq.() if (|E| < 1) then //Exploitation phase if(r ≥ 0.5 and |E| ≥ 0.5) then //soft besiege Update the location vector using Eq.() else if(r ≥ 0.5 and |E| < 0.5) then //hard besiege Update the location vector using Eq.() else if(r < 0.5 and |E| ≥ 0.5) then //soft besiege with progressive rapid dives Update the location vector using Eq.() else if(r < 0.5 and |E| < 0.5) then //hard besiege with progressive rapid dives Update the location vector using Eq.() Return X rabbit we used the mean substitution method to fill in missing values in the dataset and the minmax normalization method to eliminate the impact of data dimensions. All the experimental results in this article were generated on a computer equipped with a corei7-8750 CPU, 16G of memory, and 2.20 GHz frequency. All algorithms were implemented using Python 3.7.0 and two public machine learning libraries, scikit-learn and scikit-feature. In this article, we used three different classifiers, Decision Tree (DT), Support Vector Machine (SVM), and Logistic Regression (LR), to evaluate the performance of each algorithm. The classification performances of each standard classifier were recorded after tenfold cross-validation. The tenfold cross-validation method randomly divide the dataset into 10 parts, nine of which divided into training sets and the rest were divided into test sets. We compared Hawks (ERT-HHO), variance filter-genetic algorithm (VF-GA), extremely randomized tree-genetic algorithm (ERT-GA), variance filter-particle swarm optimization algorithm (VF-PSO), extremely randomized tree-particle swarm optimization algorithm (ERT-PSO). variance filter-crow search algorithm (VF-CSA), extremely randomized tree-crow search algorithm (ERT-CSA), variance filter-differential evolution algorithm (VF-DE), and extremely randomized tree-differential evolution algorithm (ERT-DE), Table 2 lists the specific parameter values of each algorithm and classifier. All experiments were run independently 10 times and used seven evaluation criteria to reflect the performance of each algorithm: the number of selected genes, classification accuracy (Acc), precision rate (precision), recall rate (recall), F1-Score (f1), standard deviation (SD) and algorithm running time. The calculation formulas for the four important evaluation criteria were as follows: Number of positive samples (P); Number of negative samples (N). True positive (TP): the real category of the sample is positive, and the model prediction is also positive. True negative (TN): the real category of the sample is a negative case, and the model prediction is also a negative case. False positive (FP): the real category of the sample is negative, and the model prediction is positive. False negative (FN): the real category of the sample is positive and the model prediction is negative. Because the precision, recall, and f1 were for a single class, we assigned the same weight to each class and calculated their average values.

Comparison with other algorithms
In this section, we comprehensively compared the VEH method with T, VF-ERT, Wilcoxon, VF-HHO, ERT-HHO, VF-GA, ERT-GA, VF-PSO, ERT-PSO, VF-CSA, ERT-CSA, VF-DE and ERT-DE. We performed 10 times tenfold cross-validation on the gene subsets selected by each algorithm to obtain the average result, and the performance optimal values in each dataset are highlighted in black bold. Tables 3-5 shows the performance values of the four evaluation criteria of each algorithm on the three classifiers. Table 3 shows that, on the DT classifier, the VEH method had obvious advantages over other methods, in which the Acc, precision, recall and f1 winning times were 7, 6, 7, and 7, respectively. The VEH average Acc reached 92.42% and achieved 100% classification performance on the Gas2 and ALL1 datasets. As shown in Table 4, on the SVM classifier, the winning times of the VEH method on the four evaluation criteria were 6, 5, 6, and 5, respectively. The VEH achieved 100% classification performance on the Gas2, ALL1, and Ovarian datasets. At the same time, the average Acc reached its best of 95.33%. As shown in Table 5, on the LR classifier, the VEH method has won 7, 4, 5 and 5 times on the four evaluation criteria, respectively. The performance of the four evaluation criteria reached 100% on the Gas2, ALL1 and Ovarian datasets. At the same time, the average Acc was higher than that of the other 13 methods, reaching 95.05%. In summary, compared with the other 13 methods, VEH showed advantages in four evaluation criteria on the three classifiers, especially in the DT classifier, and achieved the highest average Acc in the SVM classifier, which also proves that the hybrid method proposed in this article could deliver an effective and improved performance. Table 6 lists the number of genes selected by each algorithm in the eight datasets. The results show that the VEH method selected the smallest average number of genes. The number of genes selected by the VF-HHO, ERT-HHO, VF-GA, and ERT-GA methods in the five datasets was fewer than that of the VEH method, but as shown in Tables 3-5, the VEH method had significant advantages in many evaluation criteria. In addition, the number of genes selected by the VEH method was only 1/20 to 1/80 of the VF-ERT method, 1/3 to 1/400 of the T -test and Wilcoxon-test method, 1/50 to 1/100 of the VF-CSA method, and 1/20 to 1/50 of the ERT-DE method, but it performeds better. The above results show that VEH can better evaluate the correlation between genes through the hybrid method and improve performance. We compared VEH with other algorithms from recent years. Table 7 lists the comparison results between VEH and other methods, and ''/'' indicates the corresponding missing data. According to the results in Table 7, compared with the existing methods, the VEH method also showed certain competitiveness in Acc. Table 8 lists the average running time of each algorithm on each dataset. By comparing the results, we found that the Wilcoxon-test method had the shortest running time and the VF-PSO method had the longest running time. In combination with the results in Tables 3-7, we found that although the VF-ERT, T -test, and Wilcoxon-test methods had a short run time, this was at the expense of Acc and the number of selected genes. The VF-HHO, ERT-HHO, VF-GA, ERT-GA, VF-PSO, ERT-PSO, VF-CSA, ERT-CSA, VF-DE and ERT-DE methods had a long run time, but their other performance was significantly lower than VEH. This also shows that VEH can effectively improve performance and shorten the overall run time by combining various methods. Of all the evaluation criteria, Acc was the most important, so we tested the performance of the VEH method in the dataset when α tooks different values. As shown in Table 9, when α = 0.99, the algorithm performance was the best. Therefore, we set α = 0.99.

Gene analysis
Because the VEH method had certain randomness, we may have seen the same performance during the process of feature selection. To address this issue, we followed the following principles: (1) select the high Acc subset; (2) when the Acc is the same, select small subsets; and (3) when the Acc and the number of subsets are the same, select the highest frequency. Table 10 lists the optimal number of gene subsets, probe/Uniprot ID, and Acc on different classifiers in each dataset. We have biologically described the best subset of genes selected in five datasets, and the corresponding results are listed in Tables 11-15. Liddelow & Hoyer (2016) found that NCAM1 represents a potential drug target for many inflammatory diseases of the CNS. Clark & Stein (2020) also found that CD33 can target leukemia. Tanhaei et al. (2014) found that GAPDH can be used as a valuable indicator to distinguish DLBCL. Li, Ge & Franceschi (2021) found that RUNX2 plays a key role in the development of the prostate. Hu et al. (2016) found that the EPOR pathway can promote the formation, proliferation, and migration of Gas2.

DISCUSSION
The purpose of VEH is to select effective feature genes from high-dimensional gene expression data. Unlike other similar methods (Bir-Jmel, Douiri & Elbernoussi, 2019;Ge et al., 2016), VEH is a three-stage hybrid method that combines three different methods. The results in Tables 11-15 show that our method can select important genes related to a tumor in multiple datasets (Endo et al., 2018;Forgione et al., 2020), and the results of other studies also verify the effectiveness and practicability of genes selected using the VEH method from a medical perspective.

Notes.
The values marked in black and bold represent the best performance values in this dataset.

Notes.
The values marked in black and bold represent the best performance values in this dataset.