Two-stage feature selection for classification of gene expression data based on an improved Salp Swarm Algorithm

: Microarray technology has developed rapidly in recent years, producing a large number of ultra-high dimensional gene expression data. However, due to the huge sample size and dimension proportion of gene expression data, it is very challenging work to screen important genes from gene expression data. For small samples of high-dimensional biomedical data, this paper proposes a two-stage feature selection framework combining Wrapper, embedding and filtering to avoid the curse of dimensionality. The proposed framework uses weighted gene co-expression network (WGCNA), random forest and minimal redundancy maximal relevance (mRMR) for first stage feature selection. In the second stage, a new gene selection method based on the improved binary Salp Swarm Algorithm is proposed, which combines machine learning methods to adaptively select feature subsets suitable for classification algorithms. Finally, the classification accuracy is evaluated using six methods: lightGBM, RF, SVM, XGBoost, MLP and KNN. To verify the performance of the framework and the effectiveness of the proposed algorithm, the number of genes selected and the classification accuracy was compared with the other five intelligent optimization algorithms. The results show that the proposed framework achieves an accuracy equal to or higher than other advanced intelligent algorithms on 10 datasets, and achieves an accuracy of over 97.6% on all 10 datasets. This shows that the method proposed in this paper can solve the feature selection problem related to high-dimensional data, and the proposed framework has no data set limitation, and it can be applied to other fields involving feature selection.


Introduction
Because the incidence rate and mortality rate of cancer are very high, it has been widely a concern all over the world, so diagnosing cancer has become a very difficult task [1]. Cancer is also a common research object in bioinformatics research. In cancer diseases, there are many features with different information, which can be used to distinguish the tissue or organ source of cancer distribution according to these features [2]. The rapid development of microarray technology in recent years has produced a large amount of ultra-high-dimensional gene expression data. Therefore, the use of gene expression data for cancer diagnosis has great advantages. However, due to the huge sample size and dimension ratio of gene expression data, small sample size and high gene dimension, it is a very challenging task to screen key genes from gene expression data. And due to the curse of dimensionality [3], irrelevant and redundant genes will increase the difficulty of model training and will also have adverse effects on the accuracy of the model.
The premise of cancer treatment is an accurate diagnosis. With the extensive development of machine learning and artificial intelligence, machine learning classification methods have occupied a certain position in the field of cancer diagnosis. In recent years, more and more researchers [4] use machine learning algorithms for cancer diagnosis. And for classification problems, feature reduction also plays a very important role, which is very effective in preventing overfitting, reducing computational complexity, and reducing model interpretability [5].
Feature selection can be roughly divided into three categories: filtering, Wrapper and embedding, which aims to solve high-dimensional problems. The wrapper [6] method is to simplify the data through the feature selection algorithm, and then construct a feature subset to train the classification algorithm, and the feature selection fitness value is the performance of the classification algorithm. The filter [7] method first calculates the correlation between the features in the dataset and the target variable, and filters the data by comparing the magnitude of the correlation. The embedding [8] introduces a regularization term in the loss function of the classification method to constrain the model, and selects features according to the performance of the classification method.
Researchers have proposed many feature selection methods based on gene expression data to achieve robust feature selection and accurate cancer diagnosis [9]. L. Sun et al. [10] proposed a neighborhood rough set-based feature selection method for cancer classification of gene expression data using an uncertainty measure based on neighborhood entropy. A. Kumar et al. [11] in their paper constructed an integrated active learning approach to achieve simplification of gene expression data using a fuzzy rough set approach. This method can improve the classification accuracy with limited samples in the training dataset. J. Lee et al. [12] proposed a new multivariate feature ranking method to improve the quality of gene selection and ultimately the accuracy of microarray data classification. They embedded the formal definition of relevance into the Markov blanket (MB) to create a new feature ranking method. X. Zheng and C. Zhang [13] proposed a model based on latent representation learning, which treats each gene as a feature and performs feature selection by computing the intraassociation between samples of gene expression data and the relationship between features, i.e., in the latent representation space, rather than by comparing the importance of features in the dataset. L. Li et al. [14] Proposed a stable machine learning recursive feature elimination (StabML-RFE) strategy. They employed eight different machine learning methods and sequentially removed the least important features with recursive feature elimination (RFE). Then, each feature is sorted, and the top ranked features are selected to form the best feature subset, and a stability measure is established to evaluate the robustness of different feature selection techniques. The selected biomarkers are also verified by different methods.
In recent years, swarm intelligence optimization algorithms have shown to be very powerful in feature selection because of their simplicity and global search capability. A. K. Shukla et al. [2] proposed a novel hybrid wrapper algorithm TLBO-GSA incorporating features of teaching-learningbased optimization (TLBO) and Gravitational Search Algorithm (GSA). This method first selects relevant genes from the gene expression dataset using mRMR and then selects informative genes from approximate data generated by mRMR using the proposed method. H. Wang et al. [15] proposed a new multidimensional population-based bacterial colony optimization method, referred to as BCO-MDP, for feature selection for classification. C. Shen and K. Zhang [16] constructed a two-stage feature selection framework based on the gray wolf optimization algorithm. In the first stage, an integer optimization problem was constructed by first training the parameters of a multilayer perceptron based on the group lasso regularization term using a modified gray wolf optimization algorithm for the initial screening of features and determination of the number of hidden layer layers; in the second stage, the multilayer perceptron was run again using the results of the stage to construct a discrete optimization problem for feature selection. C. Qu et al. [17] implemented a Harris Hawk optimization algorithm based on variable neighborhood learning for feature selection of gene expression data. It is also a twostage framework that first performs one stage of feature selection using F-score to compress the feature space. Then the second phase of feature selection is performed using the Harris Hawk optimization algorithm based on variable neighborhood learning. A. Dabba et al. [18] used an improved Moth-flame optimization algorithm to combine the Moth-flame optimization algorithm with mutual information maximization to achieve feature selection of gene expression data. L. Sun et al. [19] constructed a feature selection algorithm combining an ant colony optimization algorithm with RelieF to achieve feature selection for tumor classification problems. Uzma et al. [20] constructed a two-stage gene selection method, aggregating three filtering methods in the first stage, and then using a genetic algorithm in the second stage in combination with an unsupervised autoencoder-based method to implement the gene selection problem for the subsequent classification task.
From the above introduction, we can see that some intelligent algorithms have been used to build cancer diagnosis frameworks, but these frameworks have some defects worthy of improvement, such as falling into local optimization. Since the minimization of the selected genes is not considered, the maximum fitness evaluation value and parameters are required to be adjusted, so the classification results are not ideal. And due to the traditional fitness function to select genes, the performance of the classifier cannot be maximized with a small subset of features. In this study, a cancer classification framework based on a small number of possible genes was established in order to accurately classify the gene expression data of cancer. The proposed framework used a two-stage feature selection method for optimal gene selection and machine learning classification. The accuracy of the algorithm and the number of selected features are combined as a fitness evaluation to accurately determine whether it is cancer.
The overall goal of this paper is to propose a feature selection framework for the feature selection problem of high-dimensional data. This framework can achieve high classification accuracy with fewer feature subsets. Specifically, this paper uses two-stage feature selection technology to achieve the classification of gene expression data. Since the dimension of the data is too large, the purpose of the first stage is to remove irrelevant and redundant features while retaining as many relevant features as possible. Because different feature selection algorithms have different advantages and disadvantages. mRMR is a filtering feature selection algorithm. The filtering feature selection method measures the importance of features through relevant statistics, but because the process of feature selection is independent of the learner, the selected feature subset may not obtain good classification accuracy. RF feature selection is an embedded feature selection algorithm. In the embedded feature selection, the feature selection algorithm itself is embedded in the learning algorithm as a component, and some machine learning algorithms or models are used for training to obtain the weight coefficients of each feature (between 0 and 1). These weight coefficients often represent the contribution or importance of features to the model, but the adjustment of parameters has a great impact on the method. WGCNA is a system biology method, which is used to describe the correlation pattern between genes and can be used to find highly correlated gene sets. Therefore, in the first stage, we select the combination of these three methods to select the gene subset with rich information. The purpose of the second stage feature selection is to use as few features as possible to achieve higher classification accuracy, so this paper adopts the improved binary Salp Swarm Algorithm (SSA) [21] for feature selection in the second stage. Among the wrapper-based algorithms, the SSA has superior global search capability and faster convergence speed, and the main advantages of SSA are less computational effort and fewer control parameters compared to the existing optimization algorithms, namely, Particle Swarm Optimization (PSO) [22], Grey Wolf Optimizer (GWO) [23], Whale Optimization Algorithm (WOA) [24], and Sine Cosine Algorithm (SCA) [25].
The rest of this paper is organized as follows. The second section is the method, which introduces the feature selection algorithm and classification method used in this paper; the third section is the proposed method, which introduces the proposed improved SSA and binary feature selection; the fourth section is the empirical study, which introduces the details and parameter settings of the empirical part of this paper; the fifth section is the results and discussion, which analyzes the results of the experiments in this paper and compares them with advanced methods. The sixth section is the conclusion, a summary of the work of this paper, and future research directions.

Weighted gene co-expression network
Weighted gene co-expression network analysis (WGCNA) [26,27] is a systems biology approach used to describe the association patterns of different genes, aiming to find co-expressed gene modules and to explore the association between gene networks and phenotypes of interest, as well as the core genes in the network. WGCNA uses the information of thousands or nearly 10,000 genes with the greatest variation or all genes to identify gene sets of interest, and perform significant association analysis with phenotypes. The first is to make full use of the information, and the second is to convert the association between thousands of genes and phenotypes into associations between several gene sets and phenotypes, eliminating the problem of multiple hypothesis testing and correction.
From the methodological point of view, WGCNA is divided into two parts: expression clustering analysis and phenotype association, which mainly include several steps of correlation coefficient calculation between genes, co-expression network construction, gene module identification, and module-trait association.
Step 1: Use Pearson's correlation coefficient to calculate the correlation coefficient between any two genes and construct the co-expression similarity matrix , where and are the -th and -th genes.
Step 2: Construct the adjacency matrix , construct the scale-free network, and determine the power index . Based on the adjacency matrix, construct the topological overlap matrix (TOM matrix), and the TOM matrix uses to represent the connectivity of the connected nodes where = ∑ and = ∑ , is the total number of genes analyzed for co-expression.
Step 3: The topological overlap is transformed into a dissimilarity matrix using 1 − . Hierarchical clustering trees are constructed, gene modules are generated using dynamic shearing, and genes with similar expression patterns are clustered within the same branch to determine gene modules.
Step 4: Modules are associated with external phenotypic information of interest to find modules with high phenotypic correlation, and the genes within the modules are the selected genes.

Random Forest
Random Forest is an algorithm proposed by Leo Breiman (2001) where the model uses a collection of decision trees to perform various tasks (training, classification, and prediction of samples). The random forest can also filter features by evaluating the importance of each feature in the model, which is an embedded feature selection algorithm.
1) variables are randomly selected from the collected data set for a total of variables (where is less than or equal to ), and then a decision tree is built based on these variables. 2) Repeating the above process times to construct different decision trees. 3) Then for each decision tree the outcome is predicted using random variables and all predicted outcomes are recorded, resulting in outcomes from decision trees. 4) The number of votes obtained for each prediction result is calculated, i.e., the prediction result with the highest number of votes is taken as the final prediction result of the random forest algorithm.
Random forest feature selection means that the feature variables in the random forest are sorted in descending order according to VI (Variable Importance), and then you find your own will to select the desired number of features.

Max-Relevance and Min-Redundancy
Maximum Correlation Minimum Redundancy (mRMR) is a filtered feature selection method proposed by H. Peng et al. [28], which can use mutual information, correlation or distance similarity scores to select features. The principle is very simple, which is to find the set of features in the original set of features that are most correlated with the final output target variable, but the features are least correlated with each other. Max-Relevance: Min-Redundancy: The mRMR score is: where ( ; ) and ; are the mutual information between features and categories, features and features, respectively, and is the subset of features .

Salp Swarm Algorithm
The Salps population is divided into two groups: leaders and followers. The leader is the Salps at the front of the food chain, while the rest of the Salps are considered followers. As these names imply, the leader leads the population and the followers follow each other (direct or indirect leaders).
The leader's position update formula is: Where is the leader position in the -th dimension. Choose the first salp as the leader. Where is the position of the food source in the -th dimension, is the upper bound of the -th dimension, is the lower bound of the -th dimension, and , , are random numbers. is the most important parameter in SSA because it balances exploration and exploitation, and is defined as follows.
where is the current iteration and is the maximum number of iterations. The formula for updating the position of followers is: Where ≥ 2, denotes the position of the -th follower in the -th dimension.

Classification methods
LightGBM (Light Gradient Boosting Machine) is a framework for implementing the GBDT algorithm, which uses weak classifiers to iteratively train to obtain the optimal model, supports efficient parallel training, has faster training speed, better accuracy, and is less prone to overfitting. LightGBM is efficient and fast in processing large-scale data sets.
SVM (Support Vector Machine) is a generalized linear classifier that minimizes the empirical error and maximizes the geometric edge area at the same time. SVM maps the vectors into a higher dimensional space with a maximum interval hyperplane and separates the hyperplanes to maximize the distance between the two parallel hyperplanes. SVM has many unique advantages in dealing with small sample, nonlinear and high-dimensional pattern recognition problems.
XGBoost (eXtreme Gradient Boosting), also called extreme gradient boosting tree, is an implementation of the boosting algorithm that focuses on reducing bias, i.e., reducing the error of the model. Therefore, it uses multiple base learners, each of which is relatively simple, to avoid overfitting. XGBoost is suitable for structured data, and it is fast and effective in processing large-scale data sets.
MLP (Multi-Layer perceptron) is the basic algorithm of Deep Neural Networks (DNN), which can have multiple hidden layers in the middle except for the input and output layers, and the simplest MLP contains only one hidden layer, i.e., a three-layer structure. MLP has good fault tolerance and strong self-adaptive and self-learning functions.
KNN uses the training data to partition the feature vector space and uses the result of the partition as the final algorithmic model. For any dimensional input vector, corresponding to a point in the feature space, respectively, the output is the category label or a predicted value corresponding to that feature vector. The prediction of labels only depends on the labels of several samples closest to the unknown samples. KNN algorithm is suitable for classification of data sets with unbalanced samples.

Improved Salp Swarm Algorithm
Instead of random number generation by the original algorithm, chaotic mapping can generate chaotic numbers between 0 and 1. Chaotic sequences can often achieve better results than randomly generated random numbers during operations such as population initialization, selection, crossover, and mutation. In this paper, the PWLCM chaotic mapping is used to initialize the position of the Salps population. The formula is: Where ∈ (0, 0.5).
The leader's position update formula is: In the original method, the first Salps is selected as the leader, but only the first one is easy to fall into the local optimum, so this paper selects the first third of Salps as the leader.
The follower's position update formula is based on Newton's laws of motion: which also gets This paper updates the formula with Eq (18) as a follower. The specific steps are shown in Algorithm 1.

Algorithm 1: Improved Salp Swarm Algorithm
Inputs: training set, test set, population size, the maximum number of iterations Initialization: 01: The PWLCM chaotic mapping according to Eqs (11) and (12) initializes the position matrix A of the Salps Optimization Process: 02: When iter < Tmax_iter 03 The value of the Salps fitness function is calculated according to Eq (19) 04: Ranking of fitness function values 05: If the optimal fitness value < the location of the food 06: Assign the position of the optimal fitness function value to the food position 07: Generate according to Eq (9), randomly generate , 08: The first one-third of the Salps updates the leaders' position according to Eq (13) 09: The back of the Salps updates the followers' position according to Eq (18) 10: Iter = iter + 1 Output: Optimal Salps position

Binary feature selection
The basic principle of feature selection in the SSA is to use an improved binary SSA to find an optimal binary encode, each bit in the encode corresponds to a feature, if the -th position is "1", the corresponding feature is selected and the feature will appear in the classifier if it is "0", it means that the corresponding feature is not selected and the feature will not appear in the classifier. The basic steps are: Step 1: Encoding. Using the binary encoding method, the value of each position of the binary code, "0" means the feature is not selected and "1" means the feature is selected.
Step 2: Initial population generation. initial matrices are randomly generated to form the initial population, and the number of populations is generally set to 50 to 100.
Step 3: The fitness function. The fitness function indicates the superiority or inferiority of an individual or solution.
Step 4: The update strategy of the population is determined by the fitness function value, and the next iteration is performed to continue the search for the optimal fitness function value.
Step 5: If the set number of iterations is reached, the best subset of genes is returned and used as the basis for feature selection, and the algorithm ends. Otherwise, go back to Step 4 to continue the next generation of iterations. The specific steps are shown in Figure 1.

Function of fitness
For the intelligent algorithm feature selection problem, the construction of the fitness function is very important and mainly considers two aspects: first, the number of genes, i.e., the proportion of selected features to the total number of features. The fewer the selected features, the smaller the fitness value; and the second is the classification accuracy. So the fitness function is shown in Eq (19), which is mainly based on the classification ability and the number of features of the machine learning method.
Where is a constant between 0 ∼ 1, is the number of features selected in each iteration, is the total number of features, and is the accuracy of the classification algorithm.

Empirical research
This section presents a two-stage feature selection framework to achieve more accurate cancer classification, and compares it with some feature selection algorithms based on intelligent optimization algorithms, which have been advanced in recent years. The overall framework of this paper is shown in Figure 2, which can be simply summarized as the following four steps: Step 1: Preprocessing of gene expression datasets In this paper, the 10 public datasets are pre-processed by first removing outlier points and duplicate values and then normalizing the datasets.
Step 2: First stage feature selection In order to effectively filter out highly redundant and irrelevant genes, this paper adopts a more effective combination -the combination of three feature selection algorithms to identify key genes, including filtering and embedding, and the method to identify whether there is a common expression pattern between samples, which are mRMR, RF importance feature selection and WGCNA. The first features of the three feature selection algorithms are merged to eliminate redundant and unimportant genes.
Step 3: Second stage feature selection In this paper, an improved SSA is used to further compress the feature subsets. Firstly, the gene subset obtained in Step 2 was binary coded, and then the accuracy of the classification algorithm and the number of selected features were combined as the fitness function to enter the iteration and search for the optimal subset.
Step 4: Classifier In this paper, six base classifiers, namely LightGBM, RF, SVM, XGBoost, MLP, and KNN, are used to classify by the feature subset selected in Step 3.

Datasets
To verify the effectiveness and extensiveness of the proposed framework, 10 cancer gene expression datasets [29] are used in this paper, among which 5 datasets are binary classification data and 5 datasets are multi-label classification data, as shown in Table 1, which shows some basic information of the datasets, including the number of genes, the number of samples and the number of categories. In the subsequent experiments, 80% of the samples in each dataset are selected for training in this paper, and the remaining samples are used as the test set.

Data preprocessing
In order to overcome the influence of dimension on the results and ensure the effectiveness of the analysis results, the data is preprocessed. First, by calculating the mean and variance of each feature, the genes that are all 0 are eliminated; Then detect outliers through sample clustering. As shown in Figure 3, take the red line as the dividing line, and the samples above the red line as outliers, and eliminate these samples; Finally, the max-min normalization processing is carried out on the data.

Experimental setup
Considering the efficiency and computational complexity, the population number of the proposed ISSA method is set to 100, the number of iterations is 200, and the weight in the fitness function is 0.99. To avoid unfair comparisons, the population size settings and iterations of other algorithms are kept consistent with the ISSA algorithm. The parameter settings used in this algorithm are shown in Table 2. Inertia weight in PSO 0.6 7 The SCA of a 2

Comparison algorithm
The performance of the proposed algorithm is tested on 10 gene expression datasets to evaluate the effectiveness of the proposed algorithm. In the part of feature selection, it is compared with PSO, GWO, SCA, WOA and SSA, and six classification algorithms are used. In order to further evaluate the performance of the proposed method, it is also compared with the advanced methods in the literature, as shown in Table 3.

Evaluation indicators
In this paper, accuracy, precision, recall, and F1-score are used as the evaluation metrics of the classifier. The formula is as follows: where , , , are true positive, true negative, false positive, and false negative, respectively.

The first stage feature selection
WGCNA obtained the dissimilarity matrix by TOM matrix and then obtained the final gene module based on hierarchical clustering with the dynamic cut method. Figure 4 shows the correlation between different modules and categories of gene expression datasets obtained in the WGCNA method. Only the Breast and CNS datasets are listed in this paper, and the results of the rest of the datasets are in the appendix. For each dataset, the modules with a high correlation with the category and small p-value are selected in this paper, and the genes inside these modules are analyzed for gene enrichment to extract the most important key genes for further feature selection. The upper values in each module are correlations, the values in parentheses represent p-values, the red modules indicate positive correlations and the blue modules indicate negative correlations. In general, genes within modules of the same color have a high degree of similarity, while genes in gray modules indicate genes that cannot be assigned to any module. The paper selects genes with greater relevance within the modules by performing gene enrichment analysis on the genes within each module, and finally, the Breast dataset selected 6 modules among all modules, and those modules were removed due to the weak relevance of other modules to the category, and a total of 76 genes were selected in all columns of the 6 modules; similarly, the CNS dataset  In the feature selection based on the random forest method, there is a certain randomness in the importance ranking of the features obtained from the random forest. Therefore, in this paper, we first run the random forest 10 times repeatedly and use the Gini index of the model with the best accuracy as the criterion for importance ranking. Then repeat the ten-fold cross-validation five times and draw the error rate curve, and take the number of features corresponding to the position with a lower error rate as the number of features to be extracted; then extract the corresponding number of features according to the feature importance ranking result. As shown in Figure 5, only CNS and Leukemia_3c datasets are listed, the rest are in the appendix, and the red line is the corresponding characteristic number when the lowest point occurs for the first time. As can be seen from Figure 5  In mRMR score-based feature selection, the maximum correlation between features is calculated first, then the minimum redundancy between features is calculated, and the mRMR score of each feature can be obtained by the maximum correlation minus the minimum redundancy, and finally, the mRMR score is ranked, and the corresponding number of features can be selected according to the ranking result. In this paper, when the number of features is less than 5000, 50 features are selected; when the number of features is less than 10,000, 80 features are selected; when the number of features is more than 10,000, 100 features are selected, and this is the criterion for the number of features selected by mRMR. In Table 4, this paper summarizes the number of features selected by the three feature selection methods and the final number of features after taking the concatenation of the three methods, which are the initial number of features for the next step of feature selection. Tables 5 and 6 are the classification results of all features with all features and separate feature selection methods.This paper lists two classification methods, LightGBM and XGBoost, and the results of the other four methods are in the appendix.

The second stage feature selection
To reduce the number of selected genes, achieve further improvement in classification accuracy and a further reduction in computational effort, this paper performs the next step of feature selection based on an improved binary SSA, where the input dimension of the method is the concatenation of the final number of features selected by the three feature selection algorithms in the previous section. The features are first binary coded, with 0 indicating that this feature is not selected and 1 indicating that this feature is selected. Since the features selected by the intelligent algorithm will vary from one classifier to another, to verify the effectiveness of the method proposed in this paper, each classifier is repeatedly run 10 times, and each evaluation index at the end is the average of the 10 times. As shown in Table 7, six classification methods, namely LightGBM, RF, SVM, XGBoost, MLP, and KNN, are used in this paper to classify the features selected by ISSA, and the classification results are evaluated using four evaluation metrics. Compared with Tables 5 and 6, the method proposed in this paper achieves high classification accuracy with fewer features on 10 public datasets, and achieves good performance on datasets with far fewer samples than the number of features. In order to verify the validity of the features we selected, we selected features that appeared more than five times in ten results, divided the samples in these features into the Colon Cancer patient group and the normal group, compared their differences, and performed Mann-Whitney test. The results are shown in Figure 6, in which only 10 features are listed, which "*": p < 0.05; "**": p < 0.01; "***": p < 0.001; "****": p < 0.0001. Figure 6. Gene expression profiles of a subset of selected features of colon cancer.

Comparative analysis
In recent years, using the wrapper method to improve the quality of feature subsets has become a research hotspot. In this paper, on 10 gene expression public data sets, the method proposed in this paper is compared with the current advanced methods, namely PSO, GWO, WOA, SCA and the original SSA. As shown in Tables 8 and 9, Table 8 is the classification result after constructing the fitness function feature selection based on the accuracy of LightGBM, and shows the mean and variance of the results of ten runs. It can be seen that better or similar results have been achieved on ISSA. Similar results. Similarly, Table 9 is the MLP (the results of the other four methods are in the appendix). To be fair, experiments are performed based on WGCNA, mRMR and RF feature selection.     Figure 7 is the grouping box graph of Lung data set (the graphs of the other nine data sets are in the appendix). The abscissa is six classification methods, each classification method is a group, and each group is the box graph of the results of ten runs of six intelligent optimization algorithms. From the graph, it can be seen that the median and quartiles of ISSA are above other methods.    Table 10 shows the comparison between the framework proposed in this paper and the advanced research in recent years. The last line in the table is the work done in this paper, and the one with the highest accuracy among the six methods is selected. The top 10 rows are the performance of other papers on the dataset, and the first column is the abbreviation of the methods proposed by other papers. For specific methods, please refer to Table 3. "-" indicates that the paper did not use this dataset. In the 10 datasets, this paper has achieved equal or better accuracy, and the classification accuracy has reached more than 97.6% in all datasets, so the framework proposed in this paper has very important research significance.

Conclusions
The internal relationship of cancer gene expression data sets makes cancer diagnosis full of challenges, so feature selection technology plays an important role in reducing the dimension of data and deleting irrelevant and redundant features. And because different feature selection algorithms have different advantages and disadvantages, combining different types of feature selection algorithms is a promising technique for solving feature selection problems. The two-stage framework for gene selection proposed in this paper combines embedding, filtering and wrapper to identify the optimal feature subset. This algorithm can significantly reduce the size of features while maintaining highperformance indicators. ISSA considers three factors: one is to use PWLCM chaotic mapping to increase the diversity of the initial population; second, it changes the number of leaders' choices to avoid falling into local optimization; the third is to improve the follower's update formula, which can search the optimal location faster, that is, to find the optimal subset. In the experiment, 10 highdimensional benchmark datasets were used to test the performance of the method. These benchmark datasets are different in the number of genes, samples and categories, which is good enough to evaluate the generalization ability of the method.
At present, feature selection is a hot issue. New algorithms and new theories are all trying to solve the problem of feature selection. Feature selection by swarm intelligence optimization algorithm can help machine learning technology use the most important features, which improves the performance of learning algorithm, that is, learning speed or classification accuracy. The swarm intelligence algorithm has become more and more perfect in theory and has been proved to be a good method for solving practical optimization problems. The randomness of the algorithm can promote the diversity of solutions, avoid falling into local optimal solutions, and make it converge to a global optimal solution more quickly. However, since the algorithm is a random search algorithm, the solution of the problem and the analysis of the performance of the algorithm can only be proved by numerical experiment analysis, and the theoretical derivation is still insufficient. Therefore, swarm intelligence algorithm is an important direction of computer research and development, and will have a broad prospect in most science and engineering. In our further work, it is mixed with other intelligent algorithms to enhance its search ability, and other advanced machine learning algorithms are used in the classification part.