A Novel Strategy for Gene Selection of Microarray Data Based on Gene-to-Class Sensitivity Information

To obtain predictive genes with lower redundancy and better interpretability, a hybrid gene selection method encoding prior information is proposed in this paper. To begin with, the prior information referred to as gene-to-class sensitivity (GCS) of all genes from microarray data is exploited by a single hidden layered feedforward neural network (SLFN). Then, to select more representative and lower redundant genes, all genes are grouped into some clusters by K-means method, and some low sensitive genes are filtered out according to their GCS values. Finally, a modified binary particle swarm optimization (BPSO) encoding the GCS information is proposed to perform further gene selection from the remainder genes. For considering the GCS information, the proposed method selects those genes highly correlated to sample classes. Thus, the low redundant gene subsets obtained by the proposed method also contribute to improve classification accuracy on microarray data. The experiments results on some open microarray data verify the effectiveness and efficiency of the proposed approach.


Introduction
One of the major applications of microarray data analysis is to perform sample classification between different disease phenotypes, for diagnostic and prognostic purposes [1][2][3][4][5]. The classification involves a wide range of algorithms such as differential gene expression analyses, clustering analyses and supervised learning [6]. Gene selection is one of the critical steps in the course of the classification of microarray data [7,8]. Selecting a useful gene subset not only decreases the computational complexity, but also increases the classification accuracy.
The methods for gene selection are broadly divided into three categories: filter, wrapper and embedded methods [9]. A filter method relies on general characteristics of the training data to select genes without involving any classifier for evaluation. Most filter methods consider each feature separately with ignoring feature dependencies, which may lead to worse classification performance when compared to other types of feature selection methods [10]. In addition to considering feature dependencies, wrapper methods take into account the interaction between feature subset search and model selection. However, wrapper methods have a higher risk of overfitting than filter ones and are very computationally intensive [11]. Embedded methods have the advantage that they include the interaction with the classification model, while being far less computationally intensive than wrapper methods [12].
In recent years, many methods combined with populationbased stochastic optimization techniques such as genetic algorithm (GA) [13] and particle swarm optimization (PSO) [14,15] have been used increasingly as an effective technique for microarray data analyses [16][17][18][19][20][21][22]. In [19,20], binary PSO (BPSO) [23] combined with filter method was applied for searching optimal gene subsets. The method in [19] simplified gene selection and obtained a higher classification accuracy compared to some similar gene selection methods based on GA, while the method in [20] could determine the appropriate number of genes and obtained high classification accuracy by support vector machine. In [21], an approach combined GA with K-nearest neighbor (KNN) method was proposed to identify genes that could jointly discriminate between different classes of samples. The GA/KNN approach could capture the correlated structure in the data and are highly repeatable in independent runs [21]. A combination of Integer-Coded GA (ICGA) and particle swarm optimization, coupled with extreme learning machine (ELM) was used to select an optimal set of genes [22]. These hybrid methods were capable of selecting a compact subset of predictive genes for sample classification. However, these methods considered only the features' relevance by evaluating their utility for achieving accurate predication or exploiting data variance and distribution, and the selected genes were usually poorly explicable.
Similar to GA, PSO searches for optima by updating population with generations. Unlike GA, PSO has no evolution operators such as crossover and mutation. PSO is easy to implement with few parameters need to adjust. Binary PSO (BPSO) algorithm is a binary version of PSO, which is suitable to solve discrete optimization problems [23].
In the gene selection process, different kinds of classification algorithms such as backpropagation (BP) [24], KNN [25] and support vector machine (SVM) [26], were used to evaluate the prediction ability of gene subsets, which may result into high computational cost in wrapper and embedded methods. The convergence performance, especially convergence rate, of the classification algorithm to evaluate the candidate gene subsets is a significant factor in the gene selection process. ELM, developed recently for a single hidden layered neural network (SLFN) [27], has good generalization performance with a fast training procedure. In ELM, the input weights are chosen randomly and the output weights are calculated analytically. Normally, ELM's performance is superior to other classifiers such as gradient-based learning algorithms and SVM for problems with larger sample sets and a smaller number of features [22]. Therefore, it is a reasonable choice to use ELM to evaluate the candidate gene subsets.
In Kmeans-PSO-ELM method [28], we used K-means method to group the initial gene pool into several clusters, and standard PSO combined with ELM was used to perform gene selection, which could obtain a compact set of informative genes. Based on the work in [28], a prior information referred to as gene-to-class sensitivity (GCS) is considered in the gene selection process to select smallest possible set of predictive genes with high interpretability in this paper. The GCS information is exploited by SLFN from microarray data. The GCS information, indicating if a gene is sensitive to sample classes, contributes to select those genes significantly correlated to the sample classes. To identify relevant genes effectively for subsequent researches such as sample classification, we partition all genes into different clusters by Kmeans method, and filter out some clusters and genes with low GCS values. However, some remainder genes, especially those in the same cluster, may have high similarity so that there is probably large with rich redundancy among these remainder genes. To identify small sets of genes that could be used for diagnostic purpose, a modified binary PSO coupling GCS information combined with ELM is used to select smallest possible gene subsets. The modified BPSO may select the representative genes from the remainder clusters to form the optimal gene subset. Moreover, the hybrid method could reduce computational cost for using ELM to evaluate the candidate gene subsets.

K-means Clustering Method
Clustering is a search for hidden patterns that may exist in datasets. It is a process of grouping data objects into disjointed clusters so that the data in each cluster are similar, yet different to the others [29]. There are two kinds of clustering algorithms: hierarchical clustering and partitioned clustering. Different from hierarchical approaches, the partitioned clustering approach divided the input data into specified in advance number of clusters by minimizing a certain cost-function [29].
K-means, a typical partitioned clustering method, is simple and generally very fast [29]. It initializes specified in advance number of centers by some initial values called seed-points. Then, K-means computes the squared distances between the input data points and centers, and assigns the input data points to the nearest centers. Each cluster is represented by an adaptively-changing center. The above process is repeated until all center positions are optimized. The standard K-means algorithm minimizes the square-error costfunction as follows: where p is a input data point in the cluster C i , i = 1,2,…, N c , and m i is the center of cluster C i (the mean of all data points in the cluster C i ).

Particle Swarm Optimization
PSO is an evolutionary computing mechanism in searching for the best solution by simulating the movement of flocking birds [14]. The population of the birds is called the swarm, and the members of the population are called the particles. Each particle  represents a possible solution to the problem. In the process of optimization, each particle flies independently in its own direction which is guided by its individual historical best position (pbest) as well as the global best position of all particles (gbest). Supposing the dimension of a searching space is D, and the swarm is S = (X 1 ,X 2, X 3 ,…,X ns ); each particle stands for a position in Ddimensional space; the position of the i-th particle in the search space can be denoted as X i = (x i1 , x i2, …, x iD ), i = 1, 2, …, ns, where ns is the swarm size. The individual historical best position of the ith particle is expressed as P i = (p i1 ,p i2, …,p iD ). The global best position of all particles is denoted as P g = (p g1 ,p g2, …,p gD ). The velocity of the i-th particle is expressed as According to [15], the basic PSO is described as: x id~xid z v id ð3Þ where c 1 , c 2 are the acceleration constants with positive values; rand() is a random number ranged from 0 to 1; w is the inertia weight.
The basic PSO algorithm is usually applied to solve problems in which the elements of the solution are continuous real number, whereas BPSO is more suitable for discrete optimization [23]. In BPSO, the velocity update formula remains unchanged as shown in Eq.(2), while the new position of the particle is calculated according to the following equation: where the function S(V i ) is a sigmoid function,

Extreme Learning Machine
In [30], a learning algorithm for SLFN called extreme learning machine (ELM) was proposed to solve the problem caused by gradient-based learning algorithms. ELM randomly chooses the input weights and hidden biases, and analytically determines the output weights of SLFN. ELM has much better generalization performance with much faster learning speed than gradient-based algorithms [31,32].
For where  The wh i = [wh i1 , wh i2 , …,wh in ] T is the input weight vector connecting the i-th hidden neuron and the input neurons, the wo i = [wo i1 , wo i2 , …,wo im ] is the output weight vector connecting the i-th hidden neuron and the output neurons, and the b i is the bias of the i-th hidden neuron.
In the course of learning, first, the input weights and the hidden biases are arbitrarily chosen and need not be adjusted at all. Second, the smallest norm least-squares solution of the Eq. (5) is obtained as follows: It was concluded that the ELM has the minimum training error and smallest norm of weights [31,32]. The smallest norm of weights tends to have the best generalization performance [31,32].
Since the solution is obtained by an analytical method and all the parameters of SLFN need not be adjusted, ELM converges much faster than gradient-based algorithm.

The Proposed Hybrid Method
Gene-to-class sensitivity information. To get a better understanding of gene-to-class sensitivity (GCS) information, the input-to-output sensitivity of a SLFN should be given first. wh ki is the connected weight from the i-th input node to the k-th hidden neuron, and wo jk is the connected weight from the j-th hidden neuron to the k-th output neuron. xx i is the i-th input component and O j is the j-th output component of the SLFN. The i-th input to j-th output sensitivity [33,34] is defined as:

Sij~1
Nsam where N sam is the number of samples. For simplicity, the hidden activation function of the SLFN is logsig function, and the output neurons are linear. Obviously, the derivative LO j Lxx i obtained from the r-th sample, XX r , can be calculated as: When the O j is the value of the j-th class and xx i represents the expression level of the i-th gene, S ij in Eq. (7) can be considered as the i-th gene to the j-th class sensitivity. The larger the S ij value is, the more sensitive to the j-th class the i-th gene is. The GCS value for the k-th gene is normalized as follows: where N O and N gn1 are the number of the output neurons in the SLFN and the number of genes in the first-level initial gene pool.
In this study, a SLFN is trained with the training dataset by ELM, and the input and output weights of the SLFN are determined. According to Eqs. (7)-(9), the GCS values of all genes are easily obtained. To obtain more accurate GCS value of each gene, the above operation is repeated 50 times and the average GCS value of each gene is calculated. *Also selected by the method in [40]. #Also selected by the method in [37]. doi:10.1371/journal.pone.0097530.t003 The Proposed Gene Selection Method In this paper, gene selection also consists of two respects, which are to identify relevant genes from each cluster and to tend to select smallest subsets from the relevant genes. To obtain compact and explicable gene subsets, GCS information (GCSI) exploited from microarray data is considered in the whole process of gene selection. The rough frame of the proposed method is shown in Figure 1.
To identify relevant genes for subsequent sample classification, firstly, the GCS values of all genes are calculated according to Eqs.(7)- (9), and all genes are clustered by K-means algorithm. Since the genes in the same cluster might have similar functions, it is possible that several genes serve as the equivalent for the subsequent sample classification. To reduce the redundancy, only the representative genes among these equivalent genes are selected for further processing. Then, if maximum GCS value of all genes in a cluster is far lower than those of other clusters, this cluster will be filtered out. Moreover, if the GCS value of a gene in the remainder cluster is lower than the mean of the GCS values of all genes in this cluster, this gene will also be filtered out. Thus, the remainder genes in the remainder clusters have comparatively high GCS values, and these relevant genes for sample classification are probably kept for further gene selection.
Although the remainder genes are relevant to data classes, there is still rich redundancy among these genes. In this study, a modified BPSO encoding GCS information is proposed to select the most compact gene subsets from the remainder genes. The detailed steps are described as follows.
Step 1: Form a first-level initial gene pool. The dataset is divided into training and testing datasets. Select 200,400 genes from all original genes by using information index to classification (IIC) method [35] on the training data. Since the original IIC method is used for two-class microarray data, we develop it for multi-class microarray data as follows: where m g j and m g k are the means of expression value of the gene g in the j-th and k-th classes, respectively, and s g j and s g k are the standard deviations of expression value of gene g in the j-th and kth classes, respectively. c is the total number of classes. From [35], the higher the value of d(g), the more classification information the gene g contains, so the gene g is more relevant to samples categories. The high classification accuracy will be obtained with Also selected by the method in [33]. Also selected by the method in [43]. doi:10.1371/journal.pone.0097530.t005 high probability by a classifier if the microarray data is projected onto the gene g whose IIC value, d(g), is high. The genes are ranked by their IIC values on the training dataset, and those genes with higher IIC values are chosen to establish the first-level gene pool.
Step 2: Establish a second-level initial gene pool based on the first-level initial gene pool. Cluster the genes in the first-level initial gene pool to predetermined number of groups with K-means method. The predetermined number of the clusters is determined by trial and error. Delete those clusters whose maximum GCS values are much smaller than other clusters. Moreover, in each remainder cluster, the genes whose GCS values are smaller than the mean GCS value of all genes in the cluster are also removed. All the remainder genes in the remainder clusters form a secondlevel initial gene pool to perform further gene selection.
Step 3: Use BPSO to select the optimal gene subsets from the second-level initial gene pool. To improve the search ability of the swarm, GCS information is encoded into BPSO for further gene selection.
Firstly, initialize a population of particles with random positions and velocities in the search space. The i-th particle X i = (x i1 , x i2 , …,x iD ) represents a candidate gene subset, and the element x ijM{0,1},(1#j#D) indicates whether the j-th gene is selected. The dimension of a particle is equal to the number of genes in the second-level initial gene pool. Since the number of the remainder clusters is comparatively small and each cluster owns its representativeness, at least one gene should be selected from each cluster in the initialization of each particle.
Secondly, update the position of each particle. To encode GCS information, two modified equations for updating the particles based on the discrete PSO in [20] are proposed as follows: where GCS(j) is the GCS value of the j-th gene, and N gn2 is the number of genes in the second-level initial gene pool. v ij (t+1) on the right- hand side of Eq.(11) is calculated by Eq.(2). The parameter, a, is a fixed number, and is set as 0.5 according to [20]. The subscripts i and j denote the i-th particle and its j-th component. The p ij is the j-th component of the historical best position of the i-th particle, and the p gj is the j-th component of the global best position of all particles. The above equations make the genes with high GCS values be selected with high probability. Thirdly, evaluate the fitness value of each particle, and update the historical best position of each particle and the global best position of all particles. To select smallest possible predictive gene subsets, the fitness function is defined as follows: According to [36], to correct for the selection bias in the gene selection process, we perform 5-fold cross validation (CV) to evaluate the selected genes. The accuracy(i) in Eq. (13) is the 5-fold cross validation accuracy on training data obtained by ELM with the candidate genes denoted by the i-th particle. The GenesNumber(i) is the number of the selected genes denoted by the i-th particle.
Finally, if the fitness value reaches the threshold value, or the maximal iterative generations are arrived, the particle with the best fitness value is output. So the optimal gene subset is obtained.
As a wrapper or embedded gene selection method, its main computational cost is the computational time of evaluating the candidate gene subset by classifier. In the proposed method, the classifier to evaluate the candidate gene subsets is ELM which is trained by an analytical method without iterations. In those gene selection methods using SVM or BP algorithm to evaluate the candidate gene subsets, the classifier is trained with thousands of iterations and the training process is very time consuming. Therefore, the computational time of the proposed method is much less than that of methods using SVM or BP to evaluate the candidate gene subset. Moreover, for better generalization performance of ELM, the final selected gene subset evaluated by ELM has high predictive ability. The proposed method combines K-means method with the modified BPSO, GCS information and ELM, so it is referred to as KMeans-GCSI-MBPSO-ELM.

Datasets
To verify the effectiveness and efficiency of the proposed gene selection method, we conduct experiments on six open microarray datasets including Leukemia, Colon, SRBCT, LUNG, Brain cancer and Lymphoma data. The detailed description of the datasets is listed in Table 1.
The Leukemia data [37] contains total 72 samples in two classes, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML), which contain 47 and 25 samples, respectively. Every sample contains 7129 gene expression values. The Leukemia data are available at http://www-genome.wi.mit.edu/ cgi-bin/cancer/datasets.cgi.
The Colon data consists of expression levels of 62 samples of which 40 samples are colon cancer samples and the remaining are normal samples. Although original expression levels for 6,000 genes are measured, 4,000 genes out of all the 6,000 genes were removed considering the reliability of measured values in the  The LUNG data [38,39] contains in total 203 samples in five classes, adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoids, small-cell lung carcinomas and normal lung, which have 139, 21, 20, 6,17 samples, respectively. Each sample has 12600 genes. The genes with standard deviations smaller than 50 expression units were removed and a dataset with 203 samples and 3312 genes was obtained [38,39]. The data is also available at http://www.biomedcentral.com/content/supplementary/1471-2105-7-228-S4.tgz. The Brain cancer data contains 60 samples in two classes, 46 patients with classic and 14 patients with desmoplastic brain cancer. The Lymphoma data includes 58 samples where 32 patients did cured and 26 patients did not cured. Each sample in the Brain cancer and Lymphoma has 7129 genes. These two data are available at http://linus.nci.nih.gov/,brb/DataArchive_ New.html.
Since there has no guidance on how to select the population size and maximum iteration number in PSO, we determine the values of these parameters within the crossvalidation runs on the validation dataset. As for using modified BPSO to select optimal genes, the population size is 100 on the SRBCT data, while the one is 50 on the other five data; the maximum iteration number is 20 on the SRBCT, LUNG and Brain cancer data, and the one is 40 on the Leukemia, Colon and Lymphoma data. According to [15], a recommended choice for the acceleration constants c 1 and c 2 is 2, and a better decrease for the inertia weight, w, is form 1.4 to 0.5. In this study, based on the conclusions in [15] and the crossvalidation runs on the validation dataset, the initial and final inertia weight are set as 1.2 and 0.4, respectively on all data, and the acceleration constants c 1 and c 2 are both selected as 1.6 on all data.

Reduce Redundant Genes
In the first-level initial gene pool, some clusters with low maximum GCS values are deleted. In the experiments, Cluster 3 is removed on the Leukemia data, Clusters 2 and 6 are deleted on the Colon data, Clusters 1, 2 and 3 are deleted on the SRBCT data, Clusters 3 and 8 are filtered out on the LUNG data, Cluster 6 are deleted on the Brain cancer data, and Cluster 3 are filtered out on the Lymphoma data. Moreover, some genes in a remainder  cluster whose GCS values are lower than the mean GCS values of the cluster are removed. Figure 2 shows the GCS values of the genes in the remainder clusters on all data. From Figure 2, over half the genes in the remainder clusters are filtered out. The less the reserved genes are, the less the computational cost of the further gene selection is.

The Classification Ability of the Selected Gene Subsets
To verify the classification ability of the selected gene subsets, ELM is used to perform sample classification with these selected gene subsets on the six datasets. Each experiment is conducted 100 times, and the mean results are listed in Table 2.
From Table 2, with the three genes selected by the proposed approach, ELM obtains 100% testing accuracy and 5-fold CV testing accuracy on Leukemia. ELM obtains 100% testing accuracy and 5-fold CV testing accuracy with all selected gene subsets on the SRBCT data. With the about ten genes selected by the KMeans-GCSI-MBPSO-ELM method, ELM obtains high prediction accuracies on the other four data. These results indicate that the KMeans-GCSI-MBPSO-ELM method has the ability of selecting most predictive genes highly related to sample classes.
The GCS values of all reserved genes listed in Figure 2 are shown in Figure 3. It can be found that the KMeans-GCSI-MBPSO-ELM method does not always select those genes with the highest GCS values, and it also selects those critical genes with comparatively low GCS values. This is mainly because the modified BPSO considers not only the GCS values of the selected genes but also the classification ability of each gene subset.

Biological and Functional Analysis of the Selected Gene Subsets
The experiment on each microarray data is conducted 1000 times, and the top ten frequently selected genes are listed in Tables 3-8 for the six data.
From Table 3, all genes except genes X03934, Z69881 and J03473 were also selected by the methods proposed in [40] and [37]. Gene Z69881 is also selected by the proposed method in [37]. Gene U05259, a B lymphocyte antigen receptor, encodes cell surface proteins for which monoclonal antibodies have been demonstrated to be useful in distinguishing lymphoid from myeloid lineage cells [40]. Gene M63138 is the member of the peptidase C1 family involved in the pathogenesis of breast cancer and possibly Alzheimer's disease [40]. Gene X95735, an AMLrelated gene, is an adhesive plaque protein, which plays a central role in regulation of cell differentiation [40]. Form Tables 2 and 3, it can be concluded that gene X03934 is also a critical gene for sample classification.
From Table 4, genes H20709 and R22197 were also selected in [41], and genes H20709, T94579 and R88740 were also selected in [42]. A muscle index can be calculated based on an average intensity of 17 ESTs in the array that are homologous to smooth muscle genes which included gene H20709 [41]. Gene R22197 is one of the ribosomal protein cluster which are ESTs homologous to genes that appear to be related to cellular metabolism such as an ATP-synthase component and an elongation factor [41].
From Table 5, genes 796258, 629896, 1409509, 1435862, 784224, 21652 and 770394 were also both selected by the methods proposed in [33] and [43]. Genes 812105 and 42558 were also selected by the method proposed in [33]. Some genes are  over-expressed in a certain type of tumor but lack of specificity. For instance, Gene 784224 (fibroblast growth factor receptor 4) was noted to be highly expressed only in RMS and not in normal muscle, but it is also expressed in some other cancers and normal tissues [33]. This tyrosine kinase receptor is expressed during myogenesis but not in adult muscle, and is of interest because of its potential role in tumor growth and in prevention of terminal differentiation in muscle [33]. Although many gene selection methods were used on the LUNG, Brain cancer and Lymphoma data, they did not give the detailed information of the selected genes [38,39,44,45], and thus we do not know which genes listed in Tables 6-8 were also selected by these methods. Genes listed in Tables 6-8 can be a useful reference for future study of these three data.
To further verify that the proposed method is capable of selecting predictive genes, the heatmap with top ten frequently selected genes for the six data are shown in Figure 4. From Figure 4 (A), all ten genes expression levels clearly differentiate between AML and ALL. From Figure 4 Figure 4 (E-F), there has no single gene in the Brain cancer and Lymphoma data (especially in the Lymphoma data) whose expression levels are distinct between two classes. This is mainly because all genes in these two datasets have low deviation over all samples.
Principal component (PC) analysis is a linear orthogonal transform such that the coordinates of the data in the new space are uncorrelated and the most amount of variance of the original data is preserved by only a few coordinates [46]. Hence, the first few PCs explain most of the variance in the data. A plot of the first three PCs often reveals patterns in the data [47]. Here it is used to project samples of high dimensions onto a three-dimensional plot for visual display. PC analysis is applied to all samples in the six data using the top 30 frequently selected genes, and the display is shown in Figure 5. Noticeably, two and four distinct clusters emerge for the Leukemia and SRBCT data, respectively ( Figure 5(A) and (C)). As for the LUNG data ( Figure 5(D)), although several samples of different classes are overlapped, five clusters are apparent. As for the Colon data ( Figure 5(B)), four normal samples are in the cluster of the tumor samples, which were considered as the contaminated samples in [47]. From Figure 5 (E-F), a few samples of different classes are overlapped. This further indicates that the samples from different classes are not very distinct in both the Brain cancer and Lymphoma data, which was verified in [44,45].

Reproducibility of the Proposed Method
To examine the reproducibility of gene selection [21], we repeat the same proposed approach on the six data with different random seed numbers. Another 1000 subsets of optimal particles are thus obtained through an independent run. This means that the proposed method runs 1000 times in one independent run to obtain 1000 optimal gene subsets. All genes in the second-level initial gene pool are ranked according to the frequency of being selected in 1000 solutions obtained by the independent run. Figure 6 depicts the correlation between the ranks from the two independent runs. The points in Figure 6 are distributed along the line, y = x, which indicates the proposed method is highly repeatable in two independent runs, so the reproducibility of KMeans-GCSI-MBPSO-ELM is high.  Convergence Rate of the Proposed Method Figure 7 shows the 5-fold CV accuracy on the training data versus the iteration number of the modified BPSO. From Figure 7, the MBPSO finds the optimal gene subset with only five epochs on the SRBCT and LUNG data, 13, 23, 13 and 25 epochs on the Leukemia, Colon, Brain cancer and Lymphoma data, respectively. For the low redundancy of the second-level initial gene pool and encoding the GCS information into BPSO, the modified BPSO could find the optimal gene subset with fast convergence rate. Figure 7 could give a guide for selecting the maximum iteration number in the modified BPSO for different types of microarray data.

Comparison with Other Gene Selection Methods
First, to compare the KMeans-GCSI-MBPSO-ELM method to other PSO-based gene selection methods such as BPSO-ELM, KMeans-BPSO-ELM and our previous work in [28], ELM is used to perform sample classification with the selected genes obtained by the four gene selection methods. In the BPSO-ELM method, BPSO search the optimal gene subsets and ELM is used to evaluate the gene subsets. As for the KMeans-BPSO-ELM method, all genes are grouped by K-means clustering method first and BPSO combined with ELM to perform gene selection. Compared to the KMeans-GCSI-MBPSO-ELM method, KMeans-BPSO-ELM method does not consider GCS information and not reduce redundant genes in each cluster before using BPSO-ELM to select genes. Each experiment is conducted for 100 times, and the corresponding results of 5-fold CV are listed in Table 9. With the genes selected by the KMeans-GCSI-MBPSO-ELM method, ELM obtains the highest 5-fold CV classification accuracy on all data, whereas it obtains the lowest 5-fold CV classification accuracy on all data with the genes selected by the BPSO-ELM. Although all the four methods combine PSO with ELM to select informative gene subsets, the KMeans-GCSI-MBPSO-ELM method has the best performance because it uses the GCS information to reduce the redundancy of genes and select those gene subsets highly correlated to sample classes.
Then, to compare the proposed method with GS1 [38], GS2 [38], Cho's [48], F-test [49] and the method in [28], we have adopted two ways to build a classifier more than ELM using the top frequently selected genes, one is through support vector machines (SVMs) [50] and the other is through K-nearestneighbor (KNN) search [49]. As in [38], we also chose a linear kernel in SVM and chose the Euclidean distance in KNN classifier with K = 5. To be consistent with [38], each experiment is conducted 100 times on the Leukemia, SRBCT, LUNG and Colon data, and the mean 5-fold CV classification accuracies for three numbers of top ranked genes (that is, 30, 60, and 100) on the Leukemia, SRBCT, LUNG and Colon data are listed in Tables 10-13. The results of GS1, GS2, Cho's and F-test are directly quoted from [38], and thus the results of these four methods on the Colon data are not provided in Table 13. Since the number of the reserved genes for further gene selection by BPSO in the proposed method is less than 100 on the Leukemia data, the results with 100 genes for six methods are not provided in Table 10. Similarly, the results with 60 genes on the Leukemia data for the method in [28] are not provided. In all cases except the KNN classifier with top 60 frequently selected genes on the Leukemia data, the KNN and SVM classifiers achieve higher classification accuracies with the genes selected by the KMeans-GCSI-MBPSO-ELM method than with the genes selected by other methods but the method in [28]. Moreover, the standard deviations in all cases except the method in [28] on the LUNG data of the KMeans-GCSI-MBPSO-ELM method are always less than those of other gene selection methods, which shows that the proposed method is most robust in most cases. Compared to the proposed method, the KNN and SVM classifiers achieve almost high classification accuracies on the SRBCT and LUNG data with the 60 and 100 genes selected by the method in [28], which indicates the method in [28] requires a comparatively large set including those predictive genes selected by the proposed method. Finally, to compare the KMeans-GCSI-MBPSO-ELM method with MIDClass [45], SGC (based on t-test (SGC-t) and based on WMW (SGC-W)) [44] and the method in [28], we conduct the experiments on the Brain cancer and Lymphoma data. To be consistent with [44,45], the experiments are conducted with standard Leave-One-Out-Cross-Validation (LOOCV). The results of the KMeans-GCSI-MBPSO-ELM and the method in [28] is the average of 100 trials, and the results of the MIDClass and SGC methods are retrieved from [44,45]. From the results shown in Table 14, it can be found that the KMeans-GCSI-MBPSO-ELM method outperform the method in [28], MIDClass and SGC methods.

Discussion on the Selection of the Number of the Clusters
In the KMeans-GCSI-MBPSO-ELM method, the number of the clusters of the genes in the first-level initial gene pool (N c ) is determined by trial and error where the number of the clusters is determined within the crossvalidation runs on the validation dataset. However, there are some methods available such as the Elbow method [51] or Silhouette scores [52] that could be used to determine the number of the clusters for K-means method. To show the influence of the selection of the N c , we use KMeans+ Elbow-GCSI-MBPSO-ELM method to perform gene selection on six data. The only difference between the KMeans-GCSI-MBPSO-ELM and the KMeans+Elbow-GCSI-MBPSO-ELM method is the selection of the N c . In the KMeans+Elbow-GCSI-MBPSO-ELM method, the Elbow method [51] is used to determine the number of the clusters by evaluating the ratio of the between-group variance to the total variance (also known as an F-test). The comparison results between the KMeans-GCSI-MBPSO-ELM and the KMeans+Elbow-GCSI-MBPSO-ELM methods are provided in Table 15. The results in Table 15 are the average of 100 trials.
From Table 15, the KMeans-GCSI-MBPSO-ELM and the KMeans+Elbow-GCSI-MBPSO-ELM methods select the same number of the clusters on some data including the Leukemia, Colon, Brain cancer and Lymphoma data, so ELM obtains the same classification accuracies with the genes selected by these two methods on these four data. The value of the N c in the KMeans-GCSI-MBPSO-ELM method is different from that of the KMeans+Elbow-GCSI-MBPSO-ELM method on both the SRBCT and LUNG data, and ELM obtains higher classification accuracies with the genes selected by the KMeans-GCSI-MBPSO-ELM method than with the one selected by the KMeans+Elbow-GCSI-MBPSO-ELM method on both the SRBCT and LUNG data. In summary, since the selection of the N c is determined by the crossvalidation accuracies obtained by ELM on the validation dataset in the KMeans-GCSI-MBPSO-ELM method, ELM tends to obtain higher classification accuracies on the full dataset with the genes selected by the KMeans-GCSI-MBPSO-ELM method than with the one selected by the KMeans+Elbow-GCSI-MBPSO-ELM method. Figure 8 shows the relationship between the number of the clusters (N c ) and the classification accuracies on the full dataset obtained by ELM with the genes selected by the KMeans-GCSI-MBPSO-ELM method on the six data. The result in Figure 8 is the average of 100 trials. From Figure 8, the KMeans-GCSI-MBPSO-ELM method is less sensitive to the choice of the N c on the Leukemia and SRBCT data than on the other four data. The best N c for the KMeans-GCSI-MBPSO-ELM method is 5, 8, 8 and 10 on the Leukemia, Colon, SRBCT and LUNG data, respectively, which is same as the one determined by trial and error in the proposed method. The best N c is 4 and 7 on the Brain cancer data, and the best one is 6 on the Lymphoma data, while the N c is selected as 6 and 7 by trial and error in the KMeans-GCSI-MBPSO-ELM method on the Brain cancer and Lymphoma data, respectively. This difference lies in the fact that the N c in the KMeans-GCSI-MBPSO-ELM method is selected by trial and error within the crossvalidation runs on the validation dataset while the best N c obtained from Figure 8 is determined within the crossvalidation runs on the full dataset.

Conclusions
In this study, the GCS information combined with K-means clustering method was used to reduce redundant genes, and then a modified BPSO encoding the GCS information was used to perform further gene selection. The new method could select the predictive gene subsets with comparatively high GCS values. Experiment results also shown that ELM, SVM and KNN classifiers obtained high prediction accuracy with the genes selected by the proposed method. Moreover, the proposed gene selection method has high reproducibility. Since the proposed method reduced the redundancy of genes only by removing the genes with low GCS values, it might filter out a few critical genes highly related to sample classification in some cases and thus lead into worse classification accuracy. Future work will include how to solve this problem in the proposed gene selection method as well as apply the new method to more complex microarray data.