Abstract

Gene expression data composed of thousands of genes play an important role in classification platforms and disease diagnosis. Hence, it is vital to select a small subset of salient features over a large number of gene expression data. Lately, many researchers devote themselves to feature selection using diverse computational intelligence methods. However, in the progress of selecting informative genes, many computational methods face difficulties in selecting small subsets for cancer classification due to the huge number of genes (high dimension) compared to the small number of samples, noisy genes, and irrelevant genes. In this paper, we propose a new hybrid algorithm HICATS incorporating imperialist competition algorithm (ICA) which performs global search and tabu search (TS) that conducts fine-tuned search. In order to verify the performance of the proposed algorithm HICATS, we have tested it on 10 well-known benchmark gene expression classification datasets with dimensions varying from 2308 to 12600. The performance of our proposed method proved to be superior to other related works including the conventional version of binary optimization algorithm in terms of classification accuracy and the number of selected genes.

1. Introduction

DNA microarray technology which can measure the expression levels of thousands of genes simultaneously in the field of biological tissues and produce databases of cancer based on gene expression data [1] has great potential on cancer research. Because the conventional diagnosis method for cancer is inaccurate, gene expression data has been widely used to identify cancer biomarkers closely associated with cancer, which could be strongly complementary for the traditional histopathologic evaluation to increase the accuracy of cancer diagnosis and classification [2] and improve understanding of the pathogenesis of cancer for the discovery of new therapy. Therefore, it has gained popularity by the application of gene expression data on cancer classification, diagnosis, and treatment.

Due to the high dimensions of gene expression data compared to the small number of samples, noisy genes, and irrelevant genes, the conventional classification methods cannot be effectively applied to gene classification due to the poor classification accuracy. With the inherent property of gene data, efficient algorithms are needed to solve this problem in reasonable computational time. Therefore, many supervised machine learning algorithms, such as Bayesian networks, neural networks, and support vector machines (SVMs), combined with feature selection techniques, have been used to process the gene data [3]. Gene selection is the process of selecting the smallest subset of informative genes that are most predictive to its relative class using a classification model. The objectives of feature selection problems are maximizing the classifier ability and minimizing the gene subsets to classify samples accurately. The optimal feature selection problem from gene data is NP-hard problem. Hence, it is more effective to use metaheuristics approaches, such as nature inspired computation, to solve this problem. In recent years, metaheuristic algorithms based on global search strategy rather than local search strategy have shown their advantages in solving combinatorial optimization problems, and a number of metaheuristic approaches have been applied on feature selection, for example, genetic algorithm (GA), particle swarm optimization (PSO), tabu search (TS), and artificial bee colony (ABC).

Metaheuristic algorithms, as a kind of random search techniques, cannot guarantee finding the optimal solution every time. Due to the fact that a single metaheuristic algorithm is often trapped into an immature solution, the recent trends of research have been shifted towards the several hybrid methods. Kabir et al. [4] introduced a new hybrid genetic algorithm incorporating a local search to fine-tune the search for feature selection. Shen et al. [5] presented a hybrid PSO and TS for feature selection to improve the classification accuracy. Next, Li et al. proposed a hybrid PSO and GA [6]. Unfortunately, the experiment results did not obtain high classification accuracy. Alshamlan et al. brought out an idea of ABC to solve feature selection. They first attempted applying ABC algorithm in analyzing microarray gene expression combined with minimum redundancy maximum relevance (mRMR) [7]. Then, they also hybridized ABC and GA algorithm to select genetic feature for microarray cancer classification and the goal was to integrate the advantages of both algorithms [8]. The result obtained by ABC algorithm was improved to some extent, but the small number of genes cannot get the high accuracy. Chuang et al. [9] introduced an improved binary PSO in which the global best particle was reset to zero position when its fitness values did not change after three consecutive iterations. With a large number of selected genes, the result of the proposed algorithm obtained 100% classification accuracy in many datasets.

So, in this paper, we concentrate on imperialist competition algorithm inspired by sociopolitical behavior which is a kind of new swarm intelligent optimization algorithms to address the process of feature selection from gene expression data. It starts with an initial population and effectively searches the solution space through some specially designed operators to converge to optimal or near-optimal solution. Although ICA has been proved a potential search technique for solving optimization problem, it still faces some difficulties that ICA is easy to trap into local optima and cannot get a better result. Tabu search (TS) as a local search technique just can make up for the deficiency of the ICA algorithm. It has the ability to avoid convergence to local optimal by a flexible memory system including aspiration criterion and tabu list. Due to local search property of TS, the convergence speed of TS largely depends on the initial solution. The parallelism of population in ICA would help the TS find the promising regions of the search space very quickly. So the hybrid algorithm HICATS effectively combines the advantages of ICA and TS and shows the superiority in feature selection.

The rest of the paper is organized as follows. Section 2 describes the related algorithm incorporating the process of generic ICA and TS. Section 3 elaborates the proposed HICATS including the framework, individual representation, empire initialization, colonies assimilation, and fitness function evaluation. Section 4 describes the parameter setting and the experiment result based on several benchmark gene datasets including comparative results between HICATS and other variants of PSO. Finally, concluding remarks are presented in Section 5.

2.1. Generic Imperialist Competition Algorithm (ICA)

ICA is a population-based stochastic optimization technique, which was proposed by Atashpaz-Gargari and Lucas [10]. ICA, as one of the recent metaheuristic optimization techniques, is inspired by sociopolitical behavior. A review on last studies showed that this method has not been used to solve gene expression data for feature selection. Like other evolutionary algorithms, ICA begins with an initial set of solutions (countries) called population. Each individual of population is an array which is called “country” in ICA and “chromosome” in GA. The empire is composed of the countries that would be either an imperialist or a colony. The powerful countries are considered to be imperialists and the colonies are assigned to each empire based on the power of each imperialist state. After generating the empires, the colonies are assimilated by their related imperialist which would make the colonies stronger and move towards the promising region. If the colony is better than its imperialist when moving towards the imperialist, then exchange positions of the imperialist and its colony. As an empire has more power, it attracts more colonies and imperialist competition among the empires forms the basis of the ICA. The powerful imperialists are reinforced and the weak ones are weakened and gradually collapsed when the imperialist has no colony. Finally, the algorithm converges to the optimal solution. The flowchart of ICA is shown in Figure 1. ICA has been successfully applied in many areas: fuzzy system control, function optimization, artificial neural network training, and other application problems.

2.2. The Tabu Search Algorithm

Tabu search (TS) was proposed by Glober in 1986 [11], which is a famous local search algorithm, to solve a wide range of hard combination optimization problems. The algorithm begins with initial solutions and evaluates the fitness values for the given solutions. Then an associated set of feasible solutions can be obtained by applying a simple modification with given solution. This modification by a simple and basic transformation is called move. If the best of these neighbors is not in the tabu list, then select it as the new current solution. The tabu list keeps track of previously explored solutions and prevents TS from revisiting them again to avoid falling into local optimum. A move is created to increase diversity even if it is worse than the current solution. At the same time, tabu list is introduced and used to memorize the better local neighbors which have been searched and will be neglected. After a subset of feasible solutions is created according to the tabu list and evaluated by the objective function. The best optimal solution will be selected as the next solution. This loop is stopped when the stopping criteria are satisfied.

3. Proposed HICATS

ICA as a global search metaheuristic algorithm reveals the advantage in solving combinatorial optimization problems; however, the diversity of the population would be greatly reduced after some generations and the algorithm might lead to premature convergence. TS as a local search technique can exploit the neighbors of current solutions to get better candidates, but it will take much time to obtain the global optimum or near-global optimum. The incorporation of TS into ICA as a local improvement strategy enables the method to maintain the population diversity and prohibits misleading local optimal. Each binary coded string (country) represents a set of genes, which is evaluated by the fitness function. TS is applied on imperialist in each empire to select the new imperialist and avoid premature convergence. The framework of the proposed algorithm HICATS can be shown in Figure 2, which is described further as follows.

Step 1. Set parameters of the algorithm and initialize countries with binary representation 0 and 1. Evaluate each country in the population which utilizes support vector machine classifier (SVM). The fitness is decided by the percentage of classification accuracy of SVM and the number of feature subsets. Then empires are generated depending on their fitness values.

Step 2. Apply TS on imperialist in each empire. Generate and evaluate the neighbors of imperialist. Select the new solution according to the tabu list and aspiration criterion to replace the old imperialist.

Step 3. Apply a learning mechanism on colonies which is the same as Baldwinian Learning (BL) mechanism [12]. Find out the different genes between imperialist and one of its colonies; then use the randomly generated learning probability to decide the number of selected genes for a colony. This strategy makes the colonies move towards their imperialist.

Step 4. Compare the objective values between imperialist and its colonies in the same empire. Exchange the positions of imperialist and its colony when a colony is better than its imperialist.

Step 5. Calculate the total power of an empire and compare all empires; then eliminate the weakest empire when it loses all of its colonies.

Step 6. If the termination condition (the predefined max iterations) is not fulfilled, go back to Step 2. Otherwise, output the optimal solution in the current population and stop the algorithm.

It is clear that HICATS integrates two quite different search strategies for feature selection, that is, the operation on ICA which can explore the new region and provide the ideal solution for TS, while TS can exploit the neighbors of imperialist for better candidate and avoid getting into local optimal according to memory system. The evaluation function, incorporating the accuracy of SVM with the number of selected genes in feature subset, assists HICATS to find the most salient features with less redundancy. A reliable selection method for genes classification should have higher classification accuracy and contain less redundant genes. For more comprehensibility, details about each component of HICATS are described in the following sections.

3.1. Individual Representation and Empire Initialization

In this paper, we utilize random approach to generate a binary coded string (country) composed of 0, 1 and its length is equal to the dimensions of gene expression data. A value of 1 in country indicates that this gene should be selected while the value of 0 represents the uselessness of corresponding gene. In order to clearly understand these operations, we take an example for explanation. Assume that the gene data have 10 dimensions (10 features: , , , , , , , , , and ); the country is initialized with 0 and 1. The bits of the string template are 10 which is equal to the dimensions of gene data. The string is randomly generated and the number of 1 is 5. Hence, the features , , , , and are selected to form a country which is shown in Figure 3.

After generating the population, we should evaluate the countries and initialize empires composed of imperialists and colonies. The fitness value of a country is estimated by the fitness function :In this study, assume that the initial population size is ; most powerful countries are selected as imperialists and the remaining countries are assigned to these empires according to the power of imperialists as their colonies. To assign the colonies among imperialists proportionally, normalized fitness value of th imperialist is defined bywhere and are the normalized fitness value of th imperialist and the fitness value of th imperialist, respectively. The normalized power for this imperialist is defined byThe normalized power of an imperialist reveals the strength of this imperialist. So the possessed colonies of th empire will bewhere is the total number of colonies and is the initial number of colonies of th empire. To generate each empire, we randomly choose colonies and give them to each imperialist. Figure 1 shows the initial population of each empire including imperialist and colonies with the same color. It is obvious that bigger imperialists have greater number of colonies while weaker ones have less. Imperialist 1 has the most colonies and formed the most powerful empire.

3.2. Colonies Assimilation

In HICATS, assimilation is an important operation and could likely be a momentous help in the progress of colonies evolution. In this paper, the idea of continuous BL is introduced into the HICATS for colonies assimilation by their imperialist. This strategy can utilize some specific differential information from the imperialist, that is, the differential information between imperialist and colony , indicating a more effective way to learn from the excellent solution. It can be defined as follows:The operation of difference states that, 1 subtracting 0, the result is 1; 1 subtracting 1, the result is 0; 0 subtracting 1, the result is 0; and 0 subtracting 0, the result is 0. The learning rate is a randomly generated real number which means the proportion of selected genes from the differences. is the operator that rounds its argument towards the closest integer and represents the selected genes. In order to reduce the dimension of the country, our research adopts a randomly generated template depending on the larger dimensions of imperialist and colonies. For imperialist (IM) with five characteristics , , , , and and one of its colonies with six characteristics , , , , , and in an empire, the dimension of binary template (BT) is 6. The template of colony is generated from nonoperation of BT, denoted by BTF. Because the number of IM feature genes is less than the template BT, the IMBT just takes one part of BT. To describe how it works, we will illustrate the following numerical example. Imperialist represents ; one of its colonies encodes ; then the differential information is described as . It is obvious that the number of different genes is 3 (the number of 1). According to the parameter , the number of selected genes from different genes set is 2 and the features of and are chosen. At the same time, is produced by random strategy and is the nonoperation of BT. is obtained from BT and is equal to BTF. After this process, CO becomes a country with the features , , and and the assimilated CO combining different genes between CO and IM, with five features , , , , and , is produced. Therefore, assimilated CO is generated by BL operation which is shown in Figure 4.

3.3. Fitness Function

The feature selection of gene expression data needs to consider the classification accuracy and the number of selected informative genes. Hence, the fitness function is defined as follows:in which is the leave-one-out-cross-validation (LOOCV) classification accuracy of one country (gene subset) obtained by SVM model. is the dimensions of optimal problem; in other words, it is the total number of genes for each sample and is the number of selected genes in . We use the parameters and to measure the importance of classification accuracy and the number of selected genes, respectively. The classification accuracy is more crucial than the number of selected genes and setting of the parameters satisfies constraint condition as follows: and .

3.4. TS-Based Local Search

In HICATS, each colony can be assimilated by its imperialist and then improve itself. Thus, the whole algorithm has a speed convergence. However, the classical ICA is easy to fall into local optimum. Therefore, the exploitation is performed by TS to search the better solution nearby the current imperialist and to escape from local optimal in this paper. How to produce the solution and the tabu list is very important in TS algorithm. In our study, one bit of the solution with nonoperation is utilized to produce the nearby solutions. For example, if the gene expression data with 10 dimensions and the current country is , the nearby solution can be obtained from the current solution through TS-based algorithm in Figure 5.

4. Experiment

4.1. Gene Expression Datasets and Parameter Setting

In this paper, except for SRBCT which was gained by continuous image analysis, the rest of the gene microarray datasets were obtained by the oligonucleotide technique. Presently, there is no standard method for processing gene expression data. Therefore, we designed an effective algorithm HICATS to perform feature selection for improving the classification accuracy. The datasets consist of 10 pieces of gene expression data, which can be downloaded from http://www.gems-system.org/. The description of datasets is listed in Table 1, which contains the dataset name and detailed expression. Table 2 gives the related samples, genes, and classes. These datasets contained binary-class and multiclass data that contained thousands of genes.

The parameter values for HICATS are shown in Table 3. It is very clear that the parameters of the proposed algorithm are less than binary particle swarm optimization (BPSO). Hence, the influence of parameter setting on HICATS is relatively small and the robustness of algorithm is better. The size of the population affects the performance of the algorithm and computation efficiency. Large number of countries would require more computational times for completing feature selection while if the number is too small, although the algorithm can take place in a relatively short period of time, the performance of the algorithm is not guaranteed. Therefore, the intermediate values for the size of population and iteration are chosen to be 15 and 50, respectively. Since population is composed of imperialists and colonies, the number of imperialists also needs to be determined. If the number of imperialists is 1, the HICATS algorithm is transformed to single-population evolutionary algorithm instead of multisubpopulation while if the number of imperialists is too large, the number of colonies cannot be guaranteed. The number of imperialists is chosen to be 4 in our experiment. In Section 3.3, the parameters of and are introduced and the range of values is given. In order to guarantee that is larger than , the values of and are set as 0.8 and 0.2 in our proposed algorithm with the same parameter setting of EPSO [13].

4.2. Experiment Results

In this paper, a hybrid algorithm HICATS incorporating ICA and TS is used to perform feature selection for the gene expression data. TS was embedded in the ICA to prevent the method from getting trapped into a local optimum, while applying TS on imperialist can improve the performance and speed up the convergence of TS.

The experiment results included classification accuracy and the number of selected feature genes obtained by HICATS over 10 independent runs for 10 datasets included 11_Tumors, 9_Tumors, SRBCT, Leukemia 1, Leukemia 2, DLBCL, Prostate_Tumor, Lung_Cancer, Brain_Tumors 1, and Brain_Tumors 2 which are shown in Tables 4, 5, and 6. It is found that the classification accuracy of HICATS achieves 100% with less than 10 informative genes for Leukemia 1, Leukemia 2, and DLBCL and with less than 20 selected genes for SRBCT. The average classification accuracy is more than 92.22% for all the datasets except for 9_Tumors. In other words, it is strongly demonstrated that HICATS can efficiently select informative genes from high-dimensional, binary-class, or multiclass datasets for classification. For all best classification results, the selected genes are less than 10 except for 9_Tumors and 11_Tumors, while, for the average classification result, the informative genes in subset are also less than 10 except for 9_Tumors, 11_Tumors, and SRBCT datasets. Furthermore, the standard deviation is less than 5 for all datasets except for 9_Tumors and 11_Tumors. From the classification accuracy and the selected informative genes, it is not difficult to find that HICATS is an efficient algorithm for feature selection and produces a near-optimal gene subset from gene expression data.

In order to verify the effectiveness of the proposed algorithm, firstly, we will compare the performance of HICATS with pure ICA using SVM as a classifier under the same experimental conditions; then, we will compare HICATS with other optimization algorithms on several benchmark classification datasets. The comparison results including the optimal classification accuracy and the number of selected genes obtained by HICATS and ICA are given in Table 7. The difference between these two algorithms is only whether each contains local search mechanism TS or not. It is quite clear that HICATS performs better than original ICA for all datasets. Hence, ICA combined with TS can effectively jump out of local optimum and HICATS achieves better performance. Table 8 compares experiment results obtained by other approaches from the literature and the proposed method HICATS. Various methods including non-SVM and MC-SVM were used to compare with our proposed method. The experiment results listed in Table 8 were taken from Chuang et al. for comparison [9]. Non-SVM contains the -nearest neighbor method [9, 14, 15], backpropagation neural networks [16], and probabilistic neural networks [17]. MC-SVM includes one-versus-one and one-versus-the-rest [18], DAG-SVM [19], the method by Weston and Watkins [20, 21], and the method by Crammer and Singer [20, 22]. It is obvious that our proposed approach HICATS obtained all the highest classification accuracies for the 10 benchmark datasets. The average highest classification accuracy of non-SVM, MC-SVM, and HICATS is 97.14, 93.63, and 97.81, respectively. For the datasets of Leukemia 1, Leukemia 2, SRBCT, and DLBCL, the classification accuracy can reach 100%. The average classification accuracy of HICATS and IBPSO seems to be the same; however, the selected genes of HICATS are significantly less than those of IBPSO listed in Table 9 because dimension reduction mechanisms are different. IBPSO algorithm mainly utilizes the value of sigmoid function to determine whether the gene is selected. In the initial iteration, the probabilities of 0 and 1 are 0.5 by a standard sigmoid function without any constraint and no modification. Then, in the next iteration, the probabilities are potentially influenced by velocity vectors; however, the probabilities of 0 and 1 are mostly maintained for its application on the gene expression data due to its high dimensions and a large search space. The number of genes is minimized about half of the total number of genes using the standard sigmoid function in high-dimensional data. Therefore, Mohamad et al. [13] introduced a modified sigmoid function to increase the probability of the bits in a particle’s position to be zero and minimized the number of selected genes. In our proposed algorithm, randomly generated binary templates are used to reduce the dimension of selected genes in each generation due to the assimilation mechanism that the colonies learn a lot of different genes from their imperialist. Hence, it is not hard to find that the speed of convergence is very fast and the differences of the number of selected genes between HICATS and IBPSO are huge.

The convergence graphs of the best and average classification accuracy obtained by HICATS for 9_Tumors, 11_Tumors, SRBCT, and DLBCL are shown in Figures 6 and 7. It can be seen that the best classification accuracy is achieved to be 100% less than 10 iterations for SRBCT and between 10 and 20 generations for DLBCL. Therefore, HICATS possesses a faster convergence speed and achieves the optimal solution rapidly.

5. Conclusions

In this paper, a hybrid algorithm HICATS incorporated binary imperialist competition algorithm and tabu search is used to perform feature selection and SVM with one-versus-the-rest serves as an evaluator of HICATS for gene expression data classification problems. This work effectively combines the advantages of two kinds of different search mechanism algorithms to obtain the higher classification accuracy for gene expression data problems. In general, the classification performance of HICATS is as good as IBPSO; however, HICATS is superior to IBPSO and other methods in terms of selected genes. In our proposed algorithm, in order to avoid imperialist premature convergence, a local search strategy TS embedded in ICA while TS is applied on imperialist in each empire can exploit the neighbors of imperialist to speed the convergence and assist in the imperialist evolution. Experimental results show that our method effectively classifies the samples with reduced feature genes. In the future work, imperialist competition algorithm combined with other intelligent search strategies will be used to select informative genes.

Competing Interests

The authors declare that they have no competing interests.