An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples

Background Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis. Methods In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets. Results To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes.


INTRODUCTION
Cancer is one of the most fatal diseases around the globe (Tabares-Soto et al., 2020;Hambali, Oladele & Adewole, 2020). According to the World Health Organization report, Cancer is marked as the second most deadly disease and an estimated 9.7 million deaths around the world in 2018 have occurred due to this signature disease (Hambali, Oladele & Adewole, 2020). Generally, one in every six deaths all over the world, occurs due to cancer. So, within 2,030, the number of new cancer patients per year will increase approximately by 25 million (Hambali, Oladele & Adewole, 2020;NIH, 2019). Although several advanced techniques are already developed for the detection of cancer, the proper prognosis of cancer patients, till date, is very poor and the survival rate is also very low (Tabares-Soto et al., 2020;Hambali, Oladele & Adewole, 2020;Kourou et al., 2015). It has been already found that for very accurate cancer sample classification or prediction, adequate information is not available from the clinical, environmental, and behavioral characteristics of patients (Kourou et al., 2015;Hambali, Oladele & Adewole, 2020;Tabares-Soto et al., 2020). Recently, due to different types of bio-molecular data analysis, several genetic disorders with different biological characteristics have been revealed which are very helpful for early identification and prognosis of cancer and also to discern the responses for different types of treatment (Colozza et al., 2005;Greller & Tobin, 1999;Li, Xie & Liu, 2018;Liu et al., 2011;Pilling, Henderson & Gardner, 2017;Su et al., 2001;Swan et al., 2013).
With the rapid advancements in genomic, proteomic, and imaging high-throughput technologies (Colozza et al., 2005;Greller & Tobin, 1999;Li, Xie & Liu, 2018;Liu et al., 2011;Pilling, Henderson & Gardner, 2017;Su et al., 2001;Swan et al., 2013), now it is possible to accumulate huge amount (in the order of thousands) of different bio-molecular information of patients. Using this huge amount of information, researchers have been trying to develop more advanced techniques for early detection and proper prognosis of cancer, and also to improve cancer therapy for improvement of patients' survival rate. To analyze this huge amount of information, lab-based approaches are not adequate as these methods are costly and time-consuming. So, computational or in-silico methods like statistical methods, machine learning, deep learning, etc. have been being used extensively in this field.
It is well-known fact that in cancer-causing cells, gene expression is either overexpressed or under expressed (Tabares-Soto et al., 2020). So, measurement of gene expression in cancer cells can give adequate information to improve cancer diagnostic procedures.
Nowadays, different developing countries have been using this procedure for cancer sample detection. It is already known that using DNA microarray technology it is possible to measure the expression level of a numerous number of genes for a single experiment/sample simultaneously. The outcome of DNA microarray technology is a gene expression data matrix. This matrix carries information about the expression level of a huge number of genes for a limited number of samples (such as diseased patient samples and normal samples). The presence of the limited number of samples in this data matrix is due to the lack of availability of samples. So, based on information of gene expression data matrix, cancer sample classification is one of the essential tasks in the field of cancer research (Chin et al., 2016;Dashtban & Balafar, 2017;Ding & Peng, 2005;Elyasigomari et al., 2017;Furey et al., 2000;Golub et al., 1999;Nada & Alshamlan, 2019;Tabares-Soto et al., 2020).
Using computational or in-silico approaches, gene expression-based cancer sample classification task has been reviewed extensively in different papers (Chin et al., 2016;Dashtban & Balafar, 2017;Ding & Peng, 2005;Elyasigomari et al., 2017;Furey et al., 2000;Golub et al., 1999;Nada & Alshamlan, 2019;Tabares-Soto et al., 2020). However, the main difficulties in the sample classification task arise due to several factors. First, in these data sets, a substantially small number of samples is available (generally in the order of hundreds) compared to the availability of a huge number of genes (generally in the order of thousands) (Chin et al., 2016;Hambali, Oladele & Adewole, 2020;Nada & Alshamlan, 2019). For sample classification, genes are treated as features/attributes. So, the highdimensional gene space is an overhead for most classification algorithms. Second, only a very few genes are informative (differentially expressed) and the rest of the section is noninformative (noisy) (Chin et al., 2016;Hambali, Oladele & Adewole, 2020;Nada & Alshamlan, 2019) for sample classification and responsible for degrading the classifier's performance. Gene dimension reduction by identification of informative genes as biomarkers can improve the classification accuracy of classifiers. Apart from the improvement of classification accuracy, the identification of informative biomarkers (here, informative genes) has great prospects from a biomedical point of view. These are beneficial for finding the biological reason for a disorder, assessing disease risk, and developing therapeutic targets. The third problem arises due to the small sample size which creates an overfitting problem in classifier construction. Another problem that degrades classifier performance is the sample class imbalance problem. This problem occurs due to the presence of more instances/samples of one class (majority class) with respect to other class(es) (minority class) in a dataset.
A fairly large number of works have been already developed for sample classification. These works are divided into two categories. In the first category (Chin et al., 2016;Hambali, Oladele & Adewole, 2020;Nada & Alshamlan, 2019), the major emphasis is given to the selection of relevant genes for the reduction of feature space. Then based on this reduced feature space, predictive/classification accuracy of the samples is measured using different existing single classification models like naïve Bayes, support vector machine, relevance vector machine, K-nearest neighbor, decision tree, logistic regression, etc. As gene selection is a feature selection task, so based on feature selection techniques, these methods are divided into different categories. These are (1) filter methods (2) wrapper methods (3) embedded methods and (4) hybrid methods. Before we mention the second category of classification methods, let us first elaborate on the first category methods one by one.
Filter methods (Chin et al., 2016;Hambali, Oladele & Adewole, 2020;Nada & Alshamlan, 2019) select a subset of features without taking any information from any classification model. These methods select features that are differentially expressed with respect to sample class labels. The filter methods rank individual features according to their class discrimination power based on some statistical score function and then select a number of high-ranked features to form a reduced and relevant feature subset. The popular statistical score functions used in filter methods are Fisher's score, Signal to Noise ratio (SNR), correlation coefficient, mutual information, Relief (Das et al., 2019), etc. Filter methods are computationally simple, fast, and unbiased in favor of any specific classifier as these methods do not consider any knowledge from any classifier at the feature selection phase. The drawback of filter methods is that the number of selected features is based solely on the trial-error method.
Wrapper methods (Chin et al., 2016;Hambali, Oladele & Adewole, 2020;Nada & Alshamlan, 2019), on the other hand, judge discrimination capability of a feature subset using classification error rate or prediction accuracy of a classifier as the feature evaluation function. It selects the most discriminative feature subset via minimizing the classification error rate or maximizing the classification accuracy of a classifier. The wrapper methods generally achieve better classification accuracy than the filter methods because the selection of feature subset is classifier-dependent. One drawback of these methods is that these are biased to used classifiers and another drawback is that these are computationally more expensive than the filter methods as generation of the best feature subset for the high-dimensional dataset is an NP-complete problem. Due to these reasons, these methods are not applicable for high-dimensional datasets.
In Embedded methods (Chin et al., 2016;Hambali, Oladele & Adewole, 2020;Nada & Alshamlan, 2019), the optimal feature subset is selected through the unique learning procedure of a specific classifier at the time of classifier construction. Actually, in these methods, the optimal feature subset selection part is embedded as part of classifier construction. These methods are faster than wrapper methods but are biased to the specific classifier. In embedded approaches, the feature selection process is specific for a particular classifier and is not applicable to other classifiers. These are also computationally expensive. Due to these reasons for high-dimensional datasets, these methods are not applicable. On the other hand, recently hybrid feature selection methods (Chin et al., 2016;Hambali, Oladele & Adewole, 2020;Nada & Alshamlan, 2019) are also developed. In hybrid methods, different category-based methods are combined to take advantage of all of these methods for improving classification accuracy.
Apart from these methods, clustering techniques (Chin et al., 2016;Hambali, Oladele & Adewole, 2020) are also used for feature selection purposes. Clustering techniques divide the data space in such a manner that objects in the same cluster are similar while in different clusters they are dissimilar. For the feature selection task, clustering methods (famous as attribute clustering in feature selection domain) (Au et al., 2005) divide the features into several distinct clusters and then reduce the feature dimension by selecting a small number of significant features from each cluster. A lot of unsupervised gene (attribute) clustering algorithms (Au et al., 2005;Chin et al., 2016;Hambali, Oladele & Adewole, 2020) are already developed for this task. However, these methods are unsuccessful to find informative functional groups of genes for sample classification as in clustering genes, no supervised information from sample classes is considered (Au et al., 2005;Chin et al., 2016;Hambali, Oladele & Adewole, 2020). So, scientists have developed a number of supervised gene (attribute) clustering algorithms (Dettling & Buhlmann, 2002;Hastie et al., 2000;Hastie et al., 2001;Maji & Das, 2012) in which genes are grouped using supervised information from sample classes and a reduced gene set is formed via selecting the most informative genes from each cluster.
All the above-mentioned variants deliver comparable feature selection and classification accuracy. Quite often this type of classification models with only a few genes and with a limited number of training samples can classify the majority of training samples correctly, but the generalization capability of such classification models cannot be guaranteed (Bolón-Canedo, Sánchez-Maroño & Alonso-Betanzos, 2012;Ghorai et al., 2011;Nagi & Bhattacharyya, 2013;Wang, 2006,Wang, Li & Fanget, 2012Yang et al., 2010). So, the most important task for a medical diagnosis system is to improve the classification accuracy of unknown samples (generalization performance) which cannot be solved by this type of classification model.
Apart from this problem, the microarray data is related to several uncertainties due to fabrication, hybridization, and image processing procedure in microarray technology. These uncertainties introduce various types of noise in microarray data. Due to the presence of these uncertainties with a limited number of training samples, the conventional machine learning approaches face challenges to develop reliable classification models.
To overcome the above-mentioned problems, it is therefore essential to develop general approaches and robust methods. In this regard, researchers are motivated to develop the second category-based model. These are the different robust ensemble classification models (Bolón-Canedo, Sánchez-Maroño & Alonso-Betanzos, 2012;Ghorai et al., 2011;Nagi & Bhattacharyya, 2013;Osareh & Bita, 2013;Wang, 2006;Wang, Li & Fanget, 2012;Yang et al., 2010) which can overcome small sample size problems and are capable of removing uncertainties of gene expression data.
Ensemble methods (Dietterich, 2000) are a class of machine learning technique which combines multiple base learning algorithms to produce one optimal predictive model. Ensemble classification model refers to a group of individual/base classifiers that are trained individually on the trained dataset in a supervised classification system and finally, an aggregation method is used to combine the decisions produced by the base classifiers. These ensemble classification models have the potential to alleviate the small sample size problem by applying multiple classification models on the same training data or on bootstrapped samples (sampling with replacement) of the training data to decrease the chance of overfitting in the training data. In this way, the training dataset is utilized more efficiently, and as a consequence, the generalization ability is improved.
Although different category-based ensemble classification models exist in the literature but these ensemble models are not capable of addressing all the above-mentioned problems (small sample size, high dimensional feature space, and sample class imbalance problem) related to microarray data.
In this regard, here a new Multiple Filtering and Supervised Attribute Clustering algorithm-based ensemble classification model named MFSAC-EC is proposed. In this model, first, a number of bootstrapped versions of the original training dataset are created. At the time of the creation of bootstrapped versions, an oversampling technique (Błaszczyński, StefanowskiŁ & Idkowiak, 2013) is adopted to solve the class imbalance problem. For every bootstrapped dataset a number of sub-datasets (each with a subset of genes) are generated using the proposed MFSAC method. The MFSAC is a hybrid method combining multiple filters with a new supervised attribute clustering method. Then for every sub-dataset, a base classifier is constructed. Finally, based on the prediction accuracy of all these base classifiers of all sub-datasets for all bootstrapped datasets an ensemble classifier (EC) is formed using the majority voting technique.
The novelty of the proposed MFSAC-EC model is that here the emphasis is given simultaneously on the high dimensionality problem of gene expression data, small sample size problem as well as the class imbalance problem. All of these problems at the same time are not considered in any existing ensemble classification model. First of all, due to the use of bootstrapping method with a class balancing strategy, the proposed model can handle a small sample size and overfitting problem. Second, in MFSAC, different filter methods are used with their unique characteristics. So, different characteristics-based relevant gene subsets are selected via different filters to form different sub-datasets from every bootstrapped dataset. Finally, every gene subset is modified using a supervised attribute clustering algorithm. In this way, the high-dimensionality problem of gene expression data is handled here. Apart from this, from the MFSAC generated sub-datasets, the frequency of occurrence is counted for every gene and informative genes are ranked accordingly. The prediction capability of the proposed model is experimented with over different microarray datasets and compared with the existing well-known models. Experimental outputs demonstrate the superiority of the proposed model over existing models.

MATERIALS & METHODS
The proposed MFSAC-EC model is composed of different filter score functions, a new supervised attribute clustering method, and an ensemble classification method. In the following subsections, first, a brief overview is given on different filter score functions and then the proposed MFSAC-EC model is described.

Preliminaries
In this paper, a data set (here, a microarray gene expression data set) is represented by a data matrix, K UÂV , with U data objects (samples) and V features (genes). The set of objects or samples is represented as E ¼ E 1 ; E 2 ; . . . ::; E s ; . . . E U f gwhile the set of genes is represented as G ¼ G 1 ; G 2 ; . . . . . . ; G t ; . . . :G V f g . Here, each sample is a V-dimensional feature vector containing V number of gene expression values. Similar way, every gene is a U-dimensional vector containing U number of sample values. Here, C UÂ1 is a class vector representing the associated class label for every sample. The class label is taken from a set DC ¼ d 1 ; d 2 ; . . . ::; d j ; . . . d N È É with N distinct class labels.
Brief overview of filter score functions used in MFSAC The filter score functions used in the proposed MFSAC-EC model are modified Fisher score (Gu, Li & Han, 2011), modified T-test (Zhou & Wang, 2007), Chi-square (Das et al., 2019), Mutual information (Das et al., 2019), Pearson correlation coefficient (Leung & Hung, 2010), SNR (Leung & Hung, 2010) and Relief-F (Das et al., 2019). A summary of these seven filters used in the MFSAC-EC model is given in the Table S1.

Proposed MFSAC-EC model
In the proposed MFSAC-EC model, initially, bootstrapping (sampling with replacements) with a class balancing procedure of samples is applied on training dataset K to create D number of different bootstrapped versions from the training dataset. Here, every bootstrapped dataset with U samples is formed by random sampling with replacements U times from the original dataset K. After that oversampling procedure is applied to each minority class to achieve data balance. Oversampling consists of increasing the minority class instances by their random replication to exactly balance the cardinality of the minority and majority classes in each bootstrapped dataset. Due to oversampling each bootstrapped dataset will contain more instances than the original dataset. The MFSAC method of the MFSAC-EC model, which is an integration of multiple filters and a new supervised attribute (gene) clustering method, is applied on every newly created bootstrapped (BK l ) training dataset. The proposed MFSAC method first calculates the class relevance score of every gene present in the bootstrapped training dataset using each filter score function FT x ð Þ; x ¼ 1 to 7 mentioned above. Then for each filter score function, a sub-dataset SD lx with a gene subset (GS lx ) is created by selecting a predefined number (let P) of the most relevant genes from the full gene set G. So, GS lx j j ¼ P: After that on every gene subset GS lx ð Þ of every sub-dataset SD lx , the SAC (Supervised Attribute Clustering) method is applied and a set of clusters CGS lx and corresponding cluster representatives (considered as modified features) are formed. Finally, Q numbers of most relevant cluster representatives are selected as modified features and a reduced sub-dataset RSD lx of the sub-dataset SD lx is formed. How the SAC method works on GS lx of every sub-dataset SD lx is discussed below.
For any sub-dataset SD lx , the SAC method starts by selecting the gene from the subset (GS lx ) with the highest FT x value. Let gene G li 2 GS lx with the highest FT x value be selected as the first member (let FT x G li ; C ð Þ¼A) and it also becomes the initial cluster representative R R ¼ G li ð Þ) of the first cluster C 1 GS lx and G li is deleted from GS lx : In effect, G li 2 C 1 GS lx , and GS lx ¼ GS lx À G li f g and so FT x R; C ð Þ¼ A. This cluster is then grown up in parallel with the cluster representative refinement process which is described next. In this process, the gene (let G lm ) with next highest FT x value is taken from GS lx subset and is merged with the current cluster representative R. The merging is done in two ways. First, the expression profile of G lm is directly added with R and a temporary augmented representative TR þ is formed and its FT x value (let B 1 ) is calculated. The second one is that the sign-flipped value of the expression profile of G lm is added with R and another temporary augmented representative TR À is formed and its FT If R is modified then the gene G lm is included in the cluster and G lm is deleted from GS lx . In effect, G lm 2 C 1 GS lx ; and GS lx ¼ GS lx À G lm f g. So, the next chosen gene is included in the current cluster if it improves the class relevance value of the current cluster representative. The merging process is described in Fig. 1.
Here g0 represents the current cluster representative R ð Þ and its class relevance score ((FT x ; R), here Pearson score), is shown. Now among all the genes g1, g2, g3, g4, and g5, the Pearson score of g1 is the highest. So, g1 is chosen for the merging process. Then g1 is added with R to create the temporary augmented representative (TR þ ¼ R þ g1) and also its sign-flipped value is added with the R to form the temporary augmented representative (TR À ¼ R À g1). The Pearson score of TR þ is greater than the Pearson score of TR À , so TR þ is chosen. Now the Pearson score of TR þ is greater than the Pearson score of R, so TR þ is considered as the current cluster representative and R ¼ TR þ : This process is continued for all other genes. Now, g3 is chosen as it is the gene with the next highest Pearson value. g3 and its sign-flipped value are added individually with current cluster representative R to form TR þ ¼ R þ g3 and TR À ¼ R À g3 respectively. In this case, Pearson score of TR À is greater than the Pearson score of TR þ . So, TR À is chosen. Then Pearson score of TR À is Checked with the Pearson score of R and here Pearson score of TR À is greater than the Pearson score of R. So, TR À is considered as current cluster representative and R ¼ TR À . In this way, cluster representative is refined. This process is repeated for every member of GS lx subset.
After the formation of the first cluster and its corresponding augmented representative, R is assigned to AR lx1 that means AR lx1 ¼ R; and the supervised clustering process is repeated to form the second cluster with the gene (let G lz Þ with next highest FT x value from GS lx subset. In this way a set of clusters CGS lx ¼ C 1 GS lx ; C 2 GS lx ; . . . ::; f C k GS lx ; . . . :g and their corresponding augmented cluster representatives AR lx ¼ AR lx1 ; . . . . . . . . . ; AR lxk ; . . . . . . f g are formed. After that Q number of most powerful augmented cluster representatives are chosen (as modified features) according to their FT x value from the generated clusters and with these Q number of modified features, a reduced sub-dataset RSD lx of sub-dataset SD lx is formed.
In this way, for every bootstrapped version (BK l ) of the training dataset, seven number of RSD lx sub-datasets are created and for every RSD lx an individual classifier is constructed using any existing classifier and finally, an ensemble classifier (EC) is formed by combining all these classifiers of all bootstrapped versions using the majority voting technique. To classify every sample using this ensemble classifier, each classifier votes or classifies the sample for a particular class, and the class for which the highest number of votes is obtained is considered as the output class.

MFSAC method based informative attribute ranking
For every gene (feature/attribute), the frequency of occurrence that means the total number of times it appears in all sub-datasets generated by the MFSAC method for all bootstrapped versions is calculated. Then according to their frequency of occurrence, those genes are ordered or ranked. The top-ranked genes with the highest occurrence frequency are considered the most informative cancer-related genes. Algorithm: MFSAC-EC Input: A K UÂV data matrix (here, gene expression data matrix) containing U number of data objects (here, cancer samples) and V number of attributes (here, genes).
Output: An ensemble classifier MFSAC-EC is formed to classify test samples. From MFSAC generated sub-datasets, informative genes are selected according to their rank. Every gene is ranked according to its frequency of occurrence. Definitions: . . . ; BK l ; . . . ::; BK D f g is a set of the bootstrapped version of the original training dataset. In every bootstrapped dataset the number of samples varies from the original dataset but the number of features is the same as the original dataset.
C UÂ1 is a class vector representing the associated class label for every sample. For a data matrix N distinct class labels exist and class labels are taken from a set DC ¼ d 1 ; d 2 ; . . . :: ð Þis x th filter score function which returns the class relevance value of G t gene with respect to class vector C using FT x score function, for x ¼ 1 to 7 as 7 represents the total number of filtering score functions used here.
Þ is a set of top-ranked genes of G selected using FT x score function and SD lx is corresponding sub-dataset of BK l . Here SD lx is a data matrix containing P number of genes.
g are the set of clusters and corresponding cluster representatives respectively generated from the corresponding subset GS lx of SD lx . Here every AR lxk is a vector. TR + , TR − , R are vectors similar to a gene vector.
f g is a set of sub-datasets each containing Q number of most relevant cluster representatives formed for every bootstrapped dataset BK l .
f g is a set of classifiers formed for every bootstrapped dataset.
1. Create D number bootstrapped version of training dataset K.
2. For Every bootstrapped dataset BK l repeat step 3 where G t 2 G; with respect to class vector C B. Select P number of top-ranked genes from G based on FT x score function and form GS lx gene subset with corresponding SD lx sub-dataset II. Compute second augmented representatives TR À by adding sign-flipped version of G lj 2 GS lx with R that means TR À ¼ R À G lj The block diagram of the proposed MFSAC-EC model is shown in Fig. 2, while the block diagram of the MFSAC method is shown in Fig. 3. The algorithm of the proposed model is described below.

Description and preprocessing of the datasets
The experimentation has been carried out over ten publicly available different gene expression binary class and multi-class datasets. Among these datasets, eight datasets are cancer datasets and two arthritis datasets. The eight cancer datasets are Leukemia (Golub et al., 1999), Colon (Alon et al., 1999), Prostate (Singh et al., 2002), Lung (Gordon et al., 2002), RBreast (Veer et al., 2002), Breast (West et al., 2001), MLL (Armstrong et al., 2001), and SRBCT (Khan et al., 2001). To show the accuracy of the proposed model with respect to other than cancer datasets here two arthritis datasets RAHC (Van der Pouw Kraan et al., 2003) and RAOA (van der Pouw Kraan et al., 2007) are also considered. The summary of the datasets is represented in Table 1.
In the Lesukemia dataset (Golub et al., 1999), the gene expression data matrix is prepared using Affymetrix oligonucleotide arrays. The original dataset consists of two datasets: the training dataset and the testing dataset. The training dataset consists of 38 samples (27 Acute Lymphoblastic Leukemia (ALL) and 11 Acute Myeloid Leukemia (AML)) while the test dataset consists of 34 samples (20 Acute Lymphoblastic Leukemia (ALL) and 14 Acute Myeloid Leukemia (AML)), each with 7,129 probes from 6,817 genes. For the Leukemia dataset, training and test datasets are merged here and genes with Algorithm: (continued ) SetR ¼ R þ G lj and add G lj to C k GS lx and delete G lj from GS lx count = count +1 missing values are removed and finally, the dataset with 7,070 genes and 72 samples is prepared.
In the Colon cancer dataset (Alon et al., 1999), gene expression of 6,500 genes for 62 samples is measured using Affymetrix oligonucleotide arrays. Among these 62 samples, 40 are Colon cancer samples and 22 are normal samples. Among these 6,500 genes, 2,000 genes are selected based on the confidence of measured expression levels.
Prostate cancer dataset (Singh et al., 2002) also consists of training and testing datasets. In the training dataset, among 102 samples, 50 are normal samples and 52 are prostate cancer samples. In the test dataset among 34 samples, 25 are prostate cancer samples and 9 are normal prostate samples. Gene expression of every sample is measured with respect to 12,600 genes using Affymetrix chips. Here, training and test datasets are merged, and a dataset with 12,600 genes and 136 samples is formed.
The Lung cancer dataset (Gordon et al., 2002) consists of 181 samples. Among these samples, 31 are malignant pleural mesothelioma and rest150 adenocarcinoma of lung cancer. Each sample is represented by 12,533 genes and the gene expression of every sample is measured using Affymetrix human U95A oligonucleotide probe arrays. In Rbreast data set (Veer et al., 2002), the patients, who are considered as breast cancer patients after 5 years intervals of initial diagnosis, fall under the category of relapse and rest as no relapse of metastases. A total of 97 samples have been provided in which 46 patients developed distance metastases within 5 years and they are considered as relapse while the remaining remained healthy and are labeled as non-relapse. This dataset comprises 24,481 genes and among them, 293 are removed.
In the Breast cancer dataset (West et al., 2001), the gene expression of 49 samples is measured using HuGeneFL Affymetrix microarray arrays. Breast tumors are positive or negative in the presence or absence of estrogen receptors (ER). In this dataset, 25 samples are ER+ tumors and 24 samples are ER-tumors.
MLL (Armstrong et al., 2001) is a type of dataset which comprises of training data set of 57 leukemia samples including 20 ALL, 17 MLL, and 20 AML and the test dataset including four ALL, three MLL, and eight AML samples. For MLL cancer dataset training  Table S1. SD 11 …SD 17 are sub-datasets created after applying filter score functions. SAC is the Supervised attribute clustering method applied to generate RSD 11 …RSD 17 reduced sub-datasets.
Full-size  DOI: 10.7717/peerj-cs.671/fig-3 and test, datasets are merged here and finally, the dataset with 12,582 genes and 72 samples are prepared. SRBCT dataset (Khan et al., 2001) is introduced as a dataset comprising of geneexpression for identifying small round blue-cell tumors of childhood SRBCT and samples of this dataset are further divided into four class which are neuroblastoma, rhabdomyosarcoma, non-Hodgkin lymphoma, and Ewing family of tumors and they are obtained from cDNA microarrays. A training set consisting of 63 SRBCT tissues, a test set consisting of 20 SRBCT and 5 non-SRBCT samples are available. Here we have considered only the training dataset. Each tissue sample is already standardized to zero mean value and has a unit variance across the genes.
RAHC commonly known as Rheumatoid Arthritis versus Healthy Controls is a data set ( Van der Pouw Kraan et al., 2003) which comprises of gene expression characterizing as peripheral blood cells of 32 patients with RA, three patients with probable RA, and 15 age with sex-matched healthy controls performed under microarrays with a complexity of 26,000 unique genes of 46,000 elements.
RAOA commonly known as Rheumatoid Arthritis versus Osteoarthritis is a dataset (van der Pouw Kraan et al., 2007) that includes the gene expression of thirty patients in which 21 of them are with RA and the remaining 9 of them are with OA. The Cy5 labeled experimental cDNA and Cy3 labeled common reference sample were pooled and hybridized to the lymphochips (consisting of 18,000 cDNA spots which symbolize immunology in the genes of relevance).

RESULTS
To assess the performance of the proposed MFSAC-EC model, four well-known existing classifiers named K-Nearest Neighbor (Duda, Hart & Stork, 1999), Naive Bayes (Duda, Hart & Stork, 1999), Support vector machine (Vapnik, 1995), and Decision tree (c4.5) (Duda, Hart & Stork, 1999) are applied independently in this model and four different ensemble classification models are formed. To prove the superiority of the proposed model, it is compared with existing well-known filter methods (used here) and existing recognized gene selection methods (Ding & Peng, 2005;Au et al., 2005;Maji & Das, 2012) and also with different existing ensemble classifiers (Bolón-Canedo, Sánchez-Maroño & Alonso-Betanzos, 2012; Nagi & Bhattacharyya, 2013;Osareh & Bita, 2013;Wang, 2006;Wang, Li & Fanget, 2012). To analyze the performance, the methods are applied to different publicly available cancer and other disease-related gene expression datasets. The major metrics used here for evaluations of the performance of the proposed classifier are the cross-validation method (LOOCV, fivefold, and tenfold), ROC Curve, and Heat map.

Tools used
The algorithms are implemented using Python programming language and Scikit-learn libraries (Pedregosa et al., 2011) which are explained in Komer, Bergstra & Eliasmith (2014) for ML algorithms. The programs are executed on an online Colab platform with 12 GB RAM and Intel(R) Xeon(R) processor available in the "CPU" Runtime Type at the time of writing. Figures and tables are generated in the Matplotlib library (Hunter, 2007) and also in Microsoft Excel. The python codes used here are available at https://github.com/ NSECResearchCD-SLB/PEERJ_MFSAC_EC.
In the following subsections, first, the different types of metrics used here are discussed, and then the performance of the proposed MFSAC-EC model is verified with respect to these metrics. This is followed by comparing the classification performance of the proposed model with different existing methods in terms of tenfold cross-validation. The proposed model does not only perform the task of classification but also ranks every attribute or gene in descending order based on its information present in the dataset. To show the effectiveness of this ranking procedure topmost eight genes from Colon cancer and Leukemia cancer datasets are represented with their corresponding names, symbols, and references in significant cancer-related journals to demonstrate their significant roles in these cancers.

Evaluation metrics
The performance of the proposed MFSAC-EC classifier is established with respect to the following measures.

Cross-validation method
The first well-known metric used here to evaluate the classification model performance is the k-fold cross-validation method (Wang, Li & Fanget, 2012). In the k-fold crossvalidation method, the dataset is randomly divided into k number of folds and k-1 folds are used for training and one fold is used for testing. The process is repeated for k number of times and average classification accuracy is taken. When k is set at 1 that means the fold size is equal to the size of the dataset (training dataset size is equal to one less than the number of samples in the dataset and validation is done using the remaining sample) then it is considered as Leave one out cross-validation method (LOOCV). For k is equal to two, the cross-validation method is named the household method. It has been found that when k is set at a very small value that means the fold size is large then the accuracy of the classification model is affected by low bias and high variance problems. On the other hand, if k is set at a high value that means the fold size is not so large then the classification accuracy of the classification model has a high bias but low variance. It has been found that tenfold cross-validation method outperforms the LOOCV method (Breiman & Spector, 1992;Ambroise & McLachlan, 2002;Asyali et al., 2006) and it has been also endorsed that the tenfold cross-validation method as a better measure for classification.
In training-testing random splitting the dataset is initially randomly partitioned into training set (2/3 rd of the dataset) and testing set (1/3 rd of the dataset) with 50 runs.

ROC curve analysis
The performance of the proposed classifier for two-class datasets is also judged using Receiver Operator Characteristic (ROC) analysis (Wang, Li & Fanget, 2012). It is a visual method for evaluating binary classification models. Under this analysis, the following measures are considered to judge the binary classification model.

Classification accuracy (Acc) is defined as,
The sensitivity (SN) or True Positive Rate (TPR) can be defined as, The specificity SP ð Þ or True Negative Rate TNR ð Þ can be defined as, The False Positive Rate FPR ð Þ can be defined as: The Positive Predicted Value PPV ð Þ can be defined as: The Negative Predicted Value NPV ð Þ can be defined as: where TP, TN, FP, FN are true positive, true negative, false positive, and false negative respectively.
The ROC curve is plotted considering TPR along the y-axis and FPR along the x-axis. The area under the ROC curve (AUC) is used to represent the performance of the binary classification model. The higher AUC value of a ROC curve for a particular classification model signifies the better performance of the classification model in differentiating positive and negative examples. The range of AUC value is 0 <= AUC <= 1.

Heat map analysis
A heatmap is a data representation diagram in which the values for a variable of interest are portrayed using a data matrix. In this data matrix, the values of the variable are represented across two-axis variables as a grid of colored squares. The axis variables are divided into ranges and each cell's color represents the intensity of that variable for the particular ranges of values of axis variables.
Here, the performance of the proposed classifier for multi-class datasets is judged using Heat map representation of confusion matrix (Liu et al., 2014), where a confusion matrix is a tabular representation to visualize the performance of a classification model in terms of true positive, true negative, false positive and false negative.

Parameter estimation
Before running the MFSAC-EC, the parameters are settled down. In MFSAC-EC the input training dataset is bootstrapped. The proposed MFSAC-EC model is run here varying the number of bootstrapped datasets (D) from five to 30 and the classification accuracy of this model is more or less the same from 10 to the rest of the range. So, the number of bootstrapped datasets for every training dataset for this model is set at 10.
In MFSAC method, initially P number of genes is selected by each filter method. Here in Table 2, the classification accuracy of the proposed model is shown with respect to different values of P. From Table 2, it has been found that the proposed model gives the best result for P = 100 for RAOA and RAHC datasets. In case of Breast cancer, Lung cancer, MLL and SRBCT datasets it gives the best result at P = 200. For Leukemia datasets it gives the best result at P = 500. So, it can be said that MFSAC-EC gives best result for P value within 200 to 500 in all cases for all datasets except Colon and Prostate. In Colon and Prostate, it shows the best result for P = 1,500.
Here we have used SVM, DT (C4.5), NB, and KNN classifiers individually for forming different ensemble classification models. All the classifiers are implemented using Scilitlearn libraries of Python. For all classifiers, we have set parameters with default parameter values. For DT as default setting we have used splitting function = Gini, Splitting criterion = best, height = none (that means for every sample it reaches a leaf/class node). For SVM, we have used the RBF kernel function. For KNN we have chosen K (number of nearest neighbor) value from three to seven.
The overall execution time of a single run of the MFSAC-EC model (considering bootstrapped dataset creation, feature selection using MFSAC, and then generating classification accuracy of test samples using LOOCV, fivefold, tenfold, and random splitting) and testing time using only tenfold are shown for different datasets in Table 3.

Classification performance of the proposed MFSAC-EC classifier
In Table 4, using the LOOCV method, the classification accuracy of our proposed MFSAC-EC model is 100% for different datasets (Leukemia, Breast, RBreast, Lung, RAOA, and RAHC) for all cases. In the Prostate dataset, we did not get 100% accuracy using our model with respect to any type of existing classifier. In MLL, Colon, and SRBCT it also gives 100% accuracy using all types of ensemble classifiers.
In Tables 5 and 6, it has been shown that using fivefold and tenfold cross-validation, MFSAC-EC does not provide 100% accuracy only for Colon and Prostate cancer datasets. For other datasets, it provides 100% accuracy with respect to all types of ensemble classifiers.
To show the generalization property of the proposed ensemble classifiers, the classification accuracy of these classifiers is also measured repeatedly with respect to the random splitting of the dataset into a training set (2/3 data of original dataset) and test set (1/3 data of original dataset). Random splitting is done with care such that class proportion is alike in the training set and test set. In Table 7, the classification accuracy of the above mentioned four different types of ensemble classifiers for the different number of cluster representatives is shown in different datasets which are based on the best result of   100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 Breast LOOCV 98 95.9 93.9 95.9 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 10 Fold 100 95.9 95.9 98 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 Lung LOOCV 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 10 Fold 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 Tables 4 to 7, it has been observed that classification accuracy in the LOOCV method, fivefold cross-validation, and tenfold cross-validation methods is Table 3 Total execution time in a single run of MFSAC-EC on different datasets. Total execution time in a single run of MFSAC-EC including Bootstrapped dataset creation, Feature Selection by filter methods and supervised attribute clustering approach, Training, Testing using LOOCV, fivefold, tenfold, and Random Splitting is given in the first row. While execution time using only tenfold Cross Validation is given in the 2nd row. Here the time for the best P value is shown here.  higher than the random splitting of the dataset, and the overall generalization performance of the proposed classification model is also good. The performance of the proposed model for different two-class datasets with respect to different parameters like SN, SP, PPV, NPV, FPR is shown in Table 8. From this table, it is found that the performance of the proposed model is very good with respect to all these parameters for all two-class datasets.

Leukemia
In Fig. 4, the ROC curve is shown for different two-class datasets. In Figs. 4A, 4B, and 4C, the ROC curves are shown for Breast cancer using LOOCV, for Colon cancer using fivefold cross validation, and for RAHC dataset using tenfold cross-validation respectively. The ROC curves for Leukemia Cancer, and Lung cancer datasets using LOOCV are given in Figs. S1A and S1B respectively. For Breast cancer, Leukemia cancer, and Lung cancer, the AUC value is equal to 1.0 in every case. The ROC curves are shown for RAOA, and RBreast cancer datasets using fivefold cross-validation in Figs. S2A, and S2B respectively. For these datasets also the prediction accuracy using fivefold cross validation is very high according to the AUC value. In Fig. S2C, the ROC curves are shown Table 5 Classification accuracy of the proposed MFSAC-EC model with respect to fivefold cross validation. Classification accuracy (%) of MFSAC-EC model has been shown in terms of fivefold Cross Validation with respect to four ensemble classifiers MFSAC-EC + NB, MFSAC-EC +KNN, MFSAC-EC+DT, and MFSAC-EC+SVM. Every ensemble classifier is run 50 times using fivefold Cross Validation for every dataset and the accuracy is shown which is obtained maximum number of times.

Dataset
Proposed for Prostate cancer using tenfold cross-validation. From these curves of tenfold cross validation, it may be seen that except for Prostate cancer, for all other datasets the AUC value is 1 and for Prostate cancer, the AUC value is close to 1. In Figs. 5A and 5B, heatmap representation of the confusion matrix are shown for multi-class datasets: SRBCT and MLL with respect to fivefold cross-validation, and tenfold cross-validation respectively. From these figures, it is clear that for the proposed model prediction accuracy is accurate in most cases.

Comparison of MFSAC-EC model with well-known existing filter methods used in this model
In Fig. S3, the proposed MFSAC-EC model in combination with different existing classifiers is compared with different filter methods used in this model with respect to SRBCT, RAHC, Prostate, and Colon datasets in terms of tenfold cross-validation. In all cases, the performance of the proposed model is significantly better with respect to all filters. Table 6 Classification accuracy of the proposed MFSAC-EC model with respect to tenfold cross validation. Classification accuracy (%) of MFSAC-EC model has been shown in terms of tenfold cross validation with respect to four ensemble classifiers MFSAC-EC + NB, MFSAC-EC +KNN, MFSAC-EC+DT, and MFSAC-EC+SVM. Every ensemble classifier is run 50 times using tenfold cross validation for every dataset and the accuracy is shown which is obtained maximum number of times.

Dataset
Proposed

Comparison of MFSAC-EC Model with Well-Known Existing Gene Selection Methods
In Fig. 6, the MFSAC-EC model with different existing classifiers as base classifiers are compared with existing well-known supervised gene selection methods named mRMR (minimum redundancy maximum relevance framework) (Ding & Peng, 2005), MSG (mutual information based supervised gene clustering algorithm) (Maji & Das, 2012), CFS (Correlation-based Feature Selection) (Ruiz, Riquelme & Aguilar-Ruiz, 2006), and FCBF (Fast Correlation-Based Filter) (Ruiz, Riquelme & Aguilar-Ruiz, 2006) with respect to different classifiers using tenfold cross-validation method. From these results, it has been found that the proposed model outperforms in most of the cases. In Fig. 7, the MFSAC-EC model is compared with well-known existing unsupervised gene selection methods named MGSACO (Tabakhi et al., 2015), UFSACO (Tabakhi, Moradi & Akhlaghian, 2014), RSM (Lai, Reinders & Wessels, 2006), MC (Haindl et al., 2006), RRFS (Ferreira & Figueiredo, 2012), TV (Theodoridis & Koutroumbas, 2008), and LS (Liao et al., 2014) with respect to DT, SVM, NB classifiers using random splitting Table 7 Classification accuracy of the proposed MFSAC-EC model with respect to random splitting of the datasets. Classification accuracy (%) of MFSAC-EC model has been shown in terms of random splitting with respect to four ensemble classifiers MFSAC-EC + NB, MFSAC-EC+KNN, MFSAC-EC+DT, and MFSAC-EC+SVM. Every ensemble classifier is run 50 times using random splitting for every dataset and the accuracy is shown which is obtained maximum number of times. For random splitting the dataset is divided into training (2/3) and testing (1/3) part 50 times randomly.

Dataset
Proposed Comparison of MFSAC-EC model with well-known existing ensemble classification and DEEP learning models In Table 9, the proposed MFSAC-EC model using the DT classifier is compared with wellknown existing ensemble classification models with respect to tenfold cross-validation. These models are PCA-basedRotBoost (Osareh & Bita, 2013), ICA-based RotBoost (Osareh & Bita, 2013), AdaBoost (Osareh & Bita, 2013), Bagging (Osareh & Bita, 2013), Arcing (Osareh & Bita, 2013), Rotation Forest (Osareh & Bita, 2013), EN-NEW1 (Wang, 2006), and EN-NEW2 (Wang, 2006). From Table 9, it is clear that the proposed model using DT classifier outperforms in all cases. In Table 10, the proposed MFSAC-EC model using DT, NB, KNN as base classifiers are compared with different existing ensemble classifiers with respect to tenfold crossvalidation. These classifiers are Bagging based ensemble classifier (Nagi & Bhattacharyya, 2013), Boosting based ensemble classifier (Nagi & Bhattacharyya, 2013), Stacking based ensemble classifier (Nagi & Bhattacharyya, 2013), Heuristic breadth-first search-based ensemble classifier (HBSA) (Wang, Li & Fanget, 2012), Sd_Ens (Nagi & Bhattacharyya, 2013), and Meta_Ens (Nagi & Bhattacharyya, 2013). In Table 11 our model using SVM and KNN as base classifiers is compared with auto-encoder-based deep learning models (Nabendu, Pintu & Pratyay, 2020) in terms of random splitting. Here, results are shown only for the datasets for which results are available in the literature, and all other fields are marked as "Not Found". In all cases, the MFSAC-EC model outperforms all the well-known existing ensemble models (except for the Colon cancer dataset) and deep learning models which in turn validates the usefulness of the proposed model.

Biological significance analysis
The top eight genes selected by the MFSAC-EC model for Colon cancer and Leukemia are listed in Table 12. For every gene, the name and symbol of the gene as well as the Accession number of the Affymetrix chip are listed. Apart from this information, to validate those genes, biomedical literature of the genes is searched and for every gene, the corresponding reference about its role and significance for a particular disease is provided.

DISCUSSION
In this paper, a new Multiple Filtering and Supervised Attribute Clustering algorithmbased ensemble classification model named MFSAC-EC is proposed. The main motivation behind this work is to develop a machine learning-based ensemble classification model to overcome the over-fitting problem which arises due to the presence of sample class imbalance problem, small sample size problem, and also high dimensional feature set problem in the microarray gene expression dataset, to enhance the prediction capability of the proposed model. Nowadays, in designing machine learning models, the use of ensemble methodology has been increasing day by day as it incorporates multiple learning algorithms and also training datasets in different efficient manners to improve the overall prediction accuracy of the model. Due to the inclusion of prediction accuracy of multiple learning models and also the use of different bootstrapping datasets, the chances of potential overfitting in training data is greatly reduced in the ensemble models, and as a consequence the prediction accuracy increases. One necessary condition of the superior performance of an ensemble classifier with respect to its individual member/base classifier is that every base classifier should be very accurate and diverse (Osareh & Bita, 2013). A classifier is considered accurate if its generalization capability is high and two classifiers satisfy diverse property if their prediction in classifying the same unknown samples vary from each other. The general principle of ensemble methods is to rearrange training datasets in different ways (either by resampling or reweighting) and build an ensemble of base classifiers by applying a base classifier on every rearranged training dataset (Osareh & Bita, 2013).
In our proposed ensemble model, at first, a number of bootstrapped datasets of the original training dataset is created. In every bootstrapped dataset, the class imbalance problem is solved using the oversampling method. Then for every bootstrapped dataset, a number of sub-datasets are created using the MFSAC method (which is a hybrid method combining multi-filters and a new supervised attribute/gene clustering method), and then for every generated sub-dataset, a base classifier is constructed using any existing classification model. After that, a new ensemble classifier (EC) is formed using the majority voting scheme by combining the prediction accuracy of all those base classifiers.
The prediction accuracy of the proposed model is verified by applying it to highdimensional microarray gene expression data From Figs. 6, and 7 it has been found that the classification accuracy of the MFSAC-EC model is much better than the well-known existing gene selection methods. From Tables 9, 10, and 11, it has been also found that the proposed MFSAC-EC classification model is superior to the existing ensemble classification models in almost every case. The superior performance of the proposed model is due to the following reasons: The generation of the different bootstrapped versions of training data and also the use of the oversampling procedure to balance the cardinality of majority class and minority class in every bootstrapped dataset reduces the chances of the overfitting problem of a classifier.  Different types of filter methods are used in the MFSAC method. It has been already observed that one filter gives better performance for one dataset while the same gives poor results for other datasets. This is because every filter uses separate metrics and so the choice for a filter for a specific dataset is a very complex task. As different filter methods are used in the MFSAC method, so different sub-datasets with different characteristics-based attributes/genes are formed from each dataset. This is shown using Venn diagram in Figs. S4A and S4B. Here for Leukemia and Prostate cancer datasets, the first twenty genes, selected by each filter are shown. In case of Leukemia dataset, Relief measure generates non-overlapping gene subset while using other filter metrics presence of a small number of overlapping genes in different gene subsets are observed. In Prostate cancer dataset, Relief generates non-overlapping gene subset and also maximum number of genes are non-overlapping in gene subsets formed by Fisher score, MI (mutual information). From these figures, it is clear that using different filter methods different subsets of genes are selected and different sub-datasets are formed. It shows diversity of those filter methods. As a consequence, the base classifiers prepared on these diverse datasets are become diverse. This diversity increases the power of ensemble classifier. Moreover, the genes selected by different filter methos are good biomarker also. In Table 12, the top ranked eight genes selected by MFSAC-EC model are shown for Leukemia and Colon cancer datasets. Among these genes, gene MPO (with column number 1,720), CST3 (with column number 1,823), ZYX (with column number 4,788), CTSD (with column number 2,062), CD79A/MB-1(with column number 2,583), LYZ (with column number 6,738) in Leukemia dataset are important biomarkers as these are selected by different filter methods mentioned in Fig. S4.
In MFSAC, at first, a sub-dataset of the most relevant genes is selected by each filter method. Then on each sub-dataset, the proposed supervised gene clustering algorithm is applied and a reduced sub-dataset of modified attributes/features in the form of augmented cluster representatives is generated. In this method, at the time of cluster formation, genes are augmented based on their supervised information. In other words, such augmentation is considered where it increases the class discrimination power. Thus effectively, the class relevance of any augmented cluster representative is greater than that of any single gene involved in that process. So, this modified sub-dataset containing a reduced feature set in the form of augmented cluster representatives is more powerful according to class discrimination power than the sub-dataset containing a subset of the most relevant genes. Apart from this, it is well known fact in gene expression data that two genes are functionally similar if they are pattern-based similar (either positively co-expressed or negatively co-expressed) (Das et al., 2016). So, at the time of the augmentation procedure, two types of augmentations are considered here. One is that a gene is added with its original value with the current cluster representative and another one is that the gene is added with its sign-flipped value with the current cluster representative. This is because if the current cluster representative and a gene are positively co-expressed then normal addition is considered but if they are negatively co-expressed then normal addition will hamper the addition process and in that case, sign-flipping of that gene will give proper result. The effect of augmentation with respect to every filter method is shown in Fig. 8. In Fig. 8, for the Breast cancer dataset, at the time of supervised cluster formation from each filter generated subset, the original gene, and its corresponding class relevance value, and also augmented gene and its corresponding class relevance are shown. From Fig. 8, it is clear that for every filter method the class relevance score of every original gene is increased with respect to that filter after augmentation. In Fig. 8, different class labels are distinguished by different colors.  Finally, for each sub-dataset with modified attributes in the form of augmented cluster representatives, a classifier is constructed using any existing classifier, and these classifiers are combined using the majority voting technique to form an ensemble classifier (EC). The use of different sub-datasets with optimal gene subsets in the form of augmented cluster representatives and the formation of a classifier for every sub-dataset can solve the overfitting problem of any single classifier. This is due to the reason that not all sub-datasets can consistently perform well on all types of cancer datasets (due to inherent characteristics of the datasets), but due to the use of majority voting in ensemble classifiers, this problem can be solved or reduced.
Another outcome of our proposed model is to rank informative genes for every cancer dataset. For this task, the frequency of occurrence of each gene present in the form of augmented cluster representatives in every sub-dataset is counted and these genes are ranked according to the counted value to measure the importance of those genes for any specific disease, here cancer. To establish the biological significance of those selected genes for every cancer dataset, their contribution has been confirmed by other existing studies where they are referred already. From these existing studies, it is clear that the selected genes are important for cancer class discrimination and also are important as cancer biomarkers for molecular treatment targets.

CONCLUSIONS
Many machine learning and statistical learning-based classifiers for sample classification already exist in the literature, but these methods are prone to suffer from overfitting due to small sample size problems, class imbalance problems, and the curse of the high dimensionality of microarray data. Although some of the existing methods can mitigate these issues to quite an extent, the problems have still not been satisfactorily overcome. Due to this reason, here a novel feature selection-based ensemble classification model named MFSAC-EC is proposed. It has been shown that the proposed model can handle the above-mentioned issues present in existing models. To check the performance of the proposed MFSAC-EC model, this classifier is applied to test sample classification accuracy in high dimensional microarray gene expression data, a domain that will be beneficial in the field of cancer research. From the experimental results, it has been found that the proposed model outperforms all other well-known existing classification models combined with the different recognized feature selection methods and also the newly developed ensemble classifiers for all types of cancer datasets mentioned here. Apart from this classification task, the proposed model can also rank informative attributes according to their importance. The efficiency of the proposed model in this task is vindicated by finding the most informative genes for the colon cancer and leukemia cancer datasets using this model. These genes are biologically validated based on other well-known existing studies. Consequently, it is clear that the selected genes are vital for sample class discrimination and are also important biomarkers for molecular treatment targets of deadly diseases.