Robust proportional overlapping analysis for feature selection in binary classification within functional genomic experiments

In this paper, a novel feature selection method called Robust Proportional Overlapping Score (RPOS), for microarray gene expression datasets has been proposed, by utilizing the robust measure of dispersion, i.e., Median Absolute Deviation (MAD). This method robustly identifies the most discriminative genes by considering the overlapping scores of the gene expression values for binary class problems. Genes with a high degree of overlap between classes are discarded and the ones that discriminate between the classes are selected. The results of the proposed method are compared with five state-of-the-art gene selection methods based on classification error, Brier score, and sensitivity, by considering eleven gene expression datasets. Classification of observations for different sets of selected genes by the proposed method is carried out by three different classifiers, i.e., random forest, k-nearest neighbors (k-NN), and support vector machine (SVM). Box-plots and stability scores of the results are also shown in this paper. The results reveal that in most of the cases the proposed method outperforms the other methods.


INTRODUCTION
Feature or variable selection is the process of selecting a subset of features from a large feature space, especially in high dimensional datasets such as microarray gene expression, for model construction. Selecting a subset of genes/features is a necessary task in classification and regression problems. In regression, the feature or gene selection is carried out to better estimate the average value of the target or response variable, whereas in classification it is used to improve the classification accuracy. The motivation behind feature selection is that there are redundant and/or irrelevant features that do not contribute in regulating the response variable and adversely affect the underlying algorithms. So it is necessary to select those features which are discriminative and can help in simplification of model construction. Moreover, a small number of features help in reducing the training time, increasing the generalizability of the models by minimizing their variances and reducing the curse of dimensionality in n < p problems. Feature selection can be categorized into three categories, i.e., Wrapper, Embedded and Filter. The details of these methods are given below.

Wrapper methods
In Wrapper methods, all possible subsets of features in the training set are evaluated by using a predictive model. Each subset is assigned a score based on model accuracy on the hold-out (testing) set. These methods are computationally expensive, since for each feature subset a new predictive model is to be trained. An example of the wrapper method can be found in Saeys, Inza & Larrañaga (2007).

Embedded methods
These methods are somehow similar to the Wrapper procedures. The embedded feature selection methods differ from the wrapper procedures in the sense that the former do not need to train a new model for each feature subset. In these procedures, gene/feature selection is considered as a constituent of model construction. Some of the most common embedded methods include decision tree algorithm, regression with LASSO and Ridge regression. The last two methods shrink the coefficient of non-informative features to zero and almost zero, respectively. Classification tree based classifier (Breiman et al., 1984) is another example of this method.

Filter methods
In Filter methods, feature selection is carried out by applying a statistical measure such as the mutual information criteria (Guyon & Elisseeff, 2003), the pointwise mutual information criteria (Yang & Pedersen, 1997) and Pearson product-moment correlation, Relief-based algorithms (Urbanowicz et al., 2018), etc., to each feature independently or by finding the association of the feature with the target or response variable. Features are then ranked according to their relevance score. Features with the highest relevance scores are selected for model construction. Other examples of such methods could be seen in Ghosh et al. (2020), El-Hasnony et al. (2020), Seo & Cho (2020), Algamal & Lee (2019).
The proposed method is based on a filtering approach, where the discriminative features or genes that affect the target variable are identified by using the robust measure of dispersion, i.e., median absolute deviation (MAD) for binary class problems. Eleven benchmark gene expression datasets are used to assess the discriminative ability of genes selected by the proposed method. The performance of genes selected through the proposed method is evaluated by using different classifiers, i.e., Random Forest (RF) (Breiman, 2001), K-Nearest Neighbors (k-NN) (Cover & Hart, 1967) and Support Vector Machine (SVM) (Liao, Li & Luo, 2006).

RELATED WORK
Feature selection and their utility in classification analyses can be found in several studies. Dramiński et al. (2008) introduced a method called 'relative importance'. In this method, the discriminative genes are identified by constructing a large number of decision trees, where the genes that mostly contributed to assigning the samples/observations to their true classes are selected. Ultsch et al. (2009) proposed a method called 'PUL' in which the informative genes are selected by the help of a measure (PUL-score) based on retrieval information. A method called minimal redundancy maximal relevance (mRMR) was introduced by Ding & Peng (2005), in which genes having maximum relevance with the target class and minimum redundancy are selected. An ensemble version of Ding & Peng (2005) named 'mRMRe' was introduced by De Jay et al. (2013). Principal component analysis technique was used by Lu et al. (2011), where those genes are considered informative that corresponded to the component with less variation. A similar study can be found in Talloen et al. (2007), where the factor analysis technique is used rather than principal component analysis. Ultsch et al., 2009;Liu et al. (2013) compared different feature selection methods in their study. Identification of informative genes by calculating the p-value of the statistical tests such as the Wilcoxon rank-sum test and t-test can be found in Lausen et al. (2004). Selection of discriminative genes by exploiting impurity measures, i.e., Gini index, max minority, and information gain can be found in Su et al. (2003). Features or genes can also be selected by analyzing the overlapping degree between the different classes for each gene. A large overlapping degree between the different classes for a particular gene indicates that the gene is non-informative in classifying the observation to their correct class. A study based on the overlapping score of the genes for a binary class problem can be found in Apiletti et al. (2007). This method, named as 'painter's feature selection method' calculates the overlapping degree between the two classes for each gene by considering a single factor, i.e., the size of the overlapping area. Genes that have maximum overlapped regions are assigned higher scores. Genes are then sorted in increasing order based on their scores. This idea was further extended by Apiletti et al. (2012), by taking into account an additional factor, i.e., the number of overlapped observations in the overlapping area for each gene. Apiletti et al. (2012) calculated each gene mask by considering the range of training expression intervals, which represents the capability of a gene to correctly classify the observations into their classes without any ambiguity. In this method, a minimum subset of genes that unambiguously assign the maximum number of training samples to their correct classes is identified by considering the gene masks and overlapping scores through the set covering approach. The final subset of discriminative genes is obtained by considering all the genes in the minimum subset and the genes with the smallest overlapping scores. A robust version of Apiletti et al. (2012) can be found in Mahmoud et al. (2014), where expression interval for each gene is calculated by using the Interquartile Range. Mahmoud et al. (2014) also considered the proportion of overlapping samples (POS) in each class for each gene. Genes with lower POS, i.e., proportional overlapping scores were considered informative. After obtaining the POS, the relative dominant class (RDC) for each gene was also calculated which associates each gene with the class for which it has a stronger distinguishing capability. The final set of genes/features is obtained by combining the minimum gene set via gene masks top ranked genes based on proportional overlapping scores (POS). Li & Gu (2015) proposed a method called more relevance less redundancy algorithm. Another study by Nardone, Ciaramella & Staiano (2019) introduced a two step procedure for the feature selection, where extensive experiments were performed to evaluate the performance of their proposed method on the publicly available datasets related to computational biology field. A novel supervised learning technique is introduced in Bidgoli, Ebrahimpour-Komleh & Rahnamayan (2020). This method is designed particularly for the multi class problems. Furthermore this method is an extended version of decomposition-based multi-objective optimization approach. A feature selection method for binary classification problems was introduced by Dashtban, Balafar & Suravajhala (2018), in which the traditional bat algorithm is extended with more refined formulations, improved and multi-objective

METHOD
Microarray gene expression data is usually in the form of a matrix, i.e., Z = [z ji ], where Z ∈ R p× n and z ji is the observed expression value of j th gene for i th tissue sample, for j = 1,2,3,…,p and i = 1,2,3,…,n. Each tissue sample is categorized into one of the two classes, i.e., 0 or 1. Let W ∈ R n be the class labels vector such that its i th component w i takes a unique value c which is in the form of either 0 or 1. The number of samples/observations in microarray gene expression datasets are usually smaller than the number of features, which is also called n < p problem. Figure 1 represents the common layout of a gene expression dataset. Observations/samples are listed in the rows while the genes are given in the columns. Corresponding to each sample the gene expression values for each gene are given in the cells. Further definitions used in this paper are given below:

Overlapped region
The overlapping region between the two classes is represented by R v j , which shows the intersection region between the expression values of the target classes for gene j. It is defined by; Non-outlier sample set The non-outlier sample set is symbolized by N j , it is a set of observations with expression values lying within their own response class core intervals. It is given as: Total core interval Total core interval for gene j is denoted by R j , it is the area between a global minimum and global maximum boundaries of both classes' core intervals. It is given as: such that d j = min(d j,1 , d j,2 ), e j = max(e j,1 , e j,2 ) represent lowest and highest boundaries of core interval R j,c of gene/feature with response c = (0,1) respectively.

Non-overlapped sample set
For gene j, the non-overlapping set is represented by O′ j , which contains the non-outlier samples given by N j , with expression values not falling inside the overlap interval. It is given as: Overlapped sample set The overlapping samples set for gene j is characterized by O j , which consists of the observations with expression values falling inside the overlap interval R v j . It is given as: where O′ j contains all the non-overlapping samples.
In the above expressions Q 1(j,c) , Q 3(j,c) and MAD (j,c) represent the lower (first) quartile, upper (third) quartile and median absolute deviation respectively for each class c, where c is either 0 or 1.

Relative dominant class (RDC)
For each gene, Relative dominant class (RDC) is calculated, which associates each feature/ gene with the class it is more capable to differentiate. It is defined as: where U c represents class c samples set, i.e., U c ∈ {i c j = c}.

Proposed (RPOS) score
The proposed method (RPOS) is defined as.
where , R v j . is the length of overlap interval, , R j . is the length of total core interval, |O j | is the total number of overlapped samples and |N j | is the total number of non-outlier samples for gene j. f c ¼ jO j;c j jO j j , where |O j,c | represents the overlapped samples lying in class. The number 4 is multiplied to keep the RPOS scores between 0 and 1. Smaller value of RPOS represents that a particular gene is more informative in classifying the tissue sample to its correct class.
The proposed method thus takes the following steps in selecting the most discriminative genes.
1. The proposed method initially identifies the minimum subset of genes via the greedy approach given in Apiletti et al. (2012). The greedy approach utilizes gene mask matrix given in Eq. (7) and RPOS scores in Eq. (9) to form this subset. Gene that has the highest number of bits equals 1 is included in the subset. If more than one genes having the same number of bits 1 exist the one with smaller RPOS is selected. Using AND operator, the gene masks of the remaining genes are updated for the selection of the second gene and so on. This process is repeated until the desired number of genes are selected, or the genes have no 1's in their gene masks. For further details on greedy approach gene selection, see Apiletti et al. (2012).
2. The genes that are not selected in the minimum subset are arranged according to the RPOS scores and relative dominant class (RDC) by round-robin fashion method in ascending order. A smaller score represents the higher discriminative ability of gene/feature. (2), the required top most ranked genes are selected.

After arranging the genes in step
4. The final set of genes for the model construction is obtained by combining the genes in steps (1) and (3).
The general workflow of the proposed (RPOS) method, along with its pseudo-code, is given in Fig. 2 and Algorithm 1, respectively.
The proposed (RPOS) method is novel in the sense that it utilizes median absolute deviation (MAD) for the construction of core expression intervals of the expression values of genes. The drawback of POS in Mahmoud et al. (2014) is that the gene masks are calculated on the basis of expression intervals by using the interquartile range approach. The construction of gene masks can be affected by outliers because of the smaller breakdown point, i.e., 25% of interquartile range. The breakdown point of (MAD) is 50%, which is less vulnerable to outliers, thereby reducing the effect of outliers while constructing the gene masks.

EXPERIMENTS AND RESULTS
This section provides a detailed description of the experiments executed for assessing the proposed method against the other methods on benchmark gene expression datasets. A common practice for investigating the efficacy of gene selection methods is to check the discriminative ability of the selected genes by using different classifiers. This is usually done by recording classification accuracy of the classifiers applied on datasets with selected genes only while discarding the rest of the genes. Golub et al. (1999) have used different feature/gene selection techniques given in Statnikov et al. (2005), and it has been observed that gene selection methods have a significant effect on the classifier's accuracy. This approach has been widely used in several other studies (Apiletti et al., 2012;Mahmoud et al., 2014). Before listing the results from the analyses done in this paper following the above-mentioned approach, a brief description of datasets is given below.

Microarray gene expression datasets
In this research work, a total of 10 microarray gene expression datasets are taken as standard benchmark binary classification problems. These datasets are taken from various open sources with a varying number of genes and observations. A brief description of the benchmark datasets used in the current paper is given in Table 1. The table provides the number of samples, number of genes, class-wise distribution of samples in the data and source against each dataset.

Experimental setup
Experimental setup for the analyses done in the paper is as follow. The datasets considered in this study are divided into two mutually exclusive parts in the following manner: In the first part, seventy percent (70%) of the observations from each dataset randomly selected without replacement are considered as training part, while the remaining thirty percent (30%) of the observations are considered as a testing part. In the second part, thirty percent (30%) of the observations in each dataset randomly selected without replacement are considered as training part. In comparison, the remaining seventy percent (70%) of the observation are considered as testing part. A split sample analysis of 500 runs is carried out for each combination of gene selection methods and the corresponding classifiers using 70% training, 30% testing and 30% training, 70% testing partitions. The classifiers which are considered in this study are Random forest (RF), support vector machine (SVM) and k-Nearest neighbours (k-NN). For Random forest, R package, i.e., randomForest (Liaw & Wiener, 2002) is used with default parameters ntree = 500, mtry ¼ ffiffi ffi p p and nodesize = 1. For the implementation of support vector machine R package kernlab (Karatzoglou et al., 2004) is used with default parameters. Similarly for k-Nearest neighbor classifier R package caret from Jed Wing et al. (2019) is used the default parameter value of k = 5. Using the training parts of each dataset, a set of discriminative genes, i.e., 5, 10, 15, 20, 25 and 30 are selected by different gene selection methods to train the classifiers. Gene selection methods considered in this paper are Wilcoxon Rank Sum Test (Liao, Li & Luo, 2006), Proportional Overlapping Score (POS) based method (Mahmoud et al., 2014), Genes Selection by Clustering (GClust) (Khan et al., 2019), Maximum Relevance Minimum Redundancy (mRmR) (Ding & Peng, 2005) and Significant Features by SVM and t-test (sigF) (Das et al., 2020). The performance of the selected genes are investigated by the average values of the performance metrics, i.e., classification error rate, Brier score and sensitivity using the testing parts of each dataset.

RESULTS AND DISCUSSION
The results of the proposed method and other methods included in this study are obtained for all the datasets. The results of three datasets, i.e., "TumorC", "Breast" and "Srbct" are given in Tables 2, 3 and 4. These results are based on 70% training and 30% testing parts portioning of the datasets. The results of the remaining eight datasets are given in Supplemental File (Tables S1-S15). From Table 2 given below, it is clear that for "TumorC" dataset the proposed method (RPOS) performed better than all the other methods in terms of all the performance metrics considered, except the Wilcoxon rank-sum test, which performed better for the number of genes, i.e., 5, 10, 15 and 20 on Support vector machine  Table 3, it is evident that for "Breast" dataset the proposed method (RPOS) outperformed the other methods on all the classifiers. Table 4, gives the results for the dataset "Srbct", where the proposed method (RPOS) shows better results for the number of genes genes, i.e., 5 on on Random forest (RF) classifier than the other methods. For the number of genes 10, the Wilcoxon ranksum test performs better in terms of the classification error rate. In contrast, in terms of   Table 3 Classification error rate, sensitivity and Brier score produced by Random Forest, k-Nearest Neighbors and Support Vector Machine classifiers on Breastcancer dataset based on genes selected by the given methods. The best result is shown in bold.  Table 4 Classification error rate, sensitivity and Brier score produced by Random Forest, k-Nearest Neighbors and Support Vector Machine classifiers on srbct dataset based on genes selected by the given methods. The best result is shown in bold. Boxplots of the results of the proposed method and the other methods for twenty number of genes are also constructed given in Fig. 3. From the boxplots in Fig. 3, it is clear that the proposed method (RPOS) outperforms all the other methods except for the datasets "Srbct" and "Prostate" where the proposed method RPOS and the method POS almost provide similar results. In the case of the dataset "GSE4045" the sigF method outperforms all the other methods. Similarly, in the case of dataset "Colon", the performance of the proposed method RPOS and the method POS is similar, while the method sigF outperforms all the other methods. The proposed method RPOS outperforms the rest of the methods on the dataset "Leukaemia". Overall the proposed method RPOS outperforms all the other methods on 5 out of 10 datasets and provides similar results to that of the method POS on 3 datasets.

RF
To further investigate the efficiency of the proposed method RPOS, and the other methods, plots of classification error rates, Brier Scores and sensitivity for a various number of genes are given in Figs. 4, 5 and 6 respectively. From Fig. 4 it is clear that for the datasets "Breast", "DLBCL" and "Lung", the classification error rate of the proposed method RPOS is less than all the other methods for various number genes. For "TumorC" dataset the classification error of the method, i.e., Wilcoxon rank-sum test is less than all the other methods for the number of genes 5, 10, and 15 while it increases as the number of genes increases. For the remaining set of genes, the proposed method RPOS performs better than all the other methods. A similar pattern of classification error rates can be seen for the dataset "Srbct". In the case of "Leukeamia" dataset, the performance of the proposed method RPOS and the method POS for the number of genes 10, 20 and 25 are almost similar. In contrast, for the remaining set of genes, the proposed method RPOS performs better than the others.
To assess the performance of the proposed methods RPOS and the remaining methods in terms of Brier score, the results are shown by the plots given in Fig. 5, where it is clear that the proposed method RPOS outperforms all the other methods for a various number of genes. Figure 6 are the plots of the sensitivity of the proposed method RPOS and the rest of the methods. It is evident from the figure that for the datasets "TumorC",  "DLBCL" and "Lung" the sensitivity of the proposed method RPOS is higher than the rest of the methods for various number of genes. For the dataset "Breast" the sensitivity of the method, i.e., mRmR is more elevated in almost all the cases except the number of genes 20, where the method Wilcoxon performs better than all the other methods. In case of the dataset "Srbct" POS and sigF methods give higher sensitivity than the other methods. Wilcoxon rank-sum test outperforms the remaining methods in the case of "Leukaemia" dataset. Overall the proposed method RPOS outperforms all the other methods in 4 out of 7 datasets in terms of the performance metric, i.e., classification error rate and provides comparable results in the remaining three datasets. In terms of the performance metric, i.e., Brier scores of the proposed method RPOS outperforms all the other methods in all the seven datasets considered. In terms of sensitivity, the proposed method RPOS outperforms the rest of the methods in 4 out of 7 datasets, while gives comparable results on the remaining three datasets.
The primary aim of this research article was to devise a gene selection method to improve classification performance of machine learning algorithms on high dimensional microarray gene expression datasets. We, however, provide indices of the top 10 selected genes by our proposed method for two of the datasets, i.e., leukemia and breast. This is done for readers who might want to further assess the biological significance of the selected genes by our proposed method. Indices of the genes selected for the Leukemia dataset are (15,29,38,48,312,338,459,573,760,4847) while those of Breast dataset are (346,1481,1726,1873,2942,3259,3857,4067,4174,4435). Based on the top 10 genes selected by RPOS, we achieved 95.4% classification accuracy via SVM classifier for Leukemia dataset, and for the Breast dataset the accuracy achieved is 98.6%. Studies based on biological significance of the genes for the two datasets are given in

CONCLUSION
This paper has presented the idea of gene selection for microarray datasets via proportional overlapping analysis with the help of a more robust measure of dispersion, i.e., median absolute deviation (MAD). The core intervals of the classes in the binary class problems are constructed in a robust manner so as to minimize the effect of outliers present in the gene expression datasets in conjunction with the minimum subset of genes selected via greedy search approach. The genes having the smallest RPOS score are considered as the most discriminative, because they will have no or minimum overlapping region between the binary classes. The relative dominant class (RDC) for each gene is also calculated. Genes in the relative dominant class are arranged according to an increasing order of RPOS scores. This forms two mutually exclusive groups of genes based on RDC and RPOS scores. The genes are arranged according to RDC and RPOS scores in a round robin-fashion to develop a gene ranking list. These ranked genes do not contain the genes selected via greedy search approach. The final set of genes is selected by combining the chosen genes via greedy search approach, and the topmost ranked genes in the genes ranking list. The dimension of datasets is then reduced by including the selected genes only and discarding the rest. Classification methods; random forest, support vector machine and k-nearest neighbour methods have been used to assess the performance of the proposed method in comparison with other widely used gene selection methods.
The results of the proposed method indicate that it performs better in terms of almost all the performance metrics considered, i.e., classification error rate, Brier score and sensitivity. The efficiency of the proposed method is also supported by constructing boxplots for the error rate. Furthermore, the stability of the proposed method is also assessed for various number of genes. The results show that the proposed method is more stable for varying number of genes as compared to the rest of the methods.
The reason for selecting the most discriminative gene for binary classification by the proposed method is that the core intervals of the classes are constructed by the more robust measure of dispersion, i.e., median absolute deviation (MAD) than the measure of the interquartile range (IQR) used in Mahmoud et al. (2014). Moreover, the breakdown point of MAD is 50% while that of IQR is 25%, which make the former less vulnerable to the outliers present in the gene expression datasets.
For future work in the direction of the current study, one could use the robust measures of dispersions like Q n and S n statistics rather than median absolute deviation (MAD). This study can be extended to multiclass problems as well. Moreover, one could use this technique in situations where the response variable is continuous.
Although this method is efficient and selects the most discriminative genes, however, there is still the possibility that two (or more) genes selected in the final set might be similar. This could cause the problem of redundancy in the selected set. One of the possible ways to eliminate this problem is to use the Least Absolute Shrinkage and Selection Operator (LASSO) method in conjunction with the proposed method. Another way to deal with this issue is to divide the entire set of features into a set of clusters and then apply the proposed method on each cluster (Khan et al., 2019;Shamsara & Shamsara, 2020;Sharbaf, Mosafer & Moattar, 2016). The final set of genes, in that case, will be the combination of genes selected from all the clusters. Extending performance assessment of selected genes to other recent classification methods (Khan et al., 2020a, Gul et al., 2018Khanal et al., 2020;Khan et al., 2020b) could further validate the proposed gene selection methods.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
The authors received no funding for this work.

Data Availability
The following information was supplied regarding data availability: Leukemia data is available at CRAN: https://cran.r-project.org/web/packages/ propOverlap/index.html. The data can be loaded in R by installing and loading the R library propOverlap and then using the command data (leukemia).
nki data is available at CRAN: https://cran.r-project.org/web/packages/penalized/index.html The data can be loaded in R by installing and loading the R library penalized and then using the command data (nki70).
Colon data is available at: https://www.openml.org/d/1432. Breast data is available at the following:

Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.562#supplemental-information.