A new feature selection algorithm based on relevance, redundancy and complementarity
Introduction
With the development of high-throughput technologies, large quantities of biological data have been produced. How to analyze the data and define the important information from the data has been one of the main focuses in biological studies. Feature extraction and feature selection are two efficient techniques for mining meaningful information from high-dimensional data. Feature extraction, such as principal component analysis and partial least squares-discriminant analysis, transforms the original feature space into a new low-dimensional space. Feature selection, such as ReliefF, Fisher Score and Lasso, reduces the original feature space into a low-dimensional feature subspace by directly removing the noise and noninterested input features [1]. Compared to feature extraction, feature selection reduces the dimensionality of the feature space without transformation and is assumed to be superior in terms of readability and biological interpretability. Hence feature selection is widely applied for analyzing the genomics data and metabolomics data to identify biomarkers for disease diagnosis, early warning of malignant tumors and disease prognosis [2,3].
Feature selection can be divided into unsupervised, semi-supervised and supervised. Unsupervised feature selection is used to explore biological data without the class label. It can provide an effective method for discovering the unknown meaningful results. Wolf et al. [4] proposed an unsupervised feature selection method, called Q - α, which defined the feature relevance by the Laplacian spectrum and ranked features based on a least squares optimization process. Semi-supervised feature selection studies both labeled data and unlabeled data and integrates labeled data into unlabeled data as additional information to improve the performance of the feature selection. Benabdeslem and Hindawi [5] proposed a semi-supervised feature selection method based on constraint selection and redundancy elimination. Supervised feature selection uses the labeled data to select the feature subset.
Usually, supervised feature selection algorithms can be classified into three categories, wrapper, embedded, and filter, depending on the relationship with the learning model [6]. Wrapper methods [7,8] use the learning models to evaluate the feature subset by the classification accuracy rates. Support vector machine-recursive feature elimination (SVM-RFE) [9] is one of the classical and efficient wrapper methods. Embedded methods use learning models to guide feature selection. Regularization is an embedding technique that can perform continuous shrinkage and automatically select a feature subset [10,11]. Filter methods separate the learning model and feature selection and weigh features based on their characteristics [12]. They are fast and easily analyze the high-dimensional data. Some filters measure the redundancy among features in addition to feature importance, such as mutual information feature selection (MIFS) [13], minimal-redundancy-maximal-relevance (mRMR) [14], conditional mutual information maximization (CMIM) [15], minimum conditional relevance-minimum conditional redundancy (MCRMCR) and minimum conditional relevance-minimum intra-class redundancy (MCRMICR) [16], which aim to selecting the feature subset with the maximum relevance to the class label and minimum redundancy. Fast correlation-based filter (FCBF) [17] and predominant group-based variable neighborhood search (PGVNS) [18] also consider feature redundancy, they use the approximate Markov blanket to identify and remove redundant features. Feature relationships are very complex. There are synergism and complementarity between some features. Hence to define a powerful feature subset, some filters simultaneously take advantage of feature relevance, redundancy and complementarity. Redundancy complementariness dispersion-based feature selection (RCDFS) [19] not only considers the relevance between feature and class label and pairwise inter-correlation of features but also extends traditional redundancy analysis to redundancy-complementariness analysis. Self-adaptive feature evaluation (SAFE) [20] uses feature complementarity in the search process, which penalizes redundancy and rewards complementarity based on an adaptive cost function.
Disease progression is a complex process that is usually not influenced by individual molecules but by complex molecule interactions. Hence, neglecting the complementarity between features may lose some important information for studying the nature of biomedical problems. Defining interactive or complementary features is of great significance to understanding the disease progression and prevention, diagnosis, and treatment. In this paper, we studied the feature selection technique and proposed a new feature selection algorithm based on feature relevance, redundancy and complementarity (FS-RRC). For a selected feature f, FS-RRC computes the complementarity of each feature with f. If there exist features that are complementary with f, FS-RRC selects the one having the largest complementarity to the selected feature subset. By combining feature relevance, redundancy and complementarity, FS-RRC can define more important information from the complex biological data.
The rest of this paper is organized as follows. Section 2 first introduces some evaluation criteria about feature relevance, redundancy and complementarity, and then FS-RRC is proposed. Section 3 gives the experimental settings and descriptions of the datasets. Section 4 shows the experimental results of FS-RRC compared with eleven well-known feature selection techniques. Finally, we discuss and conclude this paper.
Section snippets
Methods
The biological system is complex, and molecules cooperate and relate to each other in the process of physiological and pathological changes. Hence, the complementarity between features may also contain some meaningful information, and considering feature complementarity in addition to relevance and redundancy may induce a better biological data analysis result [19].
Experiments
In this section, FS-RRC is compared with eleven effective feature selection techniques, MIFS, mRMR, CMIM, ReliefF, FCBF, PGVNS, MCRMCR, MCRMICR, RCDFS, SAFE and SVM-RFE, on synthetic datasets and fifteen public real-world biological datasets to show its performance.
Comparison in synthetic datasets
In this section, we evaluate the performance of FS-RRC and other competitor algorithms on synthetic datasets. The results of the experiment on SD1 and SD2 are given in Table 2. Bold font represents an optimal feature subset without irrelevant and redundant features. “Selected features” shows the features having a selected frequency larger than 75%. “Sn” represents the sensitivity of the feature selection method, “Sp” represents the specificity of the feature selection method.
All the algorithms
Discussion
Table 7 lists the characteristics of each feature selection method, which are selecting relevant features (SRF), elimination redundant features (ERF), considering complementary features (CCF), avoiding high-dimensional MI estimation (AHMI), without setting the number of selected features in prior (SNP), and no other parameters (OP) [42].
Table 7 shows that FS-RRC is parameter-free, which does not set the number of selected features in prior and other parameters. Additionally, FS-RRC avoids
Conclusions
In complex biological systems, molecules relate to each other, and they work together to reflect specific physiological and pathological changes. This study focuses on feature cooperation as well as feature relevance and redundancy and proposes a new feature selection algorithm FS-RRC. While removing irrelevant and redundant features, FS-RRC can select complementary features. The experiment on the two synthetic datasets illustrated the effectiveness of FS-RRC. The experiment on the fifteen
Author contributions
X. Lin and C. Li conceived and designed the experiments; C. Li, X. Luo, Y. Qi and Z. Gao searched the datasets and performed the experiments. C. Li drafted the manuscript and X. Lin revised the manuscript.
Declaration of competing interest
None.
References (42)
- et al.
Feature selection in machine learning: a new perspective
Neurocomputing
(2018) - et al.
A support vector machine-recursive feature elimination feature selection method based on artificial contrast variables and mutual information
J. Chromatogr., B
(2012) - et al.
Toward optimal feature and time segment selection by divergence method for EEG signals classification
Comput. Biol. Med.
(2018) - et al.
Wrappers for feature subset selection
Artif. Intell.
(1997) - et al.
Wrapper-based gene selection with Markov blanket
Comput. Biol. Med.
(2017) - et al.
The L1/2 regularization network Cox model for analysis of genomic data
Comput. Biol. Med.
(2018) - et al.
Efficient feature selection filters for high-dimensional data
Pattern Recogn. Lett.
(2012) - et al.
High-dimensional feature selection via feature grouping: a variable neighborhood search approach
Inf. Sci.
(2016) - et al.
Feature selection with redundancy-complementariness dispersion
Knowl. Base Syst.
(2015) - et al.
Feature selection: a data perspective
ACM Comput. Surv.
(2018)