Research ArticleA new feature selection method based on symmetrical uncertainty and interaction gain
Graphical abstract
Introduction
Feature selection is an efficient data analyzing technique in machine learning and data mining, it has played an important role in many fields, such as bioinformatics (Saeys et al., 2007). Removing noise and irrelevant data from biological data and defining the most important features which could reflect the nature of biological problems are critical for disease diagnosis and mechanism study.
From the way of combining feature selection with classification model, feature selection algorithms can be divided into three categories, filter, embedded, and wrapper (Saeys et al., 2007). Mutual Information Feature Selection (MIFS) (Battiti, 1994), Minimal - Redundancy - Maximal - Relevance (mRMR) (Peng et al., 2005), Conditional Mutual Information Maximization (CMIM) (Fleuret, 2004), Fast Correlation-Based Filter (FCBF) (Yu and Liu, 2004), Predominant Group based Variable Neighborhood Search (PGVNS) (García-Torres et al., 2016) and distanced based ReliefF (Robnik-Sikonja and Kononenko, 2003) are filter techniques which evaluate features according to their intrinsic characteristics (Ebrahimpour et al., 2018). MIFS, mRMR, CMIM, FCBF and PGVNS aim at selecting the subset of features with the maximum relevance to the target class and minimum redundancy among the selected features. ReliefF ranks features according to their abilities to discriminate samples with different class labels and cluster those with the same class label (Robnik-Sikonja & Kononenko, 2003). Support Vector Machine - Recursive Feature Elimination (SVM-RFE) is an efficient wrapper method, it iteratively removes the features with the lowest weights computed by the SVM learning model (Guyon et al., 2002).
Feature selection algorithms can also be organized into univariate and multivariate techniques (Saeys et al., 2007; Tabakhi and Moradi, 2015; Lai et al., 2006). Univariate feature selection algorithm, such as t-test (Jafari and Azuaje, 2006) and Laplacian score (Belkin and Niyogi, 2003), attracts most attention especially in gene microarray analysis due to its efficiency (Saeys et al., 2007). Multivariate feature selection algorithm tries to capture feature-feature dependency. Whereas many multivariate methods, such as aforementioned MIFS, mRMR, CMIM, FCBF and PGVNS are only able to detect low-order dependency (Vinh et al., 2016). Deep Feature Selection (DFS) (Li et al., 2015) is a typical multivariate feature selection technique which employs deep neural networks to learn high-order feature dependency and identify informative features. Compared with univariate feature selection algorithm, multivariate feature selection algorithm always leads to construct more accurate classifiers by taking dependencies among variables into consideration (Saeys et al., 2007).
There are complex dependencies among features in biological data, such as relevance and redundancy. Apart from relevance and redundancy, there is also feature interaction (Jakulin and Bratko, 2003). Interactive features are feature combination of which each candidate feature could provide information that others could not afford, i.e., every feature draw on each other's strength and work together to achieve high relevance with class. Promoters and enhancers in the genetic data are typical pairs of interactive features (Shlyueva et al., 2014). A pair of promoter and enhancer together decides the expression of target gene(s).
Many feature selection algorithms only focus on identifying relevant features and removing redundant features. But features may interact with each other and work together to reflect the nature of the problems. Take biological data as an example, since an organism is a complex system, the physiological and pathological changes are usually influenced by molecule interactions, identifying interactive features is of great significance in prevention, diagnosis and treatment of many diseases. Therefore, this study integrates feature relevance and feature interaction to measure the feature importance and proposes a new feature selection algorithm based on interaction gain and the recursive feature elimination strategy (IG-RFE).
The rest of this paper is organized as follows: Section 2 firstly introduces some evaluation criteria about feature relevance and feature interaction, then IG-RFE algorithm is put forward. In Section 3, IG-RFE is compared with seven well-known feature selection techniques to show its effectiveness. Section 4 concludes this paper.
Section snippets
Methods
In complex biology systems, molecules interact with each other, they work together to express physiological and pathological changes. Neglecting the feature-feature interaction in data analysis may lose some useful information and affect the analysis results (Chen et al., 2015).
Experimental and discussion
In this section, empirical results are presented to compare IG-RFE with seven well-known feature selection algorithms upon eleven public biological datasets.
Conclusions
In this paper we propose a feature selection method IG-RFE based on symmetrical uncertainty and interaction gain, which weights features from two aspects: the relevance between feature and class and the interactions among features. Recursive feature elimination strategy is employed to remove less important features from current feature set iteratively. Like other feature selection methods, IG-RFE cannot reduce the explicit label skewness diagnostic bias for class imbalance data. But the
Declaration of Competing Interest
None.
Acknowledgment
The study has been supported by the National Natural Science Foundation of China (21375011).
References (41)
- et al.
Comparative metabolomics of estrogen receptor positive and estrogen receptor negative breast cancer: alterations in glutamine and beta-alanine metabolism
J. Proteomics
(2013) - et al.
Feature selection with redundancy-complementariness dispersion
Knowl. Based Syst.
(2015) - et al.
CCFS: a cooperating coevolution technique for large scale feature selection on microarray datasets
Comput. Biol. Chem.
(2018) - et al.
High-dimensional feature selection via feature grouping: a variable neighborhood search approach
Inf. Sci.
(2016) - et al.
Stable feature selection for biomarker discovery
Comput. Biol. Chem.
(2010) - et al.
Random subspace method for multivariate feature selection
Pattern Recognit. Lett.
(2006) - et al.
Analysis of connection networks among miRNAs differentially expressed in early gastric cancer for disclosing some biological features of disease development
Gene
(2014) - et al.
A systematic analysis of performance measures for classification tasks
Inf. Process. Manag.
(2009) - et al.
Gems: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data
Int. J. Med. Inform.
(2005) - et al.
Relevance-redundancy feature selection based on ant colony optimization
Pattern Recognit.
(2015)
Can high-order dependencies improve mutual information based feature selection?
Pattern Recognit.
A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure
Inf. Sci.
A novel feature selection method considering feature interaction
Pattern Recognit.
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Nature
UCI Repository of Machine Learning Datasets
Using mutual information for selecting features in supervised neural net learning
IEEE Trans. Neural Netw.
Laplacian eigenmaps for dimensionality reduction and data representation
Neural Comput.
Elements of Information Theory
An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods
Multi-interval discretization of continuous-valued attributes for classification learning
Proceedings of the 13th International Joint Conference on Artificial Intelligence
Cited by (29)
Feature Selection Using Diversity-Based Multi-objective Binary Differential Evolution
2023, Information SciencesFeature selection using Decomposed Mutual Information Maximization
2022, NeurocomputingCitation Excerpt :An overview of different feature selection methods is provided in [24] and a unified approach for characterizing the different methods was proposed in [25]. Many sequential forward feature selection methods based on MI have been proposed over the years [26–42]. These methods differ on the way they express the association between a candidate feature, the class, and the already selected features.
A novel method for feature selection based on molecular interactive effect network
2022, Journal of Pharmaceutical and Biomedical AnalysisCitation Excerpt :Analyzing biological data by fusing molecule relationships may help identify efficient information reflecting physiological and pathological changes. Interaction gain-recursive feature elimination (IG-RFE) evaluates the importance of features by integrating their relevance with class label and the interactions among features, thereinto interaction gain (IG) probes feature interactions [11]. Network-based feature selection methods study molecule relationships at a system level and can provide a stable form to depict complex diseases [3].
Wavelet spectral analysis and attribute ranking applied to automatic classification of power quality disturbances
2022, Electric Power Systems ResearchCitation Excerpt :The work of [28] describes the main improvements implemented in the most updated algorithm version. Symmetrical Uncertainty (SU): an index that points dependence degree and the amount of information associated with each attribute related to a given class of objects [29]; OneR: a classifying algorithm also used for ranking attributes.
A novel hybrid feature selection method considering feature interaction in neighborhood rough set[Formula presented]
2021, Knowledge-Based SystemsCitation Excerpt :In biological study, the changes of physiological and pathological processes in a complex organism system are usually affected by molecular interactions. Therefore, the recognition of interactive features is of great significance for the diagnosis, treatment and prevention of many diseases [12]. In addition, inconsistent, noisy and hybrid data are ubiquitous in the model construction of practical application.
Machine learning for evolutive lymphoma and residual masses recognition in whole body diffusion weighted magnetic resonance images
2021, Computer Methods and Programs in Biomedicine