Research Article
A new feature selection method based on symmetrical uncertainty and interaction gain

https://doi.org/10.1016/j.compbiolchem.2019.107149Get rights and content

Highlights

  • Proposing a novel feature evaluation criterion taking feature relevance and feature interaction into consideration.

  • Employing Interaction Gain to examine the interaction among features.

  • Combining proposed feature evaluation criterion with Recursive Feature Elimination (RFE) technique to define the informative feature subset.

  • Experimental results on eleven public data sets showed that proposed method could measure features more accurately and stably than other methods.

Abstract

Defining important information from complex biological data is of great significance in biological study. It is known that the physiological and pathological changes in an organism are usually influenced by molecule interactions. Analyzing biological data by fusing the evaluation of the individual molecules and molecule interactions could induce a more accurate and comprehensive understanding of the organism. This study proposes an Interaction Gain - Recursive Feature Elimination (IG-RFE) method which evaluates the feature importance by combining the relevance between feature and class label and the interaction among features. Symmetrical uncertainty is adopted to measure the relevance between feature and the class label. The average normalized interaction gain of feature f, every other features and the class label is calculated to reflect the interaction of feature f with other features in the feature set F. Based on the combination of symmetrical uncertainty and normalized interaction gain, less important features are removed iteratively. To show the performance of IG-RFE, it was compared with seven efficient feature selection methods, MIFS, mRMR, CMIM, ReliefF, FCBF, PGVNS and SVM-RFE, on eleven public datasets. The experiment results showed the superiority of IG-RFE in accuracy, sensitivity, specificity and stability. Hence, integrating feature individual discriminative ability and the interaction among features could better evaluate feature importance in biological data analysis.

Introduction

Feature selection is an efficient data analyzing technique in machine learning and data mining, it has played an important role in many fields, such as bioinformatics (Saeys et al., 2007). Removing noise and irrelevant data from biological data and defining the most important features which could reflect the nature of biological problems are critical for disease diagnosis and mechanism study.

From the way of combining feature selection with classification model, feature selection algorithms can be divided into three categories, filter, embedded, and wrapper (Saeys et al., 2007). Mutual Information Feature Selection (MIFS) (Battiti, 1994), Minimal - Redundancy - Maximal - Relevance (mRMR) (Peng et al., 2005), Conditional Mutual Information Maximization (CMIM) (Fleuret, 2004), Fast Correlation-Based Filter (FCBF) (Yu and Liu, 2004), Predominant Group based Variable Neighborhood Search (PGVNS) (García-Torres et al., 2016) and distanced based ReliefF (Robnik-Sikonja and Kononenko, 2003) are filter techniques which evaluate features according to their intrinsic characteristics (Ebrahimpour et al., 2018). MIFS, mRMR, CMIM, FCBF and PGVNS aim at selecting the subset of features with the maximum relevance to the target class and minimum redundancy among the selected features. ReliefF ranks features according to their abilities to discriminate samples with different class labels and cluster those with the same class label (Robnik-Sikonja & Kononenko, 2003). Support Vector Machine - Recursive Feature Elimination (SVM-RFE) is an efficient wrapper method, it iteratively removes the features with the lowest weights computed by the SVM learning model (Guyon et al., 2002).

Feature selection algorithms can also be organized into univariate and multivariate techniques (Saeys et al., 2007; Tabakhi and Moradi, 2015; Lai et al., 2006). Univariate feature selection algorithm, such as t-test (Jafari and Azuaje, 2006) and Laplacian score (Belkin and Niyogi, 2003), attracts most attention especially in gene microarray analysis due to its efficiency (Saeys et al., 2007). Multivariate feature selection algorithm tries to capture feature-feature dependency. Whereas many multivariate methods, such as aforementioned MIFS, mRMR, CMIM, FCBF and PGVNS are only able to detect low-order dependency (Vinh et al., 2016). Deep Feature Selection (DFS) (Li et al., 2015) is a typical multivariate feature selection technique which employs deep neural networks to learn high-order feature dependency and identify informative features. Compared with univariate feature selection algorithm, multivariate feature selection algorithm always leads to construct more accurate classifiers by taking dependencies among variables into consideration (Saeys et al., 2007).

There are complex dependencies among features in biological data, such as relevance and redundancy. Apart from relevance and redundancy, there is also feature interaction (Jakulin and Bratko, 2003). Interactive features are feature combination of which each candidate feature could provide information that others could not afford, i.e., every feature draw on each other's strength and work together to achieve high relevance with class. Promoters and enhancers in the genetic data are typical pairs of interactive features (Shlyueva et al., 2014). A pair of promoter and enhancer together decides the expression of target gene(s).

Many feature selection algorithms only focus on identifying relevant features and removing redundant features. But features may interact with each other and work together to reflect the nature of the problems. Take biological data as an example, since an organism is a complex system, the physiological and pathological changes are usually influenced by molecule interactions, identifying interactive features is of great significance in prevention, diagnosis and treatment of many diseases. Therefore, this study integrates feature relevance and feature interaction to measure the feature importance and proposes a new feature selection algorithm based on interaction gain and the recursive feature elimination strategy (IG-RFE).

The rest of this paper is organized as follows: Section 2 firstly introduces some evaluation criteria about feature relevance and feature interaction, then IG-RFE algorithm is put forward. In Section 3, IG-RFE is compared with seven well-known feature selection techniques to show its effectiveness. Section 4 concludes this paper.

Section snippets

Methods

In complex biology systems, molecules interact with each other, they work together to express physiological and pathological changes. Neglecting the feature-feature interaction in data analysis may lose some useful information and affect the analysis results (Chen et al., 2015).

Experimental and discussion

In this section, empirical results are presented to compare IG-RFE with seven well-known feature selection algorithms upon eleven public biological datasets.

Conclusions

In this paper we propose a feature selection method IG-RFE based on symmetrical uncertainty and interaction gain, which weights features from two aspects: the relevance between feature and class and the interactions among features. Recursive feature elimination strategy is employed to remove less important features from current feature set iteratively. Like other feature selection methods, IG-RFE cannot reduce the explicit label skewness diagnostic bias for class imbalance data. But the

Declaration of Competing Interest

None.

Acknowledgment

The study has been supported by the National Natural Science Foundation of China (21375011).

References (41)

  • N.X. Vinh et al.

    Can high-order dependencies improve mutual information based feature selection?

    Pattern Recognit.

    (2016)
  • Z. Wang et al.

    A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure

    Inf. Sci.

    (2015)
  • Z. Zeng et al.

    A novel feature selection method considering feature interaction

    Pattern Recognit.

    (2015)
  • A.A. Alizadeh et al.

    Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

    Nature

    (2000)
  • A. Asuncion et al.

    UCI Repository of Machine Learning Datasets

    (2007)
  • R. Battiti

    Using mutual information for selecting features in supervised neural net learning

    IEEE Trans. Neural Netw.

    (1994)
  • M. Belkin et al.

    Laplacian eigenmaps for dimensionality reduction and data representation

    Neural Comput.

    (2003)
  • T.M. Cover et al.

    Elements of Information Theory

    (1991)
  • N. Cristianini et al.

    An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods

    (2000)
  • U.M. Fayyad et al.

    Multi-interval discretization of continuous-valued attributes for classification learning

    Proceedings of the 13th International Joint Conference on Artificial Intelligence

    (1993)
  • Cited by (29)

    • Feature selection using Decomposed Mutual Information Maximization

      2022, Neurocomputing
      Citation Excerpt :

      An overview of different feature selection methods is provided in [24] and a unified approach for characterizing the different methods was proposed in [25]. Many sequential forward feature selection methods based on MI have been proposed over the years [26–42]. These methods differ on the way they express the association between a candidate feature, the class, and the already selected features.

    • A novel method for feature selection based on molecular interactive effect network

      2022, Journal of Pharmaceutical and Biomedical Analysis
      Citation Excerpt :

      Analyzing biological data by fusing molecule relationships may help identify efficient information reflecting physiological and pathological changes. Interaction gain-recursive feature elimination (IG-RFE) evaluates the importance of features by integrating their relevance with class label and the interactions among features, thereinto interaction gain (IG) probes feature interactions [11]. Network-based feature selection methods study molecule relationships at a system level and can provide a stable form to depict complex diseases [3].

    • Wavelet spectral analysis and attribute ranking applied to automatic classification of power quality disturbances

      2022, Electric Power Systems Research
      Citation Excerpt :

      The work of [28] describes the main improvements implemented in the most updated algorithm version. Symmetrical Uncertainty (SU): an index that points dependence degree and the amount of information associated with each attribute related to a given class of objects [29]; OneR: a classifying algorithm also used for ranking attributes.

    • A novel hybrid feature selection method considering feature interaction in neighborhood rough set[Formula presented]

      2021, Knowledge-Based Systems
      Citation Excerpt :

      In biological study, the changes of physiological and pathological processes in a complex organism system are usually affected by molecular interactions. Therefore, the recognition of interactive features is of great significance for the diagnosis, treatment and prevention of many diseases [12]. In addition, inconsistent, noisy and hybrid data are ubiquitous in the model construction of practical application.

    View all citing articles on Scopus
    View full text