A new feature selection algorithm based on relevance, redundancy and complementarity

https://doi.org/10.1016/j.compbiomed.2020.103667Get rights and content

Highlights

  • A novel feature selection method based on relevance, redundancy and complementarity (FS - RRC) is proposed.

  • Multi - information is applied to examine the complementarity between features.

  • Experimental results shows that FS - RRC could measure features more accurately and stably than other methods.

Abstract

Defining important information from biological data is critical for the study of disease diagnosis, drug efficacy and individualized treatment. Hence, the feature selection technique is widely applied. Many feature selection methods measure features based on relevance, redundancy and complementarity. Feature complementarity means that two features’ cooperation can provide more information than the simple summation of their individual information. In this paper, we studied the feature selection technique and proposed a new feature selection algorithm based on relevance, redundancy and complementarity (FS-RRC). On selecting the feature subset, FS-RRC not only evaluates the feature relevance with the class label and the redundancy among the features but also evaluates the feature complementarity. If complementary features exist for a selected relevant feature, FS-RRC retains the feature with the largest complementarity to the selected feature subset. To show the performance of FS-RRC, it was compared with eleven efficient feature selection methods, MIFS, mRMR, CMIM, ReliefF, FCBF, PGVNS, MCRMCR, MCRMICR, RCDFS, SAFE and SVM-RFE on two synthetic datasets and fifteen public biological datasets. The experimental results showed the superiority of FS-RRC in accuracy, sensitivity, specificity, stability and time complexity. Hence, integrating feature individual discriminative ability, redundancy and complementarity can define more powerful feature subset for biological data analysis, and feature complementarity can help to study the biomedical phenomena more accurately.

Introduction

With the development of high-throughput technologies, large quantities of biological data have been produced. How to analyze the data and define the important information from the data has been one of the main focuses in biological studies. Feature extraction and feature selection are two efficient techniques for mining meaningful information from high-dimensional data. Feature extraction, such as principal component analysis and partial least squares-discriminant analysis, transforms the original feature space into a new low-dimensional space. Feature selection, such as ReliefF, Fisher Score and Lasso, reduces the original feature space into a low-dimensional feature subspace by directly removing the noise and noninterested input features [1]. Compared to feature extraction, feature selection reduces the dimensionality of the feature space without transformation and is assumed to be superior in terms of readability and biological interpretability. Hence feature selection is widely applied for analyzing the genomics data and metabolomics data to identify biomarkers for disease diagnosis, early warning of malignant tumors and disease prognosis [2,3].

Feature selection can be divided into unsupervised, semi-supervised and supervised. Unsupervised feature selection is used to explore biological data without the class label. It can provide an effective method for discovering the unknown meaningful results. Wolf et al. [4] proposed an unsupervised feature selection method, called Q - α, which defined the feature relevance by the Laplacian spectrum and ranked features based on a least squares optimization process. Semi-supervised feature selection studies both labeled data and unlabeled data and integrates labeled data into unlabeled data as additional information to improve the performance of the feature selection. Benabdeslem and Hindawi [5] proposed a semi-supervised feature selection method based on constraint selection and redundancy elimination. Supervised feature selection uses the labeled data to select the feature subset.

Usually, supervised feature selection algorithms can be classified into three categories, wrapper, embedded, and filter, depending on the relationship with the learning model [6]. Wrapper methods [7,8] use the learning models to evaluate the feature subset by the classification accuracy rates. Support vector machine-recursive feature elimination (SVM-RFE) [9] is one of the classical and efficient wrapper methods. Embedded methods use learning models to guide feature selection. Regularization is an embedding technique that can perform continuous shrinkage and automatically select a feature subset [10,11]. Filter methods separate the learning model and feature selection and weigh features based on their characteristics [12]. They are fast and easily analyze the high-dimensional data. Some filters measure the redundancy among features in addition to feature importance, such as mutual information feature selection (MIFS) [13], minimal-redundancy-maximal-relevance (mRMR) [14], conditional mutual information maximization (CMIM) [15], minimum conditional relevance-minimum conditional redundancy (MCRMCR) and minimum conditional relevance-minimum intra-class redundancy (MCRMICR) [16], which aim to selecting the feature subset with the maximum relevance to the class label and minimum redundancy. Fast correlation-based filter (FCBF) [17] and predominant group-based variable neighborhood search (PGVNS) [18] also consider feature redundancy, they use the approximate Markov blanket to identify and remove redundant features. Feature relationships are very complex. There are synergism and complementarity between some features. Hence to define a powerful feature subset, some filters simultaneously take advantage of feature relevance, redundancy and complementarity. Redundancy complementariness dispersion-based feature selection (RCDFS) [19] not only considers the relevance between feature and class label and pairwise inter-correlation of features but also extends traditional redundancy analysis to redundancy-complementariness analysis. Self-adaptive feature evaluation (SAFE) [20] uses feature complementarity in the search process, which penalizes redundancy and rewards complementarity based on an adaptive cost function.

Disease progression is a complex process that is usually not influenced by individual molecules but by complex molecule interactions. Hence, neglecting the complementarity between features may lose some important information for studying the nature of biomedical problems. Defining interactive or complementary features is of great significance to understanding the disease progression and prevention, diagnosis, and treatment. In this paper, we studied the feature selection technique and proposed a new feature selection algorithm based on feature relevance, redundancy and complementarity (FS-RRC). For a selected feature f, FS-RRC computes the complementarity of each feature with f. If there exist features that are complementary with f, FS-RRC selects the one having the largest complementarity to the selected feature subset. By combining feature relevance, redundancy and complementarity, FS-RRC can define more important information from the complex biological data.

The rest of this paper is organized as follows. Section 2 first introduces some evaluation criteria about feature relevance, redundancy and complementarity, and then FS-RRC is proposed. Section 3 gives the experimental settings and descriptions of the datasets. Section 4 shows the experimental results of FS-RRC compared with eleven well-known feature selection techniques. Finally, we discuss and conclude this paper.

Section snippets

Methods

The biological system is complex, and molecules cooperate and relate to each other in the process of physiological and pathological changes. Hence, the complementarity between features may also contain some meaningful information, and considering feature complementarity in addition to relevance and redundancy may induce a better biological data analysis result [19].

Experiments

In this section, FS-RRC is compared with eleven effective feature selection techniques, MIFS, mRMR, CMIM, ReliefF, FCBF, PGVNS, MCRMCR, MCRMICR, RCDFS, SAFE and SVM-RFE, on synthetic datasets and fifteen public real-world biological datasets to show its performance.

Comparison in synthetic datasets

In this section, we evaluate the performance of FS-RRC and other competitor algorithms on synthetic datasets. The results of the experiment on SD1 and SD2 are given in Table 2. Bold font represents an optimal feature subset without irrelevant and redundant features. “Selected features” shows the features having a selected frequency larger than 75%. “Sn” represents the sensitivity of the feature selection method, “Sp” represents the specificity of the feature selection method.

All the algorithms

Discussion

Table 7 lists the characteristics of each feature selection method, which are selecting relevant features (SRF), elimination redundant features (ERF), considering complementary features (CCF), avoiding high-dimensional MI estimation (AHMI), without setting the number of selected features in prior (SNP), and no other parameters (OP) [42].

Table 7 shows that FS-RRC is parameter-free, which does not set the number of selected features in prior and other parameters. Additionally, FS-RRC avoids

Conclusions

In complex biological systems, molecules relate to each other, and they work together to reflect specific physiological and pathological changes. This study focuses on feature cooperation as well as feature relevance and redundancy and proposes a new feature selection algorithm FS-RRC. While removing irrelevant and redundant features, FS-RRC can select complementary features. The experiment on the two synthetic datasets illustrated the effectiveness of FS-RRC. The experiment on the fifteen

Author contributions

X. Lin and C. Li conceived and designed the experiments; C. Li, X. Luo, Y. Qi and Z. Gao searched the datasets and performed the experiments. C. Li drafted the manuscript and X. Lin revised the manuscript.

Declaration of competing interest

None.

References (42)

  • J. Budczies et al.

    Comparative metabolomics of estrogen receptor positive and estrogen receptor negative breast cancer: alterations in glutamine and beta-alanine metabolism

    J. Proteom.

    (2013)
  • A. Statnikov et al.

    Gems: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data

    Int. J. Med. Inf.

    (2005)
  • N. Iizuka et al.

    Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection

    Lancet

    (2003)
  • D. Singh et al.

    Gene expression correlates of clinical prostate cancer behavior

    Canc. Cell

    (2002)
  • Z. Wang et al.

    A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure

    Inf. Sci.

    (2015)
  • L. Wolf et al.

    Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weight-based approach

    J. Mach. Learn. Res.

    (2005)
  • D.H. Peluffo et al.

    Unsupervised relevance analysis for feature extraction and selection. A distance-based approach for feature relevance

  • Y. Saeys et al.

    A review of feature selection techniques in bioinformatics

    Bioinformatics

    (2007)
  • I. Guyon et al.

    Gene selection for cancer classification using support vector machines

    Mach. Learn.

    (2002)
  • Y. Liang et al.

    Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification

    BMC Bioinf.

    (2013)
  • R. Battiti

    Using mutual information for selecting features in supervised neural net learning

    IEEE Trans. Neural Network.

    (1994)
  • Cited by (0)

    View full text