KNCFS: Feature selection for high-dimensional datasets based on improved random multi-subspace learning

Cong Guo

doi:10.1371/journal.pone.0296108

Abstract

Feature selection has long been a focal point of research in various fields.Recent studies have focused on the application of random multi-subspaces methods to extract more information from raw samples.However,this approach inadequately addresses the adverse effects that may arise due to feature collinearity in high-dimensional datasets.To further address the limited ability of traditional algorithms to extract useful information from raw samples while considering the challenge of feature collinearity during the random subspaces learning process, we employ a clustering approach based on correlation measures to group features.Subsequently, we construct subspaces with lower inter-feature correlations.When integrating feature weights obtained from all feature spaces,we introduce a weighting factor to better handle the contributions from different feature spaces.We comprehensively evaluate our proposed algorithm on ten real datasets and four synthetic datasets,comparing it with six other feature selection algorithms.Experimental results demonstrate that our algorithm,denoted as KNCFS,effectively identifies relevant features,exhibiting robust feature selection performance,particularly suited for addressing feature selection challenges in practice.

Citation: Guo C (2024) KNCFS: Feature selection for high-dimensional datasets based on improved random multi-subspace learning. PLoS ONE 19(2): e0296108. https://doi.org/10.1371/journal.pone.0296108

Editor: Debo Cheng, University of South Australia, AUSTRALIA

Received: October 15, 2023; Accepted: December 5, 2023; Published: February 23, 2024

Copyright: © 2024 Cong Guo. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All datasets we used in our paper can be obtained from the following websites(The words before the links are the datasets names): 1.lung_discrete: https://jundongl.github.io/scikit-feature/datasets.html 2.SCADI: http://archive.ics.uci.edu/dataset/446/scadi 3.warpAR10P: https://jundongl.github.io/scikit-feature/datasets.html 4.tumors_C: https://www.openml.org/search?type=data&status=active&id=1107 5.GLIOMA: https://jundongl.github.io/scikit-feature/datasets.html 6.TOX_171: https://jundongl.github.io/scikit-feature/datasets.html 7.leukemia: https://jundongl.github.io/scikit-feature/datasets.html 8.ALLAML: https://jundongl.github.io/scikit-feature/datasets.html 9.pixraw10P: https://jundongl.github.io/scikit-feature/datasets.html 10.arcene: http://archive.ics.uci.edu/dataset/167/arcene 11.caesarian: http://archive.ics.uci.edu/dataset/472/caesarian+section+classification+dataset 12.blogger: http://archive.ics.uci.edu/dataset/255/blogger 13.immunotherapy: http://archive.ics.uci.edu/dataset/428/immunotherapy+dataset 14.Fertility:http://archive.ics.uci.edu/dataset/244/fertility

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

In disease prediction tasks, the collected DNA microarray datasets are often high-dimensional.Searching for the genes that determine the occurrence of diseases among these genes is challenging,as it constitutes an NP-hard problem with a complexity of O(2^d).Furthermore,in high-dimensional datasets,there exists a significant amount of redundant and noisy features.Blindly learning these features causes the model to learn spurious correlations and reduces the performance of the mode [1, 2].To address this challenge,an effective approach is to reduce data dimensionality through feature selection [3].The objective of feature selection is to retain relevant features while discarding irrelevant ones.Feature selection not only reduces feature dimensionality but also enhances model performance.

Feature selection (FS) methods can be categorized into three primary modes: wrapper mode,filter mode,and embedded mode [4].Wrapper mode typically employs heuristic search to select the most favorable features with respect to evaluation metrics [5–8].Wrapper models typically use swarm intelligent optimisation to generate binary solution vectors,where the selection of a particular feature is denoted by 1 and 0 means that the corresponding feature is not considered in the subset of features.For example,bee colony optimization [6];particle swarm optimization [8];whale optimization [9] etc.However,when dealing with high-dimensional data,these methods often struggle to complete the search within a reasonable time frame [10].To address this issue,some filter-mode methods search for the optimal subset by exploring the intrinsic relationships between samples and features [11–14].For example,literature [13] introduces the correlation coefficient and combines the correlation coefficient and mutual information to measure the relationship between different features for feature selection,and literature [15] uses mutual information and joint mutual information to balance the significance between the two feature correlation terms for weighted correlation-based feature selection.Due to the lack of a specific classifier guiding the feature selection stage,the selected features in such methods may not be optimal [16].On the other hand, embedded mode views the process of learning the optimal subset as an optimization problem.These methods introduce penalties or constraints to FS through the construction of an objective function and regularization terms related to feature weights [17–22].This encourages the model to select the most relevant features.For example,literature [23] embedded the relevance self-representation matrix into unsupervised learning to take into account the complete sample relevance and feature dependencies;literature [24] helped to identify relevant features by embedding indication labels into ridge regression models;and literature [25] proposed a new adaptive LapSVM feature selection method by embedding the acquisition of Laplacian matrices into the SVM training process in order to achieve semi-supervised learning.In comparison to filter mode, embedded mode involves interaction with a classifier and often can select features with the highest information content [26].

The Neighborhood Component Feature Selection (NCFS) [27] is an embedded method in the field of feature selection that has garnered significant attention, primarily due to its excellent performance on high-dimensional datasets.However, NCFS exhibits a notable limitation in that it is confined to acquiring knowledge within the original feature space,leading to a relatively limited information extraction from the raw samples and failing to fully exploit the latent information within the data.In a separate study,a method known as the Random Multi-subspace Approach [28] was proposed.This approach treats the reliefF method as a black box and,through multiple random data set partitions,learns local weights in each subspace to enhance the sample diversity of the reliefF method. However,it is worth noting that the experiments conducted in reference [28] were limited to low-dimensional datasets, with a maximum of 649-dimensional features used. Our further investigation suggests that the direct application of the Random Subspace Approach on high-dimensional datasets offers limited performance enhancement for NCFS.This is because in high-dimensional datasets,some features can be approximately represented as combinations of other features in a linear manner, resulting in a certain degree of feature collinearity.The existence of collinearity can reduce the model’s generalization performance.Furthermore,during the random subspace partitioning of the feature space,a situation may arise where features with collinearity are accidentally assigned to the same subspace.This can lead to model overfitting and consequently decrease the accuracy of feature selection.

In addressing the issue of limited information captured by NCFS from the original samples,we introduce an enhanced approach that simultaneously considers enhancing the diversity of the original samples and mitigating the problem of feature collinearity.In formal terms, we propose a method that utilizes clustering algorithms to construct random subspaces,aiming to alleviate the impact of collinearity. Furthermore,following the completion of feature weight learning within each feature partition,we employ a feature partition weight factor to assess the contribution of each feature partition to the final weight vector, as opposed to a simple averaging approach. Extensive experiments on ten high-dimensional datasets and synthetic datasets validate the effectiveness of our algorithm.The primary contributions of this paper are summarized as follows.

The proposed method simultaneously addresses the issues of diversity in the random subspace during the feature selection process and feature collinearity.
A feature partition weight factor is introduced to weight the importance of features learned within each feature partition.
Multiple sets of experiments on synthetic and real datasets confirm the effectiveness of the proposed method.Notably, the experimental results demonstrate that the consideration of feature collinearity in the random subspace approach enhances the effectiveness of feature selection.

The remainder of the paper is organised as follows,in Section 2 we briefly introduce the NCFS algorithm,2.3 we detail the random multi-subspace method,as well as in 2.4 we briefly introduce the K-means cluster,followed by Section 3 where we present our method, Section 4 shows the experimental results of the new method with the comparative method, and, finally, we draw conclusions in Section 5.

2. Preliminaries

In this section,we introduce the notation and definitions of this paper in 2.1.In section 2.2, the original NCFS method is briefly described,and finally 2.3 presents the random multi-subspace method.

2.1 Notation and definition

Given a feature matrix X = [x₁,x₂,..,x_n]^T∈R^n×d,which is a set of n training samples with a dimensionality of d,and y = [y₁,y₂,..,y_n]^T representing the labels corresponding to the samples,in addition,X can be formalised as a feature set F = [f₁,f₂,..,f_d],where f_i denotes the column vector consisting of the ith column of features of all samples. Then,according to the definition in [28], the set family E is a feature partition of F when the following condition holds.

when A ∈ E,A is called a random subspace.For example,let D = {f₁, f₂, f₃, f₄, f₅} be a set with 5 feature columns,and its subsets are A = {f₁, f₃},B = {f₂, f₄},and C = {f₅},then by definition,{A, B, C} is a feature partition,of which A,B,C is a subspace, respectively.

2.2 Neighborhood component feature selection

NCFS is an embedded method for selecting features that utilizes the nearest-neighbour model.It measures the similarity between samples by using feature-weighted distances.Furthermore, for each sample x_i,the algorithm measures the probability of correct classification with a probability distribution function.After the probability of all samples being classified correctly being summed up,NCFS then introduces a penalty term to prevent overfitting.

The algorithm initially initializes the feature importance weights w as a vector with all elements set to 1.Then,based on w,it defines the weighted distance between two samples x_i and x_j as follows: (1) where w_l denotes the weight of the l-th feature.In order to learn w based on the approximate leave-one-out classification accuracy,the NCFS further gives a definition of the probability that a sample x_i selects x_j as a reference point: (2) where κ(x) = exp(-x/σ) is the kernel function and σ is the kernel width.According to the above definition,the probability that the query point x_i is correctly classified is: (3) where y_ij = 1 if and only if y_i = y_j otherwise y_ij = 0.Finally,NCFS defines the objective function in the following form. (4) where λ is the regularisation parameter to be adjusted.For the problem of obtaining the maximum value of the objective function,it is sufficient to make the derivative of the function F(w) with respect to w equal to 0 to derive the local optimum value of the feature weights w,and then use the gradient ascent method to update w until F(w) converges at a point near its maximum value,and output the weighting vector w at this point.

2.3 Random multi-subspace based learning

The Random Multi-subspace Approach,as introduced in reference [28], involves partitioning the original feature space into s different subspaces,with each subspace constructed based on a distinct subset of features.Subsequently,separate feature weight learning is conducted for each of these distinct feature subspaces.By repeatedly performing such partitions,the Random Multi-subspace Approach is capable of extracting additional information from the data,thereby enhancing the model’s robustness and generalization capacity.

In the context of the Random Multi-subspace Approach,each random partition of the original feature space is referred to as a feature partition.Assuming that the Random Multi-subspace Approach conducts M random partitions of the original feature space, the ith feature partition can be represented as follows: (5) where s represents the number of random subspaces,and P^(i,j) signifies the j-th subspace within the ith feature partition,j∈{1,2, …,s}.Here,we assume an equal number of feature subspaces within each feature partition.

For an original feature space comprising d features,to execute a random partition, one can initially generate a random permutation of the d features.Subsequently, the first features are consecutively assigned to distinct subspaces.The remaining features are sequentially allocated to different subspaces until all features have been partitioned.Evidently,within each feature partition,each feature belongs to a single feature subspace.

For each subspace P^(i,j),local feature weights w^(i,j) can be computed using feature selection methods such as ReliefF.Then the overall feature weight of the i-th feature partition can be obtained by splicing the local weights of its s subspaces: (6)

It is assumed that each feature partition contributes equally to the final feature weight. Then the final feature weights can be obtained by averaging the feature weights of M feature partitions: (7)

2.4 K-means cluster

We employ the k-means algorithm for feature clustering,which is one of the most well-known and widely used clustering methods [29]. It partitions a set of samples into k clusters (the value of k needs to be predetermined).

Let A = {a₁, . .. , a_k} represent the k cluster centers.Consider z = [z_ic ]^d×c,where z_ic is a binary variable taking values 0 or 1,indicating whether feature f_i belongs to the c-th cluster,where c = {1, …, k}.The objective function of k-means is given by: (8) where D²(f_i-a_c) denotes the Euclidean distance between feature f and the c-th cluster centre a_c.The Euclidean distance is a commonly used similarity measure.The k-means algorithm iteratively minimizes the objective function J(A, z) and updates the cluster centers A and the membership matrix z as follows: (9) (10)

The algorithmic steps of K-means are as follows:Initially, k features are randomly chosen as the centers of the k clusters.Then,(1) the membership degree of each feature to each cluster center is computed following Eq (10). For a feature f_i, it is assigned to cluster a_c if a_c is its nearest cluster center.Once all features are assigned to their respective clusters,(2) the positions of each cluster center are updated following Eq (9). Steps 1 and 2 iterate mutually until the stopping criteria are met.

3. The proposed method

Previous random multi-subspace weight learning methods did not take into account that in high-dimensional datasets,some highly collinear features might be accidentally allocated to the same subspace.This collinearity is prevalent and can lead to local overfitting of the algorithm,thereby reducing the accuracy of feature selection. Hence,to address the issue of feature collinearity within random subspaces while ensuring diversity within the original sample space,we propose a novel approach.

Our algorithm requires performing M iterations.Initially,each feature is treated as equally important,with the initial weight vector w set as a vector of all ones. In the i-th iteration of the algorithm,we commence by partitioning the original feature set into k clusters based on a correlation measure using K-means clustering.Subsequently, we randomly select features from each feature cluster to construct s equally sized random subspaces.Within each subspace,we employ NCFS to learn its local feature weights. These local feature weights from each subspace are then integrated to form a complete d-dimensional feature vector,denoted as w⁽ⁱ⁾.When integrating the feature weights w⁽ⁱ⁾ learned during each iteration into the overall feature weight vector w,we apply an importance factor to weight them,as opposed to the previous approach of taking a simple average.The general framework of the algorithm is illustrated in Fig 1. Section 3.1 presents the method for constructing random subspaces using K-means clustering,while Section 3.2 introduces the proposed weighting factor.

Download:

Fig 1. Overall framework diagram of the algorithm.

https://doi.org/10.1371/journal.pone.0296108.g001

3.1 Use K-means to generate subpsaces

To address the issue of feature collinearity within random subspaces,we cluster features based on their inter-correlations.The purpose of this step is to group features with a certain level of correlation into the same cluster.To achieve this,we employ the correlation coefficient as a measure in the K-means objective function instead of the Euclidean distance.Consequently, the formulas for Eqs (8) and (10) should be defined as follows: (11) where,pea(f_i-a_c) represents the Pearson coefficient between feature f_i and the c-th cluster center, a_c.The Pearson coefficient is a commonly used measure of the degree of correlation between variables and its values range between -1 and 1.A Pearson coefficient closer to 0 indicates a smaller degree of correlation between variables, while a coefficient closer to 1 (or -1) suggests a larger degree of correlation.

Once the feature set is divided into k clusters (K₁, …, K_k),for any feature cluster K_c, it comprises different features, denoted as: (12) where f_(c,j) represents the j-th feature in the c-th feature cluster,and n_c is the number of features in the c-th feature cluster.To construct feature subspaces with low collinearity among their constituents,we generate a random permutation of length n_c for each feature cluster K_c.Subsequently,we sequentially assign contiguous features to each subspace.The remaining features are assigned to different subspaces in order until all features have been partitioned. Therefore,the feature cluster K_c can be represented as: (13) where su^(c,j)denotes the feature assigned by K_c to the j-th subspace,and any intersection between su^(c,j) is empty.Thus,for each subspace P^(i,j),it can be represented as: (14) each feature cluster is evenly divided into s segments,which are then separately incorporated into each subspace.Each subspace,denoted as P^(i,j),constitutes the i-th feature partition generated by the algorithm.Subsequently,we employ NCFS to learn local feature weights within each subspace,where w^(i,j) represents the feature weights learned in the j-th subspace of the i-th feature partition (i.e., the i-th iteration of the algorithm).Upon computing the feature weights w^(i,j) for all subspaces,they are consolidated into a complete d-dimensional feature weight vector,denoted as w⁽ⁱ⁾.

3.2 Weighting w⁽ⁱ⁾

Previously,every feature partition’s contribution was assigned equal weight in prior methods for random subspace,resulting in an average of weights across all partitions in the final feature weights outcome.Our argument is that contributions from each feature partition may not all have the same weight.Therefore, we introduce an appraisal factor,α,for the feature weight vector, w⁽ⁱ⁾,to provide a weighted assessment for every w⁽ⁱ⁾.Before calculating α,it is crucial to acquire a correlation matrix R linked to the feature matrix X.To compute the (i, j)-th element of R,use the subsequent equation: (15) where f_i and f_j represent the i-th and j-th feature columns of X,‘pea’ denotes the Pearson coefficient.According to the above definition,R is a d×d symmetric matrix. Subsequently,we perform Cholesky decomposition on R and multiply it by a random sample drawn from a Gaussian distribution,resulting in v: (16)

The form of v conforms to w,a chance variable that maintains a particular correlation structure.Afterward,we introduce v into a cumulative multivariate distribution function,resulting in a cumulative multivariate probability u that is distributed across [0–1]: (17)

Define π⁽ⁱ⁾as the normalised w⁽ⁱ⁾ and X⁽ⁱ⁾ as an empty set.Formally,if (π⁽ⁱ⁾)_k is greater than u_k, the feature f_k is added to X⁽ⁱ⁾.Finally,the weighting factor α is defined as: (18)

In this case,the ’evaluate’ function calculates the classification results on the feature matrix X⁽ⁱ⁾,and the classification accuracy (ACC) can be selected as the evaluation criterion,with Classifier representing the selected classifier.We define ’classifier’ as the KNN classifier(k = 3).Therefore,at the end of the i-th iteration of the algorithm,we use the following formula to add w⁽ⁱ⁾ to the total weight vector w: (19) where w⁽ⁱ⁾ denotes the feature weight vector obtained by the algorithm from the i-th iteration.The detailed steps of the algorithm are described in Algorithm 1.

Algorithm 1:K-means Neighbourhood Component Feature Selection(KNCFS)
Input:Feature matrix:X(x₁,x₂, …,x_n)^T = (f₁,f₂, …,f_d),labels corresponding to the samples:y(y₁,y₂, …,y_n),M:number of feature partitions,s:number of subspaces for each feature partirions,k:number of feature clusters
Output:feature importance vector w
1 Initialisation:w = (1,1, …,1)^d,F = {f₁, …,f_d}.
2 for i = 1 to M do
3 set P^(i,1), …,P^(i,s) as sempty sets.
4 K₁, …,K_k = K-means(F,k).
5 for c = 1 to k do
6 for j = 1 to s do
7 randomly select features from K_c,add them into P^(i,j),and remove the selected features in K_c.
8 if K_c is not empty then
9 for q = 1 to len(K_c) do
10 Randomly select a feature from K_c to be added to any subspace P^(i,s), and subsequently remove this feature from K_c.
11 for j = 1 to s do
12 use NCFS to compute feature importance w^(i,j) on P^(i,j).
13 w^(i,1), …,w^(i,s) are fromed into a complete weighting vector w⁽ⁱ⁾ based on the indexing of the features.
14
15 Compute R,v,u according to Eqs (15–17) and set X⁽ⁱ⁾ as an empty set.
16 if (π⁽ⁱ⁾)_k >u_k then
17 add f_k into X⁽ⁱ⁾.
18 caculate α according to Eq (18).
19 w = w+α(w⁽ⁱ⁾)×w⁽ⁱ⁾.
20 return w

4. Experiments

In this section,we conducted multiple sets of experiments to evaluate the performance of the proposed algorithm.Firstly,we explore the convergence of the proposed K-means algorithm based on Pearson coefficients.Subsequently,we compared this method with other approaches on both synthetic datasets and real-world datasets,and the experimental results confirmed the effectiveness of this method.Finally,we investigated the sensitivity of the algorithm to its parameters to determine the optimal parameter configuration.

4.1 Datasets

Ten real-world datasets were utilised as the primary experimental benchmarks to assess the performance of the proposed method.These datasets were obtained from diverse domains such as facial images (pixraw10P, warpAR10P),biomedical data (lung_discrete,tumors_C,GLIOMA,TOX_171,leukemia,ALLAML),and other areas (SCADI,arcene). Table 1 offers comprehensive information on these datasets.In addition,synthetic datasets were produced as benchmarks utilizing four small-scale datasets from the UCI database [30]. Further details about these synthetic datasets will be explained in Section 4.5.2. Subsequently,all datasets were normalized to conform to a standard distribution.

Download:

Table 1. Dataset.

https://doi.org/10.1371/journal.pone.0296108.t001

4.2 Compared method

We compared the proposed KNCFS method against six different approaches. Specifically,we used chi-square as a baseline and we compared KNCFS with two widely used feature selection (FS) methods [32–34], Fisher-score and ReliefF. Additionally,we considered RBEFF,recognized as one of the most advanced FS methods,and NCFS,renowned for its excellent performance on high-dimensional datasets.Given the various improvements we made to NCFS,to evaluate the effectiveness of the improvements in this paper,we also used RB-NCFS as a comparative method.Below,we provide a brief overview of all the comparative methods:

chi-square [35]: A statistical method used to select categorical variables significantly associated with the target variable.
fisher-score [36]: Measures the importance of features for classification tasks by comparing the variance between different classes and within classes.
ReliefF [37]: A feature selection method based on a nearest neighbor model,calculated using reliefF scores to assess feature importance,Number of nearest neighbours k = 5.
RBEFF [28]: A method based on random subspaces that uses reliefF to learn local feature weights in subspaces,number of feature partitions M = 10,number of subspaces s = 10,number of nearest neighbours in reliefF k = 5.
NCFS [27]: A feature selection method based on fast neighborhood model analysis,maximizing leave-one-out classification accuracy to obtain feature weights,the kernel width σ = 1 and the regularisation parameter λ = 1.
RB-NCFS:A method based on random subspaces that uses NCFS to learn local feature weights in subspaces,number of feature partitions M = 10,number of subspaces s = 10,the kernel width for NCFS σ = 1 and the regularisation parameter λ = 1.

4.3 Compared metric

In our experiments,to validate the effectiveness of the method,we employed four classifiers to calculate classification performance:Support Vector Machine (SVM), Naive Bayes (NB),Decision Tree (DT),and K-Nearest Neighbors (KNN).Additionally, we utilized standard evaluation metrics,accuracy (ACC),and F₁-score,to assess the performance of different feature selection methods.ACC and F₁-score range between 0 and 1,where higher values indicate better performance.

1. ACC:

(20)

Where I(y_i = c(x_i)) = 1 if and only if y_i = c(x_i),y_i is the true label of x_i,c(x_i) is the predicted label of sample x_i by the classifier.

2. F₁-score:

In binary classification problems,samples are categorized into four scenarios based on the actual labels and predicted labels:True Positives (TP),False Positives (FP),True Negatives (TN),and False Negatives (FN).Precision is the proportion of samples predicted as "positive" that are actually "positive" among all samples predicted as "positive," while recall is the proportion of samples actually labeled as "positive" that were correctly predicted as "positive" by the model.The definitions of these two metrics are as follows (Eqs (21) and (22)): (21) (22)

Often,we would like to take care of both Precision and Recall,therefore,F₁-Score is another commonly used metric,which is the reconciled average of Precision and Recall,and can be used to comprehensively evaluate the performance of the model. It is defined in Eq (23): (23)

The F₁-score for binary classification can be extended to multiclassification problems,where one of the classes is considered as a positive class and the others as negative classes,and then the F₁-Score is calculated according to Eq (23).

4.4 Parameter settings

For KNCFS,there are three parameters to consider:the number of feature subspaces M,the number of subspaces s,the number of clusters for feature clustering k. In prior research,the values for σ and λ in NCFS were recommended to be {1, 1},and for M,s,and k,we will explore their optimal settings within the range of {5, 10, 15, 20, 25}.

Regarding the parameters for classifiers,we chose the RBF kernel for the support vector machine,set the maximum tree depth to 5 for the decision tree,and selected 3 nearest neighbors for K-nearest neighbors (KNN).The parameter configurations are summarized in Table 2.

Download:

Table 2. Parameter settings.

https://doi.org/10.1371/journal.pone.0296108.t002

4.5 Results

4.5.1 Convergence results.

In this paper,K-means obtains the optimal result under the condition that:at the kth iteration,the objective function J_k(A,z)-J_(k-1)(A,z)<η or k > max_iter where we set η = 0.02 and max_iter = 300. Fig 2 shows the convergence of the K-means method we use on ten datasets,and we find that that the algorithm converges quickly on most datasets,while on the SCADI,TOX_171,and arcene datasets,the algorithm’s objective function value oscillates within an interval,and returns results only when the maximum number of iterations is reached.

Download:

Fig 2. Convergence curves of K-means.

https://doi.org/10.1371/journal.pone.0296108.g002

4.5.2 Classification result.

The average classification accuracy of the seven feature selection methods is presented in Tables 3 and 4,while the F₁-score results are shown in Tables 5 and 6. The last row in the tables demonstrates the number of times each method achieved the best results (win/tie).The average results acheaved by SVM with defferent features are shown in Fig 3.The average results accumulated by the four classifiers are shown in Fig 4.

Download:

Fig 3. Classification results with SVM on 10 datasets.

Selected 5–100 features.

https://doi.org/10.1371/journal.pone.0296108.g003

Download:

Fig 4. Average classification results with 4 classifiers on 10 datasets.

https://doi.org/10.1371/journal.pone.0296108.g004

Download:

Table 3. ACC of 10-fold cross validation on 10 datasets (Mean±std).

Selected 50 features.

https://doi.org/10.1371/journal.pone.0296108.t003

Download:

Table 4. ACC of 10-fold cross validation on 10 datasets (Mean±std).

Selected 100 features.

https://doi.org/10.1371/journal.pone.0296108.t004

Download:

Table 5. F₁-score of 10-fold cross validation on 10 datasets (Mean±std).

Selected 50 features.

https://doi.org/10.1371/journal.pone.0296108.t005

Download:

Table 6. F₁-score of 10-fold cross validation on 10 datasets (Mean±std).

Selected 50 features.

https://doi.org/10.1371/journal.pone.0296108.t006

Regarding classification accuracy,KNCFS achieved the best results 27 times for selecting 50 features and 26 times(21/5) for selecting 100 features, Fig 3 shows that in most cases,KNCFS outperforms the other six comparative methods in terms of classification accuracy,and Fig 4 (sub-figure a) also shows that KNCFS obtains the best classification accuracy.It’s worth noting that RB-NCFS (2/0 and 8/1) generally performs better than NCFS (0/0 and 1/1) because it is a random subspace method that considers sample diversity,giving it an advantage over traditional methods.However, due to its lack of a solution for feature collinearity,its performance falls short compared to KNCFS.

On the other hand,KNCFS obtained the best results 27 times in Table 5 and 25 times(22/2) in Table 6 for F₁-score,demonstrating a significant advantage in F₁-score as well, and, Fig 4 (sub-figure b) also shows that KNCFS obtains the best F₁-score. RBEFF achieved the best results twice (selecting 50 features) and three times (selecting 100 features) in the SCADI dataset because of its lower dimensionality. RBEFF is a filter-based feature selection method that performs well on small datasets. However,due to its lack of guidance algorithms in the feature selection stage and its inability to consider nonlinear relationships between features,its performance on high-dimensional datasets falls short compared to KNCFS.

4.5.3 Success rate of feature selection.

In this subsection,we generated synthetic datasets based on four real-wrold datasets from the UCI database.These datasets are as follows:Caesarian(80 samples, 5 features,2 classes),Fertility(100 samples,10 features,2 classes),BLOGGER(100 samples,5 features,2 classes) and Immunotherapy(90 samples,7 features,2 classes). Before starting the experiment,we considered the initial features of each dataset as relevant.We then added noise features consisting of random numbers with a mean of 0 and a variance of 5.The number of noise features varies from 50 to 500 in increments of 50 to form a set of synthetic datasets.We apply all seven methods to each synthetic dataset to learn the importance of the features and then rank the importance of the features.For example,on the Caesarian dataset with five relevant features,we ranked the feature weights and counted the number of relevant features in the top five to measure the success of feature selection.The experimental results for the synthetic dataset are shown in Fig 5.

Download:

Fig 5. Success rate of feature selection on synthetic datasets.

https://doi.org/10.1371/journal.pone.0296108.g005

In the Caesarean dataset (sub-figure a),KNCFS and Chi-square have similar results when the number of noisy features is below 200.On the Fertility dataset (sub-figure b),KNCFS outperforms NCFS slightly,and the gap between KNCFS and RBEFF widens as the number of noisy features increases.Lastly,KNCFS consistently exhibits the strongest performance on the BLOGGER and Immunotherapy datasets. This study establishes KNCFS as a compelling contender.RB-NCFS is often less effective than NCFS,possibly due to overfitting caused by random noise features interfering with the random subspace approach.This creates bias in the capability of RB-NCFS to resolve significant features,resulting in a lower FS success rate than that of NCFS and KNCFS.KNCFS solves the collinearity issue in random multi-subspace learning methods,enabling it to achieve the best results in terms of feature selection success rate.

4.6 Parameter sensitivity analysis

In order to investigate the influence of the parameters M,s and k on the performance of our proposed method,we performed sensitivity experiments on the average accuracy of the SVM and KNN classifiers.For simplicity,we selected two representative datasets,namely "leukaemia" and "GLIOMA",to participate in the parameter sensitivity analysis.As shown in Figs 6 and 7,the accuracy of the algorithm decreases when k exceeds 15,with the highest accuracy achieved when k is in the range {5, 10}.Similarly,the algorithm shows higher accuracy when s is in the range {5, 10}.Overall,variations in M have a relatively small effect on the accuracy of the algorithm.Therefore,it can be concluded that the algorithm is not very sensitive to the value of M.Finally,we set M,s and k to {10, 10, 10}.

Download:

Fig 6. The acc with the parameters s,M,k in leukaemia dataset on KNN and svm classification.

https://doi.org/10.1371/journal.pone.0296108.g006

Download:

Fig 7. The acc with the parameters s,M,k in GLIOMA dataset on KNN and svm classification.

https://doi.org/10.1371/journal.pone.0296108.g007

5. Conclusion

Feature selection (FS) is an important data preprocessing technique that reduces the dimensionality of a dataset,decreases model complexity and lowers computational cost.In this paper,we propose a random multi-subspace method based on feature correlation clustering,which is implemented through an iterative process consisting of two key phases:a stochastic subspace learning phase and a feature vector weighting phase.The stochastic subspace learning phase aims to increase the diversity of samples to extract more information, while the feature vector weighting phase evaluates the feature partitions.We conducted numerical experiments on two types of datasets:real-world datasets and synthetic datasets with noisy features.The experimental results,when compared with Chi-square,Fisher-score,ReliefF,RBEFF, NCFS and RB-NCFS,show that KNCFS is a state-of-the-art FS algorithm as it can effectively identify relevant features.In this study,we used the existing feature selection method NCFS in the subspace learning phase,but there are more advanced feature selection methods that can be used to improve the effectiveness of the algorithm.In addition,selecting feature selection methods or unsupervised feature selection algorithms capable of handling multi-labeled data could further extend the applicability of the algorithm.Feature selection remains an important area of research with many other aspects to be explored.

References

1. Izmailov P., et al., On feature learning in the presence of spurious correlations. Advances in Neural Information Processing Systems, 2022. 35: p. 38516–38532.
- View Article
- Google Scholar
2. Zhang G., et al., Learning fair representations via rebalancing graph structure. Information Processing & Management, 2024. 61(1): p. 103570.
- View Article
- Google Scholar
3. Urbanowicz R.J., et al., Relief-based feature selection: Introduction and review. Journal of biomedical informatics, 2018. 85: p. 189–203. pmid:30031057
- View Article
- PubMed/NCBI
- Google Scholar
4. Gonzalez-Lopez J., Ventura S., and Cano A., Distributed multi-label feature selection using individual mutual information measures. Knowledge-Based Systems, 2020. 188: p. 105052.
- View Article
- Google Scholar
5. Wang A., et al., Accelerating wrapper-based feature selection with K-nearest-neighbor. Knowledge-Based Systems, 2015. 83: p. 81–91.
- View Article
- Google Scholar
6. Sadeg S., et al. QBSO-FS: A Reinforcement learning based bee swarm optimization metaheuristic for feature selection. in International Work-Conference on Artificial Neural Networks. 2019. Springer.
- View Article
- Google Scholar
7. Ghimatgar H., et al., An improved feature selection algorithm based on graph clustering and ant colony optimization. Knowledge-Based Systems, 2018. 159: p. 270–285.
- View Article
- Google Scholar
8. Xue B., Zhang M., and Browne W.N., Particle swarm optimisation for feature selection in classification: Novel initialisation and updating mechanisms. Applied soft computing, 2014. 18: p. 261–276.
- View Article
- Google Scholar
9. Chen H., Li W., and Yang X., A whale optimization algorithm with chaos mechanism based on quasi-opposition for global optimization problems. Expert Systems with Applications, 2020. 158: p. 113612.
- View Article
- Google Scholar
10. Hu L., et al., Multi-label feature selection with shared common mode. Pattern Recognition, 2020. 104: p. 107344.
- View Article
- Google Scholar
11. Gu Q., Li Z., and Han J., Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725, 2012.
- View Article
- Google Scholar
12. Brown G., et al., Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. The journal of machine learning research, 2012. 13: p. 27–66.
- View Article
- Google Scholar
13. Zhou H., Wang X., and Zhu R., Feature selection based on mutual information with correlation coefficient. Applied Intelligence, 2022: p. 1–18.
- View Article
- Google Scholar
14. Doquire G. and Verleysen M., Feature selection with missing data using mutual information estimators. Neurocomputing, 2012. 90: p. 3–11.
- View Article
- Google Scholar
15. Zhang P., Gao W., and Liu G., Feature selection considering weighted relevancy. Applied Intelligence, 2018. 48: p. 4615–4625.
- View Article
- Google Scholar
16. Li J., et al., Feature selection: A data perspective. ACM computing surveys (CSUR), 2017. 50(6): p. 1–45.
- View Article
- Google Scholar
17. Wang Z., et al., Multi-class feature selection by exploring reliable class correlation. Knowledge-Based Systems, 2021. 230: p. 107377.
- View Article
- Google Scholar
18. Ji X., et al., Multi-label classification with weak labels by learning label correlation and label regularization. Applied Intelligence, 2023.
- View Article
- Google Scholar
19. Hu J., et al., Robust multi-label feature selection with dual-graph regularization. Knowledge-Based Systems, 2020. 203: p. 106126.
- View Article
- Google Scholar
20. Tang C., et al., Unsupervised feature selection via multiple graph fusion and feature weight learning. Science China Information Sciences, 2023. 66(5): p. 1–17.
- View Article
- Google Scholar
21. Lee C., Imrie F., and van der Schaar M.. Self-supervision enhanced feature selection with correlated gates. in International Conference on Learning Representations. 2022.
- View Article
- Google Scholar
22. Hu R., et al., Low-rank feature selection for multi-view regression. Multimedia Tools and Applications, 2017. 76: p. 17479–17495.
- View Article
- Google Scholar
23. Liu T., Hu R., and Zhu Y., Completed sample correlations and feature dependency-based unsupervised feature selection. Multimedia Tools and Applications, 2023. 82(10): p. 15305–15326.
- View Article
- Google Scholar
24. Zhang S., et al., Supervised feature selection algorithm via discriminative ridge regression. World Wide Web, 2018. 21: p. 1545–1562.
- View Article
- Google Scholar
25. Hu R., Zhang L., and Wei J., Adaptive Laplacian support vector machine for semi-supervised learning. The Computer Journal, 2021. 64(7): p. 1005–1015.
- View Article
- Google Scholar
26. Fan Y., et al., Multi-label feature selection based on label correlations and feature redundancy. Knowledge-Based Systems, 2022. 241: p. 108256.
- View Article
- Google Scholar
27. Yang W., Wang K., and Zuo W., Neighborhood component feature selection for high-dimensional data. J. Comput., 2012. 7(1): p. 161–168.
- View Article
- Google Scholar
28. Zhang B., Li Y., and Chai Z., A novel random multi-subspace based ReliefF for feature selection. Knowledge-Based Systems, 2022. 252: p. 109400.
- View Article
- Google Scholar
29. Sinaga K.P. and Yang M.-S., Unsupervised K-means clustering algorithm. IEEE access, 2020. 8: p. 80716–80727.
- View Article
- Google Scholar
30. Asuncion A. and Newman D., UCI machine learning repository. 2007, Irvine, CA, USA.
- View Article
- Google Scholar
31. Samaria F.S. and Harter A.C.. Parameterisation of a stochastic model for human face identification. in Proceedings of 1994 IEEE workshop on applications of computer vision. 1994. IEEE.
- View Article
- Google Scholar
32. Bugata P. and Drotár P., Weighted nearest neighbors feature selection. Knowledge-Based Systems, 2019. 163: p. 749–761.
- View Article
- Google Scholar
33. Zare M., Azizizadeh N., and Kazemipour A., Supervised feature selection on gene expression microarray datasets using manifold learning. Chemometrics and Intelligent Laboratory Systems, 2023. 237: p. 104828.
- View Article
- Google Scholar
34. Maldonado S., Weber R., and Basak J., Simultaneous feature selection and classification using kernel-penalized support vector machines. Information Sciences, 2011. 181(1): p. 115–128.
- View Article
- Google Scholar
35. Liu H. and Setiono R.. Chi2: Feature selection and discretization of numeric attributes. in Proceedings of 7th IEEE international conference on tools with artificial intelligence. 1995. IEEE.
- View Article
- Google Scholar
36. Duda R. and Hart P., and Stork D. G., Pattern Classification. Hoboken. 2001, NJ: Wiley-Interscience.
37. Kononenko I. Estimating attributes: Analysis and extensions of RELIEF. in European conference on machine learning. 1994. Springer.
- View Article
- Google Scholar

[ref1] 1. Izmailov P., et al., On feature learning in the presence of spurious correlations. Advances in Neural Information Processing Systems, 2022. 35: p. 38516–38532.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Zhang G., et al., Learning fair representations via rebalancing graph structure. Information Processing & Management, 2024. 61(1): p. 103570.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Urbanowicz R.J., et al., Relief-based feature selection: Introduction and review. Journal of biomedical informatics, 2018. 85: p. 189–203. pmid:30031057
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. Gonzalez-Lopez J., Ventura S., and Cano A., Distributed multi-label feature selection using individual mutual information measures. Knowledge-Based Systems, 2020. 188: p. 105052.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref5] 5. Wang A., et al., Accelerating wrapper-based feature selection with K-nearest-neighbor. Knowledge-Based Systems, 2015. 83: p. 81–91.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref6] 6. Sadeg S., et al. QBSO-FS: A Reinforcement learning based bee swarm optimization metaheuristic for feature selection. in International Work-Conference on Artificial Neural Networks. 2019. Springer.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref7] 7. Ghimatgar H., et al., An improved feature selection algorithm based on graph clustering and ant colony optimization. Knowledge-Based Systems, 2018. 159: p. 270–285.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref8] 8. Xue B., Zhang M., and Browne W.N., Particle swarm optimisation for feature selection in classification: Novel initialisation and updating mechanisms. Applied soft computing, 2014. 18: p. 261–276.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref9] 9. Chen H., Li W., and Yang X., A whale optimization algorithm with chaos mechanism based on quasi-opposition for global optimization problems. Expert Systems with Applications, 2020. 158: p. 113612.
View Article
Google Scholar

[27] View Article

[28] Google Scholar

[ref10] 10. Hu L., et al., Multi-label feature selection with shared common mode. Pattern Recognition, 2020. 104: p. 107344.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref11] 11. Gu Q., Li Z., and Han J., Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725, 2012.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref12] 12. Brown G., et al., Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. The journal of machine learning research, 2012. 13: p. 27–66.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref13] 13. Zhou H., Wang X., and Zhu R., Feature selection based on mutual information with correlation coefficient. Applied Intelligence, 2022: p. 1–18.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref14] 14. Doquire G. and Verleysen M., Feature selection with missing data using mutual information estimators. Neurocomputing, 2012. 90: p. 3–11.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref15] 15. Zhang P., Gao W., and Liu G., Feature selection considering weighted relevancy. Applied Intelligence, 2018. 48: p. 4615–4625.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref16] 16. Li J., et al., Feature selection: A data perspective. ACM computing surveys (CSUR), 2017. 50(6): p. 1–45.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref17] 17. Wang Z., et al., Multi-class feature selection by exploring reliable class correlation. Knowledge-Based Systems, 2021. 230: p. 107377.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref18] 18. Ji X., et al., Multi-label classification with weak labels by learning label correlation and label regularization. Applied Intelligence, 2023.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref19] 19. Hu J., et al., Robust multi-label feature selection with dual-graph regularization. Knowledge-Based Systems, 2020. 203: p. 106126.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref20] 20. Tang C., et al., Unsupervised feature selection via multiple graph fusion and feature weight learning. Science China Information Sciences, 2023. 66(5): p. 1–17.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref21] 21. Lee C., Imrie F., and van der Schaar M.. Self-supervision enhanced feature selection with correlated gates. in International Conference on Learning Representations. 2022.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref22] 22. Hu R., et al., Low-rank feature selection for multi-view regression. Multimedia Tools and Applications, 2017. 76: p. 17479–17495.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref23] 23. Liu T., Hu R., and Zhu Y., Completed sample correlations and feature dependency-based unsupervised feature selection. Multimedia Tools and Applications, 2023. 82(10): p. 15305–15326.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref24] 24. Zhang S., et al., Supervised feature selection algorithm via discriminative ridge regression. World Wide Web, 2018. 21: p. 1545–1562.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref25] 25. Hu R., Zhang L., and Wei J., Adaptive Laplacian support vector machine for semi-supervised learning. The Computer Journal, 2021. 64(7): p. 1005–1015.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref26] 26. Fan Y., et al., Multi-label feature selection based on label correlations and feature redundancy. Knowledge-Based Systems, 2022. 241: p. 108256.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref27] 27. Yang W., Wang K., and Zuo W., Neighborhood component feature selection for high-dimensional data. J. Comput., 2012. 7(1): p. 161–168.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref28] 28. Zhang B., Li Y., and Chai Z., A novel random multi-subspace based ReliefF for feature selection. Knowledge-Based Systems, 2022. 252: p. 109400.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref29] 29. Sinaga K.P. and Yang M.-S., Unsupervised K-means clustering algorithm. IEEE access, 2020. 8: p. 80716–80727.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref30] 30. Asuncion A. and Newman D., UCI machine learning repository. 2007, Irvine, CA, USA.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref31] 31. Samaria F.S. and Harter A.C.. Parameterisation of a stochastic model for human face identification. in Proceedings of 1994 IEEE workshop on applications of computer vision. 1994. IEEE.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref32] 32. Bugata P. and Drotár P., Weighted nearest neighbors feature selection. Knowledge-Based Systems, 2019. 163: p. 749–761.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref33] 33. Zare M., Azizizadeh N., and Kazemipour A., Supervised feature selection on gene expression microarray datasets using manifold learning. Chemometrics and Intelligent Laboratory Systems, 2023. 237: p. 104828.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref34] 34. Maldonado S., Weber R., and Basak J., Simultaneous feature selection and classification using kernel-penalized support vector machines. Information Sciences, 2011. 181(1): p. 115–128.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref35] 35. Liu H. and Setiono R.. Chi2: Feature selection and discretization of numeric attributes. in Proceedings of 7th IEEE international conference on tools with artificial intelligence. 1995. IEEE.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref36] 36. Duda R. and Hart P., and Stork D. G., Pattern Classification. Hoboken. 2001, NJ: Wiley-Interscience.

[ref37] 37. Kononenko I. Estimating attributes: Analysis and extensions of RELIEF. in European conference on machine learning. 1994. Springer.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

Figures

Abstract

1. Introduction

2. Preliminaries

2.1 Notation and definition

2.2 Neighborhood component feature selection

2.3 Random multi-subspace based learning

2.4 K-means cluster

3. The proposed method

3.1 Use K-means to generate subpsaces

3.2 Weighting w(i)

4. Experiments

4.1 Datasets

4.2 Compared method

4.3 Compared metric

4.4 Parameter settings

4.5 Results

4.5.1 Convergence results.

4.5.2 Classification result.

4.5.3 Success rate of feature selection.

4.6 Parameter sensitivity analysis

5. Conclusion

References

3.2 Weighting w⁽ⁱ⁾