Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

KNCFS: Feature selection for high-dimensional datasets based on improved random multi-subspace learning

Abstract

Feature selection has long been a focal point of research in various fields.Recent studies have focused on the application of random multi-subspaces methods to extract more information from raw samples.However,this approach inadequately addresses the adverse effects that may arise due to feature collinearity in high-dimensional datasets.To further address the limited ability of traditional algorithms to extract useful information from raw samples while considering the challenge of feature collinearity during the random subspaces learning process, we employ a clustering approach based on correlation measures to group features.Subsequently, we construct subspaces with lower inter-feature correlations.When integrating feature weights obtained from all feature spaces,we introduce a weighting factor to better handle the contributions from different feature spaces.We comprehensively evaluate our proposed algorithm on ten real datasets and four synthetic datasets,comparing it with six other feature selection algorithms.Experimental results demonstrate that our algorithm,denoted as KNCFS,effectively identifies relevant features,exhibiting robust feature selection performance,particularly suited for addressing feature selection challenges in practice.

1. Introduction

In disease prediction tasks, the collected DNA microarray datasets are often high-dimensional.Searching for the genes that determine the occurrence of diseases among these genes is challenging,as it constitutes an NP-hard problem with a complexity of O(2d).Furthermore,in high-dimensional datasets,there exists a significant amount of redundant and noisy features.Blindly learning these features causes the model to learn spurious correlations and reduces the performance of the mode [1, 2].To address this challenge,an effective approach is to reduce data dimensionality through feature selection [3].The objective of feature selection is to retain relevant features while discarding irrelevant ones.Feature selection not only reduces feature dimensionality but also enhances model performance.

Feature selection (FS) methods can be categorized into three primary modes: wrapper mode,filter mode,and embedded mode [4].Wrapper mode typically employs heuristic search to select the most favorable features with respect to evaluation metrics [58].Wrapper models typically use swarm intelligent optimisation to generate binary solution vectors,where the selection of a particular feature is denoted by 1 and 0 means that the corresponding feature is not considered in the subset of features.For example,bee colony optimization [6];particle swarm optimization [8];whale optimization [9] etc.However,when dealing with high-dimensional data,these methods often struggle to complete the search within a reasonable time frame [10].To address this issue,some filter-mode methods search for the optimal subset by exploring the intrinsic relationships between samples and features [1114].For example,literature [13] introduces the correlation coefficient and combines the correlation coefficient and mutual information to measure the relationship between different features for feature selection,and literature [15] uses mutual information and joint mutual information to balance the significance between the two feature correlation terms for weighted correlation-based feature selection.Due to the lack of a specific classifier guiding the feature selection stage,the selected features in such methods may not be optimal [16].On the other hand, embedded mode views the process of learning the optimal subset as an optimization problem.These methods introduce penalties or constraints to FS through the construction of an objective function and regularization terms related to feature weights [1722].This encourages the model to select the most relevant features.For example,literature [23] embedded the relevance self-representation matrix into unsupervised learning to take into account the complete sample relevance and feature dependencies;literature [24] helped to identify relevant features by embedding indication labels into ridge regression models;and literature [25] proposed a new adaptive LapSVM feature selection method by embedding the acquisition of Laplacian matrices into the SVM training process in order to achieve semi-supervised learning.In comparison to filter mode, embedded mode involves interaction with a classifier and often can select features with the highest information content [26].

The Neighborhood Component Feature Selection (NCFS) [27] is an embedded method in the field of feature selection that has garnered significant attention, primarily due to its excellent performance on high-dimensional datasets.However, NCFS exhibits a notable limitation in that it is confined to acquiring knowledge within the original feature space,leading to a relatively limited information extraction from the raw samples and failing to fully exploit the latent information within the data.In a separate study,a method known as the Random Multi-subspace Approach [28] was proposed.This approach treats the reliefF method as a black box and,through multiple random data set partitions,learns local weights in each subspace to enhance the sample diversity of the reliefF method. However,it is worth noting that the experiments conducted in reference [28] were limited to low-dimensional datasets, with a maximum of 649-dimensional features used. Our further investigation suggests that the direct application of the Random Subspace Approach on high-dimensional datasets offers limited performance enhancement for NCFS.This is because in high-dimensional datasets,some features can be approximately represented as combinations of other features in a linear manner, resulting in a certain degree of feature collinearity.The existence of collinearity can reduce the model’s generalization performance.Furthermore,during the random subspace partitioning of the feature space,a situation may arise where features with collinearity are accidentally assigned to the same subspace.This can lead to model overfitting and consequently decrease the accuracy of feature selection.

In addressing the issue of limited information captured by NCFS from the original samples,we introduce an enhanced approach that simultaneously considers enhancing the diversity of the original samples and mitigating the problem of feature collinearity.In formal terms, we propose a method that utilizes clustering algorithms to construct random subspaces,aiming to alleviate the impact of collinearity. Furthermore,following the completion of feature weight learning within each feature partition,we employ a feature partition weight factor to assess the contribution of each feature partition to the final weight vector, as opposed to a simple averaging approach. Extensive experiments on ten high-dimensional datasets and synthetic datasets validate the effectiveness of our algorithm.The primary contributions of this paper are summarized as follows.

  • The proposed method simultaneously addresses the issues of diversity in the random subspace during the feature selection process and feature collinearity.
  • A feature partition weight factor is introduced to weight the importance of features learned within each feature partition.
  • Multiple sets of experiments on synthetic and real datasets confirm the effectiveness of the proposed method.Notably, the experimental results demonstrate that the consideration of feature collinearity in the random subspace approach enhances the effectiveness of feature selection.

The remainder of the paper is organised as follows,in Section 2 we briefly introduce the NCFS algorithm,2.3 we detail the random multi-subspace method,as well as in 2.4 we briefly introduce the K-means cluster,followed by Section 3 where we present our method, Section 4 shows the experimental results of the new method with the comparative method, and, finally, we draw conclusions in Section 5.

2. Preliminaries

In this section,we introduce the notation and definitions of this paper in 2.1.In section 2.2, the original NCFS method is briefly described,and finally 2.3 presents the random multi-subspace method.

2.1 Notation and definition

Given a feature matrix X = [x1,x2,..,xn]TRn×d,which is a set of n training samples with a dimensionality of d,and y = [y1,y2,..,yn]T representing the labels corresponding to the samples,in addition,X can be formalised as a feature set F = [f1,f2,..,fd],where fi denotes the column vector consisting of the ith column of features of all samples. Then,according to the definition in [28], the set family E is a feature partition of F when the following condition holds.


  • when AE,A is called a random subspace.For example,let D = {f1, f2, f3, f4, f5} be a set with 5 feature columns,and its subsets are A = {f1, f3},B = {f2, f4},and C = {f5},then by definition,{A, B, C} is a feature partition,of which A,B,C is a subspace, respectively.

2.2 Neighborhood component feature selection

NCFS is an embedded method for selecting features that utilizes the nearest-neighbour model.It measures the similarity between samples by using feature-weighted distances.Furthermore, for each sample xi,the algorithm measures the probability of correct classification with a probability distribution function.After the probability of all samples being classified correctly being summed up,NCFS then introduces a penalty term to prevent overfitting.

The algorithm initially initializes the feature importance weights w as a vector with all elements set to 1.Then,based on w,it defines the weighted distance between two samples xi and xj as follows: (1) where wl denotes the weight of the l-th feature.In order to learn w based on the approximate leave-one-out classification accuracy,the NCFS further gives a definition of the probability that a sample xi selects xj as a reference point: (2) where κ(x) = exp(-x/σ) is the kernel function and σ is the kernel width.According to the above definition,the probability that the query point xi is correctly classified is: (3) where yij = 1 if and only if yi = yj otherwise yij = 0.Finally,NCFS defines the objective function in the following form. (4) where λ is the regularisation parameter to be adjusted.For the problem of obtaining the maximum value of the objective function,it is sufficient to make the derivative of the function F(w) with respect to w equal to 0 to derive the local optimum value of the feature weights w,and then use the gradient ascent method to update w until F(w) converges at a point near its maximum value,and output the weighting vector w at this point.

2.3 Random multi-subspace based learning

The Random Multi-subspace Approach,as introduced in reference [28], involves partitioning the original feature space into s different subspaces,with each subspace constructed based on a distinct subset of features.Subsequently,separate feature weight learning is conducted for each of these distinct feature subspaces.By repeatedly performing such partitions,the Random Multi-subspace Approach is capable of extracting additional information from the data,thereby enhancing the model’s robustness and generalization capacity.

In the context of the Random Multi-subspace Approach,each random partition of the original feature space is referred to as a feature partition.Assuming that the Random Multi-subspace Approach conducts M random partitions of the original feature space, the ith feature partition can be represented as follows: (5) where s represents the number of random subspaces,and P(i,j) signifies the j-th subspace within the ith feature partition,j∈{1,2, …,s}.Here,we assume an equal number of feature subspaces within each feature partition.

For an original feature space comprising d features,to execute a random partition, one can initially generate a random permutation of the d features.Subsequently, the first features are consecutively assigned to distinct subspaces.The remaining features are sequentially allocated to different subspaces until all features have been partitioned.Evidently,within each feature partition,each feature belongs to a single feature subspace.

For each subspace P(i,j),local feature weights w(i,j) can be computed using feature selection methods such as ReliefF.Then the overall feature weight of the i-th feature partition can be obtained by splicing the local weights of its s subspaces: (6)

It is assumed that each feature partition contributes equally to the final feature weight. Then the final feature weights can be obtained by averaging the feature weights of M feature partitions: (7)

2.4 K-means cluster

We employ the k-means algorithm for feature clustering,which is one of the most well-known and widely used clustering methods [29]. It partitions a set of samples into k clusters (the value of k needs to be predetermined).

Let A = {a1, . .. , ak} represent the k cluster centers.Consider z = [zic ]d×c,where zic is a binary variable taking values 0 or 1,indicating whether feature fi belongs to the c-th cluster,where c = {1, …, k}.The objective function of k-means is given by: (8) where D2(fi-ac) denotes the Euclidean distance between feature f and the c-th cluster centre ac.The Euclidean distance is a commonly used similarity measure.The k-means algorithm iteratively minimizes the objective function J(A, z) and updates the cluster centers A and the membership matrix z as follows: (9) (10)

The algorithmic steps of K-means are as follows:Initially, k features are randomly chosen as the centers of the k clusters.Then,(1) the membership degree of each feature to each cluster center is computed following Eq (10). For a feature fi, it is assigned to cluster ac if ac is its nearest cluster center.Once all features are assigned to their respective clusters,(2) the positions of each cluster center are updated following Eq (9). Steps 1 and 2 iterate mutually until the stopping criteria are met.

3. The proposed method

Previous random multi-subspace weight learning methods did not take into account that in high-dimensional datasets,some highly collinear features might be accidentally allocated to the same subspace.This collinearity is prevalent and can lead to local overfitting of the algorithm,thereby reducing the accuracy of feature selection. Hence,to address the issue of feature collinearity within random subspaces while ensuring diversity within the original sample space,we propose a novel approach.

Our algorithm requires performing M iterations.Initially,each feature is treated as equally important,with the initial weight vector w set as a vector of all ones. In the i-th iteration of the algorithm,we commence by partitioning the original feature set into k clusters based on a correlation measure using K-means clustering.Subsequently, we randomly select features from each feature cluster to construct s equally sized random subspaces.Within each subspace,we employ NCFS to learn its local feature weights. These local feature weights from each subspace are then integrated to form a complete d-dimensional feature vector,denoted as w(i).When integrating the feature weights w(i) learned during each iteration into the overall feature weight vector w,we apply an importance factor to weight them,as opposed to the previous approach of taking a simple average.The general framework of the algorithm is illustrated in Fig 1. Section 3.1 presents the method for constructing random subspaces using K-means clustering,while Section 3.2 introduces the proposed weighting factor.

3.1 Use K-means to generate subpsaces

To address the issue of feature collinearity within random subspaces,we cluster features based on their inter-correlations.The purpose of this step is to group features with a certain level of correlation into the same cluster.To achieve this,we employ the correlation coefficient as a measure in the K-means objective function instead of the Euclidean distance.Consequently, the formulas for Eqs (8) and (10) should be defined as follows: (11) where,pea(fi-ac) represents the Pearson coefficient between feature fi and the c-th cluster center, ac.The Pearson coefficient is a commonly used measure of the degree of correlation between variables and its values range between -1 and 1.A Pearson coefficient closer to 0 indicates a smaller degree of correlation between variables, while a coefficient closer to 1 (or -1) suggests a larger degree of correlation.

Once the feature set is divided into k clusters (K1, …, Kk),for any feature cluster Kc, it comprises different features, denoted as: (12) where f(c,j) represents the j-th feature in the c-th feature cluster,and nc is the number of features in the c-th feature cluster.To construct feature subspaces with low collinearity among their constituents,we generate a random permutation of length nc for each feature cluster Kc.Subsequently,we sequentially assign contiguous features to each subspace.The remaining features are assigned to different subspaces in order until all features have been partitioned. Therefore,the feature cluster Kc can be represented as: (13) where su(c,j)denotes the feature assigned by Kc to the j-th subspace,and any intersection between su(c,j) is empty.Thus,for each subspace P(i,j),it can be represented as: (14) each feature cluster is evenly divided into s segments,which are then separately incorporated into each subspace.Each subspace,denoted as P(i,j),constitutes the i-th feature partition generated by the algorithm.Subsequently,we employ NCFS to learn local feature weights within each subspace,where w(i,j) represents the feature weights learned in the j-th subspace of the i-th feature partition (i.e., the i-th iteration of the algorithm).Upon computing the feature weights w(i,j) for all subspaces,they are consolidated into a complete d-dimensional feature weight vector,denoted as w(i).

3.2 Weighting w(i)

Previously,every feature partition’s contribution was assigned equal weight in prior methods for random subspace,resulting in an average of weights across all partitions in the final feature weights outcome.Our argument is that contributions from each feature partition may not all have the same weight.Therefore, we introduce an appraisal factor,α,for the feature weight vector, w(i),to provide a weighted assessment for every w(i).Before calculating α,it is crucial to acquire a correlation matrix R linked to the feature matrix X.To compute the (i, j)-th element of R,use the subsequent equation: (15) where fi and fj represent the i-th and j-th feature columns of X,‘pea’ denotes the Pearson coefficient.According to the above definition,R is a d×d symmetric matrix. Subsequently,we perform Cholesky decomposition on R and multiply it by a random sample drawn from a Gaussian distribution,resulting in v: (16)

The form of v conforms to w,a chance variable that maintains a particular correlation structure.Afterward,we introduce v into a cumulative multivariate distribution function,resulting in a cumulative multivariate probability u that is distributed across [0–1]: (17)

Define π(i)as the normalised w(i) and X(i) as an empty set.Formally,if (π(i))k is greater than uk, the feature fk is added to X(i).Finally,the weighting factor α is defined as: (18)

In this case,the ’evaluate’ function calculates the classification results on the feature matrix X(i),and the classification accuracy (ACC) can be selected as the evaluation criterion,with Classifier representing the selected classifier.We define ’classifier’ as the KNN classifier(k = 3).Therefore,at the end of the i-th iteration of the algorithm,we use the following formula to add w(i) to the total weight vector w: (19) where w(i) denotes the feature weight vector obtained by the algorithm from the i-th iteration.The detailed steps of the algorithm are described in Algorithm 1.

  1. Algorithm 1:K-means Neighbourhood Component Feature Selection(KNCFS)
  2. Input:Feature matrix:X(x1,x2, …,xn)T = (f1,f2, …,fd),labels corresponding to the samples:y(y1,y2, …,yn),M:number of feature partitions,s:number of subspaces for each feature partirions,k:number of feature clusters
  3. Output:feature importance vector w
  4. 1 Initialisation:w = (1,1, …,1)d,F = {f1, …,fd}.
  5. 2 for i = 1 to M do
  6. 3 set P(i,1), …,P(i,s) as sempty sets.
  7. 4 K1, …,Kk = K-means(F,k).
  8. 5 for c = 1 to k do
  9. 6  for j = 1 to s do
  10. 7  randomly select features from Kc,add them into P(i,j),and remove the selected features in Kc.
  11. 8  if Kc is not empty then
  12. 9   for q = 1 to len(Kc) do
  13. 10    Randomly select a feature from Kc to be added to any subspace P(i,s), and subsequently remove this feature from Kc.
  14. 11 for j = 1 to s do
  15. 12  use NCFS to compute feature importance w(i,j) on P(i,j).
  16. 13  w(i,1), …,w(i,s) are fromed into a complete weighting vector w(i) based on the indexing of the features.
  17. 14 
  18. 15 Compute R,v,u according to Eqs (1517) and set X(i) as an empty set.
  19. 16 if (π(i))k >uk then
  20. 17  add fk into X(i).
  21. 18 caculate α according to Eq (18).
  22. 19 w = w+α(w(i)w(i).
  23. 20 return w

4. Experiments

In this section,we conducted multiple sets of experiments to evaluate the performance of the proposed algorithm.Firstly,we explore the convergence of the proposed K-means algorithm based on Pearson coefficients.Subsequently,we compared this method with other approaches on both synthetic datasets and real-world datasets,and the experimental results confirmed the effectiveness of this method.Finally,we investigated the sensitivity of the algorithm to its parameters to determine the optimal parameter configuration.

4.1 Datasets

Ten real-world datasets were utilised as the primary experimental benchmarks to assess the performance of the proposed method.These datasets were obtained from diverse domains such as facial images (pixraw10P, warpAR10P),biomedical data (lung_discrete,tumors_C,GLIOMA,TOX_171,leukemia,ALLAML),and other areas (SCADI,arcene). Table 1 offers comprehensive information on these datasets.In addition,synthetic datasets were produced as benchmarks utilizing four small-scale datasets from the UCI database [30]. Further details about these synthetic datasets will be explained in Section 4.5.2. Subsequently,all datasets were normalized to conform to a standard distribution.

4.2 Compared method

We compared the proposed KNCFS method against six different approaches. Specifically,we used chi-square as a baseline and we compared KNCFS with two widely used feature selection (FS) methods [3234], Fisher-score and ReliefF. Additionally,we considered RBEFF,recognized as one of the most advanced FS methods,and NCFS,renowned for its excellent performance on high-dimensional datasets.Given the various improvements we made to NCFS,to evaluate the effectiveness of the improvements in this paper,we also used RB-NCFS as a comparative method.Below,we provide a brief overview of all the comparative methods:

  • chi-square [35]: A statistical method used to select categorical variables significantly associated with the target variable.
  • fisher-score [36]: Measures the importance of features for classification tasks by comparing the variance between different classes and within classes.
  • ReliefF [37]: A feature selection method based on a nearest neighbor model,calculated using reliefF scores to assess feature importance,Number of nearest neighbours k = 5.
  • RBEFF [28]: A method based on random subspaces that uses reliefF to learn local feature weights in subspaces,number of feature partitions M = 10,number of subspaces s = 10,number of nearest neighbours in reliefF k = 5.
  • NCFS [27]: A feature selection method based on fast neighborhood model analysis,maximizing leave-one-out classification accuracy to obtain feature weights,the kernel width σ = 1 and the regularisation parameter λ = 1.
  • RB-NCFS:A method based on random subspaces that uses NCFS to learn local feature weights in subspaces,number of feature partitions M = 10,number of subspaces s = 10,the kernel width for NCFS σ = 1 and the regularisation parameter λ = 1.

4.3 Compared metric

In our experiments,to validate the effectiveness of the method,we employed four classifiers to calculate classification performance:Support Vector Machine (SVM), Naive Bayes (NB),Decision Tree (DT),and K-Nearest Neighbors (KNN).Additionally, we utilized standard evaluation metrics,accuracy (ACC),and F1-score,to assess the performance of different feature selection methods.ACC and F1-score range between 0 and 1,where higher values indicate better performance.

  1. 1. ACC:
(20)

Where I(yi = c(xi)) = 1 if and only if yi = c(xi),yi is the true label of xi,c(xi) is the predicted label of sample xi by the classifier.

  1. 2. F1-score:

In binary classification problems,samples are categorized into four scenarios based on the actual labels and predicted labels:True Positives (TP),False Positives (FP),True Negatives (TN),and False Negatives (FN).Precision is the proportion of samples predicted as "positive" that are actually "positive" among all samples predicted as "positive," while recall is the proportion of samples actually labeled as "positive" that were correctly predicted as "positive" by the model.The definitions of these two metrics are as follows (Eqs (21) and (22)): (21) (22)

Often,we would like to take care of both Precision and Recall,therefore,F1-Score is another commonly used metric,which is the reconciled average of Precision and Recall,and can be used to comprehensively evaluate the performance of the model. It is defined in Eq (23): (23)

The F1-score for binary classification can be extended to multiclassification problems,where one of the classes is considered as a positive class and the others as negative classes,and then the F1-Score is calculated according to Eq (23).

4.4 Parameter settings

For KNCFS,there are three parameters to consider:the number of feature subspaces M,the number of subspaces s,the number of clusters for feature clustering k. In prior research,the values for σ and λ in NCFS were recommended to be {1, 1},and for M,s,and k,we will explore their optimal settings within the range of {5, 10, 15, 20, 25}.

Regarding the parameters for classifiers,we chose the RBF kernel for the support vector machine,set the maximum tree depth to 5 for the decision tree,and selected 3 nearest neighbors for K-nearest neighbors (KNN).The parameter configurations are summarized in Table 2.

4.5 Results

4.5.1 Convergence results.

In this paper,K-means obtains the optimal result under the condition that:at the kth iteration,the objective function Jk(A,z)-J(k-1)(A,z)<η or k > max_iter where we set η = 0.02 and max_iter = 300. Fig 2 shows the convergence of the K-means method we use on ten datasets,and we find that that the algorithm converges quickly on most datasets,while on the SCADI,TOX_171,and arcene datasets,the algorithm’s objective function value oscillates within an interval,and returns results only when the maximum number of iterations is reached.

4.5.2 Classification result.

The average classification accuracy of the seven feature selection methods is presented in Tables 3 and 4,while the F1-score results are shown in Tables 5 and 6. The last row in the tables demonstrates the number of times each method achieved the best results (win/tie).The average results acheaved by SVM with defferent features are shown in Fig 3.The average results accumulated by the four classifiers are shown in Fig 4.

thumbnail
Fig 3. Classification results with SVM on 10 datasets.

Selected 5–100 features.

https://doi.org/10.1371/journal.pone.0296108.g003

thumbnail
Fig 4. Average classification results with 4 classifiers on 10 datasets.

https://doi.org/10.1371/journal.pone.0296108.g004

thumbnail
Table 3. ACC of 10-fold cross validation on 10 datasets (Mean±std).

Selected 50 features.

https://doi.org/10.1371/journal.pone.0296108.t003

thumbnail
Table 4. ACC of 10-fold cross validation on 10 datasets (Mean±std).

Selected 100 features.

https://doi.org/10.1371/journal.pone.0296108.t004

thumbnail
Table 5. F1-score of 10-fold cross validation on 10 datasets (Mean±std).

Selected 50 features.

https://doi.org/10.1371/journal.pone.0296108.t005

thumbnail
Table 6. F1-score of 10-fold cross validation on 10 datasets (Mean±std).

Selected 50 features.

https://doi.org/10.1371/journal.pone.0296108.t006

Regarding classification accuracy,KNCFS achieved the best results 27 times for selecting 50 features and 26 times(21/5) for selecting 100 features, Fig 3 shows that in most cases,KNCFS outperforms the other six comparative methods in terms of classification accuracy,and Fig 4 (sub-figure a) also shows that KNCFS obtains the best classification accuracy.It’s worth noting that RB-NCFS (2/0 and 8/1) generally performs better than NCFS (0/0 and 1/1) because it is a random subspace method that considers sample diversity,giving it an advantage over traditional methods.However, due to its lack of a solution for feature collinearity,its performance falls short compared to KNCFS.

On the other hand,KNCFS obtained the best results 27 times in Table 5 and 25 times(22/2) in Table 6 for F1-score,demonstrating a significant advantage in F1-score as well, and, Fig 4 (sub-figure b) also shows that KNCFS obtains the best F1-score. RBEFF achieved the best results twice (selecting 50 features) and three times (selecting 100 features) in the SCADI dataset because of its lower dimensionality. RBEFF is a filter-based feature selection method that performs well on small datasets. However,due to its lack of guidance algorithms in the feature selection stage and its inability to consider nonlinear relationships between features,its performance on high-dimensional datasets falls short compared to KNCFS.

4.5.3 Success rate of feature selection.

In this subsection,we generated synthetic datasets based on four real-wrold datasets from the UCI database.These datasets are as follows:Caesarian(80 samples, 5 features,2 classes),Fertility(100 samples,10 features,2 classes),BLOGGER(100 samples,5 features,2 classes) and Immunotherapy(90 samples,7 features,2 classes). Before starting the experiment,we considered the initial features of each dataset as relevant.We then added noise features consisting of random numbers with a mean of 0 and a variance of 5.The number of noise features varies from 50 to 500 in increments of 50 to form a set of synthetic datasets.We apply all seven methods to each synthetic dataset to learn the importance of the features and then rank the importance of the features.For example,on the Caesarian dataset with five relevant features,we ranked the feature weights and counted the number of relevant features in the top five to measure the success of feature selection.The experimental results for the synthetic dataset are shown in Fig 5.

thumbnail
Fig 5. Success rate of feature selection on synthetic datasets.

https://doi.org/10.1371/journal.pone.0296108.g005

In the Caesarean dataset (sub-figure a),KNCFS and Chi-square have similar results when the number of noisy features is below 200.On the Fertility dataset (sub-figure b),KNCFS outperforms NCFS slightly,and the gap between KNCFS and RBEFF widens as the number of noisy features increases.Lastly,KNCFS consistently exhibits the strongest performance on the BLOGGER and Immunotherapy datasets. This study establishes KNCFS as a compelling contender.RB-NCFS is often less effective than NCFS,possibly due to overfitting caused by random noise features interfering with the random subspace approach.This creates bias in the capability of RB-NCFS to resolve significant features,resulting in a lower FS success rate than that of NCFS and KNCFS.KNCFS solves the collinearity issue in random multi-subspace learning methods,enabling it to achieve the best results in terms of feature selection success rate.

4.6 Parameter sensitivity analysis

In order to investigate the influence of the parameters M,s and k on the performance of our proposed method,we performed sensitivity experiments on the average accuracy of the SVM and KNN classifiers.For simplicity,we selected two representative datasets,namely "leukaemia" and "GLIOMA",to participate in the parameter sensitivity analysis.As shown in Figs 6 and 7,the accuracy of the algorithm decreases when k exceeds 15,with the highest accuracy achieved when k is in the range {5, 10}.Similarly,the algorithm shows higher accuracy when s is in the range {5, 10}.Overall,variations in M have a relatively small effect on the accuracy of the algorithm.Therefore,it can be concluded that the algorithm is not very sensitive to the value of M.Finally,we set M,s and k to {10, 10, 10}.

thumbnail
Fig 6. The acc with the parameters s,M,k in leukaemia dataset on KNN and svm classification.

https://doi.org/10.1371/journal.pone.0296108.g006

thumbnail
Fig 7. The acc with the parameters s,M,k in GLIOMA dataset on KNN and svm classification.

https://doi.org/10.1371/journal.pone.0296108.g007

5. Conclusion

Feature selection (FS) is an important data preprocessing technique that reduces the dimensionality of a dataset,decreases model complexity and lowers computational cost.In this paper,we propose a random multi-subspace method based on feature correlation clustering,which is implemented through an iterative process consisting of two key phases:a stochastic subspace learning phase and a feature vector weighting phase.The stochastic subspace learning phase aims to increase the diversity of samples to extract more information, while the feature vector weighting phase evaluates the feature partitions.We conducted numerical experiments on two types of datasets:real-world datasets and synthetic datasets with noisy features.The experimental results,when compared with Chi-square,Fisher-score,ReliefF,RBEFF, NCFS and RB-NCFS,show that KNCFS is a state-of-the-art FS algorithm as it can effectively identify relevant features.In this study,we used the existing feature selection method NCFS in the subspace learning phase,but there are more advanced feature selection methods that can be used to improve the effectiveness of the algorithm.In addition,selecting feature selection methods or unsupervised feature selection algorithms capable of handling multi-labeled data could further extend the applicability of the algorithm.Feature selection remains an important area of research with many other aspects to be explored.

References

  1. 1. Izmailov P., et al., On feature learning in the presence of spurious correlations. Advances in Neural Information Processing Systems, 2022. 35: p. 38516–38532.
  2. 2. Zhang G., et al., Learning fair representations via rebalancing graph structure. Information Processing & Management, 2024. 61(1): p. 103570.
  3. 3. Urbanowicz R.J., et al., Relief-based feature selection: Introduction and review. Journal of biomedical informatics, 2018. 85: p. 189–203. pmid:30031057
  4. 4. Gonzalez-Lopez J., Ventura S., and Cano A., Distributed multi-label feature selection using individual mutual information measures. Knowledge-Based Systems, 2020. 188: p. 105052.
  5. 5. Wang A., et al., Accelerating wrapper-based feature selection with K-nearest-neighbor. Knowledge-Based Systems, 2015. 83: p. 81–91.
  6. 6. Sadeg S., et al. QBSO-FS: A Reinforcement learning based bee swarm optimization metaheuristic for feature selection. in International Work-Conference on Artificial Neural Networks. 2019. Springer.
  7. 7. Ghimatgar H., et al., An improved feature selection algorithm based on graph clustering and ant colony optimization. Knowledge-Based Systems, 2018. 159: p. 270–285.
  8. 8. Xue B., Zhang M., and Browne W.N., Particle swarm optimisation for feature selection in classification: Novel initialisation and updating mechanisms. Applied soft computing, 2014. 18: p. 261–276.
  9. 9. Chen H., Li W., and Yang X., A whale optimization algorithm with chaos mechanism based on quasi-opposition for global optimization problems. Expert Systems with Applications, 2020. 158: p. 113612.
  10. 10. Hu L., et al., Multi-label feature selection with shared common mode. Pattern Recognition, 2020. 104: p. 107344.
  11. 11. Gu Q., Li Z., and Han J., Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725, 2012.
  12. 12. Brown G., et al., Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. The journal of machine learning research, 2012. 13: p. 27–66.
  13. 13. Zhou H., Wang X., and Zhu R., Feature selection based on mutual information with correlation coefficient. Applied Intelligence, 2022: p. 1–18.
  14. 14. Doquire G. and Verleysen M., Feature selection with missing data using mutual information estimators. Neurocomputing, 2012. 90: p. 3–11.
  15. 15. Zhang P., Gao W., and Liu G., Feature selection considering weighted relevancy. Applied Intelligence, 2018. 48: p. 4615–4625.
  16. 16. Li J., et al., Feature selection: A data perspective. ACM computing surveys (CSUR), 2017. 50(6): p. 1–45.
  17. 17. Wang Z., et al., Multi-class feature selection by exploring reliable class correlation. Knowledge-Based Systems, 2021. 230: p. 107377.
  18. 18. Ji X., et al., Multi-label classification with weak labels by learning label correlation and label regularization. Applied Intelligence, 2023.
  19. 19. Hu J., et al., Robust multi-label feature selection with dual-graph regularization. Knowledge-Based Systems, 2020. 203: p. 106126.
  20. 20. Tang C., et al., Unsupervised feature selection via multiple graph fusion and feature weight learning. Science China Information Sciences, 2023. 66(5): p. 1–17.
  21. 21. Lee C., Imrie F., and van der Schaar M.. Self-supervision enhanced feature selection with correlated gates. in International Conference on Learning Representations. 2022.
  22. 22. Hu R., et al., Low-rank feature selection for multi-view regression. Multimedia Tools and Applications, 2017. 76: p. 17479–17495.
  23. 23. Liu T., Hu R., and Zhu Y., Completed sample correlations and feature dependency-based unsupervised feature selection. Multimedia Tools and Applications, 2023. 82(10): p. 15305–15326.
  24. 24. Zhang S., et al., Supervised feature selection algorithm via discriminative ridge regression. World Wide Web, 2018. 21: p. 1545–1562.
  25. 25. Hu R., Zhang L., and Wei J., Adaptive Laplacian support vector machine for semi-supervised learning. The Computer Journal, 2021. 64(7): p. 1005–1015.
  26. 26. Fan Y., et al., Multi-label feature selection based on label correlations and feature redundancy. Knowledge-Based Systems, 2022. 241: p. 108256.
  27. 27. Yang W., Wang K., and Zuo W., Neighborhood component feature selection for high-dimensional data. J. Comput., 2012. 7(1): p. 161–168.
  28. 28. Zhang B., Li Y., and Chai Z., A novel random multi-subspace based ReliefF for feature selection. Knowledge-Based Systems, 2022. 252: p. 109400.
  29. 29. Sinaga K.P. and Yang M.-S., Unsupervised K-means clustering algorithm. IEEE access, 2020. 8: p. 80716–80727.
  30. 30. Asuncion A. and Newman D., UCI machine learning repository. 2007, Irvine, CA, USA.
  31. 31. Samaria F.S. and Harter A.C.. Parameterisation of a stochastic model for human face identification. in Proceedings of 1994 IEEE workshop on applications of computer vision. 1994. IEEE.
  32. 32. Bugata P. and Drotár P., Weighted nearest neighbors feature selection. Knowledge-Based Systems, 2019. 163: p. 749–761.
  33. 33. Zare M., Azizizadeh N., and Kazemipour A., Supervised feature selection on gene expression microarray datasets using manifold learning. Chemometrics and Intelligent Laboratory Systems, 2023. 237: p. 104828.
  34. 34. Maldonado S., Weber R., and Basak J., Simultaneous feature selection and classification using kernel-penalized support vector machines. Information Sciences, 2011. 181(1): p. 115–128.
  35. 35. Liu H. and Setiono R.. Chi2: Feature selection and discretization of numeric attributes. in Proceedings of 7th IEEE international conference on tools with artificial intelligence. 1995. IEEE.
  36. 36. Duda R. and Hart P., and Stork D. G., Pattern Classification. Hoboken. 2001, NJ: Wiley-Interscience.
  37. 37. Kononenko I. Estimating attributes: Analysis and extensions of RELIEF. in European conference on machine learning. 1994. Springer.