A Bootstrap Framework for Aggregating within and between Feature Selection Methods

In the past decade, big data has become increasingly prevalent in a large number of applications. As a result, datasets suffering from noise and redundancy issues have necessitated the use of feature selection across multiple domains. However, a common concern in feature selection is that different approaches can give very different results when applied to similar datasets. Aggregating the results of different selection methods helps to resolve this concern and control the diversity of selected feature subsets. In this work, we implemented a general framework for the ensemble of multiple feature selection methods. Based on diversified datasets generated from the original set of observations, we aggregated the importance scores generated by multiple feature selection techniques using two methods: the Within Aggregation Method (WAM), which refers to aggregating importance scores within a single feature selection; and the Between Aggregation Method (BAM), which refers to aggregating importance scores between multiple feature selection methods. We applied the proposed framework on 13 real datasets with diverse performances and characteristics. The experimental evaluation showed that WAM provides an effective tool for determining the best feature selection method for a given dataset. WAM has also shown greater stability than BAM in terms of identifying important features. The computational demands of the two methods appeared to be comparable. The results of this work suggest that by applying both WAM and BAM, practitioners can gain a deeper understanding of the feature selection process.


Introduction
Over the years, feature selection has become a fundamental preprocessing tool in data mining and machine learning. Otherwise known as attribute or variable selection, feature selection refers to the process of reducing the number of input variables for predictive modeling. The objective of feature selection is three-fold: improving the prediction performance of the features, providing faster and more cost-effective features and gaining a better understanding of the underlying process that generated the data. This objective is achieved through the characterization and elimination of irrelevant and redundant features, leaving only a subset of the most useful features to be used in further analysis. As such, features represent the individual independent variables typically used as predictors within a model. Throughout this paper, we use the terms variable and feature exchangeably.
There are generally three classes of feature selection methods based on how the method interacts with the learning algorithm: filter, wrapper and embedded methods [1]. Filter methods evaluate the importance of features as a pre-processing operation to the learning algorithm and select the best feature subsets through some information metrics without direct input from the target variable. Filters are known to have high computational efficiency compared to the wrapper and embedded methods. Alternatively, the wrapper methods apply some search in the feature space and use the learning algorithm to evaluate the importance of feature subsets. Thus, wrappers are deemed to be computationally expensive and can be slow due to the need to apply the learning algorithm to each new feature subset. The embedded methods perform the feature selection internally to the learning algorithm. In this approach, a predefined importance criterion is integrated into the learning algorithm, and features which meet the set criteria are selected. Embedded methods have less computational cost than wrappers but are more likely to suffer from over-fitting.
One of the most common challenges encountered in feature selection is choosing the most suitable method for a given problem. In general, there is no single feature selection method that outperforms all others across most applications. The three different classes of feature selection methods are frequently suitable under varying conditions. For instance, the greedy randomized adaptive heuristic (GRASP) filter technique introduced in [2] provides efficient results for problems with a large number of nominal features or with simulated instances, when the bias introduced by the classification methods is minimized. In this context, an overall understanding of various types of feature selection algorithms is often needed before an appropriate method selection could be made. Often, such a choice is based on massive empirical evaluations of diverse feature selection methods ( [3,4]). Alternatively, meta-learning and ensemble methods are two widely used approaches for determining the most appropriate feature selection algorithm for a problem or circumventing it entirely.
Meta-learning predicts the most appropriate feature selection method for a given dataset. In meta-learning, the feature selection task is treated as a supervised learning problem in which the datasets are objects and the target function maps a dataset to the feature selection method that shows the highest performance based on a certain criterion. With the increase in feature selection techniques and introduction of new methods, the implementation of meta-learning can provide a valuable way of weighing the most suitable feature selection methods. However, like other learning techniques, in order to make reliable recommendations, metalearning can be limited by its need for a suitable metadatabase that is representative of the given problem domain [5]. An excellent review of meta-learning algorithms can be found in [6].
Ensemble methods were originally developed to enhance classification performance [7]. Ensemble feature selection has two main components: diversification, to create varying feature selection outputs; and aggregation, to combine the generated outputs. Diversification can be achieved by data resampling [8]. In data resampling, several randomly generated subsamples are drawn from the original dataset, and then a feature selection technique is applied on each generated subsample. The ensemble method combines the results generated by each feature selection method [9]. Bootstrap sampling is commonly used to generate the random subsamples [10][11][12]. When the amount of available data is sufficient, in [13], the partitioning of the data into non-overlapping chunks was proposed.
Although a number of aggregation techniques have been proposed in the literature, there is no clear rule to determine which one of them should be chosen for a specific feature selection task. However, the simple approach of mean-based aggregation seem to be efficient and compelling in most cases [14]. In [15], Kolde et al. (2012) proposed a novel rank aggregation method based on order statistics and applied it to gene selection. The approach detects genes (features) that are ranked consistently better than the expected behavior of uncorrelated features and assigns a significance score to each gene. In [16], Ditzler et al. (2014) developed a statistical testing framework in which a statistical test is performed for the number of times that each feature has been selected to determine whether it belongs to the relevant feature set or not.
The robustness or stability of the feature selection method is of paramount importance to reducing dimensionality and improving the performance of the learning algorithm. Stability measures the insensitivity of the feature selection method to variations in the training set. In other words, unstable feature selection methods can produce varied feature rankings when a single feature selection technique is applied to different training samples generated from the same dataset. Instability can also occur when very different feature selection techniques applied to the same dataset produce different feature rankings.
The stability of feature selection has attracted a plethora of research in machine learning and data mining communities over the last decade [8]. It has become crucial to supplement the investigation of model performance with stability analysis in order to ensure the quality of feature selection [17].
This paper introduces a general framework for a bootstrap ensemble in which feature selection results are aggregated within and between multiple feature selection methods. Multiple subsamples are generated from the original dataset by bootstrapping, and the feature selection techniques are applied on each subsample. The aggregation is thus two-fold; first, the single feature selection method is aggregated across subsamples, and second, different feature selection methods are combined into a single set. The mean-based aggregation rule is used where the generated importance scores are combined within and between the feature selection methods. Because filter feature selection methods are known for their computational efficiency, we chose four traditional filter techniques for our experimental framework, although wrappers and embedded methods are also applicable. Our experimental results with real data sets demonstrated that ensembling multiple feature selection methods improves the performance of the learning algorithm, guides the selection of the optimal feature subset and facilitates the identification of the most appropriate feature selection method for a given dataset. Furthermore, comparisons of the two folds of aggregation revealed that, on the whole, aggregating within a single feature selection method outperforms aggregating between multiple feature selection methods.
The rest of this paper is organized as follows. In Section 2, we introduce the general framework for the bootstrap aggregation within and between multiple feature selection methods. In Section 3, we describe the robustness metrics used for the evaluation of the feature selection stability. In Section 4, we present and analyze the results of our experimental work. Finally, conclusions and some insights into future work are presented in Section 5.

Bootstrap Aggregation Framework
Let us consider a dataset S ≡ (X, Y), with n observations and p features such that n, p ∈ Z >0 ; that is, X = [x ij ] n×p ∈ R n,p is the matrix of observations and Y is the target variable (i.e., the rows are the observations and the columns are the variables). Moreover, x ij denotes observation i of the feature j. The goal of this work is to reduce the number of features in the dataset X in order to predict the target variable Y. Now, let {V 1 , . . . , V p } denote the set of features (variables) in X. The dataset S is divided into a training dataset X, and a testing dataset T. Here, X = X rn ,p and T = X n− rn ,p with 0 < r < 1. For instance, considering the Philippine dataset from Table 1, Philippine has n = 5832 observations and p = 309 features. Moreover, {V 1 , . . . , V 309 } denote the set of features in the Philippine dataset. Two-thirds of the Philippine dataset is taken for training whereas one-third is used for testing. Then, the training data are denoted as X = X 3888,309 and the testing data are denoted as T = X 1944,309 . We use feature selection to reduce the 309 features in the Philippine dataset, leading to the dataset only containing the most relevant features.
Let FS 1 , . . . , FS t denote the feature selection methods used, where t ∈ Z >0 . In addition, we assume that each feature selection method FS q ∈ {FS 1 , . . . , FS t } generates a feature importance score j ∈ R for every feature V j ∈ {V 1 , . . . , V p }. Although rank aggregation could also be used for the feature selection process as an alternative to score aggregation, aggregating the ranks from different feature selection algorithms might result in ties. For instance, three feature selection algorithms might rank one feature as 2, 1 and 3 and another feature as 2, 3 and 1. The average rank for both features would be 2. A merit of this study is that it uses the importance scores to aggregate feature selection methods, as the scores have a stronger scale than the rank and can better differentiate between the features. Furthermore, averaging the actual importance scores is simple to implement and perform. For the meaningful comparison of the scores derived from different feature selection algorithms, a normalization technique has been implemented. In this section, we discuss the use of bootstrap techniques in order to combine the feature importance scores within each FS q and between FS q for all q ∈ {1, . . . , t}. For the convenience of presentation and interpretation, the following terminologies are used in the rest of the paper: - The Within Aggregation Method (WAM) refers to aggregating importance scores within a single feature selection method. - The Between Aggregation Method (BAM) refers to aggregating importance scores between different feature selection methods.
Due to the simplicity and the efficiency of the arithmetic mean [14], it is used in this paper to aggregate the feature importance scores resulting from the different feature selection methods. However, the proposed framework allows the application of other score or rank aggregation rules, including the median, geometric mean and others.

Feature Selection Based on WAM
Let us consider a training dataset X and a feature selection method FS. Let X 1 , X 2 , . . . , X m be bootstrapped samples from X, where m ∈ Z >0 . Then, we apply FS on each X s , s = 1, . . . , m, which in turn generates feature scores { s1 , . . . , sp } that correspond to the set of features {V 1 , . . . , V p }. Therefore, a score matrix L = [ sj ] ∈ R m×p is generated after applying the feature selection method FS on each bootstrap sample. In L, column j represents the FS importance scores for variable V j over the m bootstrap sample datasets. The final aggregated score for the feature V j is defined to be the mean of column j in L.
We use the notation¯ .j = ∑ m s=1 sj m to denote the aggregated scores of V j . Then, a rank vector r = (r 1 , . . . , r p ), r j ∈ {1, 2, . . . , p} is assigned to the feature set {V 1 , . . . , V p } based on the aggregated scores {¯ .1 , . . . ,¯ .p }. The feature set is then sorted from the most to the least important based on the rank vector r. Now, based on a threshold parameter, 0 < k ≤ 1, we keep only the most important 100k% of the feature set (determined by the rank vector r). The WAM approach can be used to compare the performance of different feature selection techniques based on various supervised learning methods for a given dataset. The flowchart in Figure 1 explains the WAM. The following gives the associated computational Algorithm 1.

Algorithm 1 WAM Algorithm:
Given a training dataset X with p features, a testing dataset T, a feature selection method FS, a threshold parameter k, and a learning algorithm M.

Feature Selection Based on BAM
Let us consider a training dataset X and feature selection methods {FS 1 , . . . , FS t }. Let X 1 , X 2 , . . . , X m be bootstrapped samples from X, where m ∈ Z >0 . Then, we apply FS q , q = 1, . . . , t on each X s , s = 1, . . . , m, which in turn generates feature scores { s1 , . . . , sp } (q) that correspond to the set of features {V 1 , . . . , V p }. Therefore, a score matrix L (q) = [ (q) sj ] ∈ R m×p is generated after applying the feature selection method FS q on each bootstrap sample. In L (q) , column j represents the FS q scores for variable V j over the m bootstrap sample datasets. Then, each column in the score matrix L (q) is normalized using min-max normalization; that is, After the normalization, we use the arithmetic mean to combine the normalized . That is, column j in the score matrix¯ L represents the average column j between all considered feature selection methods FS 1 , . . . , FS t . Then, the final aggregated scores, {¯ .1 , . . . ,¯ .p } , for the feature set {V 1 , . . . , V p } are the arithmetic mean of columns 1, . . . , p in¯ L. The rank vector r = (r 1 , . . . , r p ), r j ∈ {1, 2, . . . , p} is assigned to the feature set {V 1 , . . . , V p } based on the final aggregated scores. The feature set is then sorted from the most to the least important based on the rank vector r. The top 100k% features are thus retained for further analysis. The flowchart in Figure 2 explains the BAM. The following gives the associated computational Algorithm 2.

Algorithm 2 BAM Algorithm:
Given a training dataset X with p features, a testing dataset T, feature selection methods {FS 1 , . . . , FS t }, a threshold parameter k, and a learning algorithm M.
(iv) Use the arithmetic mean scores to combine the matrices in (iii) into one score matrix¯ L.

Stability Analysis
In order to measure the stability of feature rankings for each feature selection method in {FS 1 , . . . , FS t }, we implement the similarity-based approach proposed in [18]. This approach depends on the representation language of the produced feature rankings. Considering a training dataset X and a feature selection method FS, let X 1 , X 2 , . . . , X m be bootstrapped samples from X. Then, we apply FS on each X s , s = 1, . . . , m. This, in turn, produces any of the following three representations with respect to the feature set {V 1 , . . . , V p } and the sample dataset X s : A subset of features represented by an index vector w s = {w s1 , . . . , w sp }, w sj ∈ {1, 0}, where 1 indicates feature presence and 0 indicates feature absence.
Naturally, it is possible to transform any feature importance scores vector into a rank vector r by sorting the importance scores. On the other hand, a rank vector r may be converted into an index vector w by selecting the top 100k% features. The most common way to quantify the stability of a feature selection method is by simply taking the average of similarity comparisons between every pair of feature rankings derived from the different bootstrap samples as follows: where Φ( f s , f v ) is the similarity measure between a pair of feature rankings from any two training samples X s , X v (1 s, v m). Note that the feature rankings ( f s , f v ) can be represented as a pair of importance scores vectors, rank vectors or index vectors. Moreover, the multiple 2 m(m−1) stems from the fact that there are m(m−1) 2 possible pairs of feature rankings between the total m samples.
Several similarity measures have been introduced in the literature [8]. In this paper, we use some popular measures of similarity for each of the representations described above. Accordingly, we will use feature selection to produce importance score vectors { 1 , . . . m } that will be converted into rank vectors {r 1 , . . . r m } and index vectors {w 1 , . . . w m }. We implement the following: i.
Pearson's correlation coefficient: In the case of similarity between two importance score vectors ( s , v ) produced by one of the feature selection methods, the Pearson's correlation coefficient computes the similarity measure as where s is the row s in the score matrix L; that is, the feature importance scores that correspond to the set of features {V 1 , . . . , V p }, obtained from X s . Furthermore, µ s is the mean of the row vector s . Here, ii. Spearman's rank correlation coefficient: With regard to the similarity between two rank vectors (r s , r v ) produced by one of the feature selection methods, Spearman's rank correlation coefficient measures the similarity between the two rank vectors as where r s is the rank vector that corresponds to the set of features {V 1 , . . . , V p }, such that r s is derived from s . Here, Φ SRCC (r s , r v ) ∈ [−1, 1]. iii. Canberra's distance: Another measure used to quantify the similarity between two rank vectors (r s , r v ) is Canberra's distance [19]. This metric represents the absolute difference between two rank vectors as For easier interpretation, Canberra's distance is normalized by dividing by p. iv. Jaccard's index: Jaccard's index measures the similarity between two finite sets; it is taken as the size of the intersection divided by the size of the union of the two sets. Given the index vectors (w s , w v ) used to represent the two sets, Jaccard's index is given by In addition to the stability scores based on similarity, one can compute the average standard deviation of the feature importance scores across all bootstrap samples for every feature selection method. By definition, the standard deviation measures the dispersion or instability of the feature selection scores under different training bootstraps. Similar to the work in [20], we define the average standard deviation given a feature selection method FS as where c j represents column j in the standardized score matrix L. In other words, SD(c j ) is the standard deviation of the standardized FS importance scores for variable V j over the m bootstrap samples. Generally, a low average standard deviation would imply high stability, whereas a high average standard deviation would suggest lower stability.

Experimental Datasets
Without the loss of generality of the applicability of the proposed framework, the experimental evaluation in this study focuses on the binary classification conferred in the datasets illustrated in Table 1. The target variable for each dataset encompasses two classes (e.g., "males" and "females"). Here, the class refers to the group categorization of some observations under the nominal target variable Y (i.e., the value of Y). The datasets contain both numerical and nominal features with various dimensions and numbers of observations. Overall, the number of features across datasets ranges from 34 to 309, while the number of observations ranges from 351 to 6598 observations. The features/observations ratio ranges from as low as 0.007 to as high as 0.194. Furthermore, the binary class distribution spans from deeply imbalanced to perfectly balanced. On that account, these datasets provide an interesting benchmark for investigating the performance of the proposed framework and its characteristics.

Experimental Design
In this experiment, we compared the performance of the two aggregation methods: WAM and BAM. The experimental environment was Windows 10, 64-bit, 16 GB RAM, Intel(R) Xeon E-2124 (3.30 GHz). For the implementation of the proposed framework (Section 2), we selected four filter selection methods: Information Gain (IG), Symmetrical Uncertainty (SU), Minimum Redundancy Maximum Relevance (MRMR) and the Chisquared method (CS). For the Chi-squared method, numeric features were discretized based on a fixed-width binning. Each dataset was divided into training and testing datasets, where two-thirds of the dataset was used to obtain the feature rankings and one third was used for testing. For the training phase, m = 1000 bootstrap samples were used. These bootstraps were utilized to obtain final rank vectors using the aggregation of feature importance scores within each feature selection method (WAM algorithm) and between the different feature selection methods (BAM algorithm). For the testing stage, 10 different k thresholds were used, resulting in subsets containing the top {10%, 20%, . . . , 100%} of the total features. Here, k = 100% refers to the baseline model where all features were used and none of the feature selection methods were implemented. We should note here that other values of k can be chosen. The existing literature in this regard provides some guidelines on how to choose k for some specific scenarios. For example, we refer interested readers to [21][22][23][24][25]. However, it is difficult to generalize directives on the basis of specific cases, and in practice, several values of k were chosen and the classification accuracy using cross-validations or independent test data was used to evaluate the quality of the chosen subsets [26,27] In the experiment, a five-fold cross-validation procedure was implemented in the testing phase. Accordingly, we divided the testing set into five stratified samples. Thus, the five folds were selected such that the distribution of the target variable Y was approximately equal in each of the folds. For instance, given a testing data with 500 observations such that the distribution of the binary target variable was 7:3, each fold in the stratified five-fold cross-validation would be expected to have 70 observations in one class and 30 in the other.
Accordingly, stratification was used to ensure that each class was equally represented among the different folds. For every iteration, one of these five stratified samples was used as a testing set, while the remaining four samples were used to train the model. On each iteration, the testing sample was used to evaluate the performance of the selected feature subsets based on the following classification algorithms: • Logistic regression: A statistical model used to model the probability of the occurrence of a class or an event using a logistic (sigmoid) function. It is a widely used classification algorithm in machine learning. The objective of logistic regression is to analyze the relationship between the categorical dependent variable and a set of independent variables (features) in order to predict the probability of the target variable. The maximum likelihood estimation method is usually used to estimate the logistic regression coefficients. • Naive Bayes: A probabilistic classifier based on the Bayes theorem [28]. It assumes that the occurrence of each input feature is independent from other features. It can be used for both binary and multiclass classification problems. Due to its simplicity, it is a fast machine learning algorithm which can be used with large datasets. • Random Forest: An ensemble model of decision trees in which every tree is trained on a random subsample to provide class prediction. The subsamples are drawn with replacements from the training dataset. The results from all the decision trees are then averaged to yield the model prediction [29]. Random Forest is useful to prevent over-fitting, but it can be complex to implement. • Support Vector Machine (SVM): A supervised learning algorithm in which each observation is plotted as a point in p-dimensional space (with p being the number of features). SVM aims to identify the optimal hyperplane which segregates the data into separate classes. The selected hyperplanes thus maximize the distance between data points of different classes [30].
The entire experimental framework was performed using the open-source statistical programming language R. Note that features of near-zero variance were removed prior to the analysis. After the final feature subsets were selected and utilized in building each classifier across the five-folds, we evaluated the performance of the classifiers through the estimation of the area under the receiver operator curve (ROC) [31], hereafter referred to as the AUC. The AUC is a popular metric used to evaluate a model's ability to distinguish between classes. The AUC values, averaged over the five-folds in the cross-validation, are plotted against the 10 different k100% thresholds (ranging between 10-100% of features used) in order to reveal the classification performance for different feature subsets in the testing stage. The running times for WAM and BAM, respectively, are also shown in Figures (A1-A13). The average running time for WAM was 2563 seconds and that for BAM was 2579 seconds. Overall, BAM was marginally slower than WAM across all datasets because of the fact that BAM involves an additional aggregation step between the different feature selection methods. It should be emphasized that the computational costs of the proposed framework are mostly dependent on the feature selection methods used and the dataset composition. In the following two subsections, we analyze and compare the performance of the WAM and BAM algorithms in terms of their classification accuracy and their identification of optimal feature subsets (Section 4.3.1). In Section 4.3.2, we analyze and compare the stability behavior of the two algorithms. Furthermore, the association between stability and accuracy is discussed.

Classification Performance
In most datasets, we note that the ensembling of bootstrap samples and aggregation of feature importance scores within and between feature selection methods improved the baseline (k = 100% features are used) classification performance. In particular, the effectiveness of WAM was noticeable under the Naive Bayes classifier, where the AUC for most aggregated feature selection methods showed a steeper increase than in other classifiers across the six datasets. For instance, the Naive Bayes accuracy increased from around 0.79 baseline AUC to nearly 0.86 in the Scene dataset ( Figure A5) and from the baseline of 0.60 AUC to nearly 0.70 in the Fri dataset ( Figure A4). Following the removal of the least significant features, a similar increase in the performance of the logistic regression model could be observed in the Jasmine and Philippine datasets (Figures A1 and A7). At baseline performance, it seems that the Random Forest algorithm performed better than the other classifiers, with SVM achieving the lowest AUC scores in the majority of cases. Although the aggregated feature selection methods present some highly changing patterns depending on the number of features retained, the general trend showed an increase in accuracy scores. Moreover, the variability in the AUC scores between the different feature selection methods tended to be more pronounced in the high-dimensional datasets (e.g., Philippine, Jasmine, Scene, HIVA) in comparison to the low-dimensional datasets (e.g., Satellite, Ada, Splice).
Likewise, it can be observed that BAM also improved the baseline accuracy of the classification models, especially under logistic regression. For instance, BAM improved the logistic regression AUC scores from around 0.83 at baseline to over 0.86 in Jasmine ( Figure A1) and from around 0.55 at baseline to nearly 0.67 in Fri ( Figure A4).
A general observation from the figures is that the aggregated feature selection methods under the WAM outperformed the BAM. In particular, for most of the datasets, there was at least one feature selection method aggregated under WAM that produced better AUC values than BAM. For instance, it is clear that the aggregated Chi-squared (CS) method outperformed BAM in the Philippine and Scene datasets ( Figures A7 and A5, respectively). In other datasets such as Fri ( Figure A4) or Ionosphere ( Figure A8), the aggregated IG and SU tended to produce the highest AUC scores across the classifiers. The performance of BAM appeared to be comparable to that of the aggregated feature selection methods in the datasets Image, Satellite, and Spectrometer. BAM was among the best-performing methods with respect to AUC in these datasets. The figures show that, in general, the aggregated IG and CS demonstrate the greatest correlation with BAM curves.
For the selection of the optimal k100% threshold based on AUC values, the WAM and BAM produced nearly consistent results. In the Philippine, Jasmine and Musk datasets, there was a noticeable pattern in which the classification AUC dropped sharply around the 30% mark, indicating that removing more than 70% of the features results in the reduced accuracy of the trained model. Nevertheless, the exact feature reduction percentage is dependent on the dataset used. In the Scene dataset, most of the classifiers agreed on retaining around 50% of the top features. In the Splice and Optidigits datasets, the optimal k100% was 70%, while in the Spectrometer and HIVA datasets, this was almost 80% of the features.
It should be mentioned that the experimental results in this paper affirmed the fact that the performance of a given feature selection method is data-dependent. Overall, none of the feature selection methods produced the best AUC values across all datasets. For example, the aggregated MRMR showed the worst performance across multiple classifiers in the Scene, Spectrometer and Philippine datasets, whereas in the Optdigits dataset ( Figure A9), it showed improved performance when most features were removed. A similar pattern was seen for the aggregated Chi-squared feature selection method, which showed the best performance in a number of datasets (e.g., Image, Scene, Philippine) and worse performance in others (e.g., Fri, Ionosphere). The BAM curve was seen to be the middle-best performing technique. This is of course because the BAM scores are simply the aggregated averages over the individual feature selection methods. In most cases, however, there was an obvious overlap in the AUC scores between the different feature selection techniques resulting from the fact that dominant features were able to maintain similar rankings under different feature selection algorithms.
In summary, the analysis of the AUC results reveals that WAM can be used as a powerful tool to help practitioners to identify the most suitable feature selection method for a given data set. It can also guide the selection of the optimal level of feature reduction while achieving the maximum level of learning accuracy. Table 2 presents the results of the stability analysis of the four aggregated filters using WAM and BAM. Stability scores derived from the pairwise Pearson's correlation coefficient are calculated using the importance scores in the matrix L = [ sj ] ∈ R m×p resulting from applying the feature selection method FS on every bootstrap sample. The calculated Pearson's correlation coefficients are then averaged over all possible pairs. On the other hand, the averaged pairwise Spearman's rank correlation coefficients and Canberra's distance are calculated from the rank matrix R = [r sj ] ∈ R m×p , which is obtained by sorting the features from most to least important based on their importance scores L = [ sj ] ∈ R m×p . The Jaccard's index is computed from the top 25% ranked feature subsets (represented by index vectors). Finally, the average standard deviation (ASD) is computed using the importance scores, after normalization, averaged over the 1000 bootstraps.

Stability Analysis Results
The bolded scores in Table 2 represent the best stability value for each dataset. In the Jaccard's index, Spearman's, and Pearson's correlation coefficient, this refers to the highest stability score, whereas for Canberra's distance and ASD, it represents the lowest stability score. Unsurprisingly, none of the feature selection methods demonstrated the best stability behavior consistently across the 13 datasets. In other words, the stability of the aggregated feature selection methods was data-dependent. As such, even though all the stability measures agreed on a "winning" feature selection method in some datasets, it is not surprising that these methods varied throughout. For example, IG was the most stable method across all stability measures in the Ada and Philippine datasets, but Symmetrical Uncertainty was the most stable in Ionosphere and MRMR in the Spectrometer and HIVA datasets. Thus, none of the aggregated feature selection methods can be declared the most stable in every measure.
Contrasting the stability behavior of the BAM to that of the WAM, Table 2 shows that at least one feature selection method of the four feature selection methods used generally exhibited higher stability when aggregated using WAM than when aggregated using BAM. With the exception of the Image dataset, the BAM stability scores were seen to be the middle-most stable across every experimental stability measure. For the Image dataset, the BAM stability outperformed the stability of all the aggregated feature selection methods. It is noteworthy that BAM also achieved the best classification accuracy in the Image dataset (see Figure A3). Similarly, the IG and Chi-squared methods, which demonstrated the highest stability in the Fri and Scene datasets, also achieved the best classification performance in these two datasets, respectively (see Figures A4 and A5). A similar pattern can be seen with respect to the higher stability of MRMR in the Optdigits and HIVA datasets. These observations suggest that there may exist a positive association between the classification performance of a feature selection method and its stability behavior. Intuitively, this presumed association depends on the stability metrics used and also on the characteristics of the dataset and the learning algorithm-a topic that constitutes an interesting line of further research.
In summary, although BAM demonstrates a comparable stability behavior to that for each single feature selection method under WAM, it appears that, in 11 out of 12 datasets used in this paper, the stability of at least one individual method outperformed BAM. Interestingly, there appears to be a positive association between the classification performance of a feature selection method and its stability behavior. A feature selection method that outperforms others in terms of classification accuracy may also outperform them in terms of stability behavior.

Conclusions
Over the years, datasets have grown increasingly large in terms of their size and dimensionality. As a result, feature selection has become a necessary reprocessing tool in machine learning applications and the focus of a wide range of literature and research across many domains. This study has explored the full potential of a bootstrap ensemble approach for feature selection in which the ensemble aggregation is performed within and between the multiple feature selection methods.
The extensive experimental analysis of 13 different datasets selected from different domains has demonstrated that the Within Aggregation Method (WAM) is highly efficient in guiding the selection of the most suitable feature selection method for a given problem. As for reducing the dimensionality of the problem, our analysis showed that WAM and the Between Aggregation Method (BAM) were comparable in determining the optimal percentage reduction in the number of features. They are also comparable in terms of the computational costs. It important to emphasize that optimizing the feature subsets and the computational costs of the techniques depends largely on the dataset characteristics and learning algorithm implemented.
In terms of stability, the WAM demonstrated better stability behavior than BAM in most datasets (11 out of 12). Overall, the BAM stability scores fell in the middle range of the computed values on each of the score-based and rank-based stability metrics, suggesting a desirable stability behavior of the method.
The experimental analysis also showed that there exists a positive association between the classification performance of a feature selection method and its stability behavior. In other words, the feature selection method that outperforms others in terms of classification accuracy may also outperform them in terms of stability behavior. This association can, however, depend on the stability metric used, the learning algorithm and the characteristics of the dataset. Due to the extent to which the feature selection results in terms of learning accuracy and stability can depend on the data composition and learning algorithm, it is essential that both BAM and WAM methods be implemented in order to achieve a better understanding of and more useful insights into the underlying application domain.
The observed association between learning accuracy and stability indeed merits further investigation in the future. Additionally, in future work , we will extend the application of the framework to other types of feature selection methods, such as wrappers. Particular interest will be given to the application using a mix of filter and wrapper feature selection methods and establish a wider scope of comparisons to better evaluate the ensemble framework. Furthermore, datasets with diverse performance and various learning algorithms will be implemented. Another line of future research would be the application and analysis of additional aggregation methods, such as the median or geometric mean, and the comparison of them against the mean aggregation used in this paper.  Acknowledgments: The authors are grateful for the comments and suggestions by the referees and the handling Editor. Their comments and suggestions have greatly improved the paper. The authors also gratefully acknowledge that the work in this paper was supported, in part, by the Open Access Program from the American University of Sharjah.

Conflicts of Interest:
The authors declare no conflict of interest.