A framework for feature selection through boosting

.


Introduction
The presence of irrelevant and redundant features in a dataset can lower the performance of predictive models, due to over-fitting and the curse of dimensionality.Moreover, even if a model's performance is robust to redundancy and noise, the presence of those features has other disadvantages, like increasing storage and computational costs, and limiting the model's interpretability.Feature selection can mitigate these problems by identifying and selecting relevant features, and removing irrelevant and redundant ones.
Model interpretability has become specially pertinent with the rise of interest in explainable AI (Holzinger, 2018;Gunning, 2017).The interprertability of machine learning models is important as it enables compliance with the socially relevant requirements of those models, such as fairness, unbiasedness, privacy, trust, and reliability (Doshi-Velez & Kim, 2017).
Since modern machine learning applications contain an ever growing number of features, interpretability and intuitive understanding of prediction outcomes has become virtually impossible before some form of dimensionality reduction.Other aides of model interpretation, like data visualization, are also made easier by reducing the number of features.
Furthermore, in certain biomedical applications, knowledge discovery is the primary task of feature selection, more so than improving the prediction outcome or increasing computational efficiency (Borboudakis & Tsamardinos, 2019).In gene expression studies, feature selection is used to discover the genetic networks associated with diseases (Tabus & Astola, 2005).In other biomarker discovery studies, the workflow relies heavily on feature selection, as the studies often beginfor practical limitation -with large feature, small sample raw data (Christin et al., 2013).
Another benefit of feature selection is found in applications where the acquisition of features is costly.In such cases, it is useful to identify if a costly feature happens to be irrelevant or redundant (Early, Fienberg, & Mankoff, 2016;Bolón-Canedo & Alonso-Betanzos, 2019).For example, in medical data, a feature could be associated with a clinical test, which could either be expensive, inconvenient to patients, or both.
Reducing the number of features in any dataset can be achieved with principally different approaches.Therefore, many methods of feature selection have been proposed in literature.The most popular taxonomy of these methods divides them into three broad categories: filter methods, wrapper methods, and embedded methods (Guyon & Elisseeff, 2003).
Compared to filters, wrapper methods select subsets with high accuracy at the expense of increased computational cost.Since an exhaustive search of all possible feature subsets is intractable for large datasets, wrapper methods make use of efficient search strategies for finding candidate subsets to evaluate with the wrapped predictor (Guyon & Elisseeff, 2003;Tang et al., 2014;El Aboudi & Benhlima, 2016).Examples of these strategies include hill-climbing, best-first, genetic algorithms (Kohavi & John, 1997), particle swarm optimization (Ibrahim, Ewees, Oliva, Abd Elaziz, & Lu, 2019), and Whale optimization (Mafarja & Mirjalili, 2018).
Boosting, a popular meta-algorithm in ensemble classification, can also be used to modify the feature search space in wrapper-based feature selection.This use of boosting, or sample re-weighting, has been explored to limited capacity in past studies (Das, 2001;Tieu & Viola, 2004;Liu, Liu, & Zhang, 2009;Barddal, Enembreck, Gomes, Bifet, & Pfahringer, 2019).In this paper, we expand on this concept and propose a greedy forward selection algorithm that we call FeatBoost.The algorithm uses embedded feature importance scores of tree ensemble models for choosing the candidate features, then uses boosting to update the importance scores, and thus the search space, after every iteration.
Compared to past studies, we contribute the following: (1) a sample weighting strategy which weights each sample according to its prediction probability.In contrast, previous methods up-weight all misclassified samples by the same amount, (2) a modular algorithm architecture, which decouples feature ranking from selection.This overcomes inconsistencies in feature rankings, and potentially increases the robustness of the selected subsets, (3) a sample weighting reset strategy, which prevents premature stopping of the algorithm.

Article structure
In Section 4, we describe the novel elements of the proposed approach in relation to existing algorithms, and give an overview of recent, relevant developments in feature selection with tree ensembles.We then give the details of the proposed algorithm in Section 5.Then, we give the experimental settings in Section 6, followed by the results in Section 7. Finally, we discuss the implications of the results and future developments in Section 8.

Notations and definitions
In this section, we introduce the notations used in the rest of the paper.Whenever possible, we unify notations describing the different methods we compare.Therefore, the notations do not necessarily reflect those used in the original works.
Each instance of feature selection is solved on a given dataset D with p features, n samples, and a discrete target output y with n c classes.A prediction of y is referred to as ŷ.The subset of selected features for the given output is denoted by X .
In methods that take as input the number of desired features to select, or the maximum number of features that can be selected, this number is denoted as p ′ .The actual number of features selected is denoted as p * .If a method selects features sequentially, or produces a rank for the features in X , then X i refers to X at the i th iteration, for i = 1, …, p * .For clarity, we stress here the difference between two types of boosting that are present in the proposed algorithm.The first occurs within the ensemble model that is used to generate the feature importance rankings -if a boosting method was used for that purpose -while the second is an outer layer of boosting (or sample re-weighting) performed specifically for the feature selection process, and bears no direct functional relation to the first type.From this point onward, mentions of sample re-weighting or boosting in this paper refer to the second type, unless otherwise specified.

Background and related work
In this section, we elaborate on the use of boosting in feature selection by highlighting a number of existing methods, and how the proposed algorithm differs from them.We also consider more broadly methods that use the embedded feature importance scores of decisiontree models as bases for feature selection.

Tree-based feature importance
The embedded feature importance scores of tree-based ensembles are powerful starting points for feature selection (Tuv, Borisov, Runger, & Torkkola, 2009).This is largely due to the following factors: First, the versatility of those models; being scale invariant, scalable to large datasets, and able to handle numerical, categorical, and missing data.Second, the fact that the intrinsic importance scores can be derived at no additional cost over that of model training.
In a decision tree, the importance of a feature is defined as the total value of a node-splitting criterion that the feature is responsible for (e.g.Information Gain or the Gini Index).When used in ensembles, it is commonly defined as the sum or average of the splitting criterion that a feature causes across all trees (Breiman, 2002;Louppe, Wehenkel, Sutera, & Geurts, 2013).
In an ensemble of trees, like random forest (Breiman, 2001), the importance scores of two or more redundant features will be spread evenly among them, due to feature sub-sampling and bootstrapping.Using those scores as a basis for feature selection in large datasets could lead to falsely selecting a feature with many of its redundant copies (Genuer, Poggi, & Tuleau-Malot, 2010).
Other issues with tree-derived feature scores include the bias towards categorical features of high cardinality (Strobl, Boulesteix, Zeileis, & Hothorn, 2007), sensitivity to hyper-parameters (Genuer et al., 2010), and inconsistencies (Lundberg, Erion, & Lee, 2018).The latter refer to cases where the assigned importance of a feature decreases when its true impact on model performance has increased.
For the importance scores of tree models to be used reliably for feature selection, these shortcomings should be overcome.This can be partially achieved through boosting, or sample re-weighting.
Boosting algorithms, like AdaBoost and its derivatives (Freund & Schapire, 1997), rely on a sequential procedure of varying the sample weights of weak classifiers, typically decision-trees, based on the accuracy of previous boosting rounds.This leads the classifiers in later rounds of the algorithm to focus on samples that were misclassified in earlier rounds.Then, each classifier votes to determine the final outcome of the ensemble.
Through sample re-weighting, the process of boosting also affects the feature importance scores (and rankings) produced by the classifier being boosted.This presumably happens in such a way that in later boosting rounds, some features which initially ranked poorly, appear in top ranked positions due to becoming effective in classifying samples which were misclassified in earlier rounds.We explain this further in the following section.

Boosting and feature selection
The effect of boosting on feature importance scores has been exploited in the past to design feature selection algorithms (Das, 2001;Tieu & Viola, 2004;Liu et al., 2009).In short, this is done by relying on a weight-sensitive feature ranking algorithm, and a sequential procedure of adding features, often one at a time.At each round, the ranking algorithm is applied to a re-weighted version of the training data.And the re-weighting is done based on the classification error produced by features selected so far.The best feature from each round, according to the A. Alsahaf et al. feature ranking, is added to the selected subset.
Early examples of directly using boosting for feature selection include the following: Boosted Decision Stump Feature Selection (BDSFS) (Das, 2001), Boosting Image Retrieval (Tieu & Viola, 2004), and Boosted Mutual Information Feature Selection (BMIFS) (Liu et al., 2009).A recent example is Adaptive Boosting for Feature Selection (ABFS) (Barddal et al., 2019).Das (2001) used a tree stump as a base classifier, and followed the sample re-weighting strategy from AdaBoost for the purpose of feature selection.Namely, all training samples are initially given a weight of 1 n , with n being the number of training samples.Then, at each subsequent iteration, the sample weights are given as a function of the classification error from the previous iteration.All misclassified samples are equally up-weighted according to Eq. ( 1).The best feature from every iteration, according to Information Gain, is selected.
where err is the classification error from the previous iteration and ω i j is the sample weight for sample j at the i th iteration.
Boosted Image Retrieval follows a similar procedure to BDSFS, adapted for the purpose of image retrieval.The BMIFS method differs in that the error used for updating sample weights is estimated from an information metric, Mutual Information, and not from a base classifier.Barddal et al. (2019) use this form of feature selection to solve the problem of feature drift in data streams.
Our approach follows a framework that is similar to the aforementioned methods, but differs from them in a number of ways.First, we use a different sample re-weighting strategy.The AdaBoost weighting strategy, used in BDSFS (Eq.1), re-weights all misclassified samples by the same amount, which disregards how far a sample is from being correctly predicted.To improve this, we weight each sample inversely proportional to its prediction probability, according to Eq. 2.
where Y c is a one-hot encoded matrix indicating the correct class for each sample, P c is an n × n c matrix containing the class probabilities for each sample, obtained from a classifier, and ω i j is the sample weight for sample j at the i th iteration.
For a given sample, the associated weighting term α j is decreased as the probability of the correct class for that sample approaches 1, and increased as it approaches zero.Therefore, samples which are far from being correctly classified are given higher weights in the next iteration.
We use a gradient boosting trees model, XGBoost, as a base learner for obtaining the feature scores (Chen & Guestrin, 2016).Despite being computationally demanding compared to individual trees, ensembles of trees are more predictive in large datasets, and their feature importance scores reflect more complex interactions.XGBoost is a powerful example of such models, and it outperforms traditional tree-ensemble models in many applications (Luckner, Topolski, & Mazurek, 2017 Moreover, we introduce a few procedural changes, some of which were inspired by Iterative Input Selection (IIS) (Galelli & Castelletti, 2013), a tree-based forward selection method for regression problems.
For instance, in FeatBoost, we use a two-step process to select the best feature at each iteration.First, the top ranked features are obtained from the embedded feature scores of a tree model trained on all features.Then, we use a classifier to evaluate the classification performance of the top-ranked features obtained from the embedded scores.This strategy is used in IIS, where evaluations of a regressor determine the best feature at each iteration (Galelli & Castelletti, 2013).We use this approach of model evaluation in FeatBoost by testing m of the top ranked features at each iteration.However, instead of evaluating each of the candidate features individually as a single-input model, we append each feature to the selected features thus far, and evaluate the classification accuracy of the resulting models.Namely, at the i th iteration, we evaluate m models of order i.With this approach, the algorithm could be viewed as a stepwise greedy search in which the search space in each iteration is reduced from p features, to a user-specified number of features.And those candidate features change through the process of boosting.
The justification for this two-step process is the following: First, choosing the top feature from the feature ranking may not be reliable in the presence of feature redundancy in large datasets.Second, using model evaluations decouples the selection from the feature ranking algorithm, making it more robust (Galelli & Castelletti, 2013).Moreover, this process solves the issue of inconsistency highlighted by Lundberg et al. (2018).
We make FeatBoost modular by allowing the model choice for the evaluation step to be different than the model used for feature ranking.The decoupling of the two procedures -ranking and model evaluationcould lead to improvements in computational efficiency.This could be achieved by choosing the second model to be a computationally efficient one, as opposed to the first model, which produces the feature rankings.
Another element of IIS which we use in FeatBoost is that once a feature is selected, it is not dropped from the list of candidate features of subsequent iterations.This way, a feature may be selected twice, which translates to an automatic stopping condition for the algorithm.
The motivation for using this strategy is that future iterations on reweighted samples will rank new features in the presence of all other features.This means that if a feature ranks higher as a result of sample re-weighting, it does so in interaction with features that were selected before, and those that might be selected after.And the only difference between rankings of different iterations comes from sample reweighting, and not from explicitly removing features from the list of candidates once they have been selected.
We make further adjustments to the boosting process to make it more adapted to feature selection.When boosting for the purpose of classification, as in AdaBoost, it is not necessary that each classifier is trained on a highly different sample distribution.In other words, if sample weights do not change significantly after a boosting round, this will not necessarily hinder performance, as the final classification will be determined by a majority vote of all classifiers.
On the other hand, in a feature selection context like the proposed method, a relevant feature is added at each boosting round.Therefore, classification error is expected to decrease with rounds.Consequently, the difference in α (Eqs.(1) and ( 2)) between consecutive rounds will decrease.If the difference becomes low enough, sample weights swill stop changing, and therefore the desired variation in the top ranked features will stop or diminish, causing the algorithm to prematurely terminate.
We solve this problem with two strategies.First, we base the reweighting of samples not on the performance of the base classifier of the current iteration, but on the relative performance between the current iteration and the preceding one, hence the normalization step of α in the second line of Eq. ( 2).Second, we use the following reset scheme: At any given iteration, if an existing feature is selected again, or the A. Alsahaf et al. selected feature causes no increase in performance, we re-initialize the samples to have equal weights, and repeat the iteration with the new weights.In effect, this reboots the algorithm with a non-empty feature set, which could allow for the selection of additional useful features.

Ensemble tree models and feature selection
The simplest way to use the embedded scores of ensemble tree models for feature selection is by thresholding, or by only retaining features with non-zero scores.An alternative approach is to use a simple forward selection procedure; relying on the feature rankings to introduce one feature at a time to the selected subset, if the feature causes a significant gain in performance (Genuer et al., 2010).This approach will suffer from the various inconsistencies and biases of those scores (see Section 1).
More elaborate ways of using these scores have been proposed.One such example is to select features according to their importance scores when compared to artificial features, designed to trick the ranking algorithm (Kursa et al., 2010;Tuv et al., 2009).A popular example of this approach is the Boruta method (Kursa et al., 2010).It works by creating shadow features, which are copies of the original ones whose values are shuffled across samples, then computing the feature importance scores of the original set plus shadow features using a random forest classifier.Original features that score lower than the top-scoring shadow features are deemed irrelevant, and are subsequently removed.The process is repeated until all the remaining original features are relevant.This method, by design, does not solve issues of redundancy, as it selects all relevant features (Kursa et al., 2010).
Tree-based models can also be used in combination with other approaches to improve their feature selection capabilities.Rao et al. (2019) combine gradient boosted trees with artificial bee colony algorithms for feature selection.Peker, Arslan, S ¸en, Çelebi, and But (2015) combine the scores from random forest with the filter method ReliefF for selecting feature extracted from EEG signals.

Methodology
For a given dataset D with p features, n training samples, and output y with n c classes, the FeatBoost algorithm proceeds as follows: First, the selected subset of features, X , is initialized to empty, and a user selected tree-based classifier, H 1 , is trained on all samples with initial weights equal to 1 n , to produce a ranking of all features.We choose H 1 to be an XGBoost classifier for the remainder of the paper.
Then, a user-specified number, m, of the top ranking features are evaluated and compared as single-input classifiers in k-fold cross validation, using either H 1 , or a different classifier, H 2 . 1 The best performing feature with H 2 , according to an appropriate metric (e.g.classification accuracy, F-score, or area under the ROC curve) is added to the selected subset X .In iterations other than the first, classifier H 2 is used to evaluate each of the m features appended to the features selected so far, Finally, H 1 is trained on all selected features, and its prediction probabilities are used to update the sample weights for the following iteration according to Eq. ( 2).
If the increase in accuracy of X with respect to the previous iteration is below a user-defined threshold ∊, or if a feature is selected twice, the algorithm is temporarily paused, and a sample weight reset scheme is initiated.
The reset scheme functions as follows: If at iteration i, a feature is selected which already belongs to X i− 1 , or if the feature does not improve classification performance, the algorithm normally terminates, 1 Note that H 2 does not need to be tree-based, nor sensitive to sample weights, since it is not used in the feature ranking process.
A. Alsahaf et al. and X i− 1 is taken as the final subset.Under the reset scheme, this is temporarily overcome by resetting the sample weights to their initial values, and repeating iteration i.This will lead to the feature ranking being equal to that of the first iteration, albeit with an initial X that is not empty.This could lead the model evaluation step (with H 2 ) to select a feature that is partially redundant to one or more features in X i− 1 , but nonetheless having additional predictive value.If that occurs, the algorithm resumes its normal course from iteration i until a stopping condition is reached again: a feature is selected twice, or the selected feature does not improve accuracy.No further resets are initiated at that point.If the reset scheme does not lead to selecting a new useful feature, the algorithm stops. 2  The asymptotic time complexity of the FeatBoost, computed in terms of its parameters is given in Eq. (3).
where O(H 1 ) and O(H 2 ) are the asymptotic complexities of the chosen classifiers.For reference, the time complexity of training each tree in XGBoost is O(nlog n) (Chen & Guestrin, 2016).The complexity of Feat-Boost, therefore, depends highly on the choices of H 1 and H 2 , and to a lesser extent on the other parameters choices.
The pseudocode of the algorithm, along with the details of weighting 2 A software implementation of the algorithm is available at https://github.com/amjams/FeatBoost.
A. Alsahaf et al. and reset strategies are given in Algorithm 1.

Experimental settings and evaluation
In this section, we describe the data and experimental settings.Then, we briefly describe Boruta and ReliefF, the methods we compared FeatBoost with.
We applied each algorithm to 8 real datasets, and one artificial dataset, Madelon, which was designed to benchmark feature selection algorithms (Guyon et al., 2006).Table 1 contains a description of the datasets, and their dimensions.
On each dataset, we apply an l-by-k-fold cross-validated selection procedure, with l = 3, and k = 10: We split each dataset into ten equally sized folds, and apply each feature selection algorithm to the training folds separately.Then, we use the selected features to train a classifier on the training folds, and apply it to the held-out test folds, on which the classification performance is evaluated.The performance is measured in terms of the average classification accuracy of the selected subsets on the test data, and the computation time of the feature selection algorithm.We repeat the entire procedure l = 3 times with random shuffles of the sample set, for a total of 30 runs of each algorithm.A similar validation procedure is used by Song, Ni, and Wang (2013).
In cases where an algorithm selects more than 100 features, we only evaluate the accuracies of the first 100.Moreover, since each algorithm produces multiple subsets, we exclude those that could skew the average performance at the validation stage.Therefore, for each algorithm, we evaluate only the resulting subsets with a number of features equal to or larger than the mode of all subset sizes for that algorithm.Subsets which are larger than the mode are truncated to have a number of features equal to the mode.A. Alsahaf et al.We used a Nearest Neighbor classifier to validate the selected subsets.A Nearest Neighbor classifier is a sensible choice for validating feature subsets, as its performance is more likely to suffer from the inclusion of irrelevant, redundant, or noisy features, when compared to more complex classifiers (Loughrey & Cunningham, 2005).Validation with a Gaussian Naive Bayes classifier and an XGBoost classifier with default parameters are given in Appendix A.
In addition to subset accuracy, we evaluate the performance in terms of computation time of the algorithms, and the redundancy rate (RED) (Yamada, Jitkrittum, Sigal, Xing, & Sugiyama, 2014;Zhao, Wang, & Liu, 2010): where ρ(f i , f j ) is the Pearson correlation coefficient between features f i and f j .A large value of RED(X ) means that subset X contains high redundancy.Thus, lower values of RED(X ) are desired in feature selection.
The average computation time of each method is given in Table 2.The average redundancy rates for up to the top 10 and top 100 features selected from each method are given in Tables 3 and 4. We examine the top 10 as well as the complete top 100 features because in most cases, FeatBoost and Boruta select smaller subsets than the other methods, which are ranking algorithms that always select up to the user defined 100 features.Comparing the redundancy rate of up to the top 10 features leads to the subsets across all compared methods to be of similar size, and thus a better assessment of redundancy.Moreover, smaller subsets Face image † The datasets are ordered by their domain and number of features.They were obtained from the ASU feature selection repository.

Table 4
The mean and standard deviation of the redundancy rates of up to the top 100 selected features of each algorithm.could better reflect the redundancy limiting ability of the feature selection method when compared to larger sets, since the redundancy rate of the latter could be a reflection of the inherent level of redundancy in the complete dataset.

XGBoost
Since we build FeatBoost around a specific feature importance score, one derived from an XGBoost classifier, then a suitable benchmark to compare against is the same base score but with a simpler threshold.For that purpose, we define the first method in the comparison to be the feature importance scores from the same classifier used for ranking in FeatBoost, with the threshold being the mean of all feature scores. 3That is, features with an importance score lower than the mean of all feature scores are discarded.

Boruta
The second method we compare against is Boruta.In the comparison, we use XGBoost instead of random forest as the base ranking algorithm for Boruta.That way, we are able to achieve a fairer comparison, and determine which of the approaches makes better use of the same underlying feature scores.Moreover, since Boruta does not rank the elements of the selected subset by default, we post-rank them with an additional fitting of the classifier.This allows us to compare the performance of all methods iteratively with each added feature.

ReliefF
Finally, we compare FeatBoost to the ReliefF algorithm (Kira & Rendell, 1992).ReliefF is a powerful filter-based approach which belongs to the Relief family of feature selection methods (Urbanowicz, Meeker, La Cava, Olson, & Moore, 2018;Kira & Rendell, 1992).We use it in the comparison to represent a baseline of filter-based approaches.
Since Relief-based methods are feature ranking algorithms, a suitable threshold is needed in order to use them for feature subset selection.As in XGBoost, we used the mean value of the scores as a lower threshold.
We configured the algorithms as follows: 1. XGBoost: We configured XGBoost with 100 trees, and a maximum tree depth of 20.We set the maximum depth parameter to a high value in order to detect higher order feature interactions that might be present in some datasets (Johnson, 2009).We set the remaining parameters of the classifier to their default values. 42. FeatBoost: We used FeatBoost in two configurations.In the first, we set H 1 and H 2 to be the same XGBoost classifier used individually.In the second, we set H 2 as a Nearest Neighbor classifier.We set the remaining parameters as follows: k = 3,m = 50,p ′ = 100, and ∊ = 10 − 18 .3. Boruta: We used Boruta with the same classifier used in XGBoost, and default settings otherwise.We implemented the algorithm with the BorutaPy Python package, which we modified to be compatible with XGBoost.4. ReliefF: We used ReliefF with a number of neighbors equal to 10 and implemented the algorithm with the Skrebate Python package.

Results and discussion
The results of the feature selection comparison are summarized in Fig. 1, which shows the average classification accuracies of the subsets selected by the compared algorithms.The average computation times of each algorithm on all datasets are shown in Table 2.
In all datasets, FeatBoost selects better performing features in the leading ranks than the feature importance score on which it is based, XGBoost.This shows that sample re-weighting and model evaluation improve the performance of the base ranking.Moreover, the automatic stopping conditions in FeatBoost lead to selecting significantly smaller subsets than the mean-value threshold that we used for XGBoost.
In most datasets, FeatBoost outperforms Boruta and ReliefF as well, reaching higher accuracies with fewer features.This is most apparent in Isolet, PCMAC, COIL20, Orl, and WarpPIE10P.In those datasets, the performance of subsets selected by FeatBoost converges significantly faster than the second best method, indicating that the relevant features are found with smaller subsets.
In terms of computation time, XGBoost is the most efficient approach across all datasets, followed by ReliefF.This is expected since the former ranks features with a single fitting of the data, while the latter is a filter method that performs no model fitting.As for FeatBoost and Boruta, we observe that the former, when configured with an XGBoost wrapped classifier, is slightly faster than Boruta, which uses the same classifier to provide its base feature ranking.On the other hand, when FeatBoost uses a NN classifier, it becomes significantly faster than Boruta.
It is worth noting that this increase in efficiency in FeatBoostobtained by using NN instead of XGBoost as the evaluation classifierdoes not sacrifice the performance of the algorithm.In fact, this configuration performs better when the validation classifier is also NN (Fig. 1).It also performs well when XGBoost is the validation classifier (Fig. A.2).This demonstrates that the modular architecture of the algorithm can take advantage of a powerful ranking algorithm, that of XGBoost, while using a simpler and more efficient evaluation classifier.
Table 3, which shows the redundancy rates of up to the top 10 selected features, indicates that FeatBoost in either of its iterations selects subsets with lower redundancy than the other compared methods.When the top 100 features in ReliefF and XGBoost are examined (Table 4), FeatBoost with an NN classifier still retains its advantage, performing best in 4 out of 9 datasets.In both cases, Boruta performs best on a single dataset; WarpPIE10P.
The algorithm's ability to reduce redundancy, at least in the top ranked features, is very promising, since this resulted only from the sample re-weighting and model evaluation procedures, and not from explicit minimization of redundancy, as is the case in other feature selection approaches (Peng, Long, & Ding, 2005;Zhao et al., 2010;Yamada et al., 2014).

Conclusion
We showed that the proposed FeatBoost algorithm, which is based on boosting, and a stage-wise greedy procedure, is able to use the feature importance scores derived from an ensemble of decision-trees to select high performing feature subsets for classification problems.The resulting subsets outperform those obtained by simple thresholding of the baseline scores.They also outperform in most cases the Boruta algorithm, which we configured to use the same feature scores as a basis, and improved with a post-ranking of the selected subsets.
The computational cost of the algorithm is sensitive to several design choices, most importantly, the ranking algorithm, the parameter m, and the wrapped classifier.We have shown that changing the latter from XGBoost to the much simpler Nearest Neighbor improved the speed of the algorithm without significantly affecting the performance of the selected features.
The proposed boosting framework, and the two-step selection procedure, circumvent the weaknesses in the importance scores of treebased models externally, without changing how the scores are computed.The algorithm is therefore independent of the particular scores being used, or the wrapped classifier.It would be useful to investigate the use of other classifiers, and to test if other ranking algorithms, like filter-based Relief methods, would react similarly within the same algorithm.

Fig. 1 .
Fig. 1.Accuracies of the subsets selected by each feature selection algorithm.The validation classifier used to compute the accuracy is Nearest Neighbor.

Fig. A. 2 .
Fig. A.2. Accuracies of the subsets selected by each feature selection algorithm.The validation classifier used to compute the accuracy is XGBoost.

Fig. A. 3 .
Fig. A.3.Accuracies of the subsets selected by each feature selection algorithm.The validation classifier used to compute the accuracy is Gaussian Naive Bayes.

Table 1
Descriptions of datasets † used in the comparison.

Table 3
The mean and standard deviation of the redundancy rate of up to the top 10 selected features of each algorithm.

Table 2
The mean and standard deviation of computation time in minutes of each feature selection algorithm.