Fast Genetic Algorithm for feature selection -- A qualitative approximation approach

Evolutionary Algorithms (EAs) are often challenging to apply in real-world settings since evolutionary computations involve a large number of evaluations of a typically expensive fitness function. For example, an evaluation could involve training a new machine learning model. An approximation (also known as meta-model or a surrogate) of the true function can be used in such applications to alleviate the computation cost. In this paper, we propose a two-stage surrogate-assisted evolutionary approach to address the computational issues arising from using Genetic Algorithm (GA) for feature selection in a wrapper setting for large datasets. We define 'Approximation Usefulness' to capture the necessary conditions to ensure correctness of the EA computations when an approximation is used. Based on this definition, we propose a procedure to construct a lightweight qualitative meta-model by the active selection of data instances. We then use a meta-model to carry out the feature selection task. We apply this procedure to the GA-based algorithm CHC (Cross generational elitist selection, Heterogeneous recombination and Cataclysmic mutation) to create a Qualitative approXimations variant, CHCQX. We show that CHCQX converges faster to feature subset solutions of significantly higher accuracy (as compared to CHC), particularly for large datasets with over 100K instances. We also demonstrate the applicability of the thinking behind our approach more broadly to Swarm Intelligence (SI), another branch of the Evolutionary Computation (EC) paradigm with results of PSOQX, a qualitative approximation adaptation of the Particle Swarm Optimization (PSO) method. A GitHub repository with the complete implementation is available.


Introduction
Feature Selection (FS) and Instance Selection (IS) are two wellknown data mining techniques used to identify subsets of the most informative features and instances for a given learning task.Feature selection and instance selection primarily aims to achieve two goals: (a) reduce computational complexity by using fewer features, and instances, for model training; (b) improve generalization performance and model accuracy by reducing overfitting.In practice, these tasks are often performed in a greedy manner, since finding the best solution is often intractable, and even meta-heuristic optimizations are prohibitively slow.In this paper, we propose a method that uses active sampling to create an approximate, fast meta-model.
The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/).More information on the Reproducibility Badge Initiative is available at https://www.elsevier.com/physical-sciences-and-engineering/computer-science/journals.
1 http://hh.se/caisr 2https://github.com/Ghaith81/Fast-Genetic-Algorithm-For-Feature-SelectionGenetic Algorithm (GA) pioneered by Holland et al. (1992) is a bioinspired method widely used to solve complex optimization problems.GA has been shown to outperform classical non-evolutionary methods like Sequential Floating Search (Kudo & Sklansky, 2000), and Greedylike Search (Vafaie, Imam, et al., 1994) to solve large-scale feature selection tasks.Moreover, better instances reduction rates and higher classification accuracy were obtained under experimental conditions using an instance selection GA in comparison with several experiments using non-evolutionary methods (Cano, Herrera, & Lozano, 2003).
High computational cost is however a major drawback of using GA for feature selection.Typically used as a wrapper method, the process of GA involves a large number of evaluations that are computationally heavy, particularly with on data sets containing a large number of https://doi.org/10.1016/j.eswa.2022.118528Received 19 November 2021; Received in revised form 10 August 2022; Accepted 10 August 2022 Fig. 1.A qualitatively useful approximation for a combinatorial optimization is shown in (a).The approximation correctly identifies the maximum of the original function, even though the approximation error is large.On the other hand, the approximation in (b) offers better quantitative approximation (values closer to the original fitness), but it leads to a false optimum.
instances.As a result, research employing GA for feature selection have been mostly limited to data sets with a small number of instances, i.e., less than 1000 (Xue, Zhang, Browne, & Yao, 2015).
In this work, we use a light-weight approximation of the computationally expensive GA fitness function to guide the feature selection task.We propose a novel active sampling method to construct high quality approximations.It is based on a concept introduced first by Jin, Hüsken, Sendhoff, et al. (2003), who used ''qualitative'' to define a useful approximation.This meta-model ranks different individuals similarly to the original fitness function, but not necessarily reproduces the exact value.An example of a useful qualitative approximation constructed using our method is shown in Fig. 1(a).The values of the approximate meta-model are consistently significantly lower than those of the original function.It preserves the qualitative properties of the original function in terms of relative fitness of different solutions.Therefore, the meta-model can be useful to lead the evolutionary optimization irrespective of the quantitative measures such as mean error.We define ''Approximation Usefulness'' and use the expected value of Spearman rank correlation () (Spearman, 1910(Spearman, , 1961) ) between the original function and meta-model evaluations as a quality measure of the meta-model.
An approximate classifier is trained with a subset of samples selected by our novel informed selection method.The accuracy of this surrogate model is used in the evolutionary computation of feature selection on data sets which are orders of magnitude larger than previous work reported in the literature.In our experiments, we tested the proposed method using data sets with hundreds of thousands of instances, and thousands of features.We have shown empirically that our algorithm CHC  scales better than its classical wrapper counterpart, CHC, for larger datasets.Our algorithm is shown to converge faster to better feature subset solutions for datasets larger than 10K instances and was always significantly superior for datasets larger than 100K.We also have demonstrate the applicability of this approach to another class of EC by observing similar results with an algorithm based on PSO, namely PSO  .

Related work
The survey undertaken by Jin (2005) established two major concerns of approximating fitness evaluation in evolutionary computation.First, the approximation must ensure that the evolutionary algorithm convergences to a global optimum, or near-optimum, of the original fitness function.Second, the computational cost should be reduced.A subsequent survey (Jin, 2011) highlighted the limited success achieved in applications of meta-model based on evolutionary optimization even given what had been considerable growth in interest in using such meta-models within the research field.
Hybrid methods combine a filter with a wrapper in a two-staged approach.In a hybrid a filter is applied first to the features with the goal of reducing the search space.Only the top ranked-features are used by the meta-heuristic in the second stage.This approach of filtering lowranked features was used by Oreski and Oreski (2014), Rani, Kumar, Jain, and Chawla (2021), Song, Zhang, Gong, and Gao (2021), Sun, Jin, Xu, and Cichocki (2021) and Tan, Fu, Zhang, and Bourgeois (2008).Such approaches however suffer from two major drawbacks.First, the reduction in search space is only applicable to features.Therefore, a data set with large number of instances would not benefit much from this filter technique.Secondly, low-ranked features might turn out to be important when combined with other features.This filter-based approach also misses any potential feature interactions.Some hybrid approaches overcome this by employing local search along with the binary optimization algorithms (Chattopadhyay, Kundu, Singh, Mirjalili, & Sarkar, 2022;Ghosh, Malakar, Bhowmik, Sarkar, & Nasipuri, 2017;Kabir, Shahjahan, & Murase, 2011).It remains challenging however to determine how much such approaches are compromising the accuracy to reduce computation time without a comparison against a wrapper.As such methods are often only compared with baseline models and other hybrid and filter methods.
The final approach to feature selection is the wrapper, and our method belongs to this category.Wrapper approaches rely on the ML model to explicitly evaluate the fitness of feature subsets.The first work to address the computational cost of GA for the feature selection task using an approximate model in a wrapper settings was Brill, Brown, and Martin (1992).They proposed two key ideas: first, a simple k-Nearest Neighbours (kNN) is used to approximate fitness evaluations of a Neural Network.Secondly, they proposed a method named ''training set sampling'', in which only a portion of the training set is used to train a model for GA evaluations.However, their method was only validated using one small data set of 30 features and 150 of instances.Also, re-sampling on each generation using this training set sampling method forced more evaluations, and consequently incurred a high computational cost.A recent work by Altarabichi, Nowaczyk, Pashami and Mashhadi (2021) combined the ''training set sampling'' concept with ideas from progressive sampling to create a multi-level surrogateassisted algorithm that outperformed a wrapper GA by both being faster to converge and by leading to feature subset solutions of higher accuracy.
Progressive Sampling used by Le, Van Tran, Nguyen, and Nguyen (2015) to identify the Optimal Sample Size (OSS) which has been defined as the smallest sample size that offers minimum achievable error for a given learning algorithm.A model trained with OSS was used in fitness function evaluations of GA.Additionally, Le et al. proposed parallelization of the fitness function computation to reduce the runtime of GA.The algorithm was used to perform feature selection for the Named Entity Recognition (NER) task.
A Coevolutionary approach to performing feature and instance selection simultaneously for data set reduction showed promising results over a wide range of data sets .The largest data set used, in terms of number of instances, however, had 1728 instances.Whereas the largest in terms of features, had only 60 features (Derrac, García, & Herrera, 2009).
A GA feature selection method based on the MapReduce paradigm was proposed by Peralta et al. (2015).The algorithm in this method decomposed the original data set into blocks of instances to learn from in the map phase; then, the obtained partial results are merged into a final vector of feature weights in the reduce phase.The selected features are identified using a threshold applied to feature weights vector.Although the work was sufficiently convincing to show the usefulness of the MapReduce paradigm in reducing the computational cost, the performance of the method was only compared against baseline models trained with all features for two large data sets.

Problem formulation
In this section we further formally define the feature selection problem.We also offer a fundamental overview of using GA as a feature selection method in a wrapper setting, and its computational challenges.In the second part of this section, we discuss approximating the fitness function using a quantitative approximation and demonstrate the major obstacles of constructing a light-weight approximation following a quantitative approach.

Feature selection using genetic algorithm
We start by defining the feature selection problem for a machine learning task.We are given a data set D of labelled pairs of the dimensions ( × ), of which  represents the number of instances and  is the number of features.An instance ⃗  can be expressed as a dimensional real-valued vector ⃗  ∈ R  .The goal of the feature selection task is to select a new subspace R  from R  (where  ≤ ), while maintaining a comparable (or even better) performance to the one obtained with the original feature space R  .An instance ⃗  ′ after feature selection can be expressed as a -dimensional real vector ⃗  ′ ∈ R  .Typically, a GA used for feature selection is initiated with a random population of individuals encoding feature subsets as chromosomes of binary strings.An individual  can be expressed as  ∈ {0, 1}  , where 1 indicates the selection of feature of the corresponding index, while 0 indicates exclusion.As a wrapper method, GA evaluates individual's fitness by constructing a classification (or regression) model using the feature subset represented by this individual's chromosome.
The algorithm proceeds from one generation to the next by applying crossover and mutation operators to the selected individuals to produce offspring.The process resembles natural selection in that it chooses individuals with the highest level of fitness as the most likely to propagate to the following generation.In this work, we have used the CHC Genetic Algorithm (Whitley & Sutton, 2012) to lead our search process as it has been shown to perform well with small populations (Eshelman, 1991), whilst being more computationally efficient.Whenever a GA is mentioned in the rest of this paper, it is always CHC-based.
Finding the optimal features subset of high-dimensional data sets, even for a small population, requires running a large number of fitness function evaluations (for the experiments in Section 5.2 it varied from 1002 to 2216) -where each evaluation is computationally expensive.The computational cost of GA is linearly dependent on the time complexity of the model used to evaluate fitness of different feature subsets (Altarabichi, Nowaczyk et al., 2021).The time complexity of many classification algorithms as a function of the number of samples is (  ), where ( ≥ 1).For example the training complexity of kNN is ( 2 ) (Brill et al., 1992), while nonlinear SVM is between ( 2 ) and (3 ) (Bottou & Lin, 2007).Consequently, the complexity order of a wrapper feature selection GA using kNN is ( 2 ), and is between ( 2 ) and ( 3 ) for SVM, where t is the total number of fitness function evaluations.
As the feature selection  is quadratic or even cubic with respect to the number of training instances, several methods utilized processing models trained with smaller samples for fitness evaluations.We categorize these methods under three main categories: training set sampling (Altarabichi, Nowaczyk et al., 2021;Brill et al., 1992;Le et al., 2015), dividing the dataset into small pieces (Peralta et al., 2015) and instance selection (Derrac et al., 2009).Our method belong to the last category as we speed up the computation by reducing the number of training instances.Several other methods suggested reconstructing the training set through instance selection.Liu, Wang, Wang, Lv, and Konan (2017) identified useless instances that are unlikely to be support vectors to make the training time of SVM manageable.Saha, Sarker, Al Saud, Shatabda, and Newton (2022) selected instances in unsupervised fashion from the centre and borders of the clusters found using K-Means algorithm.Shaw et al. (2021) proposed an instance selection algorithm to select instances in class imbalance settings.
Our approach can be distinguished from all mentioned methods that rely on quantitative measures (e.g., accuracy of the model trained with selected instances) to evaluate the goodness of the resulting training set.We select instances that lead to correct selection of solutions during feature selection.

The drawbacks of a quantitative approximation approach
In this section we highlight the major drawbacks of constructing an approximation of the original function following a quantitative approach.A quantitative approximation offers a small approximation error when compared to the original function.In the context of approximating the fitness function of the feature selection task, a natural choice would be classifier C  trained with the optimal sample size OSS, as C  offers a close quantitative approximation of C  trained with the complete data set D of n instances.
The first possible drawback of following a quantitative approach is evident in Fig. 1(b), in which a meta-model with small approximation error fails to guide evolutionary computations properly and leads to a false optimum.A false optimum is defined as a point in the optimization surface that corresponds to an optimum 3 of the approximate function, but not of the original fitness function (Jin, 2005).Convergence to such false optima in the approximate model is a major problem in surrogateassisted evolutionary optimization (Jin, Olhofer, & Sendhoff, 2000).The severity of the problem depends, of course, on how close the real (original fitness) optimum is to the false optimum -however, this is impossible to know at the start, and even for an approximation with small degree of error this can be arbitrarily distant.Additionally, there are no guarantees that the OSS will correspond to a small number of instances.The performance of most ML models, especially Deep Learning (DL) but also many simpler ML models, does not converge quickly, and continues to improve significantly with more labelled data.Progressive sampling (PS) could be used to establish the relation between sample size and model accuracy using batches that incrementally grow in size.The relation can be depicted with a learning curve showing model accuracy as a function of sample size.A well-behaved learning curve usually follows an inverse power law function (Yelle, 1979).Fig. 2 shows the learning curve of the covtype data set from the UCI ML repository.The sample batch in the figure increases in size following a Geometric Schedule (Provost, Jensen, & Oates, 1999) by the equation   =   ⋅  0 .We may observe from Fig. 2 that the convergence point is not reached, for a simple Decision Tree classifier, even when training on more than 250 instances.Therefore, our goal is to generally identify a much smaller sample size (compared to OSS) to train a meta-model used for the feature selection task.

Method
Our method section is divided into two subsections.In the first one, we formulate the definition of ''Approximation Usefulness'' and introduce the concept of measuring the value of qualitative approximation.In the subsequent subsection, we present our method of actively sampling instances with the purpose of creating a light-weight metamodel that, to a high degree, satisfies the ''Approximation Usefulness'' definition.

Approximation usefulness
In our work, we construct a classifier C  trained with a particular subset of samples (s), where  ⊂  and || ≪ , to carry out the FS task.Our approach exploits the idea highlighted by Jin et al. (2003), that from an evolutionary computation perspective, the quantitative quality of the approximation is irrelevant, only the correct selection must be ensured.In our FS problem, the approximate classifier C  must rank different feature solutions similar to C  to be considered a useful approximation.However, its overall accuracy can be much lower.We therefore define ''Approximation Usefulness'' to indicate the necessary conditions that must be satisfied to ensure correctness of the GA computations when an approximation is used.Definition 1.A meta-model is sufficient to lead the evolutionary computations to the correct maximum4 of the fitness function if the following conditions are satisfied: (1) where  1 and  2 are two possible solutions (individuals),   () is the fitness value of individual () using the original fitness function (a classifier trained with the complete data set  of  instances),   () is the fitness value of individual g using the approximated fitness function (a classifier trained with the subset s of instances in D).
It is important to highlight that Eqs. ( 1) and ( 2) are sufficient conditions to realize a useful approximation, irrespective to how good the approximate classifier C  is in performing the learning task.In other words, C  could have a low accuracy5 in comparison to C  , but it can still be considered a useful approximation of C  in the feature selection task, provided both classifiers rank different feature subsets similarly.The Spearman Rank correlation was used in Jin et al. (2003) as a metric to measure the quality of meta-models.We will similarly use rank correlation as a quantitative measure of how valid is the qualitative approximation in satisfying the Approximation Usefulness conditions.
To visualize the qualitative aspect of the problem, we plot in Fig. 3 what we define as the Approximation Usefulness Curve.Contrary to the Learning Curve we have observed in Fig. 2, the -axis for the Approximation Usefulness plot represents the expected value of rank correlation between evaluations done using the original function (trained with all data) and the same evaluations done using approximations (trained with progressively larger sample size).Intuitively, we would expect the learning algorithm to exploit more complex interactions between features as more data become available.
A meta-model trained with a very small sample of instances could rank different feature subset solutions differently than the same learning algorithm trained with all the data.As we allow the meta-model to progressively access larger samples, we would expect the meta-model to have better agreement with the original function on the rank of feature subsets.
The goal of the Approximation Usefulness Curve is, therefore to capture the relation between training the meta-model with progressively larger samples and the agreement with original function evaluations.We estimate the expected value of rank by randomly generating  feature subsets and evaluating these subsets using the original function, and meta-models with progressively larger samples.
The vector ⃗  = ( 1 ,  2 , … ,   ) represents the  randomly generated feature subsets, where   ∈ {0, 1}  .Consequently, for any sample size , we create two -dimensional vectors ⃗  = ( 1 ,  2 , … ,   ), and , where o  is the fitness value of u  calculated using the original fitness function, and a   is the fitness value of u  calculated using an approximation trained with a sample of size  randomly sampled from D. Accordingly, we depict approximation usefulness as a function of sample size using progressive sampling.We calculate the rank correlation of the two vectors (⃗ , ⃗   ) using values of  that follow a geometric sampling schedule.
We may observe from Fig. 3 that already an approximation trained with a random sample of size 32 768 (less than 10% of available training instances) of the covtyp data set exceeds 0.95 correlation.
Such meta-model will mostly make correct selections, despite the large difference in accuracy between the original and approximate models as observed from Fig. 2; a model trained with all 348 861 training examples achieved 93.00% accuracy on test set, in comparison to 80.60% for a model trained with 32 768 samples.

The CHC 𝑄𝑋 algorithm
Our approach is based on the hypothesis that well-informed instance selection can produce high quality meta-models using smaller sample sizes and it is enough to observe few evaluations of the original function to identify such key instances.Active selection of samples is shown to improve model quality significantly in different research areas (Jin, 2005).We propose CHC  , CHC Qualitative approXimation, a twostaged surrogate-assisted evolutionary algorithm for feature selection.In our algorithm (the full pseudocode is in Algorithm 3), we break the optimization problem of feature selection into two parts.In the first part, we use active selection of samples to construct high quality lightweight meta-model.Once a good meta-model is constructed, we use it to solve the second optimization problem, i.e., finding the best possible feature subset.

CHC 𝑄𝑋 active sampling phase
Formally, this optimization problem aims to find a small subset of instances  from data set D (where || ≪ ), which maximizes the expected correlation between C  and C  .A meta-model trained with  is expected to rank feature subsets similarly to the original function.The data set D ′ after instance selection can be expressed as matrix (||×).CHC  uses an instance selection GA to solve this optimization problem.The instance selection GA is initialized with a population of individuals  ∈ {0, 1}  , where 1 indicates the selection of the instance of the corresponding index.
To measure the quality of a candidate meta-model during the instance selection phase, we first randomly generate  feature subsets.The randomly generated solutions serve as snapshots of the optimization surface of the original function.Intuitively, CHC  tries to construct a meta-model with a fitness landscape that is both aligned, and highly correlated with, the optimization surface of the original function as in Fig. 1(a).
As we did in 4.1, we denote ⃗  as the vector of randomly generated feature subsets, ⃗  = ( 1 ,  2 , … .,   ), where   ∈ {0, 1}  ,  ∈ {1, 2, … , }.We calculate the q-dimensional vector ⃗  = ( 1 ,  2 , … .,   ), where o  is the fitness value of the corresponding feature subset calculated using the original fitness function.Accordingly, CHC  instance selection tries to find the instance subset that maximize the expected rank, while minimizing the number of selected instances according to the following fitness function: where ⃗  = ( 1 ,  2 , … .,   ) is a -dimensional vector of fitness value of the corresponding randomly generated feature subset calculated using a classifier trained with instances identified by individual , and   is the difference in rank of   (the th feature subset) between ⃗  and ⃗ .Values of (⃗ , ⃗ ) fall in the range +1 to −1, where the maximum value of +1 indicates an optimal approximation, with Eqs. ( 1) and (2) satisfied.
By minimizing the fitness function defined in (3), the instance selection of CHC  optimizes for higher quality meta-model using the left term of the fitness function, with the smallest number of instances (as captured by the right term).The fitness of the original function (an individual trained with all  available instances in D), according to (3), is equal to   () = (1 − 1) +   = 1.A high-quality candidate approximation trained with small number of instances will have a fitness below 1 and approaches 0.
We denote  * as the instance subset that results from solving the first optimization problem.We use  * to train the approximate classifier C * that will be used next in the feature selection phase.

CHC 𝑄𝑋 feature selection phase
During the feature selection phase, our algorithm uses the approximate classifier C * that was constructed in the instance selection phase, together with the original fitness function C  .We carry majority of feature subset evaluations using the approximation C * , and only after a fixed number of generations we reevaluate all individuals using C  .We control this frequency of using the original function through the frequency hyper-parameter ( ).The use of the approximate model together with the original fitness function is known as evolution control in evolutionary computations using approximation (Jin et al., 2000), and has been recognized for its effectiveness in preventing the approximation from converging to a false optimum (Ratle, 1998).
The fitness function of the second optimization problem is given by the equation: where   is the validation set accuracy of classifier  * when trained with feature subset .We denote  * as the feature subset that results from solving the second optimization problem.The feature subset  * is the final solution of CHC  .The pseudocode of the instance selection and feature selection stages of CHC  can be found in Algorithm 1 and Algorithm 2. We explain the fundamental steps of the CHC algorithm as it is used in both stages: 1. Initialization: The initial population  0 is generated randomly according to the hyper-parameters  that identifies the number of individuals in the population, and  that indicates the independent probability of having a 1 in each bit of the string.2. Reproduction selection: On each generation t, the parents   are selected randomly for reproduction from the population   .But an incest prevention mechanism is applied to prevent similar parents from mating.Similarity is identified by measuring the hamming distance between the pair of parents and only pairs which differ from each other by the threshold given by  are allowed to mate.

Heterogeneous recombination:
The HUX operator is used to generate off-springs  ′  by coping all bits matched in both parents, and then copying half of the different bits from each parent to the resulting off-springs.4. Cross generation elitist selection: The algorithm selects the best individuals from the current generation   and the offsprings  ′  to be passed to the following generation.This cross generation elitist strategy ensures that the best solutions found so far always survive.If the stop criterion is met after this step, the algorithm simply exits and returns the best found solution.
Alternatively, the process moves to step 5 if the progress stagnate for several generations.Otherwise, the algorithm goes back to steps 2 and steps 2-4 are repeated iteratively.

Results and discussion
This section describes the experimental design we have used to evaluate our method, along with the results obtained in several experiments.The first experiment aimed to evaluate CHC  against a traditional CHC, an algorithm that uses all the available training data for feature selection.We also report the results of another proposed algorithm, PSO  , and compare it to PSO.In the second and third experiments we conducted sensitivity analysis of the main hyperparameters of the algorithm, namely we evaluated the effect of varying the population size and the frequency of the evolution control hyperparameters on CHC  performance.Finally, we provide amortized analysis of the complexity time of the CHC  algorithm and compare it to CHC.

Experimental setup
Our experiments involve 13 data sets from the UCI Machine Learning Database Repository.We have included 6 small sized datasets of less than 10K, 4 medium size datasets between 10K and 100K, and 3 large datasets with more than 100K instances.The objective was to evaluate the effectiveness of our approach for datasets of varying sizes.Table 1 provides a summary of the number of instances, features and classes of all data sets used in our experiments.
Each data set is divided into training (60%), validation (20%), and testing (20%) splits.We have used a Decision Tree classifier in all experiments with the default algorithm settings of the Python library (sklearn 6 ) implementation.A unified approach of prepossessing is adopted for all data sets, including categorical features encoding, imputation of missing values, and shuffling of instances.Accuracy of the model is the metric we used for the evaluation in all experiments.All reported accuracies are the ones realized on the testing set.The 6 https://scikit-learn.org/stable/.
() number of individuals in population.
(  ) maximum number of generations.
() probability of each instance to be selected (independently).
// Create vector  of  randomly generated feature subset.

𝐸𝑣𝑎𝑙𝑢𝑎𝑡𝑒[𝑢];
// Initialize the generation counter  to zero.In describing the experiments, we will use the following terminology to refer to different classifiers: -Baseline DT: The Decision Tree classifier that is trained on all available instances without performing any feature selection.-CHC: The Decision Tree classifier that is trained on all available instances in the training set, after performing feature selection using CHC with the original fitness function.The population size of CHC is 50 in all experiments unless explicitly stated otherwise, while other parameters are set to the recommended settings suggested in the original paper (Eshelman, 1991).The diversity parameter is set to ( =  4 ), where  is the length of the individual (number of features), while the divergence rate is ( = 0.35).
-PSO: The Decision Tree classifier that is trained on all available instances in the training set, after performing feature selection using PSO with the original fitness function.We have used the global version of PSO with a topology connecting all particles to one another.The following options are used {c1: 1.49618, c2: 1.49618, w: 0.7298}, while the number of particles is set to 50 in all experiments.
-CHC  : The Decision Tree classifier that is trained on all available instances in training set after performing feature selection using CHC  .This algorithm uses the same settings of the base optimizer as the CHC baseline.The hyper-parameters specific to CHC  are set to  = 20 and  = 10.-PSO  : The Decision Tree classifier that is trained on all available instances in the training set after performing feature selection using PSO  .This algorithm use the same settings of the base optimizer as the PSO baseline.The hyper-parameters specific to PSO  are also set to  = 20, and  = 10.

Experiment I: CHC 𝑄𝑋 vs. CHC and PSO 𝑄𝑋 vs. PSO
In this experiment, we compared our algorithm against a classical wrapper feature selection method.The objective was to demonstrate the effectiveness of CHC  using datasets of varying sizes in terms of number of instances and features.To validate whether CHC  converges faster than CHC, we ran CHC  to convergence (defined as 10 consecutive generations with no improvement for best solution in population) and allowed CHC to run for the same CPU time.Due to the stochastic nature of the process, we ran 10 repetitions of each dataset and reported the median and standard deviation of the runs.
The results in Table 2 show that as expected, in comparison to a baseline without any feature selection, all feature selection methods improved the performance significantly, based on paired Student's test.It must be noted however, that our algorithms CHC  and PSO  were so successful in reducing overfitting of the baseline Decision Tree ( 1 ) number of individuals in instance selection population.
( 2 ) number of individuals in feature selection population.
( 1 ) maximum number of generations during instance selection.
( 2 ) maximum number of generations during feature selection.
( 1 ) probability of each instance to be selected (independently).
( 2 ) probability of each feature to be selected (independently).
( ) frequency of using the original function.to the degree that the DT model after feature selection managed to exceed the performance of a Random Forest model trained with all features (ensemble of 11 Decision Trees) for a number of datasets (abalone, adult, dota2Train, diabetic and covtype).
The results of Table 2 show no advantage of using CHC  for small datasets with less than 10K instances.This is quite natural and can be explained by the overhead of the algorithm, which caused a deteriorating effect in a number of datasets (semeion, abalone and qsar).However, the advantage of using our algorithm is already noticeable for medium size datasets (between 10K and 100K), as CHC  performs generally better than CHC, and is significantly better according to the paired -test for the bank-full dataset.This trend continues, and CHC  is always significantly better than CHC for all three large datasets (diabetic, census-income and covtype), i.e., those with more than 100K instances.This demonstrates the usefulness of our approach for the purpose it was designed for, namely as the datasets scale larger in size.
We show similar results with our proposed PSO based algorithm.As we compare PSO  against PSO following the same procedure.The results of Table 1 show PSO  is generally no better than PSO for small sized datasets, while being generally better for medium sized, and is always significantly better for large datasets.This demonstrates that our procedure of constructing high quality meta-models is useful when used independently from the underlying optimizer.

Experiment II: Effect of population size on CHC 𝑄𝑋 and CHC
The objective of this experiment was to analyse the effect of increasing the population size for both CHC and our proposed algorithm CHC  .We vary the population size hyper-parameter to values equal to (50%, 100%, 200%, 400%) of the individual length and measure the impact on convergence time and fitness for both algorithms in 10 repetitions.Unlike the previous experiment, this time we let both algorithms run to convergence (defined as 10 consecutive generations with no improvement for the best individual).We studied the effect using two medium size datasets adult and bank-full, and one large dataset census-income.
An expected, we observe in Fig. 4 that the convergence time of both algorithms increases as we run the optimization with larger populations.It is clear, however, that CHC  converges faster than CHC in almost all settings.In general, the fitness of both algorithms improves as we increase the population size.Also, agreeing with the results of Xu and Gao (1997), we observed a reduced possibility of premature convergence with larger populations.This was particularly apparent for CHC (as indicated by lower variations in the quality of final solutions).However, increasing the population size beyond 200% of the individual size does not seem to offer any significant improvements for either of the algorithms.This observation is consistent with results from the literature which suggests that large population size is not always helpful (Chen, Tang, Chen, & Yao, 2012) while other studies recommend using CHC with small populations of around 50 individuals (Whitley, 1994).We recommend, accordingly, to work with population size values between 100% and 200% of the individual size for both CHC and CHC  .A fixed value of 50 individuals performed consistently well for problems of varying complexity across our experiments and was therefore chosen as the default population size setting for CHC  .

Experiment III: Effect of evolution control frequency on CHC 𝑄𝑋
In this experiment, we analyse our evolution control strategy by varying the value of the frequency hyper-parameter.Intuitively, a trade-off exists between less frequent controls (which risk allowing the meta-model to mislead the optimization to a false optimum), or more frequent controls (which require increased computation time).We studied this trade off by varying the value of the  hyper-parameter between (5,10,20,40).The value of  represents the number of generations that must pass before the original function is used during feature selection; a high value indicates less frequent use.We use the same three datasets as in the previous experiment.
We show in Fig. 5 the downside of less frequent use of evolution control for the value of (f = 40) in the adult dataset and (f = 20, f = 40) for the bank-full dataset.In these settings, the meta-model often converges to false optima, as is evident from the significant performance drop in comparison to the more controlled settings (f = 5, f = 10).Interestingly, the less controlled settings perform generally better for the census-income dataset; this observation could be explained by the fidelity of the approximate model.Intuitively, a high-quality meta-model requires less frequent evolution control (Jin, 2005).We recommend setting the value of  to be between 5 and 10, as this range provided the most well-balanced performance during our experiments.Ideally, a solution would be to abandon the fixed control strategy and follow an approach that adjust the frequency of evolution control adaptively, based on the fidelity of the meta-model (Jin, Olhofer, & Sendhoff, 2001), however this is work for the future and outside of the scope of this paper.

Amortized analysis of the CHC 𝑄𝑋 algorithm
As reduction of computation time is the main objective of our algorithm, we carried out amortized analysis of the time complexity of CHC  in comparison to CHC.We showed that the amortized cost of one generation of CHC  using the algorithm default settings is in the worst case smaller than its counterpart of CHC, when the two algorithms run for more than 13 generations.We used the aggregate method to determine the upper bound of the worst case total run-time cost of  generations of evolution, then calculated the amortized cost of one generation for each method.
In our analysis both algorithms use a Decision Tree classifier with complexity (⋅ 2 ) (Su & Zhang, 2006), where  represents the number of instances and  is the number of features.The analysis could be directly extended to any other induction algorithm with a different complexity.
The time complexity of CHC is linearly dependent on the complexity of the induction algorithm.This is true for situations in which fitness evaluations consumes almost all the run-time of the algorithm, the time complexity of running the evolutionary operators like crossover and mutation is negligible in comparison.The time complexity of CHC could be expressed for Decision Tree as ( ⋅  2 ⋅  ⋅ ), where  is the number of fitness function evaluations per generation. 7The amortized cost of one generation of evolution using CHC is simply: The time complexity of CHC  consists of the time required to construct the meta-model, and then the time of feature selection.The time complexity of one generation of CHC  is variable, as the algorithm mostly uses the computationally cheap meta-model to carry fitness evaluations, and only uses the true fitness evaluation occasionally.The amortized cost of one generation of CHC  is: The instance selection of CHC  involves evaluating a fixed number of randomly selected feature subsets using the original function, the number of evaluations is a hyper-parameter of the algorithm denoted as , with a default value of 10.The total time complexity of this operation is accordingly  ⋅  ⋅  2 .
CHC  uses an instance selection GA to select samples to construct a meta-model that offers the best trade-off highest correlation (agreement with original function), and the smallest sample of instances.The overall run-time of the instance selection GA is hard to predict due to the stochastic nature of GA.But we estimate the worst case time of this stage based on our selection of hyper-parameters.We used a population of 4 individuals and carried evolution for a maximum of 10 generations, we are also using a ''no change'' counter to early stop instance selection if fitness does not improve for a number of generations (we set this value to 3).The worst case total number of evaluations accordingly is 40.We initialize the starting population with individuals with no more than  2 selected instances.As we used CHC to carry instance selection, the number of selected instances will never exceeds  2 during evolution due to the averaging effect of the HUX crossover operator of CHC.We provide the proof in Lemma 1.
Lemma 1.The HUX crossover in binary optimization produces off-springs with the number of 1s equal to the average of their parents number of 1s.
Proof.Given two binary strings individuals  1 and  2 , expressed as  1 ,  2 ∈ {0, 1}  , where  indicates the length of the string, we define  1 ,  2 as the total number of 1s in  1 and  2 , respectively.
The HUX operator copies all bits matched in both parents, and then copies half of the different bits from each parent.The probability to have 1 in the same bit of  1 ,  2 is ( 2  ), while the probability to have different bits is ( We can accordingly calculate  ′ 1 ,  ′ 2 for off-springs  ′ 1 ,  ′ 2 as follows: Based on Lemma 1, we can guarantee that the instance selection phase will never produce individuals larger than the maximum of the initial population (with  2 instances).Accordingly, The worst case time for the instance selection phase of CHC  is: During feature selection, CHC  carries fitness evaluations mostly using the meta-model.Only on a predefined number of generations  (by default,  = 10), the algorithm re-evaluates all individuals in the population using the original function.Therefore, the total time of carrying r generations of feature selection using CHC  is: The number of instances in the meta-model is in the worst case  ′ =  2 based on Lemma 1.Therefore, the worst case total run of the feature selection phase is: The total cost of  generations of CHC  is: By using the algorithm hyper-parameters values of ( = 10,  = 10) the amortized cost of one generation of CHC  is: Naturally, the amortized overhead cost of the instance selection stage represented by

𝑟
is inversely proportional to the total number of generations during feature selection.For problems that can be solved with a small number of generations, the overhead cost outweighs the benefit of the algorithm.For the default hyperparameters of CHC  , and a population of 50 individuals ( = 50) we can show that for  = 13 generations: ⇒    ( = 13) <   ( = 13)

Conclusion
In this paper, we have proposed a two-staged surrogate-assisted solution for the computational problem of using GA for feature selection by constructing a meta-model for fitness evaluations following a qualitative approximation approach.We defined the term ''Approximation Usefulness'' and used the expected value of rank correlation to quantify correctness of evolutionary selections, and quality of constructed meta-models.
According to our experiments, an Approximation Usefulness Curve follows an inverse power law function similar to the Learning Curve.In the left part of the curve, the quality of a meta-model improves rapidly with more training data, until it reaches a stage in which adding more data improves the quality of the meta-model very slowly, and eventually it stops altogether.
We carried an amortized analysis of the computation time of CHC  and show that the amortized cost of one generation of our algorithm is in the worst case smaller than its counterpart algorithm CHC, as long as the two algorithms run for at least 13 generations.This analysis is supported by our empirical results where CHC  demonstrated better scalability as datasets grow larger (in terms of number of instances).We further validated our findings by also creating a variant of the PSO algorithm PSO  and demonstrated similar results using different meta-heuristic.
It must be noted that although CHC  can be used with any learning algorithm, we have deliberately used Decision Tree as the baseline learning algorithm in all the experiments we performed.Given that Decision Tree induction is based on a greedy top-down approach of splitting the data based on an impurity measure, non-informative features will not be selected for the top nodes of the tree.The implicit feature selection of Decision Tree makes it a challenging choice for our experiments.We have observed in our results how feature selection using our meta-models consistently and significantly improved the baseline performance of Decision Tree.Our results confirm the superiority of the global search of GA in comparison to the local or greedy search of a traditional Decision Tree.Interested readers could refer to Barros, Basgalupp, De Carvalho, and Freitas (2011) to learn more about how an evolutionary approach might overcome the shortcomings of a greedy search in a traditional Decision Tree.

Limitations
The instance selection stage of CHC  involves evaluating a fixed number of randomly generated feature subsets using the original function.Naturally, we would expect the instance selection to produce higher quality meta-models given a higher number of solutions evaluated using the original function.Clearly, the more evaluations we observe from the optimization surface of the original function, the easier it gets to produce a qualitatively similar meta-model.The obvious trade-off is the one between computational cost and quality of metamodel.As the main goal of this work is to make the evolutionary process time efficient ideally, we would prefer making the smallest number of evaluations using the expensive original function.Additionally, we have used a simple uniform approach with a fixed probability to control the variability in the number of selected features within the fixed solutions.However, some recent studies are realizing improved diversity and performance using low-discrepancy sequences (Bangyal, Hameed, Alosaimi, & Alyami, 2021).The impact of the initialization method on the final outcome of our algorithm could be investigated further.
We have learned from the amortized analysis that in the run-time of one generation of CHC  is more computationally efficient than a classical wrapper.However, the instance selection phase of CHC  is still computationally expensive.The reason is that it involves several of the original function evaluations.A better understanding of the basis of instance selection of CHC  could lead us to redefine the instance selection fitness function to be more computationally efficient.As the process of active selection of instances in CHC  is realized using a GA, it is challenging to explain why certain instances are selected by our algorithm.We think a future work could either: evaluate the characteristics of the selected instances (e.g., distance to decision boundary), or analyse statistical measures of the selected sample subset (e.g., Kolmogorov-Smirnov test or KL Divergence).Explanation on the basis of instance selection is not only useful for the insights.It could potentially obviate the need to perform evaluations using the original function.
We have never considered the class imbalance case as it is not the focus of this paper.In all experiments accuracy was used to evaluate the fitness of feature subsets.This metric could naturally be replaced with one that accounts for the imbalance case, e.g., average recall.We believe our instance selection method would still handle the imbalance case as the fitness function is designed to construct a meta-model which is aligned with a model trained using all instances.Instance subsets that neglect some minority class will lead to meta-models with poor correlation with the original function.If imbalance is an issue, a stratified sampling approach could ensure that randomly initialized instance solutions will not ignore the minority class.

Future work
In this paper, our method of constructing qualitative meta-models is specifically applied to the feature selection problem.An extension of this work would be to apply the same algorithm to other, related optimization problems, for example the hyper-parameters tuning of Machine Learning models using GA.The process of identifying the best hyper-parameter combinations shares many of the same computational challenges as the feature selection task.However, we should highlight that the fundamental notion that allows CHC  to work for the feature selection problem might not hold for hyper-parameters tuning.A model trained with a small number of samples would rank different feature subsets similarly to a model trained with all available data.Intuitively, both models will struggle to improve generalization performance using ''bad'' features.This notion however, might not hold for the hyperparameters tuning problem and needs to be investigated thoroughly.For example, intuitively, working with larger data sets permits the construction of more complex models without overfitting.
In the future, we also intend to study the differences in characteristics of the optimization surfaces between the original function and the meta-model.It is possible that this makes a difference; if, for example, the approximation is a lot less smooth than the original (or, on the contrary, very flat).It might turn out to be a lot more difficult for the optimization procedure -even if the maximum is correct, and it may be harder (or easier) to find it.However, we do not address this question within the scope of this work, only relying on Eqs. ( 1) and ( 2) to measure the quality of meta-models.
PSO  demonstrated a capability to perform feature selection for data sets of high dimensional features (e.g., qsar with 1024 features).
This capability makes PSO  useful for Deep Learning applications with high dimensional input (e.g., images).A future work is planned to apply and evaluate PSO  using such data sets and learning algorithms.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 2 .
Fig. 2. The Learning Curve of the covtype data set using a Decision Tree model, and a Geometric Sampling Schedule   = {32, 64, 128, . . ., 262 144}.The last point to the right represents training with all the available 348 861 training set instances.

Fig. 3 .
Fig. 3.The ''Approximation Usefulness'' curve of the covtype data set using a Decision Tree model, and a Geometric Sampling Schedule   = {32, 64, 128, . . ., 262 144}.The points correspond to the rank correlation scores between the original classifier and an approximation trained using the corresponding sample size sampled randomly.

Algorithm 3 :
CHC  Feature Selection Input : () number of controlled individuals.

Fig. 4 .
Fig. 4. The effect of varying population size on convergence time and accuracy for CHC  and CHC.The results of datasets are presented in the row order: adult (top), bank-full (middle) and census-income (bottom).
=   () +    ()  where   () is the run-time of the instance selection stage, and    () is the run-time of the feature selection stage.

Fig. 5 .
Fig. 5.The effect of varying evolution control frequency on convergence time and accuracy for CHC  .The results of datasets are presented in the row order: adult (top), bank-full (middle) and census-income (bottom).
Proposed the main idea of the work, Development of the work, Methodology, Experiments, Writing -original draft.Sławomir Nowaczyk: Development of the work, Methodology, Experiments, Writing -original draft.Sepideh Pashami: Development of the work, Methodology, Experiments, Writing -original draft.Peyman Sheikholharam Mashhadi: Development of the work, Methodology, Experiments, Writing -original draft.
Mutation is not used in the recombination stage of CHC.It is only used whenever convergence is reached based on the  value.A restart is initiated to reintroduce diversity into the population when the progress stagnate for several generations.The best individual is used as a template to generate the new population by mutating 35% of its bits.The process goes back to step 2 after the restart.Refer to Algorithm 3 for the full pseudocode of CHC  feature selection.The main steps of Algorithm 3 are: 1. Instance selection: This step represents the first stage of the CHC  algorithm and is carried to identify the meta-model  * training instances.2. Feature selection using meta-model: Feature selection is carried using  * for a fixed number of generations according to the  value.3. Reevaluations using original function: Every  generations the whole individuals in the feature selection population   are reevaluated using the original function . 4. Stop criterion: Once the stop criterion is met, the algorithm returns the best found feature subsets *. Otherwise, we go back to step 2.
Initialize a population of  instance individuals using .

Table 2
The median and standard deviation results of 10 runs, bold indicates significance in comparisons between CHC  , PSO  against CHC and PSO, respectively.