A comprehensive learning based swarm optimization approach for feature selection in gene expression data

Gene expression data analysis is challenging due to the high dimensionality and complexity of the data. Feature selection, which identifies relevant genes, is a common preprocessing step. We propose a Comprehensive Learning-Based Swarm Optimization (CLBSO) approach for feature selection in gene expression data. CLBSO leverages the strengths of ants and grasshoppers to efficiently explore the high-dimensional search space. Ants perform local search and leave pheromone trails to guide the swarm, while grasshoppers use their ability to jump long distances to explore new regions and avoid local optima. The proposed approach was evaluated on several publicly available gene expression datasets and compared with state-of-the-art feature selection methods. CLBSO achieved an average accuracy improvement of 15% over the original high-dimensional data and outperformed other feature selection methods by up to 10%. For instance, in the Pancreatic cancer dataset, CLBSO achieved 97.2% accuracy, significantly higher than XGBoost-MOGA's 84.0%. Convergence analysis showed CLBSO required fewer iterations to reach optimal solutions. Statistical analysis confirmed significant performance improvements, and stability analysis demonstrated consistent gene subset selection across different runs. These findings highlight the robustness and efficacy of CLBSO in handling complex gene expression datasets, making it a valuable tool for enhancing classification tasks in bioinformatics.


Introduction
The progression of microarray technologies has been notably swift in the era following the completion of the human genome project, allowing for simultaneous analysis of the expression levels of numerous genes.This advancement, while significant, introduces complexities due to the inherently high-dimensional nature of gene expression data, combined with typically small sample sizes used in experiments [1].Addressing these complexities is crucial for effective biomarker discovery, accurate cancer diagnosis, and the precise differentiation of tumor types, which are central challenges in the post-analysis stages of microarray studies.The implementation of an advanced feature selection mechanism is essential in mitigating these challenges by simplifying the data's dimensionality [2].
Feature selection in the context of microarray analysis, also known as gene or variable reduction, aims to identify a critical subset of informative features.This is achieved by removing irrelevant or redundant data from the initial set, thereby focusing on features of utmost relevance.Such a process is instrumental in uncovering potential biomarkers for diseases and facilitating the construction of effective disease classifiers, particularly in the realm of oncology [3].There exists a plethora of feature selection methods, broadly categorized into filters, wrappers, embedded, and hybrid approaches.Filters evaluate features based on various metrics like distance, information theory, consistency, and dependency without depending on any classifiers [4][5][6].Wrapper methods, on the other hand, select features based on the predictive accuracy of specific classifier models, often achieving better results than filter methods, albeit at the expense of computational efficiency.Embedded methods integrate feature selection as part of the model training process, representing a hybrid of the previous two.Finally, hybrid methods combine the initial screening power of filter methods with the model-specific optimizations of wrapper methods [7].
The outcomes of feature selection are generally twofold: the ranking of features according to their importance or the selection of a subset of features.While ranking methods list features by their significance, subset selection methods provide a definitive group of features for further analysis.The stability of feature selection processes is pivotal, especially when dealing with high-dimensional data like that from microarray studies.This stability refers to the ability of the feature selection process to produce consistent results under different data conditions, which is essential for the reliable identification of genetic markers and the development of accurate disease classifiers [8].Efforts to improve feature selection stability have led to the development of methods focusing on sample weighting, group-based selection, and ensemble approaches [9].
The Comprehensive Learning-Based Swarm Optimization (CLBSO) method introduced in this study represents a novel approach to feature selection in gene expression analysis.Unlike the group feature selection and ensemble methods, the CLBSO leverages the natural foraging behaviors of ants and grasshoppers to navigate the complex, high-dimensional search space effectively, adapting to data variability without the need for pre-clustering or algorithm amalgamation.
This research makes several significant contributions: • The proposition and elaboration of the CLBSO algorithm, inspired by the foraging patterns of ants and grasshoppers, for the effective balance between local and global search capabilities, improving the feature selection process in gene expression datasets.• The introduction of a novel adaptive weighting strategy, enhancing the algorithm's ability to modify the impact of local and global search phases dynamically, which in turn aids in more accurately pinpointing relevant features for disease classification.• The comprehensive testing of the CLBSO algorithm across various cancer gene expression profiles, displaying its superior performance in metrics such as accuracy and F-measure against other leading optimization techniques like XGBoost-MOGA, ISSA, BCOOT, and SBCSO.• An in-depth analysis of the CLBSO's performance, highlighting its consistency in identifying compact, significant feature sets across different types of cancer, thereby aiding in a deeper understanding of cancer mechanisms and aiding in the development of targeted treatments.• The demonstration of the versatility of the CLBSO algorithm, illustrating its potential application beyond the realm of gene expression to other feature selection scenarios in diverse fields.
The remainder of this paper is structured as follows: Section 2 reviews relevant literature; Section 3 details the CLBSO methodology; Section 4 describes the implementation; Section 5 presents experimental results; Section 6 concludes the study.

Related works
Current studies underline the significance of implementing advanced learning mechanisms for in-depth analysis of gene characteristics, which subsequently enhance classification efficacy [10].The application of evolutionary strategies like Particle Swarm Optimization (PSO), Artificial Bee Colony (ABC), and Genetic Algorithms (GA) has been prominent in the gene selection arena.Additionally, methodologies incorporating Artificial Neural Networks (ANNs), Fuzzy Logic Systems (FLS), and the Hybrid Stem Cell (HSC) algorithm have demonstrated efficacy in classification tasks [11][12][13][14][15]. Particularly, GA has garnered recognition for its efficiency in sifting through a myriad of potential solutions to identify the most appropriate gene subsets.Furthermore, the realm of gene selection has witnessed the ascendancy of swarm optimization strategies, offering a robust mechanism for dimensionality reduction through the principles of swarm intelligence, culminating in optimal solutions [16][17][18][19].
The challenge posed by a substantial feature space, typically laden with irrelevant or redundant genes, necessitates the adoption of gene selection for enhanced classification outcomes in both machine learning and medical sciences.Recent trends point towards the integration of hybrid machine learning techniques, particularly metaheuristic optimization, for the discernment of pertinent and informative genes.Innovations such as a hybrid multi-objective cuckoo search complemented by evolutionary operators have demonstrated superiority in gene selection, particularly across high-dimensional cancer microarray datasets [20][21][22][23].Likewise, hybridized harmony search optimization approaches have shown promise in addressing feature selection within high-dimensional data classification, surpassing other established algorithms in efficiency [24][25][26].
The domain of swarm intelligence optimization algorithms has been acknowledged for its substantial contribution to feature selection, attributed to their extensive global search capabilities and inherent simplicity.Notable developments include the integration of teaching-learning-based optimization (TLBO) and gravitational search algorithms (GSA) into a cohesive hybrid wrapper algorithm [27].This innovative approach combines mRMR for initial gene relevance determination with a subsequent selection of informative genes through a refined method [28].Further advancements have been made with the introduction of the multidimensional population-based bacterial colony optimization (BCO-MDP) for classification-oriented feature selection [29], alongside methodologies leveraging the gray wolf optimization algorithm and the Harris Hawk optimization algorithm, each tailored for nuanced feature selection in gene expression analysis.Recent explorations into feature selection have also incorporated improved moth-flame optimization algorithms and the synergistic use of ant colony optimization with RelieF for enhanced tumor classification [30,31].
Additionally, there have been significant advancements in machine learning techniques for disease prediction and classification, as seen in recent studies focusing on DNA sequence classification [32,33].These studies highlight the growing role of machine learning and deep learning in enhancing diagnostic accuracy and treatment outcomes across various medical applications.

Research gap
Despite the extensive array of feature selection methodologies documented in the literature, a discernible gap persists, particularly regarding a method that seamlessly combines efficiency and effectiveness for gene expression data analysis.Common challenges encountered by existing techniques include prohibitive computational demands, susceptibility to overfitting, and difficulties in navigating high-dimensional datasets.Moreover, there is a notable deficiency in the integration of varied optimization methods to harness their combined strengths for a more comprehensive exploration of the search space.
Addressing these concerns, the proposed Comprehensive Learning-Based Swarm Optimization (CLBSO) approach innovatively employs ant and grasshopper behaviors for an advanced navigation of the complex, high-dimensional search landscape, adjusting adeptly to data changes.This method uniquely combines ants for meticulous local searches with grasshoppers for expansive global searches, ensuring a thorough exploration and avoidance of local optimum pitfalls.The approach further incorporates an educational strategy, enhancing the swarm's adaptability to data shifts, thereby refining the selection process's overall effectiveness.The CLBSO methodology aims to bridge the current research void by presenting a balanced, efficient, and effective solution for gene expression data analysis, capitalizing on the synergistic potential of ants and grasshoppers to systematically explore and adapt to the evolving search landscape.

Problem definition
Given a gene expression dataset  consisting of  samples and  features (genes), the objective is to find a subset of features  ⊆ {1, 2, … , } that maximizes a specific criterion, such as classification accuracy, while minimizing the number of selected features.
Formally, the problem can be defined as shown in equation ( 1) below: max ⊆1,2,…,  () (1) where  () is a fitness function that measures the quality of the selected feature subset .The fitness function aims to maximize the classification accuracy.

Proposed CLBSO algorithm
The proposed Comprehensive Learning-Based Swarm Optimization (CLBSO) method leverages swarm intelligence principles to tackle optimization challenges characterized by intricate and dynamic environments.The architecture of the proposed CLBSO is shown in Fig. 1.It integrates the distinct yet synergistic strategies of ants and grasshoppers to create a robust optimization framework capable of adapting to changes within its operational landscape.The CLBSO algorithm unfolds through four key stages: 1) Initialization Phase, 2) Local Search Phase, 3) Global Search Phase, and 4) Adaptation or Comprehensive Learning Phase, as depicted in Fig. 2. The following subsections elucidate each stage within the CLBSO algorithm's workflow.
In contrast to traditional approaches, the CLBSO leverages a singular swarm population that incorporates the searching mechanisms of both ants and grasshoppers.This innovative structure allows each potential solution within the swarm to exhibit characteristics of both an "ant" and a "grasshopper," thereby enhancing the diversity and adaptability of the search process.The synergy between ant-like and grasshopper-like behaviors within the CLBSO framework is encapsulated in the unified optimization strategy discussed in the following sections.

Initialization phase
Metaheuristic feature selection algorithms typically employ binary encoding to represent the solution space effectively, facilitating both the representation of feature subsets and the simplification of algorithmic complexity.This binary approach assigns a binary vector of length n to each solution, where n denotes the total number of available features.In this vector, each bit corresponds to a specific feature: a "1" indicates the inclusion of the feature in the selected subset, while a "0" denotes its exclusion.
In this research, we adhere to this binary representation strategy to maximize the inherent advantages of the algorithm.During the Initialization Phase, the CLBSO algorithm generates an initial population of solutions, represented as binary strings, which correspond to various feature subset configurations within the gene expression data context.This initial generation process is randomized to ensure a diverse starting point for the optimization journey.Each individual's binary string reflects its proposed feature subset, and the corresponding objective function values are computed to assess the quality and efficacy of each proposed solution.
where  is the initial probability of selecting a feature.(b) Calculate the objective function value  () shown in equation (3) for each individual.A commonly used objective function for feature selection problems is a combination of classification performance (e.g., accuracy) and a penalty term that encourages smaller feature subsets.One such objective function can be defined as follows: where  represents an individual in the population (i.e., a binary string encoding a feature subset), accuracy() is the classification accuracy of a chosen classifier trained on the selected features, the num_selected_features() is the number of selected features in the subset,  is the total number of features in the gene expression dataset, and  1 and  2 are weighting factors that balance the importance of classification performance and feature subset size.The classifier used is SVM.
During the Initialization Phase, the objective function value  () is calculated for each individual in the population.This value will be used to guide the search process in the Local Search Phase, Global Search Phase, and Adaptation Phase.

Local search phase
During the Local Search Phase within the CLBSO framework, the algorithm employs the 'ant' mechanism to conduct an intensive search within the local vicinity of the existing population.This stage is crucial for thoroughly exploring promising areas within the search space, enabling the algorithm to refine and enhance potential solutions.
(a) Pheromone Matrix Update: Central to this phase is the concept of a pheromone matrix, which symbolizes the features' appeal to the ants.Post each cycle, this matrix's values are adjusted reflecting the solution qualities encountered.For any given feature , the pheromone level adjustment is articulated as shown in equation ( 4): where  symbolizes the rate of pheromone fading, existing in the interval (0, 1), and Δ  signifies the pheromone increment, directly tied to the performance of solutions incorporating feature .
(b) Ant-led Local Modification: Within this step, each entity   within the swarm undergoes a transformation by an ant, which probabilistically toggles features guided by the prevailing pheromone levels.The likelihood of a feature  being toggled for the entity   is delineated as shown in equation (5): Here,  and  serve to balance the influence of pheromone concentration against its inverse, shaping the decision-making process.
(c) Solution Refinement: Subsequent to the ants' local search endeavors, the algorithm updates the solutions.An enhanced solution, demonstrating superior objective function performance over its predecessor, supplants the latter.
(d) Objective Function Reevaluation: The phase concludes with a recalibration of the objective function values for the newly adjusted solutions, in accordance with the criteria established during the Initialization Phase.
By fostering detailed exploration within close proximity of the extant solutions, the Local Search Phase ensures the algorithm's proficiency in exploiting accessible regions within the search space.Following the completion of this phase, the algorithm transitions into the Global Search Phase.Here, the 'grasshopper' components engage in Levy flights, aiming to explore broader, uncharted territories of the search domain.

Global search phase
The Global Search Phase in the CLBSO framework is characterized by the 'grasshopper' elements employing Levy flights for extensive search activities.This phase is designed to propel the algorithm beyond local optima by venturing into unexplored territories of the search space through significant, randomized leaps, drawing from the Levy distribution principle.
(a) Implementation of Levy Flight: Each grasshopper, represented as   within the swarm, is relocated according to the equation (6) shown below: Here,  denotes a positive step size, and () signifies a function governed by the Levy distribution, characterized by its probability density function shown in equation ( 7) below: In this context,  acts as a scale factor that modulates the distribution's spread.(b) Conversion to Binary Format: Given the binary nature of our search domain, it is imperative to transform  ′  into a binary format.This transition is achieved through equation ( 8) as follows: where  ′  denotes the -th component of the new position  ′  , with  serving as the demarcation threshold.(c) Solution Enhancement: The algorithm adopts the new position  ′  over the old   if the former exhibits superior performance based on the objective function assessment.(d) Recalculation of Objective Function: Following the update, a fresh computation of the objective function values is undertaken for the newly adjusted solutions, aligning with the parameters set forth in the Initialization Phase.
By facilitating exploration across broader and potentially more promising areas of the search space, the Global Search Phase crucially prevents the algorithm from succumbing to local optimum traps.Subsequent to this phase, the CLBSO algorithm transitions into the Adaptation or Comprehensive Learning Phase, which tailors the search mechanism adaptively, informed by accumulated historical insights.

Comprehensive learning phase
The Comprehensive Learning Phase in the CLBSO framework is where the algorithm dynamically refines its exploration and exploitation strategies by integrating the methodologies of both ants and grasshoppers, contingent upon their respective successes in the current search context.The performance of each group is assessed based on their individual contributions towards discovering the optimal solution to date.
(a) Performance Evaluation of Ants and Grasshoppers: For each feature  within the search domain, the effectiveness of ants (   ) and grasshoppers (   ) is determined through equations ( 9) and (10) shown below: Here,  denotes the total solution count within the population,   (  ) and   (  ) represent the objective function scores attributed to ants and grasshoppers within the context of solution   , respectively, and  max is the peak objective function value identified across all present solutions.
(b) Adaptive Weight Calculation: The adaptive weights for ants (  ) and grasshoppers (  ) are derived from their respective performance metrics as shown in equations ( 11) and (12): S. Easwaran, J.P. Venugopal, A.A.V. Subramanian et al.
with  representing the dimensionality of the feature space.
(c) Solution Adaptation via Adaptive Weights: Solutions are then adjusted reflecting the synergized influence of ants' and grasshoppers' search mechanisms, as modulated by the calculated adaptive weights using equation ( 13): where  +1  is the newly adjusted value for feature  in solution   for the next iteration  + 1, with   and   denoting the respective updates from ants and grasshoppers for that feature.(d) Objective Function Recalculation: Subsequent to the adjustments, a reevaluation of the objective function values is conducted for the newly updated solutions, adhering to the foundational metrics established in the Initialization Phase.
By amalgamating the distinct capabilities of ants for meticulous local scrutiny and grasshoppers for expansive global ventures, the Comprehensive Learning Phase equips the CLBSO algorithm with a refined mechanism for navigating the search space.This innovative approach significantly augments the algorithm's capacity for a more harmonized exploration and exploitation, thereby enhancing its overall efficacy in high-dimensional space analysis.The pseudocode of the proposed CLBSO is shown in Algorithm 1.

Experimental setup and evaluation
All experiments were conducted using the existing high-performance hardware available in our lab.We utilized a system equipped with an NVIDIA A100 GPU, an Intel i9 processor, 64 GB of RAM, and 1TB of SSD storage, which provided the necessary computational power and efficiency for handling large datasets and complex computations.The software environment was built on Ubuntu 18.04.This setup ensured efficient processing and accurate results, highlighting the advantage of having access to advanced computational resources.The NVIDIA A100 GPU significantly accelerated deep learning tasks, while the Intel i9 processor and 64 GB RAM facilitated smooth execution of various computational processes.The fast SSD storage supported efficient data handling, essential for deep learning experiments.These resources were crucial for achieving the high levels of performance required for the study.

Datasets
The proposed CLBSO algorithm is evaluated using seven benchmark cancer datasets sourced from the Curated Microarray Database (CuMiDa) repository [34].The datasets encompass diverse cancer types and exhibit varying sample and gene quantities, thereby providing a comprehensive evaluation of the algorithm's feature selection efficacy.Table 1 displays the specifics of the datasets.The Pancreatic dataset (GSE16515) contains 54,676 genes across 52 total samples, divided into 41 training and 11 testing samples, with 36 cancer samples and 16 normal samples.This dataset particularly focuses on identifying the expression differences of the FKBP5 gene between pancreatic tumor and normal samples, noting higher FKBP5 expression in normal samples.The Liver dataset (GSE22405) includes 22,284 genes with a total of 48 samples, split into 38 for training and 10 for testing, equally divided between 24 cancer and 24 normal samples.It involves gene expression analysis in primary hepatocarcinoma tissue, with data sourced from the National Cancer Institute, NIH.The Lung dataset (GSE63459) features 24,527 genes and comprises 65 samples, 52 for training and 13 for testing, with 32 cancer and 33 normal samples.This dataset is noted for its mRNA expression data for Stage I Lung Adenocarcinoma and adjacent non-tumor tissues, characterized by genome-wide DNA methylation profiling.The Bladder dataset (GSE31189) contains 54,676 genes with a total of 92 samples, 74 for training and 18 for testing, including 52 cancer and 40 normal samples.This dataset focuses on differential gene expression analysis in exfoliated human urothelia from patients with bladder disease, validated using quantitative PCR.The Renal dataset (GSE66270) consists of 54,676 genes across 143 samples, divided into 115 training and 28 testing samples, with 71 cancer and 72 normal samples.It investigates gene expression profiling in human kidney cancer and benign tissues to understand the mechanisms of ccRCC progression and metastasis.The Gastric dataset (GSE19826) includes 54,676 genes with 27 total samples, 22 for training and 5 for testing, with 12 cancer and 15 normal samples.This dataset details global gene expression between Chinese gastric cancer and adjacent non-cancer tissues, identifying key differential expression genes.Finally, the Colorectal dataset (GSE75548) comprises 48,108 genes with 12 total samples, split into 10 for training and 2 for testing, with 6 cancer and 6 normal samples.It focuses on genome-wide methylation analysis and gene expression profiling of rectal cancer and paired normal samples, identifying 36 genes with inverse correlations between methylation and expression levels.Each dataset in the table provides a unique perspective on cancer gene expression, with varying sample sizes and gene counts, facilitating diverse research opportunities in cancer classification, prognosis, and treatment strategies.The detailed descriptions highlight the specific focus and methodology used in each dataset, underscoring their relevance and importance in the field of oncological research.

Dataset splitting strategy
To ensure the robustness and generalizability of the CLBSO algorithm, each dataset was divided into three separate subsets: training, validation, and testing.This splitting was done as follows: • Training Set: Used for training the classifiers and selecting the optimal feature subsets.This set constitutes 60% of the total dataset.• Validation Set: Employed during the feature selection process to tune hyperparameters and avoid overfitting.This set represents 20% of the dataset.• Testing Set: Used for evaluating the final performance of the trained model with the selected features.This set makes up the remaining 20% of the dataset.
By employing this three-way split, we ensure that the feature selection process and the final model evaluation are based on completely separate data, mitigating the risk of biased performance results that can occur with k-fold cross-validation alone.The process of cross-validation is a commonly employed method in machine learning for evaluating the effectiveness of a model.This technique involves partitioning the dataset into several subsets and subsequently training the model on each of these subsets.In general, the technique of k-fold cross-validation is utilized, wherein the dataset is partitioned into 'k' folds of equal size.The training of the model is conducted on k-1 folds, and subsequently, the remaining fold is utilized for testing.This process is repeated k times.The estimation of the model's performance is obtained by calculating the mean performance across k iterations.The present study will employ the 10-fold cross-validation technique, whereby the dataset will be partitioned into 10 mutually exclusive and equally sized subsets (folds), each comprising roughly 5 observations (with a 3:2 proportion of cancer to normal samples in each fold).Combining these strategies ensures a thorough evaluation of the CLBSO algorithm's performance, providing robust and unbiased results.The effectiveness of the CLBSO algorithm is assessed through a series of experiments across the seven cancer datasets.Initially, we conduct experiments without applying feature selection to establish a performance baseline.Subsequently, we apply the CLBSO feature selection method and evaluate its impact on classifier performance using the previously described dataset splits (training, validation, and testing sets).

Parameters used for analysis
In this section, we detail the hyperparameters utilized in the proposed CLBSO algorithm and the dataset splitting strategy.These parameters were carefully selected to balance computational efficiency and the algorithm's performance.Table 2 lists these parameters along with their respective values and rationales.
The number of folds in cross-validation (k) was set to 10 to ensure a robust evaluation of the model and mitigate the risk of overfitting.A population size (N) of 50 was chosen to strike a balance between computational efficiency and the algorithm's ability to thoroughly explore the search space.The number of iterations ( max ) was set to 100 to allow the algorithm sufficient time to converge to an optimal solution while keeping the computational cost manageable.The initial probability (p) of selecting a feature was set to 0.5 to ensure a balanced initial selection of features.
In the context of genetic algorithms, a crossover rate of 0.8 and a mutation rate of 0.05 were used to promote effective exploration and maintain the integrity of promising solutions.The inertia weight (w) was set to 0.7, and the cognitive ( 1 ) and social coefficients ( 2 ) were both set to 1.5 to balance exploration and exploitation in Particle Swarm Optimization.The pheromone decay rate () was set to 0.1 to balance between retaining useful information and allowing for exploration in Ant Colony Optimization.The Levy flight step size (s) was set to 0.1 to facilitate effective exploration in the Global Search Phase, and a threshold (T) of 0.5 was used to convert continuous values to binary for feature selection.

Feature selection using CLBSO
This segment delves into the comprehensive evaluation and scrutiny of the CLBSO approach for feature selection through a series of experimental assessments.We begin by evaluating the efficacy of the CLBSO methodology across seven distinct datasets, initially without the application of feature selection.This evaluation sets the stage for a comparative analysis against results obtained postapplication of the CLBSO feature selection process.Performance metrics such as Accuracy (Acc), Precision (Prec), Recall (Rec), and F-measure (  ) are compiled in Table 3 for an array of classifiers including Support Vector Machine (SVM), Multilayer Perceptron (MLP), Decision Tree (DT), Naïve Bayes (NB), Random Forest (RF), and k-Nearest Neighbors (KNN) across various cancer datasets, both with and without the feature selection intervention.

Evaluation without FS
Analysis of the data presented in Table 3 indicates diverse classifier performances across the cancer datasets in the absence of feature selection.Specifically, DT and KNN classifiers show variable stability, with KNN presenting notably lower accuracies within the Liver and Gastric datasets, and DT underperforming in the Lung dataset scenario.Meanwhile, SVM, MLP, and NB classifiers demonstrate intermediate levels of efficacy across different datasets.The imperative for implementing feature selection becomes evident through these observations for several reasons: (a) Disparity in classifier effectiveness: The variability in classifier performance across different cancer datasets, as shown in the table, suggests that classifiers may be adversely affected by the presence of redundant or irrelevant features.Implementing feature selection can mitigate this by discarding such features, thereby enhancing classifier efficiency.
(b) Model complexity and overfitting risks: Classifiers like Decision Trees and Random Forests are prone to overfitting, especially when dealing with data abundant in features.By adopting feature selection, we can decrease the dimensionality of datasets and simplify the models, leading to better generalization and enhanced performance on unseen data.
(c) Enhancement of interpretability: For certain complex cancer datasets, navigating through a vast feature space can be daunting, complicating the understanding of results and key patterns.Feature selection facilitates pinpointing crucial and informative features, thereby clarifying data interpretations and elucidating relationships between features and outcomes.
(d) Boosting computational efficiency: Training classifiers on high-dimensional datasets can be resource-intensive and timeprohibitive.Feature selection streamlines this by diminishing the computational load and shortening training durations for classifiers.

Evaluation with FS using CLBSO
Table 3 illustrates the significant performance improvements achieved through the integration of CLBSO-based feature selection, impacting most classifiers and datasets positively.Notably, the Random Forest (RF) classifier demonstrates enhanced accuracy, topping the charts across all seven datasets analyzed.The previously inconsistent K-Nearest Neighbor (KNN) algorithm now shows marked improvements in precision, especially notable in Pancreatic, Liver, and Renal cancer datasets.Decision Tree (DT) classifiers have also seen a rise in performance, particularly with the Lung dataset, upon the integration of feature selection via CLBSO.The improvement is not limited to these classifiers alone; SVM, MLP, and NB also exhibit elevated performance levels, attesting to the CLBSO method's broad applicability and effectiveness in boosting classifier outcomes, particularly within the realm of cancer classification tasks.Fig. 3 illustrates the significant accuracy improvements across various classifiers post the application of feature selection using CLBSO.This graphic evidence confirms that implementing CLBSO-based feature selection universally improves classifier accuracy, emphasizing the method's critical role in identifying the most relevant features for each specific cancer dataset.Classifiers like SVM and RF, in particular, show noticeable advancements, signifying their sensitivity to the quality and relevance of features, especially in complex datasets like those for Pancreatic, Liver, and Lung cancers.

Evaluation with traditional FS methods
This comparative analysis places the CLBSO technique alongside contemporary swarm intelligence algorithms such as Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO), Grasshopper Optimization Algorithm (GOA), and Firefly Algorithm (FF), across all datasets.Tables 4, 5, and 6 delineate the comparative performance, showcasing CLBSO's ability to consistently identify fewer yet optimal feature subsets, illustrating its precision in filtering out irrelevant or redundant features, thereby streamlining the classification models and reducing complexity.Contrastingly, ACO and PSO, while robust, tend to identify larger sets of optimal features, potentially indicating a less precise feature discernment compared to CLBSO.This could stem from their inherent search strategies or the lack of an advanced learning mechanism akin to CLBSO's approach.GOA and FF show improvements over ACO and PSO but still do not match the efficiency and effectiveness of CLBSO, highlighting the importance of a comprehensive learning com- ponent in feature selection algorithms for dealing with complex data like gene expressions.The effectiveness of each classifier when paired with different optimization strategies is evaluated using Accuracy (Acc) and F-measure (Fm), where the F-measure serves as a balanced metric between precision and recall.Notably, the combination of CLBSO with RF emerges as particularly potent within the Pancreatic Cancer dataset, exhibiting exemplary accuracy and F-measure rates.This trend of CLBSO enhancing classifier performance continues across various datasets, underscoring the optimization method's versatility and effectiveness.The comprehensive analysis reveals that the Random Forest (RF) classifier, when optimized with CLBSO, consistently delivers robust performance across diverse cancer datasets.While other optimization techniques like ACO, PSO, GOA, and FF show promise, they generally do not outperform the CLBSO approach.These findings affirm the potency of the CLBSO optimization technique, particularly when combined with RF, as a formidable strategy for cancer dataset classification.

Evaluation with existing methods
Table 7 delineates the comparative analysis between the CLBSO approach and established methodologies, such as eXtreme Gradient Boosting combined with Multi-objective Optimization Genetic Algorithm (XGBoost-MOGA) [35], the Improved Salp Swarm Algorithm (ISSA) [36], the Binary COOT (BCOOT) optimization technique [37], and the Self-adaptive Binary Cat Swarm Optimization (SBCSO) method [38].This comparison is centered around key metrics: Accuracy (Acc) and F-measure (Fm), where superior values denote enhanced algorithmic performance.In this comparative framework, the CLBSO methodology, as introduced, manifests a consistent outperformance against established algorithms across a range of cancer datasets, registering the highest accuracy metrics for types such as Pancreatic, Liver, Lung, Bladder, Renal, Gastric, and Colorectal cancers.Comparative performance insights reveal XGBoost-MOGA typically occupying the second rank in efficacy, with subsequent positions held by SBCSO, ISSA, and BCOOT respectively.This comprehensive performance assessment underscores the robust capability of the CLBSO approach in the precise classification of gene expression data within oncological studies.

Statistical analysis
Subsequent statistical evaluations, as presented in Table 8, contrast the CLBSO algorithm against four prevailing methodologies: XGBoost-MOGA, ISSA, BCOOT, and SBCSO, across all considered cancer datasets.This analysis encompasses the employment of ttests to determine the statistical significance of accuracy variances between the methods across different datasets.The consistent outstripping performance of the CLBSO algorithm compared to the alternatives is statistically substantiated across all dataset types.
The derived p-values, falling beneath the conventional significance threshold of 0.05, validate the statistical significance of the performance disparities, with the t-test outcomes underscoring the robustness and efficacy of the CLBSO approach in the domain of gene selection tailored for oncological dataset analysis.

Stability analysis
The Jaccard Index is employed to determine the stability of CLBSO by quantifying the degree of similarity between two sets.The Jaccard Index can be utilised to address the gene selection problem by evaluating the similarity between gene subsets generated from distinct iterations of the CLBSO algorithm on a given dataset.The Jaccard Index is calculated using the following equation: Where  and  are two gene subsets, | ∩ | represents the number of common genes in both subsets, and | ∪ | represents the total number of unique genes in both subsets combined.The Jaccard Index ranges from 0 to 1, with 0 indicating no similarity and 1 indicating complete similarity between the two gene subsets.Table 9 showcases the Jaccard Index metrics across five separate iterations of the CLBSO approach, alongside the computed average Jaccard Index.This analysis underscores the algorithm's consistent performance over the spectrum of analyzed cancer datasets.The observed Jaccard Index ranges from 0.79 to 0.85 on average, indicating a significant consistency in the selection of gene subsets across different executions of the algorithm.Particularly, the Liver dataset stands out with an average Jaccard Index of 0.85, showcasing notable stability in the gene subset selection process facilitated by the CLBSO method.Despite the Gastric dataset registering the lowest average at 0.79, it still reflects a commendable level of selection stability.As depicted in Fig. 4, the CLBSO algorithm maintains a robust stability profile across all evaluated datasets, as reflected by the high Jaccard Index values.The Liver dataset, in particular, evidences the highest stability, marked by an average index of 0.85, with the Bladder and Renal datasets following suit.Although the Gastric and Colorectal datasets show a marginally reduced stability, they nonetheless hold an average index above 0.79, underscoring the dependable consistency of the CLBSO algorithm in selecting relevant feature subsets across various iterations.These findings affirm the stability and reliability of the CLBSO approach in consistently identifying gene subsets, a crucial attribute for ensuring accurate and reproducible results in cancer diagnosis and prognostic analyses.

Convergence analysis
This segment explores the comparative convergence efficiency of the CLBSO algorithm against recognized optimization counterparts such as XGBoost-MOGA, ISSA, BCOOT, and SBCSO.The focus is on assessing how swiftly and effectively the proposed CLBSO strategy converges to the optimum solution, highlighting the impact of its integrated learning phase in guiding the search towards the most suitable outcome with enhanced speed.
The evaluation of the algorithms' convergence rates involves plotting the average fitness values against the iteration count for each respective method.Additionally, the average number of iterations required for each algorithm to reach a specific solution quality or fitness level is quantified.Table 10 illustrates the results from this convergence evaluation, detailing the average iteration counts needed by the algorithms to achieve the set fitness benchmarks across different datasets.Demonstrating a quicker convergence to optimal solutions is indicative of an algorithm's efficiency, reflected by a reduced count of required iterations.From the results in Table 10, it can be observed that CLBSO demonstrates a faster convergence rate compared to the other algorithms for all datasets.This indicates that the comprehensive learning phase effectively guides the search process, resulting in an efficient exploration and exploitation of the solution space.In comparison, the other algorithms, such as XGBoost-MOGA, ISSA, BCOOT, and SBCSO, require more iterations to reach the same level of solution quality, which implies a slower convergence rate.This demonstrates the advantage of using the comprehensive learning phase in the proposed CLBSO algorithm.Fig. 5 shows the convergence analysis of different algorithms across cancer datasets.The analysis of Fig. 5 reveals: • Varied convergence trends across algorithms, with the CLBSO algorithm generally showing fewer iterations required for convergence, indicating its efficiency.• The behavior of each algorithm, including XGBoost-MOGA, ISSA, BCOOT, and SBCSO, varies across different cancer datasets, highlighting their adaptability and specificity to the data types.• Certain algorithms demonstrate consistent performance across all datasets, whereas others show variable trends, suggesting differences in their optimization strategies and robustness.• The line charts offer an insightful comparison of algorithmic efficiency in convergence, underscoring the potential of CLBSO in gene expression data analysis for computational biology and bioinformatics.
These visualized findings provide an in-depth comparative analysis of the convergence behaviors of various algorithms, emphasizing the strengths and uniqueness of the CLBSO algorithm in handling diverse cancer datasets.
stability.This improvement underscores the algorithm's capability to enhance model performance by reducing dimensionality and eliminating redundant features.The stability analysis using the Jaccard Index confirms that CLBSO consistently selects similar feature subsets across different runs, ensuring reliable and reproducible results.This robustness is essential for applications in cancer diagnosis and prognosis, where consistency in feature selection can lead to better clinical outcomes.The CLBSO algorithm's adaptability to different datasets and its potential to integrate with other types of omics data suggest its broader applicability in bioinformatics and computational biology.Future work should explore its utility in multi-omics data integration and other complex biological datasets.While the selected features need further biological validation, the results indicate that CLBSO can identify key genes relevant to cancer progression and treatment.Collaborations with biologists and medical researchers will be vital to translate these findings into actionable clinical insights.Despite the limitations the proposed CLBSO algorithm presents a significant advancement in feature selection for gene expression data, balancing computational efficiency with high performance.Future research should address the identified limitations, explore its application to broader datasets, and enhance its interpretability for biological significance.

Conclusion
This study presented a novel Comprehensive Learning-Based Swarm Optimization (CLBSO) approach specifically designed for feature selection in gene expression data analysis.The CLBSO algorithm uniquely integrates the capabilities of ant and grasshopper swarms, creating an enhanced optimization method well-suited for complex, high-dimensional search spaces.A key feature of the CLBSO methodology is its comprehensive learning phase, which effectively balances exploration and exploitation, leading to significant improvements in feature selection efficiency.Empirical evaluations demonstrate the effectiveness of the CLBSO algorithm in enhancing the performance metrics of various classifiers across multiple gene expression datasets, with the Random Forest (RF) classifier, in particular, showing consistent and superior performance.For instance, in the Pancreatic cancer dataset, CLBSO achieved an accuracy of 97.2%, significantly higher than XGBoost-MOGA's 84.0%.Overall, CLBSO achieved an average accuracy improvement of 15% over the original high-dimensional datasets and outperformed other feature selection methods by up to 10%.Comparative analyses indicate that CLBSO outperforms traditional swarm intelligence algorithms such as Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO), Grasshopper Optimization Algorithm (GOA), and Firefly (FF) algorithm, as well as contemporary methods like eXtreme Gradient Boosting-Multi-objective Optimization Genetic Algorithm (XGBoost-MOGA), Improved Salp Swarm Algorithm (ISSA), Binary COOT (BCOOT), and Self-adaptive Binary Cat Swarm Optimization (SBCSO).The stability of the proposed algorithm, assessed using Jaccard Index metrics, confirms its robustness and reliability in consistently identifying gene subsets, highlighting its applicability in cancer classification and prognosis.Additionally, the convergence efficiency of CLBSO surpasses that of other methods, underscoring the importance of its comprehensive learning phase in achieving optimal solutions more rapidly.The ablation study further emphasizes the critical role of this learning phase, with its absence leading to significant performance declines.
In future work, we aim to expand the CLBSO framework to address multi-objective optimization problems, thereby increasing its applicability in real-world scenarios.Additionally, we plan to explore the integration of gene expression data with other omics data types, such as proteomics and metabolomics, to provide a more comprehensive biological understanding and improve the accuracy of disease diagnosis and prognosis.

Fig. 1 .
Fig. 1.Architecture of the proposed Comprehensive Learning-Based Swarm Optimization (CLBSO) model.The diagram details the Initialization Phase, Local Search Phase, Global Search Phase, and Comprehensive Learning Phase, illustrating the flow and interactions within the algorithm.

Fig. 3 .
Fig. 3. Accuracy comparison for different classifiers on multiple cancer datasets, with and without Feature Selection using CLBSO.Each panel represents a specific classifier: (a) SVM, (b) MLP, (c) DT, (d) NB, (e) RF, and (f) KNN.The charts illustrate the improvements in accuracy achieved through the application of CLBSO across various cancer datasets, showing the effectiveness of feature selection in enhancing classification performance.

Fig. 4 .
Fig. 4. Stability of CLBSO using Jaccard Index across various cancer datasets.Each line represents the Jaccard Index values for different runs and the average value for each dataset.

Fig. 5 .
Fig. 5. Convergence analysis of different algorithms across cancer datasets.Each panel represents the mean number of iterations required for convergence by a specific algorithm: (a) CLBSO (proposed), (b) XGBoost-MOGA, (c) ISSA, (d) BCOOT, and (e) SBCSO.The comparison across various cancer datasets illustrates the efficiency of each algorithm in achieving convergence, highlighting the performance differences in terms of computational iterations.

Table 1
Benchmark cancer datasets from the CuMiDa repository.

Table 2
Parameters Used for Analysis.

Table 3
Performance evaluation results for various classifiers on multiple cancer datasets without FS and with FS using CLBSO.

Table 4
The number of optimal feature subsets obtained through various methods.

Table 5
Performance evaluation results of the proposed CLBSO algorithm with other swarm algorithms for Pancreatic, Liver, Lung and Bladder cancer datasets.

Table 6
Performance evaluation results of the proposed CLBSO algorithm with other swarm algorithms for Renal, Gastric and Colorectal Cancer datasets.

Table 7
Comparison of proposed CLBSO with existing methods.

Table 8
Comparison of t-test and p-values of proposed CLBSO against existing approaches for all the datasets.

Table 9
Stability of CLBSO using Jaccard Index.

Table 10
Convergence Analysis of CLBSO and Other Algorithms.