Metaheuristic integrated machine learning classification of colon cancer using STFT LASSO and EHO feature extraction from microarray gene expressions

The microarray gene expression data poses a tremendous challenge due to their curse of dimensionality problem. The sheer volume of features far surpasses available samples, leading to overfitting and reduced classification accuracy. Thus the dimensionality of microarray gene expression data must be reduced with efficient feature extraction methods to reduce the volume of data and extract meaningful information to enhance the classification accuracy and interpretability. In this research, we discover the uniqueness of applying STFT (Short Term Fourier Transform), LASSO (Least Absolute Shrinkage and Selection Operator), and EHO (Elephant Herding Optimisation) for extracting significant features from lung cancer and reducing the dimensionality of the microarray gene expression database. The classification of lung cancer is performed using the following classifiers: Gaussian Mixture Model (GMM), Particle Swarm Optimization (PSO) with GMM, Detrended Fluctuation Analysis (DFA), Naive Bayes classifier (NBC), Firefly with GMM, Support Vector Machine with Radial Basis Kernel (SVM-RBF) and Flower Pollination Optimization (FPO) with GMM. The EHO feature extraction with the FPO-GMM classifier attained the highest accuracy in the range of 96.77, with an F1 score of 97.5, MCC of 0.92 and Kappa of 0.92. The reported results underline the significance of utilizing STFT, LASSO, and EHO for feature extraction in reducing the dimensionality of microarray gene expression data. These methodologies also help in improved and early diagnosis of lung cancer with enhanced classification accuracy and interpretability.


Review of previous work
There are various research works performed in the literature based on dimensionality reduction and feature extraction methods on microarray gene data for various cancer databases.The selection of methodology to perform dimensionality reduction and feature extraction is significant in reducing the data volume and extracting relevant features and patterns.This selection aids the upcoming classification phase to improve the classification accuracy and reduce the overfitting problems.The diverse feature extraction/dimensionality techniques with classification, pronounced in the literature are provided in Table 1.
Here are certain limitations and research directions observed in the diverse methodologies listed in Table 1.In Liu et al. 16 , low accuracy is obtained to more number of redundant features.Islam et al. 17 and Xiao et al. 18 reported high accuracies due to the applied deep learning methods.However, deep learning methods incurs high computational complexity.The proposed methods by You et al. 19 are not suitable for datasets that are highly nonlinear in nature.Bonev et al. 20 and Xu et al. 21showcased efficient feature extraction techniques but there is only limited exploration of diverse classifiers.Torkey et al. 22 performed survival tests instead of classification and obtained good accuracies in the range of 98%.Abdulla et al. 23 have proposed a cost sensitive feature selection method with accuracy in the range of 95%.Li et al. 24 explored a variety of regression techniques, and reported 84% of accuracy due to the minimum search space exploration due to LASSO and LDA regressions.Zhang et al. 25 achieved an accuracy of 90% but the methodology is sensitive to outliers and missing of sensitive information while handling nonlinear data.Subhajit et al. 26 and Nursabillilah et al. 27 attained an accuracy of 95%, but slight traces of overfitting are reported by the adopted metaheuristic methods.Wang et al. 28 and Aziz et al. 29 proposed methods that reported accuracies over 90%, but the methodologies adopted in these research works are not suitable for highly nonlinear datasets.Yaqoob et al. 30 proposed Sine Cosine and Cuckoo Search Algorithm for feature selection and classification of Breast cancer.Low number of prominent genes (30) is selected to attain a classification accuracy of 99%.For colon cancer classification, Joshi et al. 31 utilized Cuckoo Search and Spider Monkey Optimization as feature selection.The reported accuracy of the research consisting of 2000 genes is only www.nature.com/scientificreports/92%.Along with these benchmark literature works, the research also follows some of the insights provided by Arowolo et al. 32 that investigate various feature Extraction methods to classify Colon Cancer from microarray Data.
Based on the insights gathered from existing literature, this study proposes the utilization of STFT (Short Term Fourier Transform), LASSO (Least Absolute Shrinkage and Selection Operator), and EHO (Elephant Herding Optimization) as feature extraction and dimensionality reduction techniques.The research also compares the traditional machine learning classifiers such as DFA, NBC, GMM, and SVM (RBF) with integrated metaheuristic classifiers including PSO-GMM, Firefly-GMM, and Flower Pollination Optimization with GMM.The materials and methodology are discussed in the upcoming section.

Dataset
In this research, we have used a publicly accessible dataset to classify cases of colon cancer, provided by Alon et al. 33 .The dataset contains 2000 genes.There are 62 samples total; Class 1 represents the tumour class with 40 samples, and Class 2 represents the healthy class with 22 samples.Table 2 summarises the dataset's details in tabular form.

Workflow of the research
The microarray gene expression data generation involves numerous steps, starting with obtaining an RNA sample from the suspected colon cancer area cells.As explained by Sakyi et al. 34 , isolating RNA uses techniques like phenol-chloroform extraction.After obtaining the RNA sample, a process called reverse transcription is performed to convert RNA into complementary DNA (cDNA).The cDNA will be labelled with fluorescent dyes and other markers.The labelled cDNA is hybridized onto microarray slides containing thousands of known DNA sequences that correspond to genes of suspected colon cancer area.After hybridization, the microarrays are scanned to detect the fluorescence signals, which indicate the abundance of each gene's mRNA in the original sample.At last, the raw fluorescence intensity data are processed and analyzed to generate a microarray gene expression matrix.The rows of the microarray gene expression matrix represent a gene and each column represents a sample.The microarray gene expression matrix provides valuable insights into the expression levels of thousands of genes simultaneously.It helps to identify significant patterns and differences in gene expression across diverse conditions and experimental groups.The major steps involved in the generation of microarray gene expression data along with the overall workflow of the research work are illustrated in Fig. 1.In the next section, the details about various feature extraction methods adopted in the research are discussed.

Feature extraction methods
The feature extraction in this research is performed by utilizing the methods detailed in STFT (Gupta et al. 35 ), LASSO (Tibshirani et al. 36 ) and EHO feature extraction techniques (Wang et al. 37 ).The STFT is nothing but a windowed Fourier transform that is used to capture time-varying frequency components in gene expression data.It enables the analysis of dynamic gene expression patterns over time intervals.LASSO employs regularization to simultaneously perform feature selection and data shrinkage.This technique effectively identifies the subset of genes with significant expression levels and mitigates the overfitting problems during microarray data analysis.On the other hand, EHO is a metaheuristic approach that is inspired by the herding behaviour of elephants.This nature-inspired extracts and also optimizes the features in the gene expression data, providing unique insights into gene interactions.

Feature extraction using STFT
A frequency domain study of the data over a brief time span is called the Short-Time Fourier Transform (STFT).By extracting pertinent and helpful features, the STFT can reduce the data dimension and extract relevant features when applied to microarray gene data.Gupta et al. 38 have used STFT to do QRS Complex Detection.According to the authors, STFT is helpful in giving researchers a time-frequency representation of the data, enabling them to examine how variations in gene expression levels occur over various frequency components and over time.With STFT, a localised representation of frequency content over time can be obtained by identifying particular time intervals where particular frequency components are prominent.This STFT feature is helpful in finding genes that show temporal patterns active during particular time intervals for microarray gene expression research.Additionally, by identifying the crucial genes or gene clusters connected to particular frequency components, STFT lowers dimensionality.The Blackman window 39 that reduces spectral leakage serves as the windowing function for STFT calculations.Therefore, by identifying the frequency patterns and underlying correlations in the dataset, STFT can offer biological insights.The computation of STFT is expressed as Here, x [n] represents the input data having length N with n = 0, 1, 2, … N − 1.The STFT window is represented by w[n] having length M with m = 0, 1, 2, … M − 1, j = √ (− 1) denotes the complex number.The Blackman window is given by the expression, In essence, STFT would be able to extract features from the microarray gene expression data and capture significant time-varying patterns that allow the detection of dynamic gene expression changes crucial for understanding inherent biological processes.However, STFT often exhibits its inability to effectively handle highdimensional data and multicollinearity.This can lead to inadequate performance or overfitting issues.Here we discuss the significance of LASSO which can overcome these issues by imposing a penalty on the absolute size of regression coefficients.
(1)  40 .It works by minimizing the sum of squared residuals while imposing a constraint on the sum of the total values of the regression constants.The estimated regression coefficients in LASSO can be characterized as follows: Here 'n' represents the number of observations in the dataset and 'p' represents the number of predictor variables or features in the dataset.The variable x i is the observed value in the ith observation, y ij is the predicted value in the ith observation and jth prediction, β j represents the regression coefficient associated with the jth predictor.The regularisation values are represented by, a dimensional vector that is also designated as the penalty term.The variable contains the LASSO estimation values for the slope coefficients.It is reasonable to assume that the response variable has a mean of 0 without sacrificing generalizability.Also, the covariates act as predictors having a variance of 1 and a standardised mean of 0. When the L1 regularization is applied to the constants, a feature emerges: the coefficients tend to approach zero as λ grows, and for sufficiently large λ, some are precisely reduced to zero.Because of this special quality, the LASSO can choose models and produce predictable results.Thus, LASSO has effectively performed the feature selection by reducing the risk of overfitting in microarray gene expression data analysis.However, LASSO tends to select only a subset of features, potentially discarding relevant predictors that are correlated with the selected ones.Here, an Elephant Herding Optimization (EHO) can address this limitation by offering a nature-inspired optimization approach that explores a wider range of feature www.nature.com/scientificreports/combinations, potentially capturing synergistic interactions among genes and providing a more comprehensive understanding of the underlying biological mechanisms in microarray gene expression data.

Feature extraction using elephant heard optimization
EHO is grounded on the food searching behaviour of elephants, is introduced in Wang et al. 41 .In the EHO metaheuristic algorithm, the location of each elephant i is iteratively updated as follows.
Here p best is the global best and ϒ indicates the control variable, ϒ ∈ [0, 1], rand 1 refers to the random number, rand 1 ∈ [0, 1].The global best, p best is updated iteratively in the following manner.
where δ ∈ [0, 1] is another control parameter.In addition, the worst position is changed according to the following equation.
Here p min and p max mentions the minimum and maximum position values available in the solution space.In this way using the EHO's exploration of the solution space beyond sparse representations helps in mitigating the risk of overlooking to the important features, but at the same time optimizing the model performance to avoid overfitting problems during classification.In the next section, a statistical analysis of the extracted features obtained from STFT, LASSO, and EHO is analysed in detail.

Statistical analysis of the extracted features using STFT, LASSO, and EHO
To ensure that the feature extracted data reduces the dimensionality and retains the significant characteristics of the original microarray data, we delve into a comprehensive statistical analysis.For both the normal and cancerous classes, a comparison of the extracted features is performed through various statistical parameters: Mean, Variance, Skewness, Kurtosis, Pearson Correlation Coefficient (PCC), and Canonical Correlation Analysis (CCA).The analysis presented in Table 3, gives a clear picture of whether the adopted feature extraction methods can preserve the crucial properties of the microarray genes data within each class.The analysis serves as the validation for the subsequent classification performance and interpretations.
The statistical parameter 'Mean' is the average value of the set of data values in the dataset.It represents the average expression levels of genes across different samples.Due to the detailed frequency information and mathematical characteristics (without scaling and normalization), STFT provides very high mean values of 6.43E + 03 and 7.29E + 03 compared to LASSO and EHO.The mean values of LASSO and EHO are low because these are normalized methods with sparsity and compression.Variance measures the spread or dispersion of the values in the dataset.In gene expression, variance indicates how much individual expression levels deviate from the mean expression level.Similar to the mean values, the variance of the STFT data is very high when compared to LASSO and STFT methods.High variance suggests that the expression levels vary widely across samples, while low variance indicates that the expression levels are similar across samples.
Skewness measures the asymmetry of the distribution of values compared to normal distribution in the dataset.A positive skewness indicates that the distribution is skewed to the right (i.e., the tail of the distribution extends more to the right), while a negative skewness indicates that the distribution is skewed to the left.From Table 3, the STFT is a right-skewed distribution, meaning there are more samples with higher expression levels (4)  PCC measures the linear relationship between two variables.In gene expression analysis, PCC is used to quantify the degree of linear association between the expression levels of two genes across different samples.A value close to 1 indicates a strong positive correlation, a value close to − 1 indicates a strong negative correlation and a value close to 0 indicates no linear correlation.For both the normal and cancer classes, STFT feature extraction shows a weak positive linear relationship, LASSO exhibits a weak negative linear relationship, and EHO shows no linear relationship.The CCA is a multivariate statistical technique used to analyze the relationship between two sets of variables.In gene expression analysis, CCA is used to identify linear relationships between two cancer classes.CCA finds linear combinations of the features that maximize the correlation between the two sets of cancer classes.A positive correlation among the classes is observed in STFT, and LASSO methods, whereas a weak positive correlation is reported in the EHO feature extraction method.The CCA is thus a crucial statistical parameter for understanding the associations between different cancer class feature that can help to identify patterns and underlying biological processes in the data.All these observations can bevisualized in the Fig. 2 which shows the data distribution of STFT, LASSO and EHO feature extraction methods.
Up next, we compare the two class groups in our dataset using the Viloin plot shown in Fig. 3 to compare the distributions of numeric data across the two groups.A Violin plot that combines aspects of a box plot (median, quartiles, and outliers) with a kernel density plot (which shows the distribution of the data).The width of the violin at any given point represents the probability density of the data at that value.Therefore, wider sections indicate a higher probability density, while narrower sections indicate a lower probability density.
The violin plots are the same for STFT and this indicates that the distribution of features extracted using STFT is similar across both classes (normal and cancer).This suggests that STFT may sometimes fail to capture the differences in gene expression patterns between normal and cancer samples.The spread of the violin plot with more width for LASSO suggests that there is more variability in the distribution of features extracted using LASSO across both normal and cancer classes.This implies that LASSO is capable of capturing a wider range of gene expression patterns.Also, the normal class has more violin length compared to the cancer class reports that there may be more variability in the normal samples compared to the cancer samples for the features extracted using the LASSO method.In EHO, there is a spread in the width of the violin in certain regions suggesting variability in the distribution of features extracted using EHO across both normal and cancer classes.The non-overlapping of violin lengths portrays the significant differences in the distribution of features extracted using EHO between the two classes.This reveals that EHO is effective in capturing distinct gene expression patterns associated with normal and cancer samples.Thus, from all these statistical parameters it is clear that the extracted features are extremely relevant and significant for the upcoming cancer classification step.

The classification approach
Once the features are extracted, the colon cancer classification is performed using machine learning and metaheuristic classifiers.In our research, we utilize a diverse set of seven classifiers to cast a wide net for the most suitable approach.The machine learning classifiers utilised are GMM, DFA, NBC, and SVM (RBF).These are probabilistic modelling used to identify clusters of similar gene expressions and patterns within the data.Further to uncover the potentially hidden dynamics inside the data, we employ the metaheuristic integrated machine learning classification namely PSO-GMM, Firefly-GMM, and FPO-GMM.By utilizing a diverse range of algorithms, we aim to capture the multifaceted nature of the data and identify the most accurate and robust approach for colon cancer classification.

Gaussian mixture model (GMM)
The Gaussian Mixture Model is a popular unsupervised learning technique that groups together similar data points to perform tasks like image classification and pattern recognition.A set of Gaussian distributions are joined linearly in the PDF (Probability Density Function) of GMM, which makes the classification of the data easier.So the mixture classifier creates a probability distribution from microarray gene expression levels for both the classes with the help of a combination of Gaussian distributions.After probability distribution, the class prediction is based on the highest probability value (Bayes' theorem).For every class 'c' , and feature 'f ' , GMM assumes that the data is drawn from a mixture of 'G' Gaussian coefficients.This can be expressed in the following way: Here M cg , ϒ cg , C• cg indicates coefficient of mixture, mean vector, and covariance matrix respectively, under mixture component 'g' in class 'c' .The M cg signify the ratio of each constituent in the class.From training data, M cg , ϒ cg , C• cg parameters are gained for each constituent in the class with 'N' as the total number of samples, total classes 'C' , and i = 0,1,2,…N.
Also, P(y = c) indicating prior probability is also obtained for every class 'c' .P(y = c) is given as: Here 'D' represents the dimensionality of the data, that is nothing but the number of features.

Particle swarm optimization-GMM (PSO-GMM)
PSO utilizes the collective intelligence of a "swarm" to optimize the classification performance.Consider a flock of birds searching for food, constantly refining their flight paths based on individual discoveries and interactions.PSO can effectively perform classification by searching for the most clustered and hidden subset of genes from the large pool of gene expression data.Moreover, PSO can handle the complex and non-linear relationships among feature-extracted gene expressions.A combination of PSO with GMM (PSO-GMM) can intricate relationships that might be missed by static GMM.The technique potentially provides a more accurate cluster identification and classification.Here's the crux of PSO-GMM classification, Nair et al. 42 .
where v i (t) is the velocity of particle i at iteration t, w is the inertia weight, c 1 and c 2 are constriction coefficients, r 1 and r 2 are random numbers between 0 and 1, pbest i is the best position found by particle i so far, gbest is the best position found by any particle in the swarm, x i (t) is the current position of particle i.
After application of PSO, the search space will be optimized and data distribution will be changed.Now, PSO-GMM follows the expressions from Eqs. (8-13) to perform the final classification.PSO-GMM is a methodology that can unlock hidden patterns in gene expression data and improves classification accuracy by avoiding overfitting issues.

Firefly GMM
The Firefly Algorithm (FA) is a nature-inspired metaheuristic that mimics the flashing behaviour of fireflies to solve optimization problems.Consider a firefly flitting through the night, its brightness representing its "fitness" in finding mates or food.So, brighter fireflies attract others and guide them towards better locations.Using this behaviour, FA can segregate most of the relevant features and escape from the local optima of the microarray data.This exploration and search mechanism of the solution space results in reliable and robust classification outcomes.Here's the simplified representation of the attraction between fireflies: where I i is the attractiveness, I i 0 is the initial attractiveness, γ is the light absorption coefficient, r ij is the distance between fireflies i and firefly j.Fireflies move towards brighter neighbours, gradually refining their positions towards better solutions.The movement of a firefly i towards a brighter firefly j is determined by the attractiveness and randomness given by  www.nature.com/scientificreports/Here, x i t+1 is the new position of firefly i at time step t + 1, x i t is the current position of firefly i at time step t, β is the attraction coefficient, α is the randomization parameter, and rand() ∈ [0, 1], is a generated random number.
The collaboration of Firefly with GMM can potentially uncover intricate relationships in gene expression data that might be missed by standard GMM.Firefly combined with GMM can also overcome the limitations of the dimensionality problem.The Firefly GMM thus gives a clustered and segregated data distribution to GMM for classification by varying the mean, variance and other statistical parameters of the extracted features.

Detrended fluctuation analysis (DFA)
Detrended fluctuation analysis (DFA) efficiently solves problems involving complex and non-stationary data.DFA captures both short and long-range correlations based on how the data fluctuates around the data distribution with the help of a scaling exponent to classify the data.The scaling nature of DFA algorithm is described by the root mean square fluctuation of the integrated time series and detrended input data.For DFA, the inputs are analysed in the following way.
where B(i) and B are the ith sample of the input data and mean value of the input data respectively, thus y(k) denotes the estimated value of the integrated time series.Now, the fluctuation of integrated time-series and detrended data for a window with the scale of 'n' is determined by Here, y n (k) is the kth point on the trend computed using the predetermined window scale, and 'N' is the normalization factor.
Therefore using Detrended Fluctuation Analysis (DFA) as a classifier for microarray gene expression data can effectively capture the long-range correlations present in microarray gene expression data.Unlike traditional methods focusing on short-range correlations, DFA evaluates the scaling behaviour of fluctuations across different time scales.This capability is particularly valuable in gene expression data analysis, where genes may exhibit complex patterns of co-regulation and interactions across various biological processes and time scales.

Naive Bayes classifier (NBC)
NBC is a probabilistic classification algorithm based on the Bayes theorem and feature independence assumption.This straightforward assumption allows Naive Bayes classifiers to efficiently handle large feature spaces, making them computationally efficient and scalable for microarray data analysis.NBC starts by calculating the posterior probability of a class using the prior probability and likelihood.For a given class C and extracted features x 1 , x 2 … x n , posterior probability p(C|x 1 , x 2 , . . .., x n ) is expressed in Fan et al. 43 as: For class C, p(C) represents the prior probability, p(x 1 , x 2 , . . .., x n |C) is the likelihood, and p(x 1 , x 2 , . . .., x n ) is the evidence probability.As mentioned, in the Naive Bayes approach, the features are conditionally independent for the class.This assumption simplifies the calculation of the likelihood as follows: where p(x i |C) is the probability of feature x i of the class C. The (x i |C) is estimated from the fraction of class C training examples with the feature value x i .Then the prior probability (C) can be estimated as the fraction of training examples belonging to class C. Finally, to predict the class label for the features x i , the algorithm cal- culates the posterior probability for each class and assigns the instance to the class with the highest probability.Thus using a Naive Bayes classifier for microarray gene expression data can efficiently handle large amount of data because they assume independence between features given the class label, which greatly reduces computational complexity.

Support vector machine (radial basis function)
Support Vector Machine with a Radial Basis Function (SVM RBF) can handle complex, non-linear relationships between gene expression levels.SVM can effectively handle non-linear separability by mapping the input data into a high-dimensional feature space, where non-linear relationships become linearly separable.This is possible by constructing decision boundaries between normal and cancerous samples by handling non-linear relationships effectively.RBF is the kernel used in SVM to perform the nonlinear mapping of the input features into a higher-dimensional.The RBF kernel K RBF (x, z) that is used to compute the similarity between feature vectors in the input space is given by: ( 16) where σ is the kernel width parameter that regulates the influence of each training sample, and |x − z| is the Euclidean distance between feature extracted vectors 'x' and 'z' .The SVM RBF classification decision function is also described as a linear combination of the kernel evaluations between the input feature vector and the support vectors with the bias term, as performed by Zhang et al. 48.The decision function f RBF (x) is given by: Here y i , and α i are Lagrange multipliers and class labels respectively associated with each support vector.The K RBF (x, z) computes the computes the similarity or distance between two feature-extracted vectors x i and x .b is the bias term that shifts the decision boundary away from the origin allowing the SVM to classify data points that may not be separable by a hyperplane in the original feature space.

Flower pollination optimization (FPO) with GMM
FPO is a metaheuristic optimization algorithm inspired by the pollination behaviour of flowering plants.In nature, pollination happens through two major forms: abiotic and biotic.The biotic cross-pollination is observed in 90% of the cases where the insects are supporting the pollination process.The insects thus take long distance steps, in which the motion can be drawn from a levy distribution.The FPO starts with the discovery of solution space and subsequent movements.The insect motion and the levy distribution are expressed in the following steps as provided in Yang et al. 49 .where x i t is a pollen representing the solution vector x i at the t th iteration, and g best is the global best solution discovered so far.The step size is denoted by factor, and L( ) is the Levy flight step size denoting the success of pollination.
The Levy distribution is given by: Where Ŵ( ) the step size for is large steps where (s > 0) , drawn from a gamma distribution.However, for smaller pseudo-random steps, that correctly follow Levy distribution, the s value is drawn from two Gaussian distributions U and V (Mantegna algorithm 50 ), Here, U is drawn from a normal distribution with mean = 0 and variance = σ 2 , and V is drawn from a normal distribution with mean = 0 and variance = 1.The variance is given by The remaining 10% of the abiotic pollination observed in the plant and flower community is regarded as a Random Walk, that sometimes brings out solutions from the unexplored search space x i .
Here, x i t , and x k t are considered to be pollen transformations within the same plant species, from the same population.ε is the step for the random walk drawn from the uniform distribution [0,1].
Thus, FPO allows for a global search of the solution space, which is crucial for effectively exploring the highdimensional feature space of gene expression data.FPA identifies the most relevant genes that can contribute to the classification task and reduces the dimensionality of the feature extracted data.FPA also changes the means, covariance, and mixing coefficients of data distribution, so that GMM can perform better with the optimized data that contains complex and non-linear relationships among genes.The next section discusses the selection of classifier targets for the Adeno and Meso classes.

Selection of target
Selecting the target is a crucial step in defining the objective of the classification methodology.The clear definition of target improves the classifier model's predictive power by focusing on the most informative aspects of the data.The noise, outliers, and nonlinearity aspects of the data decide the selection of classifier targets.Moreover, the dataset used in this research is imbalanced and selecting the target must be strategized to deliver the maximum classifier performance.The binary classification performed in this research where the classification of feature extracted microarray data samples are classified into normal and colon cancer classes.Therefore two targets are selected namely T N and T C .For a normal class feature set of N elements, the class target for the Normal class is defined as: where M i is the average of the feature extracted vectors of the Normal class, and T N target follows the constraint

Training and testing of classifiers
Before moving to the final classification step, training of classifiers is performed utilizing the labelled microarray dataset to adjust the classifier model's parameters.This step enables the learning of complex patterns and relationships in the gene expression data and optimizes the classifier's performance by minimizing the discrepancy between predicted and actual outcomes.After training, testing evaluates the trained classifier's performance on an independent dataset not used during training.So the testing phase assesses the model's generalization ability and provides insights into its effectiveness on other datasets.Mean Square Error (MSE) serves as a critical evaluation metric during both the training and testing phases.In training, MSE quantifies the disparity between predicted and actual gene expression values, guiding the optimization process to minimize prediction errors.During testing, MSE provides insights into the model's predictive accuracy and its ability to generalize to unseen data.Minimizing MSE ensures that the classifier effectively captures the underlying relationships within the gene expression data, enhancing its predictive performance.The MSE is given by calculating the average squared difference between the predicted values and the actual values in a dataset.
Here 'N' is the total number of extracted features, A j represents the actual gene expression value of jth instance, and P j represents the predicted gene expression value of the jth instance.A lower MSE indicates bet- ter performance, as it signifies smaller discrepancies between predicted and actual values.Conversely, a higher MSE suggests poorer performance and potentially larger prediction errors.Thus, MSE acts as a parameter that continuously checks the performance of the classifier.
We also perform K-Fold Cross Validation as performed in Fushiki et al. 44 to validate the classifier model.In this approach, the dataset is divided into K subsets (folds), with each fold serving as a validation set while the remaining data is used for training.This process is repeated K times, ensuring that each data point is used for validation exactly once.By averaging the performance metrics across multiple validation iterations, K-Fold Cross Validation provides a more reliable estimate of the classifier's performance and helps mitigate overfitting, thus enhancing the generalization ability of the model.In this research, we have varied K value in the range 5-20, and is finally the value is fixed with 10, as higher values are providing similar results.
In this research for the binary classification problem of colon cancer data, the confusion matrix is described as shown in Table 4.It is used to describe the performance of a classification model during the training and testing phase.The confusion matrix has four possible combinations of predicted and actual class labels: True Positives (TP), True Negatives (TN), False Positives (FP), and False negatives (FN).These values help to evaluate the overall classifier performance metrics like Accuracy, F1 Score, Error Rate, MCC, and Kappa.
Table 5 shows the obtained training and testing MSE of classifier for various feature extraction methods.The training MSE is reported between 10 −01 and 10 −9 . The testing MSE is reported between 10 −5 to 10 −8 .The training process is evaluated with 2000 iterations.The FPO-GMM with STFT feature extraction attained the lowest training and testing MSE of 7.29 × 10 -09 and 6.44 × 10 -07 , respectively.For LASSO feature extraction, SVM (RBF) ( 28) Based on the observations from Table 5, the classifier parameters are selected for the various employed classifiers in this research as provided in Table 6.

Results and discussion
From the gene expression feature extracted data, 85% are used for training, and the remaining 15% are used for testing of models.The confusion matrix is a useful tool for assessing a machine learning model's performance in binary classification situations.The performance metrics for the binary classifier are based on the confusion matrix provided in Table 4. Through a comprehensive comparison across various feature extraction and metaheuristic and machine learning integrated classification algorithms, we strive to pinpoint the method that delivers the most accurate and consistent colon cancer classification based on gene expression data.

Classifier performance
As previously indicated, the confusion matrix lists the model's predictions in comparison to the actual labels of the data.By examining the values in the confusion matrix, performance metrics including accuracy, precision, F1 score, MCC, error rate, and kappa are obtained.These metrics offer valuable insights into the sensitivity, specificity, and overall effectiveness of each classification technique.The classifier performance metrics and their expression from the confusion matrix are listed in Table 7.
A classifier's accuracy is determined by how well it recognises the class labels in a dataset.It is obtained by dividing the total number of examples in the dataset by the number of successfully identified instances.A classifier's accuracy can be further gauged by its F1 score, which combines recall and precision into a single statistic.The F1 score is computed as the harmonic mean of precision and recall.The value of the F1 score ranges from 0 to 1, with 1 denoting perfect precision and recall.Thus, F1 score is more considered over the accuracy, especially in the case of imbalanced datasets.The error rate of a classifier is the proportion of misclassified instances.MCC stands for "Matthews Correlation Coefficient", which measures the quality of binary classification models.It considers true and false positives and false negatives which is particularly useful in situations where the classes are imbalanced.The kappa statistic, also known as Cohen's kappa, measures agreement between two raters or between a rater and a classifier.The agreement between the true and projected classes is measured using the kappa statistic, which takes into account the likelihood that the agreement is coincidental.The Kappa coefficient = 1 denotes perfect agreement, and any value below one can be interpreted as follows: There are five levels of agreement: poor (< 0.2), fair (0.2-0.4), moderate (0.4-0.6), good (0.6-0.8), and very good (0.8-1).The performance analysis of classifiers with STFT feature extraction is first discussed in Table 8.
From the reported results in Table 8, it is evident that FPO-GMM has performed the best among all the other classifiers taken to consideration.An accuracy of 90.32% indicates that the classifier is performing well in correctly classifying instances.The high F1 Score of 92.5% suggests a good balance between precision and recall, which is crucial for imbalanced datasets.MCC of 0.7886 is also quite high, indicating a strong correlation between predicted and true classifications.The Kappa value of 0.7886 indicates that classifier's predictions are consistent with the true classifications, considering the potential for randomness in the classification process.PSO-GMM has reported better performed better when compared to Firefly-GMM due to their differences in their optimization capabilities and convergence behaviour.PSO is likely to explore the solution space more effectively, leading to higher performance metrics compared to the Firefly Algorithm.PSO's ability to efficiently search for optimal solutions contributes to better classification accuracy and an MCC of 0.6273 reflects a stronger agreement between predicted and true classifications.In contrast, Firefly Algorithm might struggle with convergence or effective exploration, resulting in lower performance metrics across the board.SVM (RBF) achieved the second highest accuracy of 87.0967, F1 Score of 90, MCC of 0.7181, and Kappa of 0.7181 values.This is because SVM (RBF) is effective the handling the high-dimensional gene expression data and capturing the complex nonlinear relationships.GMM yielded moderate performance metrics as the assumption of Gaussian distributions led to a lower performance compared to SVMs.DFA showed lower performance compared to SVM and GMM, suggesting this classifier is not well-suited for capturing the intricate patterns in gene expression data.Naive Bayesian produced the lowest performance metrics due to the assumption of independence between features.This is because, Naive Bayes is simple and efficient and assumes independence between features, which is not true for gene expression data.The performance analysis of classifiers with LASSO feature extraction is next discussed in Table 9.
From Table 9, it is evident that SVM (RBF) showed the best performance for LASSO based feature extraction method.A high accuracy of 96.77%, F1 Score of 97.5% indicate a strong performance of the classifier in correctly classifying instances even in the class imbalance conditions.The 0.9295 is considerably high, suggesting a strong correlation between predicted and true classifications.This indicates the robustness and reliability of the classifier's predictions.The high Kappa value of 0.9295 indicates substantial agreement beyond chance, The second best results are reported from FPO-GMM Classifier with a high accuracy of 93.55% and F1 Score of 95%.This results also indicate a successful classification of microarray gene expression data with MCC and Kappa of 0.8591 suggesting a strong agreement between predicted and true classifications.If we compare with FPO-GMM and GMM, there is an improvement of 9.57% in accuracy, 7.8% in F1 Score, and 0.202 difference in MCC and Kappa.This is because FPO has truly contributed to optimizing the feature data of the Gaussian Mixture Model classifier.The FPO once again showed its ability to explore the solution space effectively.Overall, the LASSO with SVM (RBF) outperformed LASSO with FPO-GMM due to SVM's robustness in handling complex, nonlinear patterns inherent in microarray gene expression data.The performance analysis of classifiers with EHO feature extraction is next discussed through Table 10.
For the FPO-GMM classifier, a high accuracy of 96.77%, F1 Score of 97.5%, MCC of 0.9295, and Kappa of 0.9295 is reported for EHO feature extraction as provided in Table 10 values indicate successful classification of the data.The high F1 Score suggests effective handling of data imbalance, indicating balanced performance in capturing both positive and negative instances.EHO and FPO have contributed to optimizing parameters in both feature extraction and classifier tuning, respectively to enhance the model's performance.The secondbest performance is observed with the SVM-RBF classifier.This is only a slight improvement of performance in in terms of accuracy of 1.61%, F1 Score of 1.3%, MCC of 0.03, and Kappa of 0.03.The similarity in MCC and Kappa values suggests that both classifiers achieved a similar balance between true positives, true negatives, false positives, and false negatives in the predictions.However, the difference in Accuracy and F1 Score is due to the variations in precision and recall values.The slight improvement in the performance of the FPO-GMM over the SVM-RBF to be attributed to the EHO feature extraction method that distinctly distributes the data in two clusters as provided in Figs. 2 and 3.In essence, STFT yields higher accuracy due to its ability to capture wide variability in expression levels across samples, potentially enabling better discrimination between classes.LASSO and EHO are normalized and sparse representations with more consistent expression levels across samples and hence they can distinguish even the subtle differences between classes.For a smooth comparison, Fig. 4 shows the performances of various employed classifiers with their classification performance metrics.Now, the overall performance of the various feature extraction methodologies and classification methodologies are compared using MCC vs. Kappa as portrayed in Fig. 5.The graph demonstrates a strong positive linear relationship between MCC and Kappa values for the various classifiers used with different feature extraction techniques.The high R 2 value indicates that the relationship between MCC and Kappa values is highly linear and predictable.As one metric increases, the other tends to increase proportionally, suggesting that classifiers performing well in terms of Kappa also tend to perform well in terms of MCC.The linear relationship also suggests that the correlation between MCC and Kappa values are robust and not specific to any particular feature extraction method, especially for the imbalanced dataset in this research.Overall Fig. 5 is a representation of the reliable performance of the classifiers across various feature extraction techniques taken into consideration.So, overall in the classification task, the choice between classifiers depends on requirements such as computational complexity, interpretability, or preference for certain optimization techniques.Overall, FPO-GMM and SVM (RBF) approaches offer effective solutions for classification tasks with microarray gene expression data, with the FPO-GMM approach providing a competitive alternative to traditional methods of SVM-RBF classifiers.In this upcoming section, the computational complexity comparison of the integrated feature selection and classification techniques are discussed.

Computational complexity
The computational complexity of feature extraction and classification methods plays a crucial role in determining the feasibility and efficiency of the high volume of microarray gene expression data.In feature extraction, computational complexity refers to the time and memory resources required to transform raw data into a meaningful feature representation.The computational complexity in classification involves the resources needed to train and deploy classification models, as well as to make predictions on new data points.However, computational experience can vary across different hardware setups and systems.Therefore a robust understanding of the computational complexity of the utilized methods is comprehended with the Big O mathematical notation-O (n), with n indicating the size of the dataset.The computational complexity O (n) indicates that the computational effort grows linearly with the dataset size 'n' .Table 11 provides the computational Complexity of various employed methodologies for Classification.8, 9 and 10 shows that the methodology with comparatively with high computational complexity has provided better results during classification.When dealing with high-dimensional data, like microarray gene expressions, feature extraction helps to reduce the dimensionality and extract relevant information, making the subsequent classification task more manageable for complex classifiers.Also during classification, to bring out the underlying patterns and nonlinear relationship of the data, often the classifiers employed need to be more complex as the gene expression data is not easily separable in the original feature space.
Overall, feature extraction followed by classification methods with high computational complexity is preferred in situations where achieving optimal performance, handling complex data patterns, enhancing robustness to noise, and scalability are essential considerations.So when computational resources are available, these complex learning models can be used to classify microarray gene expression data.Likewise, feature extraction followed by classification methods with lower computational complexity are preferred, for real-time applications where rapid decision-making within the time and resource limit is critical.
Here is the gist of the research and adopted methodology employed in the paper.We begin by outlining the feature extraction methods employed, namely STFT (Short Term Fourier Transform), LASSO (Least Absolute Shrinkage and Selection Operator), and EHO (Elephant Herding Optimisation).We explained how these methods were used to reduce the dimensionality of the microarray gene expression data and extract meaningful features for the classification task.After that, we discussed the integrated machine learning and metaheuristic classification techniques utilized in this study.Then we provided insight into the classifiers employed, including the Gaussian Mixture Model (GMM), Particle Swarm Optimization (PSO) with GMM, Detrended Fluctuation Analysis (DFA), Naive Bayes classifier (NBC), Support Vector Machine with Radial Basis Kernel (SVM-RBF), and Flower Pollination Optimization (FPO) with GMM.After that, we elucidated how these classification techniques were applied to accurately classify colon cancer based on the extracted features.Following this, we presented the results obtained from the evaluation of our model.The performance of the classifiers was evaluated using the metrics: Accuracy, F1 score, Matthews correlation coefficient (MCC), and Kappa statistics.We also highlighted the significance of utilizing STFT, LASSO, and EHO for feature extraction in reducing the dimensionality of microarray gene expression data.The results compared with existing benchmark studies provided in Table 1 emphasize the novelty and effectiveness of our approach in enhancing classification accuracy and interpretability for colon cancer detection.The conclusions, limitations of research, and future research directions are discussed in the next section.

Conclusion
This research contributes to the ongoing fight against colon cancer by exploring novel methods for early detection and improved diagnosis.The major focus of the research is to address the curse of the dimensionality problem inherent in microarray data and to extract significant features that retain relevant patterns and intrinsic nonlinearity.An accuracy of 96.77% is reported for LASSO feature extraction with SVM-RBF Classification.Likewise, 96.77% is also reported for EHO feature extraction with FPO-GMM Classification.This indicates the balance between feature extraction and classification of both approaches employing different optimization techniques.Both methodologies worked well to discover the unseen data, indicating that the selected features are relevant.From the computational complexity analysis, it is observed that LASSO feature extraction with SVM-RBF Classification will incur less computational complexity when compared to EHO feature extraction with FPO-GMM Classification.So the choice of methodology can be made based on computational complexity, suitability, and efficiency.However, the effectiveness of metaheuristic optimization techniques, such as PSO and FPO may vary across different datasets and problem domains.The metaheuristic optimization methods are computationally intensive and often require tuning of algorithm parameters using hyperparameter tuning to further optimize and reproduce the same research results.So the future direction of the research is to utilize techniques like grid and random search to systematically explore the hyperparameter space and identify optimal settings for metaheuristic and machine learning classifiers.Also t-SNE, ReliefF methods for dimensionality reduction can be experimented to discriminate the high-dimensional data and identify underlying patterns and clusters.

Figure 1 .
Figure 1.The schematic workflow of the research starts with the Colon cancer microarray gene expression dataset extracted from RNA microarrays.Then, feature extraction is performed on the microarray gene expression dataset using STFT, LASSO and EHO techniques.Further, Colon cancer classification is performed using machine learning and metaheuristic-integrated machine learning classifiers.

3 .
7) p worst = p min + p max − p min + 1 × rand Table Statistical analysis of colon cancer micro array gene after bio inspired features.*The mean value indicates the centre point of the gene expression data cluster along with variance indicating the dispersion of data from the centre point.Skewness and Kurtosis values compare the gene data distribution with the Gaussian data distribution.PCC reveals the internal association of gene expression levels inherent in the data.CCA examines the correlation of the data across cancer classes.

Figure 2 .
Figure 2. Data distribution plots for various feature extraction methods.The data distribution after STFT feature extraction is positively skewed reporting more correlation among the adeno and meso classes.The data distribution after LASSO feature extraction illustrates two differently behaved Non-Gaussian distributions with different statistics parameter values and also shows a moderate correlation among the classes.The data distribution for EHO feature extraction depicts minimum correlation among the classes, with distinct Non-Gaussian distributions segregated to their extremities.
(11) p y = c = Number of samples with class c Total number of samples

Figure 3 .
Figure 3. Violin plots for various feature extraction methods.(a) STFT feature extraction, (b) LASSO feature extraction, (c) EHO feature extraction LASSO and EHO appear to capture more variability and distinct patterns compared to STFT.EHO shows particularly promising results in separating normal and cancer classes based on the distribution of extracted features.
1].For a Colon Cancer class feature set of M elements, the class target for the Colon Cancer class is defined as: where M i is the average of the feature extracted vectors of Colon cancer class, and T C target follows the constraint T C ∈ [0, 1].The Eucledian distance between the Targets of the binary class must also follow the constraint:Based on the Eqs.(28-30), the class targets T N and T C are chosen as 0.85 and 0.1, respectively.The classifier performance will be monitored with the help of MSE criteria.The next section discusses the training and testing of classifiers.

Figure 4 .
Figure 4. Employed classifiers with their classification performance metrics.

Figure 5 .
Figure 5. Performance of classifiers in terms of MCC vs. Kappa parameters for STFT, LASSO, and EHO feature extraction techniques.

Table 1 .
Review of previous work.

Number of genes Class 1 (cancer) Class 2 (healthy) Total samples
Feature extraction using LASSO regressionLASSO is an extension of ordinary least squares (OLS) regression introduced by Tibshirani et al.
for this particular feature compared to lower expression levels.LASSO exhibits nearly a symmetrical distribution with skewness values close to zero.EHO exhibits strong negative skewness in the case of Normal data and a light negative skewness for Cancer data, suggesting a left-skewed distribution of expression levels.Kurtosis measures the peakedness or flatness of the distribution of values in a dataset compared to a normal distribution.High kurtosis indicates a sharp peak (leptokurtic), while low kurtosis indicates a flat peak (platykurtic).The extracted features using STFT and EHO are leptokurtic, meaning the distribution has heavier tails and more extreme values than a normal distribution.LASSO is platykurtic, revealing that the distribution has lighter tails and fewer extreme values than a normal distribution.

Table 5 .
Training and testing MSE of classifiers for STFT, LASSO and EHO feature extraction techniques.Significant values are in bold.

Table 6 .
Parameter selection of the employed classifiers.

Table 7 .
Classifier performance metrics.*Po is the observed proportion of agreement, Pe is the proportion of agreement expected by chance.Firefly-GMM 67.74193548 74.35897436 0.310317338 32.25806452 0.309576837

Table 9 .
Performance analysis of classifiers in colon cancer detection from LASSO regression feature extraction.

Table 10 .
Performance analysis of classifiers in colon cancer detection from EHO feature extraction.

Table 11 .
Computational complexity of various employed methodologies for classification.