A Comparative Analysis of Optimization Algorithms for Gastrointestinal Abnormalities Recognition and Classification Based on Ensemble XcepNet23 and ResNet18 Features

Esophagitis, cancerous growths, bleeding, and ulcers are typical symptoms of gastrointestinal disorders, which account for a significant portion of human mortality. For both patients and doctors, traditional diagnostic methods can be exhausting. The major aim of this research is to propose a hybrid method that can accurately diagnose the gastrointestinal tract abnormalities and promote early treatment that will be helpful in reducing the death cases. The major phases of the proposed method are: Dataset Augmentation, Preprocessing, Features Engineering (Features Extraction, Fusion, Optimization), and Classification. Image enhancement is performed using hybrid contrast stretching algorithms. Deep Learning features are extracted through transfer learning from the ResNet18 model and the proposed XcepNet23 model. The obtained deep features are ensembled with the texture features. The ensemble feature vector is optimized using the Binary Dragonfly algorithm (BDA), Moth–Flame Optimization (MFO) algorithm, and Particle Swarm Optimization (PSO) algorithm. In this research, two datasets (Hybrid dataset and Kvasir-V1 dataset) consisting of five and eight classes, respectively, are utilized. Compared to the most recent methods, the accuracy achieved by the proposed method on both datasets was superior. The Q_SVM’s accuracies on the Hybrid dataset, which was 100%, and the Kvasir-V1 dataset, which was 99.24%, were both promising.


Introduction
Numerous researchers are recommending their methods for the precise identification and classification of gastrointestinal anomalies, skin lesions, and brain tumors. In the area of computer vision and medical image processing, this is a growing field of study. These research studies are usually based on the dataset obtained through imaging technologies such as computed tomography (CT), wireless capsule endoscopy (WCE), and magnetic resonance imaging (MRI). Gastric abnormalities incorporate esophagitis, bleeding, ulcer, and polyps. The most common gastrointestinal abnormalities that humans suffer from are bleeding, polyps, and ulcers [1]. These stomach abnormalities have turned into a main The main objective is to develop a reliable and efficient model that can correctly identify and categorize digestive diseases. In comparison to current techniques, the research intends to obtain improved diagnostic accuracy, facilitating early identification and efficient treatment. The objective is to identify and categorize a wide range of digestive diseases by utilizing the strength of CNN and ML algorithms. This seeks to give professionals a thorough diagnostic tool for precisely identifying diseases. The rest of the paper is arranged as follows: The literature review is in Section 2, the concise narrative of the proposed methodology is in Section 3, and the quantitative experimental results are discussed in Section 4. Section 6 presents the paper's conclusion.

Related Work
The automated detection of disease is an active area of research. Researchers have introduced numerous automated computer-aided diagnosis methods to help physicians. Most of the methods are suggested for malignant disease detection, including brain tumor segmentation and classification [15,16], glaucoma detection [17], lung cancer detection [18], and so on. The accurate detection of the infected region, such as ulcers and bleeding, is a most challenging task. There exist different methods that the researchers use for image enhancement based on gamma correction [19], image colorization [20], a geometric filter [21], the OTSU Threshold [22] and discrete Fourier-transform. A large memory and normally more than 8 h are required for the WCE examination. A real-time analysis algorithm is required to correctly analyze abnormal tissues or infected regions.
An algorithm is developed for automatic bleeding region detection [23] that mainly focuses on bleeding spot detection. It consists of the statistical analysis and shape analysis of the region of interest (ROI). This algorithm was tested on 30 different cases of capsule endoscopy and achieved 97% and 99% accuracy of specificity and sensitivity, respectively. Yuan et al. [24] came up with an approach for the detection of ulcers, polyps, and bleeding using the K-means clustering algorithm. They utilized WCE images and achieved an 88.61% average performance. The authors presented a procedure for bleeding region recognition from WCE imageries. In this method, pixels are assembled robustly based on the intensity and area through superpixel division instead of processing each pixel separately or through a uniform division of pixels [25]. A novel method is used in [26] for distinguishing healthy and bleeding regions. Bleeding is a major symptom of a disease. Different color models are used for the detection of bleeding, and experiments are performed by using local binary patter (LBP). Experiments show that by using CIE XYZ, an accuracy of 96.38% is achieved for the KNN classifier. The identification of problematic images has been an obstacle for specialists to deal with. An integrated saliency measure and the Bag of Features method are suggested to tackle this problem [27]. The proposed technique provides good classification and characteristics of polyps from WCE images. In [28], diverse measurable parameters are thought to accurately detect bleeding images such as the maxima, mean, minima, median, mode, variance, kurtosis, median, and skewness. In [29], images of ulcers and healthy people are categorized using the extraction of texture feature extraction. The contour let transform and log Gabor filter are used for texture features extraction. SVM receives the extracted texture features, which have the highest accuracy of 94.16%. Yuan et al. [30] suggested a saliency-based practice for the detection of ulcers. The proposed method comprises two main stages. Sample images of the ulcer are selected in the first step, whereas in step two, the multilevel super pixel method is used to segment infected regions that are joined in a single map. In [31], the authors proposed an approach for separating healthy and unhealthy pixels through different color spaces. RGB, HSV, YCbCr, etc. shading spaces are utilized for the extraction of texture features. Reduced FV is supplied to the SVM classifier for the classification of reduced features. They have shown a 97.89% classification accuracy. Reed T. Sutton et al. [32] carried out research in which they compared the performance of several deep learning algorithms. The authors utilized the HyperKvasir dataset and obtained 8000 images from the dataset. From the comparison of the CNN model, it is concluded that the DenseNet121 model achieved a maximum accuracy of 87.50%. A method for identifying stomach diseases was proposed by Nayar et al. [33]. The authors combined the CNN features with the improved Genetic Algorithm (GA). The feature extraction in this study is carried out with AlexNet. The computational time required for this method was 211.90 s, and the accuracy was 99.8%. Guanghua Zhang et al. [34] proposed a method for digestive tract tumor detection. In our previous study [35], a hybrid method was proposed for the classification of gastrointestinal diseases. The major steps involved in this methodology are pre-processing, texture features extraction, CNN features extraction, feature fusion, and classification. The highest classification accuracy achieved on the KVASIR dataset was 99.3%. The literature review depicts the pre-processing and features extraction steps used by most researchers for disease classification and detection. Moreover, some researchers have utilized a combination of different types of features for better classification results. As the prediction of ulcers, health, and bleeding from WCE images is greatly influenced by features that are to be extracted, therefore, in this research, our focus will be on the pre-processing step and features engineering phase.
A crucial phase that has a significant impact on prediction performance is improving the quality of the dataset. A hybrid strategy for enhancing the image quality is proposed in this study. The feature's engineering phase is likewise basic in our proposed technique, wherein two kinds of features are extracted. These features are obtained utilizing CNN models and the Local Binary Pattern (LBP) algorithm. The CNN features are extracted using the proposed CNN model (XcepNet23) and the ResNet18 model. In the final phase, machine learning algorithms are utilized for classification. The following section provides a comprehensive explanation of the proposed method.

Proposed Methodology
The automatic detection of gastrointestinal diseases is a trending area of research in computer vision and image processing. Due to the identical color of cells in normal and infected WCE images, the classification of images is also an important task that is highly sensitive. The proposed method's in-depth description is included in this section. This research proposes a method consisting of hybrid image processing, machine learning, and deep learning algorithms for the recognition of gastrointestinal abnormalities. This work is primarily based on five main phases, which are: (a) image augmentation; (b) hybrid pre-processing technique; (c) features extraction; (d) feature fusion and optimization; and (e) classification. The following sections provide a comprehensive explanation of these phases. The model that represents the proposed method is given in Figure 1. processing technique; (c) features extraction; (d) feature fusion and optimization; and (e) classification. The following sections provide a comprehensive explanation of these phases. The model that represents the proposed method is given in Figure 1.

Preprocessing
Preprocessing is the first phase of our method; the contrast of the image is improved to enhance the chromatic quality of the infected region. The pre-processing step has a great impact on different domains such as computer vision, medical imaging, biometrics, and surveillance. This step is performed in medical imaging due to the presence of numerous challenges such as the low contrast of the images, the presence of blur, noise artifacts, variations in illumination, color variations, and lightning effects. Image enhancement is performed to extract the most salient features and obtain accurate classification results. In this phase, different methods are utilized for contrast enhancement. The preprocessing step is further subdivided into major sub-steps, which are: data augmentation, 3D-box filtering, 3D-median filtering, RGB to HSI color transformation, channels extraction (H, S, I), CLAHE is employed on the extracted channels, which results in H_CLAHE,

Preprocessing
Preprocessing is the first phase of our method; the contrast of the image is improved to enhance the chromatic quality of the infected region. The pre-processing step has a great impact on different domains such as computer vision, medical imaging, biometrics, and surveillance. This step is performed in medical imaging due to the presence of numerous challenges such as the low contrast of the images, the presence of blur, noise artifacts, variations in illumination, color variations, and lightning effects. Image enhancement is performed to extract the most salient features and obtain accurate classification results. In this phase, different methods are utilized for contrast enhancement. The pre-processing step is further subdivided into major sub-steps, which are: data augmentation, 3D-box filtering, 3D-median filtering, RGB to HSI color transformation, channels extraction (H, S, I), CLAHE is employed on the extracted channels, which results in H_CLAHE, S_CLAHE, and I_CLAHE, a saturation weight map is employed on the S_CLAHE, and feature engineering is employed on the ensemble feature vector.

Data Augmentation
The availability of large datasets is still a challenge for researchers. For the training phase, deep learning models require a sizable quantity of data. Overfitting is a well-known issue that affects all researchers and is brought on using insufficient or tiny datasets. In this study, we used the data augmentation technique to expand the dataset size and balance the dataset. The research reveals that there are several image augmentation techniques, including image-flipping, mirrored images, and image rotation. In this paper, image rotation is utilized for the lossless augmentation of the dataset. We have rotated the original image in three angles, which are: 90 • -right, 180 • -right, and 270 • -right (or 90 • -left). Some sample output images of data augmentation are given in Figure 2. S_CLAHE, and I_CLAHE, a saturation weight map is employed on the S_CLAHE, and feature engineering is employed on the ensemble feature vector.

Data Augmentation
The availability of large datasets is still a challenge for researchers. For the training phase, deep learning models require a sizable quantity of data. Overfitting is a well known issue that affects all researchers and is brought on using insufficient or tiny da tasets. In this study, we used the data augmentation technique to expand the dataset siz and balance the dataset. The research reveals that there are several image augmentation techniques, including image-flipping, mirrored images, and image rotation. In this paper image rotation is utilized for the lossless augmentation of the dataset. We have rotated th original image in three angles, which are: 90°-right, 180°-right, and 270°-right (o 90°-left). Some sample output images of data augmentation are given in Figure 2.

D-Box Filtering
In this step of image preprocessing, a 3D-box filter [12] is employed for the initia image, as shown in Figure 3a. The output image of this step is shown in Figure 3b. It i used to smooth the image based on assigning an equal mask to the neighborhood pixels It has three steps. In the first step, an RGB image I(s,d) is converted into three channels. A box filter (1's pixel) is generated, which a size of 3 × 3. A mathematical representation o the box filter is given in Equation (1). (1 where j Є H, k Є V, and h and v represent the rows and columns of the produced mask filter. Here, the mask size is 3 × 3 = 9, which is separately applied on all channels, a depicted in Equations (2)-(4).

D-Box Filtering
In this step of image preprocessing, a 3D-box filter [12] is employed for the initial image, as shown in Figure 3a. The output image of this step is shown in Figure 3b. It is used to smooth the image based on assigning an equal mask to the neighborhood pixels. It has three steps. In the first step, an RGB image I(s,d) is converted into three channels. A box filter (1's pixel) is generated, which a size of 3 × 3. A mathematical representation of the box filter is given in Equation (1).
where j Є H, k Є V, and h and v represent the rows and columns of the produced mask filter. Here, the mask size is 3 × 3 = 9, which is separately applied on all channels, as depicted in Equations (2)-(4).
where R(s, d) is the red channel and G(s, d) and B(s, d ) represent the green and blue channels, respectively. R_mask(s, d) represents the kernel filter for the red channel, G_mask(s, d) is the kernel filter for the red channel, and B_mask(s, d) is the kernel filter for the blue channel. Equations (5)-(7) for channel extraction are as follows.
R(s, d) = red I(s, d) (5) G(s, d) = green I(s, d) (6) B(s, d) = blue I(s, d) (7) where j Є hu, k Є hc, hu = 3, vu = 3, hM = hu + h1 − 1, and vN = vu + v1 − 1. The symbols hu and vu represent updated values of rows and columns iterated up to 256 times. 2D images ar produced by using Fc(s, d) for each channel in the last step. Concatenating them results in a new, improved picture that amplifies the contrast between the diseased area and th surrounding area, as explained in Equations (9)

D-Median Filtering
It is applied to the output image produced by the 3D-box filter. Noise issues are re solved through this filter [36]. The resultant image is shown in Figure 3c.

HSI Color Transformation
The additive system, which can be obtained by combining all three channels-red green, and blue-is the foundation of the RGB color model. The numbers that represen these colors are expressed as the set of three numeric values representing the intensity o three channels ranging from 0 to 255. The lowest value represents the black color (0, 0, 0) Updated pixel values and kernel filters are utilized to perform convolution operations. A new matrix is produced by using a null matrix with a size of 256 × 256 and the abovegenerated values of h = 3, v = 3, h 1 = 256, and v 1 = 256. The resultant new matrix is shown below in Equation (8).
where j Є h u , k Є h c , h u = 3, v u = 3, h M = h u + h 1 − 1, and v N = v u + v 1 − 1. The symbols h u and v u represent updated values of rows and columns iterated up to 256 times. 2D images are produced by using F c (s, d) for each channel in the last step. Concatenating them results in a new, improved picture that amplifies the contrast between the diseased area and the surrounding area, as explained in Equations (9) and (10), respectively.
where F n (s,d) represents the result of the 3D-box filter, as given in Figure 3b.

D-Median Filtering
It is applied to the output image produced by the 3D-box filter. Noise issues are resolved through this filter [36]. The resultant image is shown in Figure 3c.

HSI Color Transformation
The additive system, which can be obtained by combining all three channels-red, green, and blue-is the foundation of the RGB color model. The numbers that represent these colors are expressed as the set of three numeric values representing the intensity of three channels ranging from 0 to 255. The lowest value represents the black color (0, 0, 0), while the highest value is the representation of the white color (255, 255, 255). Another color model based on three parameters is the HSI model. In HSI color transformation, colors are encoded according to Hue, Saturation, and Intensity. Along with the RGB color model, this color system is used as an alternative to RGB color space in some color monitors [37]. RGB to HSI color transformation is applied to the outcome from the previous step. In this color wheel, the hue of the HIS transformation of intensity or color is represented by the angle measure. The second channel of the HSI color model is saturation. A color's intensity or purity, especially in relation to its brightness, is referred to as saturation. It shows how much a color has been diluted with white or grey. The highly saturated color seems to be pure and vivid whereas, the de-saturated image is washed out. Low saturation indicates that there is dilution with white light, whereas high saturation means the color is pure and free from dilution. The third component is the intensity (brightness), which depends on the saturation and hue component. The mathematical equations for the conversion of RGB to the HSI color space [38] are given in Equations (11)-(14):

Contrast Limited Adaptive Histogram Equalization
Numerous researchers in image processing have used CLAHE to improve contrast by calculating distinct histograms for each image region; this algorithm improves contrast. These histograms are utilized in the next steps for the equal distribution of gray levels across the range of images. After the color transformation, in the next step, the H, S, and I channels are extracted, and CLAHE [39] is applied to all three extracted channels. After CLAHE, three output images are obtained, which are H_CLAHE, S_CLAHE, and I_CLAHE. The ensemble feature vector (φ CLAHE Ensemble ) is obtained by concatenating the H_CLAHE, S_CLAHE, and I_CLAHE. After the CLAHE step, the Chromatic weight map (C_W_MAP) is employed on the S_CLAHE, and an improved image is obtained (S_CLAHE Improved ) as an output of this step. The other two CLAHE outputs remain unchanged. The S_CLAHE Improved is further utilized for the texture features extraction. The φ CLAHE Ensemble is sharpened to improve the quality. In the next phase, the φ SH ARP Ensemble image is obtained, which is further utilized for the proposed CNN model for the deep features extraction.

Feature Extraction
Feature Extraction and optimal feature selection are significant phases that are necessary to achieve the accurate detection and classification of GIT abnormalities. In this research, we have utilized the Local Binary Pattern for the texture features extraction, and two CNN model are utilized for the deep features extraction. In Figure 4, there is an overview of the phases from preprocessing features extraction, deep features extraction, and features fusion. Whereas in the preprocessing phase we have performed several steps such as data augmentation, box-filtering, median filtering, and RGB-to-HSI conversion, in the preprocessing block of Figure 4, step-i represents the extracted Equalized S-channel output of the preprocessing phase, and Step-n represents the final output image of the preprocessing phase. The equalized S-channel image is utilized for the LBP features extraction that is later ensembled with the extracted deep learning features. A block diagram representing the proposed features engineering process is given in Figure 4.

Local Binary Pattern
LBP [40] is an important method utilized by many researchers for the extraction of texture information. It takes a grayscale image as the input after that variance and mean are calculated for all of the intensities (pixels) of the image. In this research project, we have utilized the S_CLAHE for the LBP features extraction. This algorithm returned the feature vector of dimension × 59. The mathematical formulation of LBP is given below in Equations (15) and (16): In this formulation, the neighborhood intensities are given by , the radius is given by , represents the current pixel intensity, and represents the neighboring pixel intensity. ( ) refers to the thresholding function.

Deep Learning Features
Inspired by the usefulness of deep learning in the surveillance and agriculture domain, several researchers have employed deep learning models for disease recognition. Deep learning models have resulted in a great breakthrough in terms of disease detection and classification performance. A CNN model consists of multiple layers comprising the input, convolution, batch normalization, RELU, pooling, softmax, and classification layers. Each layer is responsible for performing a specific task. A detailed description of the CNN models utilized in this research is given in the following section:

ResNet18
There are different versions of Residual Network (ResNetXX) architecture [41], where "XX" specifies the total number of layers. For the deep feature extraction in this study, the ResNet18 version was used. There are 72 layers in this architecture, with 18 of them representing the model's 18 deep layers. The ResNet18 model is originally pre-

Local Binary Pattern
LBP [40] is an important method utilized by many researchers for the extraction of texture information. It takes a grayscale image as the input after that variance and mean are calculated for all of the intensities (pixels) of the image. In this research project, we have utilized the S_CLAHE for the LBP features extraction. This algorithm returned the feature vector of dimension × 59. The mathematical formulation of LBP is given below in Equations (15) and (16): In this formulation, the neighborhood intensities are given by , the radius is given by ℛ ℛ ℛ, ℰ ℰ ℰ represents the current pixel intensity, and ℰ ℰ ℰ represents the neighboring pixel intensity. Y Y Y(x) refers to the thresholding function.

Deep Learning Features
Inspired by the usefulness of deep learning in the surveillance and agriculture domain, several researchers have employed deep learning models for disease recognition. Deep learning models have resulted in a great breakthrough in terms of disease detection and classification performance. A CNN model consists of multiple layers comprising the input, convolution, batch normalization, RELU, pooling, softmax, and classification layers. Each layer is responsible for performing a specific task. A detailed description of the CNN models utilized in this research is given in the following section:

ResNet18
There are different versions of Residual Network (ResNetXX) architecture [41], where "XX" specifies the total number of layers. For the deep feature extraction in this study, the ResNet18 version was used. There are 72 layers in this architecture, with 18 of them representing the model's 18 deep layers. The ResNet18 model is originally pre-trained on the thousand categories of the ImageNet dataset, consisting of the ResNet18 model containing 11,511,784 trainable parameters. The primary aim of this model is enabling a large number of convolution layers. The network performance becomes saturated or even degraded due to the presence of a vanishing issue. A vanishing gradient arises when there are multiple layers, and the continuous multiplication results in an even smaller gradient than the previous one; this situation leads to the network performance degradation. In the ResNet model, a new idea is presented to tackle the vanishing gradient problem, which is "skip connection". The skip connections resolved the vanishing gradient problem by again utilizing the previous layer activations. Skip connections compress the network, and the network starts learning faster than before. For the training, a ratio of 70:30 is utilized for the training and testing. We have utilized five-fold cross-validation. Figure 5 gives the visual representation of the layered ResNet18 model architecture. trained on the thousand categories of the ImageNet dataset, consisting of the ResNet18 model containing 11,511,784 trainable parameters. The primary aim of this model is enabling a large number of convolution layers. The network performance becomes saturated or even degraded due to the presence of a vanishing issue. A vanishing gradient arises when there are multiple layers, and the continuous multiplication results in an even smaller gradient than the previous one; this situation leads to the network performance degradation. In the ResNet model, a new idea is presented to tackle the vanishing gradient problem, which is "skip connection". The skip connections resolved the vanishing gradient problem by again utilizing the previous layer activations. Skip connections compress the network, and the network starts learning faster than before. For the training, a ratio of 70:30 is utilized for the training and testing. We have utilized five-fold cross-validation. Figure 5 gives the visual representation of the layered ResNet18 model architecture.

Proposed XcepNet23
A smart deep learning model with 23 layers is proposed for this research work, named XcepNet23. The layered architecture of the proposed model follows the architecture of the original Xception network model that has a depth of 71 layers. The XcepNet23 model is designed from scratch and pre-trained on the CIFER-100 dataset. After the pretraining step, we trained it on our enhanced dataset. We have employed five-fold crossvalidation, splitting the dataset into the ratio of 70:30 for training and testing, respectively. The parameters settings utilized for training are: the sgdm optimizer, an in itial learning rate of 0.01, 30 max epochs, a mini_Batch_size of 64, and shuffling on each epoch is performed. The layered architecture of XcepNet23 is given in Figure 6.

Proposed XcepNet23
A smart deep learning model with 23 layers is proposed for this research work, named XcepNet23. The layered architecture of the proposed model follows the architecture of the original Xception network model that has a depth of 71 layers. The XcepNet23 model is designed from scratch and pre-trained on the CIFER-100 dataset. After the pre-training step, we trained it on our enhanced dataset. We have employed five-fold cross-validation, splitting the dataset into the ratio of 70:30 for training and testing, respectively. The parameters settings utilized for training are: the sgdm optimizer, an initial learning rate of 0.01, 30 max epochs, a mini_Batch_size of 64, and shuffling on each epoch is performed. The layered architecture of XcepNet23 is given in Figure 6. trained on the thousand categories of the ImageNet dataset, consisting of the ResNet18 model containing 11,511,784 trainable parameters. The primary aim of this model is enabling a large number of convolution layers. The network performance becomes saturated or even degraded due to the presence of a vanishing issue. A vanishing gradient arises when there are multiple layers, and the continuous multiplication results in an even smaller gradient than the previous one; this situation leads to the network performance degradation. In the ResNet model, a new idea is presented to tackle the vanishing gradient problem, which is "skip connection". The skip connections resolved the vanishing gradient problem by again utilizing the previous layer activations. Skip connections compress the network, and the network starts learning faster than before. For the training, a ratio of 70:30 is utilized for the training and testing. We have utilized five-fold cross-validation. Figure 5 gives the visual representation of the layered ResNet18 model architecture.

Proposed XcepNet23
A smart deep learning model with 23 layers is proposed for this research work, named XcepNet23. The layered architecture of the proposed model follows the architecture of the original Xception network model that has a depth of 71 layers. The XcepNet23 model is designed from scratch and pre-trained on the CIFER-100 dataset. After the pretraining step, we trained it on our enhanced dataset. We have employed five-fold crossvalidation, splitting the dataset into the ratio of 70:30 for training and testing, respectively. The parameters settings utilized for training are: the sgdm optimizer, an in itial learning rate of 0.01, 30 max epochs, a mini_Batch_size of 64, and shuffling on each epoch is performed. The layered architecture of XcepNet23 is given in Figure 6.

Features Optimization
In the literature, we have observed that authors have utilized feature fusion to obtain a strong resultant set of features. There may be some redundant features that reduce the performance of the method. In the last decade, methods based on metaheuristic algorithms have been viewed as the most proficient and solid improvement strategies. These algorithms have been broadly utilized for the improvement of the performance of realworld issues. In this research, we have utilized feature optimization algorithms to extract the optimal features among the total extracted features. Optimization algorithms help to get rid of redundant and non-optimal features. Many algorithms can be used for selecting the best features. It is observed in the literature that authors have utilized BDA [2], MFO in combination with the Crow Search algorithm (CSA) [42], PSO and CSA [43], Enhanced Crow Search and Differential Evaluation, and Grasshopper [44]. Inspired by these studies, in this paper, we used three algorithms inspired by nature to optimize features, which are BDA, MFO, and PSO. The following sections briefly overview the various optimization techniques under consideration. We have one feature vector after the features fusion step denoted by φ Fused , having the dimension N × 1083. Optimization algorithms are applied on the φ Fused to obtain the optimal set of features. The obtained feature vector is optimized using BDA, MFO, and PSO. The mathematical formulation of optimization algorithms is as follow: The BDA [45] mimics the swarming dragonflies (DF) pattern. The fact-finding and shady systems of DA [46] are demonstrated by the association of dragonflies in keeping away from the foe and finding the source of food. The dragonfly algorithm mathematical formulation is given in the following Equations (17)- (25). Separation is formulated as: where X specifies a DF position in a space that is M-dimensional. N and X i represent the number of neighbor individuals and the position of the neighbor individual, respectively. Alignment permits speed matching of the individual in a sub-swarm/swarm. The mathematical formulation of Alignment is as follows: where N represents the neighbor individuals and V i represents the neighbor's individual velocity. Cohesion alludes to the deviation of the ongoing individual toward the focal point of the mass of the neighbor individual.
where X i specifies the individual neighbor position. Attraction is also an important behavior; it specifies that the source of food should be the attraction for the individuals. Mathematically, the attraction behavior is formulated as: where X indicates the source food's position. Distraction indicates the situation in which an individual should not get closer to the enemy. The individual should get away from predators. The mathematical formulation of distraction is as follows: where X ℰ represents the location of the enemy. These five behaviors control the DFs' movement in DA. The step vector is calculated to update the location of each dragonfly: where signifies the current iteration, s is the separation weight, α is the alignment weight, c is the weight of cohesion, f is the food weight, e is the weight of the predator, and w is the weight of inertia. In the initial algorithm for dragonflies, the locations of the DFs are restructured using the following equation: The locations vectors of BDA are updated as follows: T (∆X) = ∆X where Rand, ∆X, X k i , and T represent the number that is randomly generated (range: 0-1), the step vector, the kth position of the ith dragonfly, and the transfer function, respectively. The ensemble feature vector φ Fused is processed using the BDA algorithm and obtains the optimized feature vector with the dimensions × 384.
In this research, we have also utilized the Moth-Flame Optimization algorithm [47] for feature optimization. There are three major steps that the moths follow: population creation, position updating, and updating the final amount of the flame. Here, the Moths population and Fitness function value can be mathematically expressed as given in Equations (26) where the matrix W represents the moth's population and the flame is represented by the F, n, and z denoting the number of dimensions and the number of moths. After that, the moths' positions are altered to obtain the best global solutions. The following function is chosen for the optimization challenge's best global solution: where P is a function, P generates the fitness function corresponding to the randomly generated population P : ∅ → {W, OW} , and the Q function is the main function. Q is responsible for the moth movement around the search space Q : W → W , and the S function returns false or true based on the termination criteria S, which is formulated as S : {true, f alse}. The mathematical formulation of the moth logarithmic spiral for the updating mechanism is given as follows: where i indicates the distance between the jth flame and the ith moth, c is a constant that defines the logarithmic spiral shape, and r is a random no. between the range −1 and 1. Equation (34) M denotes the max number of flames, the total number of iterations is denoted by T, and k represents the current iteration. Selected Flame is passed to the Equations (26)- (29), and the fitness function is calculated. The MFO technique that was mentioned earlier is used to process the fused feature vector φ Fused and obtain the optimized feature vector with the dimensions × 437. In this article, the third optimization algorithm that we have utilized is the PSO Algorithm [48]. PSO is a nature-inspired algorithm that is derived from the survival behavior of swarms. This algorithm's primary goal is optimization or picking the best solution from the available possible solutions. For a K-Dimensional search space, where we have k particles, the vector for the jth particle of the swarm will be represented as j = ( j1 , j2 , . . . , jK ), whereas, the prior optimal position of the jth particle is j = ( j1 , l j2 , . . . , jK , which returns the optimal fitness value. Here, ℒ symbolizes the particle's lowest function value, and the velocity of the jth particle can be expressed as j = ( j1 , j2 , . . . , jK . The following mathematical formulation represents the particles' updating process: where c1 and c2 denote the positive constant numbers of cognitive and social parameters, respectively. Here, R() and r() are the two functions that produce pseudorandom numbers within a certain range [0, 1]. This algorithm returned a feature vector of × 428 dimensions.

Experimental Results and Analysis
The dataset consists of thousands of images which are unhealthy and healthy. Unhealthy images are further categorized into two classes: ulcers and bleeding. In this work, several machine-learning algorithms are used to classify the data. To evaluate the findings presented in this paper, each of the six classifiers listed below is taken into consideration: Fine Tree (F_Tree) [49], Coarse Tree (C_Tree) [50], Quadratic SVM (Q_SVM) [51], Fine Gaussian SVM (F_G_SVM) [52], Coarse KNN (C_KNN) [53], and Ensemble Subspace Discriminant (E_S_Disc) [54]. The classifier models include Fine Tree (F_Tree) and Coarse Tree (C_Tree), which are decision tree-based models optimized for fine and coarse-grained classification tasks, respectively. Quadratic SVM (Q_SVM) is a support vector machine model that utilizes kernel functions to handle non-linearly separable data. Fine Gaussian SVM (F_G_SVM) employs a Gaussian kernel to model the probability density function of the input data. Coarse KNN (C_KNN) is a k-nearest neighbor model optimized for coarse-grained classification tasks. Ensemble Subspace Discriminant (E_S_Disc) is an ensemble model that combines multiple classifiers operating on different subspaces of high-dimensional data. All the experiments are performed on MATLAB 2020a, Core i5-7th Generation with 8 GB RAM.

Dataset
Two datasets are utilized to evaluate the proposed method in this research paper. The considered datasets include: (1) the KVASIR_V1 Dataset and (2) the Hybrid_Dataset. The Hybrid_Dataset utilized in this research work consists of five classes: Bleeding, Ulcer, Esophagitis, Polyps, and Healthy. The dataset for this research is collected from different sources. Ulcer, Healthy, and Bleeding images are acquired from Amna Liaqat et al. [12] (each class contains 3000 images). The Esophagitis images are taken from the Kvasir dataset, [55]. Using the data augmentation technique, the volume of data for esophagitis images is increased without reducing the features in the original images. The fifth class of our dataset is polyps. Polyps' images are collected from two datasets named the Kvasir and CVC datasets. The proposed method is evaluated using a total of 15,448 images. The second dataset considered for this research work is the KVASIR_V1 dataset. This dataset consists of eight classes, and the total number of images in this dataset is 4000 [55]. We have used all classes of this dataset. Some sample images are given in Figure 7.

Dataset
Two datasets are utilized to evaluate the proposed method in this research paper. The considered datasets include: (1) the KVASIR_V1 Dataset and (2) the Hybrid_Dataset. The Hybrid_Dataset utilized in this research work consists of five classes: Bleeding, Ulcer, Esophagitis, Polyps, and Healthy. The dataset for this research is collected from different sources. Ulcer, Healthy, and Bleeding images are acquired from Amna Liaqat et al. [12] (each class contains 3000 images). The Esophagitis images are taken from the Kvasir dataset, [55]. Using the data augmentation technique, the volume of data for esophagitis images is increased without reducing the features in the original images. The fifth class of our dataset is polyps. Polyps' images are collected from two datasets named the Kvasir and CVC datasets. The proposed method is evaluated using a total of 15,448 images. The second dataset considered for this research work is the KVASIR_V1 dataset. This dataset consists of eight classes, and the total number of images in this dataset is 4000 [55]. We have used all classes of this dataset. Some sample images are given in Figure 7.

Performance Measures
In this research, the execution of the suggested classification method is assessed on different performance procedures, which are the execution time, precision (PRE), accuracy (ACC), F1-Score (F1), specificity (SPE), sensitivity (SEN), Cohen's Kappa score, and Matthews correlation coefficient (MCC). Mathematically, these performance measures are given from Equations (37)- (43). In this research, we have utilized five-fold cross-validation to assess all of the quantitative results.

Performance Measures
In this research, the execution of the suggested classification method is assessed on different performance procedures, which are the execution time, precision (PRE), accuracy (ACC), F1-Score (F1), specificity (SPE), sensitivity (SEN), Cohen's Kappa score, and Matthews correlation coefficient (MCC). Mathematically, these performance measures are given from Equations (37)- (43). In this research, we have utilized five-fold cross-validation to assess all of the quantitative results.

Results
Experiments are divided into five categories. The first category contains the results of the proposed methodology without optimizing the features, and the results will be taken on the fused extracted feature vector that is φ Fused . In the second, third, and fourth experiments, the fused feature vector φ Fused is further optimized using three different algorithms. The optimized feature vector obtained as the output is fed to the machine learning classifiers. We have evaluated these experiments on two datasets (the description of datasets is given in the previous section). 4 Similarly, preprocessed Kvasir-V1 images are utilized for feature extraction. The feature vectors obtained from the ResNet18 model and XcepNet23 are of ℳ × 512 dimensions, where ℳ symbolizes the total number of images in the Hybrid_Dataset. A feature vector of ℳ × 59 is obtained through the LBP. In the next step, feature fusion is performed to obtain a strong set of features. The ensemble feature vector obtained from the Kvasir_V1 dataset is of dimensions ℳ × 1083. A fused feature vector is produced when these three extracted feature vectors are an ensemble that is denoted by φ Fused−2 ; the ensemble feature vector is utilized for the classification. The highest classification accuracy achieved on the Kvasir_V1 dataset is 98.60% on the Q_SVM classifier in 41.758 s. With 98.47%, E_S_Disc achieved the second-highest accuracy, while C_KNN came in third with a 97.97% accuracy. The results of this experiment are depicted graphically in Figure 8 for both datasets. Similarly, preprocessed Kvasir-V1 images are utilized for feature extraction. The feature vectors obtained from the ResNet18 model and XcepNet23 are of ℳ × 512 dimensions, where ℳ symbolizes the total number of images in the Hybrid_Dataset. A feature vector of ℳ × 59 is obtained through the LBP. In the next step, feature fusion is performed to obtain a strong set of features. The ensemble feature vector obtained from the Kvasir_V1 dataset is of dimensions ℳ × 1083. A fused feature vector is produced when these three extracted feature vectors are an ensemble that is denoted by −2 ; the ensemble feature vector is utilized for the classification. The highest classification accuracy achieved on the Kvasir_V1 dataset is 98.60% on the Q_SVM classifier in 41.758 s. With 98.47%, E_S_Disc achieved the second-highest accuracy, while C_KNN came in third with a 97.97% accuracy. The results of this experiment are depicted graphically in Figure 8 for both datasets.  Table 2 shows the results of this experiment.   Table 2 shows the results of this experiment. In this experiment, we have also computed the φ Fussed BDA feature vector on the Kvasir_V1 dataset. The ensemble feature vector obtained from the Kvasir_V1 dataset is of ℳ × 1083 dimensions. The ensemble feature vector is optimized using the BDA algorithm. The optimized feature vector has ℳ × 499 dimensions. The highest classification accuracy of this experiment (Kvasir_V1 dataset) is 98.40%, achieved on the Q_SVM classifier in 22.685 s. On the Kvasir_V1 dataset, the E_S_Disc classifier achieved the second-highest accuracy of 98.32% in 66.739 s, and C_KNN achieved the thirst-highest accuracy of 97.87%. F_G_SVM performed worst among all classifiers, achieving a 44.90% accuracy in 92.103 s. The visual representation of the results is given in Figure 9 for both datasets.  In this experiment, we have also computed the feature vector on the Kva-sir_V1 dataset. The ensemble feature vector obtained from the Kvasir_V1 dataset is of ℳ × 1083 dimensions. The ensemble feature vector is optimized using the BDA algorithm. The optimized feature vector has ℳ × 499 dimensions. The highest classification accuracy of this experiment (Kvasir_V1 dataset) is 98.40%, achieved on the Q_SVM classifier in 22.685 s. On the Kvasir_V1 dataset, the E_S_Disc classifier achieved the secondhighest accuracy of 98.32% in 66.739 s, and C_KNN achieved the thirst-highest accuracy of 97.87%. F_G_SVM performed worst among all classifiers, achieving a 44.90% accuracy in 92.103 s. The visual representation of the results is given in Figure 9 for both datasets.  Table 3 presents the quantitative results of this experiment. For the evaluation of MFO-optimized features, we have utilized multiple classifiers. From the results, it can be observed that Q_SVM has achieved 100% accuracy in 67.941 s. E_S_Disc, one of the classifiers that was assessed, came in second, with an accuracy rating of 99.97%, trailing only F_Tree (99.76%). The least effective model, C_Tree, has a classification accuracy of 85.66%.   Figure 10 for both datasets. On the Kvasir_V1 dataset, the Q_SVM classifier achieved the second-highest accuracy of 98.37% in 36.044 s, and C_KNN achieved the thirst highest accuracy of 97.42%. F_G_SVM performed worst among all classifiers by achieving a 36.22% accuracy in 102.24 s. The graphical representation of the results is given in Figure 10 for both datasets.    Figure 11 for both datasets.  Figure 11 for both datasets.

Discussion and Comparison with Existing Methods
The proposed approach is compared with the most recent methods in this section [1,3,19,33,35,42,43,56,57]. The authors in [3] proposed a method for gastrointestinal disease detection from WCE images, and they achieved a 98.40% accuracy. In [3], Khan et al. achieved a 98.40% accuracy. The methodology proposed in [35] is evaluated on the three datasets. This method achieved 99.25%, 99.90%, and 100.0% accuracies on the Kvasir-V1, Nerthus, and CUI WAH WCE datasets. In another research study [19], two datasets were utilized for the assessment of the proposed method; the considered datasets were Kvasir-V1 and CUIWAH WCE. The achieved accuracy on the CUI WAH WCE dataset was 99.80%, whereas on Kvasir-V1, an 87.80% accuracy was obtained. In [33], the authors

Discussion and Comparison with Existing Methods
The proposed approach is compared with the most recent methods in this section [1,3,19,33,35,42,43,56,57]. The authors in [3] proposed a method for gastrointestinal disease detection from WCE images, and they achieved a 98.40% accuracy. In [3], Khan et al. achieved a 98.40% accuracy. The methodology proposed in [35] is evaluated on the three datasets. This method achieved 99.25%, 99.90%, and 100.0% accuracies on the Kvasir-V1, Nerthus, and CUI WAH WCE datasets. In another research study [19], two datasets were utilized for the assessment of the proposed method; the considered datasets were Kvasir-V1 and CUIWAH WCE. The achieved accuracy on the CUI WAH WCE dataset was 99.80%, whereas on Kvasir-V1, an 87.80% accuracy was obtained. In [33], the authors achieved a 99.80% accuracy for the WCE image classification. Khan et al. [43] utilized the Hybrid dataset comprising 15,000 images and achieved a 99.50% accuracy. In [56,57], the authors proposed a method for classifying gastrointestinal disease detection and classification; they achieved 97% and 96.46% accuracies on the Kvasir-V1 dataset, respectively. Later on, the authors in [42] proposed a novel method for the GIT disease classification using two optimization algorithms, the proposed technique is named Moth-Crow-based features optimization. This approach outperformed the most recent methods on three datasets. This study used three datasets-CUI WAH WCE, Kvasir-V1, and Kvasir-V2-and achieved 99.42%, 97.85%, and 97.20% accuracies, respectively. In this research, we have utilized two datasets for the evaluation of the proposed method, which are the Hybrid Dataset (CUI WAH WCE, Kvasir-V1, CVC-Clinic) and the Kvasir-V1 dataset. Our contribution to this research is in the preprocessing phase, which returned a better set of features, which results in better performance. We have achieved 100% accuracy on the Hybrid Dataset and 99.24% accuracy on the Kvasir-V1 dataset in 58.140 and 23.957 s respectively. Table 5 provides a comparison of the results obtained by the proposed method and existing methods. It is observed from the comparison that our proposed method has performed good on the Kvasir-V1 dataset in terms of all the performance measures, if we compare the performance with the results of state-of-the-art research studies [19,42,56,57]. One of our previous studies achieved a 99.25% accuracy, which is 0.01% better than that of the present research due to the complexity of this method. The proposed method achieved the results in less time as compared to the previous method. This is the limitation of the proposed method that can be addressed in a future research study by using the more advanced algorithms. In this research, we have utilized a hybrid dataset that comprises the three different dataset images. The proposed method has achieved 100% accuracy on the hybrid dataset in 58.140 s, which is a good achievement. It is evident from this comparison that our proposed approach has outperformed the most recent ones in terms of performance.

Conclusions
This research proposes a deep learning model of 23 layers named XcepNet23 for Gastrointestinal disease detection and classification. We have proposed a hybrid image preprocessing framework that is employed on the augmented dataset. The proposed hybrid contrast stretching-based image enhancement is employed on the dataset. We have extracted the two types of CNN features in this research work. The first CNN feature vector is obtained using the Global Average pool layer of the proposed XcepNet23 model. The second CNN feature vector is obtained from the original augmented dataset by utilizing the fine-tuned ResNet18 model. The texture variation of the region of interest is a big challenge in the accurate recognition of gastrointestinal diseases. In this research, we have extracted the texture features using the LBP method. The obtained feature vectors are ensembled to obtain the strong set of features that incorporate the two types of CNN and texture features. The fused feature vector is optimized using three different nature-inspired meta heuristic optimization algorithms (BDA, MFO, PSO). The experiments are conducted on two datasets: the Hybrid dataset and the Kvasir-V1 dataset. The evaluation of the method is performed on three different feature vectors: the original ensemble feature vector, BDA-optimized ensemble feature vector, MFO-optimized ensemble feature vector, and PSO-optimized ensemble feature vector. In light of the results of the comparison between the proposed method and the techniques that were already proposed, it is concluded that our method has performed better across all performance metrics. This comparison demonstrates that our proposed strategy has performed better than the most recent ones. In 58.140 and 23.957 s, respectively, we were able to achieve 100% accuracy on the Hybrid Dataset and 99.24% accuracy on the Kvasir-V1 dataset.
The only limitation of this research work is the segmentation of the gastrointestinal tract abnormalities. The most noticed abnormalities are bleeding, ulcers, and polyps. The segmentation of bleeding, ulcers, and polyps can be performed along with the recognition of diseases in future research work. It is difficult to detect three different types of diseases using the same methodology, as these three types of abnormalities have different color, texture, and shape variations. The algorithm can be expanded in future research to identify and categorize a wider variety of gastrointestinal disorders. This might entail gathering more labeled data for less prevalent diseases or researching transfer learning strategies to use information from similar disorders. To implement the approach in clinics and enable real-time prediction, methods can also be researched. Moreover, in future work, this method may be improved by reducing the computational complexity.