A Predictive Model to Detect Cervical Diseases Using Convolutional Neural Network Algorithms and Digital Colposcopy Images

Cervical diseases, specifically cervical cancer (CC), are among the leading causes of death around the globe, imposing a significant challenge to scientists and healthcare providers dealing with cervical disease patients. None of the existing solutions can detect various cervical diseases, which would lead the experts to accurately detect the early stages of cervical diseases due to the equipment limitations and the type of medical detection tests used in those solutions. New technologies have been developed to enable more rapid and sensitive cervical cancer screening using deep learning algorithms. This study proposes a predictive model using deep learning (DL) algorithms and colposcopy images to detect different classes of cervical diseases, including different stages of cervical diseases. This offers the medical sector an opportunity for early-stage diagnosis of cervical diseases. Four rounds of experiments were conducted in this research to evaluate the performance of the proposed model. According to the results, the proposed model can detect classes (stages) of cervical diseases while it obtains high accuracy. The rate of accuracy in the training stage was above 92%, and the highest achieved accuracy was 99% in the third experiment. Also, in this round of the experiment, the model could achieve the highest performance results in accuracy, and sensitivity with values of 98% and 98%, respectively. Notably, the third and last experiments achieved a perfect specificity value of 1.


I. INTRODUCTION
Nowadays, artificial intelligence (AI) and its extensions have established great performances in different domains, such as natural language processing and image processing, especially with the appearance of DL. By using DL in many fields and research and achieving improvement in results, some fields demand an exact and high level of accountability and thus, clarity, for example, in the medical domains. With a review of various research, works can categorize the performance of DL in interpretability research and different techniques that The associate editor coordinating the review of this manuscript and approving it for publication was Frederico Guimarães . provide information to the investigations of complex patterns. By using this categorization for more discerning in medical research desired: clinicians and experts carefully can later approach these methods; constructed more considerations in the medical field [1].
DL has several uses, including helping with medical diagnostics. This encompasses but is not limited to, biomedicine, magnetic resonance image analysis, and health informatics [2]. Segmentation, diagnosis, classification, prediction, and identification of various anatomical regions of interest are more specialized applications of DL in the field of medicine (ROI). The development of CAD systems helps medical practitioners diagnose and understand data while minimizing human error [3]. There are two basic methods when it comes to using DL for medical diagnostics. The first strategy involves categorizing data and minimizing potential outcomes (diagnostic) by connecting data to particular outcomes. The second method uses physiological data, such as data from other sources and data from medical pictures, to find and diagnose tumours or other disorders. When it comes to medical diagnostics, DL is employed in a variety of ways [4].
Machine learning (ML) algorithms are frequently used in CAD systems with medical imagery to detect and diagnose cancer. Feature extraction is typically a crucial stage in the adoption of ML algorithms. For various imaging modalities and cancer types, multiple feature extraction techniques have been researched. These feature extraction-based techniques have a lot of drawbacks. The flaw prevents the performance of CAD systems from being further enhanced. DL algorithms have wide applications in the processing of medical images, and a huge number of publications emphasize their application in image classification, detection, improvement, image development, segmentation, and registration [5].
The diagnosis of cancer is greatly aided by early detection, which also increases long-term survival rates. A key method for identifying and diagnosing cancer early on is medical imaging. Medical imaging has been widely used for early cancer detection, monitoring, and follow-up after treatments, as is well documented. However, manually interpreting the vast amount of medical photographs can be tiresome, timeconsuming, and prone to prejudice and errors. Computeraided diagnosis (CAD) systems were launched in the early 1980s to help clinicians understand medical images more quickly.
Cervical cancer is the fourth most prevalent malignancy worldwide among women aged 15 to 44, both in terms of incidence and fatality rate. Worldwide, the majority of women with cervical cancer have either never been screened or have received insufficient screening. According to studies, nearly 50% of patients with cervical cancer had never used cervical cytology. Another 10% of women who develop cervical cancer do not have a screening for the disease for five years. Additionally, one of the most efficient strategies to lower the morbidity and mortality of cervical cancer is by widespread and routine screening of the general public. According to the research, 2018 saw nearly half of all new cancer cases worldwide and more than half of all cancer fatalities in Asia [6]. In the UK, almost 2,800 women are diagnosed each year, and 1,000 women die from cervical cancer, with screening assessments of up to 5,000 people per life [7].

A. RESEARCH HIGHLIGHT & CONTRIBUTION
This subsection highlights the contribution made by the authors in this paper. The research mainly aims to provide a comprehensive review of state-of-the-art literature related to DL techniques, cervical disease diagnosis automated approaches, and the existing solutions for detecting cervical diseases using DL algorithms and digital colposcopy images. Our contribution to the study can be summarized as follows: •A survey of existing solutions to diagnose CC using DL algorithms.
•A proposed predictive model using DL algorithms based on the CNN structure to enhance cervical disease (CC) detection using colposcopy images.
•In addition, existing research gaps and potential problems of the previous works in detecting cervical diseases using ML methods are presented.
•A series of experiments on the developed model are conducted to validate the results and evaluate the performance of the model in terms of detecting different stages of cervical diseases.
The rest of the paper is organized as follows. Section II provides the main problem we try to solve with our research. Section III covers the related work, where the solutions to diagnose CC by applying DL algorithms are discussed along with analysis to provide a comparison of the studies. Section IV provides the methodology, where it discusses the proposed predictive model using DL algorithms based on the CNN structure to enhance cervical disease detection using colposcopy images. Section V includes the details of the implementation of the model, such as the setup, performance metrics used, and dataset. Section VI provides the discussion and results obtained from the experiment, and finally, Section VII provides the conclusion of our experiments.

II. PROBLEM STATEMENT
This section provides the problem statement, which is extracted from the review made in this research. Following the solutions to these problems, the CNN algorithm is proposed.
Cancers and, most specifically, cervical cancer (CC) are among the leading causes of death around the globe, imposing a significant challenge to scientists and healthcare providers dealing with cervical disease patients. None of the existing solutions can accurately detect the early stages of cervical diseases due to the limitations and the type of medical detection tests used in those solutions [8]. This study proposes a predictive model using DL algorithms to detect different classes of cervical diseases, including their early stages, and explores the impacts of increasing the number of classes on the accuracy of the proposed model. This offers an opportunity for early-stage diagnosis of cervical diseases. Detection of cervical diseases is mainly through Pap smear screening as well as colposcopy, the two procedures that are highly reliant on professionally trained specialist doctors who are often scarce in low-resource countries and rural areas. Recent deep learning-based solutions provide notable effectiveness in object detection and classification speed and accuracy, with specific applications in health monitoring. Although several previous studies had reported a combination of deep learning-based algorithms with standard cervical disease screening tests such as colposcopy and Pap smear screening, they are not able to detect the early stages of VOLUME 11, 2023 59883 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
cervical diseases due to their limitations and the type of medical detection tests that are used in those solutions. Moreover, the accuracy of any proper solutions needs to reach a high and accepted level in all stages of cervical precancer and cancer to support required timely treatments and supports. Therefore, there is an intense need to further improve existing deep learning-based digital solutions for timely and accurate cervical diseases diagnosis and detection of cervical diseases in all stages, especially in the early stages, to avoid morbidity and mortality, especially in low-income countries.

III. RELATED WORK
Cancer-related fatalities, researchers around the world employ a variety of DL-based solutions to accurately diagnose CC. Alyafeai and Ghouti created an entirely automated pipeline for the detection and classification of CC using Cervigram. The proposed pipeline consists of two deep learning models. The first model detects the cervix region 1,000 times quicker, while the second aids in the classification of cervical tumors using self-extracting features. These characteristics were further dissected by two CNN-based replicas. Cervigram's two datasets were used to train and evaluate key components of DL pipelines. Based on the proposed DL classifier, the authors disclosed that the classifier had an area under the curve score of 0.82 and 20 times the classification speed of Cervigram. In addition, the present method can be applied to mobile phone applications to improve their detection efficacy. In addition, the proposed pipeline lacks the perceptual quality of Cervigrams in order to provide improved and more precise labelling of the cervical region of interest (ROI) [9].
Similarly, Zhang et al. classify precancerous cervical lesions using a pre-trained, closely connected CNN, a computer-assisted diagnosis technique. The proposed method was used to evaluate CIN2 or higher-level cervical lesions after preprocessing image [negative samples (4,337 images) and positive samples (3,902 images) data with ROI isolation and data amplification. DenseNet CNN from two datasets, ImageNet and Kaggle, was fine-tuned with the parameters of all layers. Different quantities of training data, random initialization (RI) training from inception, K-fold cross-validation, and fine-tuning of a pre-trained model were examined to determine their effect on the performance of the model. Intriguingly, the results revealed 73.08% accuracy with an AUC of 0.75 in 600 test images. Nonetheless, data enhancement and the CNN algorithm require further development in order to create a more effective diagnostic structure for analyzing new data on precancerous cervical lesions [10].
In contrast, Bai and his colleagues developed the CNN-based cervical lesion detection net (CLDNet) model to extract deep features from colposcopy images. Specifically, they utilized Squeeze-Excitation (SE) CNN for the recalibration of images' isolated features. In addition, they created a proposal box through a regional proposal network (RPN) to emphasize a region of interest (ROI). A total of 6536 colposcopic images were selected, with 5095 serving as training data, 2567 negative images and 2528 positive cervical images being isolated. The outcome demonstrated an average precision extracted lesion region of 92,53 %, with an average recall rate of 85,56 % to positively augment auxiliary diagnosis [6].
Recently, a novel fuzzy reasoning model has been implemented to classify cervical images following an acetic acid test in order to reduce the risk burden of CIN. Liu et al. used an automated image segmentation algorithm to derive valuable information from the acetowhite region of 505 patients before and after acetic acid tests (383 CINnegative and 122 CIN-positive). Analyses were conducted on the grayscale modification, texture coarseness, and image complexity of post-test images. Sensitivity and specificity for the three parameters were, respectively, 80.8%, 80.9%, 82.8% and 82.0%, 87.4%, and 86.2%. In addition, fuzzy reasoning significantly enhanced overall sensitivity and specificity. Despite this, this solution cannot differentiate between SIL case studies of low and high quality. Second, the sample size was limited, which further restricted the number of analyzable features [11].
Using images of the uterine cervix, Chen and his associates employed a computer-aided diagnostic system to assist in the diagnosis of cervical diseases such as HPV, CIN, and CER. Briefly, they segmented ROI from three distinct image types using the proposed random forest (RF) segmentation algorithm (natural, acetic acid, and Lugol's iodine test). The ROI was further characterized by seven color spaces to extract features for the classification of cervical maladies using the Boruta algorithm. The final diagnosis of three cervical diseases was accurate 83.1% of the time; however, non-uniform distribution and small population size were the primary limitations of the proposed study [12].
In addition, Hu and colleagues undertake an observational investigation of the DL algorithm on a longitudinal cohort of 9406 females in Costa Rica who underwent multiple cervical screening protocols and histopathological observations for precancer/cancer. Multiple cervical examinations included cervicography, HPV testing, and cytology; subsequently, Cervigram was obtained to evaluate cervical images in terms of detection, feature extraction, and classification using a speedier R-CNN. Precancer and cancer cases were identified with greater precision (AUC=0.91) by automated visual analysis of the Cervigram than by human analysis. However, the aforementioned study is conducted on a small sample size and only includes CIN2 cases as opposed to CIN3 and AIS cases. In addition, images were obtained with a film camera as opposed to a digital camera, and they were taken by a limited number of nurses with extensive training [13].
In contrast, Shrivastav and his colleagues obtained colposcopic images of cervical cancer from anonymized patients undergoing routine examinations for cervical injuries at the outpatient department of Batra Hospital and Medical Research Institute in New Delhi. Later retrieved images were processed prior to the application of the segmentation algorithm, Earth mover's distance (EMD) in R (programming language), and software capable of automatically quantifying, identifying, and classifying morphological features, sensitivity, and color intensity to expedite the diagnostic process. Comparing anonymous colposcopic images (endocervix, ectocervix, and endo-ectocervix) using image repositories powered by Mobile ODT and Kaggle to categorize them into quantitative values generated by an algorithm revealed high validity [14].
Guo et al. employed a combination of DL networks, namely RetinaNet, fine-tuned DL models (VGG, Inception), and transfer learning models (VGG, Inception feature extractor + SVM), to evaluate HPV-associated cervical image sharpness, which hindered accurate diagnosis of cervical lesions. Consequently, they obtained 4525 unidentified images from 1399 females using Mobile ODT's EVA system and categorized them as 'Not sharp' or 'Sharp' images. RetinaNet's sensitivity and specificity were 98% and 85%, respectively, and its accuracy was 94%, based on the results, making it the replica with the best overall performance compared to the others studied [15]. In addition, another group of researchers evaluated the use of smartphones to detect cervical lesions in individuals with atypical cervical cytology. The cervix of seventy-five females with aberrant cervical cytology was examined by specialists via smartphone or colposcopy.
The diagnostic potential of smartphones for CIN1 and 2 was then evaluated, and the kappa value was calculated to disclose the chance-adjusted agreement between the histologic observation based on smartphones and colposcopic findings. The investigation revealed a significant correlation between smartphone-based histologic diagnosis and colposcopic output, with a kappa value of 0.67. In contrast, the smartphone's sensitivity and specificity in diagnosing CIN1 or chronic stage were 0.89 and 0.83, while for CIN2 or worse, they were 0.92 and 0.24, respectively. Despite this, smartphones demonstrated a high sensitivity and positive predictive value (PPV) during CIN1 detection, whereas specificity and negative predictive value (NPV) were low. PPV and NPV are also affected by disease prevalence, and findings cannot be generalized to other populations [16].
Through Cervigram images, Elayaraja and Suganthi proposed a new method for diagnosing cervical tumors. Acquired from the Guanacaste dataset (2005), the cervical images were preprocessed with the oriented local histogram technique (OLHT) to enhance the edges, followed by dual-tree complex wavelet transformations (DT-CWT) to obtain high multi-resolution images. From processed images, the desired topographies, including grey level co-occurrence matrix (GLCM), local binary pattern (LBP), moment invariant, and wavelet, were extracted. The isolated features were then trained and validated by a feed-forward back propagation neural network to classify cervical images as benign or malignant based on the trained features. The accuracy, sensitivity, and specificity were, respectively, 98.29%, 97.42%, and 99.36%. Despite the aforementioned advantages, this method cannot be used to diagnose cervical melanoma using a Pap smear Cervigram [17].
Similarly, Kudva et al. validated the likelihood of classifying cervical images as malignant or non-malignant using a shallow layer CNN. 102 females were photographed after an acetic acid (3-5%) test using an Android device. There were 42 VIA-positive (pathologic) images and 60 VIAnegative (healthy subject) images. Eventually, 275 images (15 × 15 pixels) of CC patients and 409 images of healthy controls were painstakingly extracted. These images were classified using a convolutional, pooling, rectified linear unit and two wholly associated layers CNN with shallow layers. Overall classification accuracy was 100%; however, based on percentage training data for traditional machines, sensitivity and specificity ranged from 61.9-71.3 and 69-77.3%, respectively, whereas for DL, sensitivity and specificity ranged from 43.9-100% and 75.6-100%, respectively [18].
Also, Zhang and coworkers employed DL tools to classify images of cervical cancer to aid clinicians in making a more accurate diagnosis. After filtering the dataset, they experimented with 6692 cervical images and classified them as type 1, type 2, and type 3. In the first phase, CNN was used to segment cervical lesions, whereas, in the second phase, a neural network model identical to CapsNet was employed to classify cervical lesions. The accuracy of the training set and test set was thus 99.9% and 80.1%, respectively. This model exhibited overfitting due to a lack of optimization and structural adjustment of the CapsNet-cervical network [19].
Jaya and Kumar used the Adaptive Neuro-Fuzzy Inference System (ANFIS) to classify cervical lesions in order to detect cervical cancer. After collecting 50 cervical images from the Guanacaste dataset, benign (35) and malignant (15) cases were distinguished. Fast Fourier Transform (FFT) was used to centrally align images in order to extract GLCM, trinary, and gray-level features. The ANFIS classifier then trained and classified the extracted features with 99.36% accuracy, 97.42% sensitivity, and 99.36% specificity [20]. These outcomes were analyzed using MATLAB R2014b. In addition, numerous attempts have been made to improve the diagnostic efficacy of colposcopy images. Fragoso et al. examined three well-known automated classification models, k-Nearest Neighbors, C4.5, and Nave Bayes. After enrolling 200 females with positive Pap smears referred for colposcopy, the study was conducted. The cervical region was treated with 3% acetic acid, and data were collected using MATLAB (R2009a) with an STC-N63J camera. A total of 180 images were captured from the colposcope using a green filter, and ten reference images were acquired prior to the administration of acetic acid. Automatic models were used to classify colposcopic images with a sensitivity, specificity, and accuracy of 60%, 79%, and 70%, respectively, following the central alignment of images. In contrast, the automatic classification method must be refined further to prevent false negative and false positive results [21]. VOLUME 11, 2023 Using colposcopy images, Miyagi et al. investigated DL as AI for the classification of cervical cancer [squamous epithelial lesions (SIL)]. Oncologists who performed a biopsy and colposcopy on 330 patients diagnosed 97 with low-grade SIL and 213 with high-grade SIL. Combining an AI classifier with 11 layers of CNN, a classifier was employed and trained. This study revealed that the accuracy, sensitivity, and specificity of AI classifiers and human-based diagnosis of highgrade SIL were 0.823% and 0.797%, 0.800% and 0.831%, and 0.8823% and 0.7733%, respectively. Simultaneously, the AUC for the receiver-operating attribute was 0.826% 0.052%. These results demonstrated the superior performance of AI classifiers compared to human-based diagnosis, but they lack sufficient inference. Moreover, this current framework requires classifier validation [22]. Table 1 shows a brief result reviewed in related work.

A. PROPOSED MODEL
This section explains the proposed model for enhanced cervical disease detection using colposcopy images. Figure 1 illustrates the predictive model. The cervix lesions can be categorized into normal and different stages of precancerous and cancer by classifying digital colposcopic cervical lesion images. This enhances the effective diagnosis of CC. The CNN algorithm applies as the object detection algorithm to detect the lesion region and improves the precision of the algorithm by improving the feature extraction process. The VGG-16 model is a convolutional neural network (CNN) architecture. As shown in Figure 1, the VGG-16 model consists of a total of 16 layers, including 13 convolutional layers and 3 fully connected layers. The architecture of the model is characterized by its simplicity and uniformity. Specifically, all the convolutional layers have a 3 × 3 filter size and a stride of 1, and all the max pooling layers have a 2 × 2 window size and a stride of 2. This design allows the model to learn hierarchical features of increasing complexity as the input image is processed through the layers.

B. IMAGE FEATURE EXTRACTION
Cervical precancerous lesions and invasive cancers exhibit significant abnormal morphological features that can be identified by colposcopy. Pathological features of the cervical epithelial tissue, such as colour characteristics, opacity, edge division and tissue shape, should be observed by an expert doctor expert to obtain a clinical diagnosis. Due to the subjective nature of the examination, the accuracy of the colposcopy is highly dependent on the doctor's experience and expertise. Therefore, colposcopy has low specificity and requires many unnecessary biopsies [22]. Cervical images observed by the colposcopy during cervical disease screening are divided into five categories, including normal, CIN1, CIN2, CIN3 and cancerous images.
The proposed feature extraction network of the lesion region aims to extract features of the lesion region in the cervical image [6]. The pre-trained model (e.g., VGG-16, ResNet50, etc.) has already learned to extract features from images, and the final fully connected layer is usually replaced with a new set of fully connected layers that are trained on the specific classification task. In this way, the pre-trained model acts as a feature extractor for the images, and the new fully connected layers learn to classify the features extracted by the pre-trained model. So, image feature extraction is implicitly happening during the transfer learning step.

C. CLASSIFICATION MODULE FOR CERVICAL DISEASES IMAGES
The proposed model takes an image with an arbitrary scale as the input and prepares a set of rectangular target suggestion boxes as the output.

V. IMPLEMENTATION A. EXPERIMENT SETUP
All computer-based experiments are conducted on a computer workstation with Microsoft Windows 10 Operating System. This machine has an Intel central processing unit (CPU) core i5, the UHD Graphic 630, with 16 GB memory to enhance the processing speed of DL algorithms and data manipulations. The proposed DL-based cervical diseases detection model is executed in Anaconda Environment using Spyder Python and Jupyter editor, using Python 5.3.3 programming language, and ML and DL packages. Table 2 lists the specifications of the experiment's environment. The software and hardware requirements required for the setting of the experiment are summarized in Table 2.

B. PERFORMANCE METRICS
The performance of the DL algorithm will be evaluated to ensure the efficient working of the model. The selection of effectiveness measurement criteria depends on the dataset and selected metrics. In the present study, the dimension metric used in this work is the number of classes. Also, in this study, we are exploring the other performance metrics to see what the impact is of increasing the number of classes on other measurements. For this case, we are checking the accuracy, specificity and sensitivity used as performance metrics to evaluate algorithm performance based on previous publications [3], [9].
Sensitivity and specificity are more crucial in this study since the data is related to the medical industry and calls for a greater emphasis on the true positive rate and true negative rate. From a medical standpoint, false negatives-which are of the biggest relevance in this case-will be less common if the genuine positive rate or sensitivity is higher. The same holds true for the true negative rate, which means there are very few false positives. The confusion matrix's true positive and false positive and negative values can also be used to compute other performance metrics.

1) CONFUSION MATRIX
One such crucial instrument that assists us in assessing the effectiveness of our approach is the confusion matrix. It is a matrix of size n x n, as the name implies, where n is the VOLUME 11, 2023 59887 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. number of class labels in our problem [29].
•True Positive (TP): A case of data where the model properly predicts that the data falls into the positive category, meaning that both the actual and the expected labels are positive.
•True Negative (TN): A data instance that the model properly identifies as falling inside the positive category, meaning that both the actual and the predicted labels are negative.
•False Positive (FP): A data sample where the model predicts the label as positive when the actual label is negative, the model ''predicts the label'' as positive, but the actual label is negative.
•False Negative (FN): A data instance that the model incorrectly predicts as belonging to the positive class, i.e., the actual label is negative, but the predicted label is negative.

2) ACCURACY METRIC
One of the most crucial performance indicators is accuracy, which may be calculated as the proportion of properly predicted observations to all observations, as shown below [29]: In some cases, accuracy is insufficient to judge the performance of our model. Certain models must more precisely identify positive cases, while certain cases must more accurately identify negative cases. Consider a cancer detection model. Assuming that, as is typically the case, our test data have more patients who are cancer-negative and fewer patients who are cancer positive. Then the model must more precisely recognize negative than positive. Because even if it results in a false positive, the patient can still be tested again using several tests to determine whether or not he is positive. We cannot, however, afford to have false negatives. Because the patient could not be aware that he has cancer if a case is positive, while the model forecasts it as negative. Recognizing different models will have different metrics based on the business cases, and the accuracy is not necessarily the best and sufficient statistic.

3) SENSITIVITY METRIC
How well a machine-learning algorithm can identify positive examples is measured by its sensitivity. The true positive rate (TPR) or recall is another name for it. Sensitivity enables the measurement of how many instances the model was able to detect, which is why it is used to assess model performance accurately. Few false negatives suggest that a model with high sensitivity is likely to be missing some of the positive examples. Sensitivity, in other words, assesses how well a model can recognize good samples. This is crucial because for models to produce reliable predictions, they must be able to identify all of the good examples. Sensitivity (true positive rate) and false negative rate added together would equal one. The model is better at correctly recognizing positive cases when the true positive rate is larger [29].

4) SPECIFICITY METRIC
The percentage of true negatives that the model correctly detects is known as specificity. This suggests that a further percentage of true negatives-which were formerly thought to be positive and could be referred to as false positiveswill occur. This percentage may also be referred to as a True Negative Rate (TNR). Specificity (actual negative rate) and false positive rate added together would always equal one. Low specificity indicates that the model is mislabeling many negative findings as positive, while high specificity indicates that the model is accurately identifying the majority of the negative results [29].

C. DATASETS
In the present study, two different Cervigram datasets are used. The following sections provide more details about the datasets.

1) INTEL&MOBILEODT DATASET
Cervical images are collected from Intel & MobileODT Cervical Cancer Screening dataset available on Kaggle. Kaggle is the largest data science community that offers powerful tools and resources for data science projects and researchers' databases [30]. Also, this database consists of a total of 6734 labelled images of three stages of CC images, including type 1 cervical intraepithelial neoplasia 1 (CIN1), CIN2 (type 2), and CIN3 (type 3).
In this project, 1191 type_1, 3567 type_2, and 1976 type_3 images are collected from Additional images, 250 type_1, 781 type_2, and 450 type_3 images are collected from train images, and 4018 test images are without any label in Intel & MobileODT Cervical Cancer Screening, which will be distributed as shown in Table 3.

2) INTERNATIONAL AGENCY FOR RESEARCH ON CANCER
The atlas entitled Atlas of Colposcopy -Principles and Practice were developed in the context of the cervical cancer screening research studies of the International Agency for Research on Cancer (IARC) and the related provision of technical support for regional and national scale-up of cervical cancer screening programs [31]. The IARC colposcopy Image Bank comprises 913 images and corresponding metadata from 200 cases. The images, while de-identified, contain information related to the clinical care of patients. In this project, 89 Normal (Type-0), 78 LSIL-CIN1 (Type-1), 88 HSIL-CIN2 (Type-2), 153 HSIL-CIN3 (Type-3), 27 LSIL-HPV (Type-4), 24 Adenocarcinoma (Type-5), 39 Invasive squamous cell carcinoma (Type-6), 15 Micro invasive squamous cell carcinoma (Type-7), 32 Squamous cell carcinoma (Type-8) images are collected, and 390 images are without a specific label in IARC screening, which will be distributed into test (30%) and train (70%) images as shown in Table 4.

VI. RESULTS & DISCUSSION
In this section, the experimental results, applied methods, and relevant discussions are presented.

A. EXPERIMENTAL RESULTS
After the analysis of the algorithms presented in the literature review sections is done, the authors have categorized the literature based on related research gaps.
1) This experimental work is carried out on images available on Kaggle in the train set. The image used for this study are in JPG format and have a 3: 4 aspect ratio, as shown in Figure 3.  Most of the images are 2448 × 3264 and 3096 × 4128, as shown in Figure 4. To use Kaggle datasets for this study considered two categories, Experiment 1 and Experiment 2. In Experiment 1, the number of cervical disease images is different in Type 1, Type 2, and Type 3. In Experiment 2, the number of cervical diseases colposcopy images is equal in Type 1, Type 2, and Type 3. For further experiments and results, we decided to consider the number 250 for all types of the training dataset [30].
2) The IARC Cervical Cancer Screening dataset is a comprehensive collection of annotated images and illustrations that almost entirely details both common and uncommon colposcopy abnormalities. The JPG format and a 3:4 aspect ratio are used for these images.
All images are 800 × 600. To use IARC datasets for this study considered two categories, Experiment 3 and Experiment 4. Experiment 3 used the original dataset [31].

B. PREPROCESSING
As seen in Figure 4, the images in the main set are available in various sizes. To ensure homogeneity and quick processing, all the images in experiments1, experiments2, experiments3, and experiment 4 are resized into 32 × 32 pixels. Before supplying the images as input to a DL neural network model during the training or evaluation of the model, the pixel values in the images must be scaled. During the time of the model training or evaluation process, can scale the images using a preferred scaling method. Also, the photos were separated into different landscapes and needed to be rotated. For this reason, we rotated the images and then converted them all VOLUME 11, 2023 59889 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. to 32 × 32 sizes so that the model has enough features for training while lightening.

C. AUGMENTATION METHODS
In Experiment 4, for an increased number of images in the dataset used, augmentation methods include rotation with 45, horizontal, and vertical filliped, as shown in Figure 5. When a few colposcopy images are accessible, it will be unable to train any DL model with these effectively. When trained with moderate training data sizes, DL models perform inadequately.
Therefore, to increase, the Cervigram images will be artificially generated utilizing the data augmentation strategy for the DL model to function in a large training size. For example, Krizhevsky et al. used horizontal reflections and image translations on the training images to train the AlexNet model. Using this technique, they were able to add 2048 additional training images [32]. Different mild-mannered image manipulation techniques were employed by Alyafeai and Ghouti, such as random rotation, random cropping, and random flipping [9].
In this study, data preprocessing comprises the following steps: a. Resizing all colposcopy images to the same size, 180 × 180 b. Normalizing pixel values c. Applying image deformations rotation 45, flip horizontal and flip vertical d. Storing data in NumPy format Table 5 provides more information about a number of datasets after applying the data augmentation.

D. MEASURES IN MODEL BUILDING TO PREVENT THE OVERFITTING
Several measures are implemented in this study to prevent overfitting in the classifier, namely: We froze a certain number of layers in the VGG16 model (conv_base) that was pretrained. By setting these layers' trainable attributes to False, their weights are not updated during training. This serves to utilize the knowledge learned by the pretrained model while preventing the model from overfitting to the current dataset. In addition, a Dropout layer follows the Flatten layer. Dropout is a regularization technique that arbitrarily sets a portion of input units to 0 during training, which reduces the interdependencies between neurons and functions as a form of ensemble learning. This prevented the model from relying too heavily on specific features, allowing it to generalize more effectively.
After that, we froze some of the VGG16 model's layers and reduced the number of trainable model parameters. This can aid in limiting the model's capacity and preventing overfitting, particularly when the dataset is small. We utilized the ReduceLROnPlateau callback, which monitors validation loss and reduces the learning rate when loss stops improving. By making reduced weight updates, this adaptive adjustment of the learning rate can help the model converge more effectively and prevent overfitting. In addition, the EarlyStopping callback is used to monitor the validation loss and terminate training if the loss does not improve after a predetermined number of epochs (patience). This helps prevent the model from continuing to train and overfit when its performance on the validation set is no longer improving. The ModelCheckpoint response is utilized to store the model weights with the lowest validation loss. By retaining the model with the best generalization performance, even if training continues for a larger number of epochs, we ensured that the best-performing model is retained.

E. ANALYSIS AND DISCUSSION
This study includes 4 experiments: In experiment 1, which utilizes data from Mobile ODT and Kaggle, the number of 59890 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  images selected for various classes in the training and test datasets (Type 1 to Type 3) varies. In experiment 2, which acquired data from Mobile ODT and Kaggle, the number of images selected for various classes in the training and test datasets (Type 1 to Type 3) is the same. In experiment 3, which acquired data from IARC, the number of original images selected for each class in the training and test datasets (Type 0 to Type 8) varies. In experiment 4, which acquired data from IARC, the number of augmented images selected for various classes in the training and test datasets (Type 0 to Type 8) varies.
These four experiments use different data sets with different numbers of samples. Experiments 1 and 2 use the first dataset (Mobile ODT and Kaggle dataset) that has enough number of images, and no augmentation approach is needed. But, experiments 3 and 4 use the second dataset (IARC dataset) in which we use augmentation due to the lack of images in each class. Therefore, two different comparisons are conducted and explained separately to ensure the above-mentioned differences are not affecting the analysis results and related discussions. The training and test performance of Experiments 1 and 2 are based on the data presented in Tables 6 and 8. Moreover, the training and test performance of experiments 3 and 4 are based on the data listed in Tables 7 and 9.

1) A COMPARISON BETWEEN EXPERIMENT 1 AND EXPERIMENT 2 OF THE TRAINING STAGE
This section describes a comparison between experiment 1 and experiment 2. Table 6 gives information about the performance metrics comparison between these two experiments. The related discussions are presented below. Table 6 presents the train loss and validation loss values achieved from experiment 1 and experiment 2 for different   epochs starting from 20 to 100 iterations. Figure 7 illustrates Train and validation loss performance resulting from experiments 1 and 2 in different epochs. As is shown in Figure 7, in experiment 1, the number of selected images for different classes is different, while experiment 2 used an equal number of images in different classes. In general, as the epoch increases, both the train and validation losses decrease. This is expected, as more data should help the model to learn better and generalize better to new data. Experiment 1 seems to perform better than Experiment 2 in terms of validation loss, especially for larger, higher epochs. This suggests that Experiment 1's model is better at generalizing to new data. For epochs (20 and 50), Experiment 2 has lower train loss than Experiment 1, but this advantage disappears compared to epochs (70 and 100). For epochs (70 and 100), both experiments seem to have similar train loss and validation loss, indicating that they are both able to learn well from the data. The validation losses are generally higher than the train losses, which is expected, as the model is optimized to minimize the training loss. Table 6 provides the train and validation accuracy values gained from experiment 1 and experiment 2 for different epochs starting from 20 to 100 iterations. Figure 8 shows Train and validation accuracy performance resulting from experiments 1 and 2 in different epochs. As illustrated in Figure 8, in experiment 1, the number of selected images for different classes is different, while experiment 2 used an equal number of images in different classes.    Table 6 presents the training loss and train accuracy values obtained from experiment 1 and experiment 2 for different epochs starting from 20 to 100 iterations. Figure 9 provides train loss and train accuracy performance resulting from experiments 1 and 2 in different epochs. As is given information in Figure 9, in experiment 1, the number of selected images for different classes is different, while experiment 2 used an equal number of images in different classes. The table shows the train loss and train accuracy for two experiments at different epochs. In Experiment 1, the training loss decreases over epochs from 0.6973 to 0.1879. The training accuracy increases from 0.6852 to 0.9266. In Experiment 2, the training loss decreases from 0.7149 to 0.2262 over epochs. The training accuracy increases from 0.673 to 0.9063. Overall, both experiments are improving in terms of train loss and train accuracy as epochs increase, with Experiment 1 showing slightly better performance. Table 7 gives information about performance metrics comparison between Experiments 3 and 4 that be described in continuing. Table 7 presents the train and validation loss values conducted from experiment 3 and experiment 4 for different epochs starting from 20 to 100 iterations. Figure 10 illustrates Train and validation loss performance resulting from experiments 3 and 4 in different epochs. As is shown in Figure 10, in experiment 3, the number of selected images for different classes is different, and we used original plus augmented data from IARC, while experiment 4 used an equal number of data images in each class.

2) A COMPARISON BETWEEN EXPERIMENT 3 AND EXPERIMENT 4 OF THE TRAINING STAGE
Both experiments show a decreasing trend in their training loss as the number of epochs increases, indicating that the models are learning and improving with more training. Experiment 3 has a lower training loss than Experiment 4 at every epoch, indicating that it is performing better on the training data. The validation loss for Experiment 3 decreases significantly from epoch 20 to epoch 70 but then increases slightly at epoch 100. This suggests that the model may have to overfit the training data and is not generalizing well to the validation data. The validation loss for Experiment 4 is higher than for Experiment 3 at every epoch. However, it does show a decreasing trend, which suggests that it is learning to some extent. Overall, it seems like Experiment 3 is performing better than Experiment 4 because the number of the dataset in each class is more than Experiment 4. Table 7 provides the train and validation accuracy values gained from experiment 3 and experiment 4 for different epochs starting from 20 to 100 iterations. Figure 11 shows Train and validation accuracy performance resulting from experiments 3 and 4 in different epochs. For Experiment 3, the training accuracy starts at 0.8613 at Epoch 20 and increases to 0.9967 by Epoch 100. The validation accuracy starts at 0.7891 at Epoch 20, increases to 0.9961 at Epoch 70, but then drops slightly to 0.9883 by Epoch 100. For Experiment 4, the training accuracy starts at 0.5787 at Epoch 20 and increases to 0.9783 by Epoch 100. The validation accuracy starts at 0.5938 at Epoch 20 and increases to 0.8812 by Epoch 100. It is clear that Experiment 3 is performing better than Experiment 4 in terms of both training and validation accuracy. Additionally, both experiments show an increase in accuracy as the number of epochs increases, but Experiment 4 seems to be less stable than Experiment 3, as its validation accuracy shows more fluctuations over time. Table 7 presents the training loss and train accuracy values obtained from experiment 3 and experiment 4 for different epochs starting from 20 to 50, 70, and 100 iterations. Figure 12 provides train loss and train accuracy performance resulting from experiments 3 and 4 in different epochs. For both Experiment 3 and Experiment 4, the training loss decreases with each increasing epoch, indicating that the model is improving in its ability to make predictions on the training data. Experiment 3 has a significantly lower train loss than Experiment 4 at all epochs, indicating that the model in Experiment 3 is performing better than the model in Experiment 4 on the training data. Similarly, Experiment 3 has higher train accuracy values than Experiment 4 at all epochs, indicating that the model in Experiment 3 is better  at predicting the correct output on the training data. As for the effect of epoch number, we can see that both train loss and train accuracy generally improve with more epochs for both experiments.

3) A COMPARISON BETWEEN EXPERIMENT 1 AND EXPERIMENT 2 OF THE TEST STAGE
Evaluation parameters of the classification performance, including accuracy, the area under the curve, specificity, and sensitivity of the fine-tuned CNN compared to the different epochs, are illustrated in Table 9 for experiments 1 and 2. According to Table 8, which is a comparison confusion matrix between Experiment 1 and Experiment 2, for Type 1, it is shown that the accuracy starts from 36% in Experiment 1 at Epoch 20 and gradually increases over the epochs to 79% in Experiment 2 at Epoch 70 and remains constant in Experiment 2 at Epoch 100. For Type 2, it is shown that the accuracy starts from 64% in Experiment 1 at Epoch 20 and gradually increases over the epochs to 81% in Experiment 2 at Epoch 70 and then slightly decreases to 72% in Experiment 2 at Epoch 100. For Type 3, it is shown that the accuracy starts from 69% in Experiment 1 at Epoch 20 and gradually increases over the epochs to 84% in Experiment 2 at Epoch 70 and remains constant in Experiment 2 at Epoch 100. Ultimately, the accuracy for all three types gradually increases over the epochs, with Type 3 having the highest accuracy in both experiments. However, the accuracy for Type 2 decreases slightly in Experiment 2 at Epoch 100. Table 8 presents the accuracy values achieved from experiment 1 and experiment 2 for different epochs starting from 20 to 100 iterations. For Experiment 1, it shows that the accuracy increases as progress through epochs, from 0.6083 at Epoch 20 to 0.8142 at Epoch 100. On the other hand, the ROC AUC increases initially, with values ranging from 0.7979 to 0.9183. The test loss remains relatively stable throughout the epochs, ranging from 0.7785 to 0.7921. Experiment 2's accuracy increases as we progress through epochs, from 0.649 at Epoch 20 to 0.8321 at Epoch 100. The ROC AUC follows a similar trend, increasing from 0.8041 at Epoch 20 to 0.9727 at Epoch 100. The test loss, however, fluctuates throughout the epochs, with the lowest value of 0.7656 at Epoch 70. Experiment 2 consistently outperforms Experiment 1 regarding test accuracy, with higher values in all epochs. According to the results, both experiments seem to improve in accuracy. Table 8 gives information about the Sensitivity and Specificity values gained from experiment 1 and experiment 2 for different epochs starting from 20 to 100 duplications. Sensitivity is a metric that measures the proportion of actual positive cases correctly identified, while specificity measures the proportion of actual negative cases correctly identified. For Experiment 1, we can see that the sensitivity performance gradually improves from Epoch 20 to Epoch 70 but then drops at Epoch 100. In contrast, the specificity performance is relatively stable from Epoch 20 to Epoch 70 but then drops at Epoch 100. For Experiment 2, we can see that the sensitivity performance gradually improves from Epoch 20 to Epoch 100. The specificity performance also improves, but not as consistently as the sensitivity performance. Overall, we can see that Experiment 2 has a better sensitivity performance than Experiment 1, while Experiment 1 has a better specificity performance. It's also worth noting that both experiments have their highest sensitivity and specificity performances at different epochs.
Comparing the F1-scores of Experiment 1 and Experiment 2, it can be seen that Experiment 2 consistently outperforms Experiment 1 in all epochs. At Epoch 20, Experiment 2 has a higher F1-score than Experiment 1, which has a score of 0.7106. Experiment 2 receives a score of 0.7903 at Epoch 50, while Experiment 1 improves to 0.7637. At Epoch 70, however, Experiment 1 surpasses Experiment 2, obtaining an F1-score of 0.8892 versus 0.8456 for Experiment 2. Experiment 1 maintains a superior F1-score of 0.8645 at Epoch 100, while Experiment 2 falls to 0.7703.
Analyzing the Jaccard index reveals a similar trend. Experiment 2 consistently outperforms Experiment 1 across all epochs regarding the Jaccard index. Experiment 2's Jaccard index is 0.6505 at Epoch 20, while Experiment 1's index is 0.5366. Experiment 2 achieves 0.6999, and Experiment 1 improves to 0.6291 by Epoch 50, extending the difference. Experiment 1 surpasses Experiment 2 at Epoch 70 when its Jaccard index is 0.7998 compared to Experiment 2's index of 0.8119. Experiment 1 maintains a higher Jaccard index of 0.7616 at Epoch 100 than Experiment 2, which falls to 0.7437. Experiment 2 generally outperforms Experiment 1 in terms of the F1-score and Jaccard index, with the exception of Epoch 70, in which Experiment 1 outperforms Experiment 2 in both metrics.

4) A COMPARISON BETWEEN EXPERIMENT 3 AND EXPERIMENT 4 OF THE TEST STAGE
Evaluation parameters of the classification performance, including accuracy, specificity, and sensitivity of the fine-tuned CNN in comparison with the different epochs, are illustrated in Table 9 for experiments 3 and experiment 4.
Test accuracy measures the percentage of correct predictions the model makes on the test dataset. Experiment 3 achieves higher accuracy than Experiment 4 across all epochs. Experiment 3 starts with a lower accuracy of 0.7875 in Epoch 20 but improves significantly to 0.9451 in Epoch 50 and reaches a high accuracy of 0.9817, which maintains until Epoch 100. On the other hand, Experiment 4 starts with a lower accuracy of 0.6029 in Epoch 20 and improves to 0.7647 in Epoch 50. However, it fails to improve further and reaches a plateau of 0.7941 in Epoch 70 and drops to 0.75 in Epoch 100.
Test loss measures the error between the predicted and actual values on the test dataset. Experiment 3 achieves a lower test loss than Experiment 4 across all epochs, indicating that it has better predictive performance. Experiment 3 starts with a higher test loss of 0.7968 in Epoch 20 but improves significantly to 0.2392 in Epoch 50 and reaches a low test loss of 0.0599 in Epoch 100. Experiment 4 starts with a higher test loss of 0.9591 in Epoch 20 and improves to 0.5783 in Epoch 50. However, it fails to improve further and reaches a plateau of 0.6636 in Epoch 70 and increases to 0.6974 in Epoch 100. Experiment 3 outperforms Experiment 4 in terms of all performance metrics. Experiment 3 has higher accuracy. Experiment 3 also shows a significant improvement in all metrics as the number of epochs increases, indicating that the model is learning and improving its performance over time. Table 8 gives information about the Specificity values gained from experiment 3 and experiment 4 for different epochs starting from 20 to 100 duplications. For Experiment 3, the sensitivity performance increases with the number of epochs, reaching 0.9816 at Epoch 70 and staying the same at Epoch 100. This indicates that the model is becoming more accurate in detecting positive cases as it continues to learn. The specificity performance is consistently perfect (1) across all epochs, indicating that the model is also highly accurate in detecting negative cases. For Experiment 4, the sensitivity performance also increases with the number of epochs, reaching 0.7941 at Epoch 70 but dropping slightly to 0.75 at Epoch 100. This indicates that the model is becoming better at detecting positive cases. The specificity performance is high, starting at 0.9333 at Epoch 20 and increasing to 1 at Experiment 3 consistently outperforms Experiment 4 across all epochs regarding F1-score and Jaccard index. Experiment 3 obtains higher scores and improves steadily over time, whereas Experiment 4 also improves but maintains a lower performance level than Experiment 3.

5) COMPARATIVE ANALYSIS
To conduct a benchmarking for the proposed solution by this study, this section provides a comparison between the proposed model of this study with the existing solutions. Table 10 provides the results of other works, and Table 11 provides a summary of our results. Using these two tables. The remaining of this section presents some discussion based on the conducted comparison.
By comparing the published data and the obtained data from this research, it can be said that the obtained accuracy from Case A is better than all the experiments except for experiment 3, which has 8.34% higher accuracy compared to Case A. As for the sensitivity Case A has a higher value compared to Experiment 1, experiment 2, and Experiment 4 while Experiment 3 has 0.1082 higher sensitivity compared to Case A, which is 12.38% higher. In terms of specificity, all the experiments, except for experiment 2, have a higher value compared to Case A, while experiments 3 and 4 the perfect specificity and experiment 1 has 2.31% higher compared to Case A [33].
In Case B, the multiple-images method seems to have a lower accuracy compared to the method that was used in this research, and as it can be seen, in all the experiments, the obtained accuracy is higher than Case B MLR method in which the lowest and highest difference between the experimented cases and Case B MLR is 8.67% and 18.31% which belongs to experiment 1 and experiment 3 respectively. As for the VGG method for Case B, only experiment 1 has 0.29% lower accuracy compared to Case B VGG. Other experiments have higher accuracy compared to Case B VGG, while experiment 3 is 8.55% higher, which is the highest among the obtained results in this research [25].
By comparing Case C and the obtained result from the research, it can be seen that in terms of accuracy, Case C has higher accuracy compared to experiments 1, 2 and 4, while Experiment 3 has a 7.88% higher value. As for its sensitivity, Experiment 3 has a 10.29% higher value compared to Case C, while other experiments have lower sensitivity. In terms of AUC, Case C has a higher value compared to Experiment 1 and is slightly lower than Experiments 2 and 4, while Experiment 3 is 3.07% higher compared to Case C [34].
In Case D, there are two methods of CYENET and VGG19. The CYENET method has lower accuracy compared to experiment 1, while the other experiments have lower accuracy compared to this method. Experiment 1 has a 6.36% higher accuracy compared to the Case D CYENET method, while the VGG19 method has a much lower accuracy compared to all the experiments. As for the sensitivity, the CYENET method has higher sensitivity compared to all the experiments except for experiment 3, which is 5.86% higher than the CYENET method. But the VGG19 method has much lower sensitivity than all the experiments. The specificity obtained in this research is much higher than the VGG19 method, while the CYENET model has 0.32% and 7.21% higher specificity compared to experiment 1 and experiment 2, respectively. But experiment 3 and experiment 4 both have 3.95% higher values compared to the Case D CYENET method [35].
VGG-16 has some advantages; in terms of computational efficacy, VGG-16 has fewer layers than VGG-19, resulting in a reduced computational burden during training and inference. VGG-16's architecture is comparatively simplified, making it simpler to comprehend, implement, and interpret. VGG-16 has been extensively used and evaluated in various image classification tasks, demonstrating its robust generalization capabilities. On the other hand, VGG-16 has some disadvantages; as a result of its comparatively limited depth, the VGG-16 architecture may be incapable of capturing highly complex and abstract features. The increased number of parameters in VGG-16 increases its susceptibility to overfitting, particularly when the dataset is small. In the case of D, VGG-19 advantages are, capturing intricate features: The deeper architecture of VGG-19 enables it to extract more complex and abstract features from the input images, which may result in enhanced classification performance. With more layers, VGG-19 can provide a more accurate representation of the input images, allowing it to learn more discriminative features. Also, VGG-19's Disadvantages are VGG-19 has more layers and parameters, resulting in increased computational requirements during training and inference. Similar to VGG-16, VGG-19 may be susceptible to overfitting, especially when the dataset is limited or when training convergence becomes difficult. We chose the VGG-16 model for our research due to its optimal computational efficiency and performance balance. The VGG-16 model produced satisfactory accuracy, sensitivity, and specificity for classifying cervical cancer using colposcopy images.
Case E has lower accuracy than all the obtained accuracy from the experiments, which are 10.24%, 12.17%, 25.55%, and 7.97% lower than Experiment 1, experiment 2, experiment 3, and Experiment 4, respectively. The sensitivity of Case E and Experiment 1 and Experiment 2 are close, yet Case E has lower sensitivity than Experiment 1 and Experiment 2. Experiment 3 sensitivity is 46.96% higher than Case E [10]. Experiment 3 has the highest accuracy of 0.9817, signifying that it accomplished the best overall classification performance among the experiments. Case F has a lower accuracy than any other accuracy derived from our experiments. Experiment 3 consistently demonstrates a high level of sensitivity and specificity, indicating an outstanding performance across multiple classes. Case F has comparable sensitivity to other Experiments. Experiments 1 through 4 have a higher specificity than Case F. Experiment 3 exhibited the highest precision, sensitivity, and specificity levels [33].

VII. CONCLUSION
A comparison between experiment 1 and experiment 2 of the training stage and test stage shows that experiment 1 has better performance in train and test accuracy and also higher performance in Type 1, Type2 and Type3 because the amount of data we considered is more than experiment 2. A comparison between Experiment 3 and Experiment 4 of the training stage and test stage reveals that Experiment 1 has superior performance in train and test accuracy, as well as greater performance in Type 0, Type 2, and Type 8 since we evaluated more data for Experiment 3 than for Experiment 4. Experiment 1 appears to perform better in terms of validation loss than Experiment 2, particularly for bigger, higher epochs. This indicates that the model from Experiment 1 is better at generalizing to new data. Experiment 2 exhibits reduced train loss than Experiment 1 for epochs 20 and 50, but this advantage fades for later epochs (70 and 100). Experiments 3 and 4 both demonstrate a decreasing trend in training loss as the number of epochs increases, demonstrating that the models are learning and improving as they receive more training. In each epoch, Experiment 3 shows a smaller training loss than Experiment 4, indicating that it performs better on the training data. Experiment 3's validation loss falls dramatically from epoch 20 to epoch 70 but then marginally rises at epoch 100.
Overall, it appears that Experiment 3 performs better than Experiment 4 since Experiment 3 has more datasets in each class.
Our dataset encompasses a diverse range of cases and conditions in 4 different Experiments and the different number of classes and different samples in each experiment exploring the robustness of our proposed method and according to evaluation metrics that we used in our study like accuracy, recall, F1-score, sensitivity, specificity and area under the receiver operating characteristic curve (AUC-ROC) and the results we achieved able to show the robustness of our work.
Additionally, we conducted a comparative analysis between our proposed model and existing approaches or alternative algorithms. Through this comparison, we aimed to assess the performance of our model in relation to others in the field. Our findings consistently indicate that our work exhibits superior performance and can serve as a valuable representative in practical applications, particularly in the context of healthcare and the detection of cervical diseases. This review paper provided rich scope for researchers to develop intelligent and efficient load-balancing algorithms for cloud environments. This study will be helpful for researchers to identify research problems related to load balancing, especially to further reduce the response time and avoid failures in the server, as it includes a summary of existing and available load balancing techniques.
Also, as a significant limitation of this project is the number of available colposcopy images in each class of the IARC dataset, collecting more original colposcopy images may lead to a better training process and improve the performance of the proposed model. As a future direction of this research, the performance of the proposed model can be enhanced by integrating risk factors such as basic demographic information, diagnosis history, contraceptive use, and human papillomavirus (HPV) infection. For future work, techniques such as transfer learning, faster R CNN, VGG19, and Inception v3 may lead to a better performance in detecting cervical diseases using colposcopy images.
NINA YOUNESZADE received the degree in urban development engineering from university in Iran, and the Master of Computer Science degree from the School of Computer Science, Taylor's University, Malaysia. Her current research interests include artificial intelligence, machine learning, image processing, big data, data analytics, data science, and deep learning.
MOHSEN MARJANI received the Ph.D. degree in computer science from the University of Malaya (UM) and the master's degree in information technology (multimedia computing) from Multimedia University (MMU), Malaysia. Since 1999, he has been teaching mathematics and IT-based subjects in different public and private institutions, including universities, pre-universities, high schools, and primary schools. He has experience collaborating with and working for several IT companies as an IT Consultant, a Senior Web and Mobile Developer, a Tech Lead, a Project Manager, and a CTO. He is currently a Senior Lecturer with the School of Computer Science, Taylor's University, Malaysia. He has published multiple research articles in refereed international journals. His research interests include big data, the IoT, data analytics, AI, machine learning, deep learning, and image processing.
SAYAN KUMAR RAY received the bachelor's degree in computer science and engineering from Gulbarga University, India, the M.Tech. degree in computer science and engineering from the University of Calcutta, India, and the Ph.D. degree in computer science from the University of Canterbury, New Zealand. He was a Design Engineer with Tait Communications, New Zealand, where he was involved in researching on the LTE and 3GPP evolved packet core network. He is currently an Associate Professor with the School of Computer Science, Taylor's University, Malaysia. He is also the Head of the School, Faculty of Innovation and Technology. His research interests include mobility and handover, 5G networks, LTE-Advanced/LTE-Unlicensed, spectrum sharing, disaster management networks, and the Internet of Things.