E ARLY P REDICTION OF C ERVICAL C ANCER U SING M ACHINE L EARNING T ECHNIQUES

investigation regarding the problem of imbalance class distribution which is common in medical datasets is conducted. The results revealed that LWNB and RandomForest classifiers showed the best performance in general and considering four different evaluation metrics. Also, LWNB and logistic classifiers were the best choices to handle the problem of imbalance class distribution which is common in medical diagnosis tasks. The final conclusion which could be made is that using an ensemble model which consists of several classifiers such as LWNB, RandomForest and logistic classifiers is the best solution to handle this type of problems.


INTRODUCTION
According to the Jordanian Ministry of Health (MoH) statistics, cancerous diseases are the second cause of death in Jordan. Globally, huge efforts from all nations have been implicated in the last century into building a strong understanding of pathophysiology, genetic changes and clinical presentation of different cancers and recruiting this knowledge in developing new methods of treatment, new screening methods and improving prognosis among cancer patients [1].
CC is a gynaecological malignancy that occurs mainly in middle-aged women, due to unregulated division of cells in cervical mucosa of females' reproductive system. Usually, females come to the clinic with chief complaints of vaginal bleeding and abnormal vaginal discharge [2].
CC almost exclusively develops in cervical cells with pre-existing human papilloma which induces dysplasia (abnormal cell growth that is premalignant) that remains latent with no symptoms for decades before developing into absolute CC [3].
Human Papilloma Virus (HPV) is a sexually transmitted infection that occurs mainly in individuals with multiple sex partners and who have not been vaccinated against carcinogenic HPVs [3]. Early detection of this dysplasia before developing into cancer is the cornerstone in fighting against CC [4].
Although CC-related incidences and deaths have dramatically decreased in developed countriesthanks to huge improvements in screening procedures [4], CC is still a huge challenge, especially to developing countries. It is the most deadly type of cancer in women in developing countries that cannot overcome the problem of lacking sufficient number of health-care professionals who are well trained in implementing this procedure for high-risk populations [4]. This signifies the importance of developing a computerized screening test using artificial intelligence and machine learning strategies [5].
Therefore, this paper aims to achieve the main following objectives: 1. To identify the most relevant and significant features that highly facilitate early prediction of CC. 2. To determine the best classifier that could be used to classify and predict the existence of CC among the large number of classifiers that belong to different learning strategies and use several evaluation metrics. 3. To determine the best classifier in handling the problem of imbalance class distribution which is common and familiar in the medical diagnostic field.
The main motivation for this research is to determine the best classification algorithms to use when attempting to predict CC; hence utilizing these algorithms in designing and programming a tool to automate the prediction of CC.
The rest of this paper is organized as follows: Section 2 surveys the related work dedicated to the prediction of CC. Section 3 describes the main steps of the conducted research and discusses the results obtained. Section 4 concludes the paper and lists out some future research-work horizons.

RELATED WORK
Classification is one of the main supervised learning tasks in machine learning. This task aims to accurately predict the class label for unseen instance [6]. In general, classification is divided into two main types: Single Label Classification (SLC) and Multi Label Classification (MLC) [7]. The former enforces each instance or example in the dataset to be linked to only one class label. Therefore, class labels in SLC are always mutually exclusive [7].
The latter allows instances in the dataset to be linked or associated with one class label or more. Hence, class labels in MLC are not mutually exclusive and have some kind of correlation among them, since they share the same values of features [8].
Moreover, SLC is divided into two sub-types: Binary Classification (BC) and Multi Class Classification (MCC). The former considers datasets with two class labels only, while the latter considers datasets with more than two class labels [9]- [10].
Classification as a machine-learning task has been utilized in several research papers related to CC. In [11], an attempt to combine the conventional diagnosis procedures and tests with machine learning to early predict abnormal cells, which highly increases the parentage of the complete cure of CC. This paper considered a large number of pap-smear test images which have been trained using deep learning techniques. The final proposed model was capable of predicting abnormal cells related to CC with accuracy of 74.04% only.
Ilyas and Ahmad (2021) [12] attempted to increase the accuracy of predicting CC by depending on an ensemble model. Therefore, eight different classifiers from different learning approaches have been utilized in predicting CC. Their study showed the significance of depending on several classifiers compared to depending on only one classifier when attempting to predict CC. This study could be improved by considering more classifiers and more learning strategies.
In [13], an ant-colony optimization algorithm has been proposed. The proposed algorithm has been trained on a dataset collected by the University of California. Support Vector Machine (SVM) has been used as the base classifier and showed a good performance (accuracy = 95.45%) compared with other algorithms which have been trained on the same dataset. The proposed algorithm has been evaluated using only one evaluation metric (accuracy). Also, the proposed algorithm should be evaluated against a larger number of algorithms.
A recent research that aimed to predict CC using MRI images has been conducted in [14]. Two main objectives have been achieved in this research. The first objective considered proposing an automatic system for early prediction of CC using image-processing techniques. The second objective aimed to enhance the performance of pre-trained Deep Convlutional Neural Networks (DCNNs) using Transfer Learning (TL). In this paper, five classifiers were used to classify the input-image dataset into two class labels: benign or malign. Also, five evaluation metrics have been used in the evaluation phase of the five considered classifiers. Finally, according to the evaluation results, RandomForest (RF) classifier showed a better performance than the other four classifiers.
Another research that utilized machine-learning techniques in the early prediction of CC can be found in [15]. This research utilized the high capabilities of machine-learning techniques in the featureselection step and the classification step. Unfortunately, the best evaluation result of the proposed algorithm was very low (best result for Area Under the Curve (AUC) metric was less that 0.69%) compared with other state-of-the-art algorithms.
A data-driven CC prediction model has been proposed in [16]. The proposed model not only aimed to predict CC, but also considered the problems of outliers and over-sampling. The prediction model only considered RF as a classifier. The model has been deployed through a mobile application that collects significant features related to CC and uses them in the prediction step of CC. The evaluation phase of the proposed model considered several evaluation metrics such as accuracy, precision, recall and F1score. One of the main shortcomings of this model according to the authors themselves is the slow performance and the need to high memory during running the mobile-application software.
An ensemble model which combined the results of three different machine-learning algorithms to predict CC using Pap-smear test was proposed in [17]. The proposed model managed to predict CC using K-Nearest Neighbor (KNN), Support Vector Machine (SVM) and Multi-layer Perceptron (MLP) with a high accuracy rate (accuracy = 97.83%). The research concluded with a great potential of machine learning to highly and accurately predict CC. One of the main limitations for this research was depending on only the accuracy metric in the evaluation step while ignoring other significant evaluation metrics, such as precision, recall and F1-score.
In [18], an empirical analysis to determine the best classification algorithm among three classification algorithm has been performed. The paper considered Naïve Bayes (NB), Iterative Dichotomiser3 (ID3) and C4.5 classifiers. The analysis has been carried out using only one dataset and considering accuracy only as an evaluation metric. The paper concluded that NB outperformed the two other classifiers with accuracy being equal to (81%).
In [19], a research model that consisted of four main phases has been proposed. This research model consists of data pre-processing step, predictive model selection and pseudo-code. Also, several classifiers, such as KNN, Random Forest, SVM, Logistic Regression (LR), have been evaluated using three evaluation metrics. The research concluded the significance of using Random Forest, Decision Tree and several other classifiers in the prediction phase of CC.

METHODOLOGY, RESULTS AND ANALYSIS
In this section, a comprehensive description regarding the methodology, results and analysis is presented. Firstly, in Section A, the research methodology is presented. Secondly, in Section B, the dataset is described. Thirdly, in Section C, the steps of feature selection and ranking are introduced. Finally, in Section D, the classifiers and evaluation metrics considered are introduced with the results obtained and their analysis.

A. Research Methodology
The methodology of this research is illustrated in Figure 1. As can be seen from Figure 1, the methodology consists of seven main steps. The first main step considers the collecting of data from hospitals and several specialized medical centres. Then, the segmentation process is performed as Figure 1. Research-methodology main steps. explained in Section B. After that, several related features are extracted from the collected images as illustrated in Section C. The next step aims to construct a single-label dataset based on the data collected from the previous step. Then, three different feature-ranking techniques are applied on the dataset. The final steps aim to classify the data, obtain the results and identify the best classifiers among eighteen different classifiers based on several evaluation metrics, as extensively discussed in Section D. More information regarding these main steps can be found in the following sub-sections.

B. Dataset Description
One dataset has been considered in this research. This dataset has been constructed after performing several steps. Firstly, 500 images have been collected from different hospitals and specialized medical centres in Jordan. All images in this research have been captured using an automatic glass capturing system which has been designed specifically for this purpose. This system consists mainly of three main components: a high-resolution digital camera, a highquality digital microscope and a personal computer. All images have been captured using 100X and 400X magnification, as recommended by both pathologists and cytologists. Each image has been labelled as Normal, Low-grade Squamous Intra-epithelial Lesion (LSIL) or High-grade Squamous Intra-epithelial Lesion (HSIL) by three domain experts and the final class of the image is determined by considering the majority. Figure 2 depicts a sample of the captured images. Figure 2.a represents a "Normal" class, Figure 2.b represents an "HSIL" class and Figure 2.c represents an "LSIL" class. Secondly, a segmentation process has been applied on the collected images using Adaptive Fuzzy Moving K-means (AFMKM) clustering algorithm [20]. The main goal for applying AFMKM on the collected images is to differentiate the main three parts of the CC cell image: nucleus, cytoplasm and background. Thirdly, nine features are extracted from each CC cell image using both the nucleus and the cytoplasm parts. These features are: size, grey level, perimeter, red, green, blue, intensity1, intensity2 and saturation. Intensity1and saturation were computed using Equations (1) and (3), respectively [21]. Intensity 2 was computed using Equation (2) [22].
Therefore, in total, the constructed dataset consists of eighteen features and five hundred instances. Each instance has been assigned to one class label only from three different class labels. These class labels are: Normal, LSIL and HSIL.
It is worth mentioning that the frequency of the three class labels: Normal, LSIL and HSIL was: 376, 79 and 45, respectively. Hence, as in most medical-diagnostic datasets, the considered dataset in this research suffers from the problem of imbalance class distribution. Therefore, this fact should be highly considered when attempting to identify the best classifier to deal with such kind of data.

C. Feature Selection and Ranking Step
One of the main objectives of this research is to identify the best classifier to handle the CC dataset when using all features, 75% of the features and 50% of the features. Therefore, the step of feature selection and ranking is crucial to this research.
Three different techniques have been used to rank the features. These techniques are InfoGainAttributeEval [23], ClassifierAttributeEval [23] and GainRatioAttributeEval [23]. All these techniques have been trained on the considered dataset using WEKA [23]. WEKA is short for Waikato Environment for Knowledge Analysis. WEKA is an open-source software that is used widely in data analysis in the domains of data mining and machine learning.
Regarding InfoGainAttributeEval, this technique evaluates the worth of an attribute by measuring the information gain with respect to the class. The ClassifierAttributeEval technique evaluates the worth of an attribute by using a user-specified classifier. Finally, the GainRatioAttributeEval technique depends on the gain ratio to evaluate the worth of an attribute with respect to the considered class. More information regarding these attribute evaluators and other featureranking techniques can be found in [23]. Table 2 depicts the ranking of the features after applying the three previously mentioned ranking techniques on the considered dataset.  Table 3 depicts the features of the dataset after ranking. Features have been ranked using the summation of the ranks of the three considered ranking techniques. The feature with the least sum is ranked first and the feature with the highest sum is ranked last. Based on Table 3, the classifiers considered in this research are trained on three versions of CC dataset. The first version consists of all features (18 features). The second version consists of the best ranked 75% of the features (12 features). The third version consists of the best ranked 50% of the features (9 features). The considered classifiers are evaluated based on their performance on the three versions and using several evaluation metrics.

D. Evaluation of the Considered Classifiers
The main objective of this research is to early predict CC using machine-learning techniques as accurately as possible. Therefore, many classifiers should be considered to identify the best one. Hence, eighteen different classifiers have been considered and extensively evaluated. These eighteen classifiers belong to six well-known learning strategies.
The previously mentioned classifiers have been evaluated using four different evaluation metrics: Accuracy, Precision, Recall and F1-Measure (F1-Score), using the following equations.  Table 4 depicts the evaluation results of the eighteen classifiers grouped by learning strategy and using the Accuracy metric. The evaluation considers all features, 75% of the features and 50% of the features, respectively. According to Table 4, RandomForest showed the best results considering all features, 12 features and 9 features. Logistic and MultiClassClassifier showed an identical result to RandomForest when considering all features. Moreover, Tree as a learning strategy showed the best result with 12 and 9 features, while Meta learning strategy showed the best performance when considering all features.
It is worth mentioning that NaiveBayes and NaiveBayesUpdateable showed an identical performance on the three datasets (all features' dataset, 75% of the features' dataset and 50% of the features' dataset). Table 5 depicts the evaluation results of the eighteen classifiers grouped by learning strategy and using the Precision metric. The evaluation considers using all features, 75% of the features and 50% of the features. From Table 5, it can be clearly seen that LWNB classifier showed the best performance considering the Precision metric on the three considered cases (all features, 75% of the features, 50% of the features).
Considering learning strategies, Meta as a learning strategy showed the best performance on the three considered cases. Also, Lazy learning strategy showed an identical result to Meta learning strategy when considering 75% of the features. It is worth mentioning that NaiveBayes and NaiveBayesUpdateable showed an identical performance on the three datasets (all features' dataset, 75% of the features' dataset and 50% of the features' dataset).  Table 6 depicts the evaluation results of the eighteen classifiers grouped by learning strategy and using the Recall metric. The evaluation considers using all features, 75% of the features and 50% of the features. Regarding to the best learning strategy, as can be seen from Table 6, Trees showed the best performance on the dataset with 75% of the features and the dataset with 50% of the features, while Meta learning strategy showed the best performance on the dataset with all features.
It is worth mentioning that NaiveBayes and NaiveBayesUpdateable showed an identical performance on the three datasets (all features' dataset, 75% of the features' dataset and 50% of the features' dataset). Table 7 depicts the evaluation results of the eighteen classifiers grouped by learning strategy and using the F1-Measure (F1-Score) metric. The evaluation considers using all features, 75% of the features and 50% of the features. According to Table 7, LWNB classifier has a superior constant performance compared with the other seventeen classifiers. LWNB achieved the best results on all features' dataset, 75% of the features' dataset and 50% of the features' dataset.
Considering the learning strategy, Meta as a learning strategy showed the best performance on the dataset with all features, the dataset with 75% of the features and the dataset with 50% of the features. Also, Lazy learning strategy showed the best performance on the dataset with 50% of the features.
It is worth mentioning that NaiveBayes and NaiveBayesUpdateable showed an identical performance on the three datasets (all features' dataset, 75% of the features' dataset and 50% of the features' dataset).  Table 8 summarizes the results obtained from Table 4 to Table 7 by identifying the best classifier with respect to the considered metric and the number of features being used. According to Table 8, LWNB classifier is the best classifier among all considered classifiers. LWNB classifier achieved the best performance six times. RandomForest classifier is the second best classifier, since it achieved the best performance five times. LWNB classifier is the optimal choice when there is a need to optimize Precision and F1-Measure metrics. RandomForest classifier is the best choice when there is a need to optimize Accuracy and Recall metrics. Moreover, Logistic and MultiClassClassifier showed an excellent performance when considering all features with Accuracy and Recall metrics. Table 9 depicts the best learning strategy with respect to the considered evaluation metric and the number of features being used. Table 9 summarizes the results from Table 4 to Table 7. From Table 9, It is obvious that Meta as a learning strategy is the dominant strategy. Meta showed the best performance considering the four evaluation metrics. Trees learning strategy is the second best learning strategy and Lazy learning strategy is the third best strategy according to Table 9.
In general, medical datasets like the dataset considered in this research usually suffer from the problem of imbalance class distribution. For example, in the CC dataset, the dominant class is the "Normal" class with a frequency equal to 376. For "LSIL" class, the frequency is 79, while the frequency of "HSIL" class is 45, as mentioned previously. One of the main characteristics of the optimal classifier is the ability to handle the problem of imbalance class distribution.
Therefore, it has been decided to evaluate the eighteen classifiers considered in this research based on how accurate they can predict the least frequent, but most significant, classes (LSIL and HSIL). True Positive (TP) metric has been used to accomplish this task. TP metric calculates the percentage at which the classifier correctly predicts the positive classes. Table 10 depicts the evaluation results of the eighteen considered classifiers using the TP metric and grouped by the learning strategy. It is worth mentioning that for the TP metric, the higher the value, the better the performance of the classifier. According to Table 10, Logistic classifier is the best classifier to predict the class label "HSIL" with a TP rate equal to 0.711, while LWNB is the best classifier to predict the class label "LSIL" with a TP rate equal to 0.899.
Considering the learning strategy, Trees is the most suitable learning strategy to predict the class label "HSIL", while Meta is the most appropriate learning strategy to predict the class labels "LSIL".
Since no classifier can be the dominant classifier for dealing with the problem of imbalance class distribution, it is highly recommended to adopt an ensemble model to overcome this serious problem. Based on the results of this research, it is recommended to include LWNB, RandomForest and Logistic in any future proposed ensemble models.

CONCLUSION AND FUTURE WORK
In this paper, a dataset consisting of 500 images related to CC has been collected from different hospital and specialized medical centers. Also, eighteen different classifiers which belong to six learning strategies have been trained on the collected dataset and evaluated. The evaluation of the classifiers considered four evaluation metrics with respect to all features in the dataset, 75% of the features and 50% of the features. The results revealed that LWNB classifier has achieved the best performance in general. RandomForest showed the second best performance. Also, considering the learning strategy, Meta learning strategy showed the best overall performance compared with the other five strategies. Moreover, Logistic and LWNB classifiers are the best choice to deal with the problem of imbalance class distribution, which is very common in medical diagnostic datasets. Based on the results of this research, the main recommendation for future work is to adopt an ensemble model that consists of LWNB, RandomForest and Logistic classifiers to achieve high performance in the early prediction of CC.