An E�cient SMOTE Based Machine Learning classi�cation for Prediction & Detection of PCOS

Polycystic ovary syndrome (PCOS) is a well known catabolism based and reproductive endocrinopathies disease which affects up to 18% of women in reproductive age. PCOS results irregularity in menstrual periods which might directed to numerous entanglements such as embodiment of cysts (follicles) in either or both ovaries as well as impotence for giving birth, turmoil in androgen levels, increased BMI and some other deviancy in hormone level of LH, FSH, DHEAS, Fasting insulin and Fasting blood sugar. Detection & prevention of this disease as early as possible it has attracted research attention. Overall research in this therapeutic unhealthiness categorisation using machine learning technique is considered less to some extent because of data imbalance. In this research we can make a lay on the line for a novel forecast model using Synthetic Minority Oversampling Technique (SMOTE) with �ve machine learning algorithms like Logistic Regression, Random Forest, Decision Tree Support vector machine and K-Nearest Neighbor (KNN) model which automates the PCOS detection in an early stage. The framework of the recommended classi�cation of PCOS is comprised of three different layers. The �rst �apping which is responsible for handling the missing values & oversampling the splinter class in datasets. In second layer, the most eloquent attributes are tabbed using Principal component analysis (PCA). In third layer, the projected model is skilled & its e�ciency is reckoned in a fourth layer in terms of classi�cation accuracy, Training time (TT), F1 score, Recall (Sensitivity), Precision & Area under the ROC (AUROC). The propounded algorithm clarify the outperformed of its corresponded & achieves the preeminent accuracy, precision, F1 score, Sensitivity, training time & AUROC. The best model achieve accuracy, Training time (TT), F1 score, Recall (Sensitivity), Precision & Area under the ROC (AUROC) is 97.11%, 0.010 sec, 98%, 98%, 98% & 95.6% respectively.


Introduction
Polycystic Ovary Disease (PCOD) or Polycystic Ovary Syndrome (PCOS) is a hormonal lopsidedness issue found in women of reproductive age. In today's generation, there has been a constant rise in the occurrence of this disease due to the hindered lifestyle routines and adulterated food. There are three types of ovaries and is classi ed as normal ovary, cystic ovary and polycystic ovary. (Child et al., 2001) Poly Cystic Ovarian Syndrome or PCOS is a complex hormonal plights distressing up to around 1 in every 10 women at their conceptive age. PCOS manifests during adolescence and is formed as a result of hormonal disturbances (Witchel et al., 2015). Peripherally inside the ovary, uid-lled sacs are present which are called follicles or cysts. A polycystic ovary (PCO) can be characterized by twelve or more follicles with a diameter of 2-9 mm (Jarrett et al., 2020). PCOS affects both health and the quality of women's life. The symptoms include cardiovascular diseases, failure to ovulate and infertility, late menopause, type 2 diabetes, acne, baldness, hair loss, hirsutism, obesity, anxiety, depression, and stress (Khomami et al., 2015).
The early diagnosis and treatment can be used to control based on the symptoms and by the prevention of long-term problems. PCOS can be detected through ultrasonography by a doctor by reckoning the number and size of follicles situated in the ovaries. However, this process takes a protracted interval, need good image quality and high accuracy to detect the presence of PCOS (Balen et al., 2003). Another approach for PCOS detection is through biochemical parameters such as hormone levels examination.
Since hormone examination is very expensive, other clinical parameters such as body mass index (BMI), menstrual cycle length, etc. are taken into consideration for the detection of PCOS (Ranjzad et al., 2011).
Among the plentiful problems that subsisted around us, the problems that are pertinent to the conceptive health of women was chosen as an area of our relevance, due to its materiality in this contemporary society. A exhaustive survey of studies on PCOS and systems for its supportive diagnosis was carried out. about 5-10% of Indian women in reproductive age are attained by the multifaceted endocrine disorder called Polycystic Ovary Syndrome (PCOS) (Denny et al., 2019). It is a supreme instigation of an ovulatory infertility and increases the risk for insulin resistance, obesity, cardiovascular disease and psychosocial irregularities (Pauli et al., 2011). The symptoms for PCOS might be divaricate from patient to patient.
Some of them are inconsistency in menstrual periods, acne, overweight, increased tendency for infertility, intense hair fall, balding of front head, increased growth of facial hair (Ding et al., 2017). Traditionally the PCOS can be questionable when number of follicles in an ovary is more than 12 per unit area and visible in radiological scan (Kenigsberg et al., 2015). In this paper, our objective is to know the better performance machine learning classi cation algorithm after applying the Principle Component Analysis (PCA) & Synthetic Minority Oversampling Technique (SMOTE). The left over paper has been collaborated in the following way Section 2 emphasises the proposed framework which is being propounded for the coordination of the given dataset. Section 3 bouncing off and sorts out classi cation results that are handed for getting a result of the application of the proposed methodology. Section 4 puts the lid on for the ndings of the proposed work in an trenchant way.

Proposed Framework
The data are pre-processed for mislaid values and then regimented using the standard scalar technique. Ensuingly, most signi cant attributes are selected using Principal component analysis (PCA) enduring within the dataset. Additionally, various machine learning algorithms are attained to the given dataset, and their results are estimated. The main aspire was to button down the algorithm which can distinguish the given dataset in the best manner. The following section wholly exchange the view of the proposed framework and its components in detail.

Datasets
The dataset which has been used is the UCI PCOS without fertility dataset. The UCI dataset reposed of 541 records having 41 attributes. Most published research works have used subset of 12 attributes. The main discourse for excerpting above mentioned attributes is that these attributes were mulled over the most decisive while vaticinating PCOS for a particular patient. These attributes are distinctively recognised by UCI repository while promulgating dataset for public. Sample datasets for infertility PCOS shown in Figure 1. Attributes characteristics are shown in Whisker plot ( Figure 2). These attributes along with their interrelated values show how a itemised attributes can be associated to PCOS. Among these 12 attributes, 11 attributes are used in prediction of PCOS disease. One attribute "Target" dishes up as output variable whose value resolve the subsistence or truancy of PCOS. Data pre-processing is brought about on given dataset before the pertinence of classi cation models. The main steps are attained for data pre-processing were as follows: the data were normalized between 0 and 1 to escalation the performance of a model. Left behind values were redeemed with mean values for corresponding columns. The output class "target" was shift gear from multiclass into binary class where 1 exemplify presence of PCOS and 0 represents absence of PCOS.

Imbalance datasets
The given dataset contains 178 instances of the positive class (1) and 363 instances of the negative class (0). There is an mismatched distribution of classes within the dataset. This uneven distribution is one of the considerable causes of curtailed accuracy of classi cation models. The main reason is that most machine learning models cannot learn patterns for both positive and negative classes persuasively because of their imbalance numbers in a dataset. Moreover, as minority class, i.e., positive class is less in number, so results procreated by this class often become abortive because of its less number. Most literature studies do not keep document minority class contribution toward producing overall classi cation results. One of the key contributions of the proposed work is the unbalanced nature of the given dataset is managed adequately via SMOTE technique. Likewise, results for the majority and minority classes are chronicled separately in order to excavate the performance of each class's contribution in producipng overall prediction results.

Tools & Techniques:
This algorithm works on Python version 3.7.6 with Jupyter notebook 6.0.3 & Operating system Ubuntu. As the dataset size was comparatively small, the required machine learning algorithms are run on the machine which is geered by Intel Core i3 6 th Generation processor. It has 8 GB RAM and has 500 GB HDD storage.

Parameters tuning:
One of the crucial aspects of classi cation problems in machine learning is about how to tune the hyperparameters for the given classi cation model.

Result Analysis
The experiment is accomplished on the given dataset, and related results are secured. Strati ed K-fold validation is depicted for each experiment so that results are free from any biasness. The main aspiration was to avert any biasness in the results as feature engineering often results in inattentiveness of some features which may have an impression on overall prediction results. Furthermore, the feature engineering process often ascertained to be very expensive. Raw data after some preprocessing are fed into machine learning algorithms. Afterward, results are attained and segregated with the existing state of the art systems.
There were two datasets, out of which one was PCOS data with infertility and the other was PCOS data with fertility. The dataset was analyzed and accordingly the algorithms were applied and the model was trained to get the better accuracy.

Performance Measure
One of the extensive misinterpretations regarding the assessment of the machine learning model is that every dataset can be computed with the same evaluation matrices disregarding of its nature. Most machine learning models tend to be evaluated in terms of exactness. This approach often proves to be deceitful, when we are dishing out with the cumbersome dataset. For that reason, different standard evaluation matrices are utilised along with exactness. Precision, recall, F1 measure, and ROC curve (Dutta et al., 2021) have been applied for the evaluation of the proposed work. Accuracy is the ratio of the number of accurate predictions divided by the total number of inputs. The confusion matrices are brought about by calculating true positive (TP), true negative (TN), false-positive (FP), and false negative (FN). Sensitivity and speci city are two measures that are calculated as TP/(FN + TP) and FP/(FP + TN), respectively. The receiver operation curve (ROC) is another metric that is extensively used to evaluate the classi cation accuracy of a given model.

Segregation with existing system
The results of the propounded work are being make segregated with results of other state-of-the-art extant systems so that the authenticity of the proposed work can be con rmable. The proposed SMOTE-based ve different machine learning model without feature engineering is compared with two different systems are delectably shown in recent years for the given UCI dataset. Table 4 shows the benchmark achievement of the proposed model with two extant systems. The proposed model has surpassed all existing systems with the highest accuracy of 97.11% obtain by SMOTE based Logistic regression for negative class. Furthermore, as irradiated earlier, most existing literature only keyed on evaluating results in terms of "accuracy" which may become a disingenuous metric when compromising with the shortcoming dataset.

Conclusion
Polycystic Ovary Syndrome (PCOS) is one of the most familiar types of endocrine in rmity in reproductive age women. This may payoff infertility and an ovulation. The diagnostic criterion includes the clinical and metabolic parameters which are biomarker for the disease. Here to perform the research we haven 541 number of non fertility POCS datasets from UCI repository. The overall datasets contains 178 instances of the positive class (1) and 363 instances of the negative class (0). Disparated dissemination of classes within the dataset is one of the major causes of waning exactness of classi cation models. Hence in this research we developed a novel prediction model using Synthetic Minority Oversampling Technique (SMOTE) with ve machine learning algorithms like Logistic Regression, Random Forest, Decision Tree Support vector machine and K-Nearest Neighbor (KNN) model which automates the PCOS detection in an early stage with higher degree of e ciency.
Among Five SMOTE based algorithm it is seen that proposed SMOTE based logistic Regression outperformed than other algorithms empower to classify dataset fastidiously, without any discernible data pre-processing. Furthermore, in term of execution time, SMOTE based Random Forest has taken remarkably less time with 0.10 seconds and both SMOTE based SVM & KNN has taken maximum area under ROC.
In future practice similar type disease prediction systems can be evolved for other diseases like Heart, diabetes or cancer. Moreover, IOT technology may be embedded with the proposed model so that patient's health parameters can be incidentally call the shots for developing an compelling healthcare system.

Declarations
Compliance with Ethical Standards: Funding: For this research authors does not get any fund.
Con ict of Interest: All the author declares that there has no con ict of interest Ethical approval: This article does not contain any studies with human participants or animals performed by any of the authors.