A UTOMATED D IABETES D ISEASE P REDICTION S YSTEM B ASED ON R ISK F ACTORS A SSESSMENT : T AKING C HARGE OF Y OUR H EALTH

non-diabetic. In order to choose the relevant features and identify the most informative risk factors, we combined several feature-extraction methods, such as ANalysis of Variance (ANOVA) and Recursive Feature Elimination (RFE) and we used also the features proposed by the Pima Indian Diabetes Dataset (PIDD). The results of this study provided valuable information on the comparative performance of different machine-learning models in the prediction of diabetes, as well as on the importance of the selected characteristics.


INTRODUCTION
Artificial Intelligence (AI) has been integrated into the health field to improve the efficiency and quality of care [1].It aims to create systems that can rival or even surpass human intellectual capacities.Thus, AI offers new perspectives for the optimization of medical practices and the well-being of patients.Techno-logical advancements have enabled the use of AI algorithms in many medical applications, covering areas, such as diagnosis, therapeutic decision-making, disease prediction, medical robotics, …etc.AI continues to evolve and develop, opening new perspectives to improve medical practices and patient well-being.
Currently, the health sector is facing various challenges in the diagnosis and treatment of diabetes, especially with regard to the increasing prevalence of this disease, which affects a large number of people of all ages and from various ethnic backgrounds [2]- [3].According to statistics from the World Health Organization (WHO), approximately 463 million adults had diabetes in 2019, which corresponds to a global prevalence of 9.3%.In Algeria, the prevalence of diabetes continues to increase to reach 14.4% of the population between the ages of 18 and 69 years, which is equivalent to approximately 4 million people with diabetes in 2018 [4].These statistics highlight the magnitude of the problem of diabetes and accentuate the importance of finding effective solutions for early detection, prevention and management of this disease.
In this context, early detection of diabetes can be a challenge due to the complexity of risk factors and underlying patterns in health data.Traditional screening and diagnostic approaches may have limitations in terms of accuracy and efficiency.Two cases of misdiagnosis of diabetes can be identified.The first case involves false negatives, where an individual with symptoms or risk factors for diabetes is not diagnosed as having diabetes.The second case involves false positives, where a person is misdiagnosed as diabetic when he/she does not actually have the disease.Indeed, automated diabetes-prediction systems using artificial intelligence and machine or deep learning, can analyze large amounts of health data to identify early indicators of diabetes.Our problematic consists of the following questions:  What are the most important, influential and relevant Algerian risk factors to take into account to create our diabetes-prediction system?
 Which machine-learning algorithms should be used to obtain the best prediction results?
The objective of this paper is to develop a diabetes-prediction system using machine learning, to improve early detection and prediction accuracy.We seek to exploit machine-learning models, such as Support Vector Machine (SVM), Logistic Regression (LG), Decision Trees (DTs) and K-Nearest Neighbors (KNN).In terms of feature selection, we will use statistical methods, such as ANalysis of Variance (ANOVA), Recursive Feature Elimination (RFE) and comparison with the PIDD dataset to identify the most informative variables and reduce the dimensionality of the data.Data used in this study will include clinical measures, such as fasting blood sugar, Body Mass Index (BMI), blood pressure, cholesterol level, as well as information on the lifestyle and medical history of patients.
The article is organized as follows: Section 2 provides an overview of diabetes.Section 3 presents some related works and highlights the proposed contributions.In Section 4, our proposed approach is explained in detail.This section is followed by a discussion of the obtained experimental results in Section 5. Finally, Section 6 provides the conclusion of this paper as well as perspectives.

DIABETES
Diabetes is a chronic disease caused by genetic or acquired defects in the production of insulin by the pancreas or the fact that this insulin is not active enough [5]- [6].This defect leads to an increase in blood sugar.The particularity of this disease is that it often progresses silently, without developing symptoms [7]- [8].
Insulin is a hormone produced by the pancreas that regulates blood-sugar levels in the body [9].Our body converts the food that we eat into a form of sugar called glucose and insulin transports it from the bloodstream into cells where it can be used for energy [10].Insulin also helps store glucose in the liver and muscle tissue for later use.Without enough insulin, glucose builds up in the bloodstream, leading to high blood sugar, a condition known as diabetes.Insulin therapy is a common treatment for diabetes, helping regulate blood-sugar levels by increasing glucose uptake into cells [11]- [12].There are several types of diabetes, that are: type-1 diabetes, type-2 diabetes and gestational diabetes [13]- [14].There is no specific and exclusive cause for diabetes, but there are a set of contributing factors that can affect everyone, including [15]- [16]: • Genetics: A family history of diabetes may increase the risk of developing the disease, • Overweight and obesity: Excess weight, especially around the waist, can increase the risk of developing type-2 diabetes.• Lack of physical activity: A sedentary lifestyle can increase the risk of developing type-2 diabetes, • Unhealthy diet: Consuming a diet that is high in processed foods and added sugars may increase the risk of developing type-2 diabetes.• Aging: The risk of developing type-2 diabetes increases as people age, • Certain medical conditions: Such as polycystic ovary syndrome or a history of gestational diabetes, • Certain medications: Such as steroids, some antidepressants and antipsychotics can develop diabetes, • Pancreas damage: Damage to the pancreas caused by alcohol abuse or traumatic injury can lead to diabetes.
The symptoms of diabetes can include excessive thirst, frequent urination, fatigue or lack of energy, frequent hunger, …etc.[17][18] [19].We note that there are also differences depending on the type of diabetes.Diabetes disease must be treated urgently because, it can cause severe damage to vital organs, such as the heart, kidneys, blood vessels, eyes and nerves [20][21] [22][23] [24][25] [26].

RELATED WORKS
In the literature, several works have contributed to the creation of automatic systems for diabetes detection.In the following, we will present the most relevant works.
The work presented in [27] proposed a dynamic prediction system of glycemia by the use of deeplearning multi-series.The work proposed MT-LSTM (Multi-Time-series Long Short-Term Memory) for accurate and efficient prediction of blood-glucose concentration in 112 users.The proposed model takes current readings from Continuous Glucose Monitoring (CGM) devices and a few external factors (meals, medications, insulin, physical activity and quality of sleep) as inputs and gets as output the personalized glucose concentration within the next hour.
In the work of [28], a blood-glucose prediction system has been proposed for patients with type-1 diabetes based on deep convolutional neural network (CNN).To do this, the following steps were followed: (1) Data collection from 6 patients with type-1 diabetes and wearing Continuous Glucose Monitoring to capture data every 5 minutes for 8 weeks, (2) Treatment of missing data, (3) Training of the model and (4) Evaluation.
In [29], a medical decision-support system has been proposed to predict the diabetes disease based on machine-learning and deep learning techniques.For machine learning, Support Vector Machine (SVM) and Random Forest (RF) were used and for deep learning, the authors used Convolutional Neural Network (CNN).
The objective of the study presented in [30] is to propose a predictive model for three major diabetesrelated complications in Indonesia and identify the significant risk factors related with them.To do this, the authors followed three steps, that are: data pre-processing, data-mining analysis and rule generation.The machine-learning algorithms used were: Naive Bayes Tree, C4.5 decision tree and kmeans.
The objective of the work presented in [40]  The work presented in [43] focused on the early prediction of type-2 diabetes with the use of effective techniques to reduce the mortality rate, using unsupervised deep neural networks.The authors used F1-score feature selection on the Pima Indian Dataset after the pre-processing phase to reduce noise and empty entries.
In the work of [33], a system has been proposed to explore predictive analytics for early detection of diabetes using several machine-learning algorithms, such as K-Nearest Neighbors, Logistic Regression, Support Vector Machine, Naive Bayesian, Random Forest, Decision Tree on Pima Indian Dataset which contains 768 patients from an Indian population.
In order to identify misdiagnosed type-1 diabetes patients from patients with a prior type-2 diabetes diagnosis, the XGBoost machine learning model has been used in [34].The model utilized a total of 932 variables, which were initially broken down into various metrics for prediction.Following recursive feature elimination (RFE), the model achieved an optimal balance between precision and complexity by selecting the top 250 variables.These 250 variables were then used to generate the model's outputs.
In order to predict Diabetes mellitus in Nigeria at an early stage, a new approach has been proposed in [35].To accomplish their objective, authors collect data via oral interviews with patients and performed pre-processing and transformation using label encoder and standard scaler, respectively.After that, three supervised learning algorithms (K-Nearest Neighbors, Decision Trees and Artificial Neural Networks) were applied on nine attributes in order to find the best one.
The aim of the work presented in [36] is to employ several machine-learning techniques in order to explore age adaptation in predicting the risk of Diabetes mellitus.The used machine-learning algorithms are: Linear Regression, Logistic Regression, Polynomial Regression, Neural Network, Support Vector Machines, Random Forest and XGboost.To examine the impact of age, the authors divided the dataset into six age groups, evenly distributed, where individuals aged 81 were excluded from the analysis.
The authors in [37] derived several datasets from PIMA dataset by applying various pre-processing techniques in order to implement four machine-learning algorithms for diabetes prediction.The used algorithms are: K-Nearest Neighbors, Decision Tree, Random Forest and Support Vector Machines.
Given the advantages offered by ontologies, the authors of [39] proposed an approach based on ontologies and machine-learning algorithms.After pre-processing and cleaning the PIDD dataset, six machine-learning algorithms; namely, SVM, KNN, ANN, Naive Bayes, Logistic Regression and Decision Tree, were tested in order to choose the best one.In parallel, they extracted generated rules from the Decision Tree algorithm and imported them to Protegé using the SWRLTab plugin.
A new dataset was created in [40], where new diabetes risk factors were used.The authors followed five steps that are: Dataset Collection, Data Pre-processing, Clustering, Model Creation and Evaluation.In the Clustering step, the K-means algorithm was applied on the most correlated attributes (glucose and age) in order to classify each patient into diabetic or non-diabetic.For the model creation step, the following algorithms were tested: Support Vector Classifier, Random Forest Classifier, Decision Tree Classifier, Extra Tree Classifier, Ada Boost algorithm, Perceptron, Linear Discriminant Analysis algorithm, Logistic Regression, K-Nearest Neighbour, Gaussian Naïve Bayes, Bagging algorithm and Gradient Boost Classifier.
An ontology-based machine learning was proposed in [41] in order to predict diabetes.To achieve their goal, the authors followed three phases that are: Data pre-processing, Creation of the Decision Tree by using J4.8 classifier algorithm in Weka and finally the creation of the ontology.After creating the ontology and importing the dataset, the SWRL rules were extracted from the Decision Tree for reasoning.Then, Pellet reasoner was applied to infer whether the patient is diabetic or not.
In order to predict diabetes using machine-learning algorithms on Hadoop cluster, a new architecture is proposed in [42].This architecture includes three modules that are: Diabetes Data Collection Module, Diabetes Prediction Module and Prediction Results Module.The first module is responsible for the collection of data from different hospitals.The second module is composed of different submodules that are: Features extraction, Features selection by using Information gain method and Hadoop Cluster that apply several machine-learning algorithms (Neural network, Support vector machine, Decision tree, Naive Bayes and Random forest).
In order to predict gestational Diabetes mellitus in Singapore, the authors of [43] combined coalitional game theory concepts with machine learning.Shapley values and the SHapley Additive exPlanations (SHAP) framework was combined with CatBoost tree ensembles for feature selection.Logistic Regression, Support Vector Machine, CatBoost Gradient Boosting and Multilayer Perceptron Artificial Neural Network were applied on the collected dataset.
In the same context, a machine-learning model was proposed in [44] in order to prevent the progression of gestational Diabetes mellitus to type-2 diabetes.For this purpose, Shapley values were combined with CatBoost tree ensembles to perform feature selection and Logistic Regression, Support Vector Machine, CatBoost Gradient Boosting and Artificial Neural Network were implemented and tested in order to choose the best algorithm.

Our Contribution
The objective of this work is to propose an automatic diabetes-prediction system, in order to raise patient awareness and avoid serious complications that can occur in the long term.Our system makes it possible to identify people at high risk of developing diabetes in order to put in place early preventive measures and slow down the progression of the disease, which results in the improvement of the quality of life.We can summarize our contribution in the following points: • The construction of our own dataset from scratch by collecting Algerian data.The goal behind this initiative is to collect the data that influences the diabetes disease in Algeria and to focus on the risk factors of this last.Creating our own dataset was a crucial step, because the publically available datasets don't represent all the important risk factors of diabetes; so, they do not reflect the reality of the disease.For this, we have collected all the features that concern the patient; namely, his/her demographic, anthropometric, genetic, lifestyle, medical and laboratory information.• The application of pre-processing techniques to improve the quality of the collected medical data, • Exploratory data analysis with the intention of identifying the most significant features and the correlation between them.To do that, we applied several feature-extraction techniques, such us: ANOVA, RFE and the features proposed by the PIDD dataset.• Application of several machine-learning algorithms, to compare the obtained results and choose the best among them, • Drawing possible predictive conclusions in terms of the obtained results and deciding whether the patient is diabetic or non-diabetic.

PROPOSED APPROACH
Diabetes mellitus is a major and prevalent pathology that affects many people.Several factors can increase the risk of developing Diabetes mellitus, such as age, obesity, physical inactivity, family history, lifestyle, unbalanced diet, high blood pressure, …etc.Figure 1 presents the architecture of our diabetes prediction system.
In order to predict the diabetes disease, we will follow the following steps:

Data Collection
Data collection is a crucial step in any scientific research study.In our case, we collected real data in Algeria, in the city of Oran, from diabetic and non-diabetic patients, using a questionnaire administered at the Oran University hospital.The purpose was to collect the largest amount of data on these patients.
The questionnaire was validated by experts in the field in order to guarantee its quality and relevance.
The questions asked were adapted to the studied population and allowed an in-depth analysis of the various risk factors associated with diabetes.The answers obtained were carefully processed and used to feed our diabetes-prediction system.
Our internship took place in two departments; namely, the Internal Medicine department reserved for diabetic patients and the Endocrinology department to collect data from non-diabetic patients affected by other conditions that may cause diabetes.On the other hand, we created and shared an online questionnaire via social-media networks (Facebook, Twitter, Linkedin, …etc.), to gather more data and enrich our dataset.We have implemented rigorous protocols to guarantee the quality of the collected data, in particular by ensuring the confidentiality and anonymity of patients.Table 2 shows the collected data, with an indication of the start and end period of the collection and the total number of patients taken.The general population statistics are presented in Tables 3, 4 and 5.  Blood sugar (in g/l), Total cholesterol (in g/l), Urea (in g/l), HbA1c (in percentage (%)), Triglycerides (in g/l), Creatinine (in mg/l), TGO/ASAT (in U/l), TGP/ALAT (in U/l), TSH/TSHus (in U/l), T4/FT4 (in pg/ml), T3/FT3 (in pg/ml) and Uric Acid (in mg/l).

Data Pre-processing
After describing all the data that we have collected, we will proceed to data visualization and analyze the different characteristics, then we move on to cleaning and correcting the records to obtain reliable data, ready to be used for diabetes prediction.
The goal of this phase is to improve the quality of the collected data, eliminate errors and deal with inconsistencies and missing data, to make more informed and accurate decisions.In order to prepare our data for training predictive models, we will follow the following steps:

Data Preparation
The first step is the detection and correction of the typing errors, as they are common in medical data.Next, we transformed all the categorical data into numerical data, such as Pregnant = 1 and Not-pregnant = 0 or Single = 0, Married = 1, Divorced = 2 and Widowed = 3.We also removed records that have empty values in most columns.

Feature Deletion
In this step, we decided to delete some column in order to take only the relevant features.For this, we visualized the missing values in our dataset represented by the white color in Figure 2. We notice that the data of the blood tests has a very high rate of missing values, such as FT3, FT4, THus, …etc.
After missing-data visualization, we made a percentage of the missing values for each attribute in order to categorize them into 3 groups, as follows: • Group 1: Features that have between 0% and 40% missing values, • Group 2: Features that have between 40% and 80% missing values, • Group 3: Features that have between 80% and 100% missing values.
After that, we removed group-3 attributes, as they have a large percentage of missing values.We deleted also the irrelevant features, like the city attribute (it is specific to the geographical location and may not provide a direct correlation with the diabetes-prediction study).We removed also height and weight (they have a correlated relationship with the BMI column).
Figure 2. Visualization of missing data.

Fill in Missing Values
To fill in the missing values, we replaced the missing data with the value 0 or the median value according to the attribue.The median consists of replacing missing or outlying values in the data with the median of the entire dataset.It is a robust measure of the central tendency of the data that is not influenced by extreme values or outliers.

Data Normalization
We carried out data normalization to put all the variables on the same scale.This ensures that certain variables do not dominate others when building the model.

Feature Selection
Feature selection is the process of selecting a sub-set of the most relevant features in the dataset to describe the target variable.It improves computation time, generalization performance and interpretation problems.There are several types of feature-selection techniques, in our case, we used:

Uni-varied Selection
This technique consists in evaluating each characteristic independently of the other characteristics by estimating the correlation or the dependence between its characteristics and the target variable, in order to choose the variables that have the strongest correlation or target dependence.Among these univariate selection techniques, we have the ANOVA method.
ANOVA (ANalysis of Variance) is a well-known statistical method to determine whether there is a difference in means between two groups.In our study, the ANOVA test was used to select the significant numerical characteristics in the prediction of diabetes.The F-statistic is used for feature ranking.The larger the value of the F-statistic, the better the discriminative ability of the feature [38].The F-value is calculated as follows (see formula 1): ( where SSB (sum of squares between groups) is the variation of the means of the groups from the total grand mean and SSW (sum of squares within the groups) is the sum of the squared deviations from the means of the groups and at each sighting.The degrees of freedom for the mean square between and within groups are defined by dfb and dfw, respectively.For all numerical features in the dataset, the F-value was calculated using the previous equation 1 and the features with the highest values were selected [38].
Our goal was to assess the importance of each feature in our database based on its contribution to the variance in the data, so that we could use this information to select the most important features.The best ANOVA selection is 32 out of 68 features (see Figure 3).
The visualization of the selected features and the F-score provide a visual overview of the most important variables for prediction, thereby helping better understand the influence of each feature on the final outcome (see Figure 4).

Recursive Feature Elimination (RFE)
Recursive feature elimination is a recursive procedure to select features based on model accuracy.
The metric determines the discriminative ability of the features.At each iteration, the rank score metric is calculated and lower-ranked features are eliminated.The recursive procedure is repeated until the desired number of features is reached.In this study, RFE was used as the final step in the feature-selection procedure [38].
We could use recursive feature elimination (RFE) to identify the most important features one by one.
We used this technique with the logistic regression algorithm to evaluate their performance on the selected characteristics.According to this method, the optimal number of data selections is 20 among 68 features (see Figure 5).
The visualization of features selected by RFE shows the positive and negative relationships depending on the effect of each variable on the target variable (see Figure 6).

Comparison with the Pima Indian Diabetes Dataset
The Pima Indian Diabetes Dataset (PIDD) is a well-known dataset used for diabetes prediction.It contains medical information from 768 female Pima Indians, including 9 features, such as glucose level, blood pressure, BMI and age.The target variable indicates whether the individual has diabetes (1) or not (0).This dataset is widely used in machine learning and data-mining research for predicting the risk of diabetes based on the given features.
It is important to note that this dataset is imbalanced, as there are significantly more non-diabetic samples compared to diabetic samples (500 samples belonging to non-diabetic persons and 268 samples belonging to diabetic persons).
In the dataset, six attributes contain zero values for some samples.The occurrences of zero values in these attributes are as follows: 111 for pregnancy, 5 for glucose, 35 for blood pressure, 227 for skin thickness, 374 for insulin and 11 for BMI.The presence of these zero values might be due to errors during data collection or missing data.
These zero values can potentially impact the performance of the classifier negatively.To create a reliable prediction model, it is essential to handle these missing values appropriately and fill them with suitable values.This process ensures that the model can make accurate predictions and avoid any potential biases caused by incomplete or incorrect data.
For a comparative analysis, we compared our database with the famous Pima Indians Diabetes Dataset, which is widely used in the scientific literature for the study of diabetes.This approach will allow us to evaluate and improve the study to fully understand the differences and similarities between the data collected in Algeria and the reference data in the scientific literature.So, we selected the same available features in our dataset according to PIDD.Table 6 shows a summary of the selected features according to the PIMA dataset.

Diabetes Prediction
We split the dataset into 70% for the training set and 30% for the test set.The training phase is a crucial part in the creation of any machine-learning and deep-learning model, including for diabetes prediction, in order to optimize the parameters so that the model is capable of improving performance.For the creation of the model, we used several machine-learning algorithms to evaluate their performance and choose the most robust one.This step also includes finding the best hyperparameters for each algorithm and training the model on the training data.Once the model is trained, it can be used to predict output values for new data.
The algorithms used in this step are: SVM, KNN, Decision Tree and Logistic Regression.Once the models are trained, they are saved to use them in the test phase or to deploy them to predict new cases of patients.
The next phase is testing and evaluation, which is an important step in the prediction process, because it allows to evaluate the performance of the model on the test data, it includes several steps, which are: 1) Retrieving test data, 2) Deploying the machine-learning model from the database, 3) Evaluating performance by estimating performance metrics, such as accuracy, precision, …etc.4) Analyzing the results to understand the strengths and weaknesses of the created model.This may include prediction visualizations, 5) Improving the model created from the previous analysis in order to adjust the hyperparameters again or make modifications in the previous steps.This phase can be repeated until the model reaches high performance.

EXPERIMENTATIONS
To carry out our experiments and train our prediction models, we used a single personal computer with the following characteristics (see Table 7): We have implemented several machine-learning algorithms, such as SVM, KNN, Decision Tree and Logistic Regression.Our objective was to compare these algorithms with each other and select the one that offers the best performance in terms of recall, precision, accuracy, F1-measurement, RMSE, MSE and MAE.In addition, we will examine the results obtained by each algorithm on a dataset, using 3 feature-extraction techniques; namely, Anova, RFE and comparison with the PIMA dataset.

SVM
The results obtained with the SVM algorithm are presented in Table 8, highlighting the performance of the model.After analyzing the results, we found that the use of the SVM algorithm for the prediction of diabetes does not show a large significant difference in accuracy between the use of RFE and ANOVA.The difference observed is minimal, with only 1% variation between the two techniques.

KNN
The results obtained with the KNN algorithm are presented in Table 9.From the results presented in Table 9, it is clear that using the KNN algorithm in combination with the Anova method for feature selection gives superior performance compared to other feature selection techniques.Indeed, the model achieves an accuracy of 97.60%, indicating its ability to accurately predict diabetes.

Decision Tree
The results obtained with the decision-tree algorithm are presented in Table 10.The decision-tree algorithm produced promising results in our study.When we used the features extracted by RFE, we obtained an accuracy of 97.60%.However, using features selected by ANOVA, the accuracy decreased, reaching 95.20%.

Logistic Regression
The results obtained with the logistic-regression algorithm are shown in Table 11.The results obtained demonstrate that there are no significant differences between the used featureselection methods.However, it is important to note that the RFE algorithm achieved an accuracy of 97.60% and this is the best-performing method.

Discussion
To facilitate the understanding and analysis of the results, we have divided the results into three separate tables, corresponding to each feature-selection method used in our study.This allows us to compare the performances of different methods and identify the best combination of algorithm and future-selection method for diabetes prediction.
Each table presents the key performance measures, such as precision, recall and F-measure, obtained for each combination of algorithm and feature-selection method.By examining these tables, we will be able to evaluate the performance of each selection method and determine which one offers the best results in terms of predicting diabetes.From the results of the previous tables, we can identify the best machine-learning algorithm for diabetes prediction.The most efficient algorithms are: • The KNN algorithm with the ANOVA feature-extraction method.When we use the ANOVA method, we obtain an accuracy rate of 97.60%.• The Logistic Regression algorithm performs well, using the RFE feature-extraction method.
When we apply the RFE method, we obtain an accuracy rate of 97.60%.• By using the SVM algorithm with the RFE method, we manage to obtain a remarkable accuracy rate of 97.60%, • The Decision Tree algorithm, combined with the RFE method for feature selection, provides an accuracy rate of 93.60%.
We note that during the experiments, we sought to optimize the performance of our algorithms by carrying out exhaustive tests with different parameter configurations.The objective was to find the parameters that would allow us to obtain the most efficient results.The results of each parameter configuration were carefully noted and compared.As an example, we set the number of clusters in the KNN algorithm to 7 and used Euclidean distance as a similarity measure.In the SVM algorithm, we chose kernel = 'linear'.For the decision tree, we took entropy as a division criterion.
In order to evaluate the performance of our classification models, in particular with regard to their abilities to distinguish between positive and negative classes, we plotted the ROC (Receiver Operating Characteristic) curve and the area under the curve (AUC) which are presented in Figure 8.These metrics are commonly used in the evaluation of classification models, particularly in the field of diabetes prediction, where the distinction between true positives and false positives is of critical importance.From the presented results, the AUC of our models is closer to 1.0 which is considered efficient.The limitation of this work lies in the number of collected data, since the volume of data is a critical factor in data analysis and modeling projects, especially in the medical field.Due to time, resource and data-access constraints, we had to work with a limited sample of data.This may lead to potential biases in our results, as a larger sample could have provided a more complete and representative view of the problem.

CONCLUSION
Diabetes prediction has become a growing area of interest due to the increasing prevalence of diabetes, advances in technology and the growing importance of health data.Research efforts in this area aim to develop accurate predictive tools that can aid in the prevention, early diagnosis and effective management of diabetes, thereby improving the health and well-being of individuals affected by this disease.
In this work, we implemented and evaluated several algorithms, such as SVM, KNN, Logistic Regression and Decision Tree, by comparing their performances and analyzing the obtained results.Our results demonstrated that the KNN algorithms proved to be particularly effective in the prediction of diabetes, because it took advantage of its nearest neighbor-based approach to achieve good results.It is also important to emphasize the importance of feature selection in the prediction of diabetes.We found that different selection methods influenced the performance of the algorithms, highlighting the importance of a well-considered approach, such as RFE or ANOVA, to choose the most relevant features.
In terms of future perspectives, there are several aspects to consider to improve and further develop the diabetes-prediction system: • In terms of the dataset, it would be interesting to extend our study to large data on other wilayas to reach the entire Algerian population, • Collecting other risk factors for diabetes, • Developing a useful online website or mobile application and deploying it on Play store.

Figure 3 .
Figure 3.The features selected by ANOVA.

Figure 4 .
Figure 4.The importance of the features selected by ANOVA and the F-score.

Figure 5 .
Figure 5.The features selected by recursive feature elimination (RFE).

Figure 6 .
Figure 6.The importance of the features selected by recursive feature elimination.

Figure 7
Figure 7 presents the correlation matrix of the selected features.

Figure 7 .
Figure 7. Ccorrelation matrix of the selected features.

Figure 8 .
Figure 8.The ROC and AUC results.

Table 1
presents a comparison among the presented works in terms of method and data.

Table 1 .
Comparison among the related works.

Table 4 .
Distribution of the population according to sex.

Table 5 .
Number of individuals with specific diseases.The lifestyle data comprises: sport, anxious nature, type of consumed sugar, balanced diet, consumption of fruits and vegetables, consumption of fast food, fasting rhythm, smoking and alcohol.5) Medical data: The medical data encompasses: diagnosis of diabetes during pregnancy, history of health problems, the type of disease, antihypertensive medication, hypoglycemia diagnosis, diabetes, duration, type of diabetes, treatment, health problems and complications after diabetes and the type of complication disease.6) Laboratory data: The data related to patients' blood tests includes 12 features, which are:

Table 6 .
Summary of the selected futures according to the PIMA dataset.

Table 7 .
The specifications of the used machine.

Table 8 .
Diabetes-prediction results with the SVM algorithm.

Table 9 .
Diabetes-prediction results with the KNN algorithm.

Table 10 .
Diabetes-prediction results with the decision-tree algorithm.

Table 11 .
Diabetes-prediction results with the logistic-regression algorithm.

Table 12 .
Results of the ANOVA selection method for the prediction of diabetes.

Table 13 .
Results of the RFE selection method for the prediction of diabetes.

Table 14 .
Results of the comparative method with PIMA for the prediction of diabetes.