Clinical Decision Support System for Diabetic Patients by Predicting Type 2 Diabetes Using Machine Learning Algorithms

Diabetes is one of the most serious chronic diseases that result in high blood sugar levels. Early prediction can significantly diminish the potential jeopardy and severity of diabetes. In this study, different machine learning (ML) algorithms were applied to predict whether an unknown sample had diabetes or not. However, the main significance of this research was to provide a clinical decision support system (CDSS) by predicting type 2 diabetes using different ML algorithms. For the research purpose, the publicly available Pima Indian Diabetes (PID) dataset was used. Data preprocessing, K-fold cross-validation, hyperparameter tuning, and various ML classifiers such as K-nearest neighbor (KNN), decision tree (DT), random forest (RF), Naïve Bayes (NB), support vector machine (SVM), and histogram-based gradient boosting (HBGB) were used. Several scaling methods were also used to improve the accuracy of the result. For further research, a rule-based approach was used to escalate the effectiveness of the system. After that, the accuracy of DT and HBGB was above 90%. Based on this result, the CDSS was implemented where users can give the required input parameters through a web-based user interface to get decision support with some analytical results for the individual patient. The CDSS, which was implemented, will be beneficial for physicians and patients to make decisions about diabetes diagnosis and offer real-time analysis-based suggestions to improve medical quality. For future work, if daily data of a diabetic patient can be put together, then a better clinical support system can be implemented for daily decision support for patients worldwide.


Introduction
A CDSS can be a blessing in the feld of chronic diseases like diabetes. Te capacity, complexity, and dynamic behavior of clinical info are a challenge for doctors and other health professionals. CDSS seeks to favor the physicians as well as the patients by providingreal-time feedback regarding health conditions [1].
Diabetes is a chronic metabolic condition marked by a recurrent rise in blood glucose levels. It is a global health priority that afects 463 million people, or one out of every eleven adults. Tis fgure is anticipated to grow to 578 million by 2030 [2]. Diabetes is caused by several diferent pathogenic mechanisms. Tese can range from autoimmune death of pancreatic beta cells, resulting in insulin shortage, to anomalies that lead to insulin resistance. Due to the poor impact of insulin on target tissues, diabetes produces abnormalities in glucose, lipid, and protein metabolism. Insulin defciency happens when the body does not make enough insulin and/or when tissues do not respond well enough to insulin at one or more points along the complicated path of hormone action. In many patients, reduced insulin secretion and impaired insulin movement coexist, and it is tough to inform which condition, if either, is the essential supply of hyperglycemia [3]. Diabetes is a category of metabolic disorders marked by hyperglycemia caused by problems with insulin secretion, insulin action, or both. Long-term injury, dysfunction, and breakdown of various organs, primarily the eyes, kidneys, nerves, heart, and blood vessels, have been linked. Te patient's health conditions should be regularly monitored to prevent these complications. To eliminate health hazards, early prediction of diabetes can be very benefcial as well as CDSS will help patients to perform continuous observation of diferent parameters that control insulin levels.
Over the most recent twenty years, the advancement of patient management conformity has been altogether expanded in healthcare [4][5][6][7]. In this research, our ultimate goal was to propose a CDSS for diabetes patients and clinicians, so accurately predicting diabetes was the frst goal. To predict diabetes, diferent ML classifcation algorithms such as KNN, DT, RF, HBGB, and NB were used. Patients will have access to a user interface to predict diabetes based on input parameters. With a comparative analysis of the input parameters, decision support will be ofered depending on the results of the prediction system.

Contributions of the Proposed Work.
Te main contributions of this study are as follows: (i) Te proposed system is provided with a diabetes prediction system as well as a decision support system which will give a graphical analysis to the patients. (ii) In this study, various scaling methods have been applied to diferent ML algorithms that have provided diferent levels of accuracy. (iii) A rule-based technique was applied to the dataset to improve the accuracy. (iv) Histogram-based gradient boosting algorithm achieved the highest accuracy of 92.2%.
Te remaining portion of the discussion of the research is structured as follows. Te next part of this section discusses the related work of the diferent researchers in the same feld. Section 2 covers methodology, model diagram of the system, dataset description, and preprocessing. Section 3 is about results, analysis, and discussion, and Section 4 covers the conclusion part of the research work.

Related
Work. Te goal of Kopitar et al.'s prediction model [8] was to predict type 2 diabetes at an early stage using ML approaches and to examine if ML-based techniques gave any beneft in the early detection of impaired fasting glucose and fasting plasma glucose level readings. Tis study's data came from 10 hospitals in Slovenia with 3723 participants. In the beginning, there were 111 variables, but only 59 were used because the others had missing values. Te variables were then divided into four categories. For this model, fve ML techniques were used: LR, Glmnet, XGBoost, RF, and LightGBM. Te predictive model was validated using root mean square error. AUC and area under the precision-recall curve were employed to evaluate the system because the dataset was imbalanced. XGBoost had 88.1% accuracy.
Alaa Khaleel and Al-Bakry [9] utilized the PID dataset and three supervised ML algorithms, including logistic regression (LR), NB, and KNN, to predict diabetes. Te dataset was preprocessed with the MinMax scaler to get a better accuracy value. Te model was partitioned into a 7 : 3 ratio for training and testing purposes. LR algorithms are hailed as the best classifer technique for this suggested system since their precision is superior to other classifcation algorithms.
Te authors of [10] suggested an IoT and ML system that analyzes blood sugar and other essential indicators to identify diabetes early and improve diabetes management apps that aid in patient monitoring. Four sensors were applied to obtain the essential clinical data. Additionally, a questionnaire was also employed to collect data. After the dataset was prepared, four distinct ML algorithms were implemented. Te PID dataset was utilized to evaluate the algorithms' accuracy with their collected dataset. A webbased diabetes management strategy was also proposed in this research.
Kaur and Kumari [11] developed a model utilizing the PID dataset and fve distinct ML methods, including KNN, linear kernel SVM, SVM radial basis kernel, artifcial neural network, and multifactor dimensionality reduction. Via the Boruta wrapper technique, signifcant features of the dataset were chosen. All models were tested using several criteria, including accuracy, F1 score, recall, precision, and AUC. SVM-linear performed much better than other models.
Rghioui et al. [12] devised a method for monitoring the blood glucose level of diabetic patients based on ML methods. Te proposed system develops an algorithm based on ML techniques and big data that can analyze the data of diabetes patients and send an alert in the event of an emergency. Te data were sent to the server using 5G technology. Te architecture of 5G technology consists of sensors, wearable devices, a smartphone application, and a server with a database. Tere were portable sensors that could monitor the patient's blood glucose level, physical activity, and temperature and transfer the data to the base station for analysis by ML algorithms through 5G. Te WEKA software was utilized for this study. In addition, the proposed method helps diabetes patients forecast their future blood sugar levels.
Deepti and Dilip [13] employed three ML classifcation methods to predict diabetes, such as DT, SVM, and NB. Tey also used the PID dataset in their study. Different accuracy measures like F-measure, precision, recall, and receiver operating curve were introduced for evaluating the performance of the algorithms. A 10-fold crossvalidation method was also implemented in the dataset. Te highest accuracy of 76.30% was achieved by the NB method.
In [14], Rajput et al. proposed a cloud-based mobile application framework which has been proposed to help the rural patients to monitor their type 2 diabetes through regular follow-up of daily step counts, physical activity, and daily travel history. Tey also indicated that lifestyle is one of the main reasons why people get diabetes, so they came up with a plan to help people keep track of their lifestyle and control their diabetes, which will be observed securely by doctors and medical practitioners. Hence, this can improve the interactivity between patients and doctors.
To conclude, the literature review shows that various researchers have made contributions to the diabetes prediction model. In addition, the study indicates that most of the researchers used the PID dataset to predict diabetes using diferent algorithms. However, the main research gap we observed is that most of the authors have only worked on the predictive model in this domain with this dataset. But this study also proposed a CDSS that ofers a web-based system where patients can give inputs and will have some recommendations and comparative analytics graphs according to the output. Additionally, we have used diferent scaling methods to observe the results from diferent ML techniques. A rule-based approach was also used for the PID dataset, which has improved the accuracy. Terefore, we consider the proposed methodology to be an invaluable contribution to both this dataset and the healthcare industry as a whole. Figures 1 and 2. As our model works in two stages, the frst stage, in Figure 2, refers to the prediction system. Te second stage, in Figure 1, denotes the model architecture of the user interface where the user can give the required input value to have some decision support as well as some comparative analysis which is discussed elaborately in the result section. As shown in the block diagram in Figure 2, accumulating the dataset was the preliminary step in the predictive model. Te PID dataset was used in this study, which has 768 samples. As the dataset contains missing values, according to the predictive model diagram, the dataset must be preprocessed before going to the splitting and ML algorithm application phase. Removing missing values, applying diferent scaling methods, and using some rules to the dataset were mainly in the preprocessing part of this architecture. After that, splitting the dataset into training and testing sets was done. Tus, after completing the preprocessing stage, the whole process went to the next phase, where ML algorithms have been applied to the training dataset so that the testing dataset can be applied to get the output of the algorithm. Finally, we got the output result as "yes" or "no." Tis obtained result will be used by the second stage in Figure 1. Te proposed CDSS will be formed based on the result of the prediction system according to the user input, which is shown in Figure 1, where comparing the result, the user will be able to get recommendations if the user has diabetes or analytics when the user does not have diabetes.

K-Nearest
Neighbor. When using KNN, the function is approximated initially, and all computation is deferred until the classifcation is complete. An n-dimensional space is used to store the data for later analysis, and this is where the training samples are stored. KNN is an essential supervised ML method, despite its simplicity. A supervised ML algorithm uses labeled input data to develop a function that can provide an output when fresh unlabeled data are supplied [15][16][17]. Euclidean distance for two points A (x 1 , y 1 ) and B (x 2 , y 2 ): ������������������� In KNN, diferent k values create diferent clusters for prediction. It is recommended to choose larger k values. But the standard range for k values is 3 to 10.

Decision Tree.
Decision trees refer to the group features according to the sorted form of their values. DT is one of the popular classifcation techniques of ML. Tere are several branches and nodes in DT. Each node represents a set of attributes that the numerator, the network classifcation system, must classify [18][19][20]. Te determination of the attribute for the root node for each level is a key difculty in the DT. Attribute selection is the term for this procedure. Tere are two widely used attribute selection methods. Finding the highest information gain and the smallest entropy is the main objective of the DT. Entropy determines how a DTchooses to split data. It afects the manner in which a DT generates its boundaries [21]. Te formula for calculating the entropy (E): where p i � probability of event i in class m. Information gain is calculated from the average value of the entropy before and after splitting, depending on the given value. We have the equation as follows: where E(b) � entropy before the split; K � total subsets after splitting; and E(x, a) � total number of subsets after splitting.

Random Forest.
Te random forest (RF) approach is a DT-based ensemble method. By merging numerous overft evaluators (i.e., DT) into an ensemble learning algorithm, RF helps to minimize the overftting tendency of the dataset. Te relevant classifcation decision result can be obtained for each DT. According to the concept of minority following the majority, the classifcation of the sample measured is determined by the voting results of every decision branch of a tree, and the category with the highest votes in all decision trees is picked as the fnal result [22,23]. For the discretion of the dataset, we will need the lowest Gini index. For calculating the Gini index, where P i � probabilistic class.

Dataset Description.
In this study, the PID dataset is used for implementing the prediction system, as it is a well-known and widely used benchmark dataset for predicting diabetes [26]. Tis dataset contains 768 samples along with nine attributes. Here eight attributes are independent; they are age, pregnancies, glucose, blood pressure (BP), skin thickness, insulin, BMI, and diabetes pedigree function, and one attribute is dependent. It is also the resultant feature which is represented with binary values 0 and 1. Here 0 is diabetes negative, and 1 is diabetes positive. From those 768 instances, 500 tested diabetes negative, and 268 tested positive.

Dataset Preprocessing.
Preprocessing is the key to getting the preferred output from a dataset. Te PID dataset has some unnecessary zero values for certain important features. Tere are a few ways to get rid of these zero values, such as eliminating rows with zero values and exchanging this with mean or median values. In this research, the median values were used to replace the zero values where necessary. Table 1 represents diferent parameters of features such as BMI, glucose, insulin, and BP, respectively. Te information about zero values, distinct values, the minimum and maximum range of the individual features, and also the mean values is shown in Table 1.
After replacing the zero values with median values, diferent scaling methods, such as MinMax scaler, standard scaler, MaxAbs scaler, robust scaler, quantile transformer, and power transformer, were applied while working on diferent ML algorithms. In this research, a set of rules was integrated with the dataset features based on the correlation of diferent attributes. Te training set is used to train the model, and the testing set is used to test the model's correctness. In this research, 80% of the dataset is used to train the model, and 20% of the dataset is used for testing purposes. Te most famous K-fold cross-validation technique [27] has been used to eliminate overftting and make the dataset unbiased. In this research, we used k = 5, which means our dataset has been divided randomly into 5 subparts while applying the algorithms.

Accuracy Metrics.
In this research, precision, recall, F1 score, and accuracy measurements are evaluated. Precision is defned as the anticipated percentage of true positives against total positives. Recall, also known as sensitivity and true positive rate, denotes the percentage of identifed positive classes which were actually positive. F1 score is the average value of precision and recall. Te formula for precision, recall, F1 score, and accuracy [28,29] is as follows:  TN + FP + FN) .
If the prediction system predicts a user as diabetes positive and it is actually positive, then it will be denoted by TP, which means true positive. TN represents true negative which means the prediction system predicts a user as diabetes negative, and it is actually negative. FP represents false positive, which means the prediction system predicts a user as diabetes positive and it is actually negative. FN represents a false negative, which means that the prediction system predicts a user as diabetes negative and it is actually positive.
2.6. Te Working Process of the User Interface for Decision Support. As mentioned above, this system is not only going to predict diabetes; hence, decision support will be provided for the patient based on their input values through the web interface shown in Figure 3. Tese data will be stored and used for prediction. After storing the data collected from the patients, the system will predict the existence of diabetes. If the patient is predicted diabetes positive, then the system will provide some decision support as well as a comparative graph for diferent parameters in the negative cases.

Results and Discussion
In this research, several ML approaches were applied for the classifcation of the PID dataset. Te accuracy of the algorithms which were applied to the raw dataset is shown in Table 2.
It can be seen that (from Table 1) the dataset contains zero values in some attributes, such as BMI, glucose, BP, and insulin. But in real life, this cannot be possible. So, the irrelevant zero values are replaced by the mean value of the individual column values. After that, diferent scaling methods are applied to the dataset to improve the accuracy, and the results are shown in Table 3.
After applying various scaling methods, the comparison of diferent ML classifer models is evaluated. From the information in Table 3, the KNN model provides the highest accuracy among all other classifers. Te accuracy was 84.02% for the MinMax scaling technique when the value of k was 11, and the p value was 2 for the KNN algorithm. Te lowest accuracy of 73.96% was observed using the DT classifcation method. On the other hand, it can be seen that HBGB has no impact on the applications of scaling methods. However, diferent scaling approaches defnitely had distinct efects on the classifcation algorithms, which is helpful in augmenting the model accuracy through a trial-and-error technique. In the fnal phase, the rule-based approaches were implemented [30][31][32] to the dataset. Tis time the zero values were replaced by the median value of the corresponding column. Te rules are given in Table 4.
After applying rules, the accuracy was signifcantly improved compared to the last phase, which is shown in Table 5. Table 5 and Figure 4 represent that HBGB provided the highest accuracy, and the other performance metrics, such as precision ( Figure 5), sensitivity or recall (Figure 6), and F1 score (Figure 7) are also very prominent for HBGB compared to other algorithms used in this study.

Journal of Healthcare Engineering
For this method, the 5-fold cross-validation technique [27] and hyperparameter tuning technique were also applied. For the hyperparameter, the best max iter value was 100, and the best learning rate was found at 0.04. It is also seen that the DT model provides good accuracy as well. On the other hand, SVM provided the lowest       accuracy in this method. As HBGB had shown the highest performance, this algorithm was selected to predict diabetes for our prediction system. Important features in Figure 8 are also determined to make the analytics graph based on these features [33].

Patient Who Has Diabetes.
If the diabetes prediction system indicates that the patient has diabetes according to the information provided, the CDSS will make recommendations based on this prediction. A sample of recommendations that will be displayed in the patient's user interface is given in Table 6 [34,35].

Patient Who Does Not Have Diabetes.
On the other hand, patients who will be predicted negative for diabetes after checking their diabetes status by putting the required input felds on the user interface, as shown in Figure 3, will be relocated to another interface where the patient can perceive a comparative graph (Figure 9) of some critical parameters based on important features from Figure 8 such as BMI, glucose, BP, insulin level, and skin thickness. In  Journal of Healthcare Engineering Figure 9, it can be seen that the frst level denotes the patients' current level parameter, which is obtained from their input value, the second level is the average of nondiabetic patients of the particular parameter, which is accumulated from the dataset, and fnally, the third level is the average level of the diabetic patient's specifc parameter. Tus, this will allow the patient to determine whether or not they are at risk for developing diabetes by analyzing diferent parameter levels from the graph. As a result, they will have a greater understanding of their current health status, allowing them to make more informed decisions, and it will also help them to learn better about chronic diabetes disease. As can be seen, patients who are predicted as diabetic are given guidance, whereas those who do not have diabetes are also given comparisons of various input parameters. From this, the patient can be aware of diferent parameter levels and also can take necessary steps to control these particular levels like BMI, BP, and glucose. As a result, both diabetic and nondiabetic patients will be benefted from the proposed CDSS.

Regular checking
Check glucose level and also meet with the doctor for further instruction.

Glycemic targets
Maintain blood glucose level before a meal: 80 to 130 mg/dL (4.4 to 7.2 mmol/L). Two hours after the start of a meal: Less than 180 mg/dL (10 mmol/L).
Follow doctor's prescriptions Should follow the prescriptions of the doctors and take the medications on time and follow other instructions.

Conclusion
In this research, an expert system is presented to help physicians as well as patients to make decisions about diabetes diagnosis and to ofer real-time analysis-based suggestions to improve medical quality. Te time-consuming identifcation process leads to a patient's appointment at a diagnostic center and consultation with a doctor. For predicting diabetes, several ML classifcation techniques were applied with diferent scaling methods. A rule-based approach with the HBGB classifer provides the highest accuracy of 92.21%. Since the best result for our prediction system was obtained from HBGB, this algorithm can be used for the proposed CDSS. Tis proposed system will also be very benefcial for the nondiabetic patient as it will show a comparative analysis of diferent parameters that are directly responsible for diabetes disease. For the automation of diabetes analysis, the work can be expanded and enhanced. In future work, feature selection techniques can be implemented to check the efects on accuracy improvement with various subsets of features. It is also planned to collect data from many regions across the globe in the future to create a more accurate and broader predictive model for diabetes decisions. If daily data of diabetes patients can be put together, then a clinical support system can be implemented for daily decision support for patients worldwide.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest.