K-Nearest Neighbor Method for Early Detection of Diabetes Patients Based on Symptoms and Clinical Data

— Diabetes is a chronic disease rarely detected and develops quickly. Diabetes can trigger other chronic diseases such as kidney failure and heart disease. Early detection is necessary to help patients treat diabetes before the disease becomes more severe. Various health examination methods to detect diabetes, but these examinations require medical expert action and cannot be carried out by anyone. In addition, examination costs are often unaffordable. This research aims to apply data mining methods, especially k-Nearest Neighbor (KNN), for early detection of diabetes patients based on disease symptoms and patient clinical data. KNN is used to classify patient symptoms and clinical data into two classes, diabetes and non-diabetes, calculating the distance between test data and training data using Euclidean Distance. The research results show that a lower k-value provides a higher accuracy value. However, accuracy at low k-values is insufficient to conclude the performance of KNN for early diabetes detection. High accuracy at low k-values has the potential for overfitting, and the model is not generalizing well. Apart from that, if you use a low k-value, the model only sees patterns from 1 or a few neighbors, which results in the pattern of the data not being captured by the KNN model using a k-value that is too high also risks the model becoming underfitting. The model is too general, which makes the model unreliable. This research made use of the k-fold cross-validation technique to circumvent these issues. It is possible to avoid overfitting in the constructed KNN model by employing this method. The researchers are employing k-fold=10 and k-fold=20 in their investigation. KNN This research carried out this analysis by looking at the accuracy of each iteration of the k and k-fold values. The higher the k-fold value, the more accuracy the KNN produces. Inversely proportional to the k-fold cross-validation value, the higher the k-value in KNN, the decreases the accuracy. The KNN method applied in this research provides an accuracy of 98.2692% with higher precision than recall. These findings suggest that KNN can be an effective and efficient tool for early diabetes detection.


I. INTRODUCTION
Diabetes is a chronic disease that is one of the types of disease with the fastest growing rate throughout the world.Diabetes is predicted to affect 693 million adults by 2045 [1] - [3].Diabetes can threaten health conditions [4] and trigger disorders of body functionality, such as kidney failure and heart disease [5], making diabetes a disease that needs attention.According to WHO data regarding diabetes sufferers, the number of diabetes patients increased significantly from 314 between 1980 and 2014 [6], [7].The most worrying facts emerge from low and upper-middle-income countries, which had more than 80 % of people living with diabetes in 2013, with the number always increasing [8], [9].According to data from the International Diabetes Federation (2019), there are around four million diabetes patients in the world with an age range of 20-79 years, and this number is predicted to continue to increase [10], [11].
To treat diabetes effectively, early detection and therapy are both necessary.It was necessary to conduct a clinical examination to obtain relevant results in the early identification of diabetes.Diabetes often has a lengthy period without symptoms, leading to about half of all individuals with the condition remaining undiagnosed.Various methods are applied to detect diabetes in patients, such as OGTT (Oral Glucose Tolerance Test), HbA1c examination, and blood sugar test [12], [13].However, a series of tests to detect diabetes is not cheap.In digital era technology, various methods are often proposed to solve prediction problems, such as predicting diabetes [14].
A method that has been proven to be capable of developing a disease detection system is the data mining method.Previous research was carried out to analyze data mining methods with an ensemble approach to diabetes analysis and prediction, namely using random forest, KNN, Naïve Bayes, and J48 methods.The research results show that the KNN method applied in this case did not analyze large datasets well.The proposed method gives better results on small data, while on large data, the proposed method gives relatively poor results [15].Another study was conducted by applying classification techniques to predict diabetes mellitus.The methods proposed in this research are SVM, Decision Tree, and KNN.KNN involves two experiments: the first is conducted on data that has not been changed, and the second is conducted on data that has been modified using scaling to improve its accuracy.Compared to data that has not been converted, data that has been transformed has a higher level of accuracy, which indicates that there is an influence on the shape of the data through transformation [16].
Further research was carried out using the KNN method on a dataset of diabetes sufferers.This research uses small data, which causes KNN to provide accuracy that is not good enough [17].Previous research was conducted by Delvika, which compared the KNN method with Naïve Bayes and gave results that the KNN method produced lower accuracy than Naïve Bayes based on the accuracy results obtained by KNN of 74.48% with a value of k=25.In comparison, Naïve Bayes produced an accuracy of 75.78 % with a value of k=10 [18].Another research was conducted by Anthony who compared KNN with fuzzy c-means and obtained the results that the fuzzy c-means method was better than KNN with an accuracy of 96% and used a dataset obtained from observations and interviews with diabetes experts at Tanjung Health Center and obtained 120 data with a total of 7 attributes, namely patient, often feel tired, wounds hard to cure, blurred vision, often feel hungry (polyphagia) A history of descendants, as well as a status that includes both positive and negative results, which indicates that no presence of diabetes was found.In the dataset used, the value of each attribute is in the form of a weighing scale on a scale of 0 to 3 and a scale of 0 for no, scale 1 for rarely, scale 2 for often, and scale 3 for very often [19].Naïve Bayes is good for data with a moderate number of features, and the assumption of feature independence is quite acceptable.Still, this assumption of independence is not always acceptable, and this method is very suitable for irregular data containing noise.KNN is good for structured data and features that can be normalized well and data free from noise.Based on these previous studies, the KNN method produces fairly good accuracy only when using small datasets, while the KNN method produces poor accuracy for large datasets.In this research, the researcher intends to modify previous research by using a large dataset and applying crossvalidation to optimize the accuracy produced by the KNN method.The main contribution of this research is testing and evaluating the modified KNN method with cross-validation techniques on a large dataset to increase accuracy in diabetes detection.The novelty of this research lies in the approach that uses cross-validation techniques and data transformation simultaneously on large datasets, which is expected to provide more accurate and consistent results in early diabetes detection using the KNN method.
One data mining method that can be used to predict or detect diabetes is k-Nearest Neighbor (KNN).KNN has been proven to be effective in carrying out classification [20] - [23].KNN is a simple method, has good resistance to noisy training data, and is effectively used in cases with large training data [24] - [26].In this research, KNN uses classification techniques to analyze the results of diabetes detection.Detecting diabetes requires patient data from the patient's symptoms and clinical data.This research uses secondary data from 520 instances, 16 patient data features, and one class feature.KNN classifies data based on the nearest neighbor distance, represented by the k-value.Analyzing the k-value to determine the best performance KNN provides when classifying is important.In addition, the distance calculation results are determined using the Euclidean Distance formula.In this research, the method used is k-Nearest Neighbor, which, based on its accuracy, can show whether the data obtained can be used to detect diabetes.This research was conducted to analyze the performance of the k-nearest Neighbor method in the early detection of diabetes based on symptoms and clinical data.

II. RESEARCH METHODOLOGY
The methodology used during research.The stage begins with identifying disease problems, especially diabetes, followed by conducting a literature study to explore insights from previous research on diabetes, data mining, the application of data mining in the health sector, and crossvalidation evaluation techniques.After conducting a literature study, then start collecting data.The data used in this study used secondary data obtained from Sylhet Diabetic Hospital, Bangladesh.After obtaining the dataset, the next stage is to carry out an analysis of KNN.This analysis was carried out using WEKA tools.KNN was analyzed using cross-validation techniques with a value of k=10.The methodology of this research is in Figure 1.

A. Dataset
Diabetes, especially Diabetes Mellitus or DM, is a metabolic disorder characterized by hyperglycemia due to abnormalities in insulin-high blood sugar results from relative or absolute insulin deficiency [27], [28].Diabetes consists of a collection of existing conditions categorized more generally based on a single diagnosis [1], [29].DM is a disease that patients are rarely aware of, and often, when it is discovered, it is at a stage where complications have occurred.This is caused by the relatively long asymptomatic phase of diabetes [14].Examples of symptoms that diabetes patients can experience include visual disturbances and impaired kidney function [30], [31].
This study used data obtained from Sylhet Diabetic Hospital, Bangladesh, by giving questionnaires directly to patients.The data set consists of 520 instances determined from the number of patients with 17 attributes consisting of age, gender, and symptoms indicating diabetes [14].The data attributes used in this research are in Table I, the attributes and values contained in each attribute.The patient's age ranges from 25 to 90 years in the table.Patients are male and female.The other attributes consist of 2 value categories, namely no and yes.The class consists of class no, which represents non-diabetes, and class yes, which represents diabetes.Diabetes is detected based on the symptoms experienced by the patient.Each patient from 520 instances had different symptoms.The patient's symptoms are utilized to diagnose whether or not the patient has diabetes.Examples of data sets used to detect diabetes are in Table II.

TABLE II EXAMPLES OF DATASET
Marks A1, A2, A3 to A17 indicate the attributes and classes of the data set.The first attribute (A1) shows the patient's age, A2 shows the patient's sex or gender, which is represented in the form m for men and f for women, A3 represents symptoms of polyuria, A4 represents symptoms of polydipsia, and so on.Values 0 and 1 indicate whether the patient experiences these symptoms or not.For example, a value of 0 in 3 and 1 in A4 indicates that the patient does not have polyuria but has polydipsia.The values 0 and 1 also apply to other symptoms.

B. K-Nearest Neighbor
K-nearest neighbor (KNN) is a classification technique that uses the k-value as a representation of the number of close neighbors in determining the class or group that corresponds to the object being classified [32] - [34].The distance of the data to each neighboring k-value is determined using Euclidean Distance as the distance calculation method most commonly used in KNN [35], [36].Euclidean Distance is the most general and easy formula for calculating distances for classification problems.It is proven to provide higher accuracy than other distance calculation methods such as Hamming, Jaccard, Cosine Distance, and so on [37].Euclidean Distance is used in Equation (1).Where the n variable is several attributes, the x variable is a vector of real attributes of data, the y variable is a vector of attributes resulting from the calculation (output) of data, and the d(x, y) variable is a euclidean distance of x and y.
The k-value in KNN is analyzed to determine the optimal value based on the resulting accuracy value.Apart from the kvalue, there are other considerations in determining how well KNN performs in classification, such as the cross-validation process, precision, and recall.The k-value in cross-validation used is k=10.Research shows that k-10 in cross-validation is a fairly good model [38].In this research, the k-fold crossvalidation value uses k=10.
KNN implementation is carried out using the WEKA application.The WEKA analysis begins by loading the data into WEKA in the "Preprocess" tab and then clicking "Open file".After the file is loaded into WEKA, go to the "Classify" tab, and under "Classifier," click the "Choose" button.Then, select lazy classifier and IBK.After determining the classifier, the next step is determining the evaluation technique.In this study, the cross-validation technique was used with a k-fold value = 10, then under the "test option", select cross-validation with folds-10.Once finished then, click "Start".

C. Model Evaluation
KNN performance is generally assessed based on accuracy.However, accuracy cannot always be a benchmark in determining how well KNN performs, especially if the data is unbalanced.Therefore, KNN performance is measured in accuracy, precision, recall, and f-measure [39].Accuracy compares the amount of predicted data to the total data.Precision is the ratio of data predicted to be truly positive to all data predicted to be positive [40].Recall is the comparison of the proportion of predicted data that is truly positive to all data that is actually positive [41].F-measure is the average between precision and recall and is also called the F1-Score [32].Accuracy, Precision, and Recall calculations are based on Equations ( 2), (3), and (4), respectively.

III. RESULT AND DISCUSSION
Accuracy value testing was performed using 520 instances and 17 attributes in each instance.The accuracy value testing results are in Figure 2. The k-value test was conducted to determine the optimal KNN performance based on accuracy.In Figure 1, the highest accuracy of KNN is obtained at the value k=1.Because k=1 gives the highest accuracy, we use k = 1 to test the value of k in cross-validation.The k-fold test shows that DOI : https://doi.org/10.25139/inform.v9i2.8582 the greater the k-value, the greater the accuracy provided.Figure 1 shows that k-fold=20 provides an accuracy value often better than k-fold=10 at every k-value in KNN.A high k-fold value can increase the accuracy value.However, even though the k-fold value increases, the accuracy will decrease as the kvalue in KNN increases.The advantage of using a large k-value is that it makes the model more general, which makes the KNN model stable and able to handle noise better to make the results more general.The number of folds also influences the performance of the KNN model, namely that a small number of folds has the potential for high bias in the data and low variance.Each fold has more data, which can cause biased model performance estimates.After all, there is less variation in the training and test data.Also, with a small number of folds, the validation process is faster because the number of iterations is smaller, but the results are less stable.Many folds will overcome bias and variance in the data in the KNN model.With many folds, each fold has a higher data variance, which can provide a more accurate estimate of model performance and reduce bias.Many folds also increase the variance because the validation process takes longer.After all, more iterations are required, making the classification results more stable and reliable.
Analyzing the performance of each k and k-fold value to determine the optimal k and k-fold values in the KNN model aims to avoid overfitting and underfitting in the KNN model so that the selected k-value can produce an accurate and reliable model.
The value of k increases the smaller the accuracy provided, as shown in Figure 2. The highest accuracy value is k=1, so we can conclude that k = 1 is the k-value that can provide the best results.However, the high accuracy of the k=1 is due to the test data obtained from training data.The distance calculation results, which represent the prediction results, have values that are not too far away.The value k=1 can produce the highest accuracy due to various factors, namely, adjusting to the training data.When k=1, each data point is classified exactly as its nearest neighbor, which means the model fits the training data very well, resulting in high accuracy because each data point only sees one neighbor in the training data.The second factor is avoiding classification errors.In the training data, if there is a lot of data representing each class, a model with a value of k=1 will tend to avoid classification errors because each point is classified based on its nearest neighbor.The value k=1 also affects the generalization of the model in various ways.The first is its sensitivity to noise.k-1 makes the model very sensitive to noise or outliers.If there are incorrect data points or outliers, these points will greatly influence the classification results because the decision only depends on one nearest neighbor.The second is overfitting.k-1 tends to make the model overfitting and has high variability, which causes the model to perform well on training data and produce poor performance on test data.Also, in its performance, the model will produce variable performance with only small changes due to its variability.The third is the lack of generalization.
A model with a value of k=1 becomes very specific to the training data, and its ability to generalize more general data patterns becomes very limited, which causes the model to perform poorly when dealing with new data because it fails to capture broader patterns.Therefore, the high accuracy value given by k = 1 cannot be used to assess KNN in providing optimal performance.A small k-value in KNN risks overfitting because models with small k-values tend to be sensitive to noise or outliers.One data point that is not representative or wrong will cause the classification results to be wrong.Besides being sensitive to noise or outliers, using small k-values causes high variability because the model makes classification decisions using little data.In addition, using a small k-value causes the model to capture fine and specific details of the training data, including noise, and may not capture general data patterns.This is what causes the use of a small k-value to potentially cause overfitting due to the lack of data generalization in the model, which means that the model provides good performance on training data but poor performance on test data or new data.To overcome overfitting in the model, the steps taken in this research are choosing the optimal k-value using a crossvalidation technique by trying various k-values to determine the k-value that provides optimal model performance on test data.In this study, k-fold=10 and k-fold=20 were used to analyze the performance produced by KNN.Several things must be considered in assessing KNN performance, including k-fold, recall, precision, and f-measure values.Therefore, testing is needed to determine the k-value that can provide the best fmeasure value.
Tests that are no less important are precision and recall.Precision states the ratio of existing data predicted to be truly positive with the amount of data that is predicted to be positive.Recall states the ratio of predicted true positive data to the amount of positive data.In this case, precision shows a value that decreases with each k-fold iteration.This also happens to recall, which continues to experience a decrease in value with each k-fold iteration.The precision and recall analysis results indicate that for each k-fold, the precision value is consistently higher than the recall value.The results of precision and recall data testing are in Figure 3. Figures 3(a) and 3(b) illustrate that the precision value consistently exceeds the recall value.The two figures show that the greater the k-value, the smaller the precision and recall values.If the k-value is higher, the discrepancy between the precision and recall values will be much greater.Another test is the f measure.F-measure is the average obtained from precision and recall.F-measure testing can be seen in Figure 4.The F measure value shows a balance between precision and recall.Figure 4 shows that the f-measure value at k-fold = 20 is not always higher than k-fold = 10.
Subsequently, a statistical comparison was conducted to ascertain the difference between k-fold =10 and k-fold=20.The results showed a significance value of 0.019.Since the significance value is below 0.05, it indicates a significant difference between k-fold=10 and k-fold=20, and changes in the k-fold value influence the accuracy provided.The value determined to be optimal is based on the evaluation findings of KNN's performance in classifying diabetic patients according to clinical data and symptoms presented in Table III.It is possible to use the KNN approach to determine whether or not a patient is suffering from diabetes by analyzing the symptoms they experience and the medical records.This classification is based on the results that were obtained.The method of detecting diabetes at an earlier stage will be more effective and efficient because of this data mining software.It has the potential to cut down on the expenses that are required to be expended for carrying out various tests that should not be required.

IV. CONCLUSION
The KNN method can classify diabetes by referring to symptoms and clinical data.In this case, the recall is always lower than the precision.The k-value in cross-validation also influences accuracy.There is a correlation between a larger kfold cross-validation and increased accuracy.However, the kvalue in KNN still has a greater influence, whereas a high kfold value does not provide better results if the k-value in KNN is larger.In other words, the k-fold value increases with the accuracy value, whereas the k-value in KNN is inversely proportional to the accuracy value.Further research can be carried out by developing the KNN method for early detection of diabetes patients using symptoms and clinical data.

Figure 1 .
Figure 1.The Methodology of This Research True Positive): Both the actual and predicted outcomes are positive.TN (True Negative): Both the actual and predicted outcomes are negative.FP (False Positive): The prediction is positive, but the actual outcome is negative.FN (False Negative): The prediction is negative, but the actual outcome is positive.

Figure 2 .
Figure 2. Accuracy ValueThe k and k-fold values in KNN have an important role in the performance of the KNN model.A small k-value in KNN has the potential to cause the model to overfit because the model becomes sensitive to noise, which means the model tends to capture small details in the training data, including noise, which causes the model to overfit.Apart from overfitting, a small value of k also causes data variability to be very high because the model completely depends on one or several nearest neighbors.The decisions obtained will also change with small changes in the training data.A small k-value affects the KNN model, and a large k-value also has the potential for underfitting the model.This is because a large value of k makes the model work too generally.After all, decisions are based on many neighbors.This is what causes the model to be unable to capture finer data structures.Essential patterns in the data are overlooked.The advantage of using a large k-value is that it makes the model more general, which makes the KNN model stable and able to handle noise better to make the results more general.The number of folds also influences the performance of the KNN model, namely that a small number of folds has the potential for high bias in the data and low variance.Each fold has more data, which can cause biased model performance estimates.After all, there is less variation in the training and test data.Also, with a small number of folds, the validation process is faster because the number of iterations is smaller, but the results are less stable.Many folds will overcome bias and variance in the data in the KNN model.With many folds, each fold has a higher data variance, which can provide a more accurate estimate of model performance and reduce bias.Many folds also increase the variance because the validation process takes longer.After all, more iterations are required, making the classification results more stable and reliable.Analyzing the performance of each k and k-fold value to determine the optimal k and k-fold values in the KNN model aims to avoid overfitting and underfitting in the KNN model so that the selected k-value can produce an accurate and reliable model.

20 Figure 3 .
Figure 3.The Results Of Precision And Recall Data Testing

TABLE III OPTIMAL
VALUE FROM THE RESULTS OF KNN PERFORMANCE EVALUATION