Detection of diabetes mellitus using machine learning algorithms

The growth of technology has brought in sophistication in our day to day activities. This sophistication has brought in many health issues. One among the most important problems that has currently become a typical issue is DiabeticsMellitus. DiabetesMellitus has affected over 246million peopleworldwide with a majority of them being women. The WHO reports that by 2025 this number is expected to rise to over 380million. Prevention is better than cure. The health care data is large and complex. It is better to predict the disease at an earlier stage which may save the life and also have a preventive measure in controlling the diseases. In this paper, we have taken up a heterogeneous data to analyze the various factors which are affecting this disease. The various machine learning algorithms used in this paper help us to decide the attributes which play a major role in diagnosis of Diabetes Mellitus.


INTRODUCTION
Chronic increase of glucose level in the blood is called Diabetes mellitus (DM), commonly referred to as diabetes. It is caused by the inability of the body to produce required amount of insulin for its own needs. It may be due to the marred secretion of insulin or impaired action or both. High blood sugar levels over a prolonged period leads to renal failure, loss of vision and several other tissue damages. The incidence of diabetes is increasing because female diabetics are able to have children. The incidence of diabetes is higher in persons above 40 years of age.
Females, especially the married ones are at a higher risk in getting this disease. Obesity, dietary factors and heredity are the other contributory factors for diabetes. Alcoholic beverages increase appetite, encourage weight gain and when taken in excess damage the pancreas and thereby increase the risk of diabetes. In short, DM leads to several metabolic disorders in our body. DM can be classi ied into several types. Mainly there are two clinical types   Table 1 shows,

MACHINE LEARNING ALGORITHMS
The following are few of the Machine learning algorithms,

Data Preparation
Data cleaning and transformation of data The zero values or the null values in the following (Figure 1) cannot be zero.
So the values have been replaced by the mean of that column

Removing outliers
The below graph (Figure 2) shows the distribution of data set of different attributes. By carefully studying the graph, we can igure out that insulin and skin thickness have outliers.
The outliers are removed using IQR ( Inter Quartile Range ) method.
The algorithms used are decision tree algorithm and KNN algorithm The decision tree model (Figure 3) was applied on the training dataset. The depth of the tree is 4 and the total numbers of nodes are 21. With the help of the decision tree we are able to select the features that play an important role in the detection of diabetes. The following (Figures 4 and 5) shows the important features and their ranking

Comparison between Decision tree and KNN
A Receiver Operating Characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classi ier system as its discrimination threshold is varied. The ROC curve plotted for True Positive Rate and False Positive rate give us a good result ( Figure 6). The curve area is 0.72, whereas, the curve area for KNN is 0.67. The dashed line in the diagonal represents the ROC Curve of a random predictor. It is a baseline to check if the model is useful or not. The confusion matrix is tabulated for both the methods (Figures 7 and 8).

CONCLUSIONS
The decision tree model has achieved 76% accuracy. After considering various options to improve the accuracy, we were able to achieve the desired accuracy by removing outliers, categorizing data and keeping the tree depth to 4. During this process, only few attributes out of the eight attributes play an important role. Glucose, BMI, Pregnancies, Age and Insulin were important. And also the factors like skin thickness, Diabetes Pedigree function and blood pressure had negligible effect. Hence we conclude decision tree model best among the two when compared with KNN.