Analyzing Diabetes Datasets using Data Mining

Abstract: Data mining techniques explore critical information in various domains (for example in CRM (customer relationship management), HR (Human Resource), GIS (Geographic Information System) etc.) but most importantly in medical domain. In medical domain, data mining can assist in minimizing the risk of developing some stereotyped diseases such as cancer, heart diseases, diabetes etc. In this paper, authors have focused data of Diabetic patients. Diabetic patient’s body lacks ability to manage the glucose level in blood which can affect the other body mechanism. This can lead to the dysfunctioning of other physiological and psychological parameters such as reduced weight, skin folding. These parameters may be a valuable data source for the research. Diabetes mellitus placed 4th among Noncommunicable diseases-NCDs, caused 1.5 million global deaths each year worldwide [1]. The increase in digital information has elevated numerous challenges especially when it comes to automated content analysis and to make use of some machine learning techniques to aid mankind for predicting the non-communicable diseases like diabetics. . In this research different classifying algorithms such as Naïve bayes, MLP, J.48, ZeroR, Random Forest, and Regression were applied to depict the result. The conducted research aims to extract knowledge from the given set of data and to generate comprehensive and intelligent results.


I. INTRODUCTION
Non-communicable diseases (NCDs) which include stroke, heart disease, cancer, chronic lung cancer and diabetes they together are responsible for almost 70% of the deaths worldwide in which Diabetes mellitus Type II is most common in all [1].The number of patients suffered has quadrupled since 1980.it is estimated that 422 million people have diabetes all over the world and this figure may get doubles in the next 20 years [2].The top 10 countries which are affected are India, China, USA, Indonesia, Japan, Pakistan, Russia, Brazil, Italy and Bangladesh [3].About seven million Pakistanis had diagnosed Type II Diabetes mellitus it is estimated that in 2035, the figure will reach up to 12 million [4].In this situation, it is necessary to look into the facts and the risk factors involved.
This paper meant to be written to give an idea of utilizing the information taken in different hospital as their procedures includes assessing the patient by taking some medical history before prescribing anything.This information may give some diagnostic details of the disease by comparing different data mining algorithm.

II. BACKGROUND
The process of data mining allows ascertaining the patterns in the provided datasets by simply applying The data mining tool opt for this research is WEKA.WEKA is known for data mining and contains wellknown algorithms for data pre-processing, classification, regression, clustering, association rules and visualization.It is also suited for developing new machine learning schemes [2].
In this particular example, different classifiers were used which include naïve bayes, decision tree and regression techniques and neural networks to get the best results out of it.

III. MATERIALS AND METHODS
The datasets had been taken from Pima Indians Diabetes Database of National Institute of Diabetes and Digestive and Kidney Diseases these datasets includes records of 768 patients, out of which 500 tested negative while 268 of them were tested positive [5].
The description of the dataset with the nine attributes in Table 1, help us to understand the possible prediction of this disease and which of the algorithm is more suitable for it.
In Table 1 the first eight attribute are the inputs set as input and the ninth attribute is the result which is used as a target which is either "Positive" or "Negative".

IV. Graphical Representation of Attributes
Figure 1 is a graphical representation of the original test results shown as positive (blue) and negative (red) for different parameters (preg, plas, press, skin, insulin, mass, pedi, age, class).

V. CLASSIFICATION ALGORITHM AND THEIR EVALUATION Output Prediction
The results were based on 90% percentage split.The comparison of the two initial results of different algorithms can be seen in Table 2. "Actual " and "predicted" represents the original results and the predicted results respectively.However in Tables 3-8 the column "error" represents the prediction error.

A. Naïve Bayes
This algorithm is named after Thomas Bayes who proved the bayes theorem.Naive Bayes is suitable in our situation in it solve the problem of identifying the possibilities of how many people are more prone towards diabetes.This algorithm works on probability distribution function.
In Table 3 Error column 0.99 means there is 99% chance of that instance to test negative which is true and 1% possibility that the instances could test positive.
"+ "means prediction came out untrue.However, in the second instance 67% chance for the instance to test negative as compared to the instance in which it can have 99% surety that it proved wrong.0.67 is not to close to 0.99 which gives the algorithm a benefit of doubt as to predict positive or negative.

B. Zero R
ZeroR is the simplest classification method.It is that type of classification method which would lean on the target and ignore other attributes invasion.
In Table 4, it always generates the same result for every instance either 65% (0.352 test negative) or 35% (0.352 test positive) means there is no other possibility of changing the output either it is Yes or No.This algorithm is very useful when the involvement of every other parameter is less significant.Some of the initial prediction bases on test split data can be seen in Table 5.

D. Random Forest
Random forest generates many single classification trees.To classify a new object from an input, put the input vector down each of the trees in the forest.Each tree generates their own results and then they select one set of a class as shown in Figure 2 [7].

E. Multilayer Perception
It works on how different attributes results process and interact with one another and alter their results in such a way that the final outcome is the filtered version of each node (neuron).Multi-Layer perception bestows great advantages as it is used for pattern classification, recognition, prediction and approximation.In Table 7.In Figure 3, a network of different layers namely input layer, hidden layer and output layer consisting of input nodes (green) or "neurons", output nodes (yellow) and some hidden nodes (red) some of them are visible.The nodes in the network are all sigmoid.Each connected network has some value in it which will be pass on to other nodes and each nodes perform a weighted sum of its input and pass it on until it generate some results.Hidden layer depends upon the complexity of the data [8].
MLP does show result with minimum error rate but it processes slow as compared to others.

J.48
Jr8 is basically an implementation of C4.5 algorithm [9].J48 decision tree decides which attributes is the most decisive one and which one is least and over and then these attributes further divided into sub tree.It generates a binary tree, unlike Random Forest decision tree it use the concept of entropy, difference in entropy gives us the attribute which is free to make decisions.

IX. CONCLUSION AND FUTURE SCOPE
In order to make effective and efficient results, the requirement is to work on different algorithm and to make sure which suits best.Diagnosing diabetes through data mining tool over medical records of patients though it has been done by a majority of the researchers [10][11][12][13][14][15] but the research demands more deep digging in terms of domain knowledge to get more operative medical diagnosis.
In terms of performance, it was found that multi layer perception function is most effective hence it shows fewer errors however it takes too much processing time because it requires calculation of weights of each node.ZeroR is useful to determine baseline performance for others classification method.Naïve Bayes is also very efficient as it gives a predominant result after each validation but its performance is not quit impressive.J4.8 gives a graphical image of the precedence of the attribute as it calculates the priority of each attribute with other and yet it also predicts accurate results with least error hence it requires time.
The objective of comparing the algorithm on the same dataset, analyzing and predicting the results out of it has been achieved.In future, authors are interested in gathering information among our own neighborhood and authors were keen to get new results which lead them toward more precise and accurate divination.Also more parameters can be added (such as thirst, fatigue, frequency of urination etc) for improvement.

Figure 1 :
Figure 1: Weka Output (Negative and Positive outcomes with respect to different classes).

Figure 4 :
Figure 4: Graphical representation of Accuracy over different algorithm.

Figure 5 :Figure 6 :
Figure 5: Graphical representation of Confusion Matrix over different algorithm.

Table 1 : Datasets of Diabetic Patients
Test Positive= Red Test Negative =Blue