Classification Model of Diabetic Mellitus

Diabetes mellitus, a life-threatening disease of our time was critically studied in this paper with the aim of designing a model that can be used to detect this disease so that people on their own and right in their homes can determine their health status as pertaining to this disease without consulting a physician. The model was developed using Minimum Covariance Determinant (MCD) classifier and it has 1 percent of misclassification error.


INTRODUCTION
Diabetes is usually a lifelong (chronic) disease in which there is high level of sugar in the blood.
Symptoms of high blood sugar include frequent urination, increase thirst, and increase hunger. Diabetes is due to either that the body does not produce enough insulin or because body cells do not properly respond to the insulin that is Short Research Article produced. Insulin is a hormone produced in the pancreas which enables body cells use sugar (glucose) from carbohydrate in the food that one takes for energy or to store glucose for future use. If the body cells do not absorb the glucose, the glucose accumulates in the blood (hyperglycemia) leading to vascular, nerve and other complications. If left untreated, diabetes can cause many complications. Acute complications include diabetic ketoacidosis and nonketotic hyperosmolor coma [1]. Serious longterm complications include cardiovascular disease, stroke, kidney failure, foot ulcers and damage to the eyes [2].
There are three main types of diabetes mellitus (DM)  Type 1 DM results from the body's failure to produce enough insulin. This form was previously referred to as "insulindependent diabetes mellitus" (IDDM) or "juvenile diabetes." The cause is unknown [2]  Type 2 DM begins with insulin resistance, a condition in which cells fail to respond to insulin properly. As the disease progresses, a lack of insulin may also develop [3]. This form was previously referred to as" non insulin-dependent diabetes mellitus" (NIDDM) or "adult-onset diabetes." The primary cause is excessive body weight and lack of enough exercise [2].  Gestational diabetes, is the third main form and occurs when pregnant women without a previously history of diabetes develop a high blood glucose level [2].
Other forms of diabetes mellitus include congenital diabetes which is due to genetic defect of insulin secretion, cystic-fibrosis related diabetes, steroid diabetes induced by high doses of glucocorticoids and several forms of monogenic diabetes.
Prevention and treatment involves a healthy diet, physical exercise, not using tobacco and having a normal body weight. Blood pressure control and proper foot care are also important for people with the disease. Type 1 diabetes must be managed with insulin injections [2]. Type 2 diabetes may be treated with medications with or without insulin [4]. Insulin and some oral medications can cause low blood sugar [5]. Weight loss surgery in those with obesity is an effective measure in those with type 2 DM [5].
Gestational diabetes usually resolves after the birth of the baby.

MINIMUM COVARIANCE DETERMI-NANT (MCD) ESTIMATOR
The MCD (Minimum Covariance Determinant) based LDF and QDF is given in [10]. The classifier which is based on robust version of the Lawley-Hotelling Test uses the spatial median estimators of [11], and the related scatter of [12] given by [13]. [10] used the re-weighted MCD estimator of multivariate location and scale because of its good statistical properties and FAST MCD algorithm which provides an efficient algorithm for computing estimates for large data set. For the X sample the MCD estimator is defined as the mean ̂ , and the covariance matrix S x,0 of h x observations out of n x observations whose covariance matrix has the lowest determinant. The quantity should be larger than [(n x -p+1) =2], where p is the number of the variables and n x -h x should be smaller than the number of outliers in the X population. With this choice the MCD attain its maximum breakdown value [(n x -p+1) =2] = 50%.The breakdown value of an estimator is defined as the largest percentage of contamination it can withstand. If one suspects less than 25% contamination in the X sample, it is advised to use h x  0.75n x as this yields higher finite sample efficiency. Based on the initial estimates ̂ , and S x,0 , one can compute for each observation x i , its (preliminary) robust distance.
This distance asymptotically follows Chi-squared distribution with p degrees of freedom at 97.5 percentile value. The weight '1' should be assigned to x i if 0 RD  2 ,0.975 p  and weight '0' otherwise. The re-weighted MCD estimator is then obtained as the mean ̂ MCD and the covariance matrix MCD of those observations with weight 1. This re-weighting step increases the finite sample efficiency of the MCD estimator considerably, whereas the breakdown value remains the same. This can be used to flag off outlier and so can be used to detect outliers.
The re-weighted MCD estimator of multivariate location and scale ̂ MCD and MCD respectively are used to replace the parameters of Fisher's linear discriminant function.
In fact, is the cost of misclassifying a unit of group X and is the prior probability that x will belong to population x  . Similar definitions apply for and = 1-. In practice,q , q , c and c y are often not know and therefore we set  = 0 throughout this work, under the assumption that the prior probabilities are equal and that the costs of misclassification being equal for the two populations ( = = 1), the total probability of misclassification p is the probability of misclassifying object of population X to Y. And / D x y p is the probability of misclassifying object of population Y to X.
The total cost of misclassification is then  is the prior probability that an observation comes from x  .

Source of Data
The Data for this analysis are secondary. The fasting and non-fasting blood sugar level (FBS and NFBS respectively) of diabetics and nondiabetics people coupled with their gender were randomly selected from the cases reported at Amaku General Hospital, Awka.
FBS tests are run very early in the morning on an empty stomach with the patient having had dinner at normal time while NFBS test are run at any time of the day.

Calculation of MCD Estimator
Through extensive simulation studies it is observed that h x observations out of n x observations whose covariance matrix has the lowest determinant were those observations with smaller Mahalanobis distance [14]. The first thing done here was to calculate the Mahalanobis distance of each datapoint. With the assistance of this distance we were able to get dataset whose estimators attain their maximum breakdown value in a few numbers of iterations. More information on the calculation of the MCD estimators will be seen on [14].

RESULTS AND FINDINGS
When the MCD discriminant procedure was applied on real life data of diabetic and nondiabetic peoples we obtain the model for classifying patient into diabetic status as D * (X) = 2.1629FBS + 0.0005age + 2.7402NFBS − 33.8286 A person is classified as diabetic patient if this function is obtained positive, otherwise classify as non-diabetic. The model has probability of misclassification of 0.01; this means that out of 100 people studied only one of them was misclassified. The model shows age as not a strong factor that determines person's diabetic status.

CONCLUSION
With the predictive power of 99%, the obtained model above can correctly classify patients into their diabetic status. Physicians at Amaku general hospital Awka by this model are advised not to bother on collecting information on age while trying to find the health status of patients as regards Diabetes mellitus. Also once a person knows his FBS level and NFBS level, he can find his health status as regarding diabetes mellitus without consulting Physician. The model is optimal for classification of diabetes mellitus status.