A Novel Classification Indicator of Type 1 and Type 2 Diabetes in China

Because of the differences of treatment, it is extremely important to classify the types of diabetes, especially for the diagnosis made by clinician. In this study, we proposed a novel scheme calculating an indicator of classifying diabetes, which contains two stages: the first is a model of feature extraction, 17 features are automatically extracted from the curve of glucose concentration acquired by continuous glucose monitoring system (CGM); the second is a model of diabetes parameter regression based on an ensemble learning algorithm named double-Class AdaBoost. 1050 curves of glucose concentration of type 1 and type 2 diabetics were acquired at the Department of Endocrinology in People’s Hospital of Zhengzhou University China, and an upper threshold μ was set to 7 mmol/L, 8 mmol/L, 9 mmol/L, 10 mmo/L, and 11 mmol/L respectively according to the guideline of WHO. The experiments show that the coincidence rate of our scheme and clinical diagnosis is 90.3%. The novel indicator extends the criteria in diagnosing types of diabetes and provides doctors with a scalar to classify diabetes of type 1 and type 2.

2) Employ feature extraction model to achieve 17 features from the training curves of glucose concentration.
3) Build and train a classifier using the 17 features based on variants of AdaBoost. 4) Verify the classifier using the testing curves of glucose concentration. 5) Evaluate the indicator of the scheme to classify diabetes.

Methods
CGM. CGM is used in examination of how the blood glucose concentration reacts to insulin, exercise, food, and others. And it needs calibrating with traditional finger-stick measurements. A CGM acquires glucose concentration of patients on a continuous basis (every five minutes). 14 is based on the morphological characteristics of signals to obtain the intrinsic features. Features usually possess some physical significance and could be extracted from complicated multi-component signals such as a time-series of glucose concentration. Hence, the feature extraction is taking a glucose concentration signal as input and gives the features as the output. The model of features extraction is illustrated in Figure 1.

Feature Extraction. Feature extraction
The first feature is the average of blood glucose on the whole day; it can be calculated by Equation (1).
where x i is a discrete value of blood glucose concentration, n is the number of x i in a day. The subsequent six features are also the average in different periods including pre-meal average and post-meal average of three meals. All those averages can be calculated by Equation (1). SDBE this feature is Standard Deviation of Blood Glucose; it can be calculated by Equation (2).  where v 1 , v 2 are arrays of glucose concentration with 288 values of one day, and they are the glucose concentrations of same diabetic in different day respectively. Area Under the Curve of glucose concentration (AUC) indicates the area parceled by the glucose concentration-time curve and the threshold (upper threshold or lower threshold). The AUC should be calculated with two areas including under the curve of glucose concentration and the upper threshold, over the curve of glucose concentration and the lower threshold.
Mean amplitude of plasma glucose excursions (MAGE). This feature has been studied by many papers 15 . The MAGE can be calculated as follows: Step 1. Get all extreme points in the signal; Step 2. Find the first valid extreme point whose absolute differences of both adjacent extreme points are greater than the Standard Deviation of the  In our research, a dataset was built to store the features, as shown in Table 1.
Usually, age is an important factor related to diabetes 16 , and places heavy weight on the classification of type of diabetes, thus would cause under-fitting. Some other factors were not involved, such as exercise, food, and insulin or oral medicines, which are difficult to quantify as these factors are from different manufacturers and are difficult to homogenize. Furthermore the main purpose of our research is to provide an easy and approach available to diabetes diagnose.
AdaBoost. Boosting methods are iterative algorithms 17 . AdaBoost is a boosting method which united some simple "weak" classifiers to generate generalized models. It was proposed by Freund and Schapire to distinguish a binary classification 18 , and later various AdaBoost variants such as Real Adaboost were proposed 19 . AdaBoost and its variants have contributed to various real-world applications, such as face detection 20 and human detection 21 . In our research, its variants Real Adaboost 19 , Gentle AdaBoost 22 , and Modest AdaBoost 23 , were applied to the model of diabetes parameter regression in our scheme.
Diabetes parameter regression based on AdaBoost. Let s = {(g 1 ,y 1 ), (g 2 ,y 2 ), …, (g m ,y m )} be a set of training samples with initial weights D 1 (g i ) = 1/m, and m is the number of training data. Each g i is a vector with 17 features which were extracted from CGM curve of glucose concentration, and each y i is the label of g i . In our research, the DM classification is a binary classification, so assuming label y i equals 1 if the sample belongs to type 1 diabetes, and otherwise equals −1 when the sample belongs to type 2 diabetes.
Step 4: Update data weight D t and get new weights D t+1 by error.
where z t is a normalization factor. Output: final classifier: c. set the output of h on each G j as calculate the normalization factor The second variants algorithm of AdaBoost named Modest AdaBoost which complete steps could be found in paper 23 . Gentle AdaBoost is the most efficient boosting algorithm and it has been used in Cascades object detection 24 . In each epoch, Gentle AdaBoost does a weighted regression based on least square. It means that the regression function h t (g) is fit by weighted least-squares of y i to g i .

Model Evaluation.
In order to evaluate classification results, the present study applied two performance indicators: ACC (accuracy) and MCC (Matthews correlation coefficient). P and N represent the positive class and negative class respectively. T and F denote True and False respectively, as described in Table 2.
The ACC is as the formula The MCC is as the formula

Experiment and analyses.
To demonstrate the performance of proposed indicator, 300 of the 1050 samples were used as training set to construct a diabetes classifier while the other 750 were used as testing set to evaluate the classifier. Besides, 7 mmol/L, 8 mmol/L, 9 mmol/L, 10 mmol/L and 11 mmol/L were set to the upper threshold of glucose target range in the progress of feature extraction from the curves of glucose concentration monitored by CGM. The Committee Report of diabetes expert of WHO diagnoses DM with fasting blood glucose concentration between 6.1 mmol/L and 6.9 mmol/L and plasma glucose of 11.1 mmol/L 2 hours post glucose-load (2 h PPG). This is the reason why 7 mmol/L, 8 mmol/L, 9 mmol/L, 10mmo/L and 11 mmol/L were selected as the upper threshold when we extract the 17 features from glucose signal.
The models of Real AdaBoost, Modest AdaBoost and Gentle AdaBoost were applied to calculating the indicator of classification diabetes and the error rate was presented in Table 3. The error rate of Modest AdaBoost is 0.0970 when the upper threshold was set at 7 mmol/L and 8 mmol/L, which means that the coincidence rate of our scheme and clinical diagnosis is 90.3%.
After training 100 iterations, the three models of Real AdaBoost, Modest AdaBoost, and Gentle AdaBoost were to calculate the indicator of classifying diabetes. The test misjudging rate of indicator and clinical diagnosis illustrated in Figure 2. The upper thresholds of Figure 2 (a)-(e) were set at 7, 8, 9, 10 and 11 mmol/L respectively. It shows that when the upper limit was set at 7 mmol/L and 8 mmol/L the misjudging rate of three models were  lower, and the misjudging rate of Model AdaBoost depicted by the line with the mark '|' is 0.0970. Furthermore, when the upper threshold was set at 10 mmol/L, the three models perform worst in diabetes classification. But when the upper threshold was set at 9 mmol/L or 11 mmol/L, the misjudging rate of Real AdaBoost is changing, and its largest error is greater than 0.12, therefore 9 mmol/L and 11 mmol/L are not suitable for regarding as the upper threshold. The value of upper threshold affects results of diabetes classification. 5-fold cross-validations were used to further demonstrate the accuracy of our scheme and seek out the best of upper threshold, after training 100 iterations, the indicator of classifying diabetes based on Real AdaBoost, Modest AdaBoost and Gentle AdaBoost were calculated and the test misjudging rate of indicator and clinical diagnosis illustrated in Figure 3. It shows that when threshold was set at 7 mmol/L or 8 mmol/L, the performance of our scheme is better, and only a few misjudging rates were above 0.1. It indicates that the coincidence rate of indicator calculated by our scheme and clinical diagnosis is better and the indicator is useful for doctors to diagnose diabetes.
The performance of our scheme was evaluated when the threshold was set at 7, 8, 9, 10 and 11mmol/L respectively. The results are shown in Table 4. It shows that when threshold was set at 7 mmol/L or 8 mmol/L, the performance of our scheme is better.

Discussion
Due to the difference of epidemiology, etiology, pathogenesis and treatment of type 1 and type 2 DM, a knotty problem is how to effectively treat diabetes in clinic 25 . For a doctor, the reasonable solution is to classify the type of diabetes and suit the remedy to the case, so the diabetes can be in control. In fact, there are many clinical indicators to classify diabetes, such as the test results of Oral Glucose Tolerance Test (OGTT), INS, C-Peptide, IAA, ICA. The tests would contribute to providing guideline in treating diabetes, but the tests are incomplete and can't precisely reflect the heterogeneity of the Type 1 diabetes and Type 2 diabetes. Moreover, some of original symptoms about Type 2 diabetes have emerged on patients with Type 1 diabetes. At the moment, CGM can monitor the curve representing the fluctuation of glucose concentration in patients with type 1 and 2 diabetes 9,10 , which is one of the most successful cases for diabetes controlling. In addition, the 17 features would be extracted from the curve of glucose concentration 13 . Those features can't directly diagnose the type of DM, but we attempt to build a novel scheme calculating the indicator of classifying DM by using those features.
We have constructed an effective scheme, which consists of feature extraction and classification. The experimental results show when the upper threshold μ is correctly set, the misjudging rate of classification is less than   0.097, which suggests that the scheme achieves the best performance and the coincidence rate of our scheme and clinical diagnosis is up to 90.3%. This experiment indicates that an indicator can be extracted from the curve of glucose concentration based on CGM and it is helpful for doctors to classify diabetes. In addition, more works should be considered, such as how to improve the precision of classifying diabetes, how to set a novel penalty to rectify the weight of diabetes samples according to the sampling distribution (D t ) of diabetes in the process of iteration, and our scheme should be validated whether it suffers data imbalance problems 26,27 .