Diabetes mellitus type 2: Exploratory data analysis based on clinical reading

Abstract Diabetes mellitus type 2 (DMT2) is a severe and complex health problem. It is the most common type of diabetes. DMT2 is a chronic metabolic disorder that affects the way your body metabolizes sugar. With DMT2, your body either resists the effects of insulin or does not produce sufficient insulin to continue normal glucose levels. DMT2 is a disease that requires a multifactorial approach of controlling that includes lifestyle change and pharmacotherapy. Less than ideal management increases the risk of developing complications and comorbidities such as cardiovascular disease and numerous social and economic penalties. That is why the studies dedicated to the pathophysiological mechanisms and the treatment of DMT2 are extremely numerous and diverse. In this study, exploratory data analysis approaches are applied for the treatment of clinical and anthropometric readings of patients with DMT2. Since multivariate statistics is a well-known method for classification, modeling and interpretation of large collections of data, the major aim of the present study was to reveal latent relations between the objects of the investigation (group of patients and control group) and the variables describing the objects (clinical and anthropometric parameters). In the proposed method by the application of hierarchical cluster analysis and principal component analysis it is possible to identify reduced number of parameters which appear to be the most significant discriminant parameters to distinguish between four patterns of patients with DMT2. However, there is still lack of multivariate statistical studies using DMT2 data sets to assess different aspects of the problem like optimal rapid monitoring of the patients or specific separation of patients into patterns of similarity related to their health status which could be of help in preparation of data bases for DMT2 patients. The outcome from the study could be of custom for the selection of significant tests for rapid monitoring of patients and more detailed approach to the health status of DMT2 patients.


Introduction
Diabetes mellitus represents a heterogeneous group of metabolic diseases characterized by hyperglycemia resulting from an inadequate insulin secretion or impaired insulin effects [1]. More than 90% of all diabetic patients are diagnosed with diabetes mellitus type 2 (DMT2) [2]. Insulin resistance in muscle, liver and fat cells and relative insulin deficiency due to beta-cell dysfunction are paramount for the development of DMT2; however, other pathophysiological mechanisms including impaired incretin effects, increased glucagon secretion, increased kidney glucose reabsorption and brain insulin resistance might also be of great importance [3]. DMT2 is a chronic disease associated with multiple complications, reduced quality of life, premature mortality and large economic burden, most directly affecting patients in low-and middle-income countries [4,5]. According to a large multinational study, the prevalence of micro-and macrovascular complications in patients with DMT2 was 53.5% and 27.2%, respectively, while 38.4% of the diabetic individuals suffered from diabetic neuropathy [6]. DMT2 and its complications were negatively associated with patients' quality of life in regard to occupation, family and sexual life as well as future perspectives [7]. The economic costs of diabetes have increased by 26% from 2012 to 2017 in parallel with the increased prevalence of diabetes and the increased cost per person with diabetes [8,9].
Recently, the chemometrics and machine learning methods were successfully applied to analyze metabolites and to reveal the potential biomarkers in the clinical diagnosis of DMT2. Free fatty acids were identified as potential biomarkers of DMT2 by using heuristic evolving latent projections, selective ion analysis and competitive adaptive reweighted sampling [10]. Fisher linear discriminate analysis, support vector machine and decision tree algorithms were employed to analyze elements in blood samples, and it has been demonstrated that the level of chromium and iron can serve as a valuable tool of diagnosing DMT2 [11]. The support vector machine approach was also applied to element concentration in urine and hair samples to distinguish between DMT1 and DMT2 [12]. By using orthogonal partial least-squares discriminant analysis for metabolomic analysis of human serum samples, it was revealed that metabolomics fingerprints can identify potential biomarkers of red meat consumption and can be related to the risk of development of DMT2 [13]. The elemental analysis of diabetic toenails and a large variety of machine learning algorithms were combined for the non-invasive diagnosis of DMT2, and it was found that the levels of aluminum, cesium, nickel, vanadium and zinc in toenails can serve as an indicator for the presence or absence of DMT2 [14].
The high incidence and costs as well as the predicted increase of the diabetes prevalence in the future [15] justifies the efforts to apply new multidisciplinary approaches in the diabetic research. Therefore, the present study aims to reveal hidden patterns and subgroups of diabetic patients through hierarchical cluster analysis and principal component analysis (PCA). The major goals of the present study are as follows: • proper classification of DMT2 patients and members of control group; • reduction of the number of variables for optimization of the monitoring process of DMT2 patients; • detection of patterns of similarity within the class of DMT2 patients; and • determination of discriminant parameters for each identified pattern of patients.  Table S1, the clinical parameters measured and their code names are indicated. The following parameters were taken into account: • anthropometric dataage, duration of disease, weight, height and body mass index (BMI)calculated according to the well-known formula: weight (kg)/height 2 (m 2 ), waist and hip circumference; • blood tests for thrombocytes, thrombocyte volume to total volume and erythrocyte sedimentation rate; • liver and muscle functional testsalanine aminotransferase (ALAT), gamma glutamyl transferase (GGT), albumin and creatinine phosphokinase (CPK); • kidney function datacreatinine and uric acid; • protein profiletotal protein, lipid profile: high-density lipoproteins (HDLs), low-density lipoproteins (LDLs), very low-density lipoproteins (VLDLs), cholesterol and triglycerides; • electrolyte contentpotassium ions and sodium ions; • glucose levels (for patients checked in and checked out during a session of medical treatment in a hospital)hemoglobin A 1c , fasting glucose, postprandial glucose, before sleep glucose and mean value of glucose.

Exploratory data analysis methods
In order to search for specific relationships between the clinical parameters or between the DMT2 patients. The input matrix has dimensions (120 × 35).
The following methods were used for data interpretation.

Hierarchical cluster analysis
Hierarchical cluster analysis is used to detect groups of similarity (clusters) between the objects of interest (DMT2 patients and control group) or between variables (clinical parameters) by agglomerative hierarchical clustering. The major steps in the analysis include data normalization (z-standardization) to eliminate differences in the variable's dimensions, squared Euclidean distance as similarity measure, Ward's method of linkage, Sneath's test of cluster significance and dendrogram plot as a graphical output [16].

PCA
PCA is able to reduce the number of the variables describing the system of objects in the direction of its highest variance. New variables are introduced, and the coordinates of the existing variable space are replaced by new ones. These new coordinates are the so-called latent factors or principal components (PCs). Their correct interpretation is the main task since they carry specific information about new types of relationships within the original data set. Two sets of output results are considered factor scores giving the new coordinates of the factor space with the location of the objects and factor loadings informing on the relationship between the variables. Only statistically significant loadings (>0.70) are taken into account for the interpretation.
The new PCs selected for interpretation should explain a significant portion of the total variance of the system. Usually, the first principal component (PC1) explains the maximal part of the system variation and each additional PC has a respective contribution to the variance explanation but with less significance.
A reliable interpretation pattern requires normally such a number of PCs, so that over 75% of the total variation can be explained. Often, the Varimax rotated PCA solution is applied that allows a better explanation of the system since it strengthens the role of the latent factors with higher impact on the variation explanation and diminishes the role of PCs with lower impact [17].

Partial least squarediscriminant analysis (PLS-DA)
PLS-DA is a supervised linear classification method that combines the properties of partial least squares (PLS) regression with the discrimination power of a discriminant method. The PLS regression algorithm identifies latent variables with a maximum covariance with the classes [18], which are coded into a dummy matrix Y, which represents the membership of each sample in a binary form. The PLS2 model is then calibrated on the Y matrix [19] and the probability that a sample belongs to a specific class can be calculated on the basis of the predicted responses [20]. Thus, each modeled class can be described by a classification function reporting the coefficients that determine the linear combination of the original variables to define the classification score. Before the PLS-DA calculation, data were auto scaled. All calculation work was done by the use of the software package STATISTICA 8.0 [21], except for the classification PLS-DA models, which were calculated with the MATLAB toolbox [22].
Ethical approval: All procedures were carried out in accordance with the principles of ethics of the Declaration of Helsinki. The Institutional Ethics Committee approved the use of anonymous patients' data for the study goals.

Basic statistics
In Table 1, the basic statistical data for all 120 objects are given.
Detailed statistical data about patients and control group indicate that there are no significant differences between the averages for both groups between indicators such as age, thrombocytes, ALAT, total protein, albumin, K, Na and the different cholesterol determinations (p < 0.05). Statistically significant differences (p > 0.05) are observed for BMI and the parameters related to it (weight, waist and hip), GGT, CPK, creatinine, uric acid, triglycerides, erythrocyte sedimentation rate (ESR) and all glucose tests. All this could be attributed to factors related to the disease not to demographic reasons.
The correlation analysis carried out for the DMT2 group shows that statistically significant correlations (p > 0.05) are found as follows: Age: all parameters related to BMI (anthropometric indicators), creatinine, uric acid, thrombocytes, Na and HbA1c; Duration: weight, ESR, ALAT, creatinine and uric acid; Weight: all BMI indicators and Na; Height: waist, ESR, ALAT and Na; BMI: waist, hip, HDL and HbAc1; Waist: hip, HDL and fast glucose; Thrombocytes: ESR and total protein; ESR: creatinine, albumin and HDL; ALAT: GGT, total protein, albumin and LDL; GGT: uric acid, HDL, triglycerides, K, Na, postprandial glucose and mean glucose 1; Creatinine: uric acid and total protein; Uric acid: VLDL and triglycerides; Total protein: albumin, LDL and cholesterol; Albumin: HDL, VLDL, HbA1c, before sleep glucose and mean glucose 2; HDL, LDL and VLDL: cholesterol and triglycerides; Cholesterol: triglycerides, Na, fast glucose 1, mean glucose 1, fast glucose 2 and mean glucose 2; Triglycerides: Na, HbAc1, post prandial glucose and mean glucose 1; K: before sleep glucose 2 and mean glucose 2; Na: HbAc1; and For the control group very few significant correlations are observed.
It could be concluded that several groups of mutual correlated clinical parameters for DMT2 patients are registered as follows: • glucose test parameters; • BMI and related anthropometric indicators; • cholesterol indicators and triglycerides; and • creatinine, uric acid and proteins. This is important preliminary information about links between the clinical and anthropometric indicators, which could be of help for the multivariate statistical data analysis.

Classification approach to separate controls from DMT2 patients
The goal is to find specific descriptors able to distinguish between the control group and the group of DMT2 patients. The first approach used was hierarchical cluster analysis. In Figure 1, the hierarchical dendrogram for clustering of all 120 objects (20 of the control group and 100 DMT2 patients) is presented. The separation even in this simple way of grouping is satisfactoryall members of the control group are included in one single cluster. Only one DMT2 patient is wrongly classified as member of the control group. The most statistically significant descriptors for the separation are all glucose test values, all anthropometric indicators, ESR, GGT, CPK and cholesterol.
The same data set was treated by the PLS-DA approach. In Figure 2, the separation of the objects into two very different classes (Ncontrol group and Ppatients with DMT2) is indicated. The discriminant indicators are found to be as follows: postprandial glucose test, almost all anthropometric parameters, age and, surprisingly, potassium.
It is shown that reliable classification and separation of the control group from the DMT2 patients' group are possible by the use of all clinical and anthropometric parameters used. In the next steps of the multivariate statistical analysis, an effort will be made to reduce the number of parameters used to optimize the monitoring of the patients.

Hierarchical cluster analysis of clinical parameters
In Figures 3 and 4, the hierarchical dendrograms for clustering of the clinical parameters (normalized inputs, squared Euclidean distances as similarity measure,  Figure 1 and for the DMT2 group of 100 patients, Figure 2).  Four major clusters are formed: the first one includes all glucose tests (except for HbAc1), the second onedominantly the different cholesterol indicators and enzyme function indicators, the third oneprotein and albumin levels along with HbAc1 and the last oneblood parameters, anthropometric indicators and BMI, electrolytes, cholesterol and triglycerides. It could be assumed that in the control group the clustering of the tested parameters is distributed mainly with respect to the body systems and functions involvedenzyme control and cholesterol deposition, protein exchange, blood status and metabolic syndrome assessment.
In Figure 4, the same type of dendrogram for the group of 100 patients with DMT2 is given.
The separation of the 34 clinical parameters (parameter "sex" is eliminated from the analysis since preliminary studies have proven that there is no specific division between male and female patients) leads to the formation of the following five clusters: K1: all glucose tests including the HbAc1 test, which is considered as one of the most important indicators for long-term glycemic controlglucose indicator cluster; K2: (weight, waist, BMI and hip)anthropometric indicator cluster; K3: (thrombocytes, ESR, HDL, LDL, cholesterol, VLDL and triglycerides)this cluster indicates the link between blood quality indicators and cholesterol deposition indicatorscholesterol cluster; K4: (ALAT, gamma-glutamyl transferase (GGK), albumin, total protein, height and K) linkage between enzyme indicators and protein indicators; the link to K seems to be interestingenzyme cluster; K5: (age, Na, duration, creatinine, uric acid and CPK)logical link between age and duration of the disease as well as between the indicators for the renal functionrenal function cluster.
The clustering of the clinical parameters for the DMT2 patient group indicates the specificity of the assessment of the patients with respect to the impact of the disease on the different organs and systems of the human body. It is possible to separate a reduced set of indicators (one or two from each identified cluster) to perform a rapid assessment of the health status of the DMT2 patients.

PCA
In order to complete and confirm the results of the hierarchical clustering of the clinical parameters, PCA was additionally carried out. It could help for interpretation of the data structure both for the control group   and for patients with DMT2. Varimax rotation mode was used for both data sets (in Table S2 -SI). Eight latent factors explain over 80% of the total variance of the system. In general, the significant grouping of clinical parameters resembles that of hierarchical clusteringall glucose tests have high factor loadings in PC1 but HbAc1 does not belong to this first principal component PC4; the anthropometric parameters weight, height and hip are correlated (high factor loadings in PC2). More detailed comparison is not very correct since in PCA one deals with eight latent factors since the identified number of clusters in the hierarchical clustering is four. But PCA gives the opportunity to reduce the number of variables by selecting representative variables from each latent factor (if it is needed to further interpretation). PC1 (16.6% of explained variance) could be conditionally named "hyperglycemic factor" incorporating all glucose parameters with high factor loadings.
This most significant "hyperglycemic factor" (16.6%) described the increased glucose concentrations. It might reflect the inability of beta-cell to produce sufficient insulin to maintain normoglycemia, which is the cornerstone for the DMT2 development [3]. PC2 (11.9% of the total variance) indicates the close links between the anthropometric parameters and could be conditionally named "anthropometric or obesity factor". The "obesity" factor (11.9%) emphasized the important interrelations between visceral fat mass and carbohydrate dysregulation. Visceral obesity could influence negatively the glycemic control by increasing insulin resistance and gluconeogenesis [23]. Therefore, the acquisition of healthy eating patterns and maintaining of normal body weight are among the fundamental aspects of the DMT2 treatment plan [24].
High factor loadings for almost all cholesterol indicators are found in PC3 (7.7% explanation of the total variance). It is conditionally named "lipid factor".
The "lipid factor" might explain additional 7.7% of the group variation. The atherogenic lipid profile of DMT2 patients is characterized by specific alterations including hypertriglyceridemia, decreased HDL-cholesterol levels and a preponderance of smaller denser LDL cholesterol particles despite the normal LDL cholesterol blood levels [25]. Insulin resistance might determine not only the development of hyperglycemia but also the progress of lipid abnormalities. The increased secretion of free fatty acid from the adipose tissue as well as their decreased utilization in the skeletal muscles due to insulin resistance might enhance their efflux to the liver leading to impaired triglyceride metabolism [26]. The correction of lipid abnormalities in diabetic patients might decrease the risk for macrovascular complications   and reduce the cardiovascular morbidity and mortality [27]. The PC4 indicates good correlation between thrombocytes and ESR (7.4% explained variance) and the conditional name given is "inflammatory factor".
The bidirectional interrelations between the DMT2 and inflammation might explain 7.4% of the variations among diabetic patients and controls. Obesity and insulin resistance are often associated with an increased expression of various pro-inflammatory adipocytokines that might contribute to the maintenance of chronic low-grade systemic inflammation. The inflammatory response could facilitate the development of DMT2 by aggravating the insulin resistance and hyperglycemia, thus creating a vicious circle [28,29]. Since metabolic dysregulation itself maintains inflammation, the adequate treatment of the DMT2, obesity and dyslipidemia might reduce inflammation by improving the metabolic parameters [30]. The use of specific anti-inflammatory agents for reduction of insulin resistance is a matter of further research. PC5 shows high loadings for age, duration of the disease (very logical link), uric acid and creatinine (explained variance of 7.3%). It indicates the impact of the disease on the kidneys and could be conditionally named "renal function factor".
The age and renal function are important determinants of the intragroup variation (7.3%). Ageing is related to specific difficulties in the diabetes care because of the pronounced heterogeneity in the health status of older adults, different patient's life expectancy, presence of comorbidities, increased risk for hypoglycemia and inability to transfer automatically the results from anti-diabetic studies conducted on younger patients to older ones [42]. The care for patients with renal impairment faces similar problems apart from the specifically limited therapeutic options [31].
The sixth latent factor explains additional 7.1% of the total variance. Its conditional name could be "liver function factor" as it demonstrates correlation between ALAT and GGT. Additionally, it offers a specific link between the enzyme indicators with potassium that could not be explained outside the context of unreported dietary habits or concomitant treatment.
The liver function is another crucial factor that might explain additional 7.1% of the total variance. Hepatocytes are main regulators of the glucose homeostasis through the processes of glycogen storage, glycogenolysis and gluconeogenesis. Thus, hyperglycemic states are often found in patients with hepatic diseases [32]. However, the treatment of diabetes mellitus in patients with liver disorders might be a challenge, because of the increased prevalence of concomitant malnutrition, alcohol abuse, increased risk of hypoglycemia as well as possible side effect of oral antidiabetic drugs metabolized in the liver [33].
PC7 (6.5% explanation of the total variance) is a conditional "protein factor" since it reveals high factor loadings for total protein and albumin.
The last involved latent component PC8 (5.1% explanation of the total variance) indicates the specific role of CPK in the assessment of the health status of the patients.
The importance of the other two latent factors, such as protein and CPK levels, is probably associated with an influence of concomitant conditions and/or medications. The described traits emphasize on the need of personalized complex care for the individuals at increased risk for hyperglycemia including a treatment of concomitant obesity, dyslipidemia and subclinical inflammation considering their age, renal and liver functions.
Since the PCA is a traditional method for space reduction, the further goals of the study were important to select respective variables from the strongly correlated (high factor loadings) parameters of each identified latent factor. Considering the important influence of the postprandial glucose load, obesity, inflammation, renal and liver functions for the health status of the patients with DM2 a restricted set of main indicators was chosen: postprandial glucose 1, BMI, cholesterol, thrombocytes, creatinine, uric acid, GGT and K. These indicators represent each latent factor and count for the role of the glucose tests, anthropometric indicators, liver function, renal function, inflammation markers, lipid profile and electrolytes for effective assessment of the health status of DMT2 patients.
In Figure 5, the hierarchical dendrogram for all 120 objects of the study (controls and DMT2 patients) is presented.
It is seen that the separation between both classes of objects is achieved. The only minor exception is that two patients with DMT2 are wrongly attributed to the control group. This is statistically completely acceptable.
One of the general objectives of the present study is to divide the DMT2 patients into groups of similarity (clusters) using a discrete number of important parameters. This classification could be of use to specific observation of the health status of the different patients and, additionally, to support in identifying symptoms of accompanying DMT2 diseases and complications.
In Figure 6, the hierarchical dendrogram for clustering of 100 DTM2 patients using 8 significant clinical and anthropometric indicators is shown. Four significant clusters are formed. The members of each cluster are as follows: Cluster 1 (25 members), Cluster 2 (31 members), Cluster 3 (38 members) and Cluster 4 (6 members). It could be assumed that each cluster represents patients with specific health status pattern. In order to determine the major discriminants for each pattern of patients, the average values (as standardized values) for each parameter for each cluster were calculated. Figure 7 represents the results. Cluster 1 is characterized by highest levels for GGT, creatinine, uric acid and K. At the same time, it indicates low BMI, cholesterol and glucose levels. It could be assumed that the 25 members of this cluster might suffer from microvascular complications, such as diabetic nephropathy or they might have concomitant liver or renal diseases despite the relatively good glycemic control.
Cluster 2 includes DMT2 patients (quite significant number) with worsened DMT2 status indicated by the highest BMI and glucose level (the most significant discriminants for DMT2). There are no further indications for accompanying health problems.
Cluster 3 is the pattern of patients (largest number of members) with improved DMT2 status with no extreme values of the clinical and anthropometric indicators.
Cluster 4 involves a limited number of patients (only six). Although having low mean values for BMI, this pattern of patients shows still high glucose levels as well as and, additionally, highest cholesterol and thrombocyte levels, relatively high creatinine and uric acid levels. The DMT2 status requires significant improvement because of increased risk of micro-and macrovascular complications and cardiovascular morbidity and mortality.
The present study succeeded to distribute the DMT2 patients into groups of similarity (patterns or phenotypes) using a reduced number of commonly tested parameters. Using the described statistical approach, four phenotypes of diabetic patients could be additionally interpreted.
Phenotype 1 was characterized by parameters suggesting impaired renal and liver functions. The BMI, cholesterol and glucose levels were relatively low reflecting the complicated balance between optimal   glycemic control, malnutrition and the avoidance of hypoglycemia [31,33]. Therefore, the therapeutic intensity and goals might be less stringent in this category. Phenotype 2 includes obese diabetic patients with poor glycemic control but no signs of concomitant health problems. The therapeutic approach in these patients should be focused on more intense therapy plan including lifestyle changes, healthy diet as well as antidiabetic drugs considering optimal medication adherence. The maintenance of healthy weight as well as the intensive glycemic control with the goal of achieving near-normoglycemia is crucial for the prevention of diabetic complications [34].
Phenotype 3 includes diabetic patients with optimal laboratory parameters and good control of the hyperglycemia. Thus, no therapy changes are needed in this group of individuals.
Phenotype 4 consists of a limited number of patients with unfavorable lipid and glucose values as well as increased uric acid levels despite the lack of obesity. Probably, these individuals belong to the group of socalled metabolically obese but normal weight patients, who are at increased cardiovascular risk due to increased insulin resistance, hyperglycemia and visceral fat deposition [35]. Moreover, normal-weight elderly people with metabolic disturbances have shown a higher risk of cardiovascular and all-cause mortality in comparison to obese individuals without metabolic disturbances [36]. Therefore, only the aggressive treatment of lipid abnormalities as well as the persistent efforts to optimize glycemic control might preclude the development of macrovascular complications in that group of patients.

Conclusion
The application of exploratory data analysis to classify, model and interpret clinical data of DMT2 patients has many aspectsto predict DMT2 by classification methods among large group of patients, to model the trajectories of the disease by interpretation of specific indicators, to identify metabolic and genetic biomarkers in patients with DMT2 and concomitant cardio-vascular factors by chemometric approaches, to study diabetic complications etc. [37][38][39][40].
In the present study, an effort is made to determine significant indicators out of all typical clinical and anthropometric data for DMT2 patients. The variable reduction offered (8 out of 35 variables) makes it possible to achieve the major goals of the study: • To classify correctly into different class members of the control group of healthy volunteers from patients with DMT2. • To determine by rapid tests, the specific health status of statistically significant patterns (clusters) of patients, which allows specific treatment and health care. • To offer discriminant parameters for each identified specific pattern of DMT2 patients. • To create a statistical basis for the personalized approach in the treatment of patients with DMT2 and concomitant diseases.
The present study has used intelligent data analysis to explain the variable traits of diabetic patients compared to the control group, which could reflect the differences in the pathophysiological mechanisms related to the disease. Moreover, different phenotypes of patients have been identified as in other studies, which might require distinct therapeutic approach and goals [41].
In conclusion, further efforts to differentiate distinct pathophysiological mechanisms and clinical subgroups through the PCA might contribute to the development of personalized approach in the management of diabetic patients.