Identification of Traditional Chinese Medicine Constitutions and Physiological Indexes Risk Factors in Metabolic Syndrome: A Data Mining Approach

Objective In order to find the predictive indexes for metabolic syndrome (MS), a data mining method was used to identify significant physiological indexes and traditional Chinese medicine (TCM) constitutions. Methods The annual health check-up data including physical examination data; biochemical tests and Constitution in Chinese Medicine Questionnaire (CCMQ) measurement data from 2014 to 2016 were screened according to the inclusion and exclusion criteria. A predictive matrix was established by the longitudinal data of three consecutive years. TreeNet machine learning algorithm was applied to build prediction model to uncover the dependence relationship between physiological indexes, TCM constitutions, and MS. Results By model testing, the overall accuracy rate for prediction model by TreeNet was 73.23%. Top 12.31% individuals in test group (n=325) that have higher probability of having MS covered 23.68% MS patients, showing 0.92 times more risk of having MS than the general population. Importance of ranked top 15 was listed in descending order . The top 5 variables of great importance in MS prediction were TBIL difference between 2014 and 2015 (D_TBIL), TBIL in 2014 (TBIL 2014), LDL-C difference between 2014 and 2015 (D_LDL-C), CCMQ scores for balanced constitution in 2015 (balanced constitution 2015), and TCH in 2015 (TCH 2015). When D_TBIL was between 0 and 2, TBIL 2014 was between 10 and 15, D_LDL-C was above 19, balanced constitution 2015 was below 60, or TCH 2015 was above 5.7, the incidence of MS was higher. Furthermore, there were interactions between balanced constitution 2015 score and TBIL 2014 or D_LDL-C in MS prediction. Conclusion Balanced constitution, TBIL, LDL-C, and TCH level can act as predictors for MS. The combination of TCM constitution and physiological indexes can give early warning to MS.


Introduction
Metabolic syndrome (MS) is a condition with a cluster of metabolic abnormalities that are characterized by central obesity, hypertension, hyperglycemia, and dyslipidemia [1]. The prevalence of MS is increasing rapidly worldwide [2,3]. In China, the overall standardized prevalence of MS in adults is reported to be 24.2% and is increasing year by year due to the rapid economic growth [4]. According to the International Collaborative Study of Cardiovascular Disease in ASIA (InterASIA), the age-standardized prevalence of MS was 13.7% among adults aged 35-74 years in China between 2000 and 2001 [5]. Based on 2010 China Noncommunicable Disease Surveillance data assessed using National Cholesterol Education Program Adult Treatment Panel III (NCEP ATP III) criteria, the prevalence of MS among participants aged >/= 18 years was 33.9% [6].
MS is associated with an increased risk of diseases, such as cardiovascular disease (CVD), type 2 diabetes mellitus (DM), and cancer [7,8]. Mottillo S. et al. conducted a meta-analysis containing 87 studies and found out that the metabolic syndrome was associated with an increased risk of CVD disease, CVD mortality, all-cause mortality, and myocardial infarction. Even without diabetes, MS patients maintained a high cardiovascular risk [9]. Therefore, the rapid diagnosis and prevention of MS are of great significance for the prevention of CVD and type 2 DM. Exclusion criteria: ①Aged < 18 or > 80 in 2016; ②The physiological examination or the personal information questionnaire or CCMQ were not completed in any of the three years; ③Pregnant or breastfeeding or have medicine records in three months; ④Been diagnosed as any other chronic diseases in any of the three years.
The longitudinal matrix (n=1625) : In 2016,166 patients were diagnosed as MS; The other 1,459 subjects were healthy. There are many studies on the etiology and influencing factors of MS [10,11]. Researches showed that both innate and acquired factors were involved in the occurrence and development of MS [12]. It is consistent with the cognition of the diseases' occurrence and development in TCM constitution theory. In constitution theory of TCM, constitution was described as an integrated, metastable, and natural specialty in figure, physiological functions, and psychological conditions formed on the basis of innate and acquired endowments in the human life process [13]. In other words, constitution is a dynamic character with both nature and environment and is always developing in the process of human growth [14]. Constitutions in TCM are generally classified based on the physiological (e.g., blood flow, pulse, and heartbeat) and physical status (e.g., facial appearance and body figure), as well as the clinical characteristics [15,16]. It was reported that some chronic diseases were closely related to biased constitutions [17][18][19]. Therefore, we hypothesized that the early imbalance trend of balanced constitution and the formation of biased constitution can predict the occurrence of metabolic syndrome together with physiological indexes.
Previous epidemiological studies of MS usually use multiple linear regression, logistic regression, or Cox regression models to screen risk factors. However, these methods are limited for strict requirements on data type, distribution and multicorrelation problems.
Data mining is the process of uncovering patterns, classifications, and relationships in large datasets using methods at the intersection of machine learning, statistics, and database systems [20]. The data mining process enables data owners to better understand the dependencies between the attributes of the data samples and predict the corresponding subsystem behavior.
TreeNet is a novel advance in data mining proposed by Friedman [21] at Stanford University. It builds trees from hundreds of small trees and each tree depicts a small portion of the overall model. The model prediction is finally done by adding up all individual contributions [22]. TreeNet is a new machine learning approach which is efficient for regression problems. TreeNet is fast data driven, immune to outliers and invariant to monotone transformation of variables. Therefore, TreeNet models inherit almost all of the advantages of tree-based models, while overcoming their primary disadvantages [23]. In this study, it was applied to identify TCM constitutions and physiological indexes that act as predictors for MS using two health datasets, providing support for the establishment of early warning system of MS.

Date Sources.
Use three consecutive years' data for model building. Qualified data was screened from health data from Hangzhou Haiqin Sanatorium and Shanghai Jambo Health Management Center. The data filtering process was shown in Figure 1.
Constitution classification was as follows: CCMQ was used to investigate the constitution types of subjects. The assessment contains 5 aspects of measurement, including physical characteristics, psychological characteristics, reaction state, tendency to diseases, and adaptability. A total of 60 items were measured to classify a person into one or more of nine constitution types: balanced constitution (8 items), qi-deficient constitution (8 items), yang-deficient constitution (7 items), yin-deficient constitution (8 items), phlegmdampness constitution (8 items), damp-heat constitution (6 items), stagnant blood constitution (7 items), stagnant qi constitution (7 items), and inherited special constitution (7 items).
MS identification was as follows: MS was identified according to the criteria set by Chinese diabetes society (CDS): (i) overweight and/or obesity: BMI ≥ 25.0 kg/m 2 ; (ii) hyperglycemia: FPG ≥ 6.1 mmol/L and/or 2 hPG ≥ 7.8 mmol/L and/or treatment of previously diagnosed type 2 diabetes; (iii) hypertension: SBP ≥ 140 mmHg and/or DBP ≥ 90 mmHg and/or treatment of previously diagnosed hypertension; (iv) triglyceride abnormality: TG ≥ 1.7 mmol/L and/or low HDL-C (< 0.9 mmol/L for men, < 1.0 mmol/L for women). MS can be diagnosed if any 3 or all of the above conditions are met.
All the included indicators were analyzed, and the diagnostic results were confirmed by two or more doctors.

Data
Cleaning. (i) The physical examination code was used as identification number of the subjects. (ii) Health data before 2014 was removed to avoid interference. (iii) Units and formats for data from different sources were uniformed. (iv) Converted scores were computed for each type of constitutions and used for analysis according to the scoring criteria of CCMQ. (v) Target status was as follows: in 2016, subjects who were diagnosed with MS were labeled 1 and healthy were labeled 0.

Procedure of TreeNet.
In this study, the TreeNet models were constructed using TreeNet software by Salford Systems.
Model construction was as follows: the model was begun with a small tree grew on original target and the residuals of this tree were computed. Then the second tree was built to predict the residual from the first tree. Next, we compute residuals from the new model of two trees, and a third tree was grown to predict revised residuals. We repeat the progress for machine learning and got a sequence of tree. At last, we added up all individual contributions.

Model
Testing. The dataset was randomly categorized into two groups (a training group and a test group). The prediction model was developed on the basis of the training group which consisted of 1300 cases (80% of the entire dataset). Model validation was made on the basis of the test group consisting of the rest 20% of cases (325 cases).

Accuracy of TreeNet Algorithms.
Values of the area under the receiver operating characteristic (ROC) curve (AUC) were calculated to evaluate accuracy of the TreeNet model. AUC value for TreeNet was 0.694.

TreeNet Model
Testing and Assessment. The confusion matrix for TreeNet algorithm was shown in Table 2, indicating that the model has certain predictability. Model validation was performed using the test group consisting of 20% of random data. In the test group, 287 individuals were diagnosed without MS, of which 219 were accurately predicted by the model, indicating the accuracy rate reached 76.31%. In addition, 19 of 38 patients diagnosed with MS were predicted by the model. The average accuracy rate was 63.15%, and the overall accuracy rate was 73.23% (Table 1).
The risk of MS of test group was graded according to the model. 325 cases were divided into 10 parts (10 bins) and

Variables (Physiological Indexes or TCM Constitutions)
Importance. TreeNet model gives stable variable importance rankings after assessing the relative importance of predictors. Importance of variables ranked top 15 was listed in descending order in Table 3 Taking the abscissa as the value of physiological indexes and the ordinate as the influence on the target, the relational dependency between D TBIL, TBIL 2014, D LDL-C, balanced constitution 2015, TCH 2015, and incidence of MS was shown in Figures 2-6. It came out that incidence of MS was higher when D TBIL was between 0 and 2, TBIL 2014 was between 10 and 15, D LDL-C was above 19, balanced constitution 2015 was below 60, or TCH 2015 was above 5.7.

Discussion
Over the past decades, the metabolic syndrome prevalence has increased markedly worldwide, which may be explained by urbanization, an aging population, lifestyle change, and nutritional transition. Previous surveys indicated that metabolic syndrome has become a serious public health problem and highlights the urgent need to prevent and treat.
Chronic diseases were usually induced by both internal and external factors, such as genetic abnormalities, imbalance of intestinal flora, carcinogens, poor diet, and physical inactivity. Therefore, it is necessary to explore health parameters correlated with chronic diseases, for providing evidence for prediction and early diagnosis. Data mining technology has made great progress in disease prediction, diagnosis, and treatment for its advantage in analyzing data from  a large pool of information to get knowledge of unknown patterns, classifications, clustering, and relationships [24][25][26]. TreeNet machine learning algorithm was used in exploring physiological parameters' change in MS development in our study by analyzing consecutive health data of subjects in 2014 to 2016.
Our results suggested that metabolic indexes as bilirubin and lipoprotein were important parameters to predict the occurrence of MS.
For a long time, bilirubin was considered as a waste. As the final product in catabolism of heme, bilirubin is often used as indicators in clinical diagnosis of hemolysis, neonatal jaundice, liver and biliary related diseases, etc. However, with the deepening of research, bilirubin is revealed to be a powerful antioxidant that suppresses the inflammatory process [27]. Besides, it was reported that bilirubin may be negatively correlated with MS [28]. This study explored the physiological predictors in individuals that were diagnosed as MS in 2016 for the first time or not. The results showed that the difference value of TBIL between 2014 and 2015 (D TBIL) and TBIL in 2014 (TBIL 2014) can significantly predict the occurrence of MS in 2016. Especially when D TBIL was 0-2 and/or TBIL 2014 was 10-15, the incidence of MS was the highest. Besides, consistent with researches before, the incidence of MS was decreased with increasing of the TBIL level, indicating that TBIL is expected to be a new predictive indicator in MS screening.
Lipid metabolism is closely related to MS. Lots of studies have confirmed that high-density lipoprotein, low-density lipoprotein, and cholesterol were positively correlated with MS [29]. Serum LDL-C level was reported to have a weak ability in predicting MS in women by analyzing the data from a population-based cross-sectional study conducted on representative samples of an Iranian adult population [30]. In present study, the risk of MS in 2016 was remarkably higher in individuals with the difference value of low-density lipoprotein cholesterol between 2014 and 2015 (D LDL-C) above 19 and/or the total cholesterol in 2015 (TCH 2015) above 5.7.
Body constitution in traditional Chinese medicine is the fundamental physiological component of a person, and different constitution types are variously susceptible to diseases [31]. Previous studies have shown that biased constitutions were risk factors of chronic diseases. For example, yangdeficient constitution is an independent predictor of diabetic retinopathy in T2DM patients by multiple logistic regression analysis [32]. Yang-deficient and phlegm-dampness exhibited a significantly higher risk of albuminuria among T2DM patients [33]. These findings suggest that the constitution types found in people with chronic diseases may provide valuable information for disease prevention and treatment [34]. Our study found consistent results. In our study, TCM  One notable difference from previous studies is that balanced constitution is more important than biased constitution in disease prediction. Balanced constitution is a "strong and robust physical state". People of balanced constitution were in moderate shape with flushing complexion and were energetic. Grading criterion of balanced constitution is that converted score for balanced constitution is equal to or greater than 60 and converted scores for 8 other biased constitutions were less than 30. It is believed in traditional Chinese medicine that the body's state from "health" to "sick" is due to damaging of the body's balance state by internal and external causes. A converted score for balanced constitution less than 60 is a sign of the gradual weakness of balanced constitution and the formation of the biased constitutions. According to the prediction model in this study, when the CCMQ scores for balanced constitution in 2015 (balanced constitution 2015) were lower than 60, the prevalence of MS was higher. Furthermore, interaction of CCMQ scores and other physiological indexes in prediction of MS was analyzed. Results indicated that balanced constitution 2015 had an interaction with TBIL 2014 and D LDL-C, and loss of balanced constitution combined with low level of TBIL in 2014 or D LDL-C is important predictors of the occurrence of MS. It noted that the change of balanced constitution may be an earlier indicator that can predict the trend of morbidity and must be taken seriously in clinic.
This research is the first to build forecasting model by data mining method to explore prediction effect of TCM constitutions and other physiological parameters on MS incidence by analyzing consecutive health data of subjects. It provides evidence that the physiologic indexes and TCM constitutions can provide predictive information before the occurrence of MS. Maintaining CCMQ scores of balanced constitution higher than 60 points and reasonable levels of TBIL, LDL-C, and TCH can help to delay the occurrence of MS.

Data Availability
Due to the sensitive nature of the questions asked in the Constitution in Chinese Medicine Questionnaire (CCMQ), survey respondents were assured raw data would remain confidential and would not be shared. All cleaned data we used in the article are transparent and available upon request by contact with the corresponding author.

Conflicts of Interest
The authors declare that they have no conflicts of interest.

Authors' Contributions
Yanchao Tang and Tong Zhao contributed equally to this work.