Development and Validation of Metabolic Syndrome Prediction and Classification-Pathways using Decision Trees

Purpose: The purpose of the current investigation was to create, compare, and validate sex-specific decision tree models to classify metabolic syndrome. Methods: Sex-specific Chi-Squared Automatic Interaction Detection, Exhaustive Chi-Squared Automatic Interaction Detection, and Classification and Regression Tree algorithms were run in duplicate using metabolic syndrome classification criteria, subject characteristics, and cardiovascular predictor variable from the National Health and Nutrition Examination Survey cohort data. Data from 1999-2012 were used (n=10,639; 1999-2010 cohorts for model creation and 2011-2012 cohort for model validation). Metabolic Syndrome was classified as the presence of 3 of 5 American Heart Association National Heart Lung and Blood Institute Metabolic Syndrome classification criteria. The first run was made with all predictor variables and the second run was made excluding metabolic syndrome classification predictor variables. Given that the included decision tree algorithms are non-parametric procedures, all decision tree models were compared to a logistic regression based model to provide a parametric comparison. Results: The Classification and Regression Tree algorithm outperformed all other decision tree models and logistic regression with a specificity of 0.908 and 0.952, sensitivity of 0.896 and 0.848, and misclassification error of 0.096 and 0.080 for males and females, respectively. Only one predictor variable outside of the metabolic syndrome classification reached significance in the female model (age). All metabolic syndrome classification predictor variables reached significance in the male model. Waist circumference did not reach significance in the female model. Within each model, 5 female and 3 male pathways built off of <3 American Heart Association National Heart Lung and Blood Institute Metabolic Syndrome classification criteria resulted in an increased likelihood of presenting Metabolic Syndrome. Conclusion: The proposed pathways show promise over other current metabolic syndrome classification models in identifying Metabolic Syndrome with <3 predictor variables, before current classification criteria.


Introduction
Metabolic syndrome (MetS) is a constellation of cardiometabolic predictor variables that when presented in tandem increases the risk of cardiovascular disease (CVD) and insulin resistance [1,2]. The prevalence of this classification affects approximately 1 in 3 adults in the United States [3]. Due to the high prevalence of this syndrome, proper identification of persons with MetS is imperative in order to prevent and/or modify the multiple predictor variables associated with CVD related morbidity and mortality as well as its high healthcare costs [1,2,4,5]. Furthermore, utilization of pathways for MetS classification could guide health education professional interventions before the onset of related morbidity and mortality. Using Decision Trees (DT) as a preliminary pre-metabolic syndrome classification criterion could improve outcomes associated with the development of MetS or could halt the progression of MetS and its relative consequences [6].

Classification of metabolic syndrome
Although there have been numerous attempts to harmonize classification models for MetS, there remains a lack of consensus amongst the leading organizations with particular disagreement based on predictor variable cut-off points as well as which predictor variables should be considered in making the MetS classification [1,[7][8][9]. More recently, there has been support for MetS to be considered as a premorbid condition intended to inform health educators and clinicians on relative risk of developing CVD rather than a clinical diagnosis [6,10]. In lieu of a clinical diagnosis, MetS can provide a research framework for establishing a unified cardiometabolic pathophysiology, quantifying chronic disease risk, guiding clinical management decisions, and providing a concise methodology to inform public health and health education professionals of the relationship of clustering predictor variables [10].

Classification criteria based on the leading models from the national cholesterol
Education Adult Treatment Panel III (ATPIII), the International Diabetes Federation (IDF), the World Health Organization (WHO), and the American Heart Association National Heart Lung and Blood Heart Institute (AHA/NHLBI) risk models are limited in their usefulness because they classify MetS based on predictors with binary thresholds [1,2]. There currently exists limited evidence-based research that considers the severity of these MetS cardiometabolic predictor variables, their interactions with one another, and their relationship algorithm: (1) 2-test for independence using an adjusted p-value for each predictor. (2)The predictor with the smallest adjusted p-value (i.e., most statistically significant) is split if the p-value less than the user-specified significance split level ( split ) is set at 0.05; otherwise the node is not split and is then considered a terminal node.
The stopping step utilizes the following user-specified stopping rules to check if the tree growing process should stop: (1) If the current tree reached the maximum tree depth level, the tree process stops. (2) If the size of a node is less than the user-specified minimum node size, the node will not be split. (3) If the split of a node results in a child node whose node size is less than the user-specified minimum child node size value, the node will not be split. The parent node is the level where the data set divides into child nodes that can themselves become either parent nodes or end in a terminal or decision node. (4) The CHAID algorithm will continue until all the stopping rules are met.
Exhaustive CHAID (E-CHAID) proposed by Biggs, DeVille, and Suen uses the basic CHAID algorithm with more computationally intensive merging and testing of response variables [17]. In the E-CHAID algorithm, there is no reference to any α merge value. Rather category merging continues until only two categories remain. Therefore, careful considerations should be made for over-fitting when the E-CHAID algorithm is used for large data sets with large amounts of continuous predictor variables.

Classification and regression trees
Unlike CHAID based algorithms, the Classification and Regression Tree (CART) algorithm proposed by Breiman, Freidman, Stone, and Olshen builds purely binary trees [18]. Therefore, CART pathways are easier to understand as parent nodes are always split into 2 child nodes that partition data to maximize homogeneity of each subset. In the CART procedure, the maximum tree is produced followed by tree pruning to avoid over-fitting.
The first step in the tree growing process is to find each predictor variables best split. In the CART algorithm, the splitting step employs a statistical calculation known as the Gini Impurity Function. This function is a measure of how often a randomly selected case would be incorrectly predicted; therefore it is used to determine the optimal binary split of the parent node into the child nodes. In the next step when the stopping rules are satisfied, the best possible split is chosen for the predictor variable when the impurity decreases the most from the parent node to the child nodes. This impurity decrease is quantified by the Gini Improvement Measure, which measures the decrease in impurity from the parent node to the child node. The parent node will be split when the change in impurity is maximized.

Logistic regression
Logistic Regression (LR) is a widely utilized statistical technique in binary response prediction [19]. However, LR output can be tedious to interpret and requires considerations for mutlicollinearity and missing values. These models are used when the response variable ( ) is binary with the response variable taking the value of 1 with probability of success or the value of 0 with probability of failure 1 − , and the predictor variables ( ) are either categorical or continuous values represented by the following equation: Where 0 is a constant and are the coefficients of the predictor variables in the model. The LR equation, called the likelihood function, to CVD. A major limitation within these models is the dichotomous nature of predictor variable identification [6,10]. However much like obesity, there are varied clinical implications based on the severity of predictor variables used to define MetS where the dichotomized cutoff points for each predictor variable might be clinically ambiguous. Furthermore, current MetS classification models lack consideration for established CVD predictor variables such as patient demographics (i.e. race/ethnicity and socioeconomic status), smoking [3] and previous cardiovascular events [11]. The creation of clinically feasible pathways for MetS classification that both stratifies each predictor variable based on its severity and then considers the interaction effect as predictor variable clusters could be invaluable for reducing risk of cardiovascular morbidity and mortality [5].

Decision trees
DT methodologies have been shown to be effective tools for the classification and prediction of cardiometabolic chronic disease such as MetS and insulin resistance [6,[12][13][14]. However, with the exception of Miller, Fridline, Liu & Marino and Stern et al. other models have been based on international samples. To the best of our knowledge, no published pathways for MetS classification derived from DT methodologies have been built, validated, and implemented in clinical practice [6,14].
DTs are powerful classification and prediction techniques that analyze how both categorical and continuous predictor variables best combine to create pathways explaining the outcome of a given binary response variable according to statistical tests in tandem with "ifthen" logic [6,14,15]. In DT algorithms, the data set is partitioned into two or more mutually exclusive subsets in each split with the goal of producing subsets of the data which are as homogeneous as possible with respect to the response variable. This nonparametric modeling technique shows promise over traditional regression techniques in that DT's make no assumptions about the underlying data including mutlicollinearity, are able to handle missing variables, are easily interpreted by non-statisticians, and consider the effects of variable clusters in relation to sample subsets unlike regression which considers the effect of each variable within the entire sample.

Chi-squared automatic interaction detection
The Chi-Squared Automatic Interaction Detection (CHAID) algorithm proposed by Kass operates using a series of merging, splitting, and stopping steps based on user-specified criteria as follows [16]. The merging step operates using each predictor variable where CHAID merges non-significant categories using the following algorithm: (1) Perform cross-tabulation of the predictor variable with the binary target variable. (2) If the predictor variable has only 2 categories, go to step 6. (3) 2 -test for independence is performed for each pair of categories of the predictor variable in relation to the binary target variable using the 2 distribution (df=1) with significance (α merge ) set at 0.05. For nonsignificant outcomes, those paired categories are merged. (4) For nonsignificant tests identified by α merge >0.05, those paired categories are merged into a single category. For tests reaching significance identified by α merge ≤ 0.05, the pairs are not merged. (5) If any category has less than the user-specified minimum subset size, that pair is merged with the most similar other category. (6) The p-values for the merged categories are adjusted using a Bonferroni correction to control for Type I error rate.
The splitting step occurs following the determination of all the possible merges for each predictor variable. This step selects which predictor is to be used to "best" split the node using the following IL). Each DT analysis was run in duplicate with parent nodes defined at 250 subjects, child node defined at 100 subjects, and significance for all statistical tests within each DT set at ≤ 0.05. Maximum tree depth was user specified at 5 levels. The NHANES cohort data was divided by sex to create sex-specific models for MetS classification with the 2011-2012 cohort reserved for model validation. Each DT algorithm was run twice with the first model including all possible predictor variables and the second without any AHA/NHLBI MetS classification criteria. Predictor variables included the AHA/NHLBI MetS classification criteria in addition to binary smoking status, American Heart Association Blood Pressure Classification, anthropometrics [height (cm), weight (kg), Body Mass Index (BMI) (kg/m²), and weight classification)], marital status, socioeconomic status measured via Family Poverty to Income ratio (PIR) (a measure of adjusted family income to relative poverty threshold), and race/ethnicity. Each DT was assessed using classification specificity, sensitivity, and classification error expressed as proportions. Sensitivity quantifies the proportion of correctly classified MetS and specificity gives the proportion of correctly classified non-MetS.
Within the CART algorithm, DT predictor variables were ranked by level of importance related to MetS. The best DT model was chosen and described for each node using the total proportion of MetS and no-MetS classification and a MetS Index describing the estimated probability of MetS compared to the overall prevalence of MetS in the NHANES cohort. For both the training and validation sets, MetS classification threshold was set at the current MetS prevalence within the NHANES cohort in accordance with Stern et al. who used DT models to explain insulin resistance. In this study the classification threshold of the response variable was set at the response variable's prevalence within the study cohort [14]. Instead of maintaining the 50% classification threshold for the response variable, the optimal classification cut-off point was set to maximize the sum of theoretical sensitivity and specificity, as determined from the cohort data. This decision was made to increase the number or correctly classified cases of MetS.
Stepwise Forward Logistic Regression (LR) was performed on the predictor variables used to define MetS as a parametric classification comparison. This procedure was used to approximate the predictive power of the DT techniques. The classification threshold was set at the current prevalence of MetS within the NHANES cohort as mentioned previously. The final LR model was corrected for multicollinearity problems between the predictor variables by removing highly correlated predictor variables. Within LR, severe multicollinearity can cause instability in the model coefficients when highly correlated variables are included in the model. Variables with large amounts of missing data were excluded.
is used for estimating the regression model coefficients. The maximum likelihood estimation method uses an iterative procedure to find the model coefficients that best match the pattern of observations in the sample data. Interpretation of the model comes from transforming the LR coefficients for each predictor variable by taking the exponential of the coefficients ( ) to determine the influences of each predictor variable on the response variable in terms of the odds ratio. To determine if each model coefficient is statistically significant, the Wald statistic is used.

Purpose
The central hypothesis states that the decision tree pathways derived from DT algorithms using data from National Health and Nutrition Examination Survey (NHANES) cohorts would detect the presence of MetS in adults with <3 AHA/NHLBI MetS predictor variables. The current investigation had two aims. The first aim was to develop and validate sex-specific pathways for MetS classification using multiple DT derived methodologies. The second aim was to compare each DT model with and without MetS classification criteria.

Data management
The study sample was derived from National Health and Nutrition Examination Survey (NHANES) data made publically available by the Centers for Disease Control and Prevention (CDC). This included 7 cohorts from 1999-2012 collected in 2-year intervals. The data was arranged in a column-wise format with each subject given a sequence identifier. Data management was performed with dataset merging and data subset functions using SPSS version 22 (SPSS Inc., Chicago, IL). The final sample size for inclusion in model development was n=10,639 (male: n=5,474; female: n=5,165). The current investigation was approved by the Institutional Review Board.
The inclusion criteria were based on the following parameters: Age range of 18-59 years, 12 hour fasting protocol for laboratory values, abstinence from alcohol and/or tobacco use prior to laboratories, and a negative exam for pregnancy for females. The age criteria was chosen based on Ford, Li, and Zhao [3] where the highest prevalence of MetS was exhibited after 59 years of age. This decision was made in order to create pathways to detect MetS before onset of MetS with traditional classification criteria based on the high prevalence of MetS beyond age 59. Participants with missing data based on the MetS classification criteria were excluded due to the inability/uncertainty in making a complete MetS classification. The 1999-2010 cohorts were reserved for model creation (training) and the 2011-2012 cohort was reserved for model validation. Both of the training and validation sets were separated by sex. The distributions of all parameters were the same between training and validation sets. Blood pressure readings were the average of 4 blood pressure collections per subject. An indicator of cardiovascular events was built off of the presence of 1 of 5 cardiovascular events including congestive heart failure, coronary heart disease, angina, heart attack, and/or stroke.

Metabolic syndrome classification
The MetS classification was defined as the presence of 3 of 5 predictor variables based on the clinical classification model proposed by the AHA/NHLBI, see Table 1 [1].

Statistical analysis
The DT models were developed using CHAID, E-CHAID, and CART algorithm analysis using SPSS version 22 (SPSS Inc., Chicago,

Model performance
The average prevalence of MetS within the NHANES cohort was 33.1%. Subject characteristics are displayed in Table 2. The best performing models based on specificity and sensitivity for both males and females (Table 3) were the CART models considering all study parameters as contenders for inclusion. The classification error of each of the best performing models were also the lowest of the DT and LR models at 0.096 and 0.080 for the male and female model, respectively.

Best performing female model
The first split within the DT was based on Triglycerides (TG) which corroborates with the ranked order of importance in Figure 1. The second level was on splits based on either High Density Lipoprotein Cholesterol (HDL-C) or Fasting Plasma Glucose (FPG). All MetS classification risk-factors were present in the model with the exception of Waist Circumference (WC). The only non-MetS predictor variable that the algorithm identified as statistically significant was age for the female cohort (Table 4 and Figure 2) with age greater than 46 years were 6.3 times more likely to be classified with MetS. However, this predictor variable was present in the lowest level within the model. Within the female cohort, all the terminal nodes with significant risk of Mets (MetS Index>1) were based on <3 MetS classification criteria. Within the female model the terminal node with the highest likelihood of presenting with MetS using <3 AHA/NHLBI MetS classification criteria is interpreted as a female patient presenting with TG<150 mg/ dl, FPG100 mg/dl, and HDL<50 mg/dl. The probability of MetS for this pathway is 0.969 which results in being 2.910 times more likely to than the average likelihood of presenting with MetS (

Best performing male model
The first split within the DT was based on TG which corroborates with the ranked order of importance in Figure 3. All second level splits were based on WC. Considering the risk-factors ranked by importance,   BMI was in the top predictor variables. However this predictor variable did not appear in the model. Within the male DT, there were 3 pathways that resulted in a significant increase in likelihood of MetS (MetS Index>1) was based on <3 MetS criteria (Table 5 and Figure 4). Within the male model the terminal node with the highest likelihood of presenting with MetS using <3 AHA/NHLBI MetS classification criteria is interpreted as a male patient presenting with TG ≥ 150 mg/ dl, WC<94 cm, and a FPG>100 mg/dl. The probability of MetS for this pathway is 0.655 which results in being 1.967 times more likely to than the average likelihood of presenting with MetS (Table 5, Terminal Node 10).

Discussion
The purpose of the current investigation was to create, compare, and validate sex-specific DT models to classify MetS. DT models were derived using CHAID, E-CHAID, and CART algorithms based on the presence of MetS as the response variable and the MetS classification criteria, predictor variables from cardiovascular risk model and subject characteristics as the predictor variables whose values were obtained from 1999-2012 NHANES data [3,6,10,11,13]. MetS is classified by the presence of 3 of 5 criteria defined by AHA/NHLBI classification guidelines [1].
This study has multiple novelties. First, these models are based on large amounts of data that is representative of adults in the United States. Second, the pathways derived from this model show promise in accurately classifying sex-specific MetS using fewer measurements than traditional classification criteria. Third, unlike traditional MetS classification models, the pathways of the current investigation do not provide universal cutoffs for each predictor variable. Rather, these pathways consider the clustering and multilevel interactions among predictor variables to identify stepwise pathways to classify MetS. Finally, each pathway describes the likelihood of developing MetS.    In this study the prevalence of MetS within the NHANES cohort, a representative sample of the United States adult population, was 33.1% which approximates Ford, Li, & Zhao's study that found the prevalence of MetS within the NHANES cohort to be 34.3% [3].
The first level split indicates the risk-factor with the highest association with MetS. The first level split was based on TG which corroborates with Worachartcheewan et al. who used CART to classify MetS in a sample of Thai men and women [13]. The results of this study also corroborate with Miller, Fridline, Liu, & Marino who used the CHAID algorithm to classify MetS in a sample of young adults using NHANES data. The best performing model in this study was built as a user-specified first level split on WC [6]. When the algorithm was not user-specified, the CHAID algorithm identified TG as the first level split. Interestingly in this study, the proposed CHAID model with the user-specified first split on WC outperformed the CHAID algorithm without first-level split specification and the logistic regression model in both overall sensitivity and classification accuracy for MetS.
There were notable differences between the male and female models. The first was that WC was present in the male model but not the female model. This phenomenon might be based on the body fat distribution of women prior to menopause that occurs in women at or near the age split identified by the DT model [20]. This suggests that a moderate increase in adiposity would not result in a significant increase in central adiposity. Therefore the WC measurement might not be warranted in women. Conversely for men, the body fat distribution would contribute to increases in central adiposity as body fat increases. This finding corroborates with Hari et al. who compared sex-specific differences between multiple MetS classification models and found that measures of central adiposity, specifically WC, were more profound for males than females [21]. Future investigation regarding this phenomenon is warranted considering that physicians and health professionals recommend WC measurements for both sexes.
Also notable was the close relationship between WC and BMI based on the normalized order of importance in Figures 1 and 4. Both WC and BMI have been shown to be a strong proxy of visceral adiposity [22]. However, BMI only considers the relationship of weight to height and does not consider actual body composition and girth measurements. Central adiposity has been identified as a strong predictor of MetS and a strong contributor to BMI and Despres et al. demonstrated a strong correlation between BMI and WC which suggests the interchangeability of measures [22]. Given that WC was more significantly associated with MetS than BMI, the inclusion of WC most likely diminished the effect of BMI in the DT models. Therefore WC seems to be a more sensitive predictor of MetS than BMI.
Also interesting in the female model was the inclusion of a non-MetS classification criterion parameter, age. Although this factor was not present in a high-risk MetS pathway (MetS Index>1), age ≥ 46 years were 6.3 times likely to present with MetS than females within this pathway with an age<46 years. One suggestion for the split based on age was at 46 years relates to the cardiometabolic changes related to menopause. However, a review by Barret-Conner of menopause in relation to CVD risk in women delineated the direct relationship between menopause and CVD risk [23]. The methodology of the current investigation was unable to explain the inclusion of this predictor. Further investigation exploring the relationship between central adiposity and likelihood of presenting with MetS for women by age and pre, peri, and post-menopause is warranted. models developed in the current investigation in relation to other classification models would be the classification of MetS with less than 3 risk-factors and/or identify the MetS risk of multiple clustering combinations of predictor variables. In the female model all of the pathways leading to increased risk of MetS were based on less than 3 predictor variables. However, in the male model only one pathway required less than 3 predictor variables for MetS classification. Clinical application of these pathways can inform health educators and/or clinicians identifying high risk pathways and focusing on interventions that could shift a patient to a lower risk pathway.

Conclusions
In summary, the current investigations findings suggest that DTbased pathways to classify MetS and likelihood of presenting with MetS could detect MetS before other classification models. Within the female model, waist circumference measures did not reach significance as a predictor variable. However, age did reach significance for inclusion in the female model. Five of the pathways with increased likelihood of MetS in the female model were built using ≤ 2 MetS AHA/NHBLI classification criteria. Three of the pathways with increased likelihood of MetS in the male model were built using ≤ 2 MetS AHA/NHLI classification criteria. Future research warrants the implementation and further validation of these pathways using a clinical sample. There still remains no clinically established criterion for pre-metabolic syndrome. These pathways show promise in developing a preliminary pre-metabolic syndrome classification tool to guide intervention before the onset of MetS using current models.