Classification and Prediction on the Effects of Nutritional Intake on Overweight/Obesity, Dyslipidemia, Hypertension and Type 2 Diabetes Mellitus Using Deep Learning Model: 4–7th Korea National Health and Nutrition Examination Survey

Few studies have been conducted to classify and predict the influence of nutritional intake on overweight/obesity, dyslipidemia, hypertension and type 2 diabetes mellitus (T2DM) based on deep learning such as deep neural network (DNN). The present study aims to classify and predict associations between nutritional intake and risk of overweight/obesity, dyslipidemia, hypertension and T2DM by developing a DNN model, and to compare a DNN model with the most popular machine learning models such as logistic regression and decision tree. Subjects aged from 40 to 69 years in the 4–7th (from 2007 through 2018) Korea National Health and Nutrition Examination Survey (KNHANES) were included. Diagnostic criteria of dyslipidemia (n = 10,731), hypertension (n = 10,991), T2DM (n = 3889) and overweight/obesity (n = 10,980) were set as dependent variables. Nutritional intakes were set as independent variables. A DNN model comprising one input layer with 7 nodes, three hidden layers with 30 nodes, 12 nodes, 8 nodes in each layer and one output layer with one node were implemented in Python programming language using Keras with tensorflow backend. In DNN, binary cross-entropy loss function for binary classification was used with Adam optimizer. For avoiding overfitting, dropout was applied to each hidden layer. Structural equation modelling (SEM) was also performed to simultaneously estimate multivariate causal association between nutritional intake and overweight/obesity, dyslipidemia, hypertension and T2DM. The DNN model showed the higher prediction accuracy with 0.58654 for dyslipidemia, 0.79958 for hypertension, 0.80896 for T2DM and 0.62496 for overweight/obesity compared with two other machine leaning models with five-folds cross-validation. Prediction accuracy for dyslipidemia, hypertension, T2DM and overweight/obesity were 0.58448, 0.79929, 0.80818 and 0.62486, respectively, when analyzed by a logistic regression, also were 0.52148, 0.66773, 0.71587 and 0.54026, respectively, when analyzed by a decision tree. This study observed a DNN model with three hidden layers with 30 nodes, 12 nodes, 8 nodes in each layer had better prediction accuracy than two conventional machine learning models of a logistic regression and decision tree.

Python is platform-independent programming language with object-and/or structureoriented approach. It is easy to use due to its accessibility and applicability which enables programmers to write clear and logical codes [31,32].
Machine learning techniques such as logistic regression and decision trees have been used in healthcare [33]. Logistic regression is the statistical technique used to predict the relationship between predictors (our independent variables) and a predicted variable (the dependent variable) where the dependent variable is binary. A decision tree is a flowchartlike diagram that shows the various outcomes from a series of decisions. It can be used as a decision-making tool for research analysis or planning strategy. A primary advantage for using a decision tree is that it is easy to follow and understand.
Deep learning, a subset of machine learning, can learn limited tasks by itself which stored in a training dataset, and then it can generate new tasks through a test dataset. Deep learning has shown improved data processing performance, particularly in classifying, identifying and detecting targets with excellent final accuracy of classification or prediction [34]. Deep learning algorithms include a deep neural network (DNN), a deep belief network (DBN), a stacked autoencoder (SAE), a convolutional neural network (CNN) and a recurrent neural network (RNN) [31,32]. A DNN is one of the most common deep learning models that contains multiple layers of linear and non-linear operations. DNN is the extension of standard neural network with multiple hidden layers, which allows the model to learn more complex representations of the input data. The structure of the DNN is given in Figure 1 [32,34]. Therefore, a strategy for primary prevention for overweight/obesity, dyslipidemia, hypertension and T2DM is vital to decrease the severe consequences of these diseases. Studies showed the association between nutritional determinants and overweight/obesity, dyslipidemia, hypertension and T2DM indicating the nutritional intake influences prevalence or prevention for overweight/obesity, dyslipidemia, hypertension and T2DM [22][23][24][25][26][27][28][29][30].
Python is platform-independent programming language with object-and/or structure-oriented approach. It is easy to use due to its accessibility and applicability which enables programmers to write clear and logical codes [31,32].
Machine learning techniques such as logistic regression and decision trees have been used in healthcare [33]. Logistic regression is the statistical technique used to predict the relationship between predictors (our independent variables) and a predicted variable (the dependent variable) where the dependent variable is binary. A decision tree is a flowchart-like diagram that shows the various outcomes from a series of decisions. It can be used as a decision-making tool for research analysis or planning strategy. A primary advantage for using a decision tree is that it is easy to follow and understand.
Deep learning, a subset of machine learning, can learn limited tasks by itself which stored in a training dataset, and then it can generate new tasks through a test dataset. Deep learning has shown improved data processing performance, particularly in classifying, identifying and detecting targets with excellent final accuracy of classification or prediction [34]. Deep learning algorithms include a deep neural network (DNN), a deep belief network (DBN), a stacked autoencoder (SAE), a convolutional neural network (CNN) and a recurrent neural network (RNN) [31,32]. A DNN is one of the most common deep learning models that contains multiple layers of linear and non-linear operations. DNN is the extension of standard neural network with multiple hidden layers, which allows the model to learn more complex representations of the input data. The structure of the DNN is given in Figure 1 [32,34]. In Figure 1, neurons are represented by circles. The neurons in the input layer receive some values and propagate them to the neurons in the middle layer of the network, which is also frequently called a "hidden layer". The weighted sums from one or more hidden layers are ultimately propagated to the output layer, which presents the final outputs of the network to the user.
A few studies evaluated the association between diet or nutrition and cardiometabolic risk with a machine learning approach [35][36][37][38]. Even though studies developed DNN models to predict levels of low-density lipoprotein cholesterol (LDL-C) [39] or blood glucose of T2DM [40], these studies did not investigate the association between diet or nutri- In Figure 1, neurons are represented by circles. The neurons in the input layer receive some values and propagate them to the neurons in the middle layer of the network, which is also frequently called a "hidden layer". The weighted sums from one or more hidden layers are ultimately propagated to the output layer, which presents the final outputs of the network to the user.
A few studies evaluated the association between diet or nutrition and cardiometabolic risk with a machine learning approach [35][36][37][38]. Even though studies developed DNN models to predict levels of low-density lipoprotein cholesterol (LDL-C) [39] or blood glucose of T2DM [40], these studies did not investigate the association between diet or nutritional intake and levels of LDL-C or blood glucose [39,40]. A study developed only metabolic syndrome prediction model with genetic and clinical data not evaluating diet or nutrition influence on metabolic syndrome in a nonobese health subjects based on machine learning approach [41]. A study developed prediction model to examine the association between dietary factors and hyperuricemia in Chinese adults using artificial neural network (ANN) model with 14 neurons in the input layer, 3 neurons in the hidden layer and 1 neuron in the output layer [42].
In the light of several earlier DNN/machine learning studies [35][36][37][38][39][40][41][42], there is no study that has a DNN model as an improved statistical tool to examine the association between nutritional intake and risk of incident overweight/obesity, dyslipidemia, hypertension and T2DM.
The objectives of this study are as follows: (a) to identify the nutritional intake in Korean general adult population using the KNHANES data set; (b) to develop a DNN model of deep learning and (c) to classify and predict risk of overweight/obesity, dyslipidemia, hypertension and T2DM in relation with nutritional intake in a DNN model of deep learning.
The present study compared DNN with machine learning techniques such as logistic regression and decision tree.

Variable Classification
Nutritional intake including food intake, energy intake, protein intake, fat intake, carbohydrate intake, sodium intake and potassium intake were set as independent variables ( Table 1). Data of nutritional intake was obtained from values included in dietary intake survey (food frequency questionnaire (FFQ), 24-h recall method and dietary life survey) of 4-7th KNHANES [43]. Briefly, trained dietitians in the homes of the participants collected FFQ data of KNHANES one week after the health interview and health examination. The list of FFQ consisted of 112 food and dish items including eleven good groups such as rice (5), noodle and dumpling (6), bread and rice cake (8), soup and stew (12), bean, egg, meat and fish (23), vegetable, seaweed and potato (27), milk and dairy (4), fruit (13), tea and beverage (5), snack and sweets (6), alcoholic beverage (3). The response categories for the intake frequency were divided into nine levels such as ≥3 times/d, 1 time/d, 2 times/d, 5-6 times/week, 2-4 times/week, 1 time/week, 2-3 times/month, 1 time/month and none. Participants were also asked to choose one of three portion sizes: small (0.5), medium (1.0) and large (1.5-2.0). The nutritional intake was calculated by considering the 24-h recall method and the relative frequency weighting of each food [44]. The FFQ consisting of different food items (names of food and dishes) were advised not to conduct a comparative or integrated analysis. For this reason, we analyzed by excluding implausible dietary data. This study set common nutrients of food, energy, protein, fat, carbohydrate, sodium and potassium investigated in each relevant year as independent variables for the use of data collected from the 4-7th KNHANES. Other nutrients did not coexist. For example, free sugar data are unavailable in 4-7th KNHANES while saturated fatty acid existed only in 6-7th KNHANES data. For this reason, free sugar and saturated fatty acid were not considered as independent variables.
KNHANES food and dietary intake database was undergone through nutrient conversion process as mentioned in introduction. Data on energy and nutrient intake were obtained using the following formula: Energy and nutrient intake calculation = intake frequency × intake amount × energy and nutrient content by food item [19]. This study used data on energy and nutrient intake obtained from the above formula without additional data conversion process.
The diagnostic criteria for each disease (dyslipidemia, hypertension, T2DM and overweight/obesity) were treated as dependent variables (Table 2). Initially, we classified diagnostic criteria of metabolic syndrome as dependent variables based on World Health Organization (WHO) guideline with the Asia-Pacific Perspective using the 4-7th KN-HANES data. The sample number of diagnostic criteria of metabolic syndromes were small as data was dealt with a complete-case analysis for missing data [45]. Therefore, the sample number of diagnostic criteria of metabolic syndromes was too small to run a DNN which was big data-based deep learning approach. For this reason, we classified the dependent variables as dyslipidemia, hypertension, T2DM and overweight/obesity according to the Korean diagnostic criteria. Ultimately, we developed a highly accurate predictive model for dyslipidemia, hypertension, T2DM and overweight/obesity and presented our research results. All the statistical and machine learning models are built on the foundation of data. In statistics, variables can be classified into two types of data: qualitative (categorical) and quantitative. Qualitative variables such as physical activity, smoking, gender, age and total energy intake and expenditure as independent variables are also important contributors to the dependent variable CVD. For simplicity of model building, we did not consider the qualitative variables, because of the further conversion of categorical variables into dummy variables which might yield a combinatorial explosion problem [46].
Overweight/obesity was defined when body mass index (BMI) was 23 kg/m 2 or higher according to the Korean Society for the study of obesity [47]. Hypertension was defined when systolic blood pressure was ≥140 mmHg or diastolic blood pressure was ≥80 mmHg [48]. Diagnostic criteria by the included as followed; FPG levels ≥ 126 mg/dL (7.0 mmol/L) or HbA1c ≥ 6.5% [49]. We did not include glucose levels after an OGTT as KNHANES did not provide them. According to KNHANES guidelines, when subjects with hypertension or/dyslipidemia were taking antihypertensive or/and antidyslipidemic medications for more than 20 days per month, they are defined as subjects with hypertension or/and with dyslipidemia. Subjects was diagnosed with T2DM when receiving oral hypoglycemic medication or insulin injection dosage fraction. KNHANES data did not provide specific quantity or types of medication. According to the Korean Society of Lipid and Atherosclerosis, dyslipidemia is defined as any one of the following: total cholesterol (TC) level ≥ 240 mg/dL, high-density lipoprotein cholesterol (HDL-C) level < 40 mg/dL, triglyceride (TG) level ≥ 200 mg/dL, LDL-C level ≥ 160 mg/dL, or the use of a lipidlowering drug [50]. We excluded a variable of LDL-C from dyslipidemia diagnosis due to missing data of LDL-C levels among 4-7th KNHANES. Therefore, variables including TC, HDL-C and TG for dyslipidemia diagnosis criteria were used in this study.

Deep Learning Performance Evaluation Methods
We used a TensorFlow (version 2.0) provided by Google using a backend Keras (version 2.3.1) for training and testing a DNN model in Python (version 3.7.7). Batch normalization was performed for deep learning. The overall statistical performance evaluation method is shown in Figure 2. Overweight/obesity was defined when body mass index (BMI) was 23 kg/m 2 or higher according to the Korean Society for the study of obesity [47]. Hypertension was defined when systolic blood pressure was ≥140 mmHg or diastolic blood pressure was ≥80 mmHg [48]. Diagnostic criteria by the included as followed; FPG levels ≥ 126 mg/dL (7.0 mmol/L) or HbA1c ≥ 6.5% [49]. We did not include glucose levels after an OGTT as KNHANES did not provide them. According to KNHANES guidelines, when subjects with hypertension or/dyslipidemia were taking antihypertensive or/and antidyslipidemic medications for more than 20 days per month, they are defined as subjects with hypertension or/and with dyslipidemia. Subjects was diagnosed with T2DM when receiving oral hypoglycemic medication or insulin injection dosage fraction. KNHANES data did not provide specific quantity or types of medication. According to the Korean Society of Lipid and Atherosclerosis, dyslipidemia is defined as any one of the following: total cholesterol (TC) level ≥ 240 mg/dL, high-density lipoprotein cholesterol (HDL-C) level < 40 mg/dL, triglyceride (TG) level ≥ 200 mg/dL, LDL-C level ≥ 160 mg/dL, or the use of a lipid-lowering drug [50]. We excluded a variable of LDL-C from dyslipidemia diagnosis due to missing data of LDL-C levels among 4-7th KNHANES. Therefore, variables including TC, HDL-C and TG for dyslipidemia diagnosis criteria were used in this study.

Deep Learning Performance Evaluation Methods
We used a TensorFlow (version 2.0) provided by Google using a backend Keras (version 2.3.1) for training and testing a DNN model in Python (version 3.7.7). Batch normalization was performed for deep learning. The overall statistical performance evaluation method is shown in Figure 2. The DNN model is developed and tested using Keras with a TensorFlow backend. We employed Adam optimizer to minimize the loss function with a learning rate of 0.01 and a dropout probability value of 0.1. A structure of the DNN model is presented in Figure 3. The DNN consisted of three hidden layers with 30 nodes, 12 nodes, 8 nodes in each layer and one output layer with one node. The activation function is the core of a DNN structure [51]. In this study, the rectified linear unit (ReLU) activation function was used in each layer, and sigmoid activation function for binary classification was used in the last layer. The batch size for model training was set to 20 and the epoch to 100. The DNN model is developed and tested using Keras with a TensorFlow backend. We employed Adam optimizer to minimize the loss function with a learning rate of 0.01 and a dropout probability value of 0.1. A structure of the DNN model is presented in Figure 3. The DNN consisted of three hidden layers with 30 nodes, 12 nodes, 8 nodes in each layer and one output layer with one node. The activation function is the core of a DNN structure [51]. In this study, the rectified linear unit (ReLU) activation function was used in each layer, and sigmoid activation function for binary classification was used in the last layer. The batch size for model training was set to 20 and the epoch to 100.  A confusion matrix is a way of assessing the performance of a classification model. It is a comparison between the ground truth (actual values) and the predicted values emitted by the model for the target variable. An example of a confusion matrix for binary classification is shown in Table 3. Here are the four quadrants in a confusion matrix (Table 3) One of the most commonly used metrics while performing classification is accuracy. The accuracy of a model (through a confusion matrix) is calculated using the given formula below.

Statistical Analysis
Logistic regression and decision tree were performed to investigate the association between nutritional intake and risk of overweight/obesity, dyslipidemia, hypertension and T2DM, which was compared with our DNN model. The Wald test in the context of logistic regression was used to determine whether a certain nutritional intake was significant or not. We also performed a SEM using the AMOS 22.0 program (IBM Corporation, A confusion matrix is a way of assessing the performance of a classification model. It is a comparison between the ground truth (actual values) and the predicted values emitted by the model for the target variable. An example of a confusion matrix for binary classification is shown in Table 3. Here are the four quadrants in a confusion matrix (Table 3) (b) Accuracy formula Accuracy = TP + TN TP + FP + TN + FN One of the most commonly used metrics while performing classification is accuracy. The accuracy of a model (through a confusion matrix) is calculated using the given formula below.

Statistical Analysis
Logistic regression and decision tree were performed to investigate the association between nutritional intake and risk of overweight/obesity, dyslipidemia, hypertension and T2DM, which was compared with our DNN model. The Wald test in the context of logistic regression was used to determine whether a certain nutritional intake was significant or not. We also performed a SEM using the AMOS 22.0 program (IBM Corporation, Chicago, IL, USA) to estimate how nutritional intake simultaneously affects risk of overweight/obesity, dyslipidemia, hypertension and T2DM as part of sub-analysis of a DNN modelling. Structural equation modelling is a relatively powerful technique to test and evaluate multivariate causal relationships and interactions among factors [52]. The goodness of fit is the main output that can be extracted from the first phase of SEM data analysis. Chi-square statistics (CMIN), minimum discrepancy per degree of freedom (CMIN/DF), root mean square error of approximation (RMSEA), normed fit index (NFI), comparative fit index (CFI), Tucker Lewis index (TLI), relative fit index (RFI), incremental fit index (IFI) and the goodness of fit index (GFI) are the most important fit indices that should be examined. Wagenmakers 2007 [53] determined the following values for the above parameters to illustrate goodness of fit-NFI, CFI, TLI, IFI, RFI. GFI should be equal to or greater than 0.9. RMSEA should be less than 0.08.
The number of subjects with T2DM aged 40 to 69 years was 3889 when classified as data corresponding to the T2DM diagnostic criteria. This is a value that has not been processed arbitrarily. The number of subjects with T2DM aged 40-69 years was lower than that of overweight/obesity, hypertension and dyslipidemia in the 4-7th KNHANES data.
The female proportion was higher than male in fours datasets. We categorized and divided male and female data to compare differences in male and female characteristics. The KNHANES data showed the proportion of women was higher than that of men in hypertension, T2DM, dyslipidemia and obesity, as well as data from healthy adults aged 40 to 69 years. This means that as age increases, the number of women increases in the male to female ratio indicating the number of female patients with the disease could be higher.
Intakes of carbohydrate, sodium and potassium were the higher compared with other nutritional intake in subjects with dyslipidemia. Energy intake was highest whereas protein intake was the lowest compared with other nutritional intake in subjects with hypertension. Intakes of energy, protein and fat were higher whereas sodium intake was lower compared with other nutritional intakes in subjects with T2DM. Food intake was the highest, while energy intake was lowest in subjects with overweight/obesity. Overall, in subjects with dyslipidemia, hypertension, T2DM and overweight/obesity, protein intake met recommended nutrient intake (RNI) for Koreans (50 g/d for male aged 40-49 years; 60 g/d for male aged over 50 years; 50 g/d for female aged over 40 years), while sodium intake was much higher than adequate intake (AI) (1500 mg/d for both male and female aged 40-64 years; 1300 mg/d for both male and female aged 65-74 years). N_CHO, carbohydrate intake (g); N_EN, energy intake (Kcal); N_FAT, fat intake (g); N_INTK, food intake (g); N_K, potassium intake (mg); N_NA, sodium intake (mg); N_PROT, protein intake (g); HE_BMI, overweight/obesity; HE_chol, total cholesterol; HE_dbp, diastolic blood pressure (mean value of 2-3 blood pressure measurements); HE_glu, fasting blood glucose; E_HbA1c, glycated hemoglobin; HE_HDL_st2, calibration of high-density lipoprotein cholesterol; HE_sbp, systolic blood pressure (mean value of 2-3 blood pressure measurements); HE_TG, triglyceride; T2DM, type 2 diabetes mellites. Dyslipidemia, according to the Korean Society of Lipid and Atherosclerosis, dyslipidemia is defined as any one of the following: total cholesterol level ≥ 240 mg/dL, high-density lipoprotein cholesterol level < 40 mg/dL, triglyceride level ≥ 200 mg/dL, low-density lipoprotein cholesterol level ≥ 160 mg/dL, or the use of a lipid-lowering drug; hypertension, according to Korean hypertension, hypertension was defined when systolic blood pressure was ≥140 mmHg or diastolic blood pressure was ≥80 mmHg; overweight/obesity, according to the Korean Society for the study of obesity; overweight/obesity defined when body mass index was 23 kg/m 2 or higher; T2DM, according to the Korean Diabetes Association (KDA) T2DM was defined as fasting plasma glucose levels ≥ 126 mg/dL (7.0 mmol/L) or glycated hemoglobin ≥ 6.5%.

K-Fold Cross-Validation (K = 5)
Usually, when evaluating a machine learning model, we split the data set into training and validation (or testing) sets and used the training set to train the model and validation (or testing) set to test the model. In this study, we used k-fold cross-validation which the data was first partitioned into k equally (or nearly equally) sized segments or folds. Subsequently, k iterations of training and validation (or testing) were performed such that within each iteration a different fold of the data was held-out for validation (or testing), while the remaining k − 1 folds were used for learning. Here, we had used five-fold cross validation (k = 5), where the data would be split into five folds, as shown in Figure 4. In five-fold cross validation, we obtained the overall accuracy of the model by computing the average of the five-performance metrics.

K-Fold Cross-Validation (K = 5)
Usually, when evaluating a machine learning model, we split the data set into training and validation (or testing) sets and used the training set to train the model and validation (or testing) set to test the model. In this study, we used k-fold cross-validation which the data was first partitioned into k equally (or nearly equally) sized segments or folds. Subsequently, k iterations of training and validation (or testing) were performed such that within each iteration a different fold of the data was held-out for validation (or testing), while the remaining k − 1 folds were used for learning. Here, we had used five-fold cross validation (k = 5), where the data would be split into five folds, as shown in Figure 4. In five-fold cross validation, we obtained the overall accuracy of the model by computing the average of the five-performance metrics.

Accuracy Comparison between a DNN Model and Other Machine Learning Models
Analysis results were compared with accuracy when predicting the risk of dyslipidemia, hypertension, T2DM and overweight/obesity by nutritional intakes based on, 4-7th KNHANES (Table 5). Accuracy results from DNN, logistic regression and decision tree in Table 5 were compared according to the accuracy formula shown in Table 3. A DNN model of deep learning showed slightly higher accuracy in dyslipidemia, hypertension, and T2DM compared with logistic regression and decision tree even though accuracy was not statistically significant. The relatively low performance numbers (Table 5) attributed not to evaluate the performance in an optimal model of each of four variables (dyslipidemia, hypertension, T2DM and overweight/obesity). We attempted to evaluate performance of a common model for four variables. The prediction accuracy of DNN has been shown to be higher than that of logistic regression and decision tree in Table 5.

Accuracy Comparison between a DNN Model and Other Machine Learning Models
Analysis results were compared with accuracy when predicting the risk of dyslipidemia, hypertension, T2DM and overweight/obesity by nutritional intakes based on, 4-7th KNHANES (Table 5). Accuracy results from DNN, logistic regression and decision tree in Table 5 were compared according to the accuracy formula shown in Table 3. A DNN model of deep learning showed slightly higher accuracy in dyslipidemia, hypertension, and T2DM compared with logistic regression and decision tree even though accuracy was not statistically significant. The relatively low performance numbers (Table 5) attributed not to evaluate the performance in an optimal model of each of four variables (dyslipidemia, hypertension, T2DM and overweight/obesity). We attempted to evaluate performance of a common model for four variables. The prediction accuracy of DNN has been shown to be higher than that of logistic regression and decision tree in Table 5.

Wald Test in Logistic Regression
We were also interested in examining if a significant relationship existed between nutritional intake and risk of overweight/obesity, dyslipidemia, hypertension and T2DM in the logistic model. The Wald statistics were used to test the significance of individual coefficients in the model and were calculated using the following formula: Coefficient/S.E.
The results of the Wald tests for the KNHANES data are given in Table 6. The nominal alpha level of 0.05 was used for statistical significance. The test for the coefficient of the nutritional intake indicated that N_EN, N_INTK, N_FAT and N_NA significantly contributed to predicting dyslipidemia. N_EN, N_INTK, N_FAT and N_CHO significantly contributed to predicting hypertension. All seven predictors significantly contributed to predicting T2DM. N_EN, N_FAT, N_CHO and N_NA significantly contributed to predicting overweight/obesity. The constant had no simple practical interpretation but was generally retained in the model irrespective of its significance.

Evaluation of the Fitted Model of Structural Equation Modelling
This study classified and predicted the effect of nutrient intake on diagnosis of overweight/obesity, dyslipidemia, hypertension and T2DM based on a DNN of deep learning in python. However, the correlation was determined through the structural equation as a DNN was insufficient to examine the specific correlation of the effect of nutrient intake on diagnosis of overweight/obesity, dyslipidemia, hypertension and T2DM. The model of SEM used in this study can be judged as a suitable model because fit indices satisfied the acceptance criteria ( Table 6). Analysis of estimated parameters 'significance is shown Table 7. The associations between nutritional intake and risk of dyslipidemia, hypertension, T2DM and overweight/obesity was observed in path diagrams for SEM ( Figure 5). B is the unnormalized regression coefficient. β is the standardized regression coefficient, which is a value obtained by correcting the non-standardized regression coefficient with the standard deviation of each independent variable and the standard deviation of the dependent variable. The importance (influence) of the independent variable can be compared through the standardization coefficient. The importance (influence) is compared when the value of P is significant. S.E is the standard error, which means the average distance of each data value from the mean. C.R. is a value representing the reliability by dividing the non-standardization factor by the standard error. Statistical significance is found if C.R. is greater than ±1.965. p-value, probability value; ***; p < 0.001. B, unnormalized regression coefficient; β, standardized regression coefficient; C.R., critical ratio; S.E., standard error; T2DM, type 2 diabetes mellitus; N_CHO, carbohydrate intake(g); N_EN, energy intake (Kcal); N_FAT, fat intake(g); N_INTK, food intake(g); N_K, potassium intake(mg); N_NA, sodium intake(mg); N_PROT, protein intake(g).

Discussion
The aim of this study was to classify and predict risks of overweight/obesity, dyslipidemia, hypertension and T2DM in relation with nutritional intake in a DNN model of deep learning. To achieve this aim, we developed a five-fold cross validation of the DNN model in order to deal with reliability and overfitting issues of the DNN model. The developed DNN model was compared with a logistic regression analysis to evaluate its accuracy.
In this study, a five-fold cross-validation of deep learning showed higher prediction accuracy than the existing statistical analysis method, logistic regression and decision tree. To the best of our knowledge, this is the first study suggesting that a DNN model of deep learning could be a valuable approach to evaluate the adaptive predictive effect of nutritional intake on risks of overweight/obesity, dyslipidemia, hypertension and T2DM from the KNHANES database.
In similar to our study, Panaretos et al., 2018 [36] examined the association between dietary patterns and 10-year cardiometabolic health status in 2020 subjects from ATTICA (prospective cohort study conducted in the province of ATTICA) database. The machine learning techniques showed much better accurate classification in this association compared with linear regression. In this study, identification of dietary patterns (based on foods or nutrients) was performed with a factor analysis. The k-nearest-neighbor's algorithm (K-NN) and random-forests decision tree (RF) algorithms of machine learning techniques were tested using cardiometabolic health scores produced by age, BMI, smoking, physical activity, family history of T2DM, hypertension and hypercholesterolemia.
Choe et al., 2018 [41] developed a machine learning model which can predict the effect of obesity on prevalence of metabolic syndrome by integrating clinical, environmental and genetic factors of Koreans. The 7502 out of 10,349 participants were nonobese. Metabolic syndrome was found in 647 (8.6%) participants. Nonobese participants were grouped into a training set (n = 5251) and a test set (n = 2251). Model A was input with only clinical factors including age, sex, BMI, smoking, alcohol, physical activity status), while Model B consisted of genetic information plus factors of Model A (10 single nucleotide polymorphisms). In comparison of the performance of model A and model B obtained with naïve Bayes classification (NB-one of machine learning types), Model B (area under the receiver operating characteristic curve (AUC) = 0.69) showed better performance than model A (AUC = 0.65). It is noted that these studies [36,41] had smaller sample size and machine learning models rather than deep learning models.
Recently, Lee et al., 2019 [39] demonstrated better LDL-C concentration estimation in a fivefold cross-validation of a DNN model than Friedewald and Novel methods. The DNN model consisting of six hidden layers, and 30 nodes in each hidden layer took three input values of TC, HDL-C and TG and then estimated LDL-C as the output.
Faruqui et al., 2019 [40] demonstrated high accuracy in forecasting the next day glucose level based on Clark Error Grid and ±10% range of the actual values for T2DM management. Data from 10 overweight or obese subjects with T2DM included their daily mobile health lifestyle data such as diet, physical activity, weight, and glucose concentration for over 6 months. Recurrent neural networks (RNN) known as long shortterm memory (LSTM) of a deep learning model was used for prediction model of daily glucose concentration.
Given the previous machine learning on diet and cardiometabolic outcomes [35][36][37][38]] and a DNN model of deep learning without investigating a relationship between diet and levels of LDL-C and blood glucose [39,40], the novelty of our study is development of a DNN model to classify and predict the association nutritional intake on overweight/obesity, dyslipidemia, hypertension and T2DM which is able to enhance further performance.
Several studies showed the association between nutritional intake [54,55] or dietary patterns [56] and metabolic syndrome. The highest quartile of saturated/monounsaturated fatty acids was associated with 1.27-fold (95% confidence interval (CI) 1.10-1.46; p = 0.001) metabolic syndrome compared with the first. Vitamins and trace elements were associated with an odds ratio of 0.79 (95% CI 0.70-0.89; p = 0.001) for association with metabolic syndrome. No association between polyunsaturated fatty acids and metabolic syndrome. Intakes of moderate alcohol, lower of total saturated fatty acids and sodium were associated with lower risk of metabolic syndrome. This study used principal component analysis with a 24-h dietary recall data from the National Health and Nutrition Examination Survey (NHANES 2001 to 2012) [54]. Iwasaki et al., 2019 [55] showed the association between nutrients and risk of metabolic syndrome using factor analysis in a Japanese population. Factor 1 consisting of fiber, potassium and vitamins pattern was associated with a decreased risk of metabolic syndrome. Factor 2 consisting of fats and fat-soluble vitamins pattern was positively associated with increased risks of metabolic syndrome, obesity and blood pressure. Factor 3 consisting of saturated fatty acids, calcium and vitamin B2 pattern was associated with increased risks of metabolic syndrome, blood pressure, TG and decreased HDL-C [55]. Moreover, the highest quartile of the meat pattern was positively associated with risk of metabolic syndrome only for Korean male adults after adjustment for multivariate (prevalence ratio = 1.47; 95% CI 1.00-2.15; p for trend = 0.016) [56]. Dietary intake was examined with food frequency questionnaires [55,56]. In this study, the association between nutritional intake and risk of overweight/obesity, hypertension, dyslipidemia and T2DM using SEM as sub-analysis of a DNN which did not sufficiently proved the association. A SEM showed the energy intake was the most contributor to risk of dyslipidemia, hypertension and overweight/obesity suggesting that a reduction energy intake could lead to the prevention of overweight/obesity, dyslipidemia and hypertension.
The strengths of this study include the large number of study subjects, and use of a deep learning model and structural equation modelling.
Several limitations of this study should be acknowledged. In the Python environment, it was not possible to determine how nutritional intakes, independent variables, act on the diagnostic criteria of overweight/obesity, dyslipidemia, hypertension and T2DM which were dependent variables. A SEM was performed to compensate the limitation. Substantial variations can be proposed in accuracy when classifying and predicting the association between nutritional intake and overweight/obesity, dyslipidemia, hypertension and T2DM because of potential underlying mechanisms including age, gender, gut microbiota, and genetic traits [57][58][59][60], which are beyond the scope of the present study, could be attributable to the variations.
Future studies incorporating these risk factors including physical activity, smoking, gender, age, total energy intake and expenditure, gut microbiota and genetic traits should be carried out.
A further study should be also warranted with prospective cohort studies to assess association between dietary patterns and risk of overweight/obesity, dyslipidemia, hypertension and T2DM.5.

Conclusions
From a large dataset of KNHNES, a DNN model developed in this study showed accurate classification and prediction on risk of overweight/obesity, hypertension, dyslipidemia and T2DM compared with two conventional machine learning models of a logistic regression and decision tree.
Energy intake was the most influential factor in risk of dyslipidemia, hypertension and overweight/obesity. A SEM indicated that energy intake appeared to be the most candidate to contribute to risk of dyslipidemia, hypertension and overweight/obesity. Informed Consent Statement: Informed consent was obtained from all subjects involved in the Study.

Data Availability Statement:
The authors have no authority over the data, and the data is provided upon request to the Ministry of Health and Welfare.

Conflicts of Interest:
The authors declare no conflict of interest.