Prediction of dyslipidemia using gene mutations, family history of diseases and anthropometric indicators in children and adolescents: The CASPIAN-III study

Dyslipidemia, the disorder of lipoprotein metabolism resulting in high lipid profile, is an important modifiable risk factor for coronary heart diseases. It is associated with more than four million worldwide deaths per year. Half of the children with dyslipidemia have hyperlipidemia during adulthood, and its prediction and screening are thus critical. We designed a new dyslipidemia diagnosis system. The sample size of 725 subjects (age 14.66 ± 2.61 years; 48% male; dyslipidemia prevalence of 42%) was selected by multistage random cluster sampling in Iran. Single nucleotide polymorphisms (rs1801177, rs708272, rs320, rs328, rs2066718, rs2230808, rs5880, rs5128, rs2893157, rs662799, and Apolipoprotein-E2/E3/E4), and anthropometric, life-style attributes, and family history of diseases were analyzed. A framework for classifying mixed-type data in imbalanced datasets was proposed. It included internal feature mapping and selection, re-sampling, optimized group method of data handling using convex and stochastic optimizations, a new cost function for imbalanced data and an internal validation. Its performance was assessed using hold-out and 4-foldcross-validation. Four other classifiers namely as supported vector machines, decision tree, and multilayer perceptron neural network and multiple logistic regression were also used. The average sensitivity, specificity, precision and accuracy of the proposed system were 93%, 94%, 94% and 92%, respectively in cross validation. It significantly outperformed the other classifiers and also showed excellent agreement and high correlation with the gold standard. A non-invasive economical version of the algorithm was also implemented suitable for low- and middle-income countries. It is thus a promising new tool for the prediction of dyslipidemia.


Introduction
Strengthening the capacity of the entire countries, for early warning, and health risk reduction is one of the targets of the Sustainable Development Goal (SDG) #3. Non-communicable diseases (NCDs) have adverse human, social and economic consequences in all societies. Also, the first global NCD Action Plan is "A 25% relative reduction in the overall mortality from cardiovascular diseases, cancer, diabetes, or chronic respiratory diseases" [1]. Coronary heart diseases (CHDs), are the number 1 source of death and disability in countries including Iran [1,2]. Dyslipidemia, the disorder of lipoprotein metabolism resulting in high lipid profile, is a major risk factor of CHD [3]. It is related to more Computational and Structural Biotechnology Journal 16 (2018) [121][122][123][124][125][126][127][128][129][130] than four million deaths per year [4]. The accurate and reliable prediction of dyslipidemia is thus important in targeting SDG #3 and NCD Action Plan #1.
Metabolic risk factors including dyslipidemia are the most important determinants of emerging NCDs worldwide [5,6]. Dyslipidemia is, in fact, an important modifiable risk factor for CHD [7]. Although significant adverse health outcomes in childhood are not associated with dyslipidemia, it was shown in the literature that there is a link between childhood dyslipidemia and occurrence of atherosclerosis and its follow-up in adulthood [8,9]. Not only 40-55% of children with dyslipidemia will have hyperlipidemia during adulthood [10], but also subclinical atherosclerotic abnormalities, resulting in cardiovascular disease (CVD) events, occur in childhood [11]. Prediction and screening dyslipidemia, an important CVD risk factor, in children and adolescents is thus critical [12].
Some studies were performed in the literature to assess the genetic risk for dyslipidemia [13,14]. In such studies, statistically significant dyslipidemia predictors were identified, and no actual prediction (or classification) was performed. CAD (Computer-aided diagnosis), on the other hand, could use risk factors and predict if a subject is at high risk or not. CAD, which is using data mining to interpret medical information, could improve the diagnosis accuracy [15]. CAD is in fact used as a second opinion by the physicians to make the final diagnosis or prognosis decision [16][17][18].
Two methods were proposed in the literature to predict dyslipidemia in adults [19,20]. Wang et al. [19] analyzed 8914 subjects aged 35-78 years (with the prevalence of dyslipidemia about 46%). The predictors' age, gender, occupation, education, marital status, physical activity, individual income, waist circumference, smoking, family history of dyslipidemia, and diet were used to predict dyslipidemia (High TC, or TG or low HDL-C [21]). Artificial neural network (ANN) and Multiple Logistic Regression (MLR) models were used and the sensitivity, specificity, and precision of 90%, 77%, and 76% were obtained in the hold-out (75%) internal validation.
Costanza and Paccaud [20], analyzed 2549 subjects aged 35-64 years (the prevalence of dyslipidemia about 43%). The predictors waist-to-hip circumference ratio (WHR), body mass index (BMI), gender, age, current cigarette Smoking, and high blood pressure were used and dyslipidemia (total serum cholesterol to highdensity lipoprotein cholesterol (TC/HDL-C) ratio ≥5.0) was predicted using different data mining methods, namely as the linear and logistic regressions, regression and classification trees. The sensitivity, specificity, and precision of 70%, 77%, and 69% were obtained in the hold-out external validation.
Although the prediction methods proposed in [19,20], are simple and effective and thus worthwhile for the identification of high risk people for having dyslipidemia based on the demographic, dietary and life-style, and anthropometric data, an optimal prediction is still required. Genome-based prediction of diseases has been recently focused in bioinformatics [22]. Identifying genetic mutations could assist in choosing optimal patient treatment. In fact, a lot of methods exist to reveal such mutations, including next-generation sequencing and future commercially available kits [23]. Moreover, in reliable clinical systems, critical criteria regarding statistical errors, precision, and DOR (Diagnosis Odds Ratio) must be met [24]. Moreover, considering ethnic differences in life-style, environmental factors and genetic background, examining gene polymorphisms associated with dyslipidemia in each ethnic group is important [13].
The purpose of our work is thus to design an accurate and reliable system for the prediction of dyslipidemia using gene mutations, family history of diseases and anthropometric indicators in a nationallyrepresentative sample of the pediatric population in the Middle East and North Africa (MENA). To the best of our knowledge, this is the first study of its kind for genome-based dyslipidemia prediction using data mining.

Study population
The third study of a school-based surveillance system known as the childhood and adolescence surveillance and prevention of Adult Noncommunicable disease (CASPIAN) was conducted in Iran as the national survey of school students with high-risk behaviors (2009)(2010) [25]. The description of the CASPIAN-III study was provided elsewhere in details [25]. Here, it is briefly described.
Among the youngsters, long-term changes in disease patterns are following rapid modifications in lifestyle, nutrition, and physical activity. Iranian youths are experiencing such lifestyle changes, making them prone to risk factors of chronic diseases such as NCDs. Surveilling such factors is important for long-term national planning based on monitoring NCD-related risk factors from childhood to adulthood. A school-based surveillance system entitled as CASPIAN Study was implemented in IRAN from 2003-2004. The surveys have been repeated every 2 years, with blood sampling for biochemical factors every 4 years.
This study was performed among 5570 students, sampled from 27 provinces of Iran. The entire students and their parents gave informed consent to the experimental procedure. It was approved by Isfahan University of Medical Sciences Panel on Medical Human Subjects and conformed to the Declaration of Helsinki.
We randomly selected 725 frozen whole blood samples for genome analysis from children and adolescents (48% male, 42% prevalence of dyslipidemia) taken from CASPIAN-III study. Such a sample size was estimated based on the sample-size estimation method proposed by Hajian-Tilaki [27]. Total required sample size (N) could be estimated based on the target sensitivity (Se e ) and Specificity (Sp e ) using Eq.(1): where α is the significance level, Prev is the prevalence of the disease in the population and d is the precision of estimate (i.e.,the maximum marginal error). The number of subjects in the case (n case ) and control (n controls ) categories could be then estimated using Eq.(2): The parameters Se e and Sp e were set to 70% and 77%, respectively based on the literature [20]. The prevalence of dyslipidemia in Iranian population was hypothesized as about 42% [6,28] and parameters α and d were both set to 0.05 [29]. Thus, the sample size of 725 (n controls = 418, n case = 307), sufficed.  [32], CETP (A373P [rs5880]) [33,34], apolipoprotein C-3 APOC3 (SstI [rs5128]) [35], apolipoprotein A-1 APOA1 (MspI [rs2893157]) [36], apolipoprotein A-5 APOA5 (C-1131T [rs662799]) [37] and apolipoprotein-E ApoE genes [38,39], appearing to relate to lipid profile disorders and (or) cardiovascular diseases, were investigated [3,40].
Subjects' peripheral blood was analyzed using the QIAamp DNA Blood Mini kit (Qiagen, Germany) and DNA was extracted following the manufacturer's protocol [41]. Corbett rotor-gene 6000 instruments (Corbett Research Pty Ltd, Sydney Australia) were used for Real-time PCR and high-resolution melt analysis [42]. The details of later analysis were mentioned in the Supplementary material S1.
Alleles of the genotypes were analyzed. Typically, only two out of the four possible nucleotides occur, and each sample contains a pair of every autosome. Alternatively, the carrier and non-carrier genes were represented as a binary variable for each genotype. For example, for the SNP rs320, nucleotide pairs GG, and TG/GT with the minority nucleotide G were considered as 'carrier' while the TT pair was set to 'non-carrier'. Thus, two feature sets (nucleotide pairs, and carrier/noncarriervariables) were considered for further analysis.

Other analyzed features
The Anthropometric information was recorded by a team of trained health care professionals and the examinations were conducted under standard protocol by using calibrated instruments. Weight was measured to the nearest 200g in barefoot and lightly dressed condition. BMI was calculated as weight (kg) divided by height squared (m 2 ). The parameter weight circumference (WC) was measured using a non-elastic tape to the nearest 0.2 cm at the end of expiration at the midpoint between the top of iliac crest and the lowest rib in standing position [25].
The anthropometric and life-style attributes such as age, sex, hypertension (either high systolic blood pressure (SBP) (≥90th percentile for age, sex and height) or high diastolic blood pressure (DBP) (≥90th percentile for age, sex and height) [43]), abdominal obesity (defined as waist-to-height ratio (WHtR) equal or more than 0.5 [44]), BMI categories (underweight, normal, overweight and obese defined using WHO growth curves [45] ) and physical activity (low, moderate, and severe categories [46]), as well as the family history of diabetes, obesity, CVD, cancer, and birth weight (b2500 g (low), 2500 g-4000 g (medium), and N4000 g (high) categories) were also included.

Pre-processing
The dataset was split into the estimation, validation (overall known as the training set) and test sets (40%, 10%, and 50% respectively in a hold-out validation setting). The input variables were grouped based on their interval or categorical measurement scales [47]. The categorical group consisted of nominal (such as sex) and ordinal (such as birth order) variables. The interval features were then transferred using robust Z-score measure [48,49]. In this transformation, the median and MAD (median absolute deviation) of each feature was estimated, and the median was then reduced from each feature and then normalized by the MAD value. Such features were then normalized between zero and one for further processing.
For each categorical feature, the indicator variable was estimated. It takes the value 0 or 1 to indicate the absence or presence of each category. Logit transformation was performed on each indicator variable whose intercept and slope parameters were estimated using maximum Likelihood Estimating (MLE) on the training set [50]. Thus, each indicator variable was expressed as a continuous value between zero and one. Such processed features are entitled as "predictors" from now on. The number of predictors was N p .

Optimized inductive learning
Group Method of Data Handling (GMDH), first proposed by Ivakhnenko [51,52], has been applied in many areas for data mining [53]. Inductive GMDH algorithms find interactions in data, select an optimal network structure and thus improve the performance of current algorithms [54]. Here we proposed an optimized GMDH method to predict dyslipidemia using mixed-type data.
Feature selection was performed by iteratively estimating their weights based on their capability to discriminate between neighboring patterns in the framework of the Expectation-Maximization algorithm using I-RELIEF algorithm [55]. Moreover, the parallel selective sampling (PSS) method was used to select data from the majority class as to reduce the problems in the imbalanced datasets [56].
Multilayered induction for the gradual increase of complexity was performed using different layers. Instead of the fixed regression polynomial, the nonlinear regression matrix (X) was formed between any pairs (i,j) of predictors at the first layer that has N n nonlinear regression functions: where ⊙ is the element-by-element multiplication, a i is the regression coefficients and N 1 is the number of samples in the training set. If we fix the regression coefficients, the Regularized Least Squares (RLS) solution to X T × W ≈ B (B is a column vector with the class label of the analyzed samples) could be estimated as below: where λ is the regularization parameter (set to 0.1 in our study), I is the identity matrix, and T is the matrix transpose operator. It could be easily shown that the optimal solution is the global minimum point of the RLS optimization [57]. In principle, it is possible to tune polynomial regression coefficients using a stochastic optimization [58]. Instead, we tune the regression coefficients used in the matrix X, using Particle Swarm Optimization (PSO). PSO is a meta-heuristicspopulation-based method inspired by flocking birds [59]. The topology and the internal parameters of PSO were the same as Mohebian et al. [15] except that the maximum number of iterations was set to 10 and the PSO fitness function was defined differently. At each PSO iteration, the random regression coefficients are used to calculate the matrix X for a predictor pair. Then, the parameter W is estimated on the training set. To avoid overfitting, the estimated weight W is used on the validation set to estimate the output of the analyzed pair in the validation set. The cut-off of 0.5 was then used to estimate the parameters of signal detection theory such as True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN). Then, parameters Sensitivity (¼ TP TPþFN ), Specificity (¼ TN TNþFP ) and Precision (¼ TP TPþFP ) are estimated, and their average is used as the fitness function. The PSO method usually converged at few iterations due to the internal RLS optimization.
The selection pressure of the network was set to 0.7, in our study. Thus, 70% of the best pairs were selected for each layer. The approximating function of each selected pair was used as new features at the next layer [54]. The number of layers was estimated based on the required number of interactions. In a case of N i interval features and N d indicator variables, it was hypothesized as 1 + round(log 2 (N i + N d )). At the last layer, the best approximation function was used as the output of the classification system. the overall structure of the proposed prediction system was shown in the Supplementary material S2.

State-of-the-art
In our study, other classification methods namely as multilayer perceptron (MLP), MLR and decision tree (DT), as proposed in other studies [19,20], were used for comparison. Supported vector machines (SVM) was also used for comparison. MLP, a feed-forward artificial neural network (ANN) model mapping sets of inputs onto a set of outputs [60], with one hidden layer with ten neurons and the sigmoid activation function [61] was used. SVM, constructing a hyper plane in a highdimensional space [62], with the radial basis function (RBF) kernels were used. The soft-margin parameter and the radius of the RBF kernel were tuned using the method proposed by Wu and Wang [63]. DT, building classification models in the form of a tree structure [64], uses entropy to calculate the homogeneity of samples to build the tree. The statistical classifier C4.5 with pruning (i.e., removing redundant subtrees) was used in our study [65]. The best splitting attribute is determined at each node. MLR uses the linear regression model with the Logit link function for the prediction.
After fitting the model [66], by estimating the model parameters, each case with the estimated class probability higher than 50%was classified as having dyslipidemia, or normal otherwise. In fact, DT and MLR could select relevant features because of the internal statistical validation. For MLP and SVM, Sequential Forward Selection (SFS) method, a bottom-up search procedure [67], was used for feature selection.

The performance indices for each classifier
The performance of the classifiers was determined using the holdout method, where the dataset was split into two mutually exclusive sets (50% training and 50% test). The classifiers were then trained on the training set and tested on the test set [68]. Moreover, 4-foldcross-validation (60% estimation, 15% validation, 25% test in each analysis fold) was used to test the best classifiers to control a possible biased error estimate [67]. A variety of performance indices [15,69,70] were reported for the analyzed classifiers. Such indices along with their definitions were shown in Table1, among which, MCC is a single unbiased performance measure in balanced as well as imbalanced datasets [71]. It is related to chi-square statistics, also known as phi-coefficient, a measure of association for two binary variables (predicted versus observed gold-standard class) that could be interpreted as the correlation coefficient between those binary variables [72]. The interpretation of the reference intervals of the indices AUC ROC [73], Kappa [74], MCC [75] and DP [69,76] was listed in Supplementary material S3.
A diagnosis system was considered as clinically reliable based on its Type I and II statistical errors [77], False Discovery Rate (FDR = 1-Precision) [78], and DOR [79] as to fulfill -allthe following conditions: the minimum Sensitivity, Specificity, Precision and DOR of 80%, 95%, 95% and 100, respectively.

Comparison between different classifiers
When different classifiers are compared with the gold standard, the superiority of one method to another must be presented using a proper statistical test. Otherwise, insignificant improvements might be erroneously reported as important [70]. McNemar's test, also known as the Gillick test, was used to compare the performance of two classifiers [67,80].

Statistical analysis
Results are reported as mean ± standard deviation (for interval variables) and frequencies (for categorical variables).The pairwise χ 2 analysis was used to test for allele frequency differences (and nominal features) between dyslipidemia and normal groups and when the Cochran conditions were not met, the Fisher exact test was used. The χ 2 analysis was used to test genotype frequency deviations from what predicted by the Hardy Weinberg equation. P-values less than 0.05 were considered significant. The entire data processing was performed off-line using Matlab version 8.6 (The MathWorks Inc., Natick, MA, USA). The statistical analysis and calculations were performed using the SPSS statistical package, version 16.0 (SPSS Inc., Chicago, IL, USA).

Results
The average age of the participants was 14 , ApoE, Family history of diabetes, obesity, cancer, and CVD. Set 2 included Set 1 and birth weight, age, and physical activity. We also considered set 3 in which easily-measured features were analyzed, i.e., sex, age, physical activity, birth weight, BMI category, abdominal obesity, family history of diabetes, obesity, cancer, and CVD. The hold-out (50%) validation of the proposed method as well as the base learners DT, MLP, MLR, and SVM were performed in each feature subset, and the results of the classifiers on the test set were shown in Table4.
In each feature subset, the proposed method significantly outperformed the base learners (DT, MLP, MLR, and SVM) (P-value b 0.05). In the third subset, the entire base learners did not reject the NULL hypothesis of an accidental agreement. Moreover, in such classifiers, the AUC ROC was not significant (P-value b 0.05) showing

Table1
The classification performance measures used in our study.  , age, birth weight, family history of obesity and for Set 3 were abdominal obesity, birth weight, physical activity, family history of diabetes, and BMI category. The performance of the best classifiers in each subset (i.e. the proposed classifier) was further assessed using 4-fold cross validation (Table5). The proposed prediction system showed limited discriminant power (DP = 1.3), excellent diagnosis accuracy (AUC ROC = 0.94), excellent agreement with the gold standard (Kappa = 0.87) and high correlation with the gold standard (MCC=0.87) on the second subset (Table4). The average statistical power and Type I error (α) were 93 % and 0.07, respectively based on the cross-validation on the second subset (Table5). The training time of the proposed system was 26.1 ± 2.2 (s), 33.6 ± 3.0 (s) and 20.5 ± 3.1 (s)in the first, second, and third subsets, respectively. The average running time was the average of 3 runs over 363 subjects in the training set (hold-out 50%) on an Intel Core i7-6500uCPU with 8 GB of RAM.

Discussion
Identifying high-risk children based on gene polymorphisms (sets 1, and 2), at the first place, is useful for further dietary, and life-styletreatments and screening. Using life-style, anthropometric indicators and family history of diseases (set 3), on the other hand, could identify the high-risk population in low-income countries.

The risk factors of dyslipidemia
Although the environment is very important in the development of dyslipidemia, genetic components are also critical [81]. CETP TaqIB [rs708272] was selected by the proposed dyslipidemia prediction system in both sets 1 and 2. In the literature, Genome wide association studies (GWAS) in adults showed a high correlation between CETP and plasma lipid concentrations [82]. However, such an association is less distinct in children [33,83]. It was shown in the literature that such a mutation has the protective effect on dyslipidemia [33] and Myocardial Infarction (MI) [84]. This was in agreement with our findings, where the OR of CT/TT vs. CC was 0.15 (P-valueb0.001) (Table3).
ApoE was also selected in both sets. ApoE, playing an important function in lipid metabolism, has three isoforms, Apo-e2, Apo-e3, and Apo-e4. They are in fact translated into three alleles of the gene. It was shown in the literature that ApoE , and particularly, its e4 isoform, is associated with plasma lipid parameters and CVD risks [85,86]. Similarly, in our study, the prevalence of dyslipidemia was 85% in subjects with ApoE-e4 isoforms. Moreover, the OR of e2/e4 vs. e3 was 1.73 (P-value b 0.001) (Table3).
ABCAI R1587K [rs2230808] was the other selected feature in both sets 1 and 2. Several ABCA1 gene polymorphisms including R1587K [rs2230808], were identified. Dean et al. showed that this SNP is associated with the HDL-C concentration [87], thus affecting dyslipidemia. In our study, the OR of AG/GG vs. AA was 2.21 (P-value b 0.001) (Table3). Thus, such polymorphisms increased the risk of dyslipidemia.

Table2
Characteristics of the participants in the dyslipidemia and normal groups.  D9N is as a predictor of CVD risk directly and through its interaction with TaqIB [30]. In fact, LPL is involved with triglyceride-rich lipoprotein metabolism and lipoprotein remodeling including HDL [88,89]. Similarly in our study, the OR of (AG/GG vs. AA was 2.59 (P-value = 0.003) (Table3).
The family history of obesity was another common feature. Valdez et al. indicated that people who have one or more relatives with diabetes or CVD have a high risk of such problems [90]. Such diseases have common risk factors such as obesity and dyslipidemia sharing etiology [91]. FH of obesity, however, had poor agreement rate with FH of diabetes in our database (Cohen's Kappa = 0.24; P-value b 0.05). FH of diabetes was selected in the first and third subset, though. The prevalence of dyslipidemia in subjects without FH of obesity and diabetes were 43% and 41%, respectively.
Birth weight was a selected feature for the subsets 2 and 3. Rodríguez Vargas et al. showed that high birth weight is not a risk factor for hypercholesterolemia or HDL and LDL-cholesterol esters, but is positive for TG [92]. In our study the ORs of the low and high birth weight categories were more than one, but not significant (Table3). The prevalence of dyslipidemia in the abnormal and normal birth weight groups were 45% and 41%, respectively. CETP A373P [rs5880] was selected in the first set. Agerholm-Larsen et al. indicated that such a polymorphism is associated with decreased HDL-C [93]. Heidari-Beni et al. showed that HDL-C levels were significantly lower among those with CETP A373P [rs5880] polymorphism [33]. In our study, the OR of CG/GG vs. CC was 4.12 (P-value b 0.001) (Table3).
APOA5 C-1131T [rs662799] was another selected SNP in the first set. Wang et al. indicated that this polymorphism is associated with dyslipidemia and the severity of CHD [94]. In our dataset, the OR of AG/GG vs.
AA was 1.93, but it was not significant due to the small sample size of carrier genotypes (P-value = 0.525) (Table3).
Radha et al. found an association between LPL HindIII [rs320] SNP with low HDL-C and elevated TG levels [95]. Song et al. indicated a significant association between the APOC3 SstI [rs5128] polymorphism and higher levels of TG, TC, and LDL-C [35]. Albahrani et al. showed that APOA1 MspI [rs2893157] polymorphism is associated with CVD risk [36]. We did not find such an increased risk of dyslipidemia for LPL HindIII [rs320], APOC3 SstI [rs5128] and APOA1 MspI [rs2893157] SNPs. However, Odds (dyslipidemia| GG) was 1.5 in LPL HindIII [rs320] showing that this was possibly a good feature for the proposed classifier.
Due to the small sample size of AA alleles in APOA1 MspI [rs2893157] and GG alleles in APOC3 SstI [rs5128] (Table3), no significant association between such polymorphisms and the risk of dyslipidemia was found.
Anthropometric indices such as BMI and WHtR were shown to be associated with dyslipidemia in children and adolescents in the literature [96]. In our study, people with abdominal obesity had 4.76 times risk of dyslipidemia (OR = 4.76; P-value b 0.001) compared with those without such an obesity (Table2). Moreover, overweight and obese subjects had a higher risk of dyslipidemia compared with normal BMI subjects (Table2). In fact, WHtR and BMI were moderately correlated (r = 0.737; P-value b 0.001). WHtR was poorly correlated with TG (r = 0.257; P-value b 0.001) while BMI was poorly correlated with SBP (r = 0.248; P-value b 0.001) and TG (r (Pearson's correlation) = 0.293; P-value b 0.001). They could be the reason why BMI and WHtR were selected by the proposed classifier on the third set.
Panagiotakos et al. showed that lipid profile disorders are correlated with physical activity [97]. In our dataset, the ORs of high and low physical activity compared with moderate activity were 0.60 (P-value b 0.001) and 2.03 (P-value b 0.001), respectively (Table2). It was poorly correlated with HDL levels (ρ (Spearman's correlation) = 0.252; P-value b 0.001). That could support its selection on the third set. Age was selected in the second set. Age was shown to be an independent predictor of dyslipidemia in children and adolescents [26]. Although age was directly used in the second set, age and sex are indirectly required for dyslipidemia prediction on the thirst set. The identification of BMI category in children and adolescents is dependent on the growth-curve charts that are gender and age specific [45].

Application in health policy making
The proposed automatic diagnosis of dyslipidemia on the third set is indeed an effective screening system. It used the input features of abdominal obesity, birth weight, physical activity, family history of diabetes, and BMI category. It includes therapeutic life-style change (e.g., dietary therapy, and increased physical activity), before necessary pharmacologic interventions [98]. In fact, the primary treatment for dyslipidemia in children and adolescents is such a life-style change [26]. Although the proposed system on the set 3 it is not a fully clinically reliable system (Type I error of 16% and FDR of 21%), it could be possibly used in low-and middle-income countries where genomics is not possible for a large population. Moreover, embedding the prediction system into a public online web-interface is useful in health promotion programs [15,99] that will be the focus in our future work.

The Properties and Performance of the proposed system
The proposed system for dyslipidemia prediction in the subset 2, showed promising results regarding variety of performance indices (Tables4 and 5). The statistical power, Type I error, FDR and DOR of the proposed system were 93%, 0.05, 7%, 252 (Table4). Thus, the proposed system fulfilled the criteria of a clinically reliable system except that it surpassed the minimum required FDR of 5% by 2%. We considered a variety of performance indices introduced in the literature (Tables1 and 2), and also the Standards for Reporting Diagnostic Accuracy (STARD 2015) and its extensions [70,100] in reporting the results. Guarding against testing hypotheses suggested by the data (Type III er Selecting only one kind of lipid disorder such as high total cholesterol/HDL-C ratio rather than dyslipidemia, could facilitate the interpretation of the results [20]. However, dyslipidemia contributes to cardio-metabolic risks in children and adolescents [102]. Moreover, In addition to cholesterol and HDL-C [103], triglyceride [104] and LDL-C [105] were shown to be important CVD risk factors. Thus, the outcome of the proposed system was dyslipidemia. We also considered high total cholesterol/HDL-C ratio outcome in our study and the selected features in the feature set 1 were ABCA1 (R1587K [rs2230808]), CETP (A373P

Further application of the proposed classification system
The proposed dyslipidemia prediction system made use of the following properties: I) mapping the mixed-data types to interval data using Logit function, II) RELIEF feature selection, III) PSS random sampling for imbalanced datasets, IV) the involvement of feature interactions proposed by GMDH, V) using the nonlinear regression matrix instead of a fixed regression polynomial, VI) using inner-loopRLS instead of LS, VII) using outer-loopPSO for stochastic optimization, VIII) using estimation, validation and test sets to avoid over-fitting, IX) internal cross validation on the training set (estimation plus validation set) to improve generalization capability, and X) proper cost function as the mean of Se, Sp, and Pr suitable for imbalanced data sets.
In fact, the proposed system could be regarded as a general framework for two-class classification of imbalanced mixed-type data given that it is successfully tested on different datasets. The following datasets were used for validation of the proposed framework: Wisconsin breast , ApoE, Family history of diabetes, obesity, cancer, and CVD. Set 2 included Set 1 and birth weight, age, and physical activity. Set 3 included sex, age, physical activity, birth weight, BMI category, abdominal obesity, family history of diabetes, obesity, cancer, and CVD. The classifiers were trained on the same training set and then validated on the test set and the results of the classifiers on the test set were shown. ⁎ Non-significant (P-value N 0.05).

Table5
The four-fold cross validation results of the proposed prediction system in MEAN ± SD. Cancer (BCW), Pima Indian Diabetes (PIM), Glass [106], and Hepatitis [107]. The performance of the proposed framework on such datasets was shown in Supplementary material S4.

Final considerations
The limitation of the current study is that it was a retrospective study. More sources of error are more common in such studies compared with prospective studies because of bias and possible confounders [108]. Also, the sample size must be increased as to improve the statistical power in our diagnosis system [109]. Moreover, instead of testing a small number of pre-specified genetic regions, performing GWAS could be used in the examination of a genome-wide set of genetic variants in the entire genome in different individuals. For instance, more-prevalent mutations in LDL receptor (LDLR) gene were associated with dyslipidemia such as familial hypercholesterolemia, which is associated with early severe atherosclerosis and CAD [110]. In our study, NHLBI guideline was used to define dyslipidemia in children and adolescents. However, other standards such as American Heart Association (AHA) guideline [111] exist. The AHA guideline has different cut-points for TG and HDL-C. It also does not have a non-HDL-C criterion. Using AHA guideline, the class labels might change; thus affecting the proposed classification system. Finally, external validation (i.e. assessing the performance of the model on datasets from different institutions) is required in addition to an internal validation (i.e. hold-out and cross-validation) [112]. Unlike Costanza and Paccaud who rightfully used external validation in assessing their proposed lipid-disorder prediction model [20], other studies such as Wang et al. [19] and our study in this field and many studies in the other data mining areas in the literature do have only traditional internal cross-validation. This is the other limitation of our study.

Conclusions
In conclusion, we proposed a computer-aided diagnosis system to predict dyslipidemia whose performance was assessed using different criteria and in different validation frameworks. It is accurate and precise and could be possibly used for screening and risk assessment in the health promotion programs for children and adolescents. The developed framework is available to interested readers upon request.