Coronary Artery Disease Diagnosis; Ranking the Significant Features Using Random Trees Model

Heart disease is one of the most common diseases in middle-aged citizens. Among the vast number of heart diseases, the coronary artery disease (CAD) is considered as a common cardiovascular disease with a high death rate. The most popular tool for diagnosing CAD is the use of medical imaging, e.g., angiography. However, angiography is known for being costly and also associated with a number of side effects. Hence, the purpose of this study is to increase the accuracy of coronary heart disease diagnosis through selecting significant predictive features in order of their ranking. In this study, we propose an integrated method using machine learning. The machine learning methods of random trees (RTs), decision tree of C5.0, support vector machine (SVM), decision tree of Chi-squared automatic interaction detection (CHAID) are used in this study. The proposed method shows promising results and the study confirms that RTs model outperforms other models.


Introduction
Today, we face a huge amount of data in industry as well as organizations such as the Healthcare [1,4]. Since data collection and analysis are difficult, time consuming and costly, we are always looking for a way to optimum use of data to achieve the correct decision that can be referred to diagnose and experiment of diseases in healthcare organizations [3]. In addition, common method such as angiography [5,6] in experimenting and diagnosing diseases is costly and have adverse effects for patients as healthcare researchers are trying to utilize methods that avoid the high cost as well as the adverse effects of previous methods, which can be performed by using computer-aided disease diagnose methods means machine learning. Whereas, data mining process by utilizing machine according to their order of priority, takes place. For this purpose, the subset of features is ranked from the least important to the most important due to the different weightings to the features associated with the classification models that these features were assigned in output simulator.
Finally, among the classification models used in this study, obtaining the most appropriate subset feature by Random trees model with the best classification set and the most accurate classification of coronary-heart disease diagnosis is the main purpose of this study. As a result, in terms of accuracy, area under the curve (AUC) and Gini value criteria for CAD diagnosis, Random trees model is the best model compared to other prediction models.
The rest of the paper is organized as follows: The present study is expressed data mining classification methods in Section 2 and related works is described in section 3. The proposed methodology is explained in Section 4. Section 5 represents the evaluated results of the experiment. Sections 6 presents findings of the research and the conclusions, namely "Results and Discussion" and "Conclusion and Future works" in section 7.

Data Mining Classification Methods
In this section, we describe the classification methods used in this study. These methods include CHAID decision tree, C5.0 decision tree, Random trees and support vector machine (SVM). Among the mentioned methods, except for the support vector machine, CHAID, C5.0, Random trees (RTs) because are based on the decision tree, rules are extracted that are useful for the diagnosis of CAD especially rule extraction using RTs.

Decision tree of CHAID
The Chi-Squared Automatic Interaction Detection (CHAID) is one of the oldest tree classification, and it is a supervised learning methods by building the decision tree, which is evidence of the rules extraction, which is proposed by Kass [16]. This classification model is a statistical method based on the diagnosis of Chi-squared automatic interaction, and it is a recessive partitioning method that can be given by input features as predictors and the predictive class, a Chi-squared statistic test between target class and the Predictive each feature are computed [17][18][19] so that the predictive features are ranked in order of their priority. As such, the most significant predictors of subset feature with the highest probability of their weight to diagnose CAD to be gained. It should be noted that the process of selecting a significant predictor feature is based on data sample segmentation so that until we reach an external node i.e. the leaf, the samples partition continues into smaller subdivisions [17,20]. In general, the CHAID model includes the following steps [17][18][19]: 1. Reading predictors; The first step is to make classified predictors or features out of any consecutive predictors by partitioning the concerned consecutive disseminations into a number of classifies with almost equal number of observations. For classified predictors, the classifies or target classes are determined. 2. Consolidating classifies; The second step is to round through the features to estimate for each feature the pair of features classifies that is least significantly different with concern to the dependent variable. In this process, the CHIAD model includes two types of statistical tests. One, for classification dataset, it will gain a Chi-square test or Pearson Chi-square. The assumptions for Chi-square test are as follows: Nij = The value of observations concerned with feature fields or sample size, Gij = The gained expected feature fields for datasets, for example, the training dataset ( = , = ), Vn = The value weight (Wn) concerned with per sample of dataset, Df = The most number of logically independent values, which are values that have the freedom to vary, in the dataset, namely, Degrees of Freedom. Df is equal to (Nij-1). C = The corresponsive data sample, afterward: Two, for regression datasets where the dependent variable is consecutive, in other words, for variables based on measure-dependent, F-tests. If the concerned test for a given pair of feature classifies is not statistically significant as defined by an alpha-to-consolidate value, then it will consolidate the concerned feature classifies and iterate this step, i.e., obtain the next pair of classifies, which now may include previously consolidated classifies. If the statistical significance for the concerned pair of feature classifies is significant, i.e., less than the concerned alpha-to-consolidate value), then it will gain optionally a Bonferroni adopted pvalue for the set of classifies for the concerned feature.
given that the functions Yn, Y, and Nf are formulated as follows: 3. Selecting the partition variable; The third step is to select the partition the predictor variable with the smallest adapted p-value, i.e., the predictor variable that will gain the most significant partition. The P-value is formulated in a ( = ( > 2 )) . If the smallest (Bonferroni) adopted p-value for any predictor feature is greater than some alpha-to-partition value, then no further partitions will be done, and the concerned node is a final node. Continue this process until no further partitions can be done, i.e., given the alpha-to-consolidate and alpha-topartition values). Eventually, according to step 2, the p-value is obtained as follows:

Decision tree of C5.0
Following is the process of improving decision tree models including ID3 [21,22], C4.5 [23][24][25], the C5.0 tree model [26][27][28][29] as the latest version of decision tree models developed by Ross. The improved C5.0 decision tree is manifold faster than its ally models in terms of speed. This model in terms of memory usage, the memory gain is much higher than the other models mentioned. The model also improves trees by supporting boosting and bagging [25] so that using it increases accuracy of diagnosis. As one of the common characteristics among decision trees is weighting to disease features, but the C5.0 model allows different features and types of incorrect classifies to be weighted.
One of the crucial advantage of the C5.0 model to test the features is gain ratio which the information gain is increase, i.e., the information entropy, and the bias is reduce [1,17,29]. For example, the assumptions for The information entropy, information gain, and gain ratio problems are as follows [1,17,25]: we assume the S as a set of training dataset and splits S into n subsets, and, Ni = The sample dataset of K features.
So, we obtain the features to diagnosis CAD selected with the least information entropy, and the most information gain and gain ratio. The information entropy, information gain, and gain ratio are formulated as follows.
Based on formulas (8) and (9), the number of K features, a partition S according to values of K, and where P is the probability distribution of division (C1, C2, …, Ci): P = (|C1|/|S|, |C2|/|S|, . . . , |Ci|/|S|) (10) Based on formula (10), where Ci is the number of disjoint classes and | | is the number of samples in set of S. The value of Gain is computed as follows.
Ratio instead of Gain was suggested by Quinlan so that Split Info (K,C) is the information due to the division of C on the basis of value of categorical feature K, using the following: For formulas of (12) and (13), where (C1, C2, . . . Ci) is the partition of C induced by value of K.

Support Vector Machine
Support Vector Machine (SVM) is a supervised learning model based on statistical learning theory and structural risk minimization [29,30] presented by Vapnic as only the data assigned in the support vectors are based on machine learning and model building. The SVM model is not sensitive to other data points and its aim is to find the best separation line, i.e. the optimal hyperplane between the two classes of samples so that it has the maximum distance possible to all the two classes of support vectors [29][30][31][32]. The predictor feature is determined by the separator line for each predictive class. Fig. 2 shows the scheme of the support vector machine in 2-dimensional space. ) of the separator is the distance between support vectors, data samples closest to the hyperplane are support vectors, and also, b represents the offset between the optimal hyperplane and the origin plane. Then for each training sample ( , ): Yi =+1

Yi =-1
According to the hyperplane optimization that SVM model was to solve, the optimization problem is as follows [29]: To solve the problem of formula (15), one has to obtain the dual of the problem using the Lagrange Method, namely, (Lp). To obtain the dual form of the problem, the nonnegative Lagrangian coefficients are multiplied by αi ≥0. Lp is defined as follows: Finally, the formula (16) is transformed into the following equation [29]: Equation (17) is called the dual problem, namely, LD. However, for non-Linear SVM, because there is not the trade-off between maximizing the margin and the misclassification. So, it could not obtain the linear separate hyperplane in over data sample. In the nonlinear space, the best solution, the basic data to higher dimension, i.e., feature space, of linear separate is transformed. At the end, are used the kernel functions, such as linear, polynomial, radial basis function (RBF), and sigmoid [29]. Based on equation (18), LD for non-Linear data sample is obtained. In (18), parameter C is the penalty agent and determines the measure of penalty placed to a fault, so that the ″C″ value is randomly selected by the user.
N is the number of data sample in (18). In this study, the radial basis function (RBF) [29] is selected as the kernel function as shown in (19): In (19), the kernel parameter of ϒ with respect to (ϒ ≥ 0) represents the width of the RBF.

Random Trees
The model of random trees (RTs) is one of the robust predictive models better than other classification models in terms of accuracy computing, data management, more information gain with eliminating fewer features, extract the better rules, working with more data and more complex networks. So, the model for disease diagnosis is suitable. This model consists of multiple trees randomly with high depth so that chooses the most significant votes from a set of possible trees having K random features at each node. In other words, in the set of trees, each tree has an equal probability of being assigned. Due to the experiments performed in the classification of the dataset, the accuracy of the RTs model is more accurate than the other models because it uses the evaluation of several features and composes functions. Therefore, RTs can be constructed efficiently and the combination of large datasets of random trees generally leads to proper models. There has been a vast research in the recent years over RTs in the field of machine learning [33]. Generally, Random Trees is confirmed a crucial performance as compared to the classifiers presented as a single tree in this study.
If we consider random trees at very high dimensions with a complex network, then it can include the following steps [33,34]: 1. Using the N data sample randomly, in the training dataset to develop the tree. 2. Each node as a predictive feature grasps a random data sample selected so that m<M (m represents the selected feature and M represents the full of features in the corresponding dataset. Given that during the growth of trees, m is kept constant. 3. Using the m features selected for generating the partition in previous step and computes the P node using the best partition path from points. P represents the next node. 4. For aggregating, the prediction dataset uses the tree classification voting from the trained trees with n trees. 5. For generating the terminal RTs model uses the biggest voted features. 6. The RTs process continues until the tree is complete and reaches only one leaf node.

Related works
In recent years, several studies have been conducted on the diagnose of CAD on different datasets using data mining methods. The most up-to-date dataset that researchers have used recently is the Z-Alizadeh Sani dataset in the field of heart disease. To this end, we review recent research's on the Z-Alizadeh Sani dataset [35,36].
Alizadeh Sani et al. have proposed the use of data mining methods based on ECG symptoms and characteristics in relation to the diagnosis of CAD [37]. In their research, they used Sequential minimal optimization (SMO) and Naïve Bayes algorithms separately and in combination to diagnose the disease. Finally, using the 10-fold cross-validation method for the SMO-Naïve Bayes hybrid algorithm, they achieved more accuracy of 88.52% than the SMO of 86.95% and Naïve Bayes of 87.22% algorithms.
In another study, Alizadeh Sani et al. developed classification algorithms such as SMO, Naïve Bayes, Bagging with SMO and Neural networks for the diagnosis of CAD [12]. Confidence and information gain on CAD have also been used to determine effective features. As a result, among these algorithms, SMO algorithm with information gain has the best performance, with accuracy of 94.08% using 10-fold cross-validation method.
Alizadeh Sani et al. have used computational intelligence methods to diagnose CAD so that they have separately diagnosed three major three coronary stenosis using demographic, symptom and examination, ECG characteristic's, laboratory and echo [38]. They have used analytical methods to investigate the importance of vascular stenosis characteristics. Finally, using the SVM classification model with 10-fold cross-validation method, along with features selection of combined information gain and average information gain, obtained accuracy of 86.14%, 83.17% and 83.50% for left anterior descending (LAD), left circumflex (LCX) and right coronary arteries coronaries (RCA), respectively.
Arabasdi et al., have presented a neural network-genetic hybrid algorithm for the diagnosis of CAD [39]. For this purpose, in their research, genetic and neural network algorithms have been used separately and hybrid to analyze the dataset so that the accuracy of the neural network algorithm and neural network-genetic algorithm using the 10-fold cross-validation method was 84.62% and 93.85%, respectively.
Alizadeh Sani et al. have performed a feature engineering algorithm that have used the Naïve Bayes, C4.5, and SVM classifiers for non-invasive diagnosis of CAD [36]. Given that they have increased their dataset from 303 records to 500 samples. The accuracy obtained using the 10-fold cross-validation method for Naïve Bayes, C4.5, and SVM algorithms were 86%, 89.8%, and 96.40%, respectively.
In a study conducted by Abdar et al. [40], used two-level hybrid genetic algorithm and NuSVM called N2Genetic-NuSVM. Given two-level genetic algorithm, it is used to optimize the SVM parameters and to select the feature in parallel. Using their proposed method, the accuracy of CAD diagnosis was 93.08% through a 10-fold cross-validation method.

Proposed Methodology
In this section, we follow the proposed methodology in Fig. 3 by IBM Spss Modeler version 18.0 software is used for implementation of classification models.

Description of the dataset
Initially based on Fig. 3, to diagnose the CAD, the Z-Alizadeh Sani dataset is used in this study [35]. This dataset contains information on 303 patients with 55 features, 216 patients with CAD and 88 patients with normal status. The features used in this dataset are divided into 4 groups that are features of CAD for patients including demographics, symptom and examination, electrocardiogram  Table 1. For categorizing the CAD from Normal, the diameter narrowing above 50% is represented a patient as CAD, and its absence is stated as Normal [12].

Symptom and examination
Low TH Ang (low-Threshold angina)

Classifying the dataset
Data classification is done into nine subsets i. e., 90% for training the classifiers and one subset i. e., 10% for testing dataset using 10-fold cross-validation.

Preprocessing the dataset
Preprocessing step is performed after the data is classified. In general, a set of operations that leading to the creation of a set of cleaned data that can be done on dataset, investigate operation, socalled data preprocessing. The samples values in the Z-Alizadeh-Sani dataset [35] were numeric and string. The purpose of preprocessing the data in this study is to homogenize them so that all data are in the domain of [0 1], which is called the normalization operation, so that is done the standard normalization operation using the Min-Max function. After normalizing numbers, it was time to transform the string data to numeric. In this regard, given the nature of the string data, the value was assigned to them in the interval [0,1]. For example, sex feature has male and female values, which they transformed to zero and one, respectively.

Classifying the models using 10-fold cross-validation method
For classifying the models was used the 10-fold cross-validation method [41] that the dataset was randomly divided into the same K-scale for the division so and the k-1 subset being used to train the classifiers. The rest of division is also used to investigate the output performance at each step, for 10 times. For this purpose, classifying the prediction models were performed based on the 10-fold cross-validation method so that the average of the criteria was obtained on 10-fold [1,42], which 90% of the data to train and 10% were used for testing the data. Finally, this cross-validation process was executed 10 times so that the results are demonstrated by averaging each ten times.

Evaluating the results
In this section, we examine the evaluation in two subdivisions. First, evaluation based on the classification criteria, including ROC curve, Gini, Gain, Confidence, Return on investment (ROI), Profit, and Response. Second, evaluation based on significant predictive features.

Evaluation based on classification criteria
We used a confusion matrix [1,39,43,44] to evaluate classification models such as SVM, CHAID, C5.0, and RTs in the diagnosis of CAD on the Z-Alizadeh Sani dataset that described in Table 2. In the following, through the Confusion matrix method, the AUC [1,45] and the Gini index [46] criteria have been obtained, which shown the comparison between the models mentioned for this AUC criterion in Fig. 4 (a,b). According to Fig. 4b, the AUC values for the SVM, CHAID, C5.0 and RTs models are 80.90%, 82.30%, 83.00 and 90.50%, respectively. Also the Gini value for SVM, CHAID and RTs models was 61.80%, 64.60%, 66.00% and 93.40%, respectively.
In addition, the Gain, Confidence, Profit, ROI, and Response criteria for evaluating the models have been examined, and comparisons between models through these criteria are shown in Figs 5-9.     According to Figs 5-9 of the criteria in the relevant models for the CAD diagnose of Normal class, it can be said that the RTs model has better performance in terms of Gain, Confidence, ROI, Profit and Response criteria than other classification models.

Evaluation based on significant predictive features
One of the significant evaluates for comparing classification models for predicting the CAD from normal is the use of the importance of predictive features. To this end, we have examined the models in terms of their importance in the ranking stage of features. In fact, the models are measured according to the weight determined to the predictor features. The weighted importance of the features for the models is shown in Tables 2-4.

Results and discussion
In the modeling process proposed in Section 4, we implemented several data mining models including SVM, CHAID, C5.0, RTs. The 10-fold cross-validation method was used to build these models so that the data was divided into train (90 percent) and test (10 percent) subsets. The results show that the Random trees model is the best classification model compared to the other models so that the accuracy of the RTs model is obtained 91.47% using 10-fold cross-validation method. While the accuracy of SVM, CHAID, and C5.0 models were 69.77%, 80.62%, and 82.17%, respectively.
Given that the accuracy is computed using the following formula (TP+TN/(TP+TN+FP+FN)) where TP is True Positive, TN is True Negative, FP is False Positive, and FN is False Negative [1].
Furthermore, achievement of this study is the use of criteria that were not found in previous studies, including Gain, Confidence, ROI, Profit, and Response, as shown in Figs 5-9, in terms of these criteria, the Random trees model has the best performance, other classification models.
Finally, based on Tables 2 to 5, it can be resulted that in each of the 4 models the Typical chest pain feature is selected as the most significant predictor so that the predictor significance of the Typical chest pain feature for the random trees model is equal to 0.98 with the most significant and the least significant for Lymph feature is equal to zero. Given that for the features, intervals 1 and 2 are applied in the simulator. In Table 1, also the Typical chest pain is the most significant feature with a value of 0.04 and the Region RWMA as the least significant feature of 0.01 was obtained. According to Tables 4 and 5 Typical chest pain as the most significant feature is equal to 0.28 and 0.33 respectively, and the least significant feature according to Table 4 for EF-TTE feature equal to zero and the least significant feature according to Table 5 is 0.02. It is therefore confirmed that the RTs model is the best model relative to other classification models according to the above tables.
One of the advantages of the Random Trees model were the most significant obtained rules of CAD diagnosis that is placed in Table 6. Table 6. the most significant obtained rules for CAD diagnosis using Random trees (Top Decision Rules for 'Cath' class. According to Table 6, the extracted rules for CAD are described as follows: If the condition is true of (BP> 110.0), (FH> 0.0), (Neut> 51.0) and (Typical Chest Pain> 0.0), then the CAD exist highly accurate and also interestingness index, otherwise, the person is normal. If the condition is true of (Typical Chest Pain > 0.0) and (Atypical={N}), then like the result it is like the result of previous conditions. In the following, if the condition is true of (Wight > 8.0), (CR > 0. 9 In recent years, several studies have been conducted on the diagnosis of CAD on different datasets using data mining methods. To this end, we review recent research on the updated Z-Alizadeh Sani dataset that these are described in Table 7. Given that the results of the Accuracy, AUC and Gini criteria for the models have been done according to the 10-fold cross validation method compared to previous studies. Taking a look at Table 7, it can be resulted that the proposed method based on Random Trees outperforms other methods in terms of accuracy, AUC, and Gini criteria. It implies that the 40 features extracted by using RTs are the most informative ones about the CAD disease.

Conclusion and future works
In this study, a computer-aided diagnosis system was used to diagnose CAD as a common heart disease on the Z-Alizadeh Sani dataset [35] so that this system is implemented using the IBM Spss Modeler version 18.0 tool. Since angiography is the most common tool of diagnosis of heart disease, it has a cost and side effects for individuals. So artificial intelligence methods, that is, machine learning techniques can be a solution to the stated challenge. Hence, we have such classification models including SVM, CHAID, C5.0, and Random trees are used for modeling with 10-fold cross-validation method, which are based on accuracy, AUC, Gini, ROI, Profit, Confidence, response, and Gain have been examined and evaluated. Finally, based on the criteria stated, the Random trees model is found as the best model than the other models so that select the predictive features based on the order of their priority with the highest accuracy, we conclude that the Random trees model with the most significant features of 40 and the accuracy of 91.47% has better performance than the other classification models. So as to, with this number of features, we will have more information gain than the features selected in previous works. Another achievement of this study was the important extraction rules for CAD diagnosis using the Random Trees model that these rules are shown in Table  6. As future work, the fuzzy intelligent system can be used in combination with artificial intelligence models to diagnose CAD on the Z-Alizadeh Sani dataset and other datasets. Another way to better diagnose CAD disease on this dataset and other real datasets is deep learning models and combining deep learning approaches with a distributed design and architecture.