Testing the Probability of Heart Disease Using Classification and Regression Tree Model

____


INTRODUCTION
Today, medical services have come a long way to treat patients with various diseases. Today, diagnosing patients correctly and administering effective treatments have become quite a challenge. Poor clinical decisions may end inpatient's death which could not be tolerated by the hospital as it loses its reputation. The cost to treat a patient with a heart problem is quite high and not affordable by every patient. To achieve a correct and costeffective treatment, computer-based information and/or decision support systems can be developed to do the task. Most hospitals today use some sort of hospital information systems to manage their healthcare or patient data. These systems typically generate huge amounts of data which take the form of numbers, text, charts and images [1].
There have been several efforts in developing indices to estimate cardiovascular risk, since cardiovascular diseases are the main cause of morbidity and mortality in the world [2].There is a wealth of hidden information in these data that is largely untapped. The diagnosis of diseases is a vital and intricate job in medicine [3]. The recognition of heart disease from diverse features or signs is a multi-layered problem that is not free from false assumptions and is frequently accompanied by impulsive effects. Thus the attempt to exploit knowledge and experience of several specialists and clinical screening data of patients in the form of databases to assist the diagnosis procedure is regarded as a valuable option [4].
Knowledge of the risk factors associated with heart disease helps health care professionals to identify patients at high risk of having heart disease [5]. Statistical analysis has identified the risk factors associated with heart disease to be age, blood pressure, smoking habit, total cholesterol, diabetes, hypertension, family history of heart disease, obesity, and lack of physical activity [6][7][8].
Data mining techniques play an important role in finding patterns and extracting knowledge from a large volume of data [9]. It is very helpful to provide better patient care and effective diagnostic capabilities [10][11][12]. Several data mining techniques are used in the diagnosis of heart disease such as: Genetic algorithm, classification via clustering, direct kernel selforganizing map, naïve bayes, decision tree, neural network, kernel density, automatically defined groups, bagging algorithm, and support vector machine showing different levels of accuracies [13][14]. Among these classification algorithms decision tree algorithms is the most commonly used because it is easy to understand and cheap to implement. It provides a modeling technique that is easy for humans to comprehend and simplifies the classification process [15][16][17].
In particular, researchers have been investigating the application of the decision tree technique in the diagnosis of heart disease with considerable success. Andreeva [18] used C4.5 decision tree in the diagnosis of heart disease showing accuracy of 75.73%. Sitar-taut et al. [19] used the weka tool to investigate naïve bayes and J4.8 decision trees for the detection of coronary heart disease. The results showed that there is no significant difference between naïve bayes and decision trees in the ability to realize a correct prediction of coronary heart disease. Tu et al. [20] used the bagging algorithm in the weka tool and compared it with J4.8 decision tree in the diagnosis of heart disease. The bagging algorithm showed a better accuracy of 81.41% while the decision tree showed an accuracy of 78.91%. Rajkumar and Reena [21] developed an Intelligent Heart Disease Prediction System to predict the heart disease using three classifiers; decision tree, naïve bayes and neural networks. Kaur and Wasan [22] examined the potential use of classification data mining technique like decision tree, rule induction and artificial neural network for the diagnosis of diabetic patients. Ordonez [23] implemented an efficient search for the diagnosis of heart disease comparing association rules with decision trees. Xing et al. [24] conducted a survey of 1000 patients, the results of which showed SVM to have 92.1% accuracy, artificial neural networks to have 91.0% and decision trees with 89.6% using TNF, IL6, IL8, HICRP, MPO1, TNI2, sex, age, smoke, hypertension, diabetes, and survival as the parameters. Similarly, Chen et al. [25] compared the accuracy of SVM, neural networks, Bayesian classification, decision tree and logistic regression. Considering 102 cases, SVM had the highest accuracy of 90.5%, neural networks 88.9%, Bayesian 82.2%, decision tree 77.9%, and logistic regression 73.9%. Karaolis et al. [26] has developed a data mining system for the assessment of heart event related risk factors. It is found that the data mining technique could help in the identification of high and low risk subgroups of patients. Decision tree was used for extracting rules based on the risk factors. Palaniappan and Awang [27] have developed an Intelligent Heart Disease Prediction System (IHDPS) using decision trees, neural networks and naive bayes. A scalable, reliable, expandable, and user friendly web-based system was developed on the .NET platform.
As stated, decision tree is one of the successful data mining techniques used in the diagnosis of heart disease. Our work aims to predict efficiently diagnosis with reduced number of factors that contribute more towards the heart disease using Classification and Regression Tree (CRT) model, and to use Gini index to measure the impurity of a partition or set of training tuples [28]. It can handle high dimensional categorical data.

CLASSIFICATION AND REGRESSION TREE MODEL
Decision tree induction is one of the classification techniques used in decision support systems and machine learning process [29]. With this technique the training data set is recursively partitioned using depth-first (Hunt's method) or breadth-first greedy technique until each partition is pure or belongs to the same class/leaf node [30]. The model is preferred among other classification algorithms because it is an eager learning algorithm and easy to implement. Decision tree algorithms can be implemented serially or in parallel. Despite the implementation method adopted, most decision tree algorithms in literature are constructed in two phases: tree growth and tree pruning phase. Tree pruning is an important part of decision tree construction as it is used in improving the classification/prediction accuracy by ensuring that the constructed tree model does not over fit the data set [31].
The decision tree is based on a multistage or hierarchical decision scheme (tree structure). The tree is composed of a root node, a set of internal nodes, and a set of terminal nodes (leaves). Each node of the decision tree structure makes a binary decision that separates either one class or some of the classes from the remaining classes. The processing is carried out by moving down the tree until the terminal node is reached. In a decision tree, features that carry maximum information are selected for classification, while remaining features are rejected, thereby increasing computational efficiency [32]. The top-down induction of the decision tree indicates that variables in the higher order of the tree structure are more important [33]. In this study we focused on one type of decision tree called CRT model which are memory resident, fast and easy to implement.
CRT is a recursive partitioning method to be used both for regression and classification; is constructed by splitting subsets of the data set using all predictor variables to create two child nodes repeatedly, beginning with the entire data set. The best predictor is chosen using a variety of impurity or diversity measures (Gini, twoing, ordered twoing and least-squared deviation). The goal is to produce subsets of the data which are as homogeneous as possible with respect to the target variable [34]. In this study, we used the measure of Gini impurity that is used for categorical target variables.
Gini Impurity Measure: The Gini index at node t, g(t), is defined as wherei and j are categories of the target variable. The equation for the Gini index can also be written as Thus, when the cases in a node are evenly distributed across the categories, the Gini index takes its maximum value of 1 − (1⁄ ), where k is the number of categories for the target variable [35]. When all cases in the node belong to the same category, the Gini index equals 0. If costs of misclassification are specified, the Gini index is computed as where ( | ))is the probability of misclassifying a category case as category . The Gini criterion function for split s at node t is defined as where is the proportion of cases in t sent to the left child node, and is the proportion sent to the right child node. The split is chosen to maximize the value of∅( , ). This value is reported as the improvement in the tree [36].

DATA COLLECTION
A total of 270 cases with 13 medical attributes were obtained from the Statlog Heart Disease database [37]. Papers that cite this data set [38][39][40][41]. The dataset contained 150 patients without heart disease and 120 patients with heart disease. The diagnosis classes is identified as value "1" for patients with no heart disease and value "2" for patients with heart disease. The attributes are acronym as: age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, and thal. The data don't have any missing values for all attributes [42][43][44]. Table 1 lists the name of the attributes with its type and description.

RESULTS AND DISCUSSION
CRT algorithm is used to predict absence or presence of heart disease based on values of all independent variables. The tree diagram (Fig. 1) shows tree construction based on the test sample of 31 cases, 0.05 adjustments of the probabilities, a minimum parent node size of 30, a minimum child nodes size of 10 and equal misclassification costs. The Gini index was selected as a splitting criterion. There are totally 10 nodes that consist of 6 terminal nodes; the first node placed in the tree is the root node. The depth of the tree is equal to 3. Parent node has 17 absence (54.8%) and 14 presence (45.2%) of heart disease. Prior probabilities used are the observed probability 0.548 and 0.452 for absence to presence of heart disease, respectively. The first discriminator "exercise thallium scintigraphic defects" splits the root node into two child nodes: normal (node 1, n=17), and fixed or reversible defect (node 2, n=14). The improvement for this classification is 0.136. The classifier "chest pain type" is for normal defect (0.028 improvement), and the "number of vessels colored by fluoroscopy" is for fixed or reversible defects (0.031 improvement). The next discriminator is "chest pain type" split into atypical angina or non-anginal pain (terminal node 3), and typical angina, or asymptomatic (Node 4). The next discriminator is "number of vessels colored by fluoroscopy" which produces two nodes; node 5 for that number of major vessels colored by fluoroscopy is 0, and terminal node 6 for those numbers of major vessels colored by fluoroscopy that ranged between 1 and 3. When "chest pain type" is typical angina, or asymptomatic, then the next discriminator is "number of vessels colored by fluoroscopy" (0.027 improvement), which is split into terminal node 7 (number of vessels colored by fluoroscopy is equal to 0) and terminal node 8 (number of vessels colored by fluoroscopy is ranged between 1 and 3). Also, when the "number of vessels colored by fluoroscopy" is 0, then the last discriminator is "chest pain type" (0.019 improvement), which produces two terminal nodes; terminal node 9 for asymptomatic and terminal node 10 for typical angina, atypical angina, or non-anginal pain. Percentages in each category and in each joint category are shown in Fig. 1, and in Table 2.  The improvement measure shown in Table 2, measures the increase of the effect of child node on the dependent variable, it is determined by the largest difference in the proportions of the dependent variable in the child nodes [45]. Thus, an improvement of 0.136 means that "exercise thallium scintigraphic defects" contributes 13.6% in the discrimination between absence and presence of heart disease; "chest pain type" makes an additional 4.7% improvement, and "number of vessels colored by fluoroscopy" makes another 5.8% improvement.
The results of CRT are summarized in Table 2, it is clear that not all predictor's categories contribute in the classification process. The predictors: age, sex, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate, exercise induced angina, depression induced by exercise relative to rest, and slopedo not contribute to the classification tree used.
The Dependency viewer in CRT model shows the results from the most significant to the least significant (weakest) medical predictors. The normalized importance of predictors in the classification is given in Table 3. The CRT model shows that the most significant factors influencing heart disease are chest pain type, number of vessels colored by fluoroscopy, and exercise thallium scintigraphic defects. Other significant factors include depression induced by exercise relative to rest, maximum heart rate, exercise induced angina, slope, and age. The model shows sex, resting blood pressure, serum cholesterol and fasting blood sugar as the weakest factors. The attributeresting electrocardiographic results is excluded from the recognition. Physicians can use this information to further analyze the strength and weakness of the medical attributes associated with heart disease. The classification accuracy is the percent predicted correct (absence and presence) over the total sample size used. It was found that the prediction accuracy is 90.3%, with sensitivity 82.4% and specificity 100.0%. The misclassification rate (risk resubstitution estimate) is 0.097 with standard error 0.053.
Comparison of a study with other in literature is an indispensable part of a study to evaluate the obtained results from its applications. Whereas the comparison results cannot be generalized to all problems related with developed systems, at least this comparison gives an alternative about the place of the proposed system among the compared classifiers in literature.
Work on the heart disease datasets can be dated since 1989, when Detrano et al. [46] used logistical regression to obtain 77% accuracy of prediction. The accuracies of different models on the same datasets have been tabulated in Table 4 [47][48][49][50][51]. As can be seen, the highest classification accuracy was obtained via GA-AWAIS with a classification accuracy of 87.43%. While, the lowest classification accuracy among the studies conducted is reached by InductH with an accuracy of 58.5%. As stated, the proposed CRT model achieved an accuracy of 90.3% which is higher than other methods. This result is also important when it is taken into account that the average classification accuracy of classifiers applied for this problem is in 75.8%. Thus, this problem can be seen as a hard medical classification problem and CRT model has reached a considerably good result for this problem.

CONCLUSION
Diagnosis of heart disease is a significant task in medicine. This paper investigates applying the CRT model to help healthcare professionals in this task. A data set of 270 cases and thirteen attributes were selected as the input variables of the model. Based on the proposed model, the ascending order of important variables are chest pain type, number of vessels colored by fluoroscopy, exercise thallium scintigraphic defects, depression induced by exercise relative to rest, maximum heart rate, exercise induced angina, slope, age, sex, resting blood pressure, serum cholesterol, and fasting blood sugar. The attribute resting electrocardiographic results is excluded from the recognition. The simulation results show the capability of this model for the prognosis of heart disease with good accuracy and appropriate convergence. Also, compared to other methods in literature, CRT has obtained reasonable results. This type of research can play an important role in improving patient outcomes, cost reduction of medicine, and further advanced clinical studies. The proposed work can be further enhanced and expanded for the automation of heart disease prediction. Real data from health care organizations and agencies needs to be collected and all the available techniques will be compared for the optimum accuracy. It is hoped that deeper investigations using this method in the future will use more recent datasets.

ACKNOWLEDGEMENT
The author would like to thank Malak Fuad Subhi Al-Battah, a student in Department of Medicine and Surgery, Faculty of Medicine in Jordan University of Science and Technology, for her ideas about heart disease that were insightful and most helpful for enhancing the article.