Diagnosing Coronary Artery Disease via Data Mining Algorithms by Considering Laboratory and Echocardiography Features

Background: Coronary artery disease (CAD) is the result of the accumulation of athermanous plaques within the walls of coronary arteries, which supply the myocardium with oxygen and nutrients. CAD leads to heart attacks or strokes and is, thus, one of the most important causes of death worldwide. Angiography, an imaging modality for blood vessels, is currently the most accurate method of diagnosing artery stenosis. However, the disadvantages of this method such as complications, costs, and possible side effects have prompted researchers to investigate alternative solutions. Objectives: The current study aimed to use data analysis, a non-invasive and less costly method, and various data mining algorithms to predict the stenosis of arteries. Among many people who refer to hospitals due to chest pain, a great number of them are normal and as such do not need angiography. The objective of this study was to predict patients who are most probably normal using features with the highest correlations with CAD with a view to obviate angiography costs and complications. Not a substitute for angiography, this method would select high-risk cases that definitely need angiography. Patients and Methods: Different features were measured and collected from potential patients in order to construct a dataset, which was later utilized for model extraction. Most of the proposed methods in the literature have not considered the stenosis of each artery separately, whereas the present study employed laboratory and echocardiographic data to diagnose the stenosis of each artery separately. The data were gathered from 303 random visitors to Rajaie Cardiovascular, Medical and Research Center. Electrocardiographic (ECG) data were studied in our previous works. The goal of this study was, therefore, to seek the accuracy of echocardiographic and laboratory features to predict CAD patients that require angiography. Results: Bagging and C4.5 classification algorithms were drawn upon to analyse the data, the former reaching accuracy rates of 79.54%, 61.46%, and 68.96% for the diagnosis of the stenoses of the left anterior descending (LAD), left circumflex (LCX), and right coronary artery (RCA), respectively. The accuracy to predict the LAD stenosis was attained via feature selection. In the current study, features effective in the stenosis of arteries were further determined, and some rules for the evaluation of triglyceride, hemoglobin, hypertension, dyslipidemia, diabetes mellitus, and ejection fraction were extracted. Conclusions: The current study presents the highest accuracy value to diagnose the LAD stenosis in the literature.


Background
Data mining is the process of exploring hidden knowledge in a database. This science extracts patterns by combining statistics and artificial intelligence methods as well as database management techniques. Nowadays, data mining usage has spread in a large number of fields, including marketing, fraud detection, and knowledge discovery. (1). Mortality rates of diseases are much higher than those of accidents and disasters. The major culprit for death in most developing countries such as India and China is cardiovascular disease (2). Cardiovascular disease is a heart and blood vessel disease, creating numerous problems, many of which are related to a process called atherosclerosis. Atherosclerosis is a condition that emerges when a substance called plaque accumulates in the walls of arteries, thereby obstructing the blood flow through vessels and probably leading to a heart attack or stroke. Angiography is the traditional gold standard to evaluate vascular lesions. However, this diagnostic modality is complicated, costly, and can have side effects.
Data mining is utilized as an alternative detection method for the prediction of coronary artery disease (CAD) with high accuracy, based on a set of features collected from the patient. Most of the studies conducted thus far have not considered the stenosis of each artery separately. In such methods, if each of the left anterior descending (LAD), left circumflex (LCX), and right coronary artery (RCA) is stenotic, the existence of CAD is concluded. A few studies have considered the stenosis of each artery separately so far. Babaoglu et al. (3) used exercise test data and neural networks to study the stenosis of each artery. They reached accuracy rates of 73%, 64.85%, and 69.39% to diagnose the stenosis of the LAD, LCX, and RCA, respectively. Alizadehsani et al. (4) ran the Naïve Bayes and KNN algorithms on symptom and examination features as well as the electrocardiographic (ECG) features and attained accuracy rates of 74.20%, 63.76%, and 68.33% to diagnose the stenosis of the LAD, LCX, and RCA, respectively. The difference between these two studies is because of the different features selected in these two papers. Srinivas et al. (5) and Ordonez et al. (6) extracted some rules for the stenosis of each artery. Sony et al. (7) found a number of rules for the stenosis of each vessel, by means of decision tree.
In addition to diagnosing the stenosis of arteries, some articles have studied the presence of CAD in individuals.
Alizadehsani et al. (8)examined the SMO, Naïve Bayes algorithms, and a new ensemble method on symptom and examination features in conjunction with the ECG features and obtained an accuracy rate of 88.5% to diagnose CAD. The C4.5 and Bagging (which consist of C4.5) algorithms were used in the present study in order to diagnose the stenosis of the LAD, LCX, and RCA arteries.

Objectives
This study aimed to draw upon data mining algorithms to diagnose the stenosis of arteries and find some rules based on the dataset features. The main goal of this study was to achieve a reliable estimate of presence of significant CAD in patients presenting with chest pain based on a series of readily accessible data, including history, cardiac risk factors, and echocardiographic and laboratory tests, with a view to reducing the number of patients undergoing angiography and, thus, decreasing hospital admission durations and financial burdens on health care systems.

Patients and Methods
The dataset of the current study was gathered from 303 random individuals, who referred to Rajaie Cardiovascular, Medical and Research Center, Tehran, Iran, due to chest pain.

C 4.5
The C4.5 is a classification algorithm based on decision trees and was presented to augment the performance of simple decision trees. The difference between ID3, which is one of the primitive decision-tree algorithms, and C4.5 is the ability of the latter to manage continuous values by breaking them down into sub intervals. Also, the C4.5 uses pruning methods to improve accuracy and avoid memorizing the test data (9). In the current study, confidence threshold for pruning and the minimum number of instances per leaf were set at 0.25 and 2, respectively.
The data mining algorithms used in the current study are described in this section. The default settings, as defined by RapidMiner software, were employed in both C4.5 and Bagging algorithms. The accuracy of these two algorithms was calculated using ten-fold cross-validation method.

Bagging
The Bagging classifies each sample based on the output of a set of diverse base classifiers. The base classifiers can be selected from the C4.5, Naïve Bayes, ID3, and other data mining algorithms (10). In the present study, the C4.5 was used and the number of weak learners was set at 10.

Results
Descriptive findings about the patients' demographic and background data are presented in tables 1 and 2. (version of 5.2) tool was employed. RapidMiner is a tool to process machine learning and data mining algorithms (11). The present study made use of the default setting of the RapidMiner and utilized accuracy, sensitivity, and specificity to perform the algorithms. These quantities are described in what follows.

Confusion Matrix
Performance quantities were measured by means of the confusion matrix. Table 3 depicts a general confusion matrix, in which positive means having the disease and negative means being healthy.

Performance Measures
Sensitivity, specificity, and accuracy are described below based on the confusion matrix (12).
Sensitivity relates to the ability of the algorithm to identify positive results. In fact, it is the probability of detecting CAD, assuming that the patient actually has the disease. Specificity, on the other hand, relates to the ability of the algorithm to detect negative results. In other words, it is the probability of a negative algorithm output, given that the sample is healthy. Accuracy is the overall portion of correctly identified samples.

Evaluation
The impact of different features on disease presence is not uniform. This impact can be measured with the Gini index. The Gini index measures the inequality between the values of a distribution. Accordingly, higher values of the Gini index for a feature indicate its prevalence in causing the disease. Tables 4 -6 show the Gini index per feature, presenting the impact of the features on the stenosis of the LAD, LCX, and RCA, respectively. As Table 4 demonstrates, the most effective features on the LAD stenosis were region with regional wall motion abnormality (RWMA), ejection fraction (EF), age, valvular heart disease (VHD), erythrocyte sedimentation rate (ESR), lymph, neutrophils (Neut.), hypertension (HTN), potassium (K), white blood cells (WBC), and fasting blood sugar (FBS), respectively.
As Table 6 reveals, the effect of DM, age, Lymph., ESR, Neut., FBS, WBC, length, HTN, EF, TG, hemoglobin (HB), sex, and CR on the RCA stenosis was more than the other features.
A comparison of the three tables reveals that some features such as age, EF, ESR, Lymph., and HTN affected all the arteries significantly. Also, regarding the results, diagnosing the stenosis of the LAD and RCA was easier owing to several high-impact features which affect them. A comparison of the performance measures of the algorithms for the diagnosis of the stenosis of the three arteries is portrayed in Tables 7 and 8.   Table 7 displays, the C4.5 algorithm diagnosed the stenosis of the LAD more accurately than the two other arteries. Furthermore, the diagnosis of the RCA stenosis was more accurate than that of the LCX. Sensitivity of the LCX and LAD was higher than their specificity, unlike the RCA. This means that for the two former arteries, the C4.5 offered a low false negative rate. As Table 8 shows, the accuracy of the Bagging algorithm for LAD stenosis diagnosis was higher than those of the two other arteries. In this algorithm, similar to the C4.5, sensitivity for the LAD and LCX was higher than specificity, unlike the RCA. The highest accuracy rates for the LAD, LCX, and RCA stenosis available in the literature belong to Babaoglu et al. ( 3 ), which are 73%, 64.85%, and 69.39%, respectively. As this table demonstrates, the accuracy for diagnosing the LAD stenosis was higher than those of the other similar studies and the accuracy for diagnosing the LCX and RCA stenosis was almost the same as those of the other similar studies. Even a small increase in accuracy can be beneficial, since the diagnosis of artery stenosis is extremely vital in the world of medicine (2).  In order to select the most important features, the Gini index and information gain were used. For this purpose, first, the features were sorted in two distinct groups based on these two metrics (Gini index and information gain). Thereafter, the 20 most important features based on each metric were selected. Finally, the C4.5 and Bagging algorithms were run on these two groups of selected features. The final results are illustrated in Tables 9 and 10. A comparison of Tables 7, 8, 9 and 10 indicates that while feature selection decreased the accuracy of the LAD and RCA stenosis diagnosis, it had an opposite effect on the LCX. Furthermore, the use of features selected based on information gain enhanced the accuracy of the LAD stenosis diagnosis to 79.54%, which is higher than the figures reported by previous studies. Alizadehsani et al. ( 4 ) attained the accuracy rates of 74.20%, 63.76%, and 68.33% for the LAD, LCX, and RCA stenosis diagnoses, respectively, by using symptom and examination features as well as the ECG features.
The current study extracted some rules not only to evaluate dyslipidemia (DLP), TG, HB, and some echocardiographic features such as the EF but also to diagnose HTN and DM via the RapidMiner application. In these rules, which are shown below, S and C represent Support and Confidence, respectively. Support shows in what ratio of data the named features of a rule occur all together. The equality of Confidence to 1 shows that whenever the left side of a rule appears, the right side definitely occurs. In the following rules, males older than 45 years and females older than 55 years are considered Old since they are more prone to CAD occurrence. This categorization is based on Braunwald's Heart Disease Book (2). HB is regarded as Low when it is lower than 14 for males and 12.5 for females, and high when it is higher than 17 for males and 15 for females (2).

Discussion
In the current study, the C4.5 and Bagging algorithms were used on laboratory and echocardiographic features on account of the fact that they are known as the best classification algorithms in data mining. Although laboratory and echocardiographic features play important roles in stenosis diagnosis (2), to the best of our knowledge, they have not been considered in any other previous works for the diagnosis of the stenosis of the LAD, LCX, and RCA, separately. Therefore, it was decided to assess the accuracy of these features to diagnose the stenosis of each of these three arteries separately. The results indicated that EF, age, Lymph and HTN were among the 10 most effective features on the stenosis of all of the arteries.
The accuracy rate obtained in the present study for the LAD stenosis diagnosis is higher than that of the most accurate method, proposed by Babaoglu et al. (3). Nonetheless, in LCX and RCA stenosis diagnoses, the results are quite the same. Finally, the current study succeeded in extracting some important rules to assess HB, HTN, TG, EF, DM, and DLP. To the best of our knowledge, the relationship between these features has not been taken into account in any other previous study. These studies include the ones conducted by Srinivas et al. (5), Ordonez et al. (6) and Sony et al. (7) who extracted some rules for the stenosis of each artery, separately. In the current study, the stenosis of the LAD, LCX, and RCA were examined separately. The C4.5 and Bagging algorithms were employed to ana-lyze the dataset. To the best of our knowledge, this study presents the highest accuracy value for diagnosing the LAD stenosis in the available literature. The diagnostic accuracy may be further augmented by adding new features such as the ECG and examination features in future work.