Efficient Data-Mining Algorithm for Predicting Heart Disease Based on an Angiographic Test

Background The computerised classification and prediction of heart disease can be useful for medical personnel for the purpose of fast diagnosis with accurate results. This study presents an efficient classification method for predicting heart disease using a data-mining algorithm. Methods The algorithm utilises the weighted support vector machine method for efficient classification of heart disease based on a binary response that indicates the presence or absence of heart disease as the result of an angiographic test. The optimal values of the support vector machine and the Radial Basis Function kernel parameters for the heart disease classification were determined via a 10-fold cross-validation method. The heart disease data was partitioned into training and testing sets using different percentages of the splitting ratio. Each of the training sets was used in training the classification method while the predictive power of the method was evaluated on each of the test sets using the Monte-Carlo cross-validation resampling technique. The effect of different percentages of the splitting ratio on the method was also observed. Results The misclassification error rate was used to compare the performance of the method with three selected machine learning methods and was observed that the proposed method performs best over others in all cases considered. Conclusion Finally, the results illustrate that the classification algorithm presented can effectively predict the heart disease status of an individual based on the results of an angiographic test.


Introduction
Heart disease is considered a lifethreatening illness because the heart is a vital organ (1). The diagnosis of heart disease is usually based on symptoms, a physical assessment and a medical evaluation, such as the coronary angiographic test (2). Several different organisations have made recommendations regarding the optimal approach for identifying coronary heart disease in a patient in nonemergency settings (3). Several factors, such as smoking, level of cholesterol, obesity, hereditary issues and others, have been reportedly associated with heart disease (4). These factors, however, have different levels of association with heart disease and each is likely to be more pronounced in the angiographic diagnostic process of the patient.
The angiographic test is considered the gold standard for identifying and classifying a heart disease, such as coronary artery disease (5). However, there are side effects and complications associated with this test (6). reduced the sample size to 297. The response variable (y i = ±1) is based on the result of the coronary angiographic test performed on each patient, with y i = 1 indicating the presence of heart disease and y i = −1 indicating the absence of heart disease. The data contained 76 attributes comprised of categorical factors and metrical covariates. In terms of the response category, as classified by the angiographic test, 137 (46.1%) of the 297 patients were classified as having heart disease while the remaining 160 (53.9%) did not have the disease. Most literature, such as Latha and Jeeva and, Suresh and Ananda Raj (14)(15), recommend using only 14 out of the 75 attributes for prediction but after removing the missing data, there were 13 predictors on the subjects (13). These variables, and their respective categories, are presented in Table 1. All the predictor variables are labelled as X 1 , X 2 , ..., X 3 respectively. Table 2 presents the frequency distribution of the presence or absence of the heart disease based on the angiographic result. Similarly, Table 3 and figure 1 presents some descriptive measures and box plots of the metrical covariate in the data respectively while Table 4 also presents the frequency and percentage distribution of the categorical factors in the data.

Application of The Proposed Method on Heart Disease Data
The w-SVM algorithm assigns weights to the predictor variables in the data. As indicated in (11), the weights are first determined by obtaining the correlation between each of the predictor variables and the response variable. This is referred to as the point biserial correlation or association (16): where +1 and −1 are the mean values of the predictor variable X for all data points in groups +1 and -1, respectively; p +1 and p −1 are the proportions of data points in groups +1 and -1, respectively; with p y = n y n , y = −1, or +1, and in the sample standard deviation of the j th feature X j .
Additionally, the associated risk of an angiogram has been attributed to cardiac and non-cardiac complications (5).
Early diagnosis of heart disease is paramount to its treatment (1), but medical practitioners typically face the challenge of timely detecting the presence or absence of heart disease, the kind of heart disease and the associated costs (7). Similarly, medical processes leading to diagnosis and predictions of some kinds of disease, such as cancer, have been reported to be quite inefficient due to the risks and time involved (8).
Automating the diagnostic process in health care services by using data-mining techniques is fast gaining recognition (9). Computerised classification of binary response data, such as the heart disease data used in this study, using data-mining algorithms can be useful for medical personnel to quickly diagnose their patients at a low cost (10). This approach will reduce the associated risks, costs, time and also proffer timely treatment and intervention (11).
This paper therefore presents an application of a modified support vector machine (SVM) called the weighted support vector machine (w-SVM) (11) method for the classification and prediction of heart disease. The w-SVM method was developed to improve the predictive power of the standard SVM (12) method using the biserial correlation between the response (presence or absence) of an angiographic test and the associated available factors on each patient.

Methods
This section explains the data set and the methodology of the classification algorithm used in this study.

Heart Disease Data Set
To assess the performance of the proposed w-SVM classifier, a secondary dataset on heart disease -which is available online and has been used in several works in the literature, such as Singh et al. and James et al. (4,13) -was used in this research. The heart disease dataset is publicly available and can be obtained from the University of California Irvine (UCI) machine learning repository at http://archive.ics.uci.edu/ ml/datasets/heart+Disease.
This study involved 303 patients that presented with chest pain. Out of these 303 heart disease patients, six subjects were excluded from the analysis due to incomplete information. This  where r x j Y is the correlation between predictor X j and the binary response Y.
Each of the weights are then multiplied by the respective predictor variables to give the new n × p data matrix Z. The traditional SVM is then applied on the new (weighted) data for the purpose of classification using the Monte-Carlo cross-validation (MCCV) method.
Different percentages of the splitting ratio were also considered for the train and test sets, respectively, as reported in (11). The quadratic programming problem of w-SVM is where α i ≥ 0 is the Lagrange multiplier.
These correlations determine the weight for each of the predictors. A step-by-step procedure of the w-SVM can be found in (11).
Suppose there are n sample points in the data with the p predictor variable X, and each point in X has an attribute in one of the binary classes, y i = ±1. The weight for each of the variables, as discussed in (11), is obtained as Each weight ω ij in ω is computed by The performance of the proposed w-SVM algorithm was evaluated via the test data using some performance indices. Given a 2 × 2 confusion matrix as presented in Table 5 Seven performance indices -Prediction accuracy (A CC ), Misclassification error rate (MER), Sensitivity (S e ), Specificity (S p ), Positive predictive value (P + ), Negative predictive value (P − ) and Jaccard index (JI) -were used to assess the performance of the proposed method. The MER was used in comparing the w-SVM method with the three selected machine learning methods. A CC is the proportion of subjects that are correctly classified by the classifier As stated earlier, Z is the updated data, which was derived from the original data matrix (X) and the weight matrix, such that with z ij = X ij ω jj , for i = 1, 2, …, n and j = 1, 2, …, p.
The data pre-processing and cleaning techniques have been applied to remove noisy and missing values present in the data set, respectively. The MCCV resampling technique is applied to produce balanced training and testing of the data set with different percentages of the splitting ratios and with replicating each split with 1,000 iterations. The w-SVM classification algorithm was developed to predict heart disease and the performance of the algorithm was validated with the test data. Figure 1 shows the overall flow of the prediction algorithm for heart disease.

Biserial Correlation and Weight
The results in Table 6 show the degree of relationship between the binary response variable Y, and each of the predictor variables. The respective weight of the predictors as calculated using equation (ii) is presented in the third column of the table.
As a kernel-based learning method, the w-SVM uses different kernel functions in its classification process. The well-known kernel functions are linear, radial basis function (RBF), polynomial and sigmoid (12). The choice of kernel functions under different data structures has been extensively discussed in the literature (17). Based on the report of Banjoko et al. (17), the RBF kernel is adopted for the w-SVM algorithm on the heart disease data employed here. The optimal values of the w-SVM with RBF kernel parameters for the heart disease classification were determined via a 10-fold cross-validation method.
The weights presented in Table 6 are used to multiply each of the predictors before applying the traditional SVM for efficient classification using different percentages of the splitting ratio for both train and test data. It should be noted that the results in Table 7 are the values obtained from the test data which are not included in the algorithm training.

Method on the Heart Disease Prediction
The predictive performance of w-SVM and three other classification algorithms, namely Naïve Bayes (NB), Random Forest (RF) and SVM, are compared using the MER of the classifiers for the entire splitting ratio considered in this study. Worthy of note is the fact that the lower the MER, the better the classification result of such classifier. The results are presented in Figure 2.

Results
This section presents the results of a heart disease classification using w-SVM.

Discussion
This study has demonstrated an efficient data-mining method for the prediction of heart disease based on the results of an angiographic test. The step of the method of the w-SVM was able to identify the most and least correlated predictor variable with the response variable among the predictor variables considered in this study.
The results of the respective biserial correlations between the response variable and each of the predictor variables and the corresponding weights are presented in Table 6. Table 6 shows that the variable thalassemia (THAL) (X 13 ) is the most correlated variable with the response variable having a correlation value of 0.5266 and corresponding weight value of 0.1347, while the predictor variable fasting blood sugar (FBS) (X 6 ) is the least correlated variable with the response variable, with a correlation value of 0.0032 and a corresponding weight value of 0.0008. w-SVM classification algorithm Figure 2. Flow chart of the w-SVM prediction algorithm for the heart disease data have more training sets than test sets in datamining techniques. Similarly, the maximum accuracy is 90.60% and was achieved at the splitting ratio of 90:10. This is also not encouraged, because too small a percentage of the test set may lead to overfitting the classification model. A similar pattern of results was also observed for all other performance indices at different percentages of the splitting ratios.
The proposed method performs in a similar pattern for the different splitting ratios considered in this study. As pointed out earlier, the two extreme splitting ratios in this study (95:5 and 50:50) are discouraged. Therefore, it is advisable to use the rule of thumb of 80:20 for the train and test sets, respectively, in the implementation of the proposed algorithm.
Also, the results presented in Figure 3 show the efficient performance of the w-SVM method over the selected three machine learning methods for the different splitting ratios using the MER of each of the methods considered in this study. Observations show that the w-SVM method performs best and is a more efficient data-mining technique -when compared to the three other existing classifiers -for predicting heart disease. This is indicated by the least MER values attained by the w-SVM throughout the splitting ratios considered.
The implication of the proposed method having the least MER is that, while the proposed method will be more efficient in correctly diagnosing a patient with a heart disease in this study, the other three selected methods will be less efficient because of the very high chance of wrongly diagnosing the patient. The other performance indices of the proposed method also justify how good is the proposed method.
The proposed method can unravel the importance of each of the predictor variables as they relate to heart disease. For instance, THAL has been reported in literature as a very important factor and even as indicative that the patient has a life-threatening heart disease (24). Similarly, FBS has been reported not to have a significant effect on heart disease (25). The above was rightly justified by the proposed algorithm through the weight of THAL and FBS, respectively, as demonstrated in Table 6. Table 7 shows the several performance indices that were used to assess the performance of the w-SVM method on the heart disease data, using different percentages of the splitting ratio for the train and test sets. The results indicated that the splitting ratios considered in this study do not significantly affect the performance of w-SVM on the prediction of heart disease as the same results were achieved using different percentages of the splitting ratios. Generally, a rule of thumb in the data-mining technique is to split the data using the splitting ratio 80:20 for train and test data, respectively. Therefore, as observed in Table 6, the 80:20 splitting ratio gave a prediction accuracy of 90.53%, Sensitivity value of 90.99%, Specificity value of 90.72%, Positive predictive value of 87.99%, Negative predictive value of 90.53% and a Jaccard Index of 82.89%. The high value of the performance indices considered in the study testifies to the effectiveness of the w-SVM method in the prediction of heart disease. Throughout the different percentages of the splitting ratio considered, it was observed that the least prediction accuracy yielded of the w-SVM algorithm is 88.62% and was achieved at the splitting ratio 50:50. This splitting ratio is discouraged as it is usually very important to   The results of various performance indices for the classification of heart disease using the w-SVM as employed in this study is quite impressive and better when compared to the three selected existing classifiers. Similarly, the results obtained shows that the proposed method perform well as compared to some past studies using the same heart disease data set. Comparison of accuracy scores in the proposed method with the same Cleveland Heart Disease Data is presented in Table 8. It should be noted that the result of the proposed method is the average over 1,000 runs using MCCV -which make the results obtained more reliablecompared to some studies that have reported their results over a single run.
Finally, although the w-SVM seems to be effective in predicting heart disease by considering the relationship between the presence or otherwise of it and the associated factors (predictors), this study's weakness is that, even as high as the prediction accuracy of the w-SVM is, there are still chances of misclassifying individuals with the same response to the heart disease when using the associated factors considered in this study. This is justified by the non-zero value of the MER over the different percentages of the splitting ratios. Therefore, there is a need to investigate the clinical importance of this study in order to further validate the results.

Conclusion
This study presents an efficient data-mining algorithm for the classification and prediction of heart disease. The w-SVM algorithm was developed using the degree of association between the dichotomous response variable and each of the predictors to determine the respective weights of each of the predictor variables. The w-SVM provides better and more efficient results that will also assist domain experts with better planning of early diagnosis and treatment of the patient. The results show that the w-SVM algorithm can accurately predict heart disease based on an angiographic test with more than 90% accuracy. Medical practitioners, particularly cardiologists, are advised to further investigate the results of the proposed method for ease and quick medical treatment and intervention relating to heart disease based on an angiogram.