Machine learning algorithms for predicting coronary artery disease: efforts toward an open source solution

Aim: The development of coronary artery disease (CAD), a highly prevalent disease worldwide, is influenced by several modifiable risk factors. Predictive models built using machine learning (ML) algorithms may assist clinicians in timely detection of CAD and may improve outcomes. Materials & methods: In this study, we applied six different ML algorithms to predict the presence of CAD amongst patients listed in ‘the Cleveland dataset.’ The generated computer code is provided as a working open source solution with the ultimate goal to achieve a viable clinical tool for CAD detection. Results: All six ML algorithms achieved accuracies greater than 80%, with the ‘neural network’ algorithm achieving accuracy greater than 93%. The recall achieved with the ‘neural network’ model is also the highest of the six models (0.93), indicating that predictive ML models may provide diagnostic value in CAD.

Coronary artery disease (CAD) is the most common type of heart disease, affecting millions worldwide.According to recent statistics from the American Heart Association, coronary heart disease accounted for 13% of deaths in the USA in 2018 [21].Worldwide in 2015, CAD was found to be one of the most common causes of death, with 15.6% of all deaths resulting from the disease [1].Because CAD is associated with several modifiable risk factors pertaining to lifestyle and intervention, timing of detection and diagnostic accuracy are especially relevant in the clinical management of patients with CAD.
Over the past several years, approaches that include machine learning (ML) are making significant impact in the detection and diagnosis of diseases [2][3][4][5][6][7].In general, the ML approach involves 'training' an algorithm with a control dataset for which the disease status (disease or no disease) is known, and then applying this trained algorithm to a variable dataset in order to predict the disease status in patients for whom it is not yet determined.As larger cohorts of data are introduced, the ML algorithm will be better trained as a predictor for disease status.More accurate disease prediction with ML would empower clinicians with improved detection, diagnosis, classification, risk stratification and ultimately, management of patients, all while potentially minimizing required clinical intervention.
The application of ML concepts to CAD has been significantly hampered by the availability of appropriate clinical datasets.However, one of the components of the 'UCI Heart Disease Dataset,' dubbed as the 'Cleveland Dataset,' is publicly available on the UCI Machine Learning Repository (see [22]).Originally intended to be a teaching aid, the Cleveland dataset has been highly exploited for exploring ML concepts.Available since 1988, the Cleveland dataset has so far received more than one million downloads and is currently ranked as the fifth most popular dataset in the UCI repository.A GoogleScholar search with the terms 'Cleveland dataset,' 'heart disease' and 'machine learning,' returns little more than 300 records available since 2010, of which most studies analyzed one ML method at a time (for the latest survey review, see [8]).In addition, there is no way to verify claims in any of the publications for the accuracy of the algorithms, as the computer code has not been made publicly available [23].A PubMed search using the same keywords identified ten original articles and one comprehensive review article [9].Of the ten original articles, three used the Cleveland dataset, whereas the remaining seven used proprietary datasets.Of the ten, none of the studies have made the computer code publicly available.In our view, all recent studies pertaining to the application of the ML approach to CAD thus far appear exploratory rather than seeking to provide clinical assistance to healthcare practitioners in the treatment of CAD.(Also see 'Related Works' in the Supplementary materials files for a list of recently published articles that encompass ML algorithms with detection and diagnosis of disease.) We have now undertaken a comparative analysis by applying six different ML algorithms (models) using the UCI Cleveland dataset to predict disease outcomes.In an effort to initiate an open source ML solution for detecting CAD, we have deposited our computer code on GitHub (see [25]), making it available for other researchers to test and improve our work.We also welcome the opportunity to gain access to larger datasets to further our efforts toward an open source solution.As for the six ML algorithms used in our study, we have found that all six (linear regression, regression tree, random forest, support-vector machine, nearest neighbor, neural network and k-nearest neighbor) perform well with an accuracy of greater than ∼80%, with the nearest neighbor's accuracy greater than 93%.We consider Accuracy, Recall, F1 score and the Area Under the Curve-Receiver Operating (AUC-ROC) as performance metrics for comparative analysis amongst the ML models.
Numerous risk factor variables contribute to the development of CAD, some of which can be controlled (or modifiable) (e.g., see [10]).These include high blood pressure, high cholesterol, smoking, diabetes, obesity, lack of physical activity, unhealthy diet and stress.The risk factors that cannot be controlled (or nonmodifiable) are age, sex (gender), family history and race or ethnicity.Traditional approaches assess these risk factors to predict future risk (prognosis) of cardiovascular disease [11].However, a large number of individuals at risk of cardiovascular disease fail to be identified by these approaches, while some individuals not at risk are given preventative treatment unnecessarily (see [12,13]).Several of the ML algorithms have the ability to summarize the impact of individual variables on response variable and are referred to as 'variables of importance' (from the model's perspective), thus aiding to the building of accurate prognostic models [14].In the present study, we extract the variables of importance in one model in an effort to demonstrate the feasibility of including a risk assessment component in the ML model.Furthermore, we have openly hosted the computer code that was generated in this study on GitHub, so that we may provide the settings of a workflow model that promotes public contribution toward improved risk detection of CAD using ML, as others may contribute by extending the code to larger cohort groups.
The remainder of the manuscript is organized as follows.The Dataset & preprocessing section provides information on the data source, describes the data variables, explains the data preprocessing steps and provides a high-level analysis of the data.The next section provides concise descriptions of the ML models considered in this study and the methodologies of how the model results are evaluated.The results of the modeling efforts are presented and discussed in the Results & discussion section.The concluding section outlines the limitations of the present modeling efforts and provides a workflow model framework for extending our study in future.

Dataset & preprocessing
The dataset used in this study is downloaded from the repository maintained by the UCI (University of California, CA, USA) Center for Machine Learning and Intelligent Systems.The repository contains four datasets from four different hospitals.The Cleveland dataset contains fewer missing attributes than the other datasets and has more records.This dataset has fourteen variables on 303 patients.Table 1 lists all fourteen variables in the dataset, the associated datatype for each and a brief description.
During the data preprocessing steps, six rows of data with unknown values were dropped resulting in a dataset of 297 observations; two dataset columns of factor datatype were converted to numeric datatype; continuous variables (four, excluding age) were normalized; the column 'num' was renamed to 'hd' for clarity; and the values 1 to 4 in the heart disease (hd) column are replaced with value 1 to form a binary classification -disease (value of 1) or no disease (value of 0).
The correlation matrix for the 14 variables in the dataset is shown in Figure 1, depicting the Pearson coefficients, or r, corresponding to the association between two variables, which can range from -1 to +1.A higher absolute value of r indicates a stronger correlation between two variables.A positive r indicates a direct relationship between  two variables, whereas a negative r suggests an indirect relationship.In this analysis, we considered an absolute value of 0.5 as threshold, that is, if r is greater than 0.5 or less than -0.5, we assume those two variables are correlated.The above correlation matrix shows that only a few variables correlate with coefficients greater than 0.5, demonstrating poor correlation between the variables.Having a low value of r indicates that all these independent variables and can be included in the machine learning model building.Additional analysis of the dataset is presented in the Supplementary material.

Machine learning models, model building & model evaluation
In applying machine learning models, it is generally understood that no single algorithm is superior to the others [15].In machine learning, if every instance in the dataset is given to the model with known labels (the corresponding correct outputs), like in the Cleveland dataset, then the learning is called 'supervised', in contrast to 'unsupervised' learning, in which instances are unlabeled.Below, we present the general idea on how each of the six supervised machine learning algorithms work on the dataset and any assumptions we make in each case.Each algorithm is first trained (or fitted) with a fraction of the dataset, usually known as the 'training set' and then tested on the 'test set' that is put aside as 'unseen data' for evaluating the algorithm.For a detailed description of the models we refer the reader to the excellent treatise by James et al. [16].

Logistic regression
In Logistic regression, each independent variable in every instance of the dataset is multiplied by a weight and the cumulative result is passed to a sigmoid function.The sigmoid function maps real values into probability values between 0 and 1.In our present modeling, we left the threshold to the default value, which is 0.5, such that for probability values greater than 0.5, the model predicts the dependent variable to be 1 (the patient has CAD) and for values less than or equal to 0.5, the model predicts the dependent variable to be 0 (the patient does not have CAD).

Decision tree
Decision tree is a tree-like structure that classifies instances by sorting them based on the values of the variables.Each node in a decision tree represents a variable, and each branch represents a value that the node can assume.
Instances are classified starting at the root node and sorted based on the values of the variables.The variable that best divides the dataset would be the root node of the tree.Internal nodes (or split nodes) are the decision-making part that make a decision, based on multiple algorithms and to visit subsequent nodes.The split process is terminated when a user-defined criteria is reached at the leaf (for the present modeling, we left it to be the default value, which is 20).The paths from root nodes to the leaf nodes represent classification rules.

Random forest
Random forest is an ensemble model consisting of multiple regression trees like in a forest.Random forest combines several classification trees, trains each one on a slightly different set of the dataset instances, splitting nodes in each tree considering a limited number of the variables.The final predictions of the random forest are made by averaging the predictions of each individual tree, which enhances the prediction accuracy for unseen data.The number of regression trees chosen for present modeling is (ntree) 500.

Support vector machine
In support-vector machine, each data point is plotted in an n-dimensional space with the value of each variable being the value of particular coordinates and classification is performed based on the hyperplane that differentiates the two data classes.Following this, characteristics of new instances can be used to predict the class to which a new instance should belong.
k-Nearest neighbor k-Nearest neighbor (kNN) is one of the most basic and nonparametric algorithms, it does not make any assumptions about the distribution of the underlying data.The algorithm is based on the principle of Euclidean distance that is the instances within a dataset generally exist in close proximity to other instances that have similar properties.
If the instances are tagged with a classification label, then the value of the label of an unclassified instance can be determined by observing the class of its nearest neighbors.For the present modeling, the whole process is repeated three times (repeats = 3) each with a k value of 10 (number = 10) and taking the average of the three iterations.

Artificial neural network
An artificial neuron Network (ANN) is based on the structure and functions of biological neural networks -the network learns (or changes) based on the input and output.The layers in ANN are segregated into three classes: input units, which receive information to be processed; output units, where the results of the processing are found; and units in between known as hidden units.The network is first trained on a dataset of paired data to determine input-output mapping.After training, the weights of the connections between neurons are then fixed and the network is used to determine the classifications of a new set of data.For the present modeling we consider the hidden units to be three (hidden = 3) and threshold to be 0.05 (threshold = 0.05).
To overcome our inability of using real-world data, we split the dataset into a 'training set' (70%, i.e., 208 observations) and a 'test set' (30%, i.e., 87 observations) making sure to balance the class distributions within the split (see the associated computer code on GitHub, https://github.com/aa54/CAD1).The 'training' dataset is used to train the model; the model sees and learns from this training data.The 'test' dataset is then used to provide an unbiased evaluation of a final model fit to the training dataset.In some cases, we ran multiple experiments to validate model results on different splitting ratios.The following four metrics [17] were used to evaluate the performance of the predicted models and compare them with one another.1) Accuracy: the proportion of total dataset instances that were correctly predicted out of the total instances accuracy = (true positives + true negatives)/total 2) Recall (sensitivity): the proportion of the predicted positive dataset instances out of the actual positive instances sensitivity = true positives/(true positives + false negatives) 3) F1 score: a composite harmonic mean (average of reciprocals) that combines both precision and recall.For this, we first measure the precision, the ability of the model to identify only the relevant dataset instances precision = true positives/(true positives + false positives) The F1 score is estimated as 4) Area under the curve (AUC): an estimate of the probability that a model ranks a randomly chosen positive instance higher than a randomly chosen negative instance.Additionally, the full area under the curve-receiver operating characteristic (AUC-ROC) curves are plotted to visually compare the models' performance.

Results & discussion
For our machine learning model building, we started with two basic models, logistic regression and decision tree, to predict the presence of CAD.Our intuition was that it would be easy to interpret the results of the basic models and to explain each model to the nonmachine learning audience.After analyzing the results, however, we realized that the decision tree model is prone to over-fitting, and logistic regression did not perform relatively well with this simple dataset.We then considered applying the standard SMOTE methodology [18] to create additional synthetic data.Although we achieved more accurate results with SMOTE (5-6% higher accuracy), we felt that SMOTE is not necessarily a good approach as it creates data points based on the distance algorithm, and these synthetic datapoints may not be 'true' representations of patients.Therefore, we decided not to perform SMOTE on the dataset and proceeded to apply the remaining, more complex ML algorithms.
As shown in Table 2, an accuracy value higher than 0.84 is achieved with all but the decision tree model (just below 0.80).The other performance parameters, recall, F1 score and AUC are also high.A mean value is calculated with all the later three parameters for each of the models (shown in the last column in Table 2) to judge which model performs best as a whole.We excluded accuracy in calculating the mean, as accuracy is often considered to be a misleading indicator in measuring the performance of models with biomedical datasets (e.g., see [3]).
As seen in Table 2, the performance of the neural network model is outstanding with an accuracy of 0.9303 and a recall of 0.9380.In addition, running multiple experiments with differently proportioned training and test sets (changing from 70:30 to either 60:40 or 80:20) with neural network, the performance metrics were unchanged.
A high recall value indicates a lower propensity for false negatives.In disease prediction, a low recall (high frequency of false negatives), would misdiagnose patients with CAD as healthy, which may have devastating consequences.For example, a recall rate of 0.9380 (achieved with neural network) implies that this algorithm will correctly identify the presence of CAD in approximately 94 patients out of 100 patients with CAD. Figure 2 compares the performance of the six ML algorithms using AUC-ROC.AUC-ROC curves allow to visualize the tradeoff between the true positive rate and false positive rate, whereas AUC (see above) is useful to compare multiple algorithms or hyperparameters combinations (these two are obtained by different methods).As seen in the figure, all models except for decision tree perform well, there is very little difference in the AUC-ROC cures.Nevertheless, the ROC curve corresponding to the neural network model has a relatively higher positive slope, indicating that this model gives higher 'true positive percentage.' Combining with the results obtained for the performance metrics, it may be concluded that the neural network model is able to predict CAD more effectively than the other models.
The computer code generated in this study, which was developed in R, has been made available on the public repository GitHub and supplementally with this manuscript.As R is a free software tool for statistical modeling with an extensive set of code libraries, collaboration and improvement on the code generated in this study is welcomed and encouraged.The code libraries in R include advanced ML methodologies (voting, bagging, optimization, etc.) with more to be added as R evolves further as an open source tool [19].
We next extracted at the 'variable of importance' in the neural network model.Understanding the relative importance of variables may dictate which variables are necessary to be included in the risk prediction of CAD.As shown in Figure 3, 'restcg' (resting electrocardiogram [EKG]) and sex (gender) have a relatively high importance, while the three attributes 'age', 'thalch' (maximum heart rate achieved during the thallium test) and 'exang' (exercise induced angina) have the least importance.The top ten variables of importance as assigned by the neural network model, in descending order, are: resting EKG, patient gender, chest pain, slope of the peak exercise ST segment, ST depression induced by exercise relative to rest, fasting blood sugar (higher or lower than 120 mg/dl), thallium heart scan, number of major vessels, resting blood pressure and serum cholesterol.
Heinze et al. [20] proposed certain recommendations on the application of variable selection methods to help modeling in life sciences and its worth following these recommendations in future efforts of identifying which of the variables of importance are worth studying in the risk modeling.

Conclusion & future perspective
In this study, we demonstrated that ML algorithms can be applied with high accuracy and recall to detect the presence of CAD using a publicly available dataset.We also demonstrated that the neural network model outperforms other   ML models to detect CAD.We deposited the associated computer code in the public domain (see [23]) in the hope that we are contributing to an open source community.
Although CAD is both widely prevalent and may lead to fatal consequences, timely detection of CAD would empower clinicians to treat modifiable risk factors associated with the progression of CAD.Using an ML approach provides the ability to predict the presence of CAD with high accuracy and recall, and thus allows practitioners to practice preventative medicine in patients with CAD in a timelier manner.However, at such initial stages, it should be noted that ML serves solely as a predictor of CAD rather than a diagnostic tool.We hope that as more datasets are available for training the algorithm, we may be able to label ML algorithms as diagnostic steps in CAD management.Because machine learning utilizes datasets of patients who have already been diagnosed, the predictive ability of an ML algorithm for CAD would improve as more data are supplied to the algorithm.We visualize that a viable ML solution for predicting coronary artery disease (CAD) evolves in three steps: (1) model exploration; (2) model refinement; and (3) monitoring and maintenance.A proposed framework for arriving at a practical ML solution for the detection of CAD is depicted in Figure 4. We believe that with our current effort, we have achieved the first step (marked 1 in Figure 4).We are hopeful that this study can help form the basis for further testing/validating of our algorithms with multiple and larger datasets.We look forward to gain access to larger datasets for further validation and refinement (depicted as step 2 in Figure 4), with the eventual goal of providing an open source solution (step 3 in Figure 4) to aid healthcare practitioners in the detection and treatment of CAD.

Figure 1 .
Figure 1.Correlation matrix of the various parameters in the Cleveland Heart Disease Dataset.The color coding scale denotes the degree of Pearson correlation between variables with red being negatively correlated and blue positively correlated.

Figure 2 .
Figure 2. Classifier performance comparison with receiver operating characteristic (ROC) curves obtained in applying machine learning algorithms to the Cleveland Heart Disease Dataset.(A) Area under the ROC curve (AUC) for all the six models with estimated 95% confidence intervals.(B) Full ROC curves for the six models.Note the matching color code.

Figure 4 .
Figure 4.An ideal framework for developing a viable machine learning model.The dashed arrow indicates the iteration involving the machine learning algorithms presented here (also marked 1).Steps 2 and 3 (solid arrows) indicate future explorations.Modified with permission from [26].

Table 1 .
Dataset † variables names, types and descriptions.

Table 2 .
Performance metrics of the ML models applied on the Cleveland Heart Disease Dataset.