Comparing the Efficiency of Artificial Neural Network and Gene Expression Programming in Predicting Coronary Artery Disease

Coronary artery disease is the most common cardiovascular disease [1] and the most frequent cause of death in the world [2]. In Iran it is known as the first leading cause of death [3]. The disease results from the convergence of a number of contributing risk factors [4]. Studies on different medical resources show that the risk factors for this disease mainly include smoking, hypertension, lipid disorders (high cholesterol, high triglycerides, high LDL, low HDL, diabetes, physical inactivity, obesity, abdominal obesity, age, sex, family history, alcohol consumption, psychological factors, menopause, high fasting blood glucose, fibrinogen, lipoprotein a, C Reactive Proteins (CRP) and homocysteine [4-11]. Coronary angiography is considered as a gold standard for diagnosis of Coronary Artery Disease (CAD) [11]. Angiography, however, is an expensive and invasive procedure, which is associated with some risks [6]. On the other hand, non-invasive tests might yield false negative or false positive results that could be dangerous for the patient. Hence adoption of decision support systems, along with other procedures which are done before angiography, is essential to reduce the false results [12]. Decision support systems, that can help to solve complex problems effectively and to make proper decisions [13], have been recommended by many researchers for disease detection. These systems detect patterns in medical data, improve the decision making process and, as a result, affect costs [14] while enhancing the quality of health care [15]. Decision support systems are created by a variety of data mining techniques of which Artificial Neural Network (ANN), which is inspired by biological neural networks, serves as a mathematical model in human diagnostic systems that are widely used in various fields especially medicine [16]. Among different data mining techniques, GEP is a genotype/phenotype genetic algorithm (linear and ramified) that is presented as a new technique for the creation of computer programs. Gene expression programming uses character linear chromosomes that are composed of genes structurally organized in heads and tails. The chromosomes encode expression trees which are the object of selection. The creation of these separate entities (genome and expression tree, with distinct functions) allows the algorithm to perform with such high efficiency that can greatly surpass the existing adaptive techniques [17].


Introduction
Coronary artery disease is the most common cardiovascular disease [1] and the most frequent cause of death in the world [2]. In Iran it is known as the first leading cause of death [3]. The disease results from the convergence of a number of contributing risk factors [4]. Studies on different medical resources show that the risk factors for this disease mainly include smoking, hypertension, lipid disorders (high cholesterol, high triglycerides, high LDL, low HDL, diabetes, physical inactivity, obesity, abdominal obesity, age, sex, family history, alcohol consumption, psychological factors, menopause, high fasting blood glucose, fibrinogen, lipoprotein a, C Reactive Proteins (CRP) and homocysteine [4][5][6][7][8][9][10][11]. Coronary angiography is considered as a gold standard for diagnosis of Coronary Artery Disease (CAD) [11]. Angiography, however, is an expensive and invasive procedure, which is associated with some risks [6]. On the other hand, non-invasive tests might yield false negative or false positive results that could be dangerous for the patient. Hence adoption of decision support systems, along with other procedures which are done before angiography, is essential to reduce the false results [12]. Decision support systems, that can help to solve complex problems effectively and to make proper decisions [13], have been recommended by many researchers for disease detection. These systems detect patterns in medical data, improve the decision making process and, as a result, affect costs [14] while enhancing the quality of health care [15]. Decision support systems are created by a variety of data mining techniques of which Artificial Neural Network (ANN), which is inspired by biological neural networks, serves as a mathematical model in human diagnostic systems that are widely used in various fields especially medicine [16]. Among different data mining techniques, GEP is a genotype/phenotype genetic algorithm (linear and ramified) that is presented as a new technique for the creation of computer programs. Gene expression programming uses character linear chromosomes that are composed of genes structurally organized in heads and tails. The chromosomes encode expression trees which are the object of selection. The creation of these separate entities (genome and expression tree, with distinct functions) allows the algorithm to perform with such high efficiency that can greatly surpass the existing adaptive techniques [17].
Numerous studies were done to predict CAD based on data mining techniques. One study, for example, compared performances of three techniques, known as logistic regression, decision tree and neural network, to predict CAD. In this study, the multilayer perceptron neural network model, with an accuracy rate of 78.7%, was shown to be the best model [18]. In two other studies, Mobley and his colleagues created two models for CAD by using neural networks. They worked on a set of data, different in size and risk factors, to develop CAD models; they developed their own models with accuracy rates of 89% and 72% modeling, some setting initials are necessary, as can be seen in Table  2. Modeling based on ANN was done using a Multilayer Perceptron (MLP) neural network. Also, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, developed based on a quasi-Newton algorithm, was used for learning the network. This learning algorithm has a faster convergence rate than the gradient descend and the conjugate gradient algorithms and is one of the appropriate learning algorithms [26]. Since there is no equation for estimating parameters such as the number of neurons in the hidden layer, the layer activation function and error function of a neural network model could be adopted. So, with this point in mind, we created 100 neural network models by randomly selecting the parameter value, as can be seen in Table 3. The area under the ROC curve, in the current work, was used to compare the efficiency of models. This method has been widely used in recent years to evaluate machine learning algorithms [27]; this method has also been used in the field of medicine, as an effective method, to evaluate the performance of diagnostic tests against the gold standard [28]. Following the modeling procedure based on ANN and GEP, the model obtained from each technique was compared against the AUC value in an attempt to select the best models and techniques. Based on the feature selection technique, the intended variables were obtained by removing extraneous and irrelevant variables [29,30]. In line with this procedure, the stepwise backward elimination method was adopted to compare the results of ANN and GEP and to select the best possible model and technique. As such, the least important risk factors were also removed and the modeling process was carried out with the remaining risk factors. This process continued until there was no significant change in the accuracy of the model in the following steps. The Classification and Regression Tree (CART) were adopted to determine the importance of the variable. Upon the completion of the modeling procedure, the best models from different modeling stages were compared and the final model was selected.

Results and Discussion
In the first stage of modeling, which considered all the relevant risk factors, modeling was done based on GEP and ANN. In GEP a total of [12,19]. Some other studies used the data already stored in repository machines at the University of California, Irvine [20]. The results varied depending on the type of data mining techniques used in these studies [21][22][23]. According to what was mentioned before, the risks involved in invasive diagnostic procedures, like angiography, have to be dealt with properly. One way of overcoming such risks could be data mining techniques with their promising outcomes. In this study, the results obtained from the comparison of GEP will be presented as a new data mining technique; the ANN will then be introduced as a Conventional technique; in the end, a diagnostic model to predict CAD will be followed.

Methods
To obtain a prediction model for CAD, the angiography database of Tehran Heart Center, with 13,228 records, was used. The database included nine risk factors known as age, sex, obesity, abdominal obesity, family history, smoking, high cholesterol, diabetes and hypertension. Descriptive statistics for this database appear in Table 1. To avoid overfitting and to evaluate generalizability power of the model, the data set was classified into two subsets of training (70%) and validation (30%) [24]. Then, modeling was done by using GENEXPRO and MATLAB applications, based on GEP and ANN. The steps involved in GEP were as follows: First, an initial population of chromosomes (solutions) was randomly generated. Then each chromosome was expressed and its fitness value was calculated. It is worth mentioning that one of popular Fitness functions is "Hits with penalty" that acts based on the number of samples that is to be properly classified and penalties considered for models that have True Positive (TP) or True Negative (TN) with values equal to zero, but their total number of success is high. When in a generation a model is obtained that has higher accuracy than the models produced in previous generations, that model would survive. If the termination condition of the algorithm (e.g. achieving the greatest fitness) was fulfilled, the best solution, among the existing options, would be selected and the algorithm would then be terminated; otherwise, the procedure would continue by producing another generation of solutions [25]. For a GEP-based 52 models were produced; these models were then evaluated to select the best model with AUC of 0.72. Also, in ANN a total of 100 models were produced using different parameter values; these models were then evaluated to select the best model with AUC of 0.719. The results of modeling based on ANN and GEP techniques appear in Table 4. To compare the models based on GEP and ANN, Delong's test was used.
The test is to know whether or not there is any significant difference between various levels of AUC [31]. In this study, comparison of the AUC levels of GEP and ANN models, using Delong's test, shows no significant difference (p=0.789). However, the ANN model cannot be presented and interpreted in great detail as it is composed of a black box. Nonetheless, because of their unique nature (i.e., expression trees), the GEP-based models can easily be presented, interpreted or converted to other programming languages; so with these features in mind, the current study preferred to choose the GEP technique. As mentioned in the method section, the current study adopted a feature selection procedure to achieve a simple model. In line with this procedure, with the help of CART technique, the risk factors were sequenced in order of importance as follows: age (100%), diabetes (86%), hypertension (52%), sex (49%), high cholesterol (37%), consumer smoking (36%), obesity (17%), and family history (13%). In the second stage of modeling, family history as the least important risk factor was removed; the modeling was then repeated with the remaining risk factors. Twenty-Four models were then generated; after evaluation of these models, the best one for AUC, with an area under the curve of 0.700, was selected. There was a little difference between the best model in the first and second stage of modeling. So the modeling process continued until the researchers were left with just few models in the third time of modeling. After evaluation of these models, the best one with AUC of 0.677 was selected. This value is slightly different from that of the previous stages of modeling. As in previous stages, the least important risk factors were further removed and the modeling process continued with the remaining risk factors. At the seventh stage of modeling, the area under the curve obtained for the best model of AUC was significantly different from that obtained for the AUC at the sixth stage of modeling. This means that the risk factors available at stage seven were no longer sufficient for further modeling; so owing to insufficient risk factors the modeling process was abandoned at stage seven. After seven stages of modeling, the models obtained at the first and sixth stages were shown to be the best models with some salient features, as can be seen in Table 5. As shown in Table 5, the model produced at the first stage was the best model in terms of accuracy and area under ROC curve; this model was therefore considered as the selected model. However, the model created at the sixth stage of modeling has few negligible differences with the selected model as it is composed of only four risk factors, making it simpler than the selected one. In the following steps, in order to obtain a simpler model with greater accuracy, the lastly selected model was considered as an input for gene expression programming algorithm, resulting in the shortening of model sizes from 33 to 25, while there was no change in accuracy. The final model, shown as a tree diagram in Table 1, can be easily converted into any programming language. A noticeable point, following its modification, is that, in addition to getting shorter in size, the model does not include hypertension as a risk factor. As such, the final model is composed of eight risk factors known as age, sex, obesity, abdominal obesity, family history, smoking, hyperlipidemia, and diabetes. The ROC curve of this model is comparable to Figures 1 and 2.
A number of studies have produced certain models with high accuracy rates, using data sets available in Machine Learning Repository of University of California. The reason for the production of these models, in addition to CAD-related risk factors, is the performance of physical examinations, and electrocardiography and stress imaging tests. This shows that diagnostic tests carried out before angiography could be very effective in developing models with high rates of accuracy. The prediction models developed in the current research and other similar studies, based on the CAD risk factors, tend to vary in accuracy rates [12,18,32]. Limited access to sufficient risk factors in some of these studies, including the current work, has resulted in a low accuracy rate. Another important issue, in the current research and other similar studies, is that the specificity of the models is less than their sensitivity; that is, the models are more successful in diagnosing patients than healthy individuals. The reason could be that     these studies have not used suitable risk factors in sufficient numbers. Also, in the present research, feature selection led to the production of a model with four risk factors of age, sex, diabetes and high blood pressure at an accuracy rate of 73.26%, which slightly differs from the final model with an accuracy of 73.94%. According to the researchers' reviews, in some studies, the ROC curve analysis was the main measure to evaluate the proposed models while in others the model's accuracy in relation to the sum of the data was the prime evaluation criterion. Bearing in mind that a model's accuracy alone is not a suitable criterion for its evaluation, the current study used the ROC curve analysis, as the best criteria for evaluating and generating the intended models.

Conclusion
Comparison of the results of ANN and GEP showed no significant difference between the two models although the latter (i.e., GEP), was easier to present or interpret and more convenient to be converted into a programming language. So the model obtained for coronary artery disease in this study was create during gene expression programming technique; the model includes different risk factors such as age, sex, obesity, abdominal obesity, family history, smoking, high blood fats and diabetes. The model enjoys an accuracy of 73.94%, specificity of 31.43% and sensitivity of 93.29%. The study's limitation in getting access to suitable risk factors in sufficient numbers has possibly affected the model's accuracy. Some research studies have managed to produce certain models with high accuracy rates by investigating a number of factors such as physical examination, electrocardiography, imaging and stress tests together with risk factors from coronary artery disease. This shows that diagnostic tests before angiography could be very effective in obtaining more accurate models. The current research used the classification and regression tree technique, and the stepwise backward elimination method for feature selection, resulting in the production of a model with four risk factors of age, sex, diabetes and hypertension with 73.26% accuracy rate, which was slightly different from the final model. The model presented in this study selected 390 subjects, out of a total of 1242, who were free of CAD; the model, however, failed to diagnose 183 patients, out of a total of 2727, suffering from coronary artery disease. This indicates that such models are on their way to develop and improve further; then they will be able to make a better distinction between patients and non-patients. Given the importance of parametric methods, like Logistic Regression Analysis, and development of ensemble methods, the authors recommend new comparative studies in line with the objectives of the current work.

Authorship
Issa Mahmoud collected the data, carried out statistical analyses and interpreted the data. Hamid Moghaddasi proposed the topic, designed the study, and formulated the research problem. Samad Sajjadi revised the article critically and provided numerous insightful comments.

Funding
The researcher on this manuscript received no particular funds from any particular organization or research body.