Analyzing the impact of feature selection on the accuracy of heart disease prediction

Heart Disease has become one of the most serious diseases that has a significant impact on human life. It has emerged as one of the leading causes of mortality among the people across the globe during the last decade. In order to prevent patients from further damage, an accurate diagnosis of heart disease on time is an essential factor. Recently we have seen the usage of non-invasive medical procedures, such as artificial intelligence-based techniques in the field of medical. Specially machine learning employs several algorithms and techniques that are widely used and are highly useful in accurately diagnosing the heart disease with less amount of time. However, the prediction of heart disease is not an easy task. The increasing size of medical datasets has made it a complicated task for practitioners to understand the complex feature relations and make disease predictions. Accordingly, the aim of this research is to identify the most important risk-factors from a highly dimensional dataset which helps in the accurate classification of heart disease with less complications. For a broader analysis, we have used two heart disease datasets with various medical features. The classification results of the benchmarked models proved that there is a high impact of relevant features on the classification accuracy. Even with a reduced number of features, the performance of the classification models improved significantly with a reduced training time as compared with models trained on full feature set.


Introduction
Heart disease is rapidly increasing across the globe. As per a research report published by the World Health Organization (WHO), in 2016 approximately 17.90 million people died from heart disease [1]. This much number accounts for approximately 30% of all deaths worldwide. Nearly 55% of the heart patient die during the first 3 years, and the treatment costs for heart disease are around 4% of the annual healthcare expenditure. [2]. Observing the increasing stats, accurate and timely detection and treatment of this serious illness is very essential for disease prevention and effective utilization of medical resources.
Due to the recent technological advancements, the field of medical sciences has seen a remarkable improvement over time [3,4]. Specially, machine learning (ML) has been widely used in the field of cardiovascular medicine and has established a potential space [5]. The basic framework of ML is built on models that take input data (such as text or images) and through the usage of some statistical analysis and mathematical optimizations provides the desired 2 prediction results (e.g., disease, no disease, neutral) [6]. ML models can be trained on tons of raw electronic medical data gathered from low-cost wearable devices to allow efficient heart disease diagnosis with less resources and improved accuracy [7].
During the training process, ML models require a large number of data samples to avoid overfitting [8]. However, the inclusion of the large number of data features is not required for reasons related to the curse of dimensionality [9,10]. Mostly, medical datasets cover related as well as redundant features. Unnecessary features do not contribute any meaningful information to the prediction task, and also creates noise in the description of target (output class) which leads to prediction errors [11]. Furthermore, such features increase the complexity of ML models and make the system runs slowly due to increased training time. To overcome the curse of dimensionality only those features which are closely related with the target should be selected/identified from datasets and provided as inputs to ML models [12]. Relevant feature selection can aid in performance improvement by decreasing the model complexity and increasing prediction accuracy which is very important in medical diagnosis [13] Because of the benefits outlined previously, feature selection techniques are being actively used in the area of heart diseases and strokes [14,15,16].
The contributions of this research are listed as follows: • The study uses two datasets of heart disease patients from different sources to cover a broader study of medical features.
• To perform the correlation and interdependence study between different features in datasets with respect to heart disease.
• The identification of the most relevant medical features which aids in the prediction of heart disease using a filter-based feature selection technique.
• Different ML classification models such as Logistic Regression (LR), De-3 cision Tree (DT), Naive Bayes (NB), Random Forest (RF), Multi Layer Perceptron (MLP) etc., are used on the datasets to identify the suitable models for the problem.
• The classification models were tested on full as well as the reduced feature subset to observe the impact of feature selection on the performance of models.
• With the spirit of reproducible research, the code of this article is shared in GitHub. 1

Related Work
ML has appeared to be an effective technique for assisting in the heart disease diagnosis, however the high dimensionality of datasets is a fundamental issue for ML prediction models. Feature selection is one of the techniques which is used to select only the most relevant features from datasets features that influence the disease outcome most. The identification of the most important features from the high dimensional datasets is an important aspect that can improve the accuracy of prediction models hence reduce the number of medical injuries.
In [17], Zhang et al. developed an efficient feature selection technique called weighting-and ranking-based hybrid feature selection (WRHFS) to determine the risk of heart stroke. For the weighing and ranking of features, WHRFS used a variety of filter-based feature selection techniques such as fisher score, information gain and standard deviation. The proposed technique selected 9 important input features out of 28 based on the knowledge provided for heart stroke prediction. In another research [18], the authors worked on the extraction of relevant risk factors form a large feature space for an efficient heart disease prediction. The features were selected based on their individual ranks.

4
The authors used Latent Feature Selection (ILFS) method to rank the features which is a probabilistic latent graph-based feature selection technique.
The results of the model were competitive using only half of the features from the set of 50. In [19], a feature selection model for detecting the risk of heart disease is proposed. The proposed model combined the glow-worm swarm optimization algorithm based on the standard deviation of the features to extract the quality features from a electronic healthcare record (EHR) of a community hospital in Beijing. 6 features including high blood pressure, Alkaline Phosphatase (ALP), age and Lactate Dehydrogenase (LDH) were indicated as important features to detect stroke excluding the family hereditary factors. The authors of [20] focused on finding the most relevant features from EHR to predict the early-stage risk of death from heart disease. The authors used minimum redundancy maximum (mRmR) relevance and recursive feature elimination (RFE) feature selection approaches based on NB for the selection of features. Two medical features i.e., Serum Creatinine and Ejection Fraction were ranked higher by both feature selection technique as compared to other. When provided to a prediction model as input, the selected features proved out to be most important as an overall accuracy of 80% was achieved.
Singh et al. [21], proposed an efficient approach for stroke prediction using is used in [22] to select the most significant features to detect heart disease.
The proposed feature selection algorithm identifies 7 features out of 16 to detect heart disease from Cleveland heart disease dataset. The resultant features were supplied to support vector machine (SVM) for the accuracy evaluation.
The classifier acquired 88.34% using the reduced feature set whereas only 83.34% was achieved when using whole dataset features. In terms of ROC curve, the GA-SVM performed well also when compared with the various existing feature selection algorithms also. This study [23] proposes a new heart disease prediction model by combining ML with deep learning techniques.
The least absolute shrinkage and selection operator (LASSO) penalty method based on LinearSVC was applied as the feature selection module to generate a feature subset closely related to target. 12 most relevant features were chosen from dataset obtained from Kaggle and inputted to the MLP network.
As per the experimental results, the proposed model obtained an accuracy of 98.56% with 99.35% recall and 97.84% precision. In [5], a ML based heart disease diagnosis system is proposed. Seven popular classifiers LR, k-Nearest Neighbor (K-NN), MLP, SVM, NB, DT, and RF were used for the classification of heart disease patients. Three feature selection algorithms RelieF, mRMR, and LASSO were used to select highly correlated features with target class. It was observed that the classification performance of models increased in terms of accuracy and computation time using the feature selection techniques. The LR model showed best accuracy of 89% when used with RelieF.
The main objective of this research [24] was to predict the heart disease using minimal subset of features and adequate accuracy. To achieve this objective, the authors employed a two-stage feature subset retrieving technique. Three popular feature selection techniques i.e., (embedded, filter, wrapper) were used to extract a feature subset based on a boolean process-based common "True" condition. To select the suitable prediction model, RF, SVM, K-NN, NB, XGBoost and MLP models were trained on the data. The experimental results showed that XGBoost classifier integrated with wrapper technique provided the best prediction results for heart disease. A comparative analysis of different classifiers was performed in [25] for the classification of the heart disease with minimal attributes. ML classifiers such as NB, LR, sequential minimal optimization (SMO), RF etc., were trained for the accurate detection of heart disease. To obtain the optimal feature subset, RelieF, chi-squared and correlation-based feature subset evaluator were utilized. 10 features were selected from the set of 13 to train the classifiers. The SMO classifier achieved 6 the highest accuracy of 86.468% when inputted with the optimal feature set obtained by chi-squared feature selection technique.
Despite their relevance, one major drawback of existing works on heart disease prediction is the lack of systematic guidance when selecting the input features for the development of prediction models which is an important aspect in terms of predictive performance. Previous research proposals chose features mostly in an impromptu manner without incorporating latest medical research findings. Mostly the focus is on the prediction models and their final prediction performance. However, a very less attention is paid on the correlation between different medical features and their individual importance in the prediction of heart disease. A few works present analysis of medical features but for the purpose of heart disease detection only. This research aims at addressing the ineffective feature selection in previous studies on heart disease prediction. Two heart disease patient datasets collected from different sources were utilized in this research to cover a broader study of features related to heart disease and to identify various medical procedures. To further analyze the role of each parameter in the prediction task, we obtain the interdependence and importance of the collected set of medical features. A detailed analysis of ML models trained on both full and selected feature set is provided to analyze the impact of feature selection techniques on the prediction performance as well as the identification of suitable classifiers for the specified problem.

Proposed Methodology
This research paper highlights the importance feature selection in the accurate classification of heart disease. Figure 1 demonstrates the workflow of the proposed methodology for heart disease prediction.

Datasets
In this research, two datasets named as cardiovascular disease (CVD) and Framingham were utilized to study the impact of different features on the occurrence of heart disease and to develop ML-based system for heart disease detection. The study uses two datasets to cover a broader study of medical features and various clinical pathways used for the detection of heart stroke.
The datasets were collected from different sources. The datasets contained some main medical features like 'age', 'hypertension', 'glucose levels', 'blood pressure', 'cholesterol' etc. which are closely related to the occurrence of disease and provides a great flexibility for heart disease analysis. The datasets were chosen based on two criteria. The first criterion was the variance in the medical procedures, so to study the different medical procedures and the role of each feature in the context of heart disease. Secondly, the datasets were chosen based on the data availability. Datasets from different sources possess different amount of data and collection of features. So, we have chosen datasets which were offering a good volume of data and having a level of simi-8 larity in terms of features.

CVD
The CVD dataset is controlled by McKinsey & Company which was a part of their healthcare hackathon 2 . The dataset can be accessible from a free dataset repository 3 . The collected dataset included 29072 patient observation with 12 data features. 11 of them are the common clinical symptoms and are considered as input features whereas the 12th feature 'stroke' is the target feature indicating whether a patient has had stroke or not. The complete description of data features for CVD dataset is given in Table 1.

Framingham
The Framingham dataset was created during an ongoing cardiovascular study involving the residents of Framingham, Massachusetts, and is available at the Kaggle website 4 . The dataset is mostly used in classification tasks to identify whether a patient has a chance to develop coronary heart disease (CHD) in 10 years. The dataset contains 4, 240 patient records and 15 features, where each feature indicates a risk factor. 14 input features were used to detect the decisional feature i.e., 10-year risk of CHD. Table 2 shows the description about the data features in Framingham dataset.

Pre-Processing
Data pre-processing is one of the important part of ML life cycle as it makes data analysis easy and increases the accuracy and speed of the ML algorithms [26] . We applied some pre-processing steps as the collected dataset were having smoking_status ("never smoked":0, "formerly smoked":1, "smokes":2) stroke ("yes":1, "no":0)    [28,29]. However, only a deep knowledge of specific disease will likely aid in the selection of the suitable data imputation methods. As per the mentioned analysis, we dropped all the observations with null value from both the datasets to avoid any accuracy biases.
Furthermore, looking at the class distribution, both datasets were highly un- The unbalanced nature of the datasets leads to classification errors during the training of ML models [30]. As a result, we adopted a 'Random Down-Sampling' technique to mitigate the adverse effects caused by unbalanced data. We made two classes referred as 'minority' and 'majority' classes. The patients with heart disease were included in minority class, whereas the patients having no symptoms were included in majority class. In the case of CVD dataset, 548 observations were included into the minority class and the remaining 28,524 were considered as majority class. We created a balanced dataset of 1096 observations by selecting all 548 observations from minority class and 548 random observations from a total of 28,524 majority cases.
Same process was performed for framingham dataset where 557 random observations from 3101 majority cases were derived making a total of 1114 obser-vations in a balanced dataset shape. In this way, two balanced datasets were made to study the features importance and disease classification in an efficient manner.

Feature Correlation Analysis
Feature correlation is a method which helps in understanding the underlying relationships between various data features present in a dataset. Feature correlation can be useful in many ways such as determining the interdependencies between the data features and how each feature effects the output feature [31].   As per medical research findings, with aging, major changes can be observed in the heart and blood vessels. For example, the heartbeat rate is not as fast during any physical activity as it could when you are younger. The age-related changes may raise a person's risk of heart disease according to National Heart, Lung, and Blood Institute Trusted Source [32]. Hypertension is an established risk factor for stroke, ischemic heart disease and renal dysfunction [33]. Hypertension causes the blood pressure over the normal range. The higher blood pressure levels make the arteries less elastic and decreases the oxygen and blood flow towards the heart which potentially leads to a heart disease. The diabetic patients are more likely to develop heart disease at an earlier stage.
High blood glucose from diabetes causes stronger contraction of blood vessels that control your heart and blood vessels which leads to heart disease [34].
Over time, this process can lead to a heart stroke.

Feature Selection
The main motivation of this research is to select the medical features that can improve the accuracy of heart disease prediction. Feature selection is the pro- where N is the overall sample size, S is the number of groups, j i is the number of observations in the jth group,K i is the ith group sample mean,K is the overall mean of the data, K ip is the pth observation in the ith out of S

Evaluation Matrices
We have used three popular performance evaluation metrices i.e., Accuracy, F1-score and ROC to evaluate the performance of ML classification models [38]. Confusion matrix is a table that helps ML practitioners to describe the performance of a classification model. Confusion matrix consists of four used to determine the performance matrices of a classifier and can be de- T P R = T P/(T P + F N ) F P R = F P/(F P + T N )

Results and Discussions
In this section, we will discuss the performance of the selected classification models from different perspectives. First, we checked the performance of model individually for both datasets with full features to examine which models work well for each dataset. Secondly, we evaluated the performance of the models on the selected set of feature to analyze the effect of feature selection technique on the accuracy of the classifiers. The classifiers performance was checked using the Accuracy, F1-score and ROC evaluation matrices.

Classification results using full feature set
In this section, all the ML models were tested on both datasets using full set of features to predict the binary disease outcome. We trained all the prediction models on entire data with 80% training and 20% testing subsets. The overall computational time consumed during the training of prediction models was 10.98 iterations per second (it/s) for CVD dataset and 24.20 iterations per second (it/s) using framingham dataset. Table 3 and 4 shows the binary classification results of the ML model in predicting the heart disease for both datasets.
Looking at the classification results listed In Table 3 [39,40]. However, any data manipulation strategy in medical studies may introduce significant biases, that is why we have kept all the feature values unchanged.

Classification results using reduced feature set
Given the goal of identifying the potential bio-markers and to analyze the  duces the size feature space, but it also improves performance of ML models also in various aspects.

Conclusion and Future Works
Heart disease is the most fatal disease which is rapidly increasing and became one of the causes of death around the world.   performed with full as well as the reduced feature sets to analyze the effect of selected features on the prediction accuracy of various ML prediction models. Using the full feature set the highest accuracy achieved was 0.73 for CVD and 0.66 for the Framingham heart disease dataset. After using the reduced feature set the accuracy increased to 0.75 and 0.71 for both datasets. The analysis showed that even after limiting the number of features, ML models showed better performance as compared to the models using a full feature set.
The experimental results reveal that by employing a feature selection technique, we may accurately classify the heart disease even with a small number of features and less time. We can conclude that using the feature selection only the most important features related to heart disease are selected which reduces the computational complexities and improve the accuracy of prediction models. In the intended future work, we will try to work on enhancing the prediction accuracy by using a vast combination of ML and deep learning models [43] to obtain the best feasible model for the heart disease diagnosis.
We will benchmark our analysis on additional datasets as a part of our future work. We will also try to use more than one feature selection technique to obtain more feasible feature subsets which are more direct with medical studies.