A stacking classifiers model for detecting heart irregularities and predicting Cardiovascular Disease

in two levels(Base level and Meta level). Various heterogeneous learners are combined to produce strong model outcomes. The model obtained 92% accuracy in prediction with precision score of 92.6%, sensitivity of 92.6%, and specificity of 91%. The performance of the model was evaluated using various metrics, including accuracy, precision, recall, F1-scores, and area under the ROC curve values.


Introduction
Data analytics merged with the power of Machine Learning (ML) has attracted a lot of attention across various domains due to its problemsolving ability. ML has diverse applications throughout these domains, such as speech recognition, medicine, business, social media, etc. Many breakthroughs have been driven by machine learning's use of neural networks, referred to as deep learning, which is a set of algorithms that enables the discovery of patterns and insights in large datasets. These techniques and frameworks can be deployed for information extraction, predictions, representation learning, outcome predictions, and de-identification.
In the healthcare sector, there are numerous areas in which ML has proven to be very beneficial. Considering the exponential growth of digital real-time information generated by the healthcare sector (e.g., Electronic Health Records(EHRs), wearable devices, diagnostics reports, etc.) [1], it is pertinent to develop smart systems to process of exercise, diabetes, and blood disorders, are some of the contributors. Patients who may be at risk of heart disease should be referred to cardiologists to determine the best-fit treatment to prevent any undesired events as soon as possible [5]. Traditionally, health practitioners have utilized various tests, such as blood work, electrocardiogram (ECG), echocardiogram, angiography, testing for diabetes, blood pressure, etc., and then screening the results. However, this screening can be a tedious process if a doctor has to go through hundreds of such reports. Also, the diagnosis and further treatment are costly and time-consuming. To reduce the time for manually checking the huge amount of data, we propose to incorporate ML algorithms to perform the data analytics, which would cut down the processing time significantly and allow cardiologists to spend more time preparing treatment plans.
ML algorithms are efficient in terms of detecting patients who might be at risk from an early stage, which will then help to reduce the overall costs of treatment [6]. Various machine learning classifiers are used for prediction as well as regression tasks. In the healthcare industry, they need to be reliable and show good performance with respect to medical data [7]. The main objective of these algorithms is the timely detection of cardiovascular disease before severe complications arise. Misclassifying a patient with heart disease as negative has much higher complications compared to misclassifying a healthy patient as having a disease [8]. Several studies report the use of machine learningdriven methods in achieving predictions and significantly reducing the cost of healthcare. With the amount of electronically available medical data, we aim to deliver better quality of healthcare service. As the world gradually moves towards complete digitization, computing devices are used to take notes and perform documentation, and the data availability issue is suspected to be gone in a few years. Therefore, it is pertinent to develop the proper tools to address the need for analytics with ML to improve the quality of human health.
Various studies have implemented ML techniques for the diagnosis of heart diseases. Traditional classifiers have been shown to perform well with proper model generation. The performance of such classifiers can be improved by implementing various techniques [9]. In the work described in this paper, the performance of various algorithms is improved by the implementation of the stacking technique. Specifically, in the concept of stacking, training is performed in level 1 with traditional ML classifiers, and then the output is fed to the next level, also known as a meta-level [10][11][12].
In subsequent sections of this paper, we discuss the proposed workflow, pre-processing of data, model generation, and the performance evaluation of the proposed workflow followed by a conclusive discussion and future work.

Literature review
Numerous studies have demonstrated the effective application of ML models in the detection of heart diseases. The UCI Heart Disease Dataset from UCI Machine Learning Repository is open to the public and is one of the most used datasets in this research area [13]. The Statlog dataset is also widely used [14]. In the clinical detection of diseases, such ML models aim to improve accuracy and reduce the total cost of the computation. For example, Verma et al. [15] proposed a hybrid model using particle swarm optimization (PSO) and two machine learning classifiers, namely K-nearest neighbor and multi-layer perceptron (MLP), for the prediction of heart disease, which achieved a 90.28% accuracy. Aakash Chauhan et al. (2018) introduced a model that extracts data from EHRs based on association rule generation and utilizes ML association mining for frequent pattern growth in a dataset. The model helps to achieve an overall outlook of a patient's data and underlying patterns in the dataset [16].
Saqlain et al. [17] developed a model using the Fisher score algorithm for feature selection and SVM classifier for the prediction model, which achieved an accuracy of 81.91%, sensitivity of 72.92%, and specificity of 88.68%. Latha and Jeeva [18] designed a hybrid model that implements four ML classification algorithms, namely NB, BN, RF, and MP, and incorporates various ensemble learning methods, obtaining an accuracy of 85.48%. Beunza et al. (2019) used a machine learning approach to how machine learning methods can be of great use for diagnosis with small datasets. They used R studio for the computations. Various methods such as decision tree, boosted decision trees (DTs), random forest, support vector machine (SVM), neural networks(NN), and logistic regression were tested. The highest accuracy obtained was 85% by boosted decision trees [19]. Subrat Kumar Nayak et al. (2020) emphasize the use of feature selection. The work includes 23 datasets, one of which is the heart disease dataset. Using filter methods, 13 feature subset was chosen followed by 10-fold cross-validation [20]. Liyuan Gao et al. (2020) proposed sampling and substitution methods for the Bayesian hyper-parameter optimization technique. Then compared various ML classifiers to detect irregularities. For breast cancer, 94% accuracy and for heart disease, 73.40% of accuracy was obtained [21]. Ivan Miguel Pires et al. (2021) experimented with multiple classifiers such as SVM KNN, DT, neural networks, combined nomenclature (CN2) rule inducer. All the selected classifiers underwent 5-fold cross-validation, 10-fold cross-validation, and 20-fold cross-validation. The best accuracy score of 87.69% was obtained by DT, SVM and SGD at 20,10, and 5 fold respectively [22].

Materials
Fig. 1 presents the workflow of the model, where data are initially acquired from the source, converted into a dataset, and then preprocessed. The model generation subsequently occurs, followed by analysis of the results. Each step of the model is discussed in detail in subsequent sections.

Dataset description
In this research, the UCI Heart Disease Dataset from the UCI Machine Learning Repository was selected as the open dataset taken, named which is available online. It is a combination of 4 datasets collected from Cleveland Clinic Foundation, Medical Centre Long Beach, Hungarian Institute of Cardiology and University Hospital Switzerland.
The dataset is comprised of 303 instances of records, out of a total of 76 attributes. In this study, only 13 attributes and one target attribute were taken into consideration. Table 1 describes the attributes of the UCI dataset, specifically 8 categorical and 6 numeric attributes. The dataset s a combination of different clinical test result data, such as serum cholesterol, fasting blood sugar, vessel count, and thalassemia detected from blood work. ST depression and slope of ST-segment were obtained from the electrocardiogram.

Pre-processing of the dataset
In this study, for the initial step of data pre-processing, we performed outlier detection. To improve the model's performance, Z-score outlier detection was used. Based on the empirical rule, the data point is considered to be an outlier in a distribution where the z-score is greater than 3. A z-score, or standard score, is the value that represents how far a data point is from the mean value, indicating the variability of an attribute's value in a dataset.
For categorical attributes, such as sex, chest pain (cp), resting electrocardiograph(restecg), and slope of the st segment(slope), one  Thalassemia(thal) Thalassemia type (normal, fixed defect, reversible defect) 14 Heart disease(target) 0 = negative of disease, 1 = positive for heart disease hot encoding was applied [23]. In this method, the attribute is converted into a numerical interpretable form for better adaptability and performance with machine learning algorithms.
For feature scaling, two datasets were pre-processed for preliminary analysis. For the first dataset, attribute values were standardized using the standard scaler. For the second dataset, values were normalized using the min-max scaler. In the min-max scaler, the values are in the range of 0 to 1, where 0 is the minimum value found and 1 denotes the maximum. The rest of the data are decimals in the range of 0 to 1.

Correlation heatmap of the dataset
Visualization of the dataset is an important part of the preprocessing step. Various methods of visualization give an overall idea of how the dataset is in a broad picture. Graphs such as bar graphs, charts, histograms, density estimate plots, etc provide a visual representation of the data for analysis. The correlation heatmap Fig. 2(a), depicts how the attributes of the taken dataset correlate to the target attribute (the attribute that denotes if a person has heart disease or not). The matrix represents the correlation coefficient of all the pairs of attributes. Heatmap visualization is a 2-dimensional representation of

Model building
After the data are pre-processed, the next step is to generate the model. The proposed work primarily focuses on the ML method of stacking, as shown in Fig. 2(b), in which various machine learning classifiers are combined in two levels, which generate a higher predictive model performance. Level 1, also known as the ''base level'' or "base learners'', contains the set of traditional ML algorithms. Following that, the second level, known as the ''meta level'' or "meta learners'', takes the input from the former layer. The main advantage of stacking algorithms in two levels is the utilization of the heterogeneous nature of multiple classification algorithms. This heterogeneity is where weak learners prove to be essential because of their diverse nature. Every classifier comes with certain strengths and drawbacks. Stacking helps to combine the best scenarios from the chosen classifiers. At the base level, various classifiers fit the training data and give predictions. Then, the meta level figures out the best way to maximize the strengths of each classifier and produce the final optimal prediction results. In addition, Logistic Regression (LR) and Naive Bayes (NB) were used in the preliminary experimental steps. As the number of classifiers used for the base level affects the overall performance, we selected 10 of the above-mentioned classifiers. Having a diverse set of base learners is essential as they produce results based on different assumptions.
After data were pre-processed, 10-fold cross-validation of the training data was performed to avoid model over-fitting. The total number of entries was divided into 10 sections, also known as folds, after reshuffling the data to avoid biased predictions. In every step of crossvalidation, a particular fold was treated as test data and the rest as training data. This process follows a total of 10 iterations. Due to the limited number of instances available in the dataset, cross-validation was conducted for performance comparison. This ensures that the bias and variance of an algorithm are reduced to clearly show how well that particular algorithm performs with the taken dataset.
For both datasets, averaged-out scores of accuracies, precision, and recall were recorded with their standard deviation. The comparison is shown in Table 2. Specifically, 75% of the data was used for training and 25% for testing the models. The 10 ML classifiers selected for the base learners were RF, MLP, KNN, ET, XGB, SVC, SGD, ADB, CART, and GBM.
For the meta-learner level, various classifiers were tested, and their performance scores are provided in Table 3. Multi-Layer Perceptron (MLP) classifier was selected as the meta-learner. Due to its adaptive learning feature, MLP was chosen since it learns and can be trained in real-time and is best suited for non-linear data, which is a good fit for a classification problem with a predictor label. GBM and MLP performed the same in terms of accuracy, but GBM produced higher false-positive predictions, as shown in Fig. 3(a) and (b). Therefore, MLP is considered a better fit for the particular scenario where misclassification of a positive class is undesirable in clinical diagnosis.

Results and discussion
The above-mentioned procedure was performed on a 64-bit machine with a 4th Gen Intel i5 CPU (8 GB DDR3+1 TB Hard drive+20 GB SSD). Python was chosen as the language for the machine learning tasks on the Jupiter notebook 3.7.2. Considering the nature of medical data, we   tested and recorded multiple performance metrics, including accuracy, precision, recall, and area under the ROC curve for evaluation [24][25][26]. As shown in Table 2, 12 different classifiers were tested on the dataset. Due to the limited number of records available in the dataset, cross-validation was performed as it gives the model multiple folds of data for training and testing in order to avoid over-fitting as well as reduce bias. Then, the results were recorded as mean values.
Algorithm 1 above shows a 10-fold cross-validation process. The 10 top-performing classifiers were selected for the next step. One dataset was standardized using the standard scaler function, and the other was normalized using the min-max scaler function. As discussed in the pre-processing section, outliers were removed prior to the standardization and normalization process. Normalized data performed better than standardized data in the majority of the classifier cases. Therefore, normalized data were used in subsequent experiments. Due to the heterogeneous nature of ML methods, as we can see some classifiers have good precision and recall scores but fail to achieve good accuracy. After the final predictions were made, the performance metrics were calculated based on the confusion matrix and the values of True Positive (TP), True Negatives(TN), False Positive(FP), and False Negative(FN) labels. In Fig. 3(b), the confusion matrix exhibits 3 false positives and 3 false negatives predictions.
Accuracy is the measure of a correctly classified class. Precision is the measure of the ratio between true positives to the total number  of cases classified as positives (true positives and false positives). For any medical predictive model, a good recall score (also known as sensitivity) and specificity should be maintained. Recall is a measure of how well the model correctly identifies true positive cases. Specificity is the number of correctly classified negatives by the model to actual negative cases.
F1-score is a performance metric, which is the resulting score combination of precision and recall's harmonic mean. The Area Under the Curve (AUC) of the Receiver Characteristic Operator (ROC) is a graph of the true positive rate against the false positive rate, which is used for binary classification problems. A higher ROC score indicates that the model performs well.
Matthews Correlation Coefficient (MCC) is a performance metric that produces a high score when a model performs well across all four categories (TP, TN, FP, and FN). Table 4 displays the performance comparison of all classifiers based on the 8 performance metrics, including accuracy, precision, sensitivity or recall, specificity, F1-score, AUC-ROC, log loss value, and MCC.
The AUC-ROC curve was also used for performance evaluation, which depicts how well a classifier is able to distinguish between two classes [50]. In clinical scenarios, the ability to discriminate the data between positive and negative classes is of prime importance. Fig. 5(b), compares the ROC score of the proposed stacked classifier with other classifiers, where a higher area under the curve indicates good performance. In the graph, we can see stacked classifier's curve has a greater area with an inclination towards the true positive rate axis.

Conclusion and future work
In this work, we propose an effective model that incorporates data pre-processing with outlier detection and the stacking of classifiers for predicting heart diseases. The data used was first normalized so as to ensure that the distribution of data is even and on a similar scale. This ensures the training stability of the model and gives better performance. Herein, we used 10 different classifiers with different strengths, such as instance-based (e.g. KNN), probabilistic (e.g.NB), and a few ensemble (e.g. XGB and GB) classifiers. Considering those different methods for prediction, we stacked various classifiers to take advantage of their differences in strengths. Using MLP as the meta-learner, we obtained results with 92% accuracy. The proposed stacked classifier outperformed the traditional machine learning classifiers better in terms of overall parameter comparison with a precision of 92.6%, sensitivity of 92.6%, and specificity of 91%. The proposed model exhibits the advantages of combining weak learners and using their heterogeneity to strengthen overall prediction results.
Heart diseases often result in poor quality of life or even death. Therefore, early treatment could help to save many lives if CVDs are predicted on time. However, it is not manually feasible for cardiologists to analyze the large amount of data acquired for a patient to make a timely treatment plan. Hence, primary screening by machine learningbased systems is a promising solution. Such systems need to be reliable and efficient for diagnosing patients by predictive analysis. The aim and use of such predictive models in the healthcare sector will help save lives and make sure no patient with heart disease is left undiagnosed. In this work, our method achieved good accuracy with high sensitivity in predicting patients with heart diseases.
The demonstrated high sensitivity indicates that the model has fewer false-negative results compared to traditional approaches. In other words, our method will ensure that no patient with heart disease is misdiagnosed or classified as negative for the disease. Importantly, this will allow the cardiologist to quickly establish the appropriate plan of action.
For future work, we plan to evaluate and test the proposed model on various other datasets. The limited amount of data, instances, and number of attributes is the main issue facing machine learning approaches. In the future, more work and research can be done if we are able to acquire a greater quantity of good-quality medical data by collaborating with hospitals and other data-producing entities.

Data availability
Data will be made available on request.