Optimized Ensemble Learning Approach with Explainable AI for Improved Heart Disease Prediction

: Recent advances in machine learning (ML) have shown great promise in detecting heart disease. However, to ensure the clinical adoption of ML models, they must not only be generalizable and robust but also transparent and explainable. Therefore, this research introduces an approach that integrates the robustness of ensemble learning algorithms with the precision of Bayesian optimization for hyperparameter tuning and the interpretability offered by Shapley additive explanations (SHAP). The ensemble classifiers considered include adaptive boosting (AdaBoost), random forest, and extreme gradient boosting (XGBoost). The experimental results on the Cleveland and Framingham datasets demonstrate that the optimized XGBoost model achieved the highest performance, with specificity and sensitivity values of 0.971 and 0.989 on the Cleveland dataset and 0.921 and 0.975 on the Framingham dataset, respectively.


Introduction
Cardiovascular diseases (CVDs) are the leading cause of death worldwide [1].The World Health Organization (WHO) attributes over 17.9 million deaths yearly to CVDs.Among these deaths, 32% are caused by coronary heart disease (CHD).The early detection of heart disease risk is crucial for effective treatment and prevention [2,3].In recent years, machine learning, a subset of artificial intelligence (AI), has shown promise in improving the accuracy of heart disease prediction [4][5][6].Ensemble classifiers have been widely used to achieve improved performance by combining the predictions of multiple individual classifiers.However, the performance of ensemble algorithms is heavily dependent on their hyperparameters, such as the number of trees, learning rate, and depth of the trees [7,8].Therefore, tuning these hyperparameters is crucial to enhance the model's performance.
Furthermore, ML models are often considered "black boxes" due to their lack of interpretability and transparency, i.e., the decision-making process of the models is not easily understandable to humans [9].Recently, the SHAP technique was proposed by Lundberg and Lee [10] to achieve model interpretability-a breakthrough in Explainable AI (XAI).SHAP is an approach based on game theory that is used to explain the output of ML models by assigning each feature an important value.The importance of SHAP values in the context of heart disease prediction lies in their ability to provide detailed insights into how each feature influences the model's prediction.This is particularly important in healthcare, where understanding the rationale behind a prediction is as crucial as the prediction's accuracy [11,12].SHAP values assist in identifying which features are most important for a model's decision, enabling the development of more interpretable and trustworthy models.By quantifying the impact of each feature on the prediction using SHAP values, practitioners can gain a deeper understanding of the model's behavior, identify potential biases, and ensure that the model aligns with clinical knowledge and ethical standards [13,14].Therefore, this study proposes an approach for heart disease prediction using ensemble classifiers, Bayesian optimization, and the SHAP technique.The ensemble classifiers used include random forest, XGBoost, and AdaBoost.These algorithms were selected because of their proven effectiveness in a variety of ML tasks.The Bayesian optimization technique is introduced to tune each classifier's hyperparameters to maximize performance while minimizing overfitting.
Meanwhile, the SHAP approach will be used to interpret the predictions made by the ensemble classifiers.The introduction of SHAP is aimed at gaining insight into which features are most influential in the prediction of heart disease, providing valuable information for clinicians and researchers.The proposed approach will be evaluated on two publicly available datasets containing various clinical and demographic features of patients, such as age, gender, cholesterol levels, and blood pressure.The performance of the ensemble classifiers with Bayesian optimization will be compared with the standard classifiers without optimization.

Background
The field of heart disease prediction using ML has been explored in the literature, with numerous studies demonstrating the performance of various ensemble methods.Ensemble classifiers have been utilized due to their high accuracy and robustness against overfitting.For instance, Yang et al. [15] and Mahesh et al. [16] have employed random forest and AdaBoost, respectively, to successfully identify key predictors of cardiovascular diseases in large patient datasets.These studies indicate the potential of ensemble models to enhance heart disease prediction performance using the strengths of multiple learning algorithms.
Similarly, Mienye et al. [4] proposed an ensemble approach for heart disease prediction.The study employed the classification and regression tree (CART) algorithm to build multiple base models from randomly partitioned subsets of data.The accuracy-based weighted-aging classifier was used to combine the various base models, achieving a strong homogeneous ensemble classifier, which obtained an accuracy of 93% on the Cleveland dataset and 91% on the Framingham dataset.Gao et al. [17] developed an ensemble approach for heart disease detection.The study employed a bagging ensemble of decision trees combined with two feature selection techniques, including principal component analysis and linear discriminant analysis, achieving state-of-the-art performance on the Cleveland dataset.
Furthermore, the integration of hyperparameter optimization strategies, such as Bayesian optimization, has been shown to significantly enhance model performance by tuning model parameters.Shi et al. [18] demonstrated how Bayesian optimization could be applied to XGBoost to optimize its parameters, leading to substantial improvements in predictive accuracy compared to using default settings.Additionally, interpretability in ML models has gained significant attention, and it is crucial for clinical acceptance and decision-making as it provides transparency regarding the decision-making processes of the models.Asan et al. [19] studied the impact of human trust in healthcare AI and noted that transparency is particularly important in healthcare settings, where understanding the rationale behind a model's predictions can impact patient outcomes, improve physician trust, and facilitate regulatory compliance.
Meanwhile, within the context of interpretability, the SHAP technique was recently introduced to explain the output of ML models.The work of Debjit et al. [20] highlights how SHAP values can explain the contribution of each feature to the prediction made by complex models, thereby offering insights into model behavior that are both comprehensive and understandable to clinicians.Therefore, this study proposes a hybrid approach that integrates ensemble classifiers with Bayesian optimization to develop an accurate heart disease prediction model, with SHAP values providing insights into the decisionmaking process of the models, making the predictions transparent and understandable to healthcare professionals.

Heart Disease Datasets
The Cleveland Heart Disease dataset is comprised of 14 attributes collected from patients undergoing testing for heart disease at the Cleveland Clinic Foundation [21].The dataset includes a mix of patient demographic information, blood test results, and results from various cardiovascular tests.Each row corresponds to a patient, and the goal typically involves predicting the presence of heart disease based on these attributes.Table 1 shows a description of the dataset's features.Meanwhile, the Framingham Heart Study dataset is derived from an influential cardiovascular study initiated in 1948 in Framingham, Massachusetts [22].It aimed to identify common factors or characteristics contributing to cardiovascular disease.The study initially enrolled 5209 persons aged 30 to 62 years and has been expanded to include subsequent generations, providing a wealth of data spanning several decades.The dataset typically utilized for cardiovascular disease prediction includes variables such as age, sex, smoking status, blood pressure, cholesterol levels, diabetes status, and body mass index (BMI), among others.These variables are used to model and predict the 10-year risk of developing CHD, offering vital insights for preventative healthcare strategies.Table 2 provides a description of the dataset.Random forest (RF) is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes for classification problems [23,24].The fundamental concept behind RF is to combine the predictions of several decision trees constructed on different subsets of the dataset to improve the model's generalization ability.Suppose we have a dataset D consisting of n instances and m attributes; the random forest algorithm builds each tree by selecting a random subset of instances and a random subset of attributes at each node to split on.This randomness ensures that the trees are diverse.For a new instance, each tree in the forest votes for that instance to belong to the most frequent class.Therefore, the random forest prediction, ŷ, can be represented as follows: where T is the number of trees in the forest and y i is the prediction of the ith tree [25].
For regression tasks, the mean prediction of all trees is considered.The random forest is robust to overfitting, can handle high dimensional data, and can model complex nonlinear relationships, which are often present in medical datasets [26,27].This makes it suitable for heart disease prediction.

XGBoost
XGBoost is an advanced implementation of the gradient boosting algorithm known for its efficiency and performance [28,29].XGBoost improves upon the traditional gradient boosting technique by introducing a more regularized model formalization to control overfitting, making it a powerful algorithm for complex predictive modeling tasks such as heart disease prediction.The core idea of XGBoost is to iteratively refine predictions through an ensemble of decision trees.Each new tree attempts to correct errors made by the previously built trees [30].The prediction, ŷi , for an instance, i, after T rounds of boosting is given by the following: where f t represents the tth tree's contribution, and x i is the feature vector of the ith instance.The objective function that XGBoost optimizes is as follows: where l is a differentiable convex loss function that measures the difference between the predicted ŷi and the actual y i outcomes for the ith instance and Ω is the regularization term, which penalizes the complexity of the model, including the number of leaves and the magnitude of the leaf weights in tree f t , to prevent overfitting.Meanwhile, to achieve optimal performance, XGBoost requires careful tuning of several hyperparameters, such as the learning rate and the depth of trees [31,32].This process can be significantly enhanced by employing techniques like Bayesian optimization, which systematically searches the hyperparameter space by considering the past evaluation results and aims to find the set of hyperparameters that minimizes the validation error.

AdaBoost
AdaBoost is a powerful ensemble technique that combines multiple weak learners, typically decision trees, to create a strong classifier [33].It mainly fits a sequence of weak learners (i.e., models that are slightly better than random guessing) on repeatedly modified versions of the data.The predictions from all of them are then combined through a weighted majority vote to produce the final prediction [34].This approach allows the algorithm to focus more on the instances that are harder to predict, thereby increasing the overall model's predictive performance on complex problems such as heart disease prediction.AdaBoost starts with equal weights for all instances in the dataset and iteratively adjusts these weights.After each classifier is trained, the weights are updated to increase the importance of instances that were misclassified, guiding the algorithm to focus on misclassified instances in subsequent iterations [35].The output of the AdaBoost algorithm is a weighted sum of the weak classifiers, defined mathematically as follows: where T is the total number of weak classifiers, h t (x) is the prediction of the tth classifier, and α t is the weight of the tth classifier, which is calculated based on its error rate [36].
The error rate of each classifier is given by the following: where w i is the weight of the ith instance, y i is the actual label, and I is an indicator function that returns 1 when y i ̸ = h t (x i ) and 0 otherwise.

Shapley Additive Explanations
Shapley Additive Explanations values are derived from game theory and offer a unified measure of feature importance that is both theoretically sound and consistent [37].SHAP values explain the prediction of an instance by computing the contribution of each feature to the difference between the current prediction and the average prediction across all instances [38,39].This approach provides an interpretable and detailed decomposition of a model's output, making it a powerful technique for understanding complex ML models, including ensemble classifiers like random forest, XGBoost, and AdaBoost.The foundation of SHAP values is the Shapley value, a concept from cooperative game theory that allocates payouts to players based on their contribution to the total payout [40].In the context of machine learning, features are considered "players", and the "payout" is the prediction output of the model.The Shapley value for feature i is calculated as follows: where F is the set of all features, S is a subset of features excluding i, |S| is the cardinality of set S, |F| is the total number of features, f x (S) is the prediction when only the features in set S are used, and f x (S ∪ {i}) is the prediction when the features in S along with feature i are used.The SHAP value ϕ i thus represents the average marginal contribution of feature i over all possible combinations of features [10].

Proposed Heart Disease Prediction Approach
The proposed approach for heart disease prediction integrates the robustness of ensemble learning techniques, specifically random forest, AdaBoost, and XGBoost, with Bayesian optimization to effectively optimize each model's hyperparameters.The initial phase of the methodology involves splitting the dataset into distinct training and testing sets.This separation facilitates the unbiased evaluation of each model.For each classifier, a defined hyperparameter space is established.Bayesian optimization is then applied to navigate this space efficiently, aiming to identify the optimal hyperparameters that maximize the area under the ROC curve (AUC).In this study, the Bayesian optimization process employs a 5-fold cross-validation method within the training dataset to ensure rigorous evaluation of hyperparameter sets.Here, the training dataset is divided into five smaller subsets, or folds.For each set of hyperparameters tested, the model is trained on four of these folds and validated on the remaining one.This procedure is rotated so that each fold serves as the validation data once.This cross-validation method ensures assessing how the hyperparameters perform across different subsets of the data, thereby reducing variability and providing a more generalized performance estimate.The proposed approach is represented in Algorithm 1.
Algorithm 1 Proposed Heart disease prediction approach.Once the optimal models are trained and validated using this cross-validation approach, the model that demonstrates superior performance, measured through the highest AUC on the test set, is selected as the best model.This model then undergoes a thorough interpretability analysis using SHAP values, which elucidate how each feature contributes to individual predictions.This interpretability is crucial, particularly in a clinical setting, as it aids healthcare professionals in understanding the decision-making process of the model, thereby enhancing trust and actionable insights.

Performance Evaluation Metrics
To evaluate the performance of the heart disease prediction models, it is essential to consider a variety of metrics that capture different aspects of model performance.The following metrics provide a comprehensive assessment: accuracy, sensitivity, specificity, and F1 score.Accuracy measures the proportion of true results (both true positives and true negatives) among the total number of cases examined [4,41].Although accuracy is intuitive, it may not provide a complete picture in imbalanced datasets where one class significantly outnumbers the other.Sensitivity, also known as Recall, measures the proportion of actual positives correctly identified by the model, and it is more suitable for imbalance classification tasks.It is particularly important in medical applications where missing a positive case (e.g., a disease) can have serious implications [42].Meanwhile, specificity, also known as the true negative rate (TNR), measures the proportion of actual negatives correctly identified and is crucial for ensuring that the model is not overly sensitive to positives [43].Lastly, the F1 score is the harmonic mean of precision and sensitivity, providing a balance between the two metrics.It is useful when the class distribution is uneven.They are defined as follows: Accuracy = TP + TN TP + FP + FN + TN (7) where TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives, respectively [44].Meanwhile, precision, often referred to as the positive predictive value (PPV), measures the accuracy of the positive predictions made by a model [45].It is defined as the proportion of true positive outcomes to the total predicted positive outcomes.It can be expressed mathematically as follows:

Results and Discussion
This section presents and discusses the experimental results, showing the performance of the ensemble classifiers before and after applying the Bayesian optimization.First, Tables 3 and 4 show the optimal hyperparameters after applying Bayesian optimization using the Cleveland and Framingham datasets, respectively.Table 5 shows the performance of the models using the Cleveland dataset.The optimized versions of Random Forest, AdaBoost, and XGBoost are benchmarked with the standard random forest, AdaBoost, and XGBoost, and they are evaluated based on accuracy, specificity, sensitivity, and Fmeasure.Using the Cleveland dataset, the standard XGBoost significantly outperforms the non-optimized classifiers across all metrics.Its high sensitivity (0.937) and F-measure (0.896) demonstrate its ability to accurately identify patients with heart disease and balance precision and recall effectively.Also, the optimized versions of RF, AdaBoost, and XGBoost show performance improvements.The optimized RF and optimized AdaBoost demonstrate notable increases in specificity and F-measure, indicating enhanced overall performance and reliability in accurately identifying both positive and negative cases.The optimized XGBoost achieves the highest accuracy (0.984), specificity (0.971), sensitivity (0.989), and F-measure (0.985) among all models tested.This level of performance suggests that with appropriate parameter tuning and optimization, XGBoost can offer exceptional predictive capabilities, making it highly suitable for critical applications such as heart disease prediction.Meanwhile, the performance of the classifiers on the Framingham dataset is tabulated in Table 6.The RF classifier achieved a balance across all metrics, indicating its effectiveness and potential as a comprehensive model for disease prediction.AdaBoost, with an accuracy of 0.860, achieved a lower sensitivity (0.790).Meanwhile, the XGBoost was exceptional, with the highest accuracy (0.917) among the base classifiers, alongside impressive specificity (0.899) and sensitivity (0.920).Furthermore, the optimized classifiers show notable improvements over the standard classifiers.The optimized versions of RF and AdaBoost demonstrate significant increases in specificity and sensitivity.However, the optimized XGBoost obtained the highest scores across all evaluated metrics, including a near-perfect sensitivity (0.975).This indicates its exceptional capability in identifying true positive cases, an essential attribute for healthcare applications.The findings from the Framingham dataset also reveal the substantial impact of parameter optimization in enhancing the predictive performance of the ensemble classifier.
Furthermore, with the optimized XGBoost achieving the best performance in both the Cleveland and Framingham heart study datasets, it would be beneficial to understand what features contributed the most to the decision-making process.Therefore, Figures 1 and 2 show the XGBoost's model interpretation using the SHAP technique for the Cleveland dataset.Figure 1 shows the summary plot, indicating the impact of every feature on the classifier output for each sample in the dataset.Each point represents a SHAP value for a feature and a data point.Meanwhile, the features are listed along the y-axis, and their SHAP values are on the x-axis.The color indicates the feature value, with red being high and blue being low.From the visualizations, it can be seen that the 'ca' (i.e., number of major vessels colored by fluoroscopy) and 'thal' (i.e., thalassemia) attributes are vital factors influencing the model's predictions for heart disease using the Cleveland dataset.The 'cp' (chest pain type) and 'sex' features also have substantial impacts on the model's decisions, suggesting that they are important predictors of the outcome.However, 'fbs' (fasting blood sugar) and 'trestbps' (resting blood pressure) seem to have a much smaller effect on the model's predictions.Meanwhile, Figure 2 presents a bar chart that simplifies the previous summary by taking the mean of the absolute SHAP values for each feature across all samples, providing an aggregate measure of feature importance.From the bar chart, 'ca' is the most important feature, followed by 'cp', 'sex', and 'thal', as indicated by the length of the bars; and 'fbs' (fasting blood sugar) appears to be the least important feature in terms of the average impact on the model's output.
Figures 3 and 4 show the SHAP summary plot and mean absolute SHAP values for the XGBoost model using the Framingham dataset.In this instance, the 'Age' feature has the highest mean absolute SHAP value, which indicates that it is the most important feature in terms of average impact on the classifier's output.Other significant features include 'sysBP' and 'cigsPerDay', suggesting that blood pressure and smoking habits are key indicators the model uses to predict heart disease risk.The features 'diabetes' and 'prevalentStroke' have the least impact on the model's predictions, as shown by the shorter bars.Additionally, in order to further validate the robustness of the proposed approach, a performance comparison is conducted with other methods in the literature.This comparison is shown in Table 7. From Table 7, the proposed optimized XGBoost model achieves a high accuracy of 0.984, which is comparable to the highest accuracies reported in the literature, such as the 0.987 and 0.988 achieved by [46] and [47], respectively.Additionally, the optimized XGBoost model demonstrates superior sensitivity (0.989) compared to the other methods, indicating its effectiveness in correctly identifying positive cases of heart disease.This high sensitivity is crucial for clinical applications where missing a true positive case could have severe consequences.Furthermore, the specificity of our model (0.971) highlights its capability to accurately identify negative cases, thereby reducing the likelihood of false positives and ensuring reliable performance across diverse patient populations.In comparison with other techniques, it is evident that some studies lacked detailed specificity and sensitivity metrics.For example, refs.[48,50,52,62] report accuracy values; however, the absence of specificity and sensitivity data makes it difficult to assess their overall clinical utility.Additionally, more advanced methods such as CNN [60], Bi-LSTM [47], and deep learning ensemble [59] also show high performance.Nonetheless, our optimized XGBoost model stands out by providing a balanced performance with comprehensive metrics, thus demonstrating both its robustness and reliability.Meanwhile, the implementation of XAI techniques is now a critical feature in clinical ML applications.Our study distinguishes itself not only by achieving high predictive performance but also by incorporating the SHAP technique for interpretability.This ensures that the model's decisions are transparent and understandable to clinicians, addressing a critical need for clinical adoption.

Discussion
The experimental results presented in this study indicate the effectiveness of integrating Bayesian optimization with ensemble learning techniques for heart disease prediction.Through optimizing the hyperparameters of AdaBoost, RF, and XGBoost models, this study achieved notable improvements in the performance of these models on both the Cleveland and Framingham datasets.The enhanced performance, particularly in the specificity and sensitivity metrics, reflects the importance of effective fine-tuning of model parameters, thus leading to more reliable and robust classifiers.When compared with recent studies, our optimized XGBoost model demonstrates a high accuracy of 0.984, which is comparable to the highest accuracies reported, such as 0.987 by [46] and 0.988 by [47].The proposed model's sensitivity of 0.989 highlights its effectiveness in accurately identifying positive cases of heart disease, outperforming the methods in the other studies.The proposed model's balanced performance across the various metrics demonstrates its robustness and clinical relevance.
Furthermore, the utilization of SHAP values for interpreting the optimized XGBoost provides necessary insights into the classifier's decision-making processes.This interpretability is essential for clinical applications, where understanding the factors influencing model predictions can significantly impact patient management and treatment planning.The SHAP analysis revealed key risk factors and their contributions to the predictions.For instance, features such as 'ca' (number of major vessels colored by fluoroscopy) and 'thal' (thalassemia) were identified as significant predictors in the Cleveland dataset, aligning with clinical understandings of their roles in heart disease.Additionally, the optimized XGBoost model, which showed superior performance across both datasets, highlights the robustness of combining ensemble learning techniques with hyperparameter optimization.Its high sensitivity and specificity indicate its potential in clinical settings to accurately identify patients at risk of heart disease, thereby ensuring early intervention.

Conclusions and Future Work
This research studied the use of ensemble classifiers, Bayesian optimization, and SHAP in heart disease prediction.The aim of the study was to combine these techniques to create more accurate and interpretable models that can aid in the early detection and treatment of heart disease, ultimately saving lives and improving patient outcomes.The optimized XGBoost achieved the best results among the ensemble techniques, outperforming the optimized AdaBoost, optimized random forest, and the other standard models.Furthermore, the optimized XGBoost model obtained comparable performance with other well-performing methods in recent studies and provides balanced and interpretable predictions.This combination of high performance and explainability makes it a promising approach for heart disease prediction in clinical settings.
Future work will focus on the deployment of this model in real-world healthcare settings.The first step will involve creating a robust pipeline that integrates seamlessly with hospital databases, enabling real-time data processing and predictions.This requires collaboration with IT departments within healthcare institutions to ensure compatibility with existing electronic health record systems.Additionally, extensive validation and pilot studies will be necessary to ensure the model's reliability and effectiveness in diverse clinical environments.These pilot studies will help refine the model based on feedback from healthcare professionals and actual patient outcomes.Furthermore, future efforts will focus on expanding the model's capabilities to predict other cardiovascular diseases and conditions.We aim to enhance its predictive power and adaptability to different patient populations by continuously updating and validating the model with new data.

Table 1 .
Description of the Cleveland Heart Disease dataset.

Table 3 .
Optimal parameters for Cleveland dataset.

Table 4 .
Optimal parameters for Framingham dataset.

Table 5 .
Classifier performances on the Cleveland dataset.

Table 6 .
Classifier performances on the Framingham dataset.

Table 7 .
Comparison with other methods in recent studies.