A Systematic Review of Artificial Intelligence Models for Time-to-Event Outcome Applied in Cardiovascular Disease Risk Prediction

Artificial intelligence (AI) based predictive models for early detection of cardiovascular disease (CVD) risk are increasingly being utilised. However, AI based risk prediction models that account for right-censored data have been overlooked. This systematic review (PROSPERO protocol CRD42023492655) includes 33 studies that utilised machine learning (ML) and deep learning (DL) models for survival outcome in CVD prediction. We provided details on the employed ML and DL models, eXplainable AI (XAI) techniques, and type of included variables, with a focus on social determinants of health (SDoH) and gender-stratification. Approximately half of the studies were published in 2023 with the majority from the United States. Random Survival Forest (RSF), Survival Gradient Boosting models, and Penalised Cox models were the most frequently employed ML models. DeepSurv was the most frequently employed DL model. DL models were better at predicting CVD outcomes than ML models. Permutation-based feature importance and Shapley values were the most utilised XAI methods for explaining AI models. Moreover, only one in five studies performed gender-stratification analysis and very few incorporate the wide range of SDoH factors in their prediction model. In conclusion, the evidence indicates that RSF and DeepSurv models are currently the optimal models for predicting CVD outcomes. This study also highlights the better predictive ability of DL survival models, compared to ML models. Future research should ensure the appropriate interpretation of AI models, accounting for SDoH, and gender stratification, as gender plays a significant role in CVD occurrence. Supplementary Information The online version contains supplementary material available at 10.1007/s10916-024-02087-7.


Introduction
Cardiovascular diseases (CVD) cause 32% of all global deaths [1].A confluence of environmental, genetic, social, and physiological factors leads to the development of CVD of ML that uses artificial neural networks.As illustrated in Fig. 1, there are primarily four types of ML and DL [8,9]; supervised, unsupervised, semi-supervised, and reinforcement learning (RL).Supervised ML algorithms have been utilised for future risk prediction.They require training using labelled data, that is data that contains inputs and correct corresponding outputs.Depending on the type of the outcome variable that is available for the learning phase, regression and classification algorithms (including binary, multi-class, multi-level, and imbalanced classification) are commonly employed.Regression algorithms predict continuous variables while classification algorithms determine the likelihood that a certain event will occur.Unsupervised ML algorithms (such as anomaly detection, clustering) use unlabelled data and are intended to find groups/clusters of similar characteristics without human supervision.Semisupervised ML combines the features of supervised and unsupervised ML approaches, i.e., utilises both labelled and unlabelled data.RL algorithms interact with an environment to learn the optimal behaviour to maximise the overall reward.
ML models have been in existence since 1957.The perceptron, which laid the foundation for supervised ML models and artificial neural networks, was one of the earliest neural network models.Since then, ML has passed several important milestones: the development of decision trees in the 1960s, support vector machines (SVM) in the 1990s, random forest in 2001, DL models in the 2010s, large language models such as ChatGPT in 2022, and many others recently [10].These models are supervised ML algorithms for classification and regression, and are applied for predicting or forecasting chronic diseases, including CVD risk [11].However, ML and DL algorithms for survival prediction were not widely used until Random Survival Forest (RSF) was developed by Ishwaran et al. in 2008 [12].In particular, survival AI prediction algorithms, which estimate the time until a health outcome occurs, have not received as much attention as classification and regression ML algorithms [13].Currently, various AI models for right-censored data are gaining popularity, even though they are predominantly used for predicting cancer patient survival outcomes [14][15][16][17].Survival forest models [16,17], NonLinear Cox proportional hazard (Cox PH) model (also known as Deep-Surv model), and Neural Multi-Task Logistic Regression (NMTLR) [14][15][16] are among the commonly utilised models.Some studies have also employed algorithms such as CoxTime and Cox-CC [14].
Numerous systematic reviews on AI-based CVD prediction have been conducted [18][19][20][21]; yet they primarily focus on classification-based models.For instance, Baashar et al.'s research assessed the effectiveness of ML and DL in CVD prediction through network meta-analysis [20], covering 17 studies from 2016 to 2021 and suggesting that DL might yield better results than ML.Nonetheless, a systematic review that succinctly summarise ML and DL models for right-censored data is still lacking.The justification for Fig. 1 Overview of machine learning and deep learning models exploring AI models for right-censored data stems from the unique nature of survival outcomes.Unlike regression and classification problems, survival outcome must account for two components during model training: the follow-up time, which is continuous, and the event status, indicating whether a specific event has occurred, such as CVD, represented as a binary outcome.
Additionally, previous risk prediction models, including the latest multivariable prediction models and AI-based models mentioned above, primarily focus on standard modifiable risk factors of CVD, demographics (age and sex/ gender), and lifestyle factors (particularly smoking).This means that social determinants of health (SDoH), defined as the social and environmental circumstances in which people grow, live, work, worship, and age, have been overlooked in disease prediction models [22], including CVD [18,23].For example, only race in the PCE [3] and social deprivation in the PREVENT tool [6] are incorporated when predicting CVD.Similarly, AI-based risk prediction models consider only a limited number of SDoH variables, like race, income, and occupation [18].SDoH are detailed in the Healthy People 2030 framework using five domains, namely, economic stability, education quality and access, social and community context, neighborhood and built environment, and healthcare access and quality [24].Using the Healthy People framework as a foundation, our umbrella review [25] demonstrated that SDoH have a major role in development of CVD.In general, disparities in SDoH give rise to health inequalities, which are systematic discrepancies in the opportunities people need to attain optimal health.
It is also important to focus on the explainability of the AI models to improve confidence in their application.These are called eXplainable AI (XAI) techniques and, as shown in Fig. 2, these can be model-specific (use the structure of the model itself, e.g., built in feature importance measures in ensemble models) or model-agnostic (provide posthoc explanations e.g., Local Interpretable Model-agnostic Explanations (LIME), Shapley Additive exPlanations (SHAP)) [26].However, these techniques have limited application for survival ML and DL methods.New XAI techniques for survival models such as Survival SHAP (SurvSHAP), survival neural additive model (SurvNAM), and survival LIME (SurvLIME) are currently gaining attention but are only used for explaining some algorithms [27,28].XAI in general, aims to increase user trust in a model to different stakeholders: (1) those with model expertise (e.g., ML experts, researchers); and (2) those without (clinicians, patients).However, the explainability of time to event AI models has been less explored.
Moreover, various systematic reviews have been conducted on AI-based CVD prediction [18][19][20][21]; however, they have all focused-on classification-based models (models designed for classification problems).For example, research conducted by Baashar et al. evaluated the efficacy of ML and DL in the prediction of CVD through network metaanalysis [20].The study encompassed 17 studies spanning Fig. 2 Overview of eXplainable AI approaches (including conference abstracts), case reports, letters, editorials, and reviews were not eligible.AI models based on simulation and imaging/text data are ineligible as they do not use structured population-level data.

Information Sources and Search Strategy
We carried out a comprehensive search using five electronic databases, from their inception to December 21, 2023: Embase via Ovid, Scopus, Web of Science, IEEE Xplore, and Ovid Medline.Further studies were identified by a manual search using Google Scholar, and through backward and forward reference searching using Web of Science.Various terms related to CVD, AI methods, and risk prediction were utilised (Table 2), linked through Boolean and adjacency (or proximity) operators.The comprehensive search terms used in Ovid Medline are available in the supplementary file (Table S2).

Study Selection and Data Extraction
Identified records from databases were exported to Endnote Version 20 and then to ASReview [30] and Covidence [31].Following deduplication, eligible articles at the title and abstract stage were selected using ASReview.Full text from 2016 to 2021, concluding that DL may offer more favorable outcomes than ML in predicting CVD.
Therefore, this systematic review aims to (1) investigate AI models for survival prediction employed in predicting CVD; (2) indicate whether XAI is applied for interpreting the models; and (3) examine whether the identified AI models account for SDoH as well as gender stratification.

Registration and Reporting
The study protocol was registered in the International Prospective Register of Systematic Reviews (PROSPERO CRD42023492655).The Preferred Reporting Items for a Systematic Review and Meta-analysis (PRISMA) statement is used for reporting [29] (Table S1).

Eligibility Criteria
Studies were deemed eligible if they intended to predict CVD outcomes using AI methods for survival prediction.There were no restrictions based on country, study design, language, and study period (Table 1).Grey literature

Assessment of Risk of Bias
To evaluate the risk of bias (RoB), we used the Prediction Model Risk of Bias Assessment Tool (PROBAST) [32] with four domains: participant selection; predictors; outcome; and analysis, and different signaling questions per each domain.
Using the PROBAST, we also assessed applicability using three domains: participant, predictors, and outcome.Two authors (ABT and HLH) assessed RoB independently and any disagreements were resolved by discussion.

Screening Result
Out of a total of 4,739 studies retrieved through database searching, 86 were eligible for a full-text review.Thirtythree studies in total, 30 studies  from database searching and three studies [63][64][65] from other sources, qualified for inclusion in this study (Fig. 3).The studies that were excluded during the full-text review are provided in the supplementary file (Table S3).
screening was done using Covidence.Using the data extraction sheet prepared based on the 11 CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) domains [32], data were extracted.Two reviewers (ABT and HLH) selected eligible studies and undertook data extraction, resolving conflicts through discussion (full-text review: proportionate agreement = 96%, Cohen's κ = 0.92).

Data Synthesis
Characteristics of studies were summarised based on items from the CHARMS statement and our specific aims.If a study used more than one ML or DL algorithm, we reported the prediction performance measure for the best performing algorithm.ML and DL models were compared with each other and with the standard Cox PH model.Utilised SDoH variables, based on the Healthy People 2030 framework, were reported.XAI methods employed were described.However, due to variations in the study population, the endpoint description, the different ML and DL algorithms utilised, and the variety in the types and numbers of variables, the prediction performance of the models was not pooled (i.e., meta-analysis was not conducted).

Employed ML and DL Models
Eight ML and nine DL models were utilised, with Fig. 5 presenting the names of each model and the number of studies that employed them, and Table S5 providing further details.

Best Performing ML and DL Models
To evaluate the predictive performance, studies utilised C-index, area under the curve (AUC), and Brier score or calibration plot.In addition, some studies also explored other measures such as decision curve analysis.These performance evaluation metrics are presented in Table S5.The mean C-index (standard deviation) was 0.79 (0.069) for ML models, 0.82 (0.061) for DL models, 0.81 (0.144) for Penalised Cox, 0.80 (0.058) for RSF, 0.79 (0.032) for DeepSurv, and 0.77 (0.055) for survival Gradient Boosting Models (GBM) (Table 3).

Characteristics of the Included Studies
The majority of studies were published in 2023 (n = 16/33; 48.5%) and originated from the United States (n = 13/32; 40.6%; one study did not report the country).Approximately, 39.4% (n = 13/33) of the studies used a sample size of 10,000 or more.Two studies focused exclusively on one gender (one on men and the other on women), whilst the majority of studies analysing both genders had a higher percentage of women (50% and above) (Table S4).

Follow-up Time and Incidence of Cardiovascular Diseases
The mean or median follow-up time ranges from 4.4 years to 25.03 years in studies of community-dwelling people, and 4.3 months to 8.05 years in studies of institutionalisedpeople (Table S4).Various CVD outcomes, with their definitions and corresponding ICD codes detailed in Table S4, were identified (Fig. 4).The incidence of CVD outcomes ranged from 2.1% (CVD-related mortality) to 43.7% (MACE) (Table S4).Total number of models may differ from total number of included studies, because some studies reported for men and women separately or fitted a model based on race and studies may not report the C-index or area under the curve quantitatively g Area under the curve if the study did not report C-index Fig. 5 Identified ML and DL models.Since a single study could utilise multiple ML and/or DL models, the total number of studies presented here exceeds 33 (the total number of studies included) found this model to be the best for predicting CVD outcomes (Table 4 and Table S5).
Conventional Neural Network, however, without comparing with any other ML or DL models (Fig. 6 and Table S5).

Comparison of ML and DL Models
Among the eight studies [42,44,46,48,49,51,52,54] that compared ML and DL models together, DL models were better in predicting CVD risk in seven studies [42,44,46,48,49,51,54].Four studies [46,48,49,54] compared the DeepSurv model with other ML and DL models and all four  candidate variable.Of these 20 studies, all except for two [53,57] incorporated at least one SDoH as a final predictor to train the model.However, only two studies [51,58] employed a wide range of SDoH variables from the Healthy People 2030 framework.The most frequently considered SDoH variables were race/ethnicity, level of education, and income (Fig. 7 and Table S5).

Model Validation
All studies internally validated their prediction model using either train-test splitting or using resampling methods such as k-fold cross-validation and Bootstrapping.However, only six studies [40,46,56,57,59,65] externally validated their prediction model.The commonly employed models were RSF and DeepSurv (Table S5).

Number of Predictors and Feature Selection Methods
The number of candidate predictors ranged from 7 to 950.Mostly the number of candidate predictors was greater than 50 (n = 13/33; 39.4%).The number of final predictors used ranged from 3 to 613, with the majority (n = 13/33; 39.4%) incorporating 21 to 50 variables (Table S5).In 22 studies, variable selection was not performed.Of those studies that performed variable selection prior to training, two used RSF, two used LASSO-Cox, one used stepwise forward selection, and one used Elastic Net Cox.Additionally, five studies used more than one variable selection method (Table S5).
Models Accounted for SDoH Among the 18 studies that included at least one SDoH (detailed above) as their final predictor (Table S5), nine studies utilised RSF, because RSF was the best performing model in seven studies [33,38,52,55,60,62,65] or used as the only model [39,50].Two studies employed Elastic Net Cox [63] and LASSO-Cox [61] since they were the best performing models as compared to other models.Two studies [47,58] utilised survival GBM without comparing the model with other ML or DL models.Five models utilised DL models; three studies used Deep-Surv [34,40,46], two studies NMTLR [42], and one study DeepHit [51].All these studies compared their DL model Fig. 7 Number of studies that incorporated social determinants of health variables for predicting cardiovascular disease the models that were utilised most frequently were Deep-Surv, NMTLR, and DeepHit.These three DL models had better performance, compared with other DL or ML models.Permutation based feature importance and SHAP values were the predominant XAI methods utilised for explaining the models.While a variety of variables were incorporated to predict CVD, there was a noticeable lack of consideration for a wide range of SDoH variables.Additionally, prediction modeling with gender stratification was rarely explored.

Main Findings
To enhance clarity, our principal findings are depicted in Fig. 8.A variety of ML and DL models for survival prediction in CVD were identified.The popular ML methods were RSF, survival GBM, and Penalised Cox models.These three ML models also performed best at predicting time to CVD occurrence, when compared to numerous other ML models considered in the included studies.Regarding DL models, The next commonly applied ML models were survival GBM and Penalised Cox models.This may be at least partly explained by survival GBM considers non-linear interactions, has a high reported prediction accuracy, has greater ease of interpretability, and has automatic variable selection [71].The availability of packages in R and Python (effectively open-source) makes the model easily trainable and accessible.For instance, survival GBM can be efficiently trained using the newly developed Python package, scikitsurvival [69].Both the RSF model and survival GBM are ensemble models that combine the decisions from several baseline models to improve the overall performance and robustness [72].The Penalised Cox models are commonly used because they are important for penalising and provide a parsimonious model [73].They are easier to apply (having a few, maximum two, parameters to tune).Finally, a few studies also utilised survival SVM, Linear Multi-Task Logistic Regression (LMTLR), and Extra survival trees in predicting CVD.

DL Models for Survival Outcome in CVD Prediction
Different DL models were also utilised to predict CVD.The DeepSurv model was mostly utilised.The other models were the NMTLR and DeepHit.These models are also

AI Models for Survival Prediction and Year of Publication
Our systematic review revealed ML and DL models for survival prediction are increasingly gaining attention, while nearly all studies were published in 2019 or afterwards, half were published in 2023.This is not surprising since the packages in R and Python (e.g., mlr3proba package in R, and pycox, scikit-survival, PySurvival Python packages) for survival prediction using AI models became available in 2019 or later [66][67][68][69].

ML Models for Survival Outcome in CVD Prediction
The most utilised ML model was RSF.Our finding corroborate a scoping review on applications of ML in predicting survival outcomes, which identified RSF as the most frequently utilised model [13].RSF has become a well-developed and user-friendly model, since its introduction by Ishwaran et al. in 2008 [12].RSF is effective at handling complex interactions, has built-in variable importance measures, and is robust to overfitting [70].The other plausible explanation is due to the availability of numerous opensource packages in standard software such as R and Python for appropriate training of RSF [66][67][68][69].together, DL models outperformed in seven, whereas ML models excelled in only one study.This is because DL models can improve prediction by (1) enhancing discrimination and calibration, (2) leveraging large datasets effectively, and (3) autonomously learning complex representations for better risk stratification [80].

XAI Techniques Utilised
All studies included black-box ML models (except Penalised Cox models) and DL models.Black-box models are not explainable unless XAIs are utilised, which means that humans cannot understand how predictions are made [81].Despite the included studies considering the black-box models, not all studies interpreted their models.Studies that interpreted their models mostly used permutation-based feature importance followed by SHAP value.Using XAIs, studies identified key factors driving predictions and provide transparency in model decision-making.However, feature importance alone cannot ensure a responsible and effective translation of the model into clinical practice.

SDoH Variables Accounted for in ML and DL Models for Survival Outcome in CVD Prediction
All studies evaluated the standard modifiable cardiovascular risk factors.Biomarkers, imaging features, and variables related to sleep and diet were also considered.However, despite recent studies revealing the major role of SDoH in CVD [25,82], only a handful of prediction models incorporated a wide range of SDoH variables.Our findings expand on another systematic review aimed at identifying SDoH in ML based CVD prediction models, which also reported that included models did not give much emphasis to SDoH [83].In this systematic review, most studies considered certain SDoH variables, such as race, education level, and income.However, the use of specific SDoH variables such as race in deploying ML models is controversial [84,85].For example, there is a notion that race is a biological construct, rather than a social one, and the race-aware ML model deployment could perpetuate existing biases and discrimination [86,87].While we agree that poorly implemented race-conscious models might perpetuate existing biases, including race in ML models' deployment is helpful for accurate predictions and addressing racial disparities in health outcomes [87,88].Additionally, by incorporating race, models can help tailor interventions and allocate resources more effectively to communities in need.Therefore, rather than simply omitting race in the deployment of ML models, it is essential to implement race-aware models with nuanced considerations tailored to the specific context, purpose, and application of the model [88].commonly applied in oncologic studies [15,[74][75][76].Most of these models can be well-trained using the two important Python packages, PySurvival and Pycox, which have been popular since 2019 [66][67][68].Denoising autoencoder survival network, Recurrent Neural Network Long Short-Term Memory, and Deep Survival Conventional Neural Network [66,68] were also utilised by some of the studies.

Best Performing ML and DL Models for Survival Outcome in CVD Prediction
Compared to the standard Cox proportional hazards model, both ML and DL models have demonstrated superior performance, in terms of discriminative ability and calibration.Another review has also shown that ML models outperform conventional methods in predicting health outcomes [13].This may be due to the limited capability of the standard Cox model to handle high-dimensional datasets and its reliance on a linear relationship assumption, which are often not met.
The most frequently selected ML models (based on their prediction performance) were ensemble methods (RSF and survival GBM).RSF and survival GBM are ensemble models that are known to have superior prediction performance because they are drawn from several baseline learners [72].However, this finding might also be a result of RSF and survival GBM models being considered in many of the included studies.In three studies, Penalised Cox-models were also selected as the best preforming.Penalised Cox models reduce overfitting, handle multicollinearity (particularly the Elastic Net Cox), enhance interpretability, and automate variable selection by shrinking less important predictors' coefficients to zero [77].
As for DL models, in almost all studies, DeepSurv was selected as the best performing model.Our finding corroborates multiple individual studies on the survival prediction of cancer patients, demonstrating that DeepSurv surpasses alternative methods in predictive accuracy [14,15,76].DeepSurv computes complex and non-linear features without a priori selection or domain expertise and is helpful for personalised risk prediction, even better than other linear and non-linear survival methods [78].Notably, DeepSurv was also popular and, therefore, available for comparison.

Comparison of ML and DL Models for Survival Outcome in CVD Prediction
Consistent with studies that have examined both ML and DL models in the context of predicting the survival of cancer patients [76,79], our study found that DL models surpass ML models in predicting time to CVD occurrence.That is, among the eight studies that compared ML and DL models is also imperative to consider the nature of the dataset.For example, when considering longitudinal data with available follow-up time classification-based ML methods should not be used.Right-censoring should be accounted for, since excluding those who lost to follow-up, may result in a biased estimate.ML models for right-censored data have been utilised since 2008 and have recently flourished.Since 2018/19 numerous new models (particularly DL models) for right-censored data with their respective open-source coding packages have become available [66][67][68][69].While it is encouraging that survival ML and DL models are gaining more focus and the development of cutting-edge models is accelerating, their interpretability still poses a challenge.There are open-source XAI methods such as SurvSHAP and SurvLIME for interpreting ML and DL models for right-censored data [27,28].However, it is noted that models trained using the PySurvival package, for instance, are not yet supported.Therefore, it is crucial to also focus on their XAI, whether it is model-agnostic or model-specific.In this systematic review, the quality assessment tool PRO-BAST, typically used for standard prediction models, was employed.However, its application to AI-based prediction models was not direct, leading to the omission or alteration of some signaling questions to evaluate the studies' quality.Notably, PROBAST + AI tools are currently in development [98,99], but at this stage, they remain as protocols and should be made available to researchers and decisionmakers soon.
Additionally, a standardised measuring tool for most SDoH variables is lacking.SDoH are complex and specific to context and setting, necessitating tailored approaches.Taking these factors into account when measuring SDoH could aid in the creation of effective, context-specific strategies that precisely reflect the impact of SDoH on health outcomes.Inadequately designed SDoH (e.g., race)-sensitive models have the potential to exacerbate existing biases and discrimination within healthcare systems [86,87].Consequently, it is imperative to apply nuanced considerations that are specific to the context, purpose, and application of the predictive model.In this systematic review, despite having not differentiated between gender and sex, we found that a common limitation in CVD risk prediction studies is the rarity of gender-specific analysis.Future prediction studies should focus on gender-stratification while incorporating a range of SDoH in the AI prediction models for enhanced prediction and wise decision making.

Strengths and Limitations
The strengths of this systematic review are its novelty in concisely summarising the ML and DL models utilised for time to CVD outcomes, the applied interpretation

Gender Stratification in CVD Prediction
In 80% of the studies, gender-stratified prediction was overlooked despite gender playing a role in CVD presentation, diagnosis, and survival [89,90].Moreover, the role of gender is a critical determinant of CVD as it shapes one's norms, roles, social relations, and behaviors [91].Due to the challenges in distinguishing gender and sex from the studies, we used the general term "gender".Additionally, it is important to acknowledge the following when considering gender versus sex in the deployment of ML models [92,93]: (1) Viewing gender strictly as a binary biological construct fails to account for the intricate social factors that shape gender identity and expression, (2) Inferring gender solely based on biological sex characteristics can lead to discrimination against transgender and non-binary individuals.Generally, gender-stratified prediction models are beneficial for pinpointing gender-specific predictive factors for tailored and potentially more effective interventions [94].However, we recommend that gender-stratified prediction models be undertaken after meticulous attention to the representativeness of data, potential biases, and the fundamental factors driving gender disparities in health outcomes.

Model Validation
Almost all studies internally validated their models.However, a few studies did external validation.Another review also highlighted that most studies did not perform external validation of their ML models [13].Although external validation is commonly viewed as a critical step in transitioning clinical prediction models from development to implementation, it should not be seen as an automatic green light for model deployment.Moreover, there is no single recommended validation design, external validation is not always essential, and at times, multiple external validations may be required.Generally, the necessity and scope of external validations are contingent upon the intended application of the model and the justification for conducting an external validation study [95].

Implications for Clinical Practice and Recommendations
AI-based risk prediction models have an increased discrimination ability and accuracy as compared to the conventional multivariable models [96].However, there are misconceptions that ML requires large amounts of data [97].Despite ML models often benefiting from large datasets, they can still be effectively applied to smaller health-related datasets as long as the right balance between data quantity and quality is ensured and interpretability is prioritised [97].It Funding Open Access funding enabled and organized by CAUL and its Member Institutions.ABT and HLH are supported by Monash International Tuition Scholarship and Monash Graduate Scholarship.Funders played no role in the design of the study, data collection, analysis and interpretation of data, the decision to publish, and in the writing of the manuscript.Open Access funding enabled and organized by CAUL and its Member Institutions techniques, and the assessment of whether SDoH variables or gender-stratification were accounted for.However, despite this systematic review had compared ML and DL in the context of CVD and found that DL is more effective for predicting incident CVD, due to the heterogeneity of studies (e.g., in terms of population, type and number of variables incorporated), we did not do direct comparison through meta-analysis.Additionally, we note that that the most commonly used models also had the best performance.Therefore, our findings may be biased due to the availability of these models for comparison.

Conclusion
This review identified and compared the different ML and DL models for survival outcomes in CVD prediction.RSF, survival GBM, and Penalised Cox models were the most popular and optimal predicting ML methods.Among DL models, DeepSurv was the most popular and optimal predicting model.Compared to ML models, DL models had better prediction performance.In general, RSF and Deep-Surv models were the most popular and better performing models, regardless of the types of variables included (e.g., SDoH) or the population (e.g., community-based, institutionalised).Permutation-based feature importance and SHAP value were the commonly utilised XAI methods for interpretating the AI models.Despite the evidence for SDoH as predictors of CVD and gender-desegregated findings, they were considered by only a few of the included studies.To improve CVD risk prediction and inform clinicians decision-making future studies need to assess SDoH, in addition to the traditional factors and other emerging risk factors.While men and women share many traditional risk factors for CVD, additional gender-specific risk factors and mechanisms are at play.Therefore, it is crucial to consider gender differences when it comes to predicting and managing CVD risks.Moreover, more methodological work is still required to improve ease of interpretability of deep survival learning models, particularly as they have no built-in feature importance methods.
Acknowledgements For her crucial assistance in the development of our search strategy, we extend our sincere gratitude to Lorena Romero, the senior librarian of the Ian Potter Library at the Alfred Hospital in Melbourne, Victoria, Australia.We acknowledge the Wurundjeri People who are the Traditional Custodians of the lands on which the first and senior authors predominately work and live, and we pay our respects to their Elders, past and present.

Fig. 3
Fig. 3 PRISMA flow diagram showing the study selection process

Fig. 4
Fig. 4 Number of studies based on predicted cardiovascular diseases outcomes.ASCVD: Atherosclerotic cardiovascular disease; CHD: Coronary heart disease; CVD: Cardiovascular disease; HF: Heart failure; and MACE; Major adverse cardiovascular events.Note: Since one study can incorporate more than one outcome, the sum total reported here exceeds the total number of included studies

Fig. 6
Fig. 6 Number of studies that evaluated the prediction model and used it for their final prediction

Fig. 8
Fig. 8 Summary of the principal findings Strategy, Literature Search and Screening, Data Extraction, Data Synthesis and Interpretation, Writing -Original Draft, Finalising the Manuscript.H.L.H: Literature Search and Screening, Data Extraction, Data Synthesis and Interpretation, Editing, Finalising the Manuscript.M.V: Conceptualisation, Data Synthesis and Interpretation, Editing, Finalising the Manuscript, Supervision.A.J.O: Data Synthesis and Interpretation, Editing, Finalising the Manuscript, Supervision.R.F.P: Conceptualisation, Data Synthesis and Interpretation, Editing, Finalising the Manuscript, Supervision.

Table 1
Key items for framing the aim, search strategy, and study inclusion and exclusion criteria

Table 2
Summary of keywords/search terms per each concept

Table 3
Descriptive statistics of predictive performance (C-index/area under the curve) by ML and DL algorithms Machine learning and deep learning models (number of studies f ) Includes DeepHit, Neural Multi-Task Logistic Regression, Recurrent Neural Network Long Short-Term Memory, and deep survival conventional neural network b Includes LASSO and Elastic Net Cox models c Includes Random Survival Forest, survival Gradient Boosting Models, and Penalised Cox models d Includes Deepsurv, Neural Multi-Task Logistic Regression, Recurrent Neural Network Long Short-Term Memory, and Deep Survival Conventional Neural Network f a

Table 4 Selected
[42]ls, based on their performance, among studies that compared machine learning and deep learning models together Author and year Selected best model Deep Learning models Machine learning models Feng 2022[42]Neural Multi-Task Logistic Regression Neural Multi-Task Logistic Regression Random survival forest and Linear Multi-Task Logistic Regression

Table 5
Model interpretation methods Model interpretation technique utilisedNumber of studies Feature importance (e.g., permutation (majority), Mean Decrease Gini, mean of the minimal depth of the maximal

Table 6
Risk of bias and applicability assessment