Enhancing Preoperative Outcome Prediction: A Comparative Retrospective Case–Control Study on Machine Learning versus the International Esodata Study Group Risk Model for Predicting 90-Day Mortality in Oncologic Esophagectomy

Simple Summary Preoperative risk prediction prior to oncologic esophagectomy is crucial for assisting surgeons in accurate patient selection and patients in their informed decision making. A new risk stratification tool, the IESG prediction model, was recently introduced, categorizing patients into different risk levelsMachine learning is a subfield of artificial intelligence and may allow for a more accurate identification of patients at risk. Therefore, we evaluated the IESG risk model and compared its performance with ML models. We found that the IESG risk model provided an overall adequate risk estimation. However, ML showed better results in the accurate risk stratification of patients, demonstrating its potential as a novel and powerful approach for future patient assessment. Abstract Risk prediction prior to oncologic esophagectomy is crucial for assisting surgeons and patients in their joint informed decision making. Recently, a new risk prediction model for 90-day mortality after esophagectomy using the International Esodata Study Group (IESG) database was proposed, allowing for the preoperative assignment of patients into different risk categories. However, given the non-linear dependencies between patient- and tumor-related risk factors contributing to cumulative surgical risk, machine learning (ML) may evolve as a novel and more integrated approach for mortality prediction. We evaluated the IESG risk model and compared its performance to ML models. Multiple classifiers were trained and validated on 552 patients from two independent centers undergoing oncologic esophagectomies. The discrimination performance of each model was assessed utilizing the area under the receiver operating characteristics curve (AUROC), the area under the precision–recall curve (AUPRC), and the Matthews correlation coefficient (MCC). The 90-day mortality rate was 5.8%. We found that IESG categorization allowed for adequate group-based risk prediction. However, ML models provided better discrimination performance, reaching superior AUROCs (0.64 [0.63–0.65] vs. 0.44 [0.32–0.56]), AUPRCs (0.25 [0.24–0.27] vs. 0.11 [0.05–0.21]), and MCCs (0.27 ([0.25–0.28] vs. 0.15 [0.03–0.27]). Conclusively, ML shows promising potential to identify patients at risk prior to surgery, surpassing conventional statistics. Still, larger datasets are needed to achieve higher discrimination performances for large-scale clinical implementation in the future.


Introduction
Esophagectomy is the only potentially curative treatment for patients with advanced stages of esophageal cancer, a condition ranking as the sixth leading cause of cancer-related death [1][2][3].Notwithstanding notable advancements in surgical techniques, the risk of major postoperative complications and perioperative mortality remains substantial, reaching rates as high as 38% and 9%, respectively [4][5][6][7][8].Beyond short-term perioperative consequences, these complications significantly worsen the overall survival and quality of life of affected patients [9][10][11][12][13][14][15].Accordingly, accurate preoperative risk stratification is critical to achieve optimal patient selection and support decisions grounded in informed consent [14][15][16][17].Previous research has identified several risk factors for postoperative complications and mortality after esophagectomy, including age, comorbidities, and the type and stage of cancer [18][19][20][21][22].These factors have been used to develop comprehensive risk scores and have subsequently been embedded in risk calculators to aid surgeons and patients in their joint decision-making process [23][24][25][26].Recently, a new risk model predicting 90-day mortality after oncologic esophagectomy was introduced.Using logistic regression analysis on the multicenter International Esodata Study Group (IESG) database, the IESG risk model intends to provide an easily accessible risk stratification system, assigning patients into five distinct risk categories from very low risk to very high risk [4].However, external validation of this newly introduced risk stratification model is currently lacking.
Moreover, patient-related and disease-associated factors that contribute to overall surgical risk may be interconnected in a non-linear, higher-order complexity.Machine learning (ML) algorithms, a subdomain of artificial intelligence (AI), are a powerful means of processing complex, non-linear system dynamics, thus allowing for the identification of high-dimensional patterns that may exceed human capabilities and conventional statistical approaches in medical research [27][28][29].In support of this argument, ML prediction models have already shown promising results in forecasting long-term oncologic outcomes following upper gastrointestinal surgery [30,31].In contrast, their potential to accurately identify patients at risk prior to oncologic esophagectomy remains to be evaluated-particularly in comparison to established preoperative risk models.
The objective of this study was to validate the IESG risk model concerning 90-day mortality and to assess its performance in comparison with new approaches from the field of ML.The results may assist patients and surgeons in their preoperative risk evaluation and provide a novel approach for optimized patient selection prior to oncologic esophageal resections.

Setting and Study Population
This retrospective case-control study enrolled patients after oncologic esophagectomy between January 2009 and December 2021 from two tertiary centers affiliated with Charité University Medicine Berlin.Center 1 served as ML training cohort and center 2 as independent ML validation cohort.Only patients who underwent Ivor Lewis (IL) esophagectomy were included, while other forms of resection were not considered.The primary study endpoint was defined as 90-day mortality, with 30-day mortality as the secondary outcome.This study was conducted in accordance with the Declaration of Helsinki, and the work is reported in line with the STROCSS criteria [32].Data protection consultation was performed by the institutional data security department to safeguard

IESG Risk Model Validation
For validation of the IESG risk model, patients were stratified into five risk groups (very low to very high risk) using the parameter-weighted scoring system, as described by D'Journo et al. [4].To assess model accuracy, estimated and observed mortality were considered, and a 95% Wilson confidence interval (CI) was calculated [34].For discrimination performance, the area under the receiver operating characteristic curve (AUROC), the area under the precision-recall curve (AUPRC), and the Matthews correlation coefficient (MCC) were employed.The results were calculated for the full dataset as well as for the validation cohort exclusively to allow for an independent comparison between IESG model and ML performances.

ML Training and Validation
Preoperative data for ML analysis comprised 48 parameters encompassing categorical, bivariate, and numerical factors (Tables S1-S3), including preoperative standard laboratory parameters.The tumor stage was determined using either endoscopic ultrasound or computer tomography imaging techniques according to the Union Internationale Contre le Cancer (UICC, 8th edition) TNM ("tumor", "nodes", "metastases") classification of malignant tumors [35].Comorbidities and the revised Charlson Comorbidity Index (CCI) were derived retrospectively, as described by Quan et al. [36].
The training and validation cohorts were subjected to statistical analyses using Student's t-test and the chi-square test, including Yates correction, using SciPy 1.0 [37].Data preprocessing for ML model development encompassed the deletion of features with >50% missing values, high collinearity (>0.85), and single unique values.Missing values in the remaining features were imputed, the numerical features normalized, and the categorical parameters were one-hot-encoded using the Scikit-learn framework [38].
ML development and evaluation were then performed using two consecutive approaches.In the first stage, the training cohort was divided into an internal training group and a test group in a ratio of 80:20 for the analysis.In the second stage, the entire training cohort was used for model development, while the independent validation cohort served to test the model for its robustness.Model selection and hyperparameter optimization were performed in both stages, using 5-and 8-fold stratified cross-validation on the corresponding training groups.Three different feature selection methods were assessed as part of the grid search: the selection of all features, the top 25 features based on ANOVA f-values, or the features whose coefficients exceeded the median within a support vector machine.The following classifiers were trained: decision tree (DT), logistic regression (LR), linear support vector machine (l-SVM), support vector machine (SVM), gradient boosting machine (GBM), random forest (RF), and a multilayer perceptron as a neural network (NN).A schematic of the model development is provided in Figure S1.
The optimal combination of hyperparameters was subsequently used to re-train the models on 100% of the patients in the respective training group, leveraging the maximum available data to construct the model for each approach.To reduce the variance introduced by randomly initialized model weights and the partitions of individual samples, the hyperparameter optimization process was subjected to an additional 100 repetitions.Finally, each optimized model was tested on the designated validation group; hence, these were the 20% subgroup within the internal training cohort and the independent external validation cohort, respectively.Model evaluation was performed through calculation of the AUROC, AUPRC, and MCC, which were computed and visualized using the Scikit-learn framework [38].Finally, the performance results between the internal and external models were tested for significance using the Mann-Whitney-U test.

Feature Importance Analysis
To investigate the impact of each individual parameter on the constitution and performance of the trained models, feature importance calculations using Shapley additive explanations (ShAP) analysis were conducted for all classifiers averaging over 100 seeds.ShAP investigates alterations in model performance after the removal of single parameters individually, hence offering a detailed understanding of each feature's quantified contribution to the model's performance [39].

Comparison between IESG Model and ML Classifiers
To facilitate a comprehensive head-to-head comparison between the IESG and ML models, the discrimination performances of both methods were separately measured using the independent validation cohort with a subsequent comparison of the resulting AUROCs, AUPRCs, and MCCs.

Patient Baseline Characteristics
Detailed descriptions of the patient characteristics of both the training and validation cohorts are given in Table 1.A total of 552 patients met the inclusion criteria and were selected for analysis.The overall completeness of the datasets was 95.2%.The training cohort comprised 409 patients, mostly male (n = 329, 80.4%), with a mean age of 63.8 ± 10.1 years.Adenocarcinoma (AC) was the most prevalent type of cancer, accounting for 65.3% (n = 267) of patients in the training cohort.A total of 85.6% (n = 350) of the patients had undergone neoadjuvant therapy, with 51.8% (n = 212) receiving chemotherapy and 33.3% (n = 136) undergoing radiochemotherapy.According to the revised CCI, the most common comorbidity was chronic pulmonary disease, affecting 24.0% (n = 98) of the patients, followed by diabetes without chronic complications (n = 59, 14.4%).The majority of the subjects (n = 255, 62.3%) had an Eastern Cooperative Oncology Group (ECOG) stage 0 performance status.T3N+ tumors accounted for the largest proportion within the training cohort in the preoperative tumor staging.The 90-day and 30-day mortality rates were 4.6% (n = 19) and 2.4% (n = 10), respectively.The leading complications associated with 90-day mortality were sepsis, with a predominance of respiratory tract infections (n = 6; 25%) and hemorrhagic shock (n = 5; 18.6%).Of the patients who died within 90 days after surgery, 26.3% (n = 7) had already been primarily discharged from the hospital.
The validation cohort included 143 patients, with the majority of patients being male (n = 110, 76.9%, p = 0.44), with a mean age of 65.8 ± 9.9 years.The prevalence of CCI comorbidities was in line with the training cohort, with chronic pulmonary disease (n = 27, 18.9%, p = 0.26) and diabetes without chronic complications (n = 25, 17.5%, p = 0.46) as the most common preexisting conditions.The revised CCI showed no significant difference between the cohorts (p = 0.46).In accordance with the training cohort, most patients had an ECOG 0 performance status (n = 99, 69.2%, p = 0.22) and T3N+-stage tumors in the preoperative tumor staging.AC was consistently the most common histological type of cancer (n = 93, 65.0%, p = 0.07), and the majority of patients underwent neoadjuvant chemotherapy (n = 61, 42.7%, p = 0.3).The most frequent complications causing 90-day mortality were sepsis (n = 6, 46.2%) and myocardial infarction (n = 2, 15.4%).A complete overview of all patient characteristics and parameters is given in Tables S1-S3.

IESG Risk Model Evaluation
Stratification of all 552 patients into five different risk groups, as described by D'Journo et al., resulted in the following distribution: 2.4% (n = 13) of patients were categorized as very high-risk, 8.0% (n = 42) as high-risk, 19.7% (n = 109) as medium-risk, 22.6% (n = 125) as low-risk, and 47.6% (n = 263) as very low-risk patients.The expected and observed mortality rates for each group are shown in Figure 1.In the very high-risk group, the observed mortality was 23.1%, thus exceeding the reported IESG model prediction (18.2%) but meeting the 95% Wilson CI (0.08-0.50).The observed and predicted mortality within the high-risk category showed a slight overestimation (13.6% vs. 8.9%), however, again meeting the 95% CI (0.07-0.28).Among the patients classified in the medium-risk category, the observed mortality closely met the anticipated rates predicted by the IESG model (5.5% vs. 5.8%; CI: 0.03-0.11).The low-risk group showed one deceased patient, thus differing from the expected mortality rate (1.0% vs. 3.0%); however, the group was well within the corresponding CI (0.0-0.04).The majority of patients were assigned to the very low-risk group, in which the observed mortality rate of 6.1% surpassed the predicted 1.8%.Nevertheless, these outcomes again remained within the Wilson CI (0.04-0.11).A detailed categorization of the patients, according to D'Journo et al., is shown in Table S4.
comorbidities was in line with the training cohort, with chronic pulmonary disease (n = 27, 18.9%, p = 0.26) and diabetes without chronic complications (n = 25, 17.5%, p = 0.46) as the most common preexisting conditions.The revised CCI showed no significant difference between the cohorts (p = 0.46).In accordance with the training cohort, most patients had an ECOG 0 performance status (n = 99, 69.2%, p = 0.22) and T3N+-stage tumors in the preoperative tumor staging.AC was consistently the most common histological type of cancer (n = 93, 65.0%, p = 0.07), and the majority of patients underwent neoadjuvant chemotherapy (n = 61, 42.7%, p = 0.3).The most frequent complications causing 90-day mortality were sepsis (n = 6, 46.2%) and myocardial infarction (n = 2, 15.4%).A complete overview of all patient characteristics and parameters is given in Tables S1-S3.

IESG Risk Model Evaluation
Stratification of all 552 patients into five different risk groups, as described by D'Journo et al., resulted in the following distribution: 2.4% (n = 13) of patients were categorized as very high-risk, 8.0% (n = 42) as high-risk, 19.7% (n = 109) as medium-risk, 22.6% (n = 125) as low-risk, and 47.6% (n = 263) as very low-risk patients.The expected and observed mortality rates for each group are shown in Figure 1.In the very high-risk group, the observed mortality was 23.1%, thus exceeding the reported IESG model prediction (18.2%) but meeting the 95% Wilson CI (0.08-0.50).The observed and predicted mortality within the high-risk category showed a slight overestimation (13.6% vs. 8.9%), however, again meeting the 95% CI (0.07-0.28).Among the patients classified in the medium-risk category, the observed mortality closely met the anticipated rates predicted by the IESG model (5.5% vs. 5.8%; CI: 0.03-0.11).The low-risk group showed one deceased patient, thus differing from the expected mortality rate (1.0% vs. 3.0%); however, the group was well within the corresponding CI (0.0-0.04).The majority of patients were assigned to the very low-risk group, in which the observed mortality rate of 6.1% surpassed the predicted 1.8%.Nevertheless, these outcomes again remained within the Wilson CI (0.04-0.11).A detailed categorization of the patients, according to D'Journo et al., is shown in Table S4.
The models utilizing the complete training cohort for development with subsequent testing on the independent external validation cohort achieved AUROCs ranging between 0.52 (0.51-0.53;DT) and 0.64 (0.63-0.65;RF) for 90-day mortality (Figure S2a).The corresponding outcomes for the AUPRCs and MCCs fell within the range of 0.10 (0.10-0.10; l-SVM) to 0.25 (0.24-0.27;NN; Figure S2b) and 0.01 (0.00-0.01; l-SCV) to 0.27 (0.25-0.28;RF), respectively.Overall, the RF classifier demonstrated the highest AUROC and MCC results, while, notably, the NN classifier showcased the highest AUPRC.Moreover, all classifiers significantly decreased in terms of the AUROC compared to the preceding internal analysis.
In contrast, the AUPRCs (Figure 2) demonstrated a notable performance increase for the DT (p < 0.01), the LR (p < 0.01), the SVM (p < 0.01), and, most notably, the NN (p < 0.01), as shown in Table S6.In alignment with the preceding internal analysis, all metrics revealed higher discrimination performances for 90-day than for 30-day mortality prediction.A detailed summary of all model outcomes after external validation is presented in Table 2.Additional model performance metrics, including the F1 score, are provided in Table S7.Finally, the AUROC, AUPRC, and MCC were computed for the IESG risk model to allow for a direct comparison to the performance of the external ML evaluation.The IESG model performance showed an AUROC of 0.44 (0.32-0.56), an AUPRC of 0.11 (0.05-0.21), and a MCC of 0.15 (0.03-0.27), featuring relatively broad confidence intervals (Table 2).As a result, four out of seven ML classifiers (LR, NN, RF, and SVM) exhibited superior discrimination performances across all metrics compared to the IESG risk model.

Feature Importance
A heatmap displaying the normalized ShAP factor weights is presented in Figure 3, with additional details provided in Figures S3-S9.The international normalized ratio (INR) emerged as the pivotal model parameter, consistently showcasing the highest weight in six out of seven classifiers (GBM, LR, NN, RF, l-SVM, and SVM).Subsequent to INR, the CCI demonstrated high model impact, ranking among the top three parameters in five classifiers (GBM, LR, NN, RF, and SVM) and within the first ten features in the two remaining models (DT and l-SVM).Furthermore, the red blood cell count (RBC) was consistently identified as being among the five most impactful parameters in four classifiers (l-SVM, LR, NN, and SVM).Considering patient characteristics, ASA status, weight, and age emerged as the most relevant factors.In terms of plasma-based laboratory parameters, C-reactive protein, gamma-glutamyl transferase, and bilirubin were identified as highly impactful.Finally, the AUROC, AUPRC, and MCC were computed for the IESG risk model to allow for a direct comparison to the performance of the external ML evaluation.The IESG model performance showed an AUROC of 0.44 (0.32-0.56), an AUPRC of 0.11 (0.05-0.21), and a MCC of 0.15 (0.03-0.27), featuring relatively broad confidence intervals (Table 2).As a result, four out of seven ML classifiers (LR, NN, RF, and SVM) exhibited superior discrimination performances across all metrics compared to the IESG risk model.

Feature Importance
A heatmap displaying the normalized ShAP factor weights is presented in Figure 3, with additional details provided in Figures S3-S9.The international normalized ratio (INR) emerged as the pivotal model parameter, consistently showcasing the highest weight in six out of seven classifiers (GBM, LR, NN, RF, l-SVM, and SVM).Subsequent to INR, the CCI demonstrated high model impact, ranking among the top three parameters in five classifiers (GBM, LR, NN, RF, and SVM) and within the first ten features in the two remaining models (DT and l-SVM).Furthermore, the red blood cell count (RBC) was consistently identified as being among the five most impactful parameters in four classifiers (l-SVM, LR, NN, and SVM).Considering patient characteristics, ASA status, weight, and age emerged as the most relevant factors.In terms of plasma-based laboratory parameters, C-reactive protein, gamma-glutamyl transferase, and bilirubin were identified as highly impactful.

Discussion
This study aimed to validate the IESG risk model as a new risk stratification tool and to compare its performance to ML algorithms for the preoperative prediction of 90-day mortality following oncologic esophagectomy.Adequate preoperative patient selection based on accurate risk stratification is critical for improving surgical outcomes after major surgical procedures and to aid patients in their informed consent decision making.The 90-day mortality provides a comprehensive endpoint, as it considers complications and

Discussion
This study aimed to validate the IESG risk model as a new risk stratification tool and to compare its performance to ML algorithms for the preoperative prediction of 90-day Cancers 2024, 16, 3000 9 of 13 mortality following oncologic esophagectomy.Adequate preoperative patient selection based on accurate risk stratification is critical for improving surgical outcomes after major surgical procedures and to aid patients in their informed consent decision making.The 90-day mortality provides a comprehensive endpoint, as it considers complications and adverse events beyond primary discharge, thus ensuring a more holistic assessment [40].
The IESG risk model assigns patients to one of five risk categories, providing a 90-day mortality risk estimation for each group.Applied to our independent cohort, the overall patient allocation aligned closely with the previously reported IESG distribution, as <3% of the patients were of very high risk, <10% were of high risk, and 20% were of medium risk.A notable shift was observed among the very low-risk category, which was twice as high as the IESG development cohort.Conversely, the group of low-risk patients was considerably smaller.Ultimately, the IESG stratification tended to underestimate the risk for patients within the very high-, high-, and very low-risk groups, which has to be considered when providing these patients with a mortality probability prior to surgery.Notably, however, all estimations met the underlying confidence intervals.The IESG risk group stratification can, therefore, provide an intelligible and intuitive grading system for patients as part of their preoperative decision making.In contrast, its discrimination performance was lower than previously reported, revealing an AUROC of 0.44, an AUPRC of 0.11, and an MCC of 0.15 when tested on our validation cohort.
Considering potential non-linear interdependencies among demographic patient characteristics, comorbidities, and disease-related factors in shaping cumulative surgical risk, multiple ML models were trained and validated.Importantly, we highlight that AUROC alone can be inconclusive when facing high class imbalances, which are often present in medical prediction tasks.AUPRC and MCC consider precision and recall as positive and negative predictive values, thus allowing for a more elaborate model evaluation [41].Despite being trained on a dataset comprising less than 10% of the IESG cohort, the majority of ML classifiers outperformed the IESG model in discrimination performance.Notably, this still held true when comparing both approaches on the same independent validation dataset, emphasizing the potential of ML as a novel and powerful approach for surgical risk prediction in esophageal surgery.
The optimal model achieved an AUROC of 0.75 when trained on the internal cohort only, thus representing a monocentric analysis consistent with the study design of most ML analyses in the surgical field to date.Predicting major complications defined as Clavien-Dindo >IIIa, Jung et al. found AUROCs ranging between 0.6 and 0.7 in esophagectomies [42].Likewise, Zhao et al. demonstrated AUROCs between 0.65 and 0.76 in the prediction of postoperative anastomotic leakage; however, they also included postoperative parameters in their models [43].These findings, however, were not confirmed in independent cohorts, and therefore they are prone to the risk of overfitting in single-center data.In our study, subsequent external testing demonstrated a considerable AUROC decrease to 0.64, which remains critical for clinical implementation.However, we highlight that increasing the sample size by only 20% using the complete training cohort already optimized the AUPRC in various classifiers significantly, thereby exemplifying both the overall robustness of the approach and the potential for considerable improvements with larger cohorts in the future.
"Explainable AI" describes the efforts to improve the transparency and comprehensibility of AI models to foster trust among future users.In fact, non-comprehensibility has been shown to be a major barrier for surgeons in applying risk assessment tools thus far in clinical practice [44].Providing professionals with underlying feature importance can, therefore, strengthen trust and acceptance.The INR and CCI consistently emerged as the most influential factors contributing to the final models.This finding is consistent with prior clinical studies examining risk factors for perioperative mortality in esophagectomy [2,45].Particularly noteworthy is the significant impact of the preoperative laboratory variables across all classifiers.Given their accessibility as part of routine preoperative preparation, these variables should be more extensively leveraged in predictive modeling efforts.
Overall, our results suggest that high-dimensional models can improve surgical risk prediction beyond conventional analyses and may aid in identifying individual patients at high risk.Our results are in line with familiar studies in the field of colorectal and gastric surgery, where ML models trained on larger register datasets show promising preoperative prediction results, even surpassing AUROCs of 0.80 [46,47].Performing large-scale analyses on national and international register data may, therefore, evolve as the next step in esophageal cancer research.Moreover, additional parameters may further increase accuracy in the future.
The primary constraint of this study lies in its sample size.This poses a major challenge for data-driven pattern recognition analyses and the identification of intricate interrelationships with high-dimensional data, particularly in the presence of strong class imbalances.Furthermore, the retrospective design may have introduced qualitative and quantitative data implications.However, automated data extraction using previously described means was applied to ensure homogeneity and to mitigate manual errors, ultimately achieving high dataset completeness.Moreover, in this study, IL esophagectomy was performed using open, laparoscopic, and robotic-assisted strategies, potentially introducing an approachrelated effect on the outcome.While the total number of chemotherapy cycles was taken into account, potential dose reductions within the individual cycles were not.Similarly, variations in radiation dose were not considered in the radiotherapy cases, which could have influenced the outcomes.Finally, the training and validation cohorts exhibited some disparities concerning patient characteristics, most notably with a strong but not significant trend regarding 90-day mortality as the primary endpoint, thus potentially impeding comparability among the groups.Nevertheless, the ML tools evaluated for clinical implication must be sufficiently robust to produce reasonable results across diverse patient collectives.Adjusting or merging the cohorts would have severely compromised the quality of the analysis and particularly increased the risk of overfitting, which would decrease adequate model assessment.

Conclusions
The IESG risk model provides an easily accessible assessment approach to categorize patients into ascending risk groups and to provide a general risk estimation per group.However, complex models incorporating multiple dimensions can further enhance discrimination accuracy to identify patients at risk.Future studies applying non-linear models on large-scale register data are needed to evaluate the full potential of supervised and unsupervised ML analysis techniques in this field.Applying metrics that consider class imbalances as well as external validation are of high importance to accurately evaluate new models in the field of AI for upper-gastrointestinal surgery.

Supplementary Materials:
The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/cancers16173000/s1, Figure S1: Schema of the model development;  S4: IESG score distribution of the entire study cohort; Table S5: AUROC, AUPRC, and MCC after internal validation; Table S6: Difference in metric performance of all classifiers between training and validation cohorts; Table S7: Complete model performance metrics after external validation.

Figure 2 .
Figure 2. A one-sided Mann-Whitney U test was used to evaluate the difference in AUPRC between the training and validation cohorts of every classifier; * p < 0.05, significant; ** p < 0.01, highly significant.

Figure 2 .
Figure 2. A one-sided Mann-Whitney U test was used to evaluate the difference in AUPRC between the training and validation cohorts of every classifier; * p < 0.05, significant; ** p < 0.01, highly significant.

14 Figure 3 .
Figure 3. Heatmap displaying the ten most important features from ShAP value analysis for every classifier.

Figure 3 .
Figure 3. Heatmap displaying the ten most important features from ShAP value analysis for every classifier.
Figure S2: ROC (a) and PRC (b) of all classifiers after external validation; Figure S3: Normalized top ten Shap values Decision Tree, descending; Figure S4: Normalized top ten Shap values Gradient Boosting Classifier, descending; Figure S5: Normalized top ten Shap values Linear SVC, descending; Figure S6: Top ten Shap values Logistic Regression, descending; Figure S7: Normalized top ten Shap values Neural Network Regression, descending; Figure S8: Normalized top ten Shap values Random Forrest Classifier, descending; Figure S9: Normalized top ten Shap values Support Vector Classifier, descending; Tables S1-S3: Complete patient characteristics; Table

Table 1 .
Patient characteristics of the training and validation cohorts.

Table 2 .
AUROCs, AUPRCs, and MCCs for multiple ML classifiers after external validation for 90and 30-day mortality, as well as AUROC, AUPRC, and MCC discrimination performance calculated for the IESG risk model (90-day mortality only).