A machine learning approach towards assessing consistency and reproducibility: an application to graft survival across three kidney transplantation eras

Background In South Africa, between 1966 and 2014, there were three kidney transplant eras defined by evolving access to certain immunosuppressive therapies defined as Pre-CYA (before availability of cyclosporine), CYA (when cyclosporine became available), and New-Gen (availability of tacrolimus and mycophenolic acid). As such, factors influencing kidney graft failure may vary across these eras. Therefore, evaluating the consistency and reproducibility of models developed to study these variations using machine learning (ML) algorithms could enhance our understanding of post-transplant graft survival dynamics across these three eras. Methods This study explored the effectiveness of nine ML algorithms in predicting 10-year graft survival across the three eras. We developed and internally validated these algorithms using data spanning the specified eras. The predictive performance of these algorithms was assessed using the area under the curve (AUC) of the receiver operating characteristics curve (ROC), supported by other evaluation metrics. We employed local interpretable model-agnostic explanations to provide detailed interpretations of individual model predictions and used permutation importance to assess global feature importance across each era. Results Overall, the proportion of graft failure decreased from 41.5% in the Pre-CYA era to 15.1% in the New-Gen era. Our best-performing model across the three eras demonstrated high predictive accuracy. Notably, the ensemble models, particularly the Extra Trees model, emerged as standout performers, consistently achieving high AUC scores of 0.95, 0.95, and 0.97 across the eras. This indicates that the models achieved high consistency and reproducibility in predicting graft survival outcomes. Among the features evaluated, recipient age and donor age were the only features consistently influencing graft failure throughout these eras, while features such as glomerular filtration rate and recipient ethnicity showed high importance in specific eras, resulting in relatively poor historical transportability of the best model. Conclusions Our study emphasises the significance of analysing post-kidney transplant outcomes and identifying era-specific factors mitigating graft failure. The proposed framework can serve as a foundation for future research and assist physicians in identifying patients at risk of graft failure.


Introduction
Kidney transplantation is the standard of care for the management of kidney failure, significantly enhancing quality of life and increasing longevity compared to the alternative, which is chronic dialysis treatment (1)(2)(3)(4)(5).Nonetheless, a kidney transplant's longterm success relies on the transplanted organ's survival, known as graft survival.Over the years, considerable progress has been made regarding maintenance immunosuppression regimens to improve graft survival post-kidney transplant (6)(7)(8).In Johannesburg, three eras of post-kidney transplant maintenance immunosuppression therapy are described (i) from 1966 to 1983 using combined azathioprine and cortisone; (ii) from 1983 to 2000 replacing azathioprine with cyclosporine; and (iii) starting in 2001, the introduction of sirolimus, everolimus, and mycophenolate mofetil (9)(10)(11).These advancements in immunosuppressive therapy have improved graft survival rates over the years.(12,13).Identifying prognostic factors contributing to graft failure could inform the post-kidney transplant management of recipients to improve longterm graft survival.
Globally, and in South Africa, preserving the long-term survival of the graft after kidney transplant is the ultimate goal, not only for the enhanced survival benefits and improved quality of life of the recipient but also because organ donor shortages persist.In the event of graft failure, maintenance dialysis must be re-initiated, and re-transplantation must be considered with adverse consequences for the patient and a disproportionate increase in the cost of care (when compared with the cost of maintenance immunosuppression therapy) (14)(15)(16).Studies have identified donor-related and recipient-related factors that impact kidney transplant outcomes.More specifically, previous research in South Africa has shown the impact of donor type, delayed graft function, recipient age, and self-reported ethnicity on graft survival based on univariate and multivariate survival models (10,13,17,18).These earlier studies have significantly contributed to transplantation outcomes in African settings.However, they are primarily based on conventional statistical methods.While traditional statistical methods can offer insights into how prognostic factors influence survival, some approaches used in these previous studies may not provide a realistic representation of real-life situations when identifying factors influencing outcome (19).Furthermore, many of these studies are constrained by complete case analysis or the exclusion of important variables due to missing information, and none of these studies considered exploring graft survival across the three eras (10,13,18).Medical research studies have extensively employed machine learning (ML) to enhance predictive risk assessment, resulting in more accurate predictions (20)(21)(22)(23)(24)(25).This approach can assist physicians in risk assessment by identifying patients who might be at a higher risk of graft failure following kidney transplantation.In recent years, ML models have gained increasing attention in medical research for developing diagnostic and predictive models for medical outcomes (21,(26)(27)(28)(29).These ML models have also been successfully used in kidney transplant studies and have demonstrated good performance in predicting graft survival at different survival times (30)(31)(32)(33).For example, Moghadam and Ahmadi (34) developed a clustering method using the Red Deer Algorithm (RDA), together with other ML classification algorithms and proposed a three-stage clusteringbased undersampling approach to better handle class imbalances.Topuz et al. (35) designed a method that combines the Bayesian belief network algorithm, feature selection, and multiple ML techniques to predict kidney graft survival using data from over 31,000 U.S. patients.The study suggests that this approach can be applied to other transplant datasets.Fabreti-Oliveira et al. (36) employed two gradient boosting algorithms to analyse data from 627 kidney transplant patients and identified that serum creatinine levels at discharge, pre-transplant weight and age were key factors affecting early graft loss.The study highlights the potential of ML for informed decision-making in transplantation.Although ML has not been utilised to pinpoint significant prognostic factors in South African transplant units, there are other knowledge gaps concerning previously developed models.Most of these models have yet to be validated outside of the study cohort in which they were developed.It has been observed that many ML models are constrained by factors such as geographical location, study interval, historical period, and methodological approach (20,28,37).Hence, these models may not generalise well when transported to patients with dissimilar characteristics compared to those used to develop the model.
In this study, we developed and validated ML models to predict 10-year graft survival using clinical and socio-demographic characteristics of kidney transplant recipients and their donors in Johannesburg between 1966 and 2014covering three eras of maintenance immunosuppression regimens.Our era-based models were designed to examine the risk factors associated with each transplant era and to gauge the ML algorithms' discriminative capability between outcome classes, namely graft failure or survival.By focusing on era-specific models, we ensured that our findings are consistent (stable and reliable) and reproducible (valid and replicable) across different historical contexts, capturing variations in risk factors and outcomes.Additionally, we tested the transportability of our models by developing them using data from one era and validating them in another era, highlighting the difficulty of applying ML models across different settings and emphasizing the need for tailored approaches.This research is relevant because it is region-specific, addressing challenges such as limited access to organ transplant facilities and improving kidney transplant graft survival outcomes in resource-limited settings countries.Finally, the developed models will serve as the foundation for future model development and external validation within and beyond the study area.
The rest of this paper is organised as follows.In the subsequent section, we present the design for this study, followed by descriptions of the algorithms.Section 3 showcases the study's results, and Section 6 presents a comprehensive discussion of the results and highlights avenues for future research.

Materials and methods
In this section, we offer insights into the dataset, starting with our approach to data acquisition.We then transition into the methods used for data preprocessing and developing ML models.Next, we present LIME, an explainability ML method used to interpret and understand the predictions made by the predictive models.Finally, we apply permutation feature importance to evaluate and rank the contribution of each feature to the model's predictions.

Transplant overview
The dataset encompasses a total of 1,738 kidney transplant records, split into three distinct eras: pre-CYA (458 entries), CYA (916 entries), and New-GEN (364 entries), as summarised in Table 1.The median recipient age was approximately 38  years, with a slight variation among the eras, as New-gen recipients tended to be a bit older than recipients in other eras.Overall, the median age of the recipients who experienced graft failure is higher than that of those who did not experience graft failure.The overall median donor age was 27  years.For the recipient with a failed graft, the median age of the kidney donor was 28, while the median donor age for those who did not experience graft failure was 25 years.The Mann-Whitney U test showed that recipient and donor age significantly differed between recipients who experienced graft failure and those who  did not experience graft failure.Approximately 84% of these patients received a kidney from deceased donors.Regarding selfreported ethnicity, most recipients were White, followed by Black and other ethnicities.Primary glomerular disease emerged as the predominant cause of end-stage kidney disease, especially in the pre-CYA era.The New-GEN era was marked by a higher prevalence of hypertension as a cause.Treatment-wise, methylprednisolone as an induction therapy was predominant in the New-GEN era.A notable number of patients experienced delayed graft function, with the CYA era having the most observed cases.The Chi-square test showed significant associations between graft survival status and variables, including donor type and delayed graft function, acute rejection and chronic rejection at a 5% significance level.

Data pre-processing
In this study, graft survival is the time from transplant to failure of the graft, defined as the earliest time to return to dialysis.Death with a functioning graft and a few patients lost to follow-up were censored based on the time of death or date last seen.A patient graft was classified as "graft failure" if the graft had been recorded as failed in the database, as per the definition above; otherwise, the graft status was classified as "survived".
Information relating to patients who underwent more than one transplant was excluded from this study; in other words, the scope of this study was restricted to first graft failure.We retrieved 1,207 pre-, peri and post-transplantation information from the database.The pre-transplant measures include the cause of kidney failure (KF), donor type, donor and recipient sex and blood group, recipient age, recipient selfreported ethnicity, and donor age, as shown in Table 1.The peri-transplant features are those measured during the transplant, which are estimated glomerular filtration rate and Induction therapy.The post-transplantation characteristics considered in this study are delayed graft function (DGF), surgical complications, biopsy-proven acute or chronic rejection, and rejection treatment.Repeated information measured months or years post-transplant was dropped from the analysis.We also dropped features with empty records and variables relating to data-capturing details alone.Only 39 features measured across the three transplant eras were extracted for pre-processing.
Descriptive statistics and data visualisation were used to assess the data quality and understand the study features' patterns and relationships.Overall, approximately 1% of the case records are missing in the dataset, which was contributed by eight features, including systolic blood pressure at transplant, diabetes at transplant and delayed graft function.To render the data more applicable to this study, we have addressed the problem of missingness using the missForest imputation algorithm, which uses a random forest approach to predict missing values in a database.missForest is an ML technique that has shown good performance in predicting missing values in mixed data types across different fields of study (38,39) and performs better than other imputation methods.
Feature engineering was conducted by grouping each feature category with low frequency with their related category.For instance, donor type original class "deceased", "living related", and "living unrelated" donors were recategorised as "deceased" and "living" donors.This addresses the problem of representativeness of each factor variable category and enables the model to sufficiently learn from each feature category to improve each feature discriminative power and avoid bias in prediction.Donor and recipient blood groups were matched to create a single variable "donor-recipient blood group match".Also, the donor and recipient sex were matched to generate the "donor-recipient sex match" variable.

The kidney transplant analysis
The process of kidney transplant analysis is a multifaceted procedure with several steps, each playing a pivotal role in generating meaningful insights.It begins with data preprocessing as presented in Figure 2.During this stage, the raw data undergoes various transformations.Techniques such as data cleaning, feature engineering, data sub-selection, imputation, and applying the Synthetic Minority Over-sampling Technique (SMOTE) are utilised.These methods collectively work towards refining the data, eliminating noise and irrelevant information, addressing missing values, and achieving a balanced dataset.
After preprocessing, the data is organised into three distinct categories: pre-CYA, CYA, and New-Gen.This division is crucial for the following phases of feature selection and model construction.Three feature selection techniques were employed: the One-Rule, Random Forest and the Least Absolute Shrinkage and Selection Operator (LASSO).These techniques aid in pinpointing the most significant features from the data, which have the utmost predictive power for the target variable.
We utilised internal and external validation techniques during the study to ensure our models' robustness and generalisability.Internal validation, often termed "resampling" validation, refers to the process of evaluating the model's performance on a subset of the training data.This is typically achieved using techniques like k-fold cross-validation, where the data is partitioned into "k ¼ 10" subsets.The model is trained on k À 1 of these subsets and tested on the remaining one.This process is repeated k times, each subset serving as the test set once.The primary advantage of internal validation is that it provides a more robust estimate of the model's performance, minimising the risk of overfitting by ensuring the model performs well across multiple, varied subsets of the training data.This stage encompasses utilising various ML models, such as Logistic Regression, Extra Trees, Adaboost, Gradient Boosting, Random Forest, Support Vector Machine, K-nearest neighbours, Neural Network, and Decision Tree.
After the models' construction and internal validation are assessed on an external validation dataset.External validation, on the other hand, assesses the model's performance on an entirely separate dataset that it has never seen during training.This dataset is not used in any phase of the model-building process.
The essence of external validation is to gauge the model's real-world applicability and its potential performance on new, unseen data.
Once the models are built and internally validated, they seem like black boxes, making predictions that are hard to understand.Imagine a complex model that predicts whether a kidney transplant will be successful or not.It considers numerous factors, such as the donor's age, the compatibility of donor and recipient, the health of the recipient, and many others.However, once the prediction is made, it is not immediately clear which factors were most influential in making that prediction.This is problematic because clinicians and patients might need to understand the rationale behind the prediction to make informed decisions.LIME addresses this issue by approximating the complex model with a simpler, interpretable model (e.g., a logistic regression model) locally around the prediction (40).This The framework of the kidney transplant analysis.simpler model can then be studied to understand how each feature influences the prediction.For example, the LIME explanation might reveal that the model predicted a high chance of transplant success mainly because the donor and recipient were highly compatible and the recipient was in good health.
In the subsequent section, a comprehensive discussion on the various models utilised for this research is provided.This includes an explanation of the functioning of each model.A thorough understanding of the models is indispensable for accurately interpreting the results and making well-informed decisions based on the analysis.

Machine learning classification models
In this section, we describe the ML models specifically tailored for the classification tasks used for this study.

AdaBoost
In our study, we applied a technique called AdaBoost to enhance the performance of our machine learning model.Adaptive Boosting, short for AdaBoost, is an ensemble learning algorithm designed to enhance the performance of ML models (41,42).According to Freund and Schapire (43), AdaBoost works by combining several simple models, known as weak learners, into a single, more accurate model.Each weak learner is trained on our dataset and contributes to the final prediction.Initially, all data points in our dataset were treated equally.As we trained each weak learner, we paid more attention to the examples that were difficult to classify correctly.This means that the model focused on getting the hard cases right.
We repeated this process for multiple iterations, adjusting the importance of each data point based on the previous models' performance.Misclassified examples were given more weight, so the next weak learner would focus more on them.After several rounds, we combined the weak learners into a single strong model.Each weak learner had a say in the final prediction, but the more accurate learners had a bigger influence.By using AdaBoost, we were able to create a model that performed better on our dataset compared to using just a single simple model.This approach helped us achieve more accurate and reliable results.

Extreme gradient boosting
We applied extreme gradient boosting or XGBoost to our dataset to enhance the accuracy and efficiency of our machine learning model.XGBoost is an advanced ensemble technique that combines the predictions of multiple models to produce a more accurate final prediction (25, 44).We started by dividing our dataset into training and testing sets, using the training set to build the model and the testing set to evaluate its performance.XGBoost iteratively trained a series of decision trees on the training set, with each tree focusing on correcting the errors made by the previous ones.This iterative process continuously improved the model's accuracy.
XGBoost's versatility allowed us to handle different types of prediction tasks, such as regression and classification, by using appropriate loss functions for each task.For instance, we used squared error loss for regression and logistic loss for classification.Additionally, XGBoost includes a regularization term that penalizes overly complex models to prevent overfitting.This term considers the number of terminal nodes in the trees and the scores assigned to these nodes.By applying XGBoost, we created a robust model that accurately captured the patterns in our data, significantly improving the model's performance and reliability for our prediction tasks.

Random forest
We utilized the Random Forest algorithm to analyze our dataset.Random Forest is an ensemble learning method that constructs multiple decision trees to perform both classification and regression tasks (29,45,46).For classification, it predicts the class that is chosen by the majority of the trees, and for regression, it averages the predictions of all the trees.This technique helps to address the problem of overfitting often encountered with individual decision trees.Random Forest works by creating each tree from a different bootstrap sample of the data, and generally, increasing the number of trees enhances the accuracy of the model (29).Additionally, Random Forest performs automatic feature selection, which improves the performance of traditional decision tree algorithms (47).
In applying Random Forest to our dataset, we divided the data into training and testing sets.The algorithm built numerous decision trees using different subsets of the training data.For classification tasks, the final prediction was determined by the majority vote from all the trees, while for regression tasks, the average prediction of all the trees was used.This approach not only improved the accuracy of our model but also made it more robust and less prone to overfitting.The ability of Random Forest to automatically select relevant features further enhanced the efficiency and effectiveness of our analysis, leading to more reliable and interpretable results.

Decision trees
Decision Trees are a widely used supervised learning method for classification and regression tasks (48, 49).They work by creating a model that predicts the value of a target variable using simple decision rules derived from the data's features.The core idea is to split the dataset into subsets based on specific criteria, ensuring that each split results in more homogeneous subsets (50).This splitting process continues until the model can make accurate predictions.Decision Trees rely on various metrics to determine the best splits, such as entropy, information gain, and Gini impurity.These metrics measure the disorder or impurity within the data and help guide the tree-building process to create effective and accurate models.
We used Decision Trees to analyze our dataset by dividing it into training and testing sets.The Decision Tree algorithm built the model by learning decision rules from the training data, using metrics like entropy to determine the best splits.For instance, entropy measures the randomness or unpredictability in the data, and information gain represents the reduction in entropy after a split.Gini impurity, another metric, quantifies how often a randomly selected item would be incorrectly classified.By applying these metrics, the Decision Tree algorithm iteratively split the data into smaller, more uniform subsets, leading to a model that could accurately predict outcomes.This method provided a clear and interpretable structure for understanding the relationships in our data, making it a valuable tool for our analysis.

Extra trees
Extra Trees, also regarded as Extremely Randomised Trees, is an ensemble method designed for supervised classification and regression tasks (51).As an ensemble method, the Extra Trees introduce a higher level of randomness in the tree-building process (52).Unlike traditional tree methods that identify the optimal decision split for a given attribute, Extra Trees randomises the choice of attributes and their respective cut points.This double layer of randomness-both in attribute and split point selectionoften results in a more diversified set of base trees, which can enhance the model's generalisation capabilities.
This method can sometimes outperform more deterministic algorithms, especially in scenarios with a lot of noise.In the Extra Trees Classifier, decision trees are utilised.The parameter k determines the number of features selected in a random sample from the feature set.
To apply Extra Trees to our dataset, we divided the data into training and testing sets.The algorithm then constructed numerous decision trees using random subsets of features and split points for each tree.This process involves training the model with these randomly generated trees and combining their predictions to produce a final outcome.By averaging the predictions of all the trees in the ensemble, Extra Trees provided a more stable and accurate model.This approach allowed us to capture complex patterns in the data and make reliable predictions, enhancing the overall performance of our analysis.

Logistic regression
Logistic regression was applied to our dataset to evaluate the relationship between a categorical outcome variable and multiple predictor variables (21,53,54).This method is particularly useful for binary classification tasks, where the goal is to predict one of two possible outcomes (27,55,56).In our analysis, logistic regression was used to model the likelihood of a specific outcome based on various input features.
To implement logistic regression on our dataset, we first identified the target variable and the predictor variables.The model was then trained using the data to find the best-fit logistic curve, which represents the probability of the target variable occurring given the predictor variables.This approach allowed us to make predictions about the categorical outcome, providing insights into how different factors influence the likelihood of the outcome in our dataset.

Support vector machine
Support Vector Machine (SVM) is a supervised learning technique used for both classification and regression tasks.Its primary goal is to find the best boundary that separates different classes in the dataset.This boundary, known as the decision boundary, is determined by maximizing the margin between the closest data points from each class, ensuring that the model can effectively distinguish between positive and negative instances (57, 58).SVM is particularly effective when the data is linearly separable, meaning there is a clear dividing line between the classes.
In our dataset, SVM was applied to classify instances based on the features provided.By training the SVM model on the dataset, we identified the optimal decision boundary that separates different classes.This boundary was used to predict the class labels for new data points, helping us understand how the features influence the classification outcome.The SVM model's ability to maximize the margin between classes contributed to its effectiveness in accurately classifying the data in our analysis.

K-nearest neighbour
K-Nearest Neighbour (KNN) is a simple and widely used classification algorithm that works on the idea that similar data points are close to each other in the feature space.This makes it effective for tasks where it's difficult to describe the relationship between features and outcomes using more complex models.KNN is commonly applied in areas like image recognition, recommendation systems, and medical diagnosis (59)(60)(61).In our dataset, we used KNN to classify data points by identifying the 'k' closest neighbors to each point and predicting its class based on the majority class among these neighbors.This approach allowed us to classify new data based on the patterns observed in the nearest existing data points.

MLP neural network
The Multilayer Feedforward Perceptron (MLP) is a popular neural network architecture used for tasks like classification and regression.It consists of layers of neurons where each neuron is connected to all neurons in the next layer, with no connections within the same layer.During training, the network adjusts its weights and biases to minimize the difference between the predicted and actual outcomes (62,63).In our dataset, we applied MLP to model complex relationships between features and the target variable by learning from patterns in the data, allowing us to make accurate predictions based on these learned patterns.

Machine learning explainability with LIME
The local interpretable model-agnostic explanations (LIME) framework is a model-agnostic technique designed to investigate a ML model's decision-making process on a per-instance basis (40,64).LIME works by tweaking the input parameters of an already trained model while observing how these tweaks affect the model's predictions.This process allows LIME to create a simplified, interpretable model that approximates the behaviour of the original complex model within a local region around a specific instance.
According to (65), LIME distinguishes itself from other interpretable models by centring its focus on providing explanations for individual predictions rather than attempting to elucidate the entirety of the model's behaviour.In simpler terms, LIME adopts a localised approach instead of striving to explain the entire model.While numerous interpretable models aim to approximate the decision boundaries of a ML model globally, LIME recognises that understanding every facet of a complex model's behaviour across all instances might be impractical.The mathematical formulation of LIME can be written as: Where LIME explanation (x) represents the explanation provided by LIME for the instance x. f is the original ML model whose predictions we want to explain.g is the local surrogate model, chosen from a family of interpretable models, G. L(f , g, p x ) is the loss function that quantifies the difference between the predictions of f and g, for instance, x, weighted by a proximity measure p x .V(g) is a complexity penalty term that encourages simpler explanations provided by g. p x is a proximity measure that captures how similar an instance, z is to x, often using a kernel-based approach to define the neighbourhood around x.
The minimisation process aims to find the best-fitting surrogate model, g, that balances prediction fidelity and interpretability.

Evaluation metrics
Evaluation metrics provide quantifiable measures to assess the performance of ML models in categorising data into different classes (66)(67)(68).These metrics offer insights into a model's accuracy, precision, sensitivity, specificity, area under the curve (AUC), F1-score, and more.
Understanding True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) is crucial for this study.TP denotes the model correctly identifying positive results, TN signifies the correct identification of negative results, FP represents incorrect positive labelling, and FN indicates a missed positive identification.For instance, a TP could involve accurately predicting graft failure, while a TN might involve identifying non-failure cases.Conversely, FP could misclassify graft failure, and FN might overlook actual failures.
The assessment of the model's performance in the Kidney Graft Study encompasses several key metrics.Accuracy quantifies the overall correctness of predictions, reflecting both true positive and true negative rates.Precision gauges the ratio of true positive predictions to all positive predictions, emphasizing correctness in positive classifications.Sensitivity measures the model's ability to correctly identify true positives, highlighting its effectiveness in capturing actual positive cases.Specificity evaluates the model's aptitude in identifying true negatives, accentuating its proficiency in recognizing actual negative cases.The area under the curve (AUC) provides a comprehensive overview of the model's discriminative power across varying thresholds.It elucidates the model's ability to rank positive instances above negative ones.The F1-score harmonises precision and sensitivity, striking a balance between the two metrics.These metrics are pivotal for evaluating the model's performance and drawing meaningful conclusions.

Permutation-based feature importance
We quantified the contribution of each feature in the predictive model using permutation-based feature importance (69)(70)(71) and graphically visualized the results using a combination of bar plot and box plot.This analysis employed Extra Tree models across different eras, assessing feature importance by permuting each feature and measuring the resulting change in model performance.Initially, model performance was measured using all features, referred to as the "full model performance."Each feature's values were then randomly permuted, and the model's performance was reassessed.A feature was deemed "important" if permuting its values significantly increased the model's prediction error, as indicated by an increase in 1 À AUC, showing reliance on that feature.Conversely, a feature was considered "unimportant" if the permutation caused little change in 1 À AUC, suggesting the model did not rely on that feature.Model performance was evaluated using 1 À AUC as the loss function, with a larger increase indicating greater feature importance.To account for randomness in the permutation process, we computed the mean values of the loss function over 10 permutations.The bars' lengths correspond to each feature's average contribution or importance, while the boxplot represents the distribution and variability of each feature's importance across different permutations.This approach quantified variability in feature importance and provided a robust ranking of feature contributions.

Results
This section presents the results of our experimental and model explanation studies.All experiments utilised the preprocessed data as outlined in Section 2. The experiment used the R (72) programming language, which is equipped with various statistical and graphical techniques for highly extensible machine-learning tasks.The experiments were performed on an AMD Threadripper 3990X 4.3 GHz GT 1,030 2 GB PRO highperformance workstation (288 MB Cache, 64x Cores, 128 Threads, 4.3 GHz Turbo), with an MSI TRX40 PRO 10G AMD Ryzen Threadripper motherboard, a GeForce RTX 2,070 8 GB GDDR6 graphics card a 3,200 MHz 64 GB gaming RAM, 1 TB M.2 SSD with up to 3.5 GB/s speed, and a 4TB HDD.This device provided the computational power necessary to handle the various stages of data preprocessing, model building, and evaluation involved in the analysis.

Performance evaluation of the classifiers
Table 2 presents a comprehensive overview of model performance during the pre-CYA era, complemented by the ROC curve depicted in Figure 3.These results offer valuable insights into the effectiveness of the models.The ensemble classifiers exhibited superior performance, boasting an AUC of 94% and above and an Accuracy of 86% and above.Notably, the AdaBoost model demonstrated particularly high performance across several evaluation metrics.Conversely, the Logistics regression model showed relatively lower performance in the pre-CYA era.The enhanced performance of ensemble classifiers can be attributed to their adeptness in mitigating overfitting and effectively handling noisy data.This resilience positions them as a robust choice for this particular era.Table 3 presents another model performance during the CYA era.The ensemble classifiers also performed the best overall, with Extra Trees showing the highest scores in AUC (0.95), accuracy (0.84), and sensitivity (0.68).At the same time, Logistic regression had the lowest scores across all metrics except for specificity, where it scored 0.87, which was higher than K-Nearest Neighbors and Decision Trees. Figure 4 presents the ROC curve of the classifiers in the CYA era.
Table 4 provides an evaluation of model performance in the New-Gen era.Based on the AUC score and other metrics listed in the table, Extra Trees outperforms other models, followed by Random Forest, AdaBoost, and SVM.Logistic Regression and Decision Trees performed the worst according to the AUC score.The high specificity values across all models indicate their effectiveness in identifying true negatives.Figure 5 presents the

Model comparisons and historical transportability assessment
The primary evaluation metric shows that the Ensemble classifiers, especially the Extra trees algorithm, are the best-performing models across the three eras.We further evaluated the validity of this claim by assessing whether the observed differences in the model performances are statistically significant.Statistical significance was ascertained using the Wilcoxon signed-rank test at a 5% significance level.Figure 6A shows that the Extra Trees model had a narrower range in the AUC scores for the 10-fold CV compared to other models.It is shown that the distribution of the Extra Trees model significantly differs from the non-ensemble classifiers.
Comparing the model performance distribution in CYA and New-Gen (Figures 6B,C), there are statistical differences between the Extra Trees and other models, except for Random Forest in the CYA era.As the experimental results confirmed the reproducibility of the selected models using the re-sampling technique, we further assessed the transportability of the New-Gen model to other eras (Figure 7).The external validation of the Extra Trees model in the New-Gen era shows a lower discriminative power in the CYA (AUC: 0.59) and pre-CYA (AUC: 0.58) eras.

Explanation of the outcomes
The Extra Trees classifier was chosen to develop the LIME model across all three eras to ensure interpretability and transparency.AUC of the classifiers for the CYA era.In Figure 9 (Cases 1 and 2), the model identifies several factors that increase the likelihood of graft survival.These include being a non-diabetic or white recipient, being a younger recipient, receiving a kidney from donors between the ages of 33 and 49, and not having hypertension as a cause of KF.Other factors, such as delayed graft function (DGF), that positively influence survival may not be significant, as shown in the plot.Conversely, in cases of graft failure (Figure 9 -Cases 3 and 4), factors contributing to graft failure include being a black recipient, being a recipient between 43 and 56 years, not having an inherited cause of KF, and receiving kidneys from donors aged approximately 17 to 33 years.
In Figure 10, Cases 1 and 2, classified as "Survived," and Cases 3 and 4, classified as "Failed" by the Extra Tree model in the New-Gen era, are depicted.The figure illustrates that specific factors, such as being a non-diabetic recipient, not receiving IVI steroids as rejection treatment, the absence of acute rejection, having a high EGFR, or experiencing surgical complications, as well as other feature categories shown in the blue bars, positively influence graft survival in Cases 1 and 2. Conversely, in Cases 3 and 4, we observed the reverse effects of these features on graft failure.

Feature importance
For the pre-CYA era (Figure 11A), the feature importance plot shows recipient age and EGFR are the most significant features, with recipient age demonstrating the highest importance for predicting graft failure in this era.Notably, EGFR shows a

Discussion
This study evaluates nine machine learning (ML) models to predict the 10-year risk of graft failure after kidney transplantation.Advances in medical practices, including immunosuppressive drugs and supportive therapies, have positively impacted graft survival post-transplant.Exploratory data analysis revealed substantial improvement in graft survival rates from the pre-CYA to the New-Gen transplant eras, indicating reduced risk of graft failure over time.Specific patient or donor characteristics influencing graft survival have also improved, leading us to hypothesise varying prognostic factors across the three transplantation eras and prompting the modelling of graft failure during each era.We aimed to develop an optimisable platform for predicting graft failure posttransplant, with adaptable methodological strategies for future studies using more recent data to identify risk factors and support clinical decision-making (34).We internally validated nine models for each transplant era, both with and without data augmentation.Given the study's relatively small sample size, we addressed potential issues related to model reproducibility and overfitting.Results indicate that all nine selected algorithms demonstrated good discrimination ability measured by the AUC metric.Ensemble algorithms consistently outperformed others in predicting graft failure, benefitting from additional samples and diversity introduced by augmented data (Table A1), aligning with studies emphasising the importance of large datasets for accurate ML (73).While direct comparison with prior studies modelling era-specific graft survival in kidney transplants was not possible, our bestperforming models achieved an AUC score of 97%, which is comparable to or higher than AUC scores reported in studies modelling long-term graft survival, which ranged from 64.4% to 89.7% (30,32,(74)(75)(76)(77).As depicted in Table 5, the variation in these scores among studies can be attributed to several factors, including differences in data size, study periods, risk associations, and modeling strategies.
Top features influencing graft survival showed inconsistency across the three eras, except for recipient and donor age, which consistently demonstrated global importance.This highlights the variation in graft survival factors across eras.LIME models provided interpretable results for features influencing graft survival within each era, emphasizing the necessity for continuous adaptation and validation of predictive models in different contexts.Incorporating interpretable ML models like LIME into clinical decision-making can lead to more informed and individualized treatment plans, improving patient outcomes and graft survival rates (40,64).
Our study also evaluated the transportability of the New-Gen model to other eras, revealing challenges due to changing disease severity over time (28).Differences in survival rates and risk factors across the three eras indicate that historical transportability may only be achieved if the same features consistently impact graft survival.Despite these challenges, our models exhibited reproducibility and consistency in predicting outcomes within each era, underscoring the potential of ML approaches to enhance understanding and prediction of graft survival across diverse settings.
In conclusion, this study concurrently explores graft survival across three transplant eras, providing valuable insights into post-kidney transplant outcomes.Acknowledging limitations, including reliance on data from a single centre with a relatively small patient cohort, is crucial.While findings may not fully capture the entire landscape or current state of  kidney transplants in South Africa, they provide a foundation for future studies.Further ML-based investigations into graft survival, utilising current data from diverse regions, are essential to deepen our understanding.The study's comprehensiveness could have been enhanced by incorporating pivotal variables, which, unfortunately, were excluded due to missing data or inconsistencies in the data collection process.
Looking forward, our objective is to refine and assess the geographical applicability of the models developed within a different transplant unit in South Africa, with a primary focus on improving transportability within the same transplant era.

FIGURE 1
FIGURE 1 Outcome variable description.(A) A Time series plot showing the number of transplant cases and proportion of graft failure in the three eras and over the study period.The dashed lines show the study period for pre-CYA, CYA and New-GEN.(B) A barplot illustrates the number of transplant cases and the distribution of graft failure across the transplant era.

FIGURE 5 AUC
FIGURE 5AUC-ROC of the classifiers for the New-Gen.

Figure 8
presents the LIME results for the Pre-CYA era, showcasing the influence of different factors on graft survival or failure in four randomly selected patients.In Figure 8 (Case 1), factors that negatively influenced graft survival, include low estimated glomerular filtration rate (EGFR), presence of surgical complication, unmatched donor-recipient blood group type and donors aged approximately 34 to 48 years or deceased donors.Conversely, being a younger recipients or white recipient and having no inherited cause of KF contributed positively to graft

FIGURE 6
FIGURE 6 Box plot of model performance evaluation on the test-folds, based on the distribution of the AUC for each model.The Extra Trees was used as the benchmark for comparison with other models.(A) pre-CYA, (B) CYA and (C) New-Gen.The significant difference was based on the Wilcoxon Signed-Rank Test at a 5% level of significance.

FIGURE 7 AUC
FIGURE 7 AUC curves demonstrating the historical transportability of the Extra Tree model in the New-Gen (Derivation) to the CYA (Validation 1) and pre-CYA (Validation 2).

FIGURE 8 LIME
FIGURE 8 LIME model plots explaining individual predictions for four randomly selected patients who underwent transplants in the pre-CYA era.The plots are based on the Extra Tree model and show the features that support (blue bars) or contradict (red bars) the predicted probability.

FIGURE 9 LIME
FIGURE 9 LIME model plots explaining individual predictions for four randomly selected patients who underwent transplants in the CYA era.The plots are based on the Extra Tree model and show the features that support (blue bars) or contradict (red bars) the predicted probability.

FIGURE 10 LIME
FIGURE 10 LIME model plots explaining individual predictions for four randomly selected patients who underwent transplants in the New-Gen era.The plots are based on the Extra Tree model and show the features that support (blue bars) or contradict (red bars) the predicted probability.

FIGURE 11
FIGURE 11Permutation-based feature importance measures across eras: The plot displays the permutation-based feature importance measures for study features included in the Extra Trees models for each era: A ¼ Pre-CYA, B ¼ CYA, and C ¼ New-Gen.Feature importance is measured using 1-AUC as the loss function, where higher values indicate greater impact on model performance when the feature is permuted.

TABLE 1
Demographic and clinical characteristics of study participants across the transplant eras.

TABLE 2
Performance evaluation of the models on the pre-CYA.

TABLE 3
Performance evaluation of the models on the CYA era.AUC of the classifiers for the pre-CYA era.
The LIME model is a valuable tool for interpreting predictions from any classifier in an understandable and interpretable manner.A feature importance bar plot was employed to elucidate predictions in the local region.Visual representations of the LIME results are presented in Figures8, 9, and 10, where blue and red colours signify contributing factors.Specifically, blue denotes features that increase the likelihood of graft survival or failure.At the same time, red indicates features that negatively influence the likelihood of graft survival or failure.The length of

TABLE 4
Performance evaluation of the models on the New-Gen.

TABLE 5
Comparison with other existing studies.