Development and validation of a digital biomarker predicting acute kidney injury following cardiac surgery on an hourly basis

Objectives To develop and validate a digital biomarker for predicting the onset of acute kidney injury (AKI) on an hourly basis up to 24 hours in advance in the intensive care unit after cardiac surgery. Methods The study analyzed data from 6056 adult patients undergoing coronary artery bypass graft and/or valve surgery between April 1, 2012, and December 31, 2018 (development phase, training, and testing) and 3572 patients between January 1, 2019, and June 30, 2022 (validation phase). The study used 2 dynamic predictive modeling approaches, namely logistic regression and bootstrap aggregated regression trees machine (BARTm), to predict AKI. The mean area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and positive and negative predictive values across all lead times before the occurrence of AKI were reported. The clinical practicality was assessed using calibration. Results Of all included patients, 8.45% and 16.66% had AKI in the development and validation phases, respectively. When applied to testing data, AKI was predicted with the mean AUC of 0.850 and 0.802 by BARTm and logistic regression, respectively. When applied to validation data, BARTm and LR resulted in a mean AUC of 0.844 and 0.786, respectively. Conclusions This study demonstrated the successful prediction of AKI on an hourly basis up to 24 hours in advance. The digital biomarkers developed and validated in this study have the potential to assist clinicians in optimizing treatment and implementing preventive strategies for patients at risk of developing AKI after cardiac surgery in the intensive care unit.

Following cardiac surgery, up to 40% of patients can develop acute kidney injury (AKI), 1 which can contribute to a greater risk of postoperative infection, atrial fibrillation, and a more prolonged stay in the intensive care unit (ICU) and hospital. 2Furthermore, AKI is associated with the progression of chronic kidney disease, which affects patients' long-term quality of life. 3ecause AKI is a complex, multifactorial complication, there is currently no single molecular or digital biomarker signature that is a so-called "kidney troponin." 4At present, the most promising molecular biomarkers for AKI diagnosis are neutrophil gelatinase-associated lipocalin, interleukin-18, kidney injury molecule-1, cell-cycle arrest biomarkers, 2 and N-terminal prohormone of brain natriuretic peptide, high-sensitivity C-reactive protein, hemoglobin, and magnesium. 5A widely used clinical test for AKI is NEPHROCHECK (NC; Astute Medical), which detects urinary biomarkers tissue inhibitor of metalloproteinases and insulin-like growth-factor binding protein 7 to assess for risk of moderate or severe AKI. 6owever, these molecular biomarkers are expensive due to requiring extra resources to gather, test, and interpret the data, which consequently affects the usability of these biomarkers. 7Therefore, investigating already routinely collected serum data from the ICU to develop a digital biomarker would offer an affordable and automated way to assess the risk of developing AKI.
Within the past decade, numerous dynamic predictive models have been developed with the hope to improve surgical outcomes and overall patient care, mostly to predict mortality and sepsis. 8As AKI is a persistent and widespread problem in cardiac surgery, numerous prediction models for AKI have been developed for preoperative use to minimize patient risk before surgery. 2 However, these models mostly use demographic data, which offer very little granularity when it comes to personalized prediction.Since AKI is still underdiagnosed, especially at lower stages, 9 having a dynamic, near real-time prediction model suitable for ICU use that considers the patient's physiological changes could be useful to detect AKI hours in advance.A model is considered as dynamic if a prediction is made repeatedly as time and potentially the value associated with each of the predictive variable changes.Using patient data, collected with medical devices and stored in electronic health records, enables the development of a digital biomarker that could be used as a monitoring biomarker 10 that assesses the status of AKI.
Therefore, with the objective to improve risk assessment for AKI in the ICU for the cardiac population, this study aims to develop and internally validate a digital biomarker to predict the onset of AKI on an hourly basis within 25 hours since ICU admission, up to 24 hours in advance, using routinely collected clinical data.

METHODS
This study gained ethical approval from the responsible UK Health Research Authority (REC18/YH/0366, September 21, 2018).Since this is a retrospective analysis of routinely collected clinical data, the requirement for written informed consent was waived by the Institutional Review Board.This article adheres to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis guidelines. 11The methods used in this study have been described in detail in Appendix E1 (Table E1).

Predicted Outcome
The Kidney Disease Improving Global Outcomes clinical practice guideline 12 was used to define AKI.Retrospective diagnosis was given, by dividing each serum creatinine level, measured in the ICU, by the preoperatively measured serum creatinine level (baseline).If the difference was greater than or equal to 1.5 times the baseline, the patient was diagnosed to have AKI.In addition, the timestamp when the creatinine difference occurred was recorded as a timestamp to indicate the occurrence of AKI.

Setting and Datasets
This study was conducted at the Golden Jubilee National Hospital, a large cardiac center in the United Kingdom that performs more than 50% of all elective cardiothoracic surgeries for the National Health Service in Scotland. 13Data from 2 local electronic health record databases were used: the Cardiac, Cardiology and Thoracic Health Information database, which includes static information recorded preoperatively, and the Centricity CIS Critical Care database, which includes dynamic laboratory data from the ICU.Data for patients undergoing coronary artery bypass graft (CABG), aortic valve, and combined CABG and valve surgeries between April 1, 2012, and December 31, 2018, were included for the development phase (training and testing) of the models.The patient data between January 1, 2019, and June 30, 2022, was used to internally validate the models.The final number of patients included in this study was 6056 patients for development and 3572 patients for validation.The details of how the final study population for development and validation phase of the study was arrived at are shown in Figure 1.

Predictors
In total, 82 variables were used in the models, including 25 preoperatively recorded variables, including demographic variables (eg, sex and age), information about the surgery (eg, type and urgency of the surgery), and comorbidities relevant to cardiac surgery (eg, cardiac and renal function).From the ICU database, 13 laboratory variables and 4 medicinerelated variables were included.The full list of variables included in the models, together with descriptive statistics can be found from Table E2.

Classification Methods and Experiments
This paper presents a logistic regression (LR) and a bootstrap aggregated regression trees machine (BARTm) model predicting the onset of AKI within 25 hours since ICU admission on an hourly basis, up to 24 hours in advance.(As part of this study, other methods were also experimented with, the details and results of which can be found from https://stax.strath.ac.uk/concern/theses/6969z130f.)These models were developed for hourly lead times, based on the time windows (Figure 2).
The models were developed on a complete set of training data (ie, all records including missing values were removed).To take advantage of being able to incorporate missing values into the prediction model, 2 experiments were undertaken in terms of incorporating missing values to testing and validation sets.
Experiment 1: Testing and validating the models using complete data (ie, removing all records that included missing values).The results of LR and BARTm are presented.
Experiment 2: Testing and validating the models on datasets that included some missing values.Records with >40% of missing values were excluded from analysis, as done elsewhere. 14The rest of the missing values were left as is.Here, the results of BARTm are presented since this method is robust to handle missing data. 15The models were developed on the training data, using 10-fold cross validation.All analyses were conducted, using R, version 4.2.2 (R Foundation for Statistical Computing).The models were evaluated for each lead time using the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and positive and negative predictive values.The models' performance measures across all lead times were compared using t tests with the significance level set to .05.Also, calibration was assessed through plotting the predicted versus observed probabilities for AKI.

Patient Population and AKI
As shown in Table 1, of the 6056 patients included in the development phase, 512 (8.45%) had AKI.Of these patients, 4058 were included in the training set, where the mean age for the training dataset population was 66.08 years, the majority being male (73.04%).The most common procedure was CABG (57.96%).The mean hospital stay was 10.97 days, and the mean ICU stay was 38.94 hours.Overall, 49.41% of the patients had complications and 0.62% of the patients died in the hospital.The testing set of 1998 patients did not significantly differ from the training set population.
Of the 3572 patients included in the validation dataset, 595 (16.66%) had AKI.The patients were slightly younger (mean age of 65.47 years), and the proportion of male patients was significantly greater (76.99%) in the validation dataset as compared with the training dataset.The CABG surgery was still the most popular openheart surgery (57.70%).Hospital stay and ICU hours were significantly different from the training set, with mean ICU hours being 48.69 and total days in hospital being 12.00 days in the validation dataset.A significantly greater proportion of patients in the validation set had complications (62.74%) and passed away (1.93%), compared with the training set.Detailed descriptive statistics of all variables and comparison between the training, testing, and validation datasets can be found from Table E2.Most patients in both development and validation datasets had AKI between 20 and 25 hours since ICU admission (Figure E1), more specifically at median hours of 16.18 (interquartile range, 24.49) in development phase data and 19.78 (interquartile range, 23.95) in validation data.Interestingly, patients in the validation data appeared to have the onset of AKI in general earlier than in the development dataset.This is because AKI was retrospectively diagnosed using serum creatinine measurements and, as shown in Table 2, creatinine measurements were taken more frequently in validation dataset than in dataset recorded in the development phase.
It is important to note that there appear to be significant differences in the development and validation patient populations (Table 1 and Table E2) in terms of characteristics, but also in terms of frequency of data collection in the ICU (Table 2).The reasons for this are multifactorial and hence difficult to objectively underpin.It can be speculated that the differences could be due to the changing procedures, where more straightforward patients tend to have more minimally invasive surgeries, such as percutaneous coronary intervention, as opposed to riskier CABG and/or valve surgeries. 16Changes in patient population can also occur due to policy changes in patient selection processes but also changes in data collection. 17However, the frequency of data collection could also be different simply due to improvement and automation of the devices collecting the data. 18dels Predicting Acute Kidney Injury in ICU on Hourly Basis Models' discrimination.For both models, the performance, regardless of training, testing or validation datasets, tended to increase as the lead time got closer to 0 (Figure 3).The reason behind this might be that with shorter lead times more data were available for each patient, giving the algorithms more information from which to construct a model that could indicate the probability whether the patient would have AKI.However, interestingly, at the lead times 22 and 21, the LR model had a noticeable dip in performance.This could be due to more variation being introduced to the model as more data was entered into the system at these time windows (Figure E2).The BARTm model using complete training data and complete testing data (Experiment 1) achieved the greatest mean AUC of 0.850 and the greatest mean sensitivity of 0.821 (Table 3) (mean variable importance reported in Table E3).Logistic regression from Experiment 1 had the greatest mean specificity of 0.824 (model coefficients reported in Tables E4 and E5).In terms of negative predictive value, BARTm developed with complete training data and tested with missing values (Experiment 2) achieved a greater negative predictive value of 0.800 than LR.For both models in both experiments, the positive predictive values were very low due to low prevalence of AKI in the patient population.In fact, based on the mean AUC, BARTm had a significantly greater performance than LR, with the mean AUC of 0.923 for training, AUC of 0.850 for testing and 0.844 for validation data.
BARTm performed comparably well, when applied to testing and validation datasets that included missing values, with mean AUC being 0.837 and 0.838 for testing and validation datasets, respectively.This result is very promising, because missing data in routinely collected clinical data are common 19 and being able to apply the model on patients whose data is not complete can be extremely helpful to predict AKI in practice.
There is a noticeable variation in sensitivity and specificity (Figure 4) from one lead time to another, especially for logistic regression between lead times of À18 and À22, again likely due to introduction of more variation in laboratory values at these lead times.The exact performance measures for each lead time for each model and experiment can be found from Table E6.Calibration of the models.Unsurprisingly, models were more confident at their predictions at lead times, which were closer to the onset of AKI (ie, at 1 hour and 4 hours in advance) than when the prediction was made earlier (Figures E3 and E4).Furthermore, in all experiments, both models were more confident at predicting patients to not have AKI (ie, when the probability of AKI is low), rather than at predicting patients to have AKI.This is especially evident when looking at the BARTm model predicting AKI 24 hours in advance.The models tend to slightly overestimate the risk of AKI if the actual probability is low, and underestimate if the actual probability is high.The mean predicted probabilities and actual proportion of patients with AKI are shown for each model at each lead time for each experiment in Table E7.

Summary of Results and Comparison with Existing Models
This study developed and validated a digital biomarker that predicts AKI in ICU following cardiac surgery on an hourly basis (Figure 5).The best-performing model, BARTm achieved high overall performance on testing data (mean AUC ¼ 0.850, sensitivity ¼ 0.821 and specificity ¼ 0.741) and validation data (mean AUC ¼ 0.844, sensitivity ¼ 0.789, and specificity ¼ 0.806).The model also predicted AKI when data included missing values, achieving mean AUC of 0.837 for testing data and 0.838 for validation data.Even though AKI is a persistent and widespread problem in cardiac surgery, only 2 dynamic prediction models for AKI have been developed to date. 20,21eyer and colleagues 20 predicting renal failure achieved greater performance (AUC of 0.96, sensitivity of 0.94 and specificity of 0.86), whereas the BARTm model Both Meyer and colleagues' 20 and Ryan and colleagues' 21 models have some limitations, such as potential overestimation of predicted outcome due to balancing methods, 22 which could lead to poor calibration.23 Neither of the studies report their models' calibration, making it difficult to compare these models' applicability with the model developed in this study in clinical practice.
When we compared the digital biomarker developed as part of this study with the widely used NC urine biomarker test, BARTm noticeably achieved a better AUC, sensitivity, and specificity than the NC (AUC ¼ 0.633, sensitivity ¼ 0.56, and specificity ¼ 0.64), achieved by the development study. 6Although the NC has shown to have a great performance when applied to patients who undergo cardiac surgery (AUC ¼ 0.84, sensitivity ¼ 0.92, and specificity ¼ 0.81), 24 the performance of the NC has not been consistent, only achieving an AUC of 0.60 in a recent study investigating off-pump CABG patients. 25It is important to note that due to the nature of the cost of testing molecular biomarkers, these studies validating NC are very small, including only 50 and 90 patients, respectively.

Strengths and Limitations
Although the Kidney Disease Improving Global Outcomes criteria are currently the most objective and accurate way to diagnose AKI, 2 they rely on serum creatinine laboratory results.Since creatinine was measured more frequently in validation phase than in development phase (Table 2), the hourly prediction based on more frequent creatinine measurements could improve diagnosis reliability, which could be an explanation for why the models still performed well in the validation datasets, regardless of the validation and the development phase data being significantly different based on the frequency of measurements and also values.Since this study is a single-center study, it is unclear whether creatinine is measured more frequently in the later years as an international standard, or whether this change took place simply at the study institution.Therefore, it is unclear whether the models could perform well in validation data where the creatinine Due to the missing values of hemoglobin in earlier years in the Cardiac, Cardiology and Thoracic Health Information database, preoperatively measured hemoglobin variable was excluded from the analysis.As hemoglobin has been shown to be associated with kidney function, the exclusion of this variable can be perceived as a limitation of this study.However, as the models presented in this study integrate the latest laboratory information available on an hourly basis, the significance of the most recent hemoglobin level, documented within the ICU, outweighs the importance of the hemoglobin level recorded during the pre-operative phase at the clinic.In the ICU, hemoglobin was recorded every 1 to 1.5 hours (Table 2) for 99.9% to 100% of patients (Table E2), making it a more reliable measure than preoperative hemoglobin.Although we have made use of BARTm's capability to consider incomplete data for ICU laboratory measurements, we have opted not to apply data imputation methods to address missing values in the preoperative hemoglobin measurements.This decision is based on the availability of more dependable and current hemoglobin data within the ICU, and our desire to prevent potential biases that imputation methods might introduce. 26issing data in electronic health records are very common and are a barrier to development of accurate and usable clinical prediction models. 19The competitive performance by BARTm with missing values on testing (mean AUC ¼ 0.830) and validation data (mean AUC ¼ 0.838) is promising.Being able to use methods that can make a prediction, even with the presence of missing data, can be extremely beneficial as a clinician can still be informed whether a patient is likely to develop AKI due to the well-performing model that is robust to missing values.In the future, the models should also be tested on datasets including larger proportions of missing data as entries with more than 40% of missing values were removed from analysis.
The reduced interpretability of BARTm compared with logistic regression poses a challenge due to the lack of model coefficients.However, since ICU is a complex, data-rich environment, to put either of these models into use in practice, clinical software needs to be developed to apply the models to patient data.
Finally, using a local dataset may limit generalizability but ensures greater relevance of the models within this specific setting.Local care processes can vary between institutions, and policies influencing treatment and access to care can differ across countries, and therefore, external validation and recalibration are needed to support applicability to other populations. 27

Clinical Implications and Future Work
The hourly ICU digital biomarker has the potential to be developed into a clinical system that is integrated with electronic health records.Such a system could aid clinicians in risk assessment, treatment planning, and resource allocation to predict AKI hours in advance.The work presented in this article is the first step to developing the clinical decision support model that is integrated with the electronic health records in the ICU, as is done with the current commonly used risk prediction models.Unlike the Sequential Organ Failure Assessment and Acute Physiology, Age and Chronic Health Evaluation scores, 28 the digital biomarker calculates the risk every hour, allowing clinicians to find out which patients are at risk of developing AKI in a timely manner, well in advance to avoid late diagnosis, and consequently worsened health outcomes for patients.
As AKI is still vastly underdiagnosed, 9 there is a need for an accurate, usable, and timely way to diagnose AKI, for which the BARTm is a great candidate.The high sensitivity and specificity show the model's ability to recognize patients with and without AKI comparatively well.The negative predictive value staying above 0.700 for development, testing, and validation sets shows the model classifies patients to be without AKI with a 70% probability.Although there is room for improvement regarding false positives and false negatives, it is unknown whether this model performs better than other models in that regard as the other similar studies have not reported this information. 20,21o improve the predictive ability of the models, in the future, the inclusion of vital signs, molecular serum, and plasma data could be beneficial. 2Furthermore, to improve the usability and applicability of the models, other complications that are known to be associated with AKI, such as delirium and sepsis, could be added as additional outcomes to be predicted.Although the data from the validation phase were significantly different from the development phase, interestingly, the models performed well at predicting AKI on the validation set, based on discrimination, and calibration.As mentioned earlier, although the reasons for the development and validation datasets being different are multifactorial and therefore difficult to objectively underpin, the strong performance of the models in the validation set shows the robustness of the models to the possible changes in patient population, health policies, and changes in medical devices, ICU protocols, patient pathways, and even to possible effects on changes in patient selection due to the coronavirus disease 2019 pandemic.However, to confirm the robustness of the model and to support its generalizability before implementation into clinical practice, 27 an external validation study, an updating strategy, and a clinical support system integrated with electronic health records are needed for widespread adoption.
In summary, this study developed a digital biomarker for hourly prediction of AKI in the ICU after cardiac surgery, demonstrating high performance.These digital biomarkers could help clinicians optimize treatments for patients who are at risk of developing AKI hours in advance.

APPENDIX E1: MODEL DEVELOPMENT METHODS
This study gained ethical approval from the responsible UK Health Research Authority (REC18/YH/0366, September 21, 2018).Since this was a retrospective analysis of routinely collected clinical data, the requirement for written informed consent was waived by the institutional review board.The article adheres to the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis guidelines.E1 In this paper, 2 models were developed to predict the onset of acute kidney injury (AKI) within 25 hours since admission to the intensive care unit (ICU) on an hourly basis, up to 24 hours in advance, will be presented.These models are logistic regression (LR) and bootstrap aggregated regression trees machine (BARTm).(As part of this study, other methods were also experimented with, the details and results of which can be found from https://stax.strath.ac.uk/concern/theses/6969z130f.)

Predicted Outcome
The Kidney Disease" Improving Global Outcomes clinical practice guideline E2 was used to define AKI.Retrospective diagnosis was given, by dividing each serum creatinine level, measured in the ICU, by the preoperatively measured serum creatinine level (baseline), as done by Birnie and colleagues.E3 Ff the difference was greater than or equal to 1.5 times the baseline, the patient was diagnosed to have AKI.In addition, the timestamp when the creatinine difference occurred was recorded as a timestamp to indicate the occurrence of AKI.The urine output was not used to diagnose AKI, as per the Kidney Disease: Improving Global Outcomes definition, because the urine output has been shown to be overly sensitive and nonspecific for the cardiac surgery population.E4

Setting and Datasets
This study was conducted at the Golden Jubilee National Hospital, a large cardiac center in the United Kingdom that performs more than 50% of all elective cardiothoracic surgeries for the National Health Service in Scotland.E5 Data from 2 local electronic health record databases were used: the Cardiac, Cardiology and Thoracic Health Information (CaTHI) database, which includes static information recorded preoperatively, and the Centricity CIS Critical Care database, which includes dynamic laboratory data from the ICU.Data for patients undergoing coronary artery bypass graft, aortic valve, and combined coronary artery bypass graft and valve surgeries between April 1, 2012, and December 31, 2018, were included for the development phase (training and testing) of the models.The patient data between January 1, 2019, and June 30, 2022, were used to internally validate the models.Only the records that occurred in the dataset for the patient for the first time (unique entries) were included in the analysis.Furthermore, patients who experienced AKI within the first hour since ICU admission were excluded due to no laboratory data available for these patients to allow for prediction of AKI.Finally, of patients with AKI, only those who had AKI within the first 25 hours were observed, as the majority of patients were diagnosed with AKI within that time frame (as shown in the Results).Thus, the final number of patients included in this study was 6056 patients for development and 3572 patients for validation.The derivation of training and testing datasets and how the final number of patients was arrived at is further described in "Missing Data" section.

Predictors
In total, 82 variables were used in the models, including 25 preoperatively recorded variables, including demographic variables (eg, sex and age), information about the surgery (eg, type and urgency of the surgery), and comorbidities, relevant to cardiac surgery (eg, cardiac, neurologic, renal, and respiratory function).From the ICU database, 13 laboratory variables and 4 medicine-related variables were included.The full list of variables included in the models, together with descriptive statistics can be found from Table E1.
The preoperative variables were measured only once at the preoperative assessment clinic, and were therefore treated as static variables in the models.The laboratory variables were measured repeatedly, allowing for the development of an hourly prediction model.
As shown in Table 1, each laboratory variable was measured at different times, depending on patient's needs.Furthermore, there is a significant difference between the frequencies of how often the laboratory variables are measured, when comparing the development data and validation data.It is especially noticeable that in the (more recent) validation dataset, measures of creatinine, urea and C-reactive protein are made about every 6 hours, whereas in the development dataset, these variables are measured every 10 hours.

Missing Data
For preoperative data from the CaTHI database, patients who had not been discharged from the hospital by the time of data extraction were excluded from the analysis due to not having their final outcome recorded (ie, deceased or discharged).Patients with "salvage" priority, and "unknown" New York Heart Association grade, previous myocardial infarction, and hypertension history were excluded due to very small group of patients not having these variables recorded for them.The cases with many "unknown" entries for categorical variables were included in the analysis and left coded as "unknown." Also, because some of the validation data was recorded during the start of the coronavirus disease 2019 (COVID-19) pandemic, due to the lack of understanding regarding the specific effects of past COVID-19 infection on surgical outcomes, patients who were indicated to have past COVID-19 or COVID-19 infection at the time of ICU in the clinicians' notes were excluded from the analysis (1122 patients).For numerical variables, patients with clinically infeasible values were excluded.If a numerical variable was recorded for less than 80% E6 of the patients, the variable was excluded from analysis.The only variable excluded for that reason was preoperative hemoglobin level.
Although the CaTHI database is an audit database, and therefore should include fewer "unknowns," the only variables that are mandatory in the CaTHI database are those needed to calculate the logistic EuroSCORE.E7 Therefore, until a study was conducted in 2016, using this database, the information regarding New York Heart Association grade, previous myocardial infarction, and hypertension history and hemoglobin was not always recorded.Furthermore, as this information is usually recorded at the preoperative assessment clinic, medical history for patients with emergency surgery might not always be readily available.
The ICU data from Centricity CIS database was checked for obvious incorrect values, in case of which the values were marked as NA.If a patient had a timestamp recorded for a missing value of a variable, the previously recorded value was carried forward to the next timestamp.The only variable recorded for all ICU patients was hemoglobin, with 100% completeness.Creatinine was recorded for almost all patients, with 98.32% completeness.Instead of using medicine doses, since these were recorded for less than 41.00% of the patients, medicine variables were recorded as binary categorical variables based on whether the patient was given medication (yes vs no).

Classification Methods and Experiments
In the development phase of the study, other classification methods were also experimented with (see Lapp E8 ); however, LR and BARTm were chosen for the validation of the study due to logistic regression's high interpretability and competitiveness with other classification methods E9 and BARTm's ability to incorporate missing values into its prediction.E10 Although missing data are a common problem in electronic health records, E11 incorporating classification methods that can handle missing data without imputation methods which could lead to bias E12 is rare.E13 BARTm is a method that can handle missing values by incorporating built-in estimates of uncertainty in the form of credible intervals as well as previous information on covariates.This is done by sending missing data to whichever of the 2 daughter nodes increases the overall model likelihood.E14 This means, if we have 2 options (eg, left and right), then BARTm offers options for both of these paths if a record has a missing value.Hence, there is a consideration that the direction of missingness is equally likely to be left or right, conditional on the splitting attribute and value.E10 It has been previously shown that BARTm is comparable with the performance of random forest with missForest imputation.E15 The models were developed on a complete set of training data (ie, all records including missing values were removed).To take advantage of being able to incorporate missing values into the prediction model, 2 experiments were undertaken in terms of incorporating missing values to testing and validation sets.
Experiment 1: Testing and validating the models using complete data (ie, removing all records that included missing values).The results of LR and BARTm are presented.
Experiment 2: Testing and validating the models on datasets that included some missing values.Records with >40% of missing values were excluded from analysis, as done elsewhere.E16 The rest of the missing values were left as is.Here, the results of BARTm are presented since this method is robust to handle missing data.E10 For logistic regression, the "caret" R package version 6.0.93 with method "glm" was used.E17 For BARTm, the "bartMachine" package, E18 version 1.3.3.1, in R was used, together with the default of including missing values as the model can accommodate these.The models were developed on the training data, using 10-fold cross validation (The general code for developing the classification models can be found from https://doi.org/10.15129/1ab360f7-0779-4cf3-8a9a-dae621892a51), which is the recommended approach to developing a prediction model.E1 All analysis was conducted, using R, version 4.2.2.

Data Preparation and Time Windows
The models were developed to predict AKI on an hourly basis.To facilitate this, rolling time windows were created to first indicate the onset of the predicted outcome.Second, the time windows were used to develop prediction models for each time window before the event.In this study, the hourly prediction was undertaken for AKI within 25 hours of ICU stay.This timeframe was chosen because the majority of patients experienced the onset of AKI between 20 and 30 hours since ICU admission, as shown in the section "Patient Population and Acute Kidney Injury." The models predicting AKI were built for hourly lead times, based on the time windows.The lead times were chosen to be every hour from 1 to 24 hours ahead of the onset of AKI occurring within 25 hours since ICU admission.For example, if predicting AKI 1 hour in advance, the data were collected from the admission to ICU until 1 hour before the onset of AKI.In general terms, the model predicting AKI at lead time n used all data that were collected until n hours before the outcome, which is also illustrated in Figure 5 in the main article.
To simplify the laboratory data used in models, for each laboratory variable for the minimum, maximum, and first and last measurement for each lead time were used, helping to create a more consistent set of input data for the models, which might otherwise have had to deal with variations in the number of independent variables at each stage (shown in Figure E1).E19,E20 This means that if the predicted outcome happened in time window ¼ 6, for each variable first, last, min, and max measurements that occurred in time windows 0 to 5 were calculated.Regardless of the number of hours after admission that AKI occurred, there would always be 4 values (first, last, min and max) for each dynamic predictor variable.
The prediction models had a binary outcome (AKI ¼ yes/ no), but only patients with AKI ¼ yes had a timestamp associated with the outcome recorded.Hence, an arbitrary time as the end point was chosen for patients with AKI ¼ no.Most patients had AKI between 20 and 25 hours since ICU admission and so an arbitrary end point of 25 hours for AKI prediction was chosen.
Finally, it is important to note that patients who had the predicted outcome recorded within the first hour since ICU admission were removed from analysis, as done in other predictive modeling studies.E20 This is because the hourly prediction model is intended to be used in the ICU, and hence it is impossible to predict an outcome that has already happened.Hence, 545 patients who had AKI on admission to the ICU were excluded from the analysis in the development dataset and 309 patients from validation dataset.
For both logistic regression and BARTm models, all available variables were included, and the time points were treated as independent, ie, the predictions carried out by the models were not dependent on the previous time points.

Training, Testing, and Validation Data
To develop the models, the datasets for each lead time were divided into a training set (2/3 of data) and testing set (1/3 of data).For every experiment, the models developed using the training data that did not contain any missing values.As explained previously, the models were tested and validated on both complete data and data with missing values.
Due to AKI occurring at different times for patients in the ICU, the number of patients in each dataset-and consequently the number of data points used in the creation of the model-in each lead time is different (shown in Table E1).Since the same training data was used for both experiments, the mean number of patients in training data was 2464 (standard deviation [SD] ¼ 25.75).The mean number of patients in testing data was 1212 (SD ¼ 12.96) and 1740 (SD ¼ 9.75) for Experiments 1 and 2, respectively.For validation data, the mean number of patients was 1327 (SD ¼ 21.78) for Experiment 1 and 2341 (25.18) for Experiment 2. As explained earlier, the training data was complete for both experiments, and the testing and validation datasets were complete for Experiment 1.For Experiment 2 (with the missing data) the mean completeness across all lead times was 96.96% (SD ¼ 0.04) for testing data and 96.61% (SD ¼ 0.06) for validation data.

Performance Measures
The models' performance measures were calculated for training, testing and validation sets for each experiment for each lead time.The models were evaluated based on discrimination, ie, the area under the receiver operating characteristic curve, sensitivity, specificity, and positive predictive value and negative predictive value.The performance metrics were calculated, using the optimal cut-off points, where sensitivity and specificity were maximized.In this article, mean and SD for each performance measure across all lead times are presented.The models' performance measures across all lead times were compared using t tests with the significance level set to .05.
To understand the applicability of the models in this specific patient population, the models' calibration was assessed, using calibration plots and predicted versus observed probabilities for AKI.As the models developed in this paper were only validated internally, in case of poor calibration, the models were not recalibrated as the average of predicted risks would match the event rate.E21

FIGURE 4 .
FIGURE 4. Sensitivity and specificity for both models for each lead time, applied to training, testing and validation datasets.AKI, Acute kidney injury; BARTm, bootstrap aggregated regression trees machine.

FIGURE E1 .FIGURE E2 .FIGURE E3 .FIGURE E4 .
FIGURE E1.Distribution of how time of acute kidney injury is diagnosed in the ICU in development phase (training and testing data) and validation datasets.AKI, Acute kidney injury; ICU, intensive care unit.

TABLE 1 .
Descriptive statistics of demographic, surgery, and outcome variables based on training, testing (development phase), and validation phase SD, Standard deviation; BMI, body mass index; CABG, coronary artery bypass grafting; ICU, intensive care unit.

TABLE 2 .
Mean and standard deviation (SD) hours of when each laboratory variable is recorded in development and validation datasets, where P value signifies whether there is a statistically significant difference between the frequency of measurement between development phase and validation phase

Lead time before AKI Logistic Regression BARTm
Area under the receiver operating characteristic curve (AUC) for both models for each lead time, applied to training, testing and validation datasets.AKI, Acute kidney injury; BARTm, bootstrap aggregated regression trees machine.

TABLE 3 .
Mean and standard deviation (SD) of each performance measure for training, testing, and validation data for both BARTm and LR models

TABLE E1 .
Number of patients in training, testing, and validation datasets, with proportion of patients with AKI and missingness of the datasets AKI, Acute kidney injury; SD, standard deviation.

TABLE E2 .
Descriptive statistics of all variables included in the models, where frequencies and percentages are shown for categorical and mean and standard deviation (SD) are shown for numerical variables

TABLE E2
The P values are derived, using c 2 tests for categorical and Student t tests for numerical variables.BMI, Body mass index; EuroSCORE, European System for Cardiac Operative Risk Evaluation; LV, left ventricular: NYHA, New York Heart Association; MI, myocardial infarction; LMS, left main stem; PCI, percutaneous coronary intervention; CABG, coronary artery bypass grafting; ICU, intensive care unit; HCO3, bicarbonate.

TABLE E3 .
Mean variable importance for the top 10 variables for each lead time for the BARTm model LMS, Left main stem; EuroSCORE, European System for Cardiac Operative Risk Evaluation; MI, myocardial infarction; PCI, percutaneous coronary intervention; BARTm, bootstrap aggregated regression trees machine.JTCVS Open c Volume 16, Number C 567

TABLE E4 .
Model coefficients for the logistic regression models for the lead time of 1 to 12 hours before AKI

TABLE E5 .
Model coefficients for the logistic regression models for the lead time of 13 to 24 hours before AKI

TABLE E6 .
Performance measures for each model for each lead time applied on each dataset when predicting the onset of AKI within 25 hours, up to 24 hours in advance in ICU

TABLE E7 .
Mean predicted probabilities for both models for each lead time and actual proportion of patients with AKI, based on each experiment