Clinical prediction models for acute kidney injury

Objective To report on the currently available prediction models for the development of acute kidney injury in heterogeneous adult intensive care units. Methods A systematic review of clinical prediction models for acute kidney injury in adult intensive care unit populations was carried out. PubMed® was searched for publications reporting on the development of a novel prediction model, validation of an established model, or impact of an existing prediction model for early acute kidney injury diagnosis in intensive care units. Results We screened 583 potentially relevant articles. Among the 32 remaining articles in the first selection, only 5 met the inclusion criteria. The nonstandardized adaptations that were made to define baseline serum creatinine when the preadmission value was missing led to heterogeneous definitions of the outcome. Commonly included predictors were sepsis, age, and serum creatinine level. The final models included between 5 and 19 risk factors. The areas under the Receiver Operating Characteristic curves to predict acute kidney injury development in the internal validation cohorts ranged from 0.78 to 0.88. Only two studies were externally validated. Conclusion Clinical prediction models for acute kidney injury can help in applying more timely preventive interventions to the right patients. However, in intensive care unit populations, few models have been externally validated. In addition, heterogeneous definitions for acute kidney injury and evaluation criteria and the lack of impact analysis hamper a thorough comparison of existing models. Future research is needed to validate the established models and to analyze their clinical impact before they can be applied in clinical practice.


INTRODUCTION
Critically ill patients need specific support to preserve the function of their vital organs, for which they are treated in intensive care units (ICU). (1,2) Predicting the future health state of these patients is crucial, and clinicians try to foresee changes in their clinical situation as early as possible to adjust the treatments, prevent organ failure at an early stage and avoid unfavorable outcomes. Acute kidney injury (AKI) is one of the most prevalent organ failures in critically ill patients, affecting approximately 40% of ICU admissions. (3)(4)(5)(6) Acute kidney injury is defined as a rapid decline in renal excretory function and is classified into three stages according to the degree of increase in serum creatinine (SCr) and/or the decline in urine output (UO). (7) However, both criteria are late and nonspecific signs of actual underlying kidney damage. For the early detection of AKI, several plasma or serum and urinary biomarkers have been identified and are available through diagnostic tests, (8)(9)(10)(11)(12)(13)(14)(15) even though it is still unclear which patient populations would benefit from biomarker testing, what the timing of said testing should be, or what the therapeutic consequences of a positive test could be in the absence of a specific therapy for AKI. In addition, these biomarkers carry a certain financial cost.
The ICU is a data-rich environment. The typical critically ill patient, with continuous monitoring of vital signs, therapeutic devices, radiological imaging and serial laboratory measurements, yields large amounts of data. Electronic health record implementation is increasing worldwide, (16) making these data available for analysis, and as such, the ICU is an ideal environment to transform these data into valuable information for prediction. These large and complex electronic health record databases present a challenge for traditional statistical techniques, however. Automatic learning algorithms, also known as machine learning, could offer a better understanding of the complex variability and interactions among highdimensional variables. (17) The goal of the present review is to investigate which internally validated clinical prediction models for early AKI diagnosis in the heterogeneous adult ICU are currently available.

Search strategy
PubMed® was searched for review articles and full reports of retrospective and prospective studies that were published in English with full text availability from January 1 st , 2012 to June 5 th , 2019. Due to the lack of a unified definition for AKI prior to the Kidney Disease: Improving Global Outcomes (KDIGO) AKI criteria proposed in 2012, (7) we only investigated studies published over the past 7 years.
A combination of three search terms was used: "intensive care unit", "acute kidney injury", and "prediction". On the basis of a previous review, (18) we adapted these words into Medical Subject Heading (MeSH) terms and keywords in a title and abstract search (Table 1). We manually examined all the included studies. Potentially relevant articles identified by other sources and references of the retrieved literature were also included in the examination.

Inclusion criteria
Reviewers first screened all abstracts and titles of the articles that were retrieved by the abovementioned strategy. Articles recognized as relevant were further read in full if they developed a novel prediction model for early AKI diagnosis in the ICU, validated an established model for early AKI diagnosis in the ICU, or appraised the impact of an existing prediction model for early AKI diagnosis in the ICU. A prediction model for early AKI diagnosis was defined as a model using more than one risk factor to estimate the probability of AKI development.

Exclusion criteria
Studies were excluded if they were performed on patients outside the ICU setting; were performed in pediatric patients or infants; did not validate the models formally, internally or externally; (19) focused exclusively on biomarkers and did not contain a clinical prediction model for AKI; lacked formal performance measures (at a minimum, the area under the receiver operating characteristics curve -AUROC/c-statistic should have been reported).

Data extraction
A data extraction form was used in accordance with previous reviews and guidelines. (18)(19)(20) Extracted items included study type (e.g., prospective, retrospective, casecontrol or cohort studies), population (e.g., cardiac, septic, or surgical patients), modeling method (e.g., multivariate logistic regression), AKI definition (e.g., Acute Kidney Injury Network -AKIN -or KDIGO), internal validation method (e.g., random split, cross-validation, or bootstrap), number of patients, type and number of predictors, and model performance (e.g., discrimination, calibration, or decision curve analysis). External validation results were reported when available.

Model performance
The evaluation of model performance concentrated on model discrimination, calibration, and decision curve analysis. Discrimination refers to the ability of a model to discriminate patients with a specific symptom from those without it. One common way to report model discrimination is to visualize the ROC curve and report the AUROC. (21) Calibration refers to the agreement between predicted probabilities and observed outcome proportions in the population. (22) Common methods used to evaluate model calibration include calibration in the large, calibration slope, and the Hosmer-Lemeshow test. (23) A well-calibrated model should have an Hosmer-Lemeshow test p-value larger than 0.05, a calibration slope close to one, and a calibration in the large close to zero. It is also encouraged to plot the observed outcome proportions versus predicted probabilities along with an indication of the calibration slope. (19) Decision curve analysis refers to the quantification of the net benefit of the prediction model across a range of different possible classification thresholds. (24) For a specific threshold, the prediction model with the highest net benefit will be the most clinically useful.

RESULTS
We identified 583 potentially relevant articles, of which 551 were excluded based on screening titles and abstracts. We further read the 32 remaining studies in full, and five were retained for detailed comparison as they satisfied the inclusion criteria ( Figure 1). A comparative summary of the clinical prediction models for AKI in the ICU is shown in table 2.
Malhotra et al. (25) conducted a prospective multicenter study in which they developed and internally as well as externally validated a risk score for early AKI diagnosis in ICU patients. First, predictor candidates within 48 hours after ICU enrollment were selected based on previous studies. Second, stepwise forward selection was used for feature selection. Third, the prediction model was built on a development cohort using multivariate logistic regression methodology with 5-fold crossvalidation to predict the occurrence of all stages of AKI (KDIGO SCr criterion stages 1, 2, and 3) in the first week of ICU stay. Afterwards, they converted the coefficient-based regression prediction model to a point-based risk score ranging from zero to 21. They examined the generalizability of the risk score model by internally validating the model on a test cohort and externally validating it on an independent cohort. The AUROC for internal and external validation were 0.79 (95% of confidence interval -95%CI 0.70 -0.89) and 0.81 (95%CI 0.78 -0.83), respectively. Calibration was reported with an Hosmer-Lemeshow test p-value of 0.293 for internal validation and with calibration plots for the internal and external validation cohorts. Since the Hosmer-Lemeshow test p-value was larger than 0.05 and there was agreement between the observed outcome proportions and predicted probabilities in the calibration curves, the prediction model was considered well calibrated.
AKI Predictor, (26) an online prognostic calculator for early AKI diagnosis in the ICU, was created based on the database from a large multicenter randomized control trial. (27) Four prediction models were developed based on clinical information available at different time points: before ICU admission (baseline model), upon ICU admission (admission model), on the first day of ICU admission (day1 model), and 24 hours after ICU admission (day1+ model). First, candidate predictors were considered on the basis of the literature, expert opinion, and availability in the dataset. Second, feature selection was performed by using bootstrapped backward elimination analysis. Third, models were developed on a development cohort using a random forest algorithm to predict the occurrence of all stages of AKI (KDIGO SCr criterion stages 1, 2, and 3; AKI-123) and the more severe stages (KDIGO SCr criterion stages 2 and 3; AKI-23) separately in the first week of ICU stay. Fourth, the performance of the clinical model was internally validated by bootstrapping. Subsequently, models were validated on an unseen independent validation cohort, where the prognostic performance was compared against neutrophil gelatinase-associated lipocalin (NGAL) measurements obtained at ICU admission. The AUROC   (28) carried out a prospective multicenter observational study of the efficacy of three individual biomarkers (serum cystatin C -sCysC -, urinary N-acetylβ-D-glucosaminidase -uNAG -, and the urinary albumin/ creatinine ratio -uACR) and their combinations on the early diagnosis of AKI in ICU patients. First, to predict the occurrence of all stages of AKI (KDIGO SCr criterion stages 1, 2, and 3; any AKI) and the more severe stages (KDIGO SCr criterion stages 2 and 3; severe AKI) separately in the first week of ICU stay of the patient, the authors looked into the performance of the three biomarkers separately. Second, to improve the performance of these biomarkers in AKI detection, they created three models consisting of different biomarker combinations and examined their performance. Third, the clinical variables that were shown to be significantly different between patients with and without AKI (two-sided p-value less than 0.1) in univariate analysis within 1 day after ICU admission were regarded as candidates in the final model. Fourth, a stepwise selection was utilized to select variables used in the final prediction models. Fifth, they built prediction models for the early diagnosis of AKI with multivariate logistic regression, investigated the contribution of biomarkers to the clinical prediction models, and internally validated the performance by bootstrapping. The AUROC of the multivariate models of sCysC and uNAG for any AKI and severe AKI were 0.756 (95%CI 0.723 -0.789) and 0.863 (95%CI 0.827 -0.900), respectively. The AUROC of the prediction models for any AKI and severe AKI were 0.821 (95%CI 0.792 -0.850) and 0.908 (95%CI 0.881 -0.934), respectively. The AUROC of the prediction models combined with sCysC and uNAG for any AKI and severe AKI were 0.836 (95%CI 0.808 -0.864) and 0.918 (95%CI 0.893 -0.944), respectively.
Chiofolo et al. (29) performed a single-center retrospective study on a heterogeneous cohort of adults admitted to ICUs, with the aim of developing and validating a prediction model for AKI development of any stage (AKIN SCr and/ or UO criteria without inclusion of dialysis rules). First, predictors up to 24 hours before the time of prediction with AUROC larger than 0.5 in the univariate analysis were selected. Second, selected predictors were further chosen based on their AUROC and clinical judgment. Third, a stepwise regression algorithm was conducted to confirm the predictor selection. Fourth, random forest models were built with different feature combinations. Fifth, the final model was selected based on the AUROC and early AKI detection percentage and internally validated in a cohort that was randomly split from the original data. Finally, by adjusting different classification thresholds, the prediction model could be used to detect any stage of AKI or only moderate-severe AKI. To continuously monitor ICU patients, the prediction model was designed to output the AKI development probability every 15 minutes from ICU admission to discharge. The AUROCs in the development and validation cohorts were 0.949 (95%CI 0.943 -0.954) and 0.882 (95%CI 0.867 -0.897), respectively. The well-calibrated performance was confirmed by an Hosmer-Lemeshow test p-value of 0.3. In addition, based on the highest net benefit of the model across a wide range of classification thresholds (approximately from 5% to 95%) in the decision curve analysis, the clinical usefulness of the model was verified.
Zimmerman et al. (30) reported a retrospective single-center study in which they developed and validated prediction models for AKI development of any stage (KDIGO SCr criteria). First, demographics, clinical data and laboratory test measurements within 1 day after ICU admission that were identified as AKI risk factors in previous studies were selected. Second, variables missing more than 20% of their values were removed, and the missing values of the remaining variables were imputed by using multivariate imputation by chained equations (MICE). Third, variables that were not significantly associated with the outcome in the univariate analysis (p-value > 0.05) were excluded. Fourth, a backward elimination method was applied to variables significantly associated with the outcome (p-value ≤ 0.05). Fifth, prediction models for AKI development within 72 hours after ICU admission were developed by using multivariate logistic regression, a random forest method, and a multilayer perceptron based on the selected variables. Sixth, the models were internally validated by ten runs of 5-fold cross-validation, and measurements were averaged over the ten runs for models with and without use of the backward elimination method. The averaged AUROC in the internal validation cohort for multivariate logistic regression, the random forest method, and the multilayer perceptron models with backward selection were 0.780, 0.772, and 0.792, while the averaged AUROC without backward elimination method were 0.783, 0.779, and 0.796, respectively.

DISCUSSION
In this systematic review, we identified five studies in which clinical prediction models for the early diagnosis of AKI were built and validated in heterogeneous cohorts of ICU patients. One study focused on the assessment of performance with and without the inclusion of biomarkers in clinical prediction models. (28) AKI was defined using the KDIGO criteria in four studies, (25,26,28,30) among which three studies did not include the UO criteria, (25,26,30) and one study used the AKIN criteria without inclusion of dialysis rules. (29) Definitions of baseline SCr and methods to handle missing pre-ICU values were heterogeneous, hindering model comparison. Three models reported calibration measures. (25,26,29) Only two models were validated externally. (25,26) The number of independent predictors in the final models ranged from 5 to 19, but no single predictor was used consistently in all studies. To the best of our knowledge, to date there has been no impact analysis of these prediction models in practical clinical settings.

AKI definition and severity
Although the discussed models were mostly based on the latest unified KDIGO AKI criteria, (25,26,28,30) we found that the AKI definitions had been adapted when the necessary data were not available. First, three studies defined AKI using the KDIGO SCr criterion without including UO. (25,26,30) This definition is understandable in view of the difficulty of measuring and recording UO every hour, which requires manual data entry. Nevertheless, not using UO, which is an important component of the AKI definition, is suboptimal since the change in UO could be a more sensitive marker of renal dysfunction compared to SCr. (31) A second and perhaps even more crucial issue was the lack of a standardized method to define the baseline SCr concentration, which largely influenced the SCr change-based definition and classification of AKI. To determine the baseline SCr, one study used the day 1 minimum creatinine level, (30) while the others used preadmission SCr values. Regarding missing data in the preadmission values, two studies used admission SCr. (25,28) and the other two studies estimated the baseline SCr by back calculation from the Modification of Diet in Renal Disease (MDRD) formula with an assumed normal glomerular filtration rate. (26,29) However, none of these estimations is without problems. Studies have shown that MDRD back calculation may overestimate the incidence of AKI and admission SCr underestimates it. (32) With respect to the severity of predictive AKI outcome, two studies investigated all stages of AKI (stage 1, 2, or 3 AKI), (25,30) and the others looked into both moderate to severe AKI (stage 2 or 3 AKI) and all stages of AKI. (26,28,29) The stronger association between predictors and outcome for higher severity AKI (33) could explain the more robust performance of the models built for moderate to severe AKI.

Validation, calibration, and decision curve analysis
To prevent overly optimistic claims to the performance, one study utilized random split, (29) two used cross-validation, (25,30) and the others employed bootstrapping (26,28) for internal validation. It is recommended to use bootstrapping to estimate internal validity, as it provides stable estimates and reduces bias. (34,35) Two studies properly externally validated their models, (25,26) while the others did not perform external validation. (28)(29)(30) External validation is a necessary step to assess the generalizability of a model in a previously unseen setting. The more the models are externally validated in diverse settings, the more their generalizability can be trusted.
Although all five studies reported fair discrimination, two studies did not examine performance with respect to calibration. (28,30) A prediction model is of no practical use if it achieves high discrimination but poor calibration because calibration demonstrates whether the model applies to the population examined. Furthermore, in three studies that included a calibration measure, one study used calibration in the large and the calibration slope with the calibration curve, (26) and the others used the Hosmer-Lemeshow test with the calibration curve. (25,29) Notably, even though the Hosmer-Lemeshow test is widely used, it is sensitive to sample size and incidence of the predicted outcome. (36) Therefore, caution should be exercised when assessing model calibration by the Hosmer-Lemeshow test.
Additionally, two of the five studies further conducted decision curve analysis. (26,29) Both studies showed the clinical usefulness of their models for a wide range of classification thresholds. Decision curve analysis is recommended not only because it allows the choosing of a model with the highest net benefit but also because it provides a range of clinically useful classification thresholds.

Study design, comparison with biomarkers, and commonly included risk predictors
Two studies were prospective, (25,28) three studies were multicentered with sample sizes greater than 500 patients, (25,26,28) and all of the studies were based on databases of mixed critically ill patients admitted to the ICU. The large sample sizes in the studies provided crucial insights into the development of prediction models for the early diagnosis of AKI in ICU populations, which complemented reviews in general hospital populations, (37) patients after liver transplantation, (38) patients after cardiac surgery, (39) and major non-cardiac surgical patients. (18) Data from multicentered studies have greater heterogeneity, which leads to models with lower chances of overfitting. The incidence of AKI ranged from 16.5% to 30.1%.
Although it met the inclusion criteria and thus included at least one prediction model, one study was designed specifically to examine the predictive performance of biomarkers instead of clinical prediction models. (28) To investigate early AKI diagnosis in ICU populations, the authors investigated three biomarkers (sCysC, uNAG, uACR) and their combinations. (28) In contrast, the remaining four studies focused on prediction models, (25,26,29,30) among which one made a comparison between prediction models with and without inclusion of a biomarker (NGAL). (26) A comparison between biomarkers, prediction models, and combined models for early AKI diagnosis in adult ICUs is presented in table 3. In the two studies that compared biomarkers and prediction models, (26,28) the prediction models outperformed biomarkers with regard to discrimination. It is also noteworthy that after combining the biomarkers with the clinical prediction models, improved discrimination was reported in both studies, but whether the small increase in discrimination after biomarker inclusion compensates for the added costs of performing a biomarker test is debatable.
Although all models included a variable selection methodology, there was large variability across the different studies in the final variables used. Among the 47 predictive variables identified across the five prediction models, no single variable was used consistently in all studies (Table 4). Nonetheless, sepsis, age, and creatinine level were commonly included predictors (present in more than two studies).

Limitations
First, the main limitation of this systematic review is the small number of studies included due to the very strict inclusion criteria, which could have prevented us from identifying less formal but novel studies. Second, the protocol of our systematic review was not published in advance in an international database of prospectively registered systematic reviews, such as PROSPERO (https:// www.crd.york.ac.uk/prospero). Nevertheless, the literature was searched in a systematic way following an appropriate methodology based recent guidelines. (40) Third, we only used the PubMed® database to search for potentially relevant articles for this systematic review, and as such, we could have missed manuscripts that are available in other databases. Fourth, only studies published in the last seven years were retrieved. This predefined 7-year time window is justified since there were multiple AKI definitions before the publication of the KDIGO AKI criteria in 2012, and we wanted to avoid heterogeneous AKI definitions by only investigating studies completed after the publication of these unified AKI criteria. However, it should still be  acknowledged that nonstandardized adaptations that were made to define baseline SCr when the preadmission value was missing still lead to heterogeneous definitions of the outcome. Fifth, no recommendation of one model over the others was given, since only two out of five studies used an external validation cohort, and since no impact analysis was performed. Finally, since we did not obtain individual participant data from these studies, a metaanalysis was not possible.

CONCLUSION
Several clinical models that can predict acute kidney injury in patients admitted to the intensive care unit have been developed over the past years. Since only two of these models have been validated in an independent cohort, their clinical use remains limited. To make these models robust enough for decision support in clinical practice, three important recommendations need to be followed. First, blinded prospective clinical trials for independent validation need to be set up. Second, these validation studies should use calibration analysis and decision curve analysis, in addition to the more universally reported discrimination analysis, as evaluation criteria. Third, to make the models more generalizable, the use of the KDIGO acute kidney injury criteria and established definitions for baseline serum creatinine is recommended. It is clear that the availability of large high-resolution datasets combined with novel statistical and machine learning tools opens opportunities to develop and validate robust predictive models for acute kidney injury of potential benefit in patient care and risk stratification.