External validation of SAPS 3 and MPM0-III scores in 48,816 patients from 72 Brazilian ICUs

Background The performance of severity-of-illness scores varies in different scenarios and must be validated prior of being used in a specific settings and geographic regions. Moreover, models’ calibration may deteriorate overtime and performance of such instruments should be reassessed regularly. Therefore, we aimed at to validate the SAPS 3 in a large contemporary cohort of patients admitted to Brazilian ICUs. In addition, we also compared the performance of the SAPS 3 with the MPM0-III. Methods This is a retrospective cohort study in which 48,816 (medical admissions = 67.9%) adult patients are admitted to 72 Brazilian ICUs during 2013. We evaluated models’ discrimination using the area under the receiver operating characteristic curve (AUROC). We applied the calibration belt to evaluate the agreement between observed and expected mortality rates (calibration). Results Mean SAPS 3 score was 44.3 ± 15.4 points. ICU and hospital mortality rates were 11.0 and 16.5%. We estimated predicted mortality using both standard (SE) and Central and South American (CSA) customized equations. Predicted mortality rates were 16.4 ± 19.3% (SAPS 3-SE), 21.7 ± 23.2% (SAPS 3-CSA) and 14.3 ± 14.0% (MPM0-III). Standardized mortality ratios (SMR) obtained for each model were: 1.00 (95% CI, 0.98–0.102) for the SAPS 3-SE, 0.75 (0.74–0.77) for the SAPS 3-CSA and 1.15 (1.13–1.18) for the MPM0-III. Discrimination was better for SAPS 3 models (AUROC = 0.85) than for MPM0-III (AUROC = 0.80) (p < 0.001). We applied the calibration belt to evaluate the agreement between observed and expected mortality rates (calibration): the SAPS 3-CSA overestimated mortality throughout all risk classes while the MPM0-III underestimated it uniformly. The SAPS 3-SE did not show relevant deviations from ideal calibration. Conclusions In a large contemporary database, the SAPS 3-SE was accurate in predicting outcomes, supporting its use for performance evaluation and benchmarking in Brazilian ICUs. Electronic supplementary material The online version of this article (doi:10.1186/s13613-017-0276-3) contains supplementary material, which is available to authorized users.


Background
Severity-of-illness scores have broad applicability in intensive care setting. Although they should not be used on individual basis, they are useful to evaluate ICU performance, to monitor it overtime, to guide resource management and quality improvements, and for benchmarking purposes [1]. However, the performance of these models varies in different scenarios because of differences in case mix, clinical management patterns, admission policies as well as pre-and post-ICU care. Therefore, severity-of-illness scores must be validated prior to their use in a specific setting or geographic region.
The three most used severity-of-illness scores are the Acute Physiology and Chronic Health Evaluation (APACHE) [2], the Mortality Probability Models (MPM 0 -III) [3] and the Simplified Acute Physiology Score (SAPS 3) [4,5]. Among them, the only score developed using data from patients and intensive care units (ICU) worldwide (307 ICUs in 35 countries) was the SAPS 3 score. Besides a general standard equation, investigators also developed seven regional equations to estimate hospital mortality, thus allowing comparisons among ICUs on a more common level.
In 2009, the Brazilian Association of Intensive Care (Associação de Medicina Intensiva Brasileira, AMIB) chose the SAPS 3 score as the severity-of-illness score recommended for performance evaluation and benchmarking in Brazilian ICUs [6]. However, to our knowledge, validation studies reported conflicting results and were mostly single centered, involving specific patient populations [7][8][9][10][11][12][13] and with relatively small sample sizes [14][15][16]. Moreover, as the calibration of severity-of-illness scores is expected to deteriorate overtime, the performance of such instruments should be reassessed on a regular basis [17]. Therefore, in the present study, we aimed at to validate the SAPS 3 in a large contemporary cohort of patients admitted to Brazilian ICUs. In addition, we also compared the performance of the SAPS 3 with the MPM 0 -III.

Design and setting
This was a secondary analysis of the ORCHESTRA study, a multicenter retrospective cohort study of critical care organization and outcomes in 59,693 patients admitted to 78 ICUs at 51 Brazilian hospitals during 2013 [18].

Selection of participants, data collection and definitions
Participating ICUs in the ORCHESTRA study were selected from the Brazilian Research in Intensive Care Network (BRICNet). For the purposes of the present study, we excluded ICUs exclusively admitting cardiac patients (n = 6) ( Fig. 1) and a total of 72 ICUs at 50 hospitals were involved. We included all consecutive patients aged ≥16 years admitted to the participating ICUs during 2013. In the ORCHESTRA study, readmissions and patients with missing core data [age, location before ICU admission, main ICU admission diagnosis, SAPS 3 score, ICU and hospital length of stay (LOS) and vital status at hospital discharge] were excluded. In the present study, besides the patients admitted to cardiac units (n = 3951), we also excluded those who did not meet both the SAPS 3 and MPM 0 -III eligibility criteria [patients aged <18 years (n = 358), who underwent cardiac surgeries (n = 2971), with acute myocardial infarction (n = 3568) and burns (n = 29)]. Therefore, a total of 48,816 patients constituted the study population. We obtained de-identified patient data from the Epimed Monitor System ® , (Epimed Solutions ® , Rio de Janeiro, Brazil), a commercial cloud-based registry for quality improvement, performance evaluation and benchmarking purposes. ICUs using the Epimed Monitor System ® prospectively collect data in a structured electronic case report form, most typically using a trained case manager. Key data elements included demographics, admission diagnosis, location before ICU admission, comorbidities based on the Charlson Comorbidity Index [19], functional status one week before hospital admission [20], scores including the SAPS 3 score, MPM 0 -III score and the Sequential Organ Failure Score (SOFA) [21], use of ICU support, ICU and hospital LOS and destination after hospital discharge. The SAPS 3 and MPM 0 -III scores were calculated using data from the ICU admission (±1 h). As recommended, missing values were coded as the reference or "normal" category for each variable. Estimated mortality rates using both the standard equation (SAPS 3-SE) and the one customized for Central and South American countries (SAPS 3-CSA) are provided in the system. In the present study, the primary outcome of interest was in-hospital mortality at the patient level.

Statistical analysis
We described ICU and patient characteristics using standard descriptive statistics and reported continuous variables as mean ± standard deviation or median (25-75% interquartile range, IQR), as appropriate. We reported categorical variables as absolute numbers (frequency percentages).
We assessed models' discrimination (ability of each model to discriminate between patients who lived from those who died) by estimating the area under the receiver operating characteristic curve (AUROC). Comparisons between AUROCs by a pairwise evaluation of the three scores discrimination power were performed by Delong method [22]. We used the calibration belt, proposed by the GiViTI group [23,24], to investigate the relationship between the observed and expected outcomes. Using this approach, a generalized polynomial logistic function between the outcome and the logit transformation of the predicted probability was fitted, with the respective 95 and 80% confidence intervals (CI) boundaries. A statistically significant deviation from the bisector (the line of perfect calibration) occurs when the 95% CI boundaries of the calibration belt do not include the bisector [23]. Calibration curves were constructed by plotting predicted mortality rates (x-axis) against observed mortality rates (y-axis). Standardized mortality rates (SMR) with respective 95% confidence intervals (CI) were calculated for each model by dividing observed by predicted mortality rates. A two-tailed p value <0.05 was considered statistically significant. We performed the statistical analyses using R (http://www.r-project.org) and SPSS 21 (IBM Corp., Armonk, NY).
Predicted mortality rates were 16.4 ± 19.3% (SAPS 3-SE), 21.7 ± 23.2% (SAPS 3-CSA) and 14.3 ± 14.0% (MPM 0 -III). Table 3 gives the performance analyses for the studied scores. In summary, the SMR was appropriate using the SAPS 3-SE, while the SAPS 3-CSA overestimated and the MPM 0 -III underestimated the hospital mortality. Overall, discrimination was good, but higher for the SAPS 3 score (Table 3). Calibration was acceptable for the SAPS 3-SE only. In the calibration belt analysis, there was only minimal over-(below the first percentile) and underprediction (between the 8th and 14th percentiles) in the first two risk deciles. Conversely, the SAPS 3-CSA uniformly overestimated mortality in all risk range and the MPM 0 -III tended in general to underestimation (Figs. 2, 3).
As most of the included ICUs were located at private hospitals, we performed subgroup analyses according to the type of hospital and specific subgroups of patients (Additional file 1: eTable 1 and eFigures 1-8). In patients admitted to private hospitals, we found results comparable to the ones observed for all the studied population and the SAPS 3-SE was the only model with a good performance. However, in patients admitted to public hospitals, none of the models was accurate in predicting hospital mortality. Finally, we performed additional analyses of the SAPS 3 performance in all patients (n = 55,742) fulfilling only the eligibility criteria reported  the original publication of the model [4]. Models' discrimination (AUROC = 0.855) for both the SAPS 3-SE and SAPS 3-CSA and calibration (Additional file 1: eFigure 3) were also appropriate. In Additional file 1: eTable 2, we provided information on patients' characteristics and outcomes for our cohort of patients and the one reported in the SAPS 3 study.

Discussion
In the present study, we demonstrated that the SAPS 3-SE was able to accurately predict outcomes in a large contemporary cohort of Brazilian ICU patients. Conversely, the MPM 0 -III score had a relatively worse calibration and tended to significantly underestimate mortality, while the SAPS 3-CSA overestimated mortality despite a  The times the calibration belt significantly deviates from the bisector using 80 and 95% confidence levels are described in the lower right part of the plots reasonable discrimination. Moreover, the SAPS 3-SE provided more precise estimations, resulting in a SMR closer to 1.0. In the calibration curves, the lines of observed mortality of the SAPS 3-SE were uniformly closer to the line of ideal prediction across all risk classes. In the last years, mostly driven by official recommendations provided by AMIB, the SAPS 3 became the severity-of-illness score used in the vast majority of Brazilian ICUs to evaluate ICU performance as well as for benchmarking. However, validation studies of SAPS 3 were performed in specific subgroup of patients or in singlecenter studies involving a general ICU population [7][8][9][10][11][12][13][14][15][16]. In general, both the SAPS 3-SE and SAPS 3-CSA equations were evaluated in the studies. Overall, discrimination was usually good, but calibration results varied among the studies.
In these previous studies, the SAPS 3-SE had a poor calibration and tended usually to underestimate mortality [7][8][9][10]12]. The SAPS 3-SE tended to overestimate mortality in only two studies (one of them comprising patients with acute coronary syndromes), both with a relatively low mortality rate [11,16]. On the other hand, the SAPS 3-CSA accurately predicted mortality in five studies involving patients with cancer [8,9], acute kidney injury [10,12] and those who underwent surgical procedures [15]. Our results confirm that the MPM 0 -III, however, was inaccurate in predicting mortality. These results are in line with almost all previous studies performed in Brazil [9,10,12,16].
There is a known phenomenon with traditional calibration statistics (such as Hosmer-Lemeshow goodness of fit) in prediction models validation/calibration studies with many thousands included subjects, in which often p values are highly significant despite visually good calibration curves, very small absolute errors, and acceptable calibration slope and intercept. This occurs because with a large sample size the power is big enough to detect, as statistically significant, irrelevant small differences. At the other extreme, one must be cautious in the interpretation of calibration results with small cohorts, because, even when the calibration curve, the calibration intercept and slope points to a miscalibration, the p values of traditional calibration statistics may not be significant, raising concern about the study low power [25]. Therefore, in small cohorts, the lack of correspondence between expected and observed probabilities can also result in misaligned calibration curves, when sample size cannot be enough to achieve statistical significance [26]. In addition, specific subgroups of patients were included in these studies, whose results may not be fully transposed to general populations of critical care patients in different scenarios.
It is a well-known phenomenon that the performance of prognostic scores (chiefly the calibration) tends to deteriorate overtime. Zimmerman et al. [2] when reporting the development of the APACHE IV elegantly demonstrated this. Soares et al. [8] also documented the temporal compromising of calibration studying the SAPS 3 score in a cohort of patients with cancer admitted to the ICU over a 3-year period in Brazil. This is why the performance of prognostic scores should be reassessed periodically.
The cohort composition could also interfere with the score performance. Comparing our cohort and original SAPS 3 development cohort, we had comparable median age, but clinical patients predominated (67.9 vs. 43.5% in the SAPS 3 cohort), with lower median SAPS 3 scores (43 points vs. 48 points) and lower hospital mortality (16.5 vs. 23.5%) (Additional file 1: eTable 2). Despite these case mix differences, currently the SAPS 3-SE model was well fitted to our population, which might reflect changes in the provision of health care resulting in lower riskadjusted mortality. In this sense, our results have potential implications for ICU performance evaluation and more importantly for benchmarking purposes in Brazilian ICUs. On the one hand, we provide robust evidence that although the SAPS 3 remains useful in our country, the customized equation for Latin American countries should be no longer used.
Our study has many strengths including being, to our knowledge, the largest validation study of severity-of-illness scores in Brazil and using more contemporary data from several centers countrywide. Moreover, we consider there is a negligible potential for discharge bias, [27] once our percentage of patients discharged to other hospitals and hospice care facilities was minimal.
Our study has also several limitations that should be considered in the interpretation of our results. First, although we have evaluated a large number of Brazilian ICUs, we used a convenience sample, predominantly composed by private hospitals and they may not be representative of the entire country. Second, we have not audited data collection, as we used data recorded in a registry for performance evaluation and benchmarking. Therefore, we cannot estimate the effect of missing variables in the scores estimations. However, trained healthcare professionals that work as case managers register data in all ICUs. Third, we did not assess end-of-life decisions, as they are not regularly registered in the database, and therefore, we were unable to account for this factor in the analysis.
In conclusion, using a large contemporary database, we demonstrated that the SAPS 3-SE was accurate in predicting outcomes, supporting its use for performance evaluation and benchmarking in Brazilian ICUs.