Severity Scores Performance in Neurocritical Care: A Retrospective Cohort Study

Severity scores have wide spread use in intensive care as surrogate markers of disease severity, benchmarking and resource allocation. They are derived from certain cohorts and should go a validation process before their general use in other populations. Nevertheless, they may loss accuracy and prognosis predictability overtime and in different settings. Therefore, constant assess of score’s performance is crucial. In this study, we evaluated APACHE II, SAPS 3 and SOFA discrimination and calibration for mortality in a cohort of neurosurgical and neurological patients. Although all have showed good discrimination, APACHE II had a better calibration compared to the others. Hence, our data suggests that APACHE II is a good prognostic model for neurocritical patients.


Introduction
Prognostic scores have been widely used in research and in intensive care medicine for disease severity evaluation, benchmarking and optimal resource allocation [1]. These models use certain medical information to predict an outcome chance of happening [2]. Simplified Acute Physiology Score 3 (SAPS 3), Acute Physiology and Chronic Health Disease Classification System II (APACHE II), Sequential Organ Failure Assessment (SOFA) are predictive scores employed to estimate probabilities of death and severity of organ dysfunction in patients admitted in an ICU [3][4][5][6][7][8][9]. They are derived from developmental cohorts and may lose precision, after extrapolation to other populations and accuracy overtime [1,10]. Therefore, predictive models should go through external validation and constant update.
Neurocritical care is a subspecialty of critical care medicine that utilize specialized interventions aiming to improve severe neurological and neurosurgical injuries. This may add substantial resource consumption and queries for rational medical practices [11]. Thus, predictive models may play a significant role in this subset of patients.
To assess performance of predictive models in neurocritical care, we performed an external validation analysis using a cohort of neurological and neurosurgical patients. This is relevant taking into consideration the continuous advancement of care and resource allocation optimization in the field.

Study design and setting
This was a one center, retrospective cohort study, conducted between January 1, 2013 and December 31, 2016, in the neurological ICU of Unicamp's teaching hospital. This ICU has seven beds and a full time intensive care specialists, nurses, assistants and physiotherapists. Pertinent data on patient's medical history were extracted from an electronic database. All death data corresponded to the ICU stay period. This was an observational study and every clinical decision was at the discretion of the attending physician. Therefore, an informed consent was waived. Local and national ethics committee approved the study.

Selection of participants
During the study period, all patients with a neurological or neurosurgical diagnosis, aged 18 years or older, requiring an admission to the ICU, were evaluated. Patients with less than 24 hours of ICU stay, who received cardiopulmonary resuscitation before admission or have died during this period were excluded.

Data gathering
Data were collected from an electronic database which was fed continuously with patient's information. This electronic database was protected to hide patient's identification. Demographic, clinical and laboratory variables of interest were retrospectively collected. All used scores (SAPS 3, APACHE II and SOFA) were automatically calculated with appropriate variables.

Statistical analysis
Statistical analyses were performed using MedCalc version 17.6 and Epi Info 7.2.1.0. Continuous variables were presented as mean or median and standard deviation or interquartile range (IQR) as indicated. Categorical variables were presented as absolute values and percentages.
Score discrimination was evaluated by calculating the area under the receiver operating characteristic curve (AUROC) with 95% confidence intervals (CI) and compared with the Hanley-McNeil (1983) test. The discrimination was considered excellent, very good, good, moderate and poor with AUROC values of 0.9-0.99, 0.8-0.89, 0.7-0.79, 0.6-0.69 and <0.6, respectively [12]. The correlation between the observed and expected rate of death across all strata of probabilities of death was evaluated by the Hosmer-Lemeshow goodness-of-fit test C-statistic, which assessed the calibration for the test [13,14]. A p value <0.05, twotailed, and 95% CI were used for logistic regressions.

Results
A total of 909 patients fulfilled the inclusion criteria and were enrolled for further analyzes. From those, 611 (67.20%) were hospitalized for neurosurgical procedures and 298 (38,2%) with neurological issues. The mean age was 49.28±16.51, mean ICU length of stay 7,91±10,23 days and mean hospitalization length of stay 22,26±25,37 days. There was a male predominance with 477 cases (52,5%). Observed mortality was 6.5% (59 patients) and mean prognostic scores were 11.13±5.35, 37.51±15.02 and 2.91±2.63 for APACHE II, SAPS 3 and SOFA respectively. Patients features are shown in Table 1. To assess the performance of prognostic scores in neurocritical patients, we evaluated accuracy of APACHE II, SAPS 3 and SOFA verifying their discrimination and calibration in our cohort of patients. Discrimination was good for all scores. Area under the ROC curves were 0.832 (95%CI: 0.806 to 0.856) for APACHE II, 0.840 (95%CI: 0.814 to 0.863) for SAPS 3, and 0.808 (95%CI: 0.781 to 0.833) for SOFA ( Figure 1). Pairwise comparison did not achieve significance according to the Hanley & Mcneil [12] test (APACHE II vs SAPS 3 P: 0,75; APACHE II vs SOFA P: 0,39; SAPS 3 vs SOFA P:0,23). Next, we sought to verify calibration of those scores through the Hosmer-Lemeshow goodness-of-fit C-statistics.
Observed p values ascribe a good calibration for APACHE II while SAPS 3 and SOFA had a poor performance ( Table 2).

Discussion
In this study, we evaluated prognostic scores prediction in a cohort of neurological and neurosurgical patients. We observed that although all have a good discrimination, APACHE II are better calibrated than SAPS 3 and SOFA in our settings. Prognostic models are acceptable surrogate indicators for illness severity and organ dysfunction [2,11]. Studies suggest that prediction models must pass through external validation prior of being extrapolated to different populations [1,15]. This is especially important if the new population has features far from score's developmental cohort. To our knowledge, few studies evaluated prediction models in a subgroup of patients composed of neurological and neurosurgical patients. Based on our data, it is reliable to use APACHE II in neurocritical patients. Moreover, although SAPS 3 and SOFA have good discrimination, both perform poor in subsets of patients as shown by their calibration results. Noteworthy, studies have already noticed poor calibration for prognostic scores in many settings [2,16]. This suggests that even using regional equations, validation must be performed in any new population with adequate sample sizes,if possible, and periodically. Whether a continuous accuracy evaluation or an intermittent process should be performed is unknown.

Conclusion
Prognostic scores are important markers for stratification and prediction in intensive care medicine. We observed that APACHE II has good discrimination and calibration in neurocritical patients. Our results support its use in this setting of patients. Moreover, Prediction score validation process is crucial for model's reliability and must be performed in any population before it's use.