VALIDATION OF THE APACHE IV SCORE FOR ICU MORTALITY PREDICTION IN DR. SARDJITO HOSPITAL DURING THE PANDEMIC ERA

Introduction : ICU service quality must continuously improve to provide better patient service. One of these improvement efforts is the use of a risk prediction system to predict mortality rates in the ICU by utilizing risk factors. This system helps healthcare services perform evaluations and comparative audits of intensive services, which can also aid with more targeted planning. APACHE IV is considered to have good validity. However, its predictive capabilities may change over time due to various factors, such as the pandemic, where changes in the case mix may affect its predictive abilities. Therefore, this research tests the validity of APACHE IV on the Indonesian population through Dr. Sardjito Hospital patients. The findings can be utilized for future use and risk stratification, and ICU quality benchmarking. Objectives : This study aims to assess the validity of the APACHE IV score in ICU Mortality prediction in Dr. Sardjito Hospital for medical patients, surgical patients, and patients with both cases during the pandemic. Materials and Method : This study used retrospective data from 336 patients at Dr. Sardjito Hospital Yogyakarta from the 1 st of January 2020 to the 31 st of December 2021. All data required for calculating the APACHE IV score was collected, and the patient’s observed ICU Mortality was used. The model’s predictive validity is measured by finding the discrimination and calibration of the APACHE IV score and comparing it to the observed ICU mortality. Validation was also conducted separately for medical and surgical cases. Results : APACHE IV shows good discrimination ability in all cases (AUC-ROC 95% CI: 0.819 [0.772-0.866]) but poor calibration (p = 0.023) for mortality prediction in the ICU. For medical cases, the discrimination ability is poor but still acceptable (AUC-ROC 95% CI: 0.698 [0.614-0.782]), and in surgical cases, the discrimination ability is good (AUC-ROC 95% CI: 0.848 [0.776-0.921]). Both cases showed good calibration (p: medical = 0.569, surgical = 0.579) in predicting mortality during the pandemic. Conclusion : APACHE IV showed good discrimination but poor calibration ability for predicting mortality for all ICU patients during the pandemic era. Mortality prediction for surgical cases showed good discrimination and calibration. However, medical cases showed poor discrimination but good calibration.


INTRODUCTION
One of the most important services in a healthcare facility is the Intensive Care Unit (ICU). Therefore, the ICU service quality must be continuously improved to provide better patient service by evaluating the effectiveness of the treatments provided to the patients. A proper assessment of the patient's condition before treatment is needed for a valid therapy evaluation (1). A risk prediction system is one method to evaluate a patient's condition (2). The system is useful for analysing and assessing risk factors that will later be used to predict the prolonged length of stay (PLOS) and mortality rates of ICU patients.
The scoring system can also help healthcare policyholders to manage human resources, time allocation, and the various equipment needed per the needs of ICU patients (3). The benchmark of ICU LOS can be used to evaluate the processes and policies in ICUs. Best practices are related to survival and resource allocation and can be used to monitor advancement in ICU resource allocation in a multiple-hospital system (4). The Ministry of Health's technical instructions for ICU service implementation in hospitals also states that a prognostic prediction system can be used as an indicator for evaluating and monitoring ICUs and assessing their quality of care by comparing predicted mortality and observed mortality. However, a study states that the assessment of APACHE's validity on COVID-19 patients is still not fully studied yet (5).
Studies done during the pandemic with the addition of COVID-19 patients in the setting may bring changes to disease patterns and severity (6), which may lead to different results compared to previous studies. Thus, this study aims to assess the performance of APACHE IV in predicting mortality in ICU patients in Dr. Sardjito Hospital, especially during the pandemic, where changes in the case mix may affect its predictive performance.

MATERIALS AND METHODS Study Design
This is an observational retrospective cohort study. Data were collected retrospectively from the ICU of Dr. Sardjito Hospital between the 1 st of January 2020 and the 31 st of December 2020. The study received ethical clearance from the Medical and Health Research Ethics Committee of Universitas Gadjah Mada, Yogyakarta, Indonesia (ethical clearance no. KE/FK/1165/EC; October, 26 th 2021).

Study Subjects
The subjects consist of all patients treated in the ICU at Dr. Sardjito Hospital. Samples were determined by using non-probability sampling. This study's population consists of patients who were treated in the ICU of Dr.

Result Analysis
The study's data analysis focuses on validating the APACHE IV score by assessing its calibration and discrimination abilities. To evaluate discrimination power, the ROC that produces an area under the curve (AUC) with 95% confidence intervals (CIs) was used. A ROC is considered 'good' if it is > 0.80. The Hosmer-Lemeshow goodness-of-fit test was used to evaluate the calibration of the APACHE IV score, and a p-value of >0.05 is regarded as a good calibration. The data analysis was conducted using SPSS and included a descriptive analysis.

RESULT AND DISCUSSION Patient Characteristics
The original data comprised 353 data from all ICU patients, of which 14 were excluded from the study as they did not meet the inclusion criteria. An additional three patients' data were also excluded due to data loss. Thus, this study included and analysed data from 336 ICU patients with 40 variables per patient. For the age variable, most patients were 41-60 years old, with a total of 137 patients (40.7%). This indicates that middle-aged men mainly populate the date, while extreme ages below 20 and over 80 are small in comparison. Next, the median and mean are 50 and 49.65 ± 16.17, respectively, younger than the average age of the original APACHE IV publication, which is 61.45 ± 0.08 years old. The sample has 169 females (50.3%) and 167 males (49.7), indicating no significant difference in numbers for the gender variable. Next, 149 patients (44.3%) were categorized as medical patients and 187 (55.7%) were labeled as surgical patients. This shows that the data has more surgical patients than medical ones, which is inverse to the original publication of APACHE IV, where medical patients (69.2%) are more abundant than medical ones (30.8%). Moreover, the mortality percentage in this study reached 30.95%, while other similar studies in the same location have lower mortality rates, such as 25.4% (9) and 25.4% (10). In addition, the original publication had a 13.5% mortality rate. In Table 2, we can see that the distribution of the APACHE score is 64.27. This number is higher than the original publication, where the average APACHE score was 46.43. Patients tend to aggregate in scores 41-80 covering 55.96% of the study population.
Based on Table 2, 69 patients with an APACHE score of less than 40 survived, 97.1% of all cases. Conversely, patients with APACHE scores of more than 100 had a higher mortality rate, with 63.6% of patient cases resulting in death. These results suggest that a higher score is proportional to increased patient mortality, even though with a score >100, the outcome of death is less likely than the previous score range of 81-100, which has a 73.9% mortality rate.
Next, as we have found a discrepancy, a chi-square test was conducted to determine the fault and significance of the mortality proportion between patients with APACHE IV scores higher than 100 and below 100. The pvalue for all cases and surgical was significant.
However, the p-value for medical cases was not significant. Table 2 also shows the patients' distribution based on their referred case and acquired APACHE score. The surgical category has more patients that aggregate into the lower score categories. Meanwhile, the medical category is more distributed on the middle side while having more cases with high scores compared to the surgical category. Surgical patients also tend to have better survivability outcomes, with 86.1% of their patients surviving the ICU. Conversely,   Figure 1 shows the discrimination result of the APACHE IV score for mortality. The area under the curve (AUC) of the Receiver Operating Characteristics (ROC) has a 95% confidence interval (CI) for mortality at 0.819 (0.772-0.866). This result indicates that the discriminative power in predicting mortality is strong in all cases. Next, the cut-off point for the mortality prediction is 67.5, with a sensitivity of 77.9% and specificity of 74.1%. This suggests that patients with APACHE IV scores above this cut-off point will be more likely to receive a death outcome and treatments, for these kinds of patients must be handled with more caution. Figure 2 shows the discrimination of the APACHE IV score for mortality in medical cases; the area under the curve (AUC) of Receiver Operating Characteristics (ROC) with a 95% confidence interval (CI) for mortality is 0.698 (0.614-0.782). This discrimination power is considered weak. Figure 3 exhibits the discrimination of the APACHE IV score for mortality surgical cases; the area under the curve (AUC) of Receiver Operating Characteristics (ROC) with a 95% confidence interval (CI) for mortality is 0.848 (0.776-0.921). This discrimination power is strong.  In all cases, the APACHE IV model showed poor calibration (p<0.05) for mortality prediction. The APACHE score prediction is similar to the observed mortality in the low-risk but underestimated in the high-risk. However, the performance of the APACHE IV varies based on the case. Medical and surgical cases showed good calibration (p>0.05) for mortality prediction. The APACHE score slightly underestimated the mortality in medical cases, as shown in Figure 5. For surgical cases, the APACHE score mortality prediction varies. However, overall, the mortality prediction for surgical cases was similar to the observed mortality, as shown in Figure 6.
The study shows that the APACHE IV score gives good determination (AUC = 0.819 [0.772-0.866]; 95%CI) in predicting mortality in all cases, including during the pandemic. Although the discrimination is not as good as in the original population study, the quality of discrimination is still considered strong. This is also proven by various studies (7-9) on the Indonesian population before the pandemic (10). However, the discrimination quality in surgical and medical cases resulted in different values. Patients in the surgical cases have good discrimination (AUC = 0.848 [0.776-0.921]; 95% CI), whereas patients included in the medical cases showed weak discrimination power (AUC = 0.698 [0.614-0.782]; 95% CI).
The calibration using the Hosmer-Lemeshow shows that the APACHE IV score has poor performance in predicting mortality (X2 = 17.722, p = 0.023). From the calibration curve, the model prediction appeared to fit in the first four deciles. However, there are prediction inaccuracies starting from the fourth decile onwards, where the prediction starts to underestimate the mortality. Nevertheless, calibration tests done separately on surgical and medical populations produced different results. The p-values were 0.569 for medical and 0.579 for surgical cases, which means both show good calibration in predicting mortality. Additionally, a study in Malaysia (9) showed that the APACHE IV also has poor calibration (p<0.0001). A study in Korea (8) retrospectively tested the APACHE IV, APACHE II, and SAPS 3 scores in a Korean ICU and found that all models show good discrimination (0.80, 0.85, and 0.86, respectively) but poor calibration for all models (p<0.05). The same study also showed that different subgroups of admission types and admission diagnoses might produce different calibration results, such as patients with stomach cancer surgery having good calibration (p>0.05), but poor calibration is seen in other surgeries (8).
This study was conducted on patient samples obtained during the 2020 pandemic. After going through the inclusion and exclusion criteria, 336 samples out of 353 were used for this study. The number of patients who died in this study with an APACHE IV score above 100 was 63.6%, whereas, in the original research, it accounts for 47%. In this study, the number of patients who died with APACHE IV scores over 100 was smaller compared to patients with scores of 81-100. This is proven insignificant, especially for patients in medical cases, as people with scores above 100 should have higher mortality rates than those with lower APACHE IV scores. These findings may affect the discrimination or calibration of the APACHE IV validation, as mortality in patients with scores above 100 may not represent the real cases.
Compared to other studies, they only assess the mortality prediction without including the PLOS prediction. Most studies also used more than one parameter other than the APACHE IV score for their comparison (7,11,12). Research in Iran (7) found that the APACHE IV has good discrimination but poor calibration for mortality prediction (AUC = 0.81; p = 0.036). Additionally, a study comparing different risk prediction model validity found that APACHE IV has the best discrimination and calibration (AUC = 0.745; p = 0.541) for mortality prediction if compared to other predictors such as APACHE II, SAPS 3, and MPM0 III (11). Another recent research done during the pandemic compared the accuracy of the APACHE IV score to the APACHE II and Sequential Organ Failure Assessment (SOFA) scores for mortality in patients with Coronavirus disease in the ICU. The study revealed that all scores had poor discrimination on the general population (APACHE IV 0.67 vs. APACHE II 0.63 vs. SOFA score 0.53) (13).
The APACHE IV has good discrimination but lacks calibration in predicting mortality. Different outcomes may result from variations in patient characteristics, clinical practice, assurance, quality, and services provided by healthcare systems. One of the key points in this study is that the patient population is taken from a pandemic setting. In this pandemic condition, changes in case mix and illness severity have been noted (6,14), and these changes may have impacted the predictive accuracy of risk factors. As quoted, "Calibration may weaken over time, especially due to the effects of altered patient interventions and case-mix." (7). The accuracy of prognosis prediction was impacted by differences between clinical practices between the USA and Indonesia, case-mix differences, insurance policies, step-down policies, and hospital policies relating to patients' end-of-life status.
Moreover, medical resource management was challenged during the pandemic, as a big part of the medical resources was dispatched to handle the COVID-19 pandemic, leading to other departments being forced to adapt to the situation (12). Additionally, the lack of hospital preparedness in the early stages of the pandemic also contributes to the hospital's service quality to patients, leading to patient safety problems, such as delayed treatment for patients (15). These conditions may affect the predictive accuracy of these models as the changing service quality may lead to different outcomes. Prognostic models have the potential to improve the standard of critical care in Indonesia. In the long run, medical practitioners will benefit from using a good prognostic model as a clinical decision-support tool.

CONCLUSION
APACHE IV showed good discrimination (AUC 0.819) but poor calibration in predicting mortality (p<0.05). APACHE IV also has good discrimination in predicting mortality for patients in surgical cases but has poor discrimination in medical cases. Both medical and surgical have good calibration (p>0.05).

STRENGTH AND LIMITATIONS
There are several limitations to this study. First, some parameters were absent in some patients, which may affect the end prediction scores. Second, only a year's worth of data was collected, whereas longer and more data sets might yield different results, this happened because we wanted to study the population with a case mix, which comprised ICU patients admitted during the Pandemic. Therefore, the time could not be extended for more than this one year.