A customised down-sampling machine learning approach for sepsis prediction

Objective: Sepsis is a life-threatening condition in the ICU and requires treatment in time. Despite the accuracy of existing sepsis prediction models, insuﬃcient focus on reducing alarms could worsen alarm fatigue and desensitisation in ICUs, potentially compromising patient safety. In this retrospective study, we aim to develop an accurate, robust, and readily deployable method in ICUs, only based on the vital signs and laboratory tests. Methods: Our method consists of a customised down-sampling process and a speciﬁc dynamic sliding window and XGBoost to oﬀer sepsis prediction. The down-sampling process was applied to the retrospective data for training the XGBoost model. During the testing stage, the dynamic sliding window and the trained XGBoost were used to predict sepsis on the retrospective datasets, PhysioNet and FHC. Results: With the ﬁltered data from PhysioNet, our method achieved 80 . 74% accuracy ( 77 . 90% sensitivity and 84 . 42% speciﬁcity) and 83 . 95% ( 84 . 82% sensitivity and 82 . 00% speciﬁcity) on the test set of PhysioNet-A and PhysioNet-B, respectively. The AUC score was 0 . 89 for both datasets. On the FHC dataset, our method achieved 92 . 38% accuracy ( 88 . 37% sensitivity and 95 . 16% speciﬁcity) and 0 . 98 AUC score on the test set of FHC. Conclusion: Our results indicate that the down-sampling process and the dynamic sliding window with XGBoost brought robust and accurate performance to give sepsis prediction under various hospital settings. The localisation and robustness of our method can assist in sepsis diagnosis in diﬀerent ICU settings.


Introduction
Sepsis, characterised by life-threatening organ dysfunction, is one of the most common diseases in ICUs [1].Its high rates of mortality and morbidity make it one of the global healthcare burdens [2].There were about 11 million deaths related to sepsis in 2017 worldwide [2].The high medical cost related to sepsis also poses heavy burdens to the global healthcare system [2,3].The total cost related to handling sepsis patients in ICUs is approximately 38 billion U.S. dollars annually in the USA [4] and 4.6 billion U.S. dollars in China [5].To reduce its mortality before its onset with high specificity and sensitivity and also raise the alarm to intensivists in a timely manner, so that treatments can be provided in time.
ICU staff have been reported to suffer from alarm fatigue, leading to alarm desensitisation [15].This can potentially impact patient safety [16].To tackle the problem of alarm desensitisation, researchers have explored machine learning methods to reduce alarm desensitisation in ICUs [15,17,16].Many studies have contributed to classify ICU alarms [15,17] or to avoid non-actionable alarms [18].In addition to the post-hoc alarm reduction, a carefully designed alarm algorithm with balancing sensitivity and specificity can achieve the ad-hoc alarm reduction and potentially relieve alarm fatigue [19].
This project aims to develop a localised, low-frequent, and accurate approach for sepsis prediction in ICUs.Various institutions have different ICU setups, resulting in a variation in the data characteristics.Therefore, training and deploying methods locally could benefit in predicting sepsis.In order to relieve alarm fatigue, one potential solution is to lower the detection frequency to achieve ad-hoc alarm reduction.The approach should also be robust enough to be easily adapted and deployed locally in different institutions.There are several techniques to handle the issue of imbalanced class, such as up-sampling (or so-called "oversampling") and down-sampling (or so-called "undersampling") [20].Among different techniques, the most commonly used technique is random undersampling (RUS) [21,20].To achieve ad-hoc alarm reduction, we would like to propose a customised down-sampling process for our method to follow the clinical routine.
Different machine learning approaches have been developed to predict sepsis or its related conditions [22], such as the deep learning models [23][24][25][26][27][28] or traditional machine learning models like SVM [29,30].Despite the progress, peer-reviewed studies have also shown that challenges still need to be addressed [31,22].There is not only existing heterogeneity in sepsis [32] but also a variety of data sources, data preprocessing and feature engineering methods among different ML-based approaches, making it difficult to compare between methods [31,22].For clinical application, it is also worth pointing out that although many of the reviewed models have achieved high AUC values, most of the studies have focused solely on the modelling part and neglected the assessment in ICUs including reducing alarm rate [22,27,[33][34][35][36].In addition, the widely adopted NNs often require large amounts of computational resources during the development, deployment and maintenance phases, making them challenging to use in clinics.
Another frequently encountered issue is the missing values in sepsis data.Previous studies often treated missing values with forward-filling, assuming that the patient's condition changes once the next test result [22,33,37].In forward-filling processing, the missing data is filled by the last valid value forward until the next observation in a given feature.However, this method only addresses some of the missing cases, since some laboratory test values are commonly not available at the beginning of the ICU stay.According to our experiments, about 25% of data in the sepsis-specific dataset from the PhysioNet Cardiology Challenge still needs to be added after forward filling [38].To address this issue, Liu et al. [37] developed a two-phase down-sampling and up-sampling strategy.The highest AUC was achieved by the gradient boosting model with a down-sampling of 50% of the data and up-sampling.Doggart and Rutherford [34] proposed to use RUS and Boosted Tree to offer the sepsis prediction.
Thus, we developed a down-sampling process and dynamic sliding window technique with XGBoost for predicting sepsis based on vital signs and laboratory values.The down-sampling process is used to prepare the training data, whereas the dynamics sliding window is used for the deployment.With our design, we expect our method to be accurate and flexible to be deployed and adaptable to different ICU settings.These qualities can facilitate a smoother integration of our method into real-world clinical sepsis prediction.

Study design and data sources
Our study offered a retrospective study with the data retrieved from the PhysioNet 2019 Challenge and collected from the First Hospital Changsha, China.We proposed a preprocessing method and XGBoost to offer sepsis prediction.We explored the potential of performing the early sepsis prediction while achieving ad-hoc alarm reduction.We reported our results by adopting the "transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRI-POD)" [39].
Our first clinical datasets were retrieved from the PhysioNet/Computing in Cardiology Challenge 2019 [38] and were used to develop and evaluate our method.The dataset from the challenge contains vital signs, laboratory tests, diagnosis of sepsis and other general information about ICU adult patients from two Hospitals, Beth Israel Deaconess Medical Center, Boston, United States (Hospital A) and Emory University Hospital, Atlanta, United States (Hospital B).The data has a consistent header with each row corresponding to one hour's worth of data.There are 20,336 and 20,000 adult ICU patients, containing 1790 and 1142 sepsis cases, for whom the data were collected from Hospitals A and B, respectively.The sepsis onset is labelled and defined by the earlier time between the intensivist's suspicion and SOFA diagnosis, following the Sepsis-3 clinical criteria [40,1,41].The third dataset of PhysioNet from Hospital C was private.Therefore, only datasets from Hospitals A and B were included in this study, denoted as "PhysioNet-A" and "PhysioNet-B" in the rest of this paper.For FHC dataset, the clinical data was collected from patients 18 years or older from the First Hospital of Changsha, China, between 2020 and 2022.The collected data contained laboratory test values and vital signs from adult ICU patients, including 69 sepsis cases and 46 non-septic cardiovascular-disease-only cases.The laboratory tests were conducted following the routine of clinical practice.The laboratory results were collected on a daily basis.The laboratory instruments and measurements are listed in the supplementary material.The items of daily laboratory tests were used as data features.The sepsis label was given by the intensivist's suspicion of onset time following the Sepsis-3 clinical criteria.The description of each column in the recorded data file is listed in Table S1 in the supplementary materials.This dataset is denoted as "FHC" in the rest of this paper.The model development and validation were analysed between April 12 ℎ , 2022 and March 13 ℎ , 2023.

Feature selection and data preprocessing
There are 8 vital signs, 26 laboratory values and 6 demographic information in the PhysioNet datasets.We excluded ICU Units, Hospital Stays, and other hospital administration-related features from the entire dataset, assuming that the facility in one hospital remained the same.Based on Schinkel et al. [42], we also excluded demographic information as we wanted to build our method only based on the vital signs and the laboratory values.We excluded the end-tidal carbon dioxide (EtCO2) as no valid data was available in PhysioNet-A.The FHC dataset has a different hospital setting compared to PhysioNet datasets, and the collected data is organised under a longer time interval.Features in the FHC dataset contain 5 vital signs, 31 laboratory values and 4 demographic information.As there are differences between the septic and non-septic data, features present only in one patient group were excluded at the data cleaning stage.These include blood urea nitrogen (BUN), calcium ion (Ca++), the fraction of inspired oxygen (FiO2), glucose (GLU), bicarbonate (HCO3), oxygen saturation(SO2), troponin and urine output.Similar to the Physionet datasets, features related to demographic information are also excluded.
We proposed a customised down-sampling process and a nonoverlapping dynamic sliding window process to offer ad-hoc alarm reduction and handle the imbalanced data.Specifically, the customised  down-sampling process was used to prepare the training data.The nonoverlapping dynamic window was used for the hold-out testing.We applied the XGBoost on the preprocessed data to offer the sepsis prediction.

Customised down-sampling process
The customised down-sampling process is introduced to reduce the impact of missing data.We first applied the forward-filling to the training data before the cross-validation.The down-sampling process is shown in Fig. 1.A "data block" in this process is defined as a collection of all valid data points within a range of continuous time stamps.The time-series data for each patient is separated into multiple data blocks based on the collected feature percentage, which is the number of features collected versus the total feature number.The initial data block starts with Time 0 when the patient is admitted to the ICU and ends with the time when the collected feature percentage surpasses an empirical threshold, which is around 80% in our experiments.The subsequent data block starts after the end of the previous one and ends when the newly collected feature percentage surpasses the same threshold.Following the same rule, blocks are created till the end of the patient's data.The available data from the last time stamp and its corresponding label of each data block are treated as an instance and selected for model development.The details of this down-sampling process can be found in the supplementary material.Applying the down-sampling process to the dataset lowered the alarm rate and minimised the differences between sepsis and non-sepsis instances.The post-processed data is demonstrated in Section 3.1.

Dynamic sliding window
The dynamic sliding window is designed for the deployment to achieve the ad-hoc alarm reduction, shown in Fig. 2. It offers similar characteristics as the down-sampled data from the customised downsampling process.It continuously collects the new vital signs and laboratory test values with time stamps until it meets the customised down-sampling empirical threshold.This study used the dynamic sliding window to prepare the hold-out testing data to mimic the hospital environments.

Outcome
The outcome of this study was the sepsis prediction.We proposed the preprocessing methods to down-sample the time-series data.The XGBoost was developed and validated for sepsis prediction.Also, the attempts of model transfer were demonstrated by cross-institute validation, resulting in the suggestion for deployment.

Experiment design
For Experiment One, we applied our method to the PhysioNet and FHC datasets to demonstrate our method's performance.The PhysioNet-A and PhysioNet-B were separated randomly into ten stratified folds.The instances from the same patient were not included in both training and test sets simultaneously.The proportion of training, validation and test sets was 8:1:1.We left one fold out as the test set.The remaining nine folds were used to perform the cross-validation.The binary classification was carried out by XGBoost [43].The tuning of hyperparameters was performed during the cross-validation.We investigated that 593 trees with a maximum depth of 25 and a learning rate 0.084 gave the best prediction.Moreover, we set the hyperparameter of the feature sample rate to 0.6 and the instance sample rate to 0.9 to reduce the data bias and prevent overfitting.For the FHC dataset, we separated the data into five folds due to the size of the dataset.The proportion of training, validation and test sets is 3:1:1.
For Experiment Two, we aimed to demonstrate our method in a different ICU setting.A feature-reduced PhysioNet dataset was created for this experiment by performing the intersection between the PhysioNet features and the FHC features.The feature description is listed in Table S2 in the supplementary materials.The model prediction was evaluated under the same process as the previous experiment.Additionally, we compared the performance of XGBoost to other machine-learning models, including Naïve Bayes, SVM, Random Forest and ANN, under Experiments One and Two.
For Experiment Three, we demonstrated the cross-institutional performance of the trained model.For instance, we train the model under Institution  and test the model in Institution  .Instead of the trainingvalidation-test split, we performed a ten-fold cross-validation for this experiment.Besides the migration between PhyioNet-A and PhysioNet-B, we tested the PhysioNet model on the FHC data.The other experiment settings between PhysioNet-A and PhysioNet-B remained the same as the PhysioNet ones in Experiment One.Nevertheless, experiments from PhysioNet to the FHC used the experiment settings as Experiment Two.Additionally, we separated the FHC into two folds.One fold was mixed with the training set from PhysioNet-A or PhysioNet-B, denoted as "Mix-A" and "Mix-B" in Section 3.4, and the other one was used as the test set.Therefore, we could further demonstrate the benefit of training and deploying our method locally.

Discriminatory performance and model calibration
From our experiment design, our study can be classified as type 2a and type 3 prediction study [39].According to relevant studies [22,27,33], accuracy, sensitivity, specificity and AUC are used to evaluate the model's performance across different datasets.Furthermore, we included the PPV and NPV [44] to demonstrate the proportion of patients predicted as true positive or negative results to the total positive or negative predictions.We performed the empirical calibration curve [45] and Platt scaling [46] for calibrating the models validated in Experiments One and Two.The proportion bin number was 10 and 5 for the PhysioNet datasets and the FHC dataset, respectively.From these two plots, we can demonstrate the agreement between the test prediction and labels.

Data characteristics
After applying our preprocessing to the data, it was down-sampled into several key instances for binary classification.It is critical that the sepsis prediction should happen before onset.The last instance before onset is summarised and listed in Table S3 in the supplementary materials.The average time before the sepsis onset is 5.85 ± 2.53 (hours) and 4.95 ±2.40 (hours) for PhysioNet-A and -B, respectively.These time gaps would help intensivists to apply treatment in time [38].Specifically, the test set size is 457 and 324 with 200 and 147 positive samples for PhysioNet-A and -B, respectively.The test set size for the FHC dataset is 130 with 53 positive samples.
The characteristics of vital signs and laboratory values by demonstrating the median and interquartile range for PhysioNet and FHC are listed in Table 1 and Table 2, respectively.We used the Mann-Whitney U test for each feature between sepsis and non-sepsis instances to calculate the p-value.We considered rejecting the null hypothesis when p-value < 0.05, where the null hypothesis could be interpreted as the tested feature from sepsis and non-sepsis shares certain similarities.

Exp. one: evaluation on PhysioNet and FHC datasets
In this section, we demonstrate the performance of our method on the PhysioNet and FHC datasets.We implemented our method on Anaconda Python 3.7 with the XGBoost package (v1.5.0).Our experiments are finished on the laptop with an i7 CPU and 32GB RAM.From our empirical measurement, the average training time for an XGBoost model is 11.96 seconds.
The result on PhysioNet datasets is listed in Table 3.The ROC is shown in Fig. 3.The proposed method achieves the AUC of 0.87 ± 0.01 and 0.88 ± 0.01 for PhysioNet-A and -B, respectively.The performance of the FHC dataset is listed in Table 5.The corresponding ROC is also shown in Fig. 4. The test accuracy for the FHC dataset is 92.38% with 92.58% PPV, 92.29% NPV, 88.37% sensitivity and 95.16% specificity.The AUC score is 0.98.These results of PhysioNet and FHC datasets show that the proposed method is wellperformed and reliable in predicting sepsis.For the model comparison, Random Forest outperformed the other models (89.47%PPV, 80.60% NPV, 83.81% accuracy, 72.34% sensitivity and 93.10% specificity) but underperformed compared to XGBoost.The calibration plot of the developed model is shown in Fig. 6.From the empirical calibration, the XGBoost was considered overly extreme risk estimates with PhysioNet-B.However, with Platt scaling, the XGBoost was well-calibrated for the PhysioNet datasets.For the FHC dataset, the XGBoost slightly overestimated risks.It was considerable due to the smaller dataset.

Exp. two: evaluation of the feature-reduced PhysioNet datasets
For the experiment on the feature-reduced PhysioNet data, the accuracy, sensitivity and specificity are listed in Table 6.The ROC is shown in Fig. 5.
From Table 6, the accuracy between the training and test is consistent.The test accuracy, sensitivity and specificity of PhysioNet-A are 76.59%, 71.21% and 83.50%.The test result of PhysioNet-B is 81.17% accuracy with 85.20% sensitivity and 72.28% specificity, while the leaveout test AUCs for PhysioNet-A and -B are 0.85 and 0.88.We observed a similar trend of sensitivity and specificity between Table 3 and Table 6.The results on the reduced-feature PhysioNet dropped slightly.The feature reduction may influence the decreased performance.For the model calibration, the performance of XGBoost was consistent with Experiment One according to Fig. 6.The model was well-calibrated.

Exp. three: model migration experiments
In Experiment Three, we first demonstrated the model migration between PhysioNet-A and PhysioNet-B, shown in Table 7.
We trained the XGBoost model on PhysioNet-A and tested it on PhysioNet-B, and vice versa.The accuracy of PhysioNet A-to-B is 66.16% ± 2.59 with 65.13% ± 4.43 sensitivity and 67.12% ± 5.37 specificity, dropped from 81.65% ± 0.6 accuracy with 86.19% ± 2.11 sensitivity and 73.18% ± 4.07 specificity as the training results of PhysioNet-B in Table 3.The accuracy of PhysioNet B-to-A is 67.75% ± 2.38 with 53.58% ± 2.85 sensitivity and 77.31% ± 3.68 specificity, dropped from 80.01% ±1.48 accuracy with 78.77% ±2.84 sensitivity and 81.63% ±2.33 specificity as the training results of PhysioNet-A.The AUC of evaluation on PhysioNet-B drops from 0.88 ± 0.01 to 0.66 ± 0.03, while the AUC of evaluation on PhysioNet-A drops from 0.87 ± 0.01 to 0.65 ± 0.02.
Table 8 gives the result of migrating trained models from PhysioNet-A and PhysioNet-B to FHC.Specifically, both models were trained with the feature-reduced data, so that the trained model can be tested on the FHC data.In general, the mixed training set model performed better than the direct migrating model from the feature-reduced PhysioNet datasets.Compared to Table 6, the accuracy drops from 89.98% ± 3.76 to 53.45% ± 5.60 and 59.58% ± 3.26 of the feature-reduced PhysioNet-A and the Mix-A, respectively.The accuracies from the migrated model from the feature-reduced PhysioNet-B and Mix-B are 61.02% ± 5.38 and 71.67% ± 4.21, respectively.The sensitivity and specificity of the migrated model from the feature-reduced PhysioNet-A are 46.89%± 11.21 and 57.94% ± 14.71, respectively.On the other hand, the migrated model from the feature-reduced PhysioNet-B gained 21.56% ± 9.58

Table 4
Comparison between our method and the other methods on PhysioNet Datasets.The method without indicating PhysioNet-A or -B are validated by a test set combining these two datasets together.

Discussion
This paper proposes a customised down-sampling process and dynamic sliding window technique with XGBoost for sepsis prediction.Our method aims to provide a robust and reliable solution with a low alarm rate, which can be readily and locally deployed at different institutions and support patient monitoring in ICUs.
Our method achieved notable results on the PhysioNet and FHC datasets, exhibiting robust performance across multiple measures and experiment designs.According to Experiment One, the AUC achieved by the algorithm (0.89 for PhysioNet-A, 0.89 for PhysioNet-B, and 0.98 for FHC data) further underscores its effectiveness.Even though the high AUC of the FHC dataset might come from the relatively small size of the dataset, this result means that our method can successfully predict sepsis and minimise false alarms in our experimental contexts.Based on the calibration analysis in Fig. 6, our model is functioning as intended and well-calibrated.The training and testing results from Experiment One demonstrated that the instances generated by our proposed methods share a similar distribution.However, the performance consistency of the methods relating to distribution consistency has not yet been proven.This requires further prospective studies in the actual deployment.
Compared our XGBoost model to the existing work [37,34], the proposed customised down-sampling process and dynamic sliding window have proven to be a reasonable and reliable solution for missing data.
It allowed an XGBoost model to perform better with different features, offering more flexibility and robustness in sepsis prediction and reducing the alarm rate at the first instance.As presented in section 3.2, we    4. The PhysioNet-A and -B were open datasets sourced from two different hospitals.The FHC dataset was constructed only in the Asian region.Our method retained the expected PPV, accuracy and AUC in all tested hospital settings.However, we've observed variations in sensitivity and specificity across different datasets, indicating diverse tendencies in how these measurements perform when applied to various data samples.The PhysioNet-A and the FHC hold lower sensitivity and higher specificity, while the PhysioNet-B tends to have higher sensitivity and lower specificity.Upon comparing Table 1 with the original PhysioNet distribution reported by Reyna et al. [38], we observed a similar pattern in distributions, but suggesting further analysis.There is a potential risk that applying the down-sampling method could lead to increased variance, which might consequently affect our performance of the model during validation.These suggest that when deploying our method with down-sampling at various institutions, the model can be retrained and improved to adapt to different hospital settings.
During the analytical validation of the feature-reduced PhysioNet in Experiment Two, the hospital settings of FHC data were migrated to the PhysioNet datasets.Our proposed method maintained convincing outcomes, with reasonable drops caused by feature reduction.This suggested that our model relied more on the data and retained the performance with fewer features.
To further estimate the ability of migration and transferring, we introduced Experiment Three to evaluate the model trained on data from one institution and tested on data from another institution.The challenge in such scenarios is the variation in data across institutions due to different patient populations, equipment, protocols, etc.The Phys-ioNet A-to-B and B-to-A transfer experiment results in Table 7 indicate that our model is limited by data characteristics or bias.Biases in the training data can lead to poor generalisation of new, unseen data.The migration experiments between standardised datasets, PhysioNet-A and -B, suggest that the model still suffers from cross-institutional deployment even if datasets are curated to have a common standard.The lack of its effectiveness emphasises the importance of localisation.It typically refers to training or fine-tuning a model on local data to ensure its performance for that particular setting in the context of machine learning for healthcare.As shown in Table 8, we set up migration experiments with various training and testing combinations, including PhysioNet-A-to-FHC, PhysioNet-B-to-FHC, Mix-A-to-FHC and Mix-B-to-FHC.The performances of PhysioNet-A-to-FHC and PhysioNet-B-to-FHC further emphasise that when using data from one setting to predict another with different curation standards, the overlapped knowledge is insufficient for accurate predictions.The improved scores with mixed data, Mix-A and Mix-B, might indicate that diverse training data, including local data, can help the model to be generalised better.
This set of experiments presented a significant challenge in model development with cross-institutional data in ensuring good generalisation.These three experiments showed that XGBoost with our proposed down-sampling methods performed well during validation within the same dataset.However, its generalisation ability is limited, as demon-Q.Wu, F. Ye, Q. Gu et al. strated in Experiment Three.This limitation of generalisation could be likely derived from the heterogeneity in sepsis and the need for crossinstitutional data.Additionally, it may also be caused by the fact that XGBoost is sensitive to imbalanced data, which can be improved by the hyperparameter tuning [47].Our results of the mixed dataset suggested that the cross-institutional study might tackle the problem of generalisation.
From our perspective, localisation or incorporating local data is crucial for sustaining the performance of our model and is recommended for model deployment.Our method for sepsis prediction is designed for easy implementation and local deployment, ensuring it remains robust across different hospital settings.

Conclusion
In this paper, we developed a well-performed method that assists sepsis diagnosis with vital signs and laboratory values to better supply clinical criteria.This down-sampling process and dynamic sliding window technique with the XGBoost model can address the differences in hospital settings.It is sturdy, maintaining high accuracy and a low rate of alarms, even when medical resources are scarce.Aside from lowering the frequency of alarms after they occur, our method was meticulously crafted to balance sensitivity and specificity that can contribute to mitigating alarm fatigue, a problem widely acknowledged in the medical field.Compared to the hourly classification model, this offers a novel strategy for sepsis prediction in ICUs.Our approach is accurate and robust and can be readily and locally deployed in different institutions.
Moving forward, we will delve deeper into our research, expanding our model's capabilities to process time series data more effectively.We also plan to conduct a prospective study of our current method in the real ICU environment to understand the impact of our proposed down-sampling processes on the data distribution at deployment.In addition, we plan to use MIMIC-III and MIMIC-IV databases for further evaluation, enhancing the robustness and adaptability of our model.Furthermore, to improve the prediction of high-risk patients and hierarchical analysis, we are set to develop a stacking machine learning algorithm grounded in our existing model.We will continue focusing on alleviating alarm fatigue by developing well-adjusted signals to better balance sensitivity and specificity in machine learning optimisation.

Fig. 3 .
Fig. 3. ROC of XGBoost Model on the PhysioNet Dataset.The top row figures are the ROC of each training fold of the model performance on PhysioNet.The bottom row figures are for the testing performed."Set A" refers to PhysioNet-A."Set B" refers to PhysioNet-B.(For interpretation of the colours in the figure(s), the reader is referred to the web version of this article.)

Fig. 4 .
Fig. 4. ROC of XGBoost Model on Changsha First Hospital Data.The left ROC is the training performance on Changsha First Hospital Data.The right ROC is the testing performance.

Fig. 5 .
Fig. 5. ROC of XGBoost Model on the Feature-Reduced PhysioNet Dataset.The top row figures are the ROC of each training fold of the model performance.The bottom row figures are for the testing performed."Set A" refers to PhysioNet-A."Set B" refers to PhysioNet-B.

Fig. 6 .
Fig. 6.Calibration of XGBoost Model for Experiment One and Two: (a) Empirical calibration plot; (b) Platt scaling.The plots at the bottom row are the histograms of testing sets.Each probability range contains histogram bins that are positioned accordingly.
are 81.06%PPV,75.52%NPV, 77.78% accuracy, 69.48% sensitivity and 85.29% specificity.However, the Random Forest classifier is under-performed compared to XGBoost.Compared to other methods in Table4, our method achieved the highest AUC score with a relatively similar accuracy among all methods.It indicates that our method can lead to better classification outcomes, which may benefit the deployment.

Table 1
Characteristics for the filtered data from PhysioNet.

Table 2
Characteristics for the filtered data from FHC.

Table 3
Method Performance on PhysioNet Datasets.

Table 5
Method Performance on FHC Dataset.

Table 6
Method Performance on the Feature-Reduced PhysioNet.

Table 7
Model Transfer between PhysioNet-A and PhysioNet-B.

Table 8
Model Transfer from PhysioNet-A and PhysioNet-B to FHC.Our results in Table3indicated that our model outperformed the others.Our method achieved the best AUC score compared to other methods on PhysioNet datasets listed in Table