Inter hospital external validation of interpretable machine learning based triage score for the emergency department using common data model

Emergency departments (ED) are complex, triage is a main task in the ED to prioritize patient with limited medical resources who need them most. Machine learning (ML) based ED triage tool, Score for Emergency Risk Prediction (SERP), was previously developed using an interpretable ML framework with single center. We aimed to develop SERP with 3 Korean multicenter cohorts based on common data model (CDM) without data sharing and compare performance with inter-hospital validation design. This retrospective cohort study included all adult emergency visit patients of 3 hospitals in Korea from 2016 to 2017. We adopted CDM for the standardized multicenter research. The outcome of interest was 2-day mortality after the patients’ ED visit. We developed each hospital SERP using interpretable ML framework and validated inter-hospital wisely. We accessed the performance of each hospital’s score based on some metrics considering data imbalance strategy. The study population for each hospital included 87,670, 83,363 and 54,423 ED visits from 2016 to 2017. The 2-day mortality rate were 0.51%, 0.56% and 0.65%. Validation results showed accurate for inter hospital validation which has at least AUROC of 0.899 (0.858–0.940). We developed multicenter based Interpretable ML model using CDM for 2-day mortality prediction and executed Inter-hospital external validation which showed enough high accuracy.

Emergency department (ED) is complex and need urgent judgement for the better triage 1,2 .In order to determine the patient's condition quickly, Korea Triage Acuity Scale (KTAS), New Early Warning Score and Modified Early Warning Score have been developed by expertise 3,4 .However, although most scores require complicated process to make, they are fixed score and have low reliability and poor outcome due to subjective assessment 5 .To solve this problem, data and machine learning (ML) based objective score has emerged 6,7 .
Those ML based models have problems of black box and external validation 8,9 .There has been some studies for interpretable triage in ED which utilized framework for interpretable scoring system called Autoscore [10][11][12] .However it was only conducted with limited population and specific for ER admission patients 11 .Each hospital have different population and characteristics, so we need to develop each hospital based unique score for the application.
Another tricky part for the external validation in ML research is data protection law and policy 13,14 .It is impossible to transfer the data into other hospital for preserving privacy.To solve this challenge, common data model (CDM) can be adopted for each hospital 15 .Through the CDM format, multicenter research could be done without data transfer.Standardized format of terminology and structure can be made for each hospital's different electronic medical records format and policy.There has been some CDM based research regarding the ML 16,17 , there was no CDM based interpretable machine learning research in Korea.
The aim of the study is to develop, and inter-hospital external validate the interpretable ML score among the 3 big hospitals in Korea using novel framework using CDM.

Results
During the same study period for each hospital from 2016 to 2017 145,371, 169,896 and 96,369 patients visited ED in A, B and C respectively as shown in Fig. 1.Among them, totally 57,511, 86,533 and 41,946 patients were excluded due to age under 18, DOA, and trauma patient.Finally, 86,670, 83,363 and 54,423 patients were used for developing models.The mortality rate was from 0.51%, 0.55% and 0.65% for 2 days.www.nature.com/scientificreports/ The distribution of ED patients' demographics for each hospital is shown in Table 1.Each cohort included 445, 464 and 379 of events.(67.2 (14.3), 72.8 (14.4) and 72.5 (13.5) for age; 265 (59.6.%),245 (52.8%) and 218 (57.5%) for male).Regarding the mortality patient, there were quite differences between hospitals, especially in patient conciseness of Alert at hospital A (70.8%) have higher than others (44.0 and 28.2%).Moreover, patient with severe (KTAS1 or KTAS2) at scene in hospital C (87.4%) was higher than other hospitals.(49.7% and 71.3%).Regarding the vital sign all hospital have different patterns, especially in SPO2 and BP.In terms of comorbidities history, Hospital A have much higher cancer related patients (73.9%) compared to B and C (9.5 and 5%).Whereas Hospital B and C have higher chronic disease including diabetes (28.2 and 28%).Synthetic minority over-sampling technique (SMOTE) based distribution and significance of difference for each variable were provided with standardized mean difference (SMD) were shown in Supplementary Tables 1-3.
Based on the variable importance from the Autoscore framework shown in Table 2 and parsimonious plot shown in Supplementary Fig. 1, we selected top 8 variables for score generations.Common feature for three hospitals were vital sign, age, patient consciousness.Vital sign such as systolic blood pressure (SBP) and heart rate (HR) were important in hospital A and B, whereas Consciousness was most important in hospital C. SBP, HR, Temperature were top 3 contributed variables in overall rank.
Scores for each hospital were presented in Table 3.The developed score for each hospital had different patterns.Among the included variables, Temperature and SpO2 were the highest effect in hospital A (17), patient consciousness for hospital B (27) and C (33).In hospital B, Age (13) was also high scored variables.Whereas Systolic blood pressure ( 14) was dominant at hospital C. Overall score was calculated with weighted score of number of patients and performance for each institutions.Score based on SMOTE was provided at Supplementary Table 4.
We evaluated each score to the other hospital for the intra-institutional external validation.We used the testing cohort to evaluate the performance of each score.Table 4 depicts the AUROC with CI for the external validation which showed the best internal validation (0.913, 0.919 and 0.930) and dropped a little for the external results.Overall evaluation results show the quite good classification results from 0.904 to 0.933.Other metrics for original and SMOTE were shown in Supplementary Table 5.

Discussion
In this study, we developed interpretable score based on CDM Autoscore for ED and evaluated with 3 tertiary hospitals in Korea for inferring the 2-day mortality for ED visit patients.Although each hospitals have different characteristics, scores were accurate for their external validation results for other institutions which has at least of 0.885 (0.842-0.942)AUROC.Moreover, it was interpretable score, so it can be integrated easily into clinical practice.We found each scores from their own hospital, which is the internal validation results were accurate from 0.913 to 0.930 AUROC.We also identified the extent of lack of accuracy and acceptance when we apply the score to other institute.
To the best of our knowledge, this is the first study for interpretable machine learning using CDM framework in ED.Many policies or laws regarding the data protection or leak was published for the protection of private patient information 18,19 .For solving these problems, our framework can share the result without any transferring patient data.CDM is designed to standardize the structure and vocabulary of observational health data that can produce reliable evidence without sharing data.This approach creates a unique opportunity of implementing several existing data exploration and evidence generation tools and participating in world-wide distributed research network studies without raw data leakage [20][21][22] .Extensibility and generatability can be obtained based on our framework.More institutions can be added to analysis cohort for further development and validation because of the developed semi-automated ETL process enables CDM conversion for all institution's NEDIS data in Korea.
Interpretable point-based score can be easily utilized for the real practice.A paper published from Netherlands in 2023 also developed international early warning score for predicting mortality in ED 23 .The score was consistent with our interpretable score in terms of having high impact on consciousness, systolic blood pressure and temperature and Spo2.Whereas old age was most impact factor in international score.
Another novelty for this study is it conducted the cross-external validation for identifying the generalizability.Patient distribution is different for each institution.In case of hospital C, almost mortality patients had severe KTAS level and consciousness was most important for predicting mortality.We need to develop each score for institution.Many previous study emphasized the importance of external validation for the generality of model 14,24,25 .Most of the studies conducted one model from one site to other sites 26,27 , but in this study all institutions made their one score and we can compare the results for each one.
There are some limitations for this study, first it was a retrospective, the score needs to be evaluated in prospectively for the checking the applicability.However, this score-based model development is easy to apply to EMR integration because of advantages of point-based score.Second, we need to consider the representative score for Korea.We can develop with national emergency department information system data which is data from 403 ED data for developing national level score for Korea.
In summary, we developed the K-SERP score for 3 hospitals in Korea using CDM Autoscore for ED and showed good cross-external validation results which were at least 0.899 of AUROC.We can expand the result with other emergency department site based on CDM framework.Each score could be interpreted and applied to clinical process easily.

Method Study design and setting
This retrospective and validation study was executed across from 3 ED in Korea (A, B and C).A, B and C are tertiary hospitals located in a metropolitan city in Korea.Respectively, the hospital has approximately 2000, 1000, and 1000 inpatient beds.Approximately more than 80,000, 90,000 and 50,000 patients visit the ED annually.There Vol:.( 1234567890 are 16, 20 and 7 specialists working at each institution, respectively.All data were mapped to the Observational Medical Outcome Partnership Common Data Model (OMOP-CDM) for the multicenter study.This study was approved by the Samsung Medical Center Institutional Review Board (2023-02-036), and a waiver of informed consent was granted for EHR data collection and analysis because of the retrospective and de-identified nature of the data.All methods were performed in accordance with the relevant guidelines and regulations.

Selection of participants
Initially, ED patients from 2016 to 2017 were included for each hospital.Patient older than 18 with disease patients were included.We also excluded patient with left without being seen or death on arrival/cardiopulmonary resuscitation patients.We split into two cohort: development (70%) cohort for training the interpretable ML model and test (30%) for evaluation from each hospital.

Candidate predictors
We extracted data from each hospital's electronic medical records system which all patient information was deidentified.Candidate input variables were considered with available features at the stage of ED triage including demographic characteristics such as age, gender, administrative variables including time of ED visit and clinical variables such as severity index, consciousness, and initial vital sign.Comorbidities were also obtained from hospital diagnosis records in the preceding 5 years before patients' emergency visit and compared for each hospital.They were extracted from International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10).The list and description of candidate predictors and comorbidities are given in the supplementary Tables 6 and 7.

Outcomes
Emergency patients with semi-acute conditions typically undergo surgical procedure or are admitted to Intensive care unit (ICU) following emergency room treatment and given the imperative for patients to survive.Our

Common data model (CDM)
For the multicenter study, we adopted OMOP CDM from the research network Observational Health Data Sciences and Informatics (OHDSI) 28 for standardized structure and vocabularies to map emergency department data based on Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) and Logical Observation Identifiers Names and Codes (LOINC) as example shown Supplementary Fig. 1.Extract, Transformation and Load (ETL) process was performed with structured query language.Each ED care and diagnosis related information was mapped into proper CDM tables as shown in Fig. 2. For example, patient demographics and

CDM autoscore for ED framework
AutoScore Framework is a machine learning-based clinical score generator, consisting of six modules developed from Singapore 12 .Module 1 uses a random forest for ranking variables according to their importance.Module 2 transforms variables by categorizing continuous variables to improve interpretation with quantile information.Module 3 makes scores for each variable based on a logistic regression coefficient.Module 4 selects which variables could be included in the scoring model.In Module 5, clinical domain knowledge is incorporated to the score and cutoff points can be defined when categorizing continuous variables.Module 6 evaluates the performance of the score in a separate test dataset.The AutoScore framework provides a systematic and automated approach to develop score automatically, combining of advantage of machine learning for discriminating and the strength of logistic regression in its interpretability.For the overall score generation, We considered weighted average scores across all institutions.For each institutions i, a weight w i was formulated as where N i was the sample size, AUC i was the AUC value obtained based on the validation set, and M was the total number of institutions.Overall score was calculated with weighted score based on w i .
We defined our new novel framework "CDM Autoscore for ED", combination of CDM based standardized format and autoscore based interpretable framework shown in Fig. 3.The analysis and preparation code using CDM format was also shared on GitHub 29 .

Statistical analysis
Categorical features were expressed as frequency and percentages and continuous features were expressed as means and standard deviations.Comparison tests for each hospital were performed with analysis of variance and chi-square tests at 5% significance levels.Standardized mean difference (SMD) was also calculated for comparing each hospital.Two types of validations for this study were conducted.First, we executed internal-institutional validation for each hospital's score.We also performed intra-institutional validation pair-wisely for the external validation.Area under the curve in the receiver operating characteristic (AUROC) and 95% confidence interval (CI) with 1000 times of bootstrap was reported.Other metrics including accuracy, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were also reported.SMOTE was conducted for handling the imbalance problem.Twice of minority was oversampled and same number of majorities according to the number of minority was sampled with fixed seed number.

Figure 1 .
Figure 1.Flow chart for each hospital from 2016 to 2017 emergency department visits.Age under 18, traumatic and death on arrival patient were excluded.

Figure 2 .
Figure 2.Table mapping for converting clinical to common data model tables.CDM: common data model; ED: Emergency department.

Figure 3 .
Figure 3. Overall process of "CDM Autoscore for ED".Each Institutions conducted Extract, Transformation and Load process for converting local data into CDM format.Algorithms from each of institution were derived using interpretable machine learning framework and validated inter-and intra-institutionally.EMR: Electronic medical records; ETL: Extract, transformation and Load; OMOP CDM: Observational Medical Outcome Partnership Common Data Model.

Table 1 .
Baseline Demographic for each hospital ED triage information from 2016 to 2017.*P-value were calculated for t-test for numerical variable and chi-square test for categorical variable under 0.05 significance.SD Standard deviation, Other Route of arrival contains transfer in, referral from outpatient, other and unknown.Other in Mode of transport contains walk-in,public transportation, Aeromedical transport, other cars, others and unknown.

Table 2 .
Top 14 contribution variables for each hospital.KTAS Korea Triage Acute Scale.

Table 3 .
Score generated from each hospital.Variables were selected from parsimonious plot shown in Supplementary Fig.2.Overall score was calculated with weighted score for each institutions.weights are 0.472 for Hospital A, 0.410 for Hospital B and 0.116 for Hospital C.

Table 4 .
Inter-hospital external validation result with AUROC (95% CI mortality) for each hospital.AUROC area under the receiver operating characteristic.