Building a Machine Learning-based Ambulance Dispatch Triage Model for Emergency Medical Services

Background In charge of dispatching the ambulances, Emergency Medical Services (EMS) call center specialists often have difficulty deciding the acuity of a case given the information they can gather within a limited time. Although there are protocols to guide their decision-making, observed performance can still lack sensitivity and specificity. Machine learning models have been known to capture complex relationships that are subtle, and well-trained data models can yield accurate predictions in a split of a second. Methods In this study, we proposed a proof-of-concept approach to construct a machine learning model to better predict the acuity of emergency cases. We used more than 360,000 structured emergency call center records of cases received by the national emergency call center in Singapore from 2018 to 2020. Features were created using call records, and multiple machine learning models were trained. Results A Random Forest model achieved the best performance, reducing the over-triage rate by an absolute margin of 15% compared to the call center specialists while maintaining a similar level of under-triage rate. Conclusions The model has the potential to be deployed as a decision support tool for dispatchers alongside current protocols to optimize ambulance dispatch triage and the utilization of emergency ambulance resources.


Introduction
Efficient and effective Emergency Medical Services (EMS) are vital for good outcomes during pre-hospital emergencies such as stroke and cardiac arrest, where the response time and the type of response matters [1][2][3].The traditional approach that dispatches the closest available vehicle to an emergency has been shown to be far from optimal [4,5].With limited ambulance resources, the priority of a case is a crucial factor for dispatch decisions [6][7][8][9].
EMS systems worldwide have investigated and implemented various pre-hospital triage systems to determine the priority level of a pre-hospital emergency case.Most of these systems can be categorized into 2 groups.The Medical Priority Dispatch System and its variants assign priority levels to each case based on protocols with scripted questions put to the caller [10,11] and originated in North America.On the other hand, Criteria-Based Dispatch systems involve guidelines to determine response levels based on patient signs and symptoms collected by the dispatcher [12,13] and are used, for example, in Nordic and European countries.However, studies have shown that the accuracy of these priority dispatch systems remains an issue of concern [14,15], and there is a paucity of research to guide pre-hospital triage systems [16].
In Singapore, the Singapore Civil Defence Force (SCDF) serves as the national EMS organization, and their ambulance crews respond to more than 190,000 calls (national emergency "995" hotline) every year.With only a fleet of 84 ambulances, the need for efficient resource utilization is pressing, especially as the population continues to grow and age.At present, the SCDF uses a rule-based system containing 30 in-house protocols similar to the Criteria-Based Dispatch system.For different chief complaints, the dispatcher at the call center will ask questions based on the respective protocol and assign a Patient Acuity Category Scale (PACS) [17] for medical dispatch.The PACS is the emergency scale used nationwide in Singapore's EMS system that includes 5 levels (P1+, P1, P2, P3, and P4).P1+ and P1 are assigned to the most severe cases that are immediately life-threatening, such as cardiac arrest and head injury.P1+ cases will trigger a fire bike that beats the traffic on the road and will arrive faster than an ambulance.P2 cases are emergencies where the patients are usually unable to walk and are in some form of distress.If not attended early, then their medical status could deteriorate quickly.P3 cases are minor emergencies involving patients who have mild to moderate symptoms and are able to walk.Early intervention will still result in a better patient outcome in P3 cases.P4 cases are nonemergencies such as old injuries or chronic conditions that do not require immediate attention.
During an emergency call, the call center specialists will decide the PACS according to the patient's presenting complaints, symptoms, and the nature of the emergency, allowing a quick identification of the case acuity with no objective measurements available.After the ambulance is dispatched and arrives at the scene, paramedics will decide the PACS guided by a list of presenting complaints and provisional diagnoses, which they arrive at after their more detailed history taking, scene assessment, and physical examination of the patient.A quick overview of the PACS levels is shown in Table 1.
The purpose of the call center triage is to determine the (a) need for urgent intervention, (b) need for hospital conveyance, and (c) urgency of conveyance.This is fairly similar to the purpose of field triage by the paramedics, and hence, we seek concurrence between the 2 as much as possible.However, call center specialists, who do not have professional registration but only receive vocational training in SCDF, have a challenging job deciphering caller information and dispatching ambulance support (if necessary) within their required time limit: 1 min.Furthermore, the protocols are designed to allow quick comprehension and utilization but have several limitations.For example, the protocols may not be very specific and do not take other patient-reported information into consideration.Cur rently, we observed around 47% over-triage rate and 6% under-triage rate by the call center specialists in the period of January 2018 and August 2020 over more than 300,000 cases, using the acuity reported by the paramedics at the scene as ground truth.This means that, for every 100 emergency cases, call center specialists will assign a more-severe PACS (over-triage) in 47 cases and a less-severe PACS (under-triage) in 6 cases than that from the paramedics.We postulated that a machine learning-based model, digesting information captured solely from the call, can improve the overall dispatching effectiveness by reducing over-triage and keeping under-triage at the same level.
There have been some studies exploring machine learning on medical emergency calls.In Copenhagen, Blomberg et al. [18] used a machine learning framework to recognize cardiac arrest in emergency calls, but the details of the machine learning framework is proprietary and, hence, was not disclosed.In the their randomized clinical trial, no significant improvement in dispatchers' ability to recognize cardiac arrest was found when supported by machine learning [19].In London, Tollinton et al. [20] used machine learning models to predict whether an unconscious and fainting patient would be conveyed to a hospital using the Medical Priority Dispatch System codes and free text notes as features.However, using conveyance as a binary marker of case severity is neither accurate nor objective.
In this proof-of-concept study, we aimed to develop a machine learning-based model for EMS ambulance dispatch triage in Singapore.The model should be easily applied to a wide range of pre-hospital emergency cases and decrease the level of overtriage without increasing the level of under-triage.

Data collection and preprocessing
All SCDF ambulance dispatch cases between January 2018 and August 2020 that were assigned a PACS of either P1+, P1, P2,  P3, or P4 were included.A small number of cases with no meaningful call conversation, as well as cases that fell under a protocol of less than 0.1% prevalence, was excluded.Data were linked using the unique incident number and then subsequently anonymized.All data were retrieved from both the call center record system and ambulance case record system electronically.We used the PACS reported by paramedics at the scene as the ground truth.In particular, we grouped P1+ and P1 as "critical emergency, " P2 as "normal emergency, " and P3 and P4 as "nonemergency" to represent the case acuity.Critical emergency cases would require the ambulance to be dispatched as soon as possible.Normal emergency cases also required assistance from an ambulance, but the arrival time can be less stringent.Meanwhile, nonemergency cases could be deprioritized when ambulance resources are strained, and future research can explore diverting these cases into other alternative care pathways such as primary care, telemedicine, or outpatient visits.

Feature engineering
The medical information obtainable from an emergency call is very limited.We chose to include basic patient demographics (age and gender) and the chief medical complaint of the case as predictor variables in our model because these variables were currently being collected by the call center specialists based on the answers and understanding of the situation before the triage decision was made.For cases where the age of the patient was not available, we imputed with the median age of the cohort.
Because of the urgency of emergency calls, SCDF call center specialists do not note down every single word of the conversation but choose the most appropriate answers for the protocol questions, with the rule-based system behind prompting the PACS.Such pseudo-dialogs consisting of question-answer (QA) pairs were the only available representation of the actual conversation, containing important information like the patient's consciousness and breathing.An example of a protocol and a pseudo-dialog are shown in Fig. 1.On the left-hand side, it shows the protocol that call center specialists need to follow when the nature of the emergency is "poisoning/ingestion. " Specific questions such as what the patient took, whether the patient is alert, and whether the patient was breathing normally were asked to decide the PACS.On the right-hand side, it shows the pseudo-dialog consisting of 7 QA pairs extracted from an alcohol intoxication case, where the first 4 questions were from a complaint-agnostic general protocol and the last 3 questions were from the "poisoning/ingestion" protocol.According to the answers, this case would be considered as P3.
To tackle this challenge, we proposed the following feature engineering approach.First, all unique QA pairs in our dataset were extracted, and each group of lexically similar questions was standardized to one question by removing the blank spaces and spelling errors.The lexical similarity of the questions was measured by Levenshtein distance [21].Second, all QA pairs were manually screened and grouped into mutually exclusive attributes and values that are semantically meaningful.Last, features were created from these attributes with ordinal encoding and used in the predictive model.The step-by-step process of creating these features is shown in Fig. 2. For each emergency call, if the attribute was asked more than once, then the latest value was used.If the attribute was never asked, then a "notasked" value was assigned to the attribute.

Bayesian optimization for class weights
An under-triage would cause significantly more harm than an over-triage, which is why the existing rule-based system was designed to tend to over-triage patients.However, machine learning models, by default, would not know the implied difference between an under-triage and an over-triage, resulting in undesirable similar rates of the 2 errors.To control the under-triage rate to be similar to that of the call center specialists, we tuned the class weight of critical emergency cases using Bayesian optimization [22].As illustrated in Fig. 3, Bayesian optimization approximates the objective function (represented by the blue curve) based on a posterior distribution of Gaussian processes (represented by the dashed line and the cyan areas) with observations (represented by the red points).
We designed the following objective function F, where f(x) represents the overall under-triage rate yielded by the machine learning model given a class weight of under-triage rate represented by x, and 0.0684 is the under-triage rate of the call center specialists.
After the fitting process, we could obtain an x that maximizes F so that the overall under-triage rate from the model is close to the under-triage rate of the call center specialists.We ran 30 iterations with 5 initial points and constrained x to be between 1 and 10 for each model.The class weights for normal emergency and nonemergency were kept as 1 for simplicity.

Models and experimental setup
We randomly generated 10 different splits, each having 80% as training data and 20% as testing data, from our dataset.In each split, we experimented with different machine learning models, namely, Logistic Regression, Decision Tree [23], Extreme Gradient Boosting (XGBoost) [24], and Random Forest [25], together with the class weights derived by our Bayesian optimizations to control the under-triage rates.Except for age, all attributes were treated as categorical features in the model.We compared the model predictions to the PACS assigned by the call center specialists.We reported the 4 metrics that we would like to improve on: overall over-triage rate, triage accuracy, and likelihood ratios of critical emergency cases and nonemergency cases.The mean and 95% confidence interval of the metrics were estimated with the metrics from the 10 splits.All data processing and statistical analysis were carried out in Python 3.8 using libraries including pandas, scikit-learn, and bayes_opt.

Results
The process to select the final cohort of 361,506 cases is shown in Fig. 4, and the demographics and most frequently asked attributes of the cohort are shown in Table 2.Among the final cohort, 191,474 (58.4%) of the patients were male, and the median age was 61 years old (interquartile range, IQR 41-77).In some cases, the age (13.0%) and the gender (9.3%) of the patient were not collected because of either nonresponse from a second or third party caller or data entry error.
The performance of our models versus the call center specialists on the training data is shown in Table 3.The Random Forest model was chosen as the final model because of its performance, achieving an accuracy of 63.7% and an over-triage rate of 29.6%, significantly outperforming current call center protocols by an absolute margin of around 15%.Both likelihood ratios were also significantly higher than the call center specialists.
Comparing the performance on test data shown in Table 4 with the performance on training data, we could also observe that the models performed consistently and did not suffer from overfitting.
The difference in performance between our model and the call center specialists on test data stratified by the different protocols is shown in Fig. 5. Compared to the call center specialists, the over-triage rates achieved by our model were better in most of the protocols (represented by green bars to the right of the 0% vertical line), while the under-triage rates increased/ decreased, varying across different protocols (represented by red bars to the left/right of the 0% vertical line).Overall, the accuracy in most of the protocols was improved (represented by bars filled with slanted lines to the right of the 0% vertical line).The proportion of factual response versus the feature importance is shown in Fig. 6.A factual response meant that the value of the feature was extracted from the QA pairs instead of the "not-asked" value.The closer the proportion of real response is to 100, the more frequent the information was asked.The age had the highest importance among all the features.Other important features include the protocol chosen; whether the patient was conscious, bleeding, breathing, and able to speak were also well expected.On the contrary, we found that the information of whether the patient was experiencing chest pain did not contribute as much to the model, although chest pain was one of the priority symptoms together with breathing, consciousness, and bleeding in the protocol that would be automatically assigned as P1+.

Discussions
In this study, we used emergency call center and ambulance case records to develop a machine learning-based multiclass classification model to improve ambulance dispatch triage performance for Singapore EMS.With no extra data/information compared to what is currently available to dispatchers, our model outperformed current baseline performance by an absolute margin of ~15% in terms of reducing over-triage while maintaining a similar level of under-triage.This model has the potential to be deployed as a decision support tool alongside existing protocols to optimize pre-hospital triage and utilization of emergency ambulance resources.Because EMS systems differ from country to country, our model may not be directly applicable to others.However, the importance of our work lies in demonstrating a methodology for processing EMS call center data and developing a machine learning model, which could be generalized to other EMS systems worldwide.The   sharing of our actual protocol and data could be requested and will be subjected to the approval from the EMS system in Singapore.

Ambulance dispatch triage with machine learning
Recently, machine learning models have been applied to several fields in EMS including symptom recognition [18], survival prediction [3], patient conveyance [20], paramedic documentation audit [26,27], emergency department (ED) record linkage [28], etc.However, no previous study has applied machine learning in optimizing pre-hospital triage, despite the fact that the accuracy of the medical dispatching systems are concerning [15].To our knowledge, our study is the first to address ambulance dispatch triage optimization with machine learning, and we hope that our study inspires more research in this direction.With multimodal data available for EMS in the future (e.g., real-time video, previous patient health record, and remotely monitored vital signs), machine learning and deep learning could certainly transform EMS to a whole new level.

Hypothesis of lower feature importance of chest pain
Currently in SCDF, callers that report any priority symptoms (abnormal breathing, chest pain, decreased consciousness, or profuse (nonstop) bleeding) would be automatically assigned P1+.Our model showed that breathing, consciousness, and bleeding were indeed important indicators of the case acuity, while chest pain was less reliable.Chest pain is an exceedingly common complaint in emergency departments worldwide, and oftentimes, the most common causes are relatively benign, e.g., musculoskeletal conditions, gastrointestinal disease, or stable coronary artery disease [29].Therefore, chest pain alone might not be specific enough in determining the acuity of the case.

Limitations
Our study has several limitations.First, this was a single-center study using retrospective data.A future prospective study implementing a decision support system using this model is required to validate our approach internally, and studies in other centers with similar EMS systems are required to validate our methodology externally.Second, we used PACS reported by the paramedics to derive the ground truth instead of the final PACS assessed in the ED because of unavailability of hospital data.Although the paramedics have more information to work with and are trained in making a sound medical assessment, the eventual PACS could still change between the time of ambulance dispatch and the time of ambulance arrival at the emergency department.However, it might be debatable whether paramedics or the ED triage nurses have a better assessment of the case acuity at that specific point in time.We also found that, in Singapore and globally, there is a lack of evidence on the consistency of triage judgment made by the field paramedics and ED triage nurses.It will be desirable for future studies to investigate this issue when the data become available.Third, the information we used in the model was limited to the current call.We have plans to include the patient's historical emergency call information and health record to provide an even better triage prediction in our future studies.Fourth, similar to retrospective analysis on hospital electronic health records, our study was a secondary analysis of call center records.Hence, the data were subject to human error, especially given the urgent nature of the call center specialists' job.As with most large cohort studies, there were also varying amounts of missing data that was excluded from analysis.Fifth, the model achieved a lower over-triage rate across most of the protocols as shown in Figure 5.Although we maintain the overall undertriage rate the same as the call center specialists, what could be concerning was that the model did not maintain the same level of under-triage rate across some of the most common protocols, such as sick person, falls/back injury, chest pain, and abdominal pain.There is always a trade-off between over-triage and under-triage, and the current protocol has an intentional bias toward not under-triaging patients.While an under-triage may directly put the patient's life in danger and should be avoided as much as possible, an over-triage may unnecessarily take up health care resources and delay the dispatch for critical emergency cases, indirectly costing lives.Thus, further costbenefit analysis should be conducted to study a reasonable trade-off, and protocol-specific models could be explored in the future as well.Sixth, as we only had pre-hospital data, we lacked information on the precise etiology of the various conditions and the patients' overall outcomes, which would have further enriched our analysis.Last, for EMS systems that do not have machine learning capabilities, our approach will be more difficult to adopt.We suggest the EMS systems to find collaborators in local universities or work with us to develop one for their own.

Conclusion
In conclusion, in this proof-of-concept study, we developed a machine learning model that can reduce over-triage rates significantly while maintaining a similar level of under-triage.These results are encouraging and show that this approach could be used in the call center to provide better ambulance dispatch triage and case acuity recommendation to optimize ambulance resource utilization.The methodology reported in this paper could also be generalized to other EMS centers to develop their own model.and contributed to the study design, model design and experiments, and manuscript preparation.All authors have approved the final version of the manuscript submitted.Competing interests: There are no conflicts of interest to be declared.

Fig. 1 .
Fig. 1.Illustration of a pseudo-dialog derived from an existing protocol.

Fig. 2 .Fig. 3
Fig. 2. Illustration of the process to transform the QA pairs into the final features.

Fig. 5 .
Fig. 5. Difference in performance between our model and the call center specialists on test data stratified by the different protocols.

Fig. 6 .
Fig. 6.Illustration of the proportion of factual response versus the feature importance derived for each feature.

Table 1 .
Details of 5 levels of PACS used in SCDF.

Table 2 .
Characteristics of the cohort grouped by the case severity.

Table 3 .
Performance metrics of various models versus the baseline of call center protocols on the training data.