Machine Learning Approaches for Predicting Difficult Airway and First-Pass Success in the Emergency Department: Multicenter Prospective Observational Study

Background There is still room for improvement in the modified LEMON (look, evaluate, Mallampati, obstruction, neck mobility) criteria for difficult airway prediction and no prediction tool for first-pass success in the emergency department (ED). Objective We applied modern machine learning approaches to predict difficult airways and first-pass success. Methods In a multicenter prospective study that enrolled consecutive patients who underwent tracheal intubation in 13 EDs, we developed 7 machine learning models (eg, random forest model) using routinely collected data (eg, demographics, initial airway assessment). The outcomes were difficult airway and first-pass success. Model performance was evaluated using c-statistics, calibration slopes, and association measures (eg, sensitivity) in the test set (randomly selected 20% of the data). Their performance was compared with the modified LEMON criteria for difficult airway success and a logistic regression model for first-pass success. Results Of 10,741 patients who underwent intubation, 543 patients (5.1%) had a difficult airway, and 7690 patients (71.6%) had first-pass success. In predicting a difficult airway, machine learning models—except for k-point nearest neighbor and multilayer perceptron—had higher discrimination ability than the modified LEMON criteria (all, P≤.001). For example, the ensemble method had the highest c-statistic (0.74 vs 0.62 with the modified LEMON criteria; P<.001). Machine learning models—except k-point nearest neighbor and random forest models—had higher discrimination ability for first-pass success. In particular, the ensemble model had the highest c-statistic (0.81 vs 0.76 with the reference regression; P<.001). Conclusions Machine learning models demonstrated greater ability for predicting difficult airway and first-pass success in the ED.


Introduction
In the emergency department (ED), achieving successful tracheal intubation at the initial attempt (ie, first-pass success) is essential [1]. The literature has shown that repeated intubation attempts are associated with a higher rate of adverse events [2][3][4]. However, recent studies have also reported first-pass success rates of 74%-84% in the ED [5,6], suggesting that there are still occasions where repeated intubation attempts are required. To improve ED airway management, the development of effective risk stratification and prediction tools is instrumental.
A widely used prediction tool for difficult airway is the modified LEMON (look, evaluate, Mallampati, obstruction, neck mobility) criteria [7], which has been validated [8]. Although the criteria have good prediction ability (eg, sensitivity 86%, specificity 48% for direct laryngoscope) [8], there remains room for improvement. Besides, no prediction tool accurately predicts first-pass success (or failure) in the ED. The recent advent of machine learning approaches has enabled clinicians and researchers to accurately predict various diseases and conditions, such as sepsis [9], acute asthma [10], and ED triage [11,12]. Compared with conventional prediction tools and regression approaches, modern machine learning approaches have several advantages, such as incorporating high-order, nonlinear interactions between predictors and mitigating overfitting [13]. Despite the clinical and research importance, no study has yet applied modern machine learning approaches to predict a difficult airway in advance of preparing for airway management or to predict first-pass success once the intubation strategy has been determined in the ED.
To address this significant knowledge gap in the literature, using data from a prospective, multicenter study of ED airway management, we aimed to develop machine learning models that accurately predict difficult airway and first-pass success and to compare their performance with conventional approaches.

Study Design, Setting, and Participants
This study analyzes data from a multicenter, prospective study of emergency airway management-the second Japanese Emergency Airway Network (JEAN-2) study. The details of the study design, setting, participants, methods of data measurement, and definitions of variables have been reported elsewhere [14]. In brief, the JEAN-2 study is a consortium of 13 academic and community EDs, including 10 level I and 3 level Ⅱ equivalent trauma centers. These EDs are located in different geographic regions across Japan. The median ED census is 29,000 patients per year (range of 16,000 to 67,000 annual visits). These ED are affiliated with an emergency medicine residency training program. Attending physicians or resident physicians who are under the supervision of the attending physician perform intubations. In this observational study, patients were managed at the discretion of treating physicians. The institutional review board at each participating center approved the waiver of informed consent before data collection. This study used data from consecutive (both children and adults) patients who underwent ED management at one of the participating EDs from January 1, 2010 through December 31, 2018. Patients who underwent surgical intubations at the first attempt were excluded.

Outcomes
The outcomes of interests were difficult airway and first-pass success. According to the American Society of Anesthesiologists (ASA) guidelines, a difficult airway was defined as multiple intubation attempts by emergency physicians or anesthesiologists according to the ASA guidelines [15]. First-pass success was defined as intubation success at the initial attempt of each encounter [16]. Intubation success was defined as the proper placement of a tracheal tube through the vocal cord, confirmed by the use of quotative or end-tidal CO 2 monitoring [17]. An intubation attempt was defined as a single insertion of the laryngoscope past the teeth [18].

Predictors of Machine Learning Models
To develop machine learning models for the difficult airway outcome, we used the following variables that are routinely obtained in advance of the actual intubation attempt: patient demographics (age, sex, estimated height and body weight, BMI), components of the modified LEMON criteria, pre-intubation vital signs (pulse rate, systolic blood pressure, respiratory rate, oxygen saturation), and Glasgow coma scale. To develop models that predict the first-pass success outcome (once the intubation strategy has been determined), we used all available intubation-related information-in addition to the aforementioned predictors-such as type of day (weekend/weekday), medications, intubation methods, intubation devices, intubator's post-graduate year, and intubator's specialty.

Statistical Analysis
Summary statistics were used to describe the characteristics of patients and airway management. After performing imputations for missing continuous variables (most predictors had <10% missingness; Multimedia Appendix 1) using random forest [19], we conducted predictor preprocessing, such as one-hot encoding (ie, creation of dummy variables), normalization, and standardization. The nonlinear predictors included in the developed models were age, body weight, height, BMI, and pre-intubation vital signs. In the training set (80% random sample), for each outcome, we developed 7 machine learning models: (1) logistic regression model with elastic-net (penalized logistic regression) [20], (2) random forest [21], (3) gradient boosting decision tree [22], (4) multilayer perceptron [23], (5) k-point nearest neighbor [24], (6) XGBoost [25], and (7) ensemble model (ridge regression and the random forest with an equal weight) [26]. For the difficult airway outcome, the modified LEMON criteria model was used as the reference model. For the first-pass success outcome, a (nonpenalized) logistic regression model was used as the reference model. We performed stratified 5-fold cross-validation to determine the optimal hyperparameters with the highest c-statistic (ie, the area under the receiver operating characteristic [ROC] curve).
In the test set (the remaining 20% of the random sample), we measured the performance of reference and machine learning models. We estimated the c-statistic of each model and examined the following association measures: sensitivity, specificity, positive and negative predictive values, and positive and negative likelihood ratios. The c-statistic is the probability that, given 2 individuals (one who experiences the outcome of interest and the other who does not), the model estimates a higher probability for the first patient than for the second [27]. We determined the threshold of perspective prediction (cut-off) results based on the ROC curve from the Youden method [28]. For the model with the highest c-statistic among the 7 machine learning models, we computed the variable importance-how strongly each of the predictors improved the c-statistic. We also examined calibration plots of the best-performing machine learning model for each of the outcomes. Data were analyzed using python (version 3.7.3) and R (version 3.6.2). Two-sided P values <.05 were considered statistically significant.

Patient Characteristics
During the 108-month study period, the JEAN-2 study recorded data for 10,816 patients (capture rate, 96%) who underwent emergency airway management at one of the 13 participating EDs. Of these, 75 patients who underwent surgical intubation at their first attempt were excluded; the remaining 10,741 patients comprised the analytic cohort. The patient characteristics, details of airway management, and intubation outcomes are shown in Table 1 Figure 2A)-which indicates how far the predicted risk from the ensemble model deviated from the actual risk-showed that the ensemble model overestimated the risk of the outcome, while there was a positive relationship between the predicted and actual risks, largely due to the class imbalance (ie, difficult airway outcome occurred only in 5% of the sample).     Table 3 summarizes the performance of the reference model and 7 machine learning models when predicting the first-pass success outcome in the ED. Compared with the reference model, the discrimination ability of machine learning models-except for the random forest and k-point nearest neighbor models-was significantly higher (all P<.05). Among the 7 machine learning models, the ensemble model had the highest c-statistic (0.81, 95% CI 0.79-0.83; Figure 1B). Compared with the reference model, the ensemble model had a higher sensitivity (0.79, 95% CI 0.77-0.81) and specificity (0.67, 95% CI 0.65-0.69), with a PPV of 0.85 (95% CI 0.84-0.87) and NPV of 0.57 (95% CI 0.55-0.59). Compared with the reference model, which had a specificity of 0.36 (95% CI 0.34-0.38), most machine learning models had higher specificity, with the random forest model achieving a specificity of 0.70 (95% CI 0.68-0.72). In the calibration plot of the ensemble model ( Figure 2B), the model-predicted probability was well-matched with the observed probabilities.   Table 4 shows the variable importance of the best performance model (the ensemble model) for each outcome. For the difficult airway prediction, the most contributing predictor was age, followed by any criterion met in the modified LEMON criteria and hyoid mental distance ≥3 fingers. For the first-pass success prediction, the best contributing predictor was the use of laryngeal pressure, followed by lifting force required for laryngeal deployment and Cormack grade of 3.

Principal Findings
In this analysis of multicenter prospective data from 10,741 ED patients, we applied modern machine learning models to predict intubation-related outcomes in the ED. Specifically, compared with conventional approaches (ie, modified LEMON criteria and nonpenalized logistic regression model), most machine learning models demonstrated superior discrimination performance when predicting both difficult airway and first-pass success outcomes. Additionally, these machine learning models also achieved higher specificity when predicting these 2 outcomes. To the best of our knowledge, this is the first study that has investigated the performance of modern machine learning models when predicting clinically important intubation outcomes in the ED setting.
Consistent with our findings, the following has been reported as predictors for first-pass success in the ED: patient characteristics (eg, restricted mouth opening, restricted neck extension, and swollen tongue), high Cormack grade, intubators' characteristics (eg, clinical experience and working department), the use of rapid-sequence intubation, and the use of video laryngoscope at the first attempt [6,[29][30][31].
The importance of accurate prediction for difficult airways has been emphasized in ED airway management [8]. Although the modified LEMON criteria (and the LEMON criteria) have been validated as an indicator for difficult airways, their prediction ability is suboptimal for clinical use [7,8]. In the operating room setting, a couple of studies have reported a potential benefit of machine learning models for predicting difficult airways [32,33]. For example, in a single-center study of 80 patients, a deep learning approach using data from the patients' facial images had high discrimination ability for difficult airways-defined as multiple attempts by an intubator with at least 12 months of anesthesia experience, grade 3 or 4 laryngoscopic view, need for a second intubator, or nonelective use of an alternative airway device [32]. Our multicenter study-with a sample size that is many times larger than the prior studies on this topic-builds on these earlier reports and extends them by demonstrating that modern machine learning models outperform conventional approaches for predicting intubation outcomes in the ED.
The observed improvement in prediction ability by machine learning approaches may be explained by several reasons. First, the machine learning approaches account for high-order interactions between predictors and nonlinear relationships with an outcome, which traditional modeling approaches cannot address [34]. Second, the modified LEMON criteria may be too parsimonious (ie, the use of a limited number of predictors), while the applied machine learning models could use a larger number of predictors. Third, the modern machine learning approaches enabled us to minimize overfitting, such as lasso and ridge penalizations (ie, elastic net model and cross-validation). In addition to these strengths, modern machine learning models also are scalable for further improvement by integration with recently developed techniques such as image analysis of patients' faces and necks [32,35].
Although the machine learning models achieved a more significant predictive ability, their performance remained imperfect. This may be explained, at least partially, by the limited set of predictors (eg, lack of detailed information on the intubation competency and experience of each intubator) and data measurement errors. Additionally, one may surmise that the modified LEMON criteria are simpler and easier to use in the ED. Despite the known trade-off between parsimonious models and more complex models with a larger number of predictors, the use of modern machine learning models has advantages in the era of health information technology, including automated data entry through voice recognition, natural language processing, continuous sophistication of models through sequential extractions of electronic health records, and reinforcement learning [36,37]. Our findings and the recent advent of machine learning approaches collectively support cautious optimism that machine learning may enhance the clinician's ability-as assistive technology-to predict patient outcomes in the ED. The resulting accurate prediction of intubation outcomes has several important implications in airway practice in the ED. For example, early identification of difficult airways should help ED providers develop individualized and optimal management strategies and prepare for rescue airways [14,38]. Besides, the accurate estimation of the probability of first-pass success given the conditions (eg, the airway management strategies and intubator to be used) would not only increase the opportunity for clinical training (eg, which patient can be safely intubated by the intubator) but also improve patient safety.
To implement our developed machine learning models, a web-based application or integrated emergency department information system is needed. The rapid development of health information technology (eg, web-based artificial intelligence application with the model) enables us to implement the developed model into the real clinical setting. Furthermore, the current models can be used not only for practice but as an educational tool. For example, in simulation-based intubation training, supervisors can evaluate the trainee's intubation strategy by indicating the actual probability of difficult airway and first-pass success.

Limitations
Several potential limitations of this study should be noted. First, our data may be subject to self-reporting and measurement bias (eg, underreporting difficult airways). However, the study was conducted by investigators using a standardized protocol [6], which led to the high capture rate (96%) and low proportion of missingness in the predictors and outcomes (Multimedia Appendix 1). Second, we did not have detailed information on the procedural competency of each intubator, as this factor is also challenging to define and measure in real-world settings. To address this issue, we used years of experience and specialty, which are readily available in most ED settings, as a proxy for the competency. Third, machine learning models have a common limitation in the interpretability of models. Fourth, because of the small samples of children (2.8%), our model may not have optimal prediction ability in pediatric populations. Finally, our models may not be generalizable to other practice settings, although the study sample consisted of a geographically diverse patient across Japan.

Conclusions
In summary, based on the extensive multicenter, prospective data from 10,741 ED intubations, we developed modern machine learning models to predict clinically essential intubation outcomes. Using routinely available data as the predictors, we found that the machine learning models had a greater ability to predict difficult airways and first-pass success than conventional approaches. Although formal validation is required, this study lends support to the application of machine learning models for the prediction of intubation-related outcomes, which will, in turn, improve airway management practice and outcomes of critically ill patients in the ED.

Conflicts of Interest
None declared.