Leveraging explainable artificial intelligence to optimize clinical decision support

Abstract Objective To develop and evaluate a data-driven process to generate suggestions for improving alert criteria using explainable artificial intelligence (XAI) approaches. Methods We extracted data on alerts generated from January 1, 2019 to December 31, 2020, at Vanderbilt University Medical Center. We developed machine learning models to predict user responses to alerts. We applied XAI techniques to generate global explanations and local explanations. We evaluated the generated suggestions by comparing with alert’s historical change logs and stakeholder interviews. Suggestions that either matched (or partially matched) changes already made to the alert or were considered clinically correct were classified as helpful. Results The final dataset included 2 991 823 firings with 2689 features. Among the 5 machine learning models, the LightGBM model achieved the highest Area under the ROC Curve: 0.919 [0.918, 0.920]. We identified 96 helpful suggestions. A total of 278 807 firings (9.3%) could have been eliminated. Some of the suggestions also revealed workflow and education issues. Conclusion We developed a data-driven process to generate suggestions for improving alert criteria using XAI techniques. Our approach could identify improvements regarding clinical decision support (CDS) that might be overlooked or delayed in manual reviews. It also unveils a secondary purpose for the XAI: to improve quality by discovering scenarios where CDS alerts are not accepted due to workflow, education, or staffing issues.


Introduction
The federal government has spent more than $34 billion on the implementation of electronic health records (EHRs) over the past decade. 14][5][6] Additionally, many alerts fire at inopportune times (eg, a weight-loss alert during a cardiac resuscitation) or in clinical scenarios where they are unlikely to be helpful (eg, a cholesterol screening alert for a hospice patient). 7CDS alerts are often triggered by a limited set of criteria (eg, suggest cholesterol screening in men >35 or women >45 who have not received screening).Accounting for additional criteria (eg, exclude hospice patients from cholesterol screenings) could eliminate some alert firings that are extremely unlikely to be accepted.
Researchers have attempted various approaches to identify these criteria and improve alert quality, including manual review [8][9][10][11] and collecting feedback from healthcare providers 12,13 to adjust or turn off low-response alerts.These approaches are time-and labor-intensive, which prohibits rapid improvement of alerts.Additionally, manual reviews can only consider a small number of variables at a time, making it difficult to fully understand complex clinical scenarios.Finally, clinician feedback often fails to comprehensively capture all users' perceptions, and it introduces recall bias. 146][17] Combining alert log data with EHR data can provide information on who overrides alerts, when, and under what circumstances, which can be used to better target alerts. 18Therefore, an urgent need arises for an efficient and fair approach to comprehensively analyze user interaction with alerts and automatically generate suggestions to target alerts more precisely or improve clinical processes.
Explainable artificial intelligence (XAI) approaches are promising tools to address this need.XAI is a range of techniques designed to maintain high learning performance in AI while enabling users to understand model behavior. 19,20XAI techniques can be broadly categorized into 2 types based on the scope of their explanations: global and local. 21Global explanations focus on the entire model's rationale, providing a comprehensive overview of the decision-making process and its various potential results.This kind of transparency is reflected in models such as logistic regression, where the entire logic must be clear and traceable.However, the pursuit of global explanations often leads to a trade-off with model complexity and predictive power.On the other hand, local explanations focus on explaining individual decisions or predictions.XAI techniques in this category tailor explanations to specific instances, offering justifications for the model's behavior in particular scenarios.A prominent example of this is the Local Interpretable Model-Agnostic Explanations (LIME) technique, which provides local approximations of a model's predictions, enabling a granular understanding of its operations. 22Building on such foundational work, new methods like "Anchors" promise to enhance the precision of these local explanations with decision rules, guiding users through the AI's reasoning for individual cases. 23n prior work, we developed a traditional machine learning model that could suppress 294 871 (54.1%) medication alert firings while maintaining a false-negative rate of only 0.9% (ie, missed 430 acceptances) in a test dataset, illustrating that machine learning models based on the alert log data can accurately predict user response to alerts within a single organization. 24However, results from machine learning are insufficient in that they lack transparency and are difficult to integrate into current rule-based alerts.Using XAI techniques allows for explanations of models (ie, user acceptance or notacceptance of alerts).For example, one potential explanation of the model is that the model's prediction is to not accept the alert when the patient is in postpartum department for the Contraindicated-Non-steroidal anti-inflammatory drugs (NSAIDs) and Pregnancy alert.Based on this explanation, CDS experts could review the alert logic, for instance, by considering the patient's presence in the postpartum department as an exclusion criterion for the Contraindicated-NSAIDs and Pregnancy alert to improve specificity of alerts and reduce unnecessary alerts.The purpose of this study was to develop and evaluate a data-driven process that generates suggestions to improve alert criteria.

Method overview
The method overview is shown in Figure 1.It consists of 3 components: (1) a data collection step that extracts alert log data and associated variables from EHR; (2) a model development component that applies XAI approaches to generate suggestions on improving the alert logic; and (3) a suggestion evaluation component, which includes: historical change log comparison, stakeholder interviews, and current alerts data analysis.For machine learning models used in the XAI approach, we used the Area under the ROC Curve (AUC) to select the optimal models.The output of models was a set of IF-THEN rules to explain in what situations users were less likely to accept the alert.We used a set of metrics-odds ratio (OR), probability of low acceptance, decrease rate, confidence, interest, conviction, and the P value of v 2 -to select rules into the evaluation step.We then converted the rules into suggestions.For example, for a breast cancer screening alert, a rule was: IF the patient was a hospice patient, THEN the user was less likely to accept the alert.The corresponding suggestion would be: Do not fire the breast screening alert for hospice patients.

Data collection
Leveraging a previously developed taxonomy to identify features influencing user responses to alerts, we extracted data from Vanderbilt University Medical Center (VUMC)'s Epic Clarity clinical data warehouse for alerts generated from January 1, 2019 to December 31, 2020.To build a predictive model, the dataset only included data generated before the alert was displayed to the user. 24Alerts with firing counts �100, or acceptance counts �10, were excluded.User responses were categorized as accepted or not accepted.If the user chose "Acknowledge/override Warning," "Cancel Warning," "No Action Taken," "Accept BPA (No Action Taken)," or "Cancel BPA," then the user response was classified as "not accepted."Otherwise, the user response belonged to "accepted," (ie, single order, remove order set).For the NSAIDs alert, for example, we extracted 2196 instances of the alert firing.Each row contained one alert firing and the outcome (in this case, if the NSAID was removed the alert was classified as "accepted"; if the user overrode the alert and kept the NAID order, it was counted as "not accepted").For each alert firing record, we also extracted 2689 features, such as the patient's age, all of their diagnoses, lab results, etc.The same features are used for every CDS alert.Numeric variables were binned into 10 groups with equal-width intervals. 25

Model development
Four machine learning techniques were evaluated: random forest, neural network, support vector machine, and gradient-boosted trees.These models are commonly used for medical data, and gradient-boosted tree models (LightGBM implementation) have demonstrated superior performance in recent studies. 24,26,27To impute missing values of numerical features, we compared mean imputation, median imputation, and imputation with the most frequent value.For categorical features, we used the one-hot encoding method. 28e applied 3 XAI techniques: skope-rules, boosted rule set, and Anchor.The skope-rules technique extracts rules from gradient-boosted trees, deduplicates them, and combines them based on out-of-bag precision, providing a global explanation that represents typical patterns across the entire dataset. 29On the other hand, the boosted rule set, employing the principles of AdaBoost, sequentially fits a set of rules to handle data variance effectively. 30Unlike gradient boosted trees, which focus on reducing residuals and typically produce deeper trees, AdaBoost emphasizes correctly classifying previously misclassified instances by adjusting the weights of observations and classifiers.Lastly, the Anchor technique is utilized for its ability to generate precise local explanations, which represent local patterns. 23][33] We merged the rules generated by the XAI approaches and removed duplicates.For each rule, we used a set of selection metrics to compute the values to select candidates for the final evaluation: odds ratio (OR), probability of low acceptance, decrease rate, confidence, interest, conviction, and the P value of v 2 .The probability of low acceptance uses the Binomial distribution with a Beta(1,1) (ie, uniform) prior to identify alerts whose acceptance rate may fall below a certain threshold.To illustrate, consider the rules for improving alerts, such as "IF the patient is a hospice patient, THEN the user is less likely to accept the breast cancer screening alert."Assuming the breast cancer screening alerts were triggered in 1000 hospice patients and accepted 16 times, the excluded firings count would be 1000, leading to a subpopulation acceptance rate of 1.6%.In this scenario, the posterior distribution of the acceptance rate would be Beta (17,985), giving a probability of 0.782 that the acceptance rate for these excluded firings would be at most 0.02, thereby supporting the adoption of the corresponding suggestion "Do not fire the breast cancer screening alert for the hospice patients."This metric effectively accounts for both the number of firings (the more the better) and the number of acceptances (the fewer the better) when selecting obsolete alerts.Additional metrics were calculated as detailed in Table 1.The pros and cons of each metric are displayed in Table S1.We recognized the necessity of employing multiple metrics to evaluate the usefulness of suggestions in CDS alert optimization.Simple metrics were considered initially; however, they proved insufficient for a holistic assessment.Finally, a subset of 1000 randomly chosen suggestions was reviewed by 2 CDS experts (A.W. and A.B.M.), who examined each rule alongside its corresponding metric values, to determine the thresholds for both the beta probability distribution and the selection metrics.The thresholds were thus informed by the manual review, leveraging the expertise of our CDS experts to optimize for practical impact on CDS alerts improvement.

Suggestion evaluation
We conducted a comprehensive qualitative and quantitative evaluation of the generated suggestions.The first method we used in the qualitative assessment was to compare the generated suggestions with BPA's historical change logs to identify consistent or partially consistent modifications and record the corresponding modification dates.The change log is a locally developed tool containing all changes made to BPAs in VUMC's Epic system since 2019, generated each day by comparing the current version of each BPA record with the previous day's version and recording the differences.Generated suggestions that matched or partially matched with BPA changes were classified as "helpful suggestions."For the remaining generated suggestions, we conducted stakeholder interviews aimed at assessing clinical significance through the insights of relevant healthcare professionals.The generated suggestions that did not match BPA changes but were clinically correct were also classified as "helpful suggestions."Notably, a limitation of the above methods was that generated suggestions with seemingly disparate modifications may have similar implications, posing a challenge to manual sideby-side comparisons.For example, a generated suggestion "exclude patients with chief complaints of 'Return' for an alert about Retinopathy of Prematurity (ROP) exam for premature infants" had the same effect as a modification to exclude all outpatients.Because in the clinical settings for this alert, the chief complaint "Return" was only used for outpatients.As a complementary approach, we extracted the most recent user response and corresponding features for alerts generated from March 1, 2023 to June 30, 2023, to assess whether alerts still fired in the context of the generated suggestions.In addition, for each BPA, we calculated the relative change in acceptance rate compared to the original acceptance rate.

Results
The final dataset included 2 991 823 firings with 2689 features, 139 BPAs, 247 648 patients, and 18 397 users.The features are listed in Table S2.The overall acceptance rate was 12.3%.Among the machine learning models, the LightGBM model achieved the highest value in AUC: 0.919 [0.918, 0.920].It was selected as the optimal machine learning model for generating suggestions in Anchor.Sensitivity, precision, F1, accuracy, and AUC for each machine learning model are listed in Table 2.
Applying pre-defined thresholds (odd ratio > 1.25, probability of low acceptance > 0.5, decrease rate < 0.4, confidence > 0.98, interest > 1, conviction > 1.2, and P[v 2 ] < 0.01) and taking the intersection, a total of 1727 generated suggestions were selected as candidates for evaluation.In historical change log comparisons, we found that 76 of the suggestions either fully or partially matched with historical changes (63 fully matched and 13 partially matched).In addition, another 20 suggestions were identified as "helpful suggestions" in stakeholder interviews.Taken together, these 96 helpful suggestions were associated with 18 BPAs.Among 2 991 823 firings, 278 807 firings (9.3%) could have been eliminated.Table 3 shows examples of generated suggestions and their corresponding comments.All modified, partially modified, or discussed generated suggestions are listed in Table S3.
If all helpful suggestions were implemented, then for each BPA, the average decrease in alert firings would be 12.3% and the average relative change in acceptance rates would be 16.9%.Table 4 shows the number of firings and acceptance rates before and after applying the helpful generated suggestions grouped by BPAs.
The evaluation dataset contained 524 970 firings for 112 BPAs with an overall acceptance rate of 12.2%.Of the 1727 suggestions generated, 702 corresponded to situations where no alerts occurred.Specifically, 425 suggestions related to retired BPAs and the other 277 suggestions were likely to have been included in the modifications.

Discussion
In this study, we developed and evaluated a data-driven process to generate suggestions for improving alert criteria using XAI approaches.This approach could eliminate alert firings by 9.3% after implementing the suggestions validated by CDS experts.This approach can significantly reduce, but not eliminate, human intervention, and it generates fully transparent rules in a timely manner from user interactions with alerts while potentially being more accurate compared to ordinary rule-based models.
The effectiveness of the XAI approach in eliminating alert firings might be underestimated due to the robust manual alert review process at VUMC.From March to September in 2020, VUMC performed an intensive 6-month initiative to refine alert criteria, working with 28 clinicians. 34In addition to this, VUMC has monthly CDS governance meetings to review alerts with low acceptance rates.These regular and comprehensive manual review processes provide a valuable opportunity to compare the data-driven generated suggestions with those generated by manual review.However, as a result, some of the suggestions generated from the 2019 alert data had already been identified in subsequent manual reviews.Conversely, the XAI approach may eliminate a larger number of alert firings from institutions with limited resources to conduct manual reviews.
We presented an effective approach to identify improvements in alert criteria which a human might not be able to identify or identify in a timely manner.Through a comparative analysis of modification timelines, we found that only 62% of the helpful suggestions were implemented by 2021.Furthermore, 16% of these helpful suggestions were implemented after 2022, and 22% of helpful suggestions were not identified by human review.Several reasons led to these results.When experts reviewed alerts, it was difficult to consider hundreds of possible improvements.Reviewers might not consider clinical scenarios which are rare or outside their area of practice (such as patients in the OR, patients receiving an imaging procedure, patients on hospice).On the other hand, this data-driven process could learn from millions of user interactions with alerts to cover a much wider range of scenarios.From this perspective, one person manually reviewing an alert could provide the opinion of one healthcare provider, but the XAI-based process could generate suggestions that are more comprehensive for all providers at the institution.
More importantly, our research not only has benefits within the scope of CDS, but it also aims to improve the clinical process.For example, for the alert regarding weight documentation of pediatric patients, one generated suggestion was "Do not fire when: Provider Primary Location ¼ Vanderbilt Wilson County Adult Hospital."This suggestion was noted by a stakeholder for its importance to further explore the reason of low acceptance in this location and the potential to reinforce education on its use.It demonstrates that alert log data provides not only user acceptance of alerts, but also an opportunity to track practices and associated implementation process.Overall, this data-driven process transforms CDS alert knowledge maintenance into a learning health system by effectively utilizing user interaction data in the clinical setting.This would enable the CDS team to learn from experience and inform clinical improvements, leading to continuous enhancements in healthcare quality and efficiency.Healthcare organizations and EHR vendors should consider developing or adopting automated methods to identify potential improvements to CDS. 35 Right now, implementing the suggestions requires manual work to update CDS logic, but EHR vendors could also add tools to allow users to accept suggestions directly and automatically adjust the CDS logic accordingly.These automated methods complement other approaches to CDS improvement such as the Clickbusters process, review of user feedback, and monitoring. 34,36dditionally, the Epic EHR system includes an automatic tool "Tune-up" that suggests updates to minimize disruptions, focusing on features like popup and acknowledgment lockout periods, alert triggers, and provider-specific details such as type, department, and specialty.However, "Tuneup" is constrained to suggest modifications for only one feature at a time.The XAI approach could consider combinations of different features and generate suggestions with multiple features.

Limitations
This study has several limitations.First, we developed and evaluated the data-driven process to generate suggestions using datasets from a single medical center.Exploring the capability of this data-driven process in other healthcare systems might add more value.Second, as a retrospective study, the impact of generated suggestions on patient outcomes and physician behaviors remains unknown.

Future work
Future work in this area includes designing an interface for CDS experts to visualize the XAI process and evaluate model-generated suggestions.A real-time and user-friendly interface could facilitate the process of improving CDS alert criteria, as described above.Another direction is to conduct a multi-site prospective study to implement suggestions and evaluate changes in clinician behavior and clinical outcomes.

Conclusion
In summary, we developed a data-driven process to generate suggestions for improving alert criteria using XAI techniques.Our approach could identify improvements to CDS that might be overlooked or delayed in manual reviews.Our study also unveils a secondary purpose for the XAI: to improve quality by discovering scenarios where CDS alerts are not accepted due to workflow, education, or staffing issues.It is important for healthcare organizations and EHR vendors to integrate such automated techniques to improve CDS tools.

Table 1 .
Metric to select generated suggestions.

Table 2 .
Prediction results on the testing dataset.

Table 3 .
Examples of generated suggestions and feedback from clinicians.

Table 4 .
Comparisons of original and predicted firing counts with acceptance rates (ARs) before and after using "helpful suggestions."<32 weeks or <1500 g are at high risk for Retinopathy of Prematurity (ROP).Please confirm inclusion on the ROP exam list or indicate your reasons for opting out.This patient is due for the flu vaccine.Please order or specify why the vaccine cannot be ordered.Premature infants born <31 weeks or <1500 gms are at high risk for Retinopathy of Prematurity.Please confirm inclusion on the ROP exam list or indicate your reasons for opting out.