Large Language Model–Based Responses to Patients’ In-Basket Messages

Key Points Question Can generative artificial intelligence (GenAI) chatbots aid patient–health care professional (HCP) communication by creating high-quality draft responses to patient requests? Findings In this cross-sectional study of 16 primary care physicians’ opinions on the quality of GenAI- and HCP-drafted responses to patient messages, GenAI responses were rated higher than HCPs’ for communication style and empathy. GenAI responses were longer, more linguistically complex, and less readable than HCP responses; they were also rated as more empathetic and contained more subjective and positive language. Meaning In this study, primary care physicians perceived that GenAI chatbots produced responses to patient messages that were comparable in quality with those of HCPs, but due to GenAI responses’ use of complex language, these responses could cause problems for patients with lower health or English literacy.


Findings
In this cross-sectional study of 16 primary care physicians' opinions on the quality of GenAI-and HCP-drafted responses to patient messages, GenAI responses were rated higher than HCPs' for communication style and empathy.GenAI responses were longer, more linguistically complex, and less readable than HCP responses; they were also rated as more empathetic and contained more subjective and positive language.
Meaning In this study, primary care physicians perceived that GenAI chatbots produced responses to patient messages that were comparable in quality with those of HCPs, but due to GenAI responses' use of complex language, these responses could cause problems for patients with lower health or English literacy.

Introduction
][9][10][11] Each additional message adds more than 2 minutes of EHR time, encompassing message drafting, information searching, order placing, and documentation. 1,5,10Proposed relief strategies include EHR window switching reduction (improving user interface design), upskilling support staff, and billing for messages, 1,7,10,11 a practice now permitted by the Centers for Medicare & Medicaid Services, which evidence suggests reduces messaging burden. 12Using EHR-integrated generative artificial intelligence (GenAI) chatbots to automate drafting responses to patient messages could streamline workflows and thus alleviate burnout.
[20][21][22] Addressing these challenges will enhance understanding of this technology's benefits and limitations. 18Successful adoption depends on understanding HCPs' and patients' perceptions of GenAI outputs. 13,16,18,20,21,23,24[34] Our study addresses this gap by using private patient-HCP message-response pairs to investigate PCPs' perceptions of GenAI drafts and explore underlying linguistic factors associated with equity and perceived empathy.We hypothesize that GenAI draft quality, assessed by PCPs on information content quality, communication style, and usability, will be equivalent to HCP-generated responses.

Study Setting and Participants
A convenience sample of 16 PCPs were recruited from a large urban academic health system via a listserv email to 1189 internal medicine physician email addresses.PCPs affiliated with NYU Grossman School of Medicine were eligible.Participants provided consent by accepting the request to complete the survey and could opt out at any time.Surveys were collected between September 23 and November 16, 2023.

Survey
Surveys were conducted in REDCap. 36Participants were provided a random selection of messageresponse pairs over 2 surveys, masked to whether the response was generated by GenAI or an HCP.
The first contained 5 to 8 message-response pairs but no branching logic, while the second survey contained 15 to 20 message-response pairs and branching logic (eAppendix 1 in Supplement 1).
In both surveys, participants assessed the quality of response information content and communication style using 5-point Likert scale questions (scale 1-5, with 1 indicating strongly disagree and 5 indicating strongly agree), then answered whether it was preferable to starting from a blank page (usable vs unusable).Branching logic followed negative responses to Likert questions (bottom 2 box) to explore PCPs' rationale, assessing for aspects like relevance and empathy.Regardless of whether a draft was considered usable, respondents selected from a list of items (eAppendix 1 in Supplement 1).
To construct the first survey, 200 random in-basket messages were extracted on September 12, 2023, including the corresponding HCP and AI-generated response.A total of 112 patient messages were reviewed, and 53 were excluded because they needed outside context (eg, laboratory values or medication names) for adequate evaluation of the response by participants, leaving 59 patient messages paired with both HCP and GenAI responses (52.7%).For the second survey, 500 random patient messages were extracted from the data warehouse on October 12, 2023, of which 464 were reviewed, and 146 were excluded due to the need for external context to properly evaluate the response, leaving 318 patient messages paired with both HCP and GenAI responses (68.5%) from which respondents' questions were randomly assigned.Not all extracted messages were reviewed because the desired sample size (determined by the effort required of our participants) was achieved beforehand.Message-response pairs were randomly sampled (with replacement) for review, yielding some pairs being reviewed by different reviewers.

Message-Response Sample Selection
The survey used in-basket message-response pairs from outpatient internal medicine departments participating in the pilot study of Generated Draft Replies (Epic), which generated responses using GPT-4 (OpenAI) through an EHR-integrated, vendor-prepared system.Pairs were randomly selected during the system's silent validation, where drafts were being generated using Epic's standard prompts but not seen by HCPs.The patient message subcategory (laboratory results, medication refill requests, paperwork, and general medical advice) determined which prompt (utilizing unique instructions and patient-specific details) generated the response (eg, laboratory results messages auto-populate recent test results, while medication refill requests include the active medication list).
Evaluating standard prompts allows for benchmarking future prompt engineering efforts.
Inclusion criteria dictated that the first patient-initiated message between the patient and their HCP was chosen.If multiple HCP messages were sent in response, they were combined to minimize artificially incomplete responses.Responses from physicians, nurses, and frontline staff were included to reflect how patient requests are answered at many institutions.

Statistical Analysis
Statistical analysis was conducted in Python version 3.9.16(Python Software Foundation) in May 2024.We used a priori levels of significance of P < .05for 2-sided tests of the null hypothesis that GenAI drafts would be equal to HCP responses on our 3 main survey questions.Mann-Whitney tests, robust to outliers and nonnormal distributions, 14,37,38 evaluated differences between GenAI and HCP responses for the 2 main Likert questions and the 2-way paperwork messages subgroup comparison.
Kruskal-Wallis tests compared the Likert scale means of physicians, nonphysicians, and GenAI across the 4 message subcategories. 14Independent sample t tests were used to compare differences in the proportion of GenAI vs HCP responses considered usable and all computational linguistics measures.
One-way analysis of variance was used to compare the proportions of drafts considered usable by physicians, nonphysicians, and GenAI across 3 of 4 message subcategories.P values for all secondary analyses underwent a Sidak correction 39 to account for multiple comparisons.
Because our data are ordinal and pairs were randomly assigned, the 1-way intraclass correlation coefficient (ICC) was used to estimate interrater reliability from the double-reviewed questions. 40near mixed models with random effects for individual reviewer variation and fixed effects for patient message subcategory and HCP subcategory were built (eAppendix 3 in Supplement 1) to assess how these factors affected survey results.
Computational linguistics methods analyzed responses' length, complexity, and sentiment as well as the prevalence of specific content dimensions, such as positive emotion words.Such measures characterize writing styles and can anticipate readers' attitudes and behavior toward the content, including their perception of its usefulness. 41Analysis was performed in Python with the pandas package (version 2.1.1)used to calculate word counts.Lexical diversity, or the variety of words used in a text, was assessed with the measure of textual, lexical diversity, calculated using the lexical_diversity package (version 0.1.1)3][44][45] Lexical diversity reflects language proficiency; highly diverse text indicates the author is using a broad range of vocabulary to express their thoughts and ideas. 46The textstat package (version 0.7.3) calculated the Flesch-Kincaid grade level, which is calculated from the average syllables per word and average words per sentence, and describes an English passage's comprehensibility. 47,48ntent analysis of the main response groups and empathetic subgroups utilized the latest Linguistic Inquiry and Word Count (LIWC) application, LIWC-22, the preferred application for automated text analysis in social science research. 491][52][53] The textblob package (version 0.17.1)[56]
Of 344 evaluated survey message-response pairs (175 GenAI drafted; 169 HCP drafted), there were 157 single-reviewed, 73 double-reviewed, 11 triple-reviewed, and 2 quadruple-reviewed message-response pairs, resulting in 117 unique HCP and 126 unique GenAI message-response pairs.Branching logic only occurred during the second survey and was available for 207 unique questions (85.2%).

Survey Results
Participant evaluations were generally positive for GenAI and HCP responses (Figure 1

Computational Linguistics Results
GenAI responses were 38% longer than HCPs' (imposing a burden on readers' time), but the difference was not statistically significant (mean [SD] word count, 90.Downloaded from jamanetwork.comby guest on 07/19/2024 contained a significantly higher proportion of (particularly positive) emotion words and a greater use of affiliative language (Table 3).

Discussion
3][4]57 According to the PCP respondents, and consistent with prior studies, [25][26][27][28] GenAI drafts outperformed HCPs' responses on communication quality.Despite poor interrater reliability, the sensitivity analysis revealed consistent patterns of findings even after incorporating random effects for reviewers.Subsequently including fixed effects for HCP and patient message subcategories revealed that physicians were responsible for HCP responses underperforming GenAI on communication quality.This may be because physicians responded to more challenging messages than their nonphysician colleagues. 11nAI responses matched HCP responses in information quality, indicating effective use of health care-related training data 20 and patient health data within the standard prompts.This deviates from Ayers et al, 26 where chatbots had 3.6 times higher quality responses to public patient messages, but still supports chatbots' utility.A crucial caveat is that intentional guardrails restrict the LLM's confidence in providing medical information 22 and are designed to limit hallucinations and automation bias, 14,18 but may explain why PCPs found GenAI responses more often unhelpful and irrelevant.
9][20] Future implementers of GenAI into EHR in-basket messaging should direct resources toward revising prompts related to laboratory results.
The prevalence of affiliation words, positivity, and subjectivity in GenAI drafts may explain why they were perceived as more empathetic than HCPs'.Affiliation content, such as "together" and "us," implies a partnership between the HCP and patient.Although empathy is context-sensitive, responses that PCPs perceived as empathetic contained more positive language, which may convey hopefulness and potentially better outcomes. 58GenAI could thus improve virtual communications between HCPs (physicians in particular) and patients.HCPs surprisingly did not leverage knowledge of their patients to communicate more empathetically than GenAI.][3][4][5][6][7][8][9][10][11] In fact, PCPs cited quicker for me as the main reason drafts were considered usable.Although not the primary audience, PCPs must still perceive GenAI drafts as high quality before utilizing them.Patients are the ultimate recipients of drafts, and future research must assess their perceptions of GenAI responses, whose linguistic complexity may be preferred (or ignored) by physicians but burden those with low health or English literacy.Research must also explore concerns about whether GenAI perpetuates bias and health inequity of various patient demographic characteristics [32][33][34]59,60 and determine whether communication gains outweigh such risks.
This study addressed GenAI implementation challenges, including benchmarking draft and prompt quality and understanding PCPs' perceptions.A critical finding of our study was the inability of PCPs to agree with each other on what makes a draft high quality, suggesting that successful utilization of drafts by PCPs requires a personalized approach.Future research should investigate the impact of prompt refinement and personalization on end-users' perceptions of draft quality.
Computational linguistics may drive more intelligent prompt engineering to enhance outputs' empathy, reduce their linguistic complexity, and improve personalization.

Limitations
This study has limitations.Generalizability may be limited due to this study's single-center focus and small sample size.The evaluated GenAI responses were not used to deliver patient care, which may limit our findings' practical applicability.Low ICC suggests a need to adjust the survey questions or instructions (although variance in reviewer responses did not affect our findings) or conduct follow-up interviews to investigate reasons for disagreement.This study did not evaluate the perceptions of patients and nonphysician HCPs who participate in outpatient messaging.We acknowledge that for some HCPs who answer patient messages, particularly nonphysicians, templates are used to draft responses rather than a blank page; the presence of templates was not assessed, and future studies should treat templated HCP responses as a separate group for comparison.Furthermore, our study did not examine whether response quality varied with patient demographics.

Conclusions
In this study, PCPs' found EHR-integrated GenAI responses to private patient messages similar to HCPs in terms of information content quality, better with respect to communication style, and similar in their usability compared with starting from scratch.While poorly rated GenAI responses lacked relevance, were less helpful, or more verbose, they outperformed HCP responses in completeness, empathy, and professionalism.GenAI drafts acceptable to HCPs may offset the increasing workload (and diminishing well-being) they face from in-basket messages from patients.Future research should focus on optimizing the perceived quality of GenAI responses to end-users', particularly patients', perceptions; quantifying efficiency gains; and mitigating biases and hallucinations.
This blinded cross-sectional quality improvement study evaluates PCP perceptions of GenAI responses to patient messages compared with HCP-generated responses.Subgroup analyses evaluated whether response quality varied with HCP type (physicians and nonphysicians) and patient message classification (laboratory results, medication refill requests, paperwork, and general medical advice; determined by the EHR's proprietary message classification LLM [Epic]).Computational linguistics analyses compare response content to elucidate potential equity concerns and why drafts were considered empathetic.As part of an operational pilot program to implement and curate GenAI in-basket drafts most acceptable for end-users, this study met NYU criteria for quality improvement work and did not undergo institutional review board review.All study procedures complied with institutional ethical standards and those set by the Declaration of Helsinki JAMA Network Open | Health Informatics Large Language Model-Based Responses to Patients' In-Basket Messages and are reported using the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guidelines for cross-sectional studies.

Figure 1 .
Figure 1.Distribution of Health Care Professional (HCP) and Generative Artificial Intelligence (GenAI) Responses to Each Main Survey Question

SUPPLEMENT 1 . eAppendix 1 . 2 . 3 . 4 . 5 .
Survey Design Flow Diagram eAppendix Subgroup Analyses Stratified First by HCP Type (Physician vs Nonphysician), Then by Patient Message Subcategory eAppendix Linear Mixed Models With Random and Fixed Effects to Explore the Extent to Which Reviewer Variance or That Attributed to Patient Message (General Medical Advice, Laboratory Results, Medication Refill Requests, Paperwork) and HCP (Physician, Nonphysician) Subgroups Affected Final Results for the Main 3 Survey Questions eAppendix Select Free-Text Comments From Each Questions When Other Was Chosen in the Branching Logic eAppendix Exploration of Intraclass Correlation by Subgroup SUPPLEMENT 2. Data Sharing Statement

Table 2 .
Basic Computational Linguistics and Lexical Complexity Metrics per Unique Responses in Each Group

Table 3 .
Content and Sentiment Analysis of GenAI vs HCP and Empathetic vs Nonempathetic Responses by smaller SDs for most metrics, and its use of more emotional and affiliative language, suggests PCPs may utilize structured responses that fill a gap in their typical responses.