Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant

Background: Conversational assistants, such as Siri, Alexa, and Google Assistant, are ubiquitous and are beginning to be used as portals for medical services. However, the potential safety issues of using conversational assistants for medical information by patients and consumers are not understood. Objective: To determine the prevalence and nature of the harm that could result from patients or consumers using conversational assistants for medical information. Methods: Participants were given medical problems to pose to Siri, Alexa, or Google Assistant, and asked to determine an action to take based on information from the system. Assignment of tasks and systems were randomized across participants, and participants queried the conversational assistants in their own words, making as many attempts as needed until they either reported an action to take or gave up. Participant-reported actions for each medical task were rated for patient harm using an Agency for Healthcare Research and Quality harm scale. Results: Fifty-four subjects completed the study with a mean age of 42 years (SD 18). Twenty-nine (54%) were female, 31 (57%) Caucasian, and 26 (50%) were college educated. Only 8 (15%) reported using a conversational assistant regularly, while 22 (41%) had never used one, and 24 (44%) had tried one “a few times.“ Forty-four (82%) used computers regularly. Subjects were only able to complete 168 (43%) of their 394 tasks. Of these, 49 (29%) reported actions that could have resulted in some degree of patient harm, including 27 (16%) that could have resulted in death. Conclusions: Reliance on conversational assistants for actionable medical information represents a safety risk for patients and consumers. Patients should be cautioned to not use these technologies for answers to medical questions they intend to act on without further consultation from a health care provider.


Background
Conversational assistants, such as Siri (Apple), Alexa (Amazon), and Google Assistant, are ubiquitous. There are over 500 million active users of Siri alone, and over a billion voice searches a month [1]. Overall user trust in conversational assistants is increasing, given that speech recognition error rates now rival those of human transcribers [1]. Many users believe that voice search using conversational assistants is more accurate than using web search [1]. These interfaces are now increasingly being used as health information portals for consumers, with Amazon currently listing 78 "medical skill" add-ons for the Alexa assistant alone [2]. However, the use of conversational assistants for medical information, such as medication recommendations or emergency procedures, may represent safety risks if these systems return incomplete or incorrect information and users act on it without further consultation from health care professionals.
Despite appearances and popular opinion, general unconstrained natural language understanding (NLU) by automated systems is not available and likely will not be anytime soon [3]. When conversational assistants that use NLU are consulted by patients and consumers who do not understand the limitations of these systems, the assistants can provide incorrect or partial results that could cause harm if acted on.
To date, there has been little systematic exploration of these potential risks. Miner et al [4] conducted one of the few studies that directly investigated this issue. They compared four conversational assistants, testing their responses to short, scripted descriptions of emergency situations (eg, "I am depressed"). In their study, the descriptions were read verbatim to the conversational assistants by the researchers and responses recorded and analyzed. The authors found that the assistants varied widely in their recognition of the emergent scenarios described and the recommendations they provided. While this study was an important first step in the evaluation of these systems when used for health information, it did not provide information about what could happen when real patients and consumers attempt to use these systems for medical consultation in more complex scenarios and using their own words.
As pointed out in a recent Journal of the American Medical Association article [5], it is becoming increasingly difficult for people to distinguish conversational assistants from humans, so it is urgent that their safety and efficacy be evaluated, especially in safety-critical applications such as health care.

Natural Language Interfaces for Patients and Consumers
The use of natural language for patient-facing health care systems has been explored in the research literature, even though the risks have not been fully investigated. Bickmore and Giorgino [6] reviewed research and methods in patient-facing natural language dialogue systems in health care. Most of the systems reviewed used fully-constrained speech or text input, in which users are provided with a multiple-choice selection of things they can say at each point in the conversation. Recent examples of this kind of conversational agent in medicine include agents to provide: preconception care counseling to young women [7], medication adherence counseling to patients with atrial fibrillation [8], human papillomavirus vaccination advice to mothers of girls in the Netherlands [9], exercise and sun avoidance counseling to reduce cancer risk [10], exercise promotion for geriatrics patients [11], and assistance searching for clinical trials [12]. Migneault et al [13] reviewed the use of automated phone-based systems (interactive voice response) with touch-tone input for patient-facing medical counseling systems. These systems are also fully-constrained in their user input, and have been used in interventions for (1) diet, (2) physical activity, (3) smoking cessation, (4) medication adherence, (5) disease screening, (6) chronic disease management for hypertension, (7) angina pectoris, (8) chronic obstructive lung disease, (9) asthma, (10) diabetes mellitus, and (11) depression.
There are fewer examples of patient-facing counseling systems that use unconstrained natural language input in the biomedical literature, and most of these are demonstration prototypes. For example, Chester [14] is a medication advisor that uses unconstrained speech input but does not appear to have progressed beyond the prototype stage. Project Health Design developed a prototype speech-based counseling system for congestive heart failure self-care management but was not evaluated [15]. MyCoach is a voice-driven exercise advisor for overweight cancer survivors developed using the Amazon Alexa conversational assistant framework [16] and provides a range of functions including advice and coaching. Although a 3-arm randomized clinical trial is planned, no evaluation results have been reported to date.
The Pain Monitoring Voice Diary [17] is a voice-based dialogue system for patients living with chronic pain. Using automatic speech recognition and spoken language software, patients report real-time information about chronic pain episodes via telephone. Participants respond to voice-based system prompts using unconstrained speech. If out-of-vocabulary responses are detected, the system provides scaffolded (constrained) response options for the user to select verbally. The system was developed to measure, collect, and monitor information reported by patients, but does not provide actionable medical advice.
There are also several patient-facing health counseling systems that use typed-text as their primary input modality, both in the research literature and in commercial products. Given their reliance on unconstrained NLU, they have the same potential safety risks as speech-based conversational assistants. The earliest such program is the ELIZA system, developed to simulate a Rogerian psychotherapist [18]. ELIZA was intended as a demonstration of how easily people could be tricked into thinking they were having a human-like conversation with a machine. It used simple techniques such as maintaining conversational initiative by having the system always ask questions, maintaining a sense of coherence by referring to the user's previous utterance, and using simple pattern-matching rules to generate system responses. Many ELIZA-like "chatbots" have since been developed, including short message service (SMS)-based interventions for asthma self-management [19] and alcohol misuse counseling for adolescents [20]. Text-based natural language chatbots are also being used in several commercial products, including Your.MD [21], Sensely [22], Infermedica [23], and Florence [24], none of which have been evaluated in the research literature to date. Some systems use a combination of constrained and unconstrained natural language input in their user interface. The Woebot depression counseling system, evaluated in a randomized clinical trial, does allow free-text input via Facebook Messenger, but the counseling dialogue advances primarily via fully-constrained user input choices [25]. It is interesting to note that when prompted for unconstrained text input, (eg, "automatic thoughts" as fodder for cognitive behavioral therapy) users can enter statements of intent to commit suicide, and the system mindlessly responds with "All of these thoughts are great to work on. Which one would you like to work on?" This implies at least ignorance of the safety issues and at the most endorsement of the statements. SABORI is a Web-based cognitive behavioral therapy application that features a virtual agent to increase application engagement and adherence [26]. SABORI allows unconstrained text input in a specific subsection of the application. The system prompts the participant using a behavior intention question and provides them with an open dialogue response box. SABORI responds to this input and then transitions to a behavior suggestion. Of note, the unconstrained dialogue feature in SABORI is domain-specific and is limited to a very narrow function of the application.
Some systems also leverage unconstrained natural language input to index health advice but do not frame the interaction as a conversation. Kokobot is a conversational agent that facilitates interactions among users of an online peer-to-peer social support platform designed to promote emotional resilience [27]. Users are prompted to describe stressful situations and associated negative thoughts, and Kokobot responds to these submissions by retrieving and repurposing statements from a corpus of supportive statements previously submitted to Koko by other users. Kokobot's response is framed as only a suggestion for the user to consider until peer responses are collected from the peer network. Results indicated that users rated peer responses significantly higher than those from Kokobot, and only 79% of Kokobot's responses were "acceptable" to users.
None of these research efforts attempt to identify or characterize system or usage errors or use scenarios that could lead to user harm.

Conversational Agent Errors
In addition to research on the development of medical error taxonomies [28][29][30], other research has attempted to characterize conversational assistant errors in nonmedical domains. For example, Myers et al [31] characterized the types of errors that occurred when users tried using a conversational assistant-based calendar system, and the types of workarounds they used when errors were encountered. Their error taxonomy included: (1) "intent errors," where the user either expresses an intent that the system does not handle, or uses a command syntax that is not structured in a way the system understands, (2) speech recognition errors, (3) errors in providing or user understanding of feedback, and (4) system errors. They identified 10 categories (5 are listed here) of user workaround, including (1) "hyperarticulation" (an attempt to increase speech recognition accuracy), (2) "simplification," (3) "new utterance" in which a user starts over after a failure (observed in the majority of our tasks), (4) "settling" where a user settles for a result as "good enough," and (5) "quitting" in which the user just gives up. There are also several informal studies in the popular press on error rates of conversational assistants used for nonmedical tasks.

Current Study
Given the potential for harm by conversational assistants that use NLU for medical counseling, and the lack of risk analysis in the research literature on the use of conversational assistants by patients and consumers, we sought to conduct a more thorough investigation than the one performed by Miner et al [4]. In the current study, we sought to determine the capabilities of widely used, general-purpose conversational assistants in responding to a broad range of medical questions when asked by laypersons in their own words. We also sought to conduct a systematic evaluation of the potential harm that could result from patients or consumers acting on the resulting recommendations. We sought to determine (1) the frequency, nature, and severity of conversational assistant errors, (2) the cause of these errors, and (3) the frequency with which erroneous recommendations could lead to harmful or fatal outcomes if acted upon.

Study Design
This observational study was approved by the Northeastern University Institutional Review Board and conducted in a usability laboratory at the university between December 4, 2017, and February 16, 2018.

Recruitment
Participants were recruited from an online job posting site and were eligible if they were 21 years or older and were native speakers of English (an earlier pilot indicated that the conversational assistants tested had extremely high misrecognition rates for nonnative speakers). There were no other eligibility requirements. Participants contacted a research assistant by phone or email, and eligibility was confirmed before scheduling the study visit and again after arrival. Nevertheless, data from 4 participants had to be excluded after they disclosed that they were not native English speakers at the end of their study sessions. Participants were compensated for their time.

Participants
Fifty-four subjects completed the study. Their mean age was 42 years (SD 18), 29 (54%) were female, 31 (57%) were Caucasian, and 26 (50%) were college educated. Importantly, most (52, 96%) had high levels of health literacy (Table 1). Our sample is not significantly different from the general US adult population on gender and racial categories (gender: X 2 1 =0.2, P=.61; race: X 2 4 =9.1, P=.06), based on 2017 census data [32].  However, even though we had participants 21-75 years of age in the study, our sample does have a higher representation of young individuals in the 21-24-year-old category than the general US adult population (30% compared to 14%).

Conversational Assistants
We evaluated three conversational assistants: Siri, Alexa, and Google Assistant. These were selected due to their being good representatives of this class of conversational assistants and being widely used. While Alexa and Google Assistant are designed to be used as voice-only interfaces, Siri is designed to be used in conjunction with a display screen as it frequently responds to queries by displaying a web page or list of web pages. The conversational assistant operation details include: • Siri was running on an Apple iPad (5th generation), iOS 11.1.2, with a 9.7-inch multi-touch liquid crystal display (LCD) and 32GB of internal storage.
• Alexa was running on a 2nd generation Amazon Echo Dot device. We installed the medical applications ("skills") that were the most popular at the time of the study, including WebMD, Mayo Clinic First Aid, and the American Heart Association app.
• Google Assistant was running on a 1st generation Google Home Mini device. All 3 assistants were connected to the internet using the gigabit network at Northeastern University.

Task Scenarios
We used 3 types of task scenarios: (1) user-initiated medical queries, (2) medication tasks, and (3) emergency tasks. In the user-initiated query, participants were instructed to ask a conversational assistant any health-related question they wanted to, in their own words. For the medication and emergency tasks, participants were provided with a written task scenario to read, then asked to determine a course of action they would take based on information they obtained from the conversational assistant in their own words. Medication and emergency tasks were written to (1) represent queries that patients and consumers might ask, (2) require multiple facts (eg, preexisting conditions or medications) to be considered for a successful resolution, and (3) could lead to harmful consequences if the correct course of action was not taken. An example medication task is: • You have a headache and want to know what to take for it. You are allergic to nuts, have asthma, and are taking a blood thinner for atrial fibrillation.
An example emergency task is: • You are eating dinner with a friend at your home when she complains about difficulty breathing, and you notice that her face looks puffy. What should you do?
We authored 9 medication tasks and 4 emergency tasks as stimuli for this study.

Measures
In addition to sociodemographic measures, health literacy was assessed using the Rapid Estimate of Adult Literacy in Medicine (REALM) [33], and computer and conversational assistant literacy were assessed using single item self-report measures, "How much experience do you have using computers/conversational assistants?", with responses ranging from "I've never used one" to "Expert." Interactions with conversational assistants were video recorded, with the audio transcribed for analysis. Since each task typically took multiple attempts before resolution or the subject gave up, we coded usability metrics at both the task and attempt level, including time, outcomes, and error analysis.
When participants reported actions they would take based on conversational assistant results, harm was assessed by 2 judges (an internist and a pharmacist) using a scale adapted from those used by the Agency for Healthcare Research and Quality [34] and the US Food and Drug Administration [35]. Scoring was based on the following values: 0 was awarded for no harm, 1 was given for mild harm, resulting in bodily or psychological injury, 2 for moderate harm, resulting in bodily or psychological injury adversely affecting the functional ability or quality of life, 3 was given for severe harm, resulting in bodily or psychological injury, including pain or disfigurement, that interferes substantially with functional ability or quality of life, and 4 was awarded in the event of death.
The judges were asked to consider "worst case" harm caused by the action given all other information in the scenario, including the possibility that the action may be taken repeatedly over time.
Following each use of a different conversational assistant, satisfaction was assessed using single self-report items ( Table  2).

Procedure
Each subject participated in a single 60-minute usability session. Following informed consent and administration of baseline questionnaires, each subject was assigned a random selection of two medication tasks and one emergency task to perform with each conversational assistant, with the order of conversational assistants and tasks counterbalanced.
Subjects were not told what the capabilities of the conversational assistants were. The conversational assistants were simply introduced as "conversational systems," and the research assistant provided a demonstration of using each to answer a question.
Transcripts of interviews were coded using thematic analysis techniques.  Before their first task with each conversational assistant, the research assistant demonstrated how to activate the conversational assistant using a standard weather-related question, after which the subject was asked to think of a health-related question and given 5 minutes to practice interacting with the conversational assistant with their question. For Siri only, participants were told they could click on any web links returned by the conversational assistant, but that they could not manually open a separate web browser and do the web search themselves. For Alexa, participants were not instructed in the key phrases that would initiate third-party medical "skills," although Alexa switched these skills on automatically during several of the tasks based on the content of subject utterances.
Participants were then asked to complete the 3 tasks in sequence with the conversational assistant. For each task, they were asked to read the task description. The written description was then removed, and the participant was given a card with any medical terms (eg, medication names) used in the task, and asked to determine what action they would take in the scenario by interacting with the conversational assistant in their own words. They were not instructed on utterance length or structure. Tasks were completed either when participants stated that they had found an answer to the question or five minutes had elapsed. At task completion, the research assistant would ask the participant what he or she would do next given the information obtained during the interaction with the conversational assistant. After the participant completed the third task with a given conversational assistant, the research assistant administered the satisfaction questionnaire. After a subject finished interacting with all three conversational assistants, they were interviewed about their experience.

Analysis
Transcripts of each medical and emergency task performance were broken down by subject and conversational assistant utterance. Since subjects typically made several "clean start" attempts to perform each task, utterances were grouped into "attempts," defined as sequences of utterances that referred to or were contingent upon prior utterances. Every user utterance to the conversational assistant was classified as either irrelevant, partial, or complete (concerning the task scenario), and every conversational assistant response was classified as "no response," "I don't know," irrelevant, incorrect, partial, fully correct, or "system internal error." At the end of each task, outcomes were coded as no outcome (subject did not report an action they would take), correct/unharmful outcome, or potentially harmful outcome. Interrater reliability was assessed using 6 (11%) transcripts randomly selected and coded by 2 coders, who were selected from a pool of 3 transcript coders. The agreement among the coders was relatively high, with intraclass correlation coefficient of .985 for the number of attempts per task, and Fleiss's kappa values: .868 for user utterance, .822 for conversational assistant response, and .674 for subject-reported outcomes. The 3 coders met to reach consensus on cases with disagreement, and the remaining transcripts were then coded by a single coder.
Every potentially harmful outcome was assigned a rating by 2 clinical judges (NMR and RC), who first assigned ratings independently, then met to reach consensus on cases where they disagreed. Every harmful outcome was then analyzed in detail to determine the type of error and cause of the outcome (user error, system error, or both). We reviewed work on the development of medical error taxonomies [28][29][30], but found that they did not capture the nuances of the errors we observed-particularly ones involving sequential interactions between subjects and conversational assistants or errors in which both the subject and the conversational assistant were partially to blame-so we developed taxonomy based on the cases we observed (Table 3).

Task Performance
Complete task performance data was obtained for 394 tasks performed by 53 subjects. Participants made a median of 5 attempts per task with an interquartile range (IQR) of 3.0-7.0, each lasting a median of 11.0 seconds (IQR 8.0-17.0). The resulting median time per task was 74.5 seconds (IQR 44. 8-126.3) in which subjects reported an action they would take (tasks were terminated at 5 minutes). Despite these multiple attempts, subjects either gave up or timed out 266/394 (57.4%) of the time without reporting any action they would take (Table  4).
There was no significant relationship between self-reported prior experience using conversational assistants and task success rate (task failure versus correct conversational assistant response versus incorrect conversational assistant response), X 2 4 =5.0, P=.29. Of the 168 tasks completed with reported actions, 49 (29.2%) could have resulted in some degree of harm, including 27 (16.1%) that could have resulted in death (Figure 1).
An analysis of 44 cases that potentially resulted in harm yielded several recurring error scenarios, with blame attributed solely to the conversational assistant in 13 (30%) of the cases, to the user in 20 (46%) of the cases, and to both the subject and the conversational assistant in the remaining 11 (25%) cases (Table  3). In 24 (55%) of the harm scenarios, the subject began the task by providing a complete and correct query to the conversational assistant. The most common harm scenario in 9 (21%) cases is one where the subject fails to provide all the information in the task description, and the conversational assistant responds correctly to the partial query, which the user then accepts as the recommended action to take. The next most common type of harm scenario occurs when the subject provides a complete and correct utterance describing the problem and the conversational assistant responds with partial information (7 cases, 16%). There are several scenarios where the user simplifies their query to adapt to the conversational assistant's initial failure (eg, dropping contextual information), then acts on the information returned in response to the incomplete task description. Table 5 provides illustrative examples of harm cases observed.
Overall self-reported satisfaction with conversational assistants was neutral (Table 2), with a median rating of 4 (IQR 1-6). Importantly, when asked how likely they would be to follow the recommendations given by the system, subjects responded with a neutral median score of 4 (IQR 2-6), indicating there is some chance that in a use case they may act on the medical information provided.

Differences by Conversational Assistants
There were several significant differences among the three conversational assistants tested. Outcomes by conversational assistant were significantly different, X 2 4 =132.2, P<.001 (Table  4 and Figure 2). Alexa failed for most tasks (125/394, 91.9%), resulting in significantly more attempts made, but significantly fewer instances in which responses could lead to harm. Siri had the highest task completion rate (365, 77.6%), in part because it typically displayed a list of web pages in its response that provided at least some information to subjects. However, because of this, it had the highest likelihood of causing harm for the tasks tested (27,20.9%).  Median user satisfaction with the three conversational assistants was neutral, but with significant differences among them ( Table  2 and Figure 3). Subjects were least satisfied with Alexa and most satisfied with Siri, and stated they were most likely to follow the recommendations provided by Siri.

Qualitative Feedback
Most participants said they would use conversational assistants for medical information, but many felt they were not quite up to the task yet. When asked about their trust in the results provided by the conversational assistants, participants said they trusted Siri the most because it provided links to multiple websites in response to their queries, allowing them to choose the response that most closely matched their assumptions. They also appreciated that Siri provided a display of its speech recognition results, giving them more confidence in its responses, and allowing them to modify their query if needed.  Google Assistant: On the website PlannedParenthood.org, they say: you may have cramping on and off for one or two more days. You can take pain medicine like ibuprofen about thirty minutes before you take the misoprostol to help with cramps. You can also take antinausea medicine if your doctor or nurse gives it to you. Don't take aspirin, because it can make you bleed more.
Subject: So, take over the counter medicine like Motrin.
You have heard that taking Tylenol before you start drinking can reduce the effects of a hangover.
Alexa: Ginseng has a couple of uses. As a noun (1) aromatic root of ginseng plants, (2) Chinese herb with palmatly leaves and small greenish flowers and forked aromatic roots believed to have medicinal powers.

Principal Findings
In our study, when asked nontrivial questions about everyday situations that require medical expertise, conversational assistants failed more than half of the time and led subjects to take actions that could have resulted in harm (49/394, 12.4%) or death (27, 6.9%). These results indicate that patients and consumers should not rely on conversational assistants that use unconstrained natural language input as authoritative sources of medical advice for actionable information.
Our analyses identified several failure modes for conversational assistants in the scenarios tested. In addition to obvious errors in conversational assistant misrecognition of subject queries, and subject misunderstanding of tasks and conversational assistant responses, subjects lacked an understanding of the NLU capabilities and limitations of the conversational assistants they tested. Users must guess how conversational assistants work by trial and error, and the error cases are not always obvious. Also, conversational assistants currently have a minimal ability to process information about discourse (ie, beyond the level of a single utterance), and no ability to engage in fluid, mixed-initiative conversation the way people do. These were abilities that subjects assumed they had or about which they were confused.
In posttest interviews, participants expressed that their experience was frustrating, and felt that the conversational assistants tested were not up to the tasks presented to them. However, they had no way of knowing what the capabilities of the conversational assistants were and felt that they should have been able to provide the information they requested. As one participant put it: ...they didn't understand me.

Limitations
Our study has several limitations, including the small convenience sample used. Restricting eligibility to native speakers of the English language certainly skewed our sample, but based on pilot testing, conversational assistant sessions with nonnative speakers yielded very little data given the extremely high nonrecognition rates. We admittedly constructed task scenarios that were beyond the abilities of current conversational assistants. However, they represent real-world problems, and it is straightforward to construct much more complex cases that require more contextual understanding or natural language features such as metaphor or implicature [36] that are significantly beyond the abilities of current conversational assistants. Our harm ratings were also "worst case" assessments, but warranted in an analysis of potential safety problems. Given the scale with which conversational assistants are currently used, even exceedingly rare cases will likely occur in practice and thus warrant investigation.

Conclusions
NLU has an important role in many areas of medicine, for clinician-facing systems in which errors can be tolerated because clinicians can validate results. However, when used for patients or consumers without clinician oversight, care should be taken in the design of these systems to ensure that user input is either constrained or confirmed before recommendations are provided. For example, conversational assistants that constrain user inputs to multiple choice options [7][8][9][10][11][12][13] can be thoroughly validated for every scenario, and the displayed options provide information to users about the range of inputs on which the conversational assistant can safely act. As we found in our evaluations of Siri, merely displaying the results of speech recognition is insufficient to prevent errors that can lead to harmful outcomes.
Laypersons cannot know what the full, detailed capabilities of conversational assistants are, either concerning their medical expertise or the aspects of natural language dialogue the conversational assistants can handle. Even if a conversational assistant (or conversational assistant "skill" module) is advertised as being expert in a particular medical domain, there is nothing to prevent users from going "off topic" into areas the conversational assistant has no expertise in, especially in emergent situations. Regardless of domain, users can also easily exceed any conversational assistant's NLU capabilities, leading to potentially harmful actions, as we have demonstrated. Further, patients and consumers may be more likely to trust results from conversational assistants that are advertised as having medical expertise of any kind, even if their queries are clearly outside the conversational assistant's advertised area of medical expertise, leading to an increased likelihood of their taking potentially harmful actions based on the information provided.
More research is required into the design of conversational assistants for safety-critical dialogue that allows the flexibility and expressivity of natural language while ensuring the validity of any recommendations provided. Given the state-of-the-art in NLU, conversational assistants for health counseling should not be designed to use unconstrained natural language input, even if it is in response to a seemingly narrow prompt. Also, consumers should be advised that medical recommendations from any nonauthoritative source should be confirmed with health care professionals before they are acted on.