Evaluating Smart Assistant Responses for Accuracy and Misinformation Regarding Human Papillomavirus Vaccination: Content Analysis Study

Background Almost half (46%) of Americans have used a smart assistant of some kind (eg, Apple Siri), and 25% have used a stand-alone smart assistant (eg, Amazon Echo). This positions smart assistants as potentially useful modalities for retrieving health-related information; however, the accuracy of smart assistant responses lacks rigorous evaluation. Objective This study aimed to evaluate the levels of accuracy, misinformation, and sentiment in smart assistant responses to human papillomavirus (HPV) vaccination–related questions. Methods We systematically examined responses to questions about the HPV vaccine from the following four most popular smart assistants: Apple Siri, Google Assistant, Amazon Alexa, and Microsoft Cortana. One team member posed 10 questions to each smart assistant and recorded all queries and responses. Two raters independently coded all responses (κ=0.85). We then assessed differences among the smart assistants in terms of response accuracy, presence of misinformation, and sentiment regarding the HPV vaccine. Results A total of 103 responses were obtained from the 10 questions posed across the smart assistants. Google Assistant data were excluded owing to nonresponse. Over half (n=63, 61%) of the responses of the remaining three smart assistants were accurate. We found statistically significant differences across the smart assistants (N=103, χ22=7.807, P=.02), with Cortana yielding the greatest proportion of misinformation. Siri yielded the greatest proportion of accurate responses (n=26, 72%), whereas Cortana yielded the lowest proportion of accurate responses (n=33, 54%). Most response sentiments across smart assistants were positive (n=65, 64%) or neutral (n=18, 18%), but Cortana’s responses yielded the largest proportion of negative sentiment (n=7, 12%). Conclusions Smart assistants appear to be average-quality sources for HPV vaccination information, with Alexa responding most reliably. Cortana returned the largest proportion of inaccurate responses, the most misinformation, and the greatest proportion of results with negative sentiments. More collaboration between technology companies and public health entities is necessary to improve the retrieval of accurate health information via smart assistants.


Introduction
Background Voice assistants, a form of chatbot or conversational agent often referred to colloquially as "smart assistants," are devices that respond to human voices and can be commanded to do a variety of tasks [1]. Smart assistants have existed in their most contemporary form since 2010, with the introduction of Siri, and they are used for numerous tasks such as automation (ie, integration with climate control devices and entertainment devices), retrieval of information about certain topics, and shopping [1]. Smart assistants have been integrated into smartphones, laptops, speakers, and other devices, creating a large network of consumers who frequently utilize smart assistants to acquire information on a range of topics [1].

Prevalence of Smart Assistant Use
According to the Pew Research Center, out of a total sample of 2045 Americans, nearly half (46%) reported using smart assistants in 2017 [2]. Of 4272 Americans surveyed in 2019, one-quarter (25%) had a stand-alone smart assistant device (eg, Amazon Echo, Google Home) in their homes [3]. Households that reported having a stand-alone smart assistant device (n=1067) also reported higher income (34% earned US $75,000 or more per year; 15% earned below US $30,000) [3]. The same report indicated that younger Americans frequently use stand-alone smart assistants, with 32% of users aged between 18 and 29 years and 28% of users aged between 30 and 49 years [3]. This difference may be due to varied ways of assessing use (eg, ever use and daily use). In contrast, smartphone smart assistants are more frequently adopted by those aged between 18 and 29 years (81%), whereas those aged between 45 and 60 years report the most daily active use of smartphone smart assistants [2]. These usage trends depict nuanced usage patterns wherein younger users are the most common adopters of smartphone smart assistants and older users, once they begin using smartphone smart assistants, use them more frequently than do other age groups. These varying adoption and usage behaviors provide multiple avenues for targeting different age groups through smart assistants.

Uses of Smart Assistants
More than half of Americans report that a major reason for using smart assistants is the ability to interact with their devices without using their hands [2]. One-quarter of Americans say that they use smart assistants to remotely control other devices such as heating systems, door locks, and lights [2]. Given that approximately 35% of Americans report searching online (ie, using Google or another search engine) to self-diagnose a medical condition, it seems logical to assume that individuals may turn to smart assistants for this information as well [4]. Unfortunately, smart assistants are relatively new technologies, and their responses have not been rigorously evaluated in many contexts.

Previous Assessments of Smart Assistants
In general, smart assistants have been evaluated for their response accuracy [5] and their general usability [6]. While considerable research has been conducted assessing digital health approaches, such as text message-based approaches and mobile apps for a range of health behaviors [7][8][9], few have included smart assistant responses to health-related queries. For instance, a pilot study comparing two smart assistants (Google Assistant and Apple Siri) to a standard Google Search on the topic of smoking cessation resources found that Google Assistant provided a greater number of evidence-based responses [10]. Other studies have specifically examined consumer experiences with the natural language processing of smart assistants [5].

Human Papillomavirus Prevention
To effectively assess smart assistant responses for accuracy and misinformation, human papillomavirus (HPV) has been identified as a controversial content area with a substantial evidence base. The evidence base of HPV is scientifically valid but hotly contested. HPV is the most common sexually transmitted infection and is a known cause of cervical cancer, as well as several other types of cancers [11]. While most sexually active people will contract HPV at some point in their lives, how the infection resolves varies from person to person, and a federally approved vaccine has been shown to be effective at reducing both the incidence of HPV transmission and the incidences of genital and anogenital HPV infection and cervical lesions [12]. Despite studies affirming the positive effects of the HPV vaccine while debunking inaccurate claims, there continues to be large-scale misinformation efforts (mostly conducted online through social media platforms) surrounding this issue, which are driven in part by the antivaccination movement [13][14][15][16][17][18][19][20]. These issues of misinformation have contributed to mistrust of medical professionals [21][22][23] and misunderstanding of diseases and their risks [24,25]. In January 2020, the World Health Organization (WHO) released a list of urgent global health challenges for the new decade, including stopping vaccine-preventable diseases and earning the public's trust [26]. These two challenges are closely intertwined, as trust helps to shape whether individuals rely on provider recommendations (eg, whether parents will vaccinate their children) [22,27] and how misinformation disseminated online and across social media platforms influences vaccine refusal or vaccine delay [28]. In fact, the WHO has stated that vaccine hesitancy is one of the top 10 threats to global health [29].
In this study, we attempted to answer the following research question: do responses to smart assistant queries vary between different smart assistants, with regard to accuracy, misinformation, and sentiment toward HPV vaccination?

Query Development
In order to effectively assess smart assistants' responses for accuracy and misinformation, we utilized questions from the chat-text service of Planned Parenthood (personal communication by Nicole Levitz; March 18, 2019). We chose to focus on questions around the HPV vaccine owing to the previously identified issues of accuracy and misinformation surrounding this topic [30,31]. We chose variations of 10 evidence-based questions from the Planned Parenthood system, allowing us to better evaluate responses to those questions for accuracy and misinformation. The questions are listed in Textbox 1.

Search Process
One member of the research team queried each of the four most popular smart assistants (Apple Siri, Google Assistant, Amazon Alexa, and Microsoft Cortana), which were identified in the 2019 Voice Report by Microsoft [32]. In the event that the smart assistant provided a nonresponse to a query (eg, "I don't know how to answer that"), the team member queried the smart assistant a maximum of three times to ensure that it was not an errored response due to misunderstanding the query, background noise, or some other unrelated reason.
Smart assistants provided varying numbers of results on their respective first pages for each query. Alexa and Google Assistant provided only one oral result for each query, whereas Siri provided five text results for each query. Cortana provided between three and nine text or video results for each query. Multimedia Appendix 1 displays the number of results each smart assistant provided per query. The team member, responsible for conducting the searches, recorded the first page of the results for data extraction based on previous studies indicating that users more frequently click on the top 10 results, which tend to be concentrated on the first page of most search results [33]. There is also evidence suggesting that people do not necessarily only click the first result on a page. In a 2020 study of search engine optimization, 16% of respondents in the study reported clicking on only the first result compared with 17% and 14% of respondents who reported clicking on three and five results, respectively [34]. This finding suggests that the remaining results on the first page should be extracted to best replicate actual human search behavior.

Recording Process
The team member, who posed questions to the smart assistants, used video and audio recording software to record both the queries and the smart assistant responses. We used these recordings in the data extraction and coding process. Figure 1 depicts a recording of a query posed to Cortana.

Data Extraction and Coding Process
We developed a web-based (Qualtrics) survey to aid in data extraction and coding of the smart assistant responses. Survey items were designed to extract relevant data from the smart assistant responses, including the types of responses (eg, video, web page, blog, and journal article), sources of the responses (eg, business, doctors, and health information provider), sentiments of the responses toward the HPV vaccine (eg, positive, neutral, and negative), accuracy of the responses, any incidence of misinformation in the responses, and topics discussed in the responses (eg, cancer, sexual behavior, and conspiracy theory).
We defined accuracy as a response that reflected the existing evidence. We coded accuracy using a dichotomy approach (0 for not accurate; 1 for accurate), which we applied specifically to the query (ie, accuracy of unrelated tangential content was not considered). If the response did not answer the query with the correct reply or positioned the correct reply as dubious or incorrect, we considered it "not accurate." We defined misinformation as either deliberate or accidental promotion of previously disproved or unproven beliefs, attitudes, and behaviors. We coded misinformation using a dichotomy approach (0 if it did not provide misinformation; 1 if it did provide misinformation), which we applied to the entirety of the response (not just the answer to the question posed). If any misinformation was found in the smart assistant's response, we coded it as having provided misinformation. Table 1 depicts examples of accuracy and misinformation in smart assistant responses to several queries. Since accuracy and misinformation were coded based on different aspects of the smart assistant responses, there were some cases with both an accurate response to the query and some misinformation in the result. We coded sentiment toward vaccines as one of the following four potential categories: negative (mostly negative statements), neutral (neither positive nor negative statements), positive (mostly positive statements), and ambiguous (both negative and positive statements). We applied sentiment coding to the entirety of the response.
Two independent team members extracted and coded the data with an almost perfect agreement (κ=0.85) [35]. We resolved discrepancies through discussion.
In the trials that led to the approval of Gardasil and Cervarix, these vaccines were found to provide nearly 100% protection against persistent cervical infections with HPV types 16 and 18 and the cervical cell changes that these persistent infections can cause." Does the HPV a vaccine work?
"The scientific literature is now starting to fill up with case reports and studies and articles that irrefutably show that there is a connection between this vaccine (and it's an ugly vaccine) and neurological damage." "There is no link between vaccines and autism." Does the HPV vaccine cause autism?
"The Gardasil HPV vaccine has been proved to have caused the deaths of 32 women." "Gardasil works by preventing the infection of the types of HPV that can lead to cervical cancer..." Does Gardasil work? N/A "There has never been a study that has shown that vaccines cause autism." Does Gardasil cause autism?
"Aluminum in the vaccines is toxic enough to be harmful." "Findings from many vaccine safety monitoring systems and more than 160 studies have shown that HPV vaccines have a favorable safety profile-the body of scientific evidence overwhelmingly supports their safety." Is the HPV vaccine dangerous?
"The Gardasil HPV vaccine has been proved to have caused the deaths of 32 women." "Although this information is accurate in a strictly literal sense, it is a misleading presentation of raw data that does not in itself establish a causal connection between Gardasil and the posited medical dangers." Is Gardasil dangerous? N/A "All people ages 9 to 45 can get the HPV vaccine to protect against genital warts and/or different types of HPV that can cause cancer."

Analyses
The sample consisted of 128 total data points across all four smart assistants. We excluded any nonresponse data points (eg, "I don't know how to answer that") and responses that were categorized as other (eg, responded with completely irrelevant content to the query) from the final analyses. Google Assistant provided nonresponses to every query and was removed, thus reducing our final analysis sample to 103 responses across three smart assistants. These nonresponses and other responses were removed because they did not in any way address the query posed. Multimedia Appendix 1 displays the frequencies of responses and nonresponses for each smart assistant. Descriptive frequencies and chi-square difference tests were used to examine the levels of accuracy and misinformation among the smart assistants. Table 2    In general, smart assistant responses contained positive sentiment toward vaccines, with 65 out of 103 (63.1%) responses containing primarily positive sentiment. The second most common sentiment was "neutral," which was found in 18 out of 103 (17.5%) responses. The least common sentiment was "negative," which was found in only 7 out of 103 (6.8%) responses, and all were provided by Cortana.

Principal Results
This study sought to determine whether responses to queries vary among smart assistants in terms of accuracy, misinformation, and sentiment related to HPV vaccination. Our results indicate that smart assistants responded differently when asked about this topic. Specifically, smart assistants showed variations in the level of misinformation provided in responses, as well as the provision of evidence to support claims made in their responses. Smart assistants did not differ statistically in terms of the accuracy of their responses to queries involving HPV vaccination.
To our knowledge, this study is the first to report on smart assistants and their responses to HPV vaccine-related queries. Previous studies of HPV vaccination behaviors did not focus on smart assistants, and they indicated that antivaccination messages may influence one's decisions to vaccinate themselves or their children [15]. However, given the growing number of persons and households adopting and utilizing smart assistants and the fact that smart assistants may be able to provide HPV vaccine information, it is necessary to examine whether responses delivered through smart assistants are disseminating these same antivaccination messages. This study is the first step in establishing a foundation for the types of vaccine sentiments that are present in content disseminated by smart assistants.
Importantly, our study revealed that, in general, smart assistants largely provide accurate and positive information regarding HPV vaccination (just under two-thirds of all smart assistant responses were accurate and positive), with no relevant differences across devices. We should reiterate here, however, that Google Assistant provided nonresponses to every query and, accordingly, was removed from the study. While accurate information on something as beneficial as HPV vaccination is necessary to mitigate HPV transmission and cancer incidence, just over one-third of responses contained inaccurate answers, suggesting that work is needed in this area to improve the provision of health information via smart assistants. On the other hand, misinformation was more of a concern across devices, as Cortana yielded the greatest number of responses containing misinformation (one-quarter of Cortana's responses contained misinformation).

Comparison With Prior Work
Previous studies of smart assistant responses have focused on whether the devices responded correctly or whether they understood the query [5,6]. Only a handful of studies examined smart assistant responses in the context of a health behavior [10]. This study further assessed smart assistant responses to determine whether they provide accurate evidence to specific health behavior questions. Evidence is critical in combatting misinformation, at least on social media platforms. For example, in an experimental study that exposed Facebook users to simulated misinformation and different correction mechanisms about the Zika virus, the authors found that correction can work when it provides supporting evidence or appropriate sources to accompany refutation (eg, Centers for Disease Control and Prevention) [36]. The authors reported that this finding was maintained even in those who were rated higher in conspiracy beliefs.

Limitations
This study has limitations that should be considered when evaluating its results. The devices used their specific locations when responding to queries, which may have produced less generalizable responses. Some of the smart assistants employed in the study were previously used devices and may have been influenced by prior searches of the former user. To the best of our knowledge, no private mode exists in these smart assistants, which would prevent the queries and responses from influencing future searches. The sample size was limited for several reasons, including nonresponses and other unrelated responses to queries. Specifically, Google Assistant provided no response data, which may have influenced the results. Any responses that were entirely unrelated to the queries should be explored in future studies, since this may be an indication of a larger issue with smart assistants not comprehending the queries. Based on our knowledge about natural language processing in these smart assistants, it is suggested that syntax and dictation are some of the many influencers of smart assistant responses [37], and the queries posed to each device may not have been appropriately worded to elicit the most effective responses. For example, our use of a single investigator to ask the questions could be viewed as a limitation (ie, we do not know how the smart assistants would have responded to an investigator with a different gender, tone, or accent). The investigator only queried a single example of a smart assistant (ie, only one device was used to query Alexa), which could be a limitation, as smart assistants may respond differently depending on whether a stand-alone smart assistant or a smartphone-based smart assistant is used. The underlying search engine behind each smart assistant determines the responses to queries, which could have limited the information provided by smart assistants that are not connected to large search engines such as Google. The varying number of results provided in the smart assistant responses to queries could have impacted our ability to compare across the devices; however, since we found no relevant differences between the devices with regard to accuracy or sentiment, we do not believe that more results from one device impacted our findings or conclusions.
Despite these limitations, there are several strengths that provide further evidence of the study's validity and importance. First, we utilized questions submitted to a Planned Parenthood chat service, which represent real-world issues on which consumers have previously sought additional information. Second, the questions used as queries were chosen because responses to these queries could be scored as either correct or incorrect, limiting "gray areas" in assessing accuracy of the smart assistant responses. Third, the queries were conducted by a single member of the research team to maintain consistency in language processing factors (eg, tone and syntax), and the search and coding processes were systematic to maintain consistency as well. Overall, this research furthers our understanding of how an emerging technology disseminates health information and whether such information can be classified as accurate and evidence-based. This study provides a basic framework for future evaluation of these smart assistants on which more advanced devices may be based.
Our findings paint a picture of smart assistants as newly emerged potential health promotion tools that are yet to be rigorously evaluated in this area. Smart assistants have only recently been introduced as potential health promotion tools (most notably from an executive perspective [38]) despite the growing proportion of households and individuals that use them [2,3]. The untapped potential of these devices for evidence-based information dissemination to consumers should be further explored. The potential of the devices may also be determined by their manufacturers who have, in some cases, provided platforms on which specialized topical information can be consolidated and further explored. For example, Alexa has an option for developers to create an Alexa Skill, which is essentially an application for Alexa that provides predefined information when queried specifically for that information. For example, an Alexa Skill that specifically searches in predefined evidence-based sources when queried for information on HPV vaccination could be developed. The downside to this potential approach for improving public health is that, as our results show, there is risk of disseminating misinformation through smart assistant responses, potentially reducing the positive impacts of a health promotion intervention. More needs to be done to better understand the susceptibility of these devices and their respective skills to outside influences.

Conclusions
Our results suggest that not all smart assistants are created equal with regard to utility, at least when it comes to provided evidence and misinformation in their responses. These findings should spark further research into how the proprietors of smart assistants procure the information that is disseminated to consumers of their products. Specifically, we suggest that manufacturers of these and other smart assistants collaborate with researchers to further evaluate the accuracy of smart assistants as public health tools and determine together how to disseminate information and what fact-checking assessments should be used for such information. With the high rate of HPV transmission globally and in the United States, smart assistants and their manufacturers are well positioned to deliver evidence-based health information to consumers. However, such a practice necessitates strong communication between technology companies, who may not be as focused on the accuracy or source of their most heavily promoted content, and public health entities. The focus on collaboration to address the issues of information accuracy and misinformation is paramount if we are to adequately respond to the WHO's list of urgent global health challenges for the new decade, namely stopping vaccine-preventable diseases.