Finding the Patient ’ s Voice Using Big Data : Analysis of Users ’ Health-Related Concerns in the ChaCha Question-and-Answer Service ( 2009 – 2012 )

Background: The development of effective health care and public health interventions requires a comprehensive understanding of the perceptions, concerns, and stated needs of health care consumers and the public at large. Big datasets from social media and question-and-answer services provide insight into the public’s health concerns and priorities without the financial, temporal, and spatial encumbrances of more traditional community-engagement methods and may prove a useful starting point for public-engagement health research (infodemiology). Objective: The objective of our study was to describe user characteristics and health-related queries of the ChaCha question-and-answer platform, and discuss how these data may be used to better understand the perceptions, concerns, and stated needs of health care consumers and the public at large. Methods: We conducted a retrospective automated textual analysis of anonymous user-generated queries submitted to ChaCha between January 2009 and November 2012. A total of 2.004 billion queries were read, of which 3.50% (70,083,796/2,004,243,249) were missing 1 or more data fields, leaving 1.934 billion complete lines of data for these analyses. Results: Males and females submitted roughly equal numbers of health queries, but content differed by sex. Questions from females predominantly focused on pregnancy, menstruation, and vaginal health. Questions from males predominantly focused on body image, drug use, and sexuality. Adolescents aged 12–19 years submitted more queries than any other age group. Their queries were largely centered on sexual and reproductive health, and pregnancy in particular. Conclusions: The private nature of the ChaCha service provided a perfect environment for maximum frankness among users, especially among adolescents posing sensitive health questions. Adolescents’ sexual health queries reveal knowledge gaps with J Med Internet Res 2016 | vol. 18 | iss. 3 | e44 | p.1 http://www.jmir.org/2016/3/e44/ (page number not for citation purposes) Priest et al JOURNAL OF MEDICAL INTERNET RESEARCH


Introduction
The development of effective health care and public health interventions requires a comprehensive understanding of the perceptions, concerns, and stated needs of health care consumers and the public at large [1,2].Clinical and behavioral interventions are most successful when aimed at improving outcomes that are important and relevant to patients.Interventions targeted at these patient-centered outcomes are most effectively developed when patients are engaged in the research process, particularly regarding the identification of salient problems.Funders of health care research increasingly expect proposals to include substantial evidence of attention to patient-centered outcomes through public engagement in the research process, including the process of developing and framing research questions [1][2][3].
There are many successful models of engaging the public in research, ranging from long-term engagement models such as community-based participatory and action research to the use of focus groups, interviews, and specific designs to elicit stakeholder feedback [4,5].However, there are substantial challenges associated with these approaches.First, these approaches require a significant investment of time and resources, valued commodities that may not be available to researchers and their teams, nor to communities and their members [6].Second, in traditional research geographic constraints often limit the number and diversity of individuals who can be included in a single project.Third, most of these methods begin with an a priori research question relevant to the community but often generated by the researcher, which restricts public involvement in the framing of research priorities [7].In order to overcome the aforementioned limitations and develop relevant and effective patient-centered health interventions, new methods of patient and public engagement are needed.
The Internet has changed the ways in which people seek out and share health-related information [8,9].Research shows that 35% of Americans report having used the Internet, including social media platforms, to determine what medical condition they or someone they know might have [9,10].Advances in mobile phone technology make searching the Internet for health-related issues even easier.A recent poll found that 62% of mobile phone owners have used their phone in the past year to look up information about a health condition [11].Researchers have increasing access to anonymized data from these sites, which have thus far been used to research and disseminate information about disease and disease processes [12].More recently, social media and other Web-based data sources have been used to facilitate early outbreak detection [13][14][15].These datasets can also be used as a point of entry for public involvement in health research.Social media data provide insight into the public's health concerns and priorities without the financial, temporal, and spatial encumbrances of more traditional community-engagement methods.While these newer methods cannot replace the more traditional ones, social media methods may prove a useful starting point for public engagement in the health research enterprise.
In 2014, the Indiana University Social Network Health Research Laboratory developed a partnership with ChaCha (ChaCha Search, Inc, Carmel, IN, USA) [16], a US-based company that operates a human-guided question-and-answer service that provides free, real-time answers to any question through its website, text messaging, or mobile apps.The data provide a powerful and unique opportunity to listen to the authentic health concerns of individuals.Other Internet-based platforms also provide opportunities to assess population health concerns.Social media platforms have been widely discussed in the literature [17][18][19][20].These platforms, while valuable, are designed for users to communicate with a broad audience of friends or the public at large (eg, Twitter, Facebook), and posts are part of social identity presentation [21].Conversely, ChaCha queries are a private exchange between an anonymous user and anonymous human guides or a computer.The private nature of the exchange allows users to put forth questions that may be stigmatizing in other settings.
Through our partnership with ChaCha, our laboratory is examining the use of Internet-based question-and-answer services to elicit the patient's voice and develop health interventions that resonate with public concern.The purpose of this paper is to describe ChaCha user characteristics and health-related queries, and to discuss how this big dataset may be used to better understand the perceptions, concerns, and stated needs of health care consumers and the public at large.

Methods
In early 2015 we conducted an automated retrospective textual analysis of 1.9 billion anonymous queries submitted to ChaCha by 19.3 million unique users between January 2009 and November 2012.Because we analyzed only existing, de-identified data, the Indiana University Institutional Review Board determined that the study did not meet definitions of human subject research.
We aggregated queries by year in tabulated ASCII text files, in which each line contained 16 data fields representing 1 ChaCha query and 16 associated descriptors (Table 1).Each year's file was imported to a Linux machine with 64 GB of RAM.Perl scripts were used to parse and summarize the raw data for cleaning and subsequent analyses.A total of 2.004 billion queries were read, of which 3.50% (70,083,796/2,004,243,249) were missing 1 or more data fields, leaving 1.934 billion complete lines of data for these analyses.User unique identifier (machine generated) 15

Content of Queries
All incoming queries were initially filtered a by proprietary ChaCha algorithm that identifies keywords to sort 75.45% (1,459,279,135/1,934,159,453) of queries into 12 broad categories (Table 2) that are further divided into 129 subcategories.Excluding ChaCha customer service-related questions, the queries we analyzed most commonly fell into 5 ChaCha-described categories:  5) Health.Of a total of 106 million health queries, 78.17% (83,056,248/106,254,243) were generated by users who specified their sex and age.We focus here on the subset of those queries (n=68 million) that passed a proprietary ChaCha algorithm that looks for sentence structure, interrogative words, and other factors to filter out "bad questions" that lack sufficient information to be answered.We examined whole-sentence health queries, first those that were generated by roughly equal proportions of males and females, then those that were predominately (≥90%) submitted by females, and finally those predominately (>80%) submitted by males.Among the sex-balanced queries, questions about pregnancy were by far the most prevalent, such as the following: "How are babies made?" "Can you get pregnant on your period?" "What are the signs of pregnancy?"The only other health query frequently submitted by both males and females was about the length of time that alcohol remains in the body.
The queries submitted predominately by females focused on signs and symptoms of reproductive and urinary tract infections, ovulation, and pregnancy.The most common query was about signs and symptoms of yeast infection, followed by inquiries about how to treat, get rid of, or cure a yeast infection.Females more commonly than males asked about the menstrual cycle and its relationship to pregnancy: "When do you ovulate?" "When are you most likely to get pregnant?""Am I pregnant?"Toxic shock syndrome was frequently mentioned by females, who wanted to know more about its symptoms.Other predominately female user queries included body image questions such as "How can you make your butt bigger?" "How do you get rid of cellulite?",and 1 relational question: "How do you get over a guy?" Whole-sentence queries submitted predominately by males focused on body image, particularly penis size and methods for increasing it: "Does ExtenZe work?" "How to make your penis bigger?" "How do I get a six-pack?"Marijuana was the next most-common subject of health queries submitted by males: "What is the best kind of marijuana?""How many grams in an ounce?" "Why is marijuana illegal?"This was followed by queries related to women's anatomy and physiology: "How deep is a vagina?""How do you get a girl pregnant?"Personal health queries focused on testicular discomfort (pain, itching), whether creatine use is safe, and physical fitness goals.
Next we examined smaller word groups, of 2-and 3-word phrases, sorted by sex.Table 3 presents the 10 most prevalent 3-word phrases submitted by males, and Table 4 shows those submitted by females.Findings mirrored the whole-word analysis with the addition of weight-loss questions arising in queries submitted by both male and female users.Figures 2 and  3 illustrate the most prevalent 2-word phrases submitted predominately by males and females, respectively.Figure 4 shows the most prevalent 2-word phrases submitted by both males and females.Finally, we examined patterns in queries by age groups.The most prevalent 2-word phrases in queries from users aged 13-19, 20-39, and ≥40 years are depicted in Figures 5-7, respectively.Among adolescents younger than 19 years, more females than males submitted queries, whereas among young adults aged 19-29 years, more males than females submitted queries.Age patterns were also sex-related patterns, as reflected in the most prevalent 3-word phrases (Table 5).

Discussion
Exploring the ways in which consumers use the Internet to seek health information can also aid Internet-based recruitment for research studies of various types, to improve communication between consumers and health care providers, and to inform the content and geographic scope of marketing for evidenced-based interventions using Internet-accessible platforms.To our knowledge, this is the first analysis of ChaCha data, and these initial results provide valuable methodological and content insights.Methodologically, the results of this initial query affirm our a priori assumption, and the findings of other studies examining Internet health information seeking, that big-data analytical techniques applied to these datasets allow for highly efficient identification of health concerns of users and provide substantial opportunities to develop interventions focused on patient-centered outcomes.Consider that our team analyzed 68 million health-related queries among 1.9 billion overall, generated by 19 million unique users, in less than 5 months and with a total cost of less than $15,000.00.Our entire team working full-time using traditional patient-engagement strategies would have been unable to generate this volume of data in our collective lifetimes, and the cost would be untenable.Several significant content findings from this initial analysis of the ChaCha dataset are consistent with the literature regarding adolescents' use of social media (eg, Twitter) for seeking health information.The first is that the majority of health-related queries were submitted by adolescent users, which suggests that adolescents are comfortable using an anonymous text-based question-and-answer service for health information seeking, and a similar platform could be useful for interventions targeted to adolescents.The second is that adolescents' health queries reveal potential knowledge gaps that have serious, lifelong consequences.The vast majority of health questions submitted by adolescents were focused on sexual and reproductive health.They frequently asked about when and how a girl could become pregnant, the signs and symptoms of pregnancy, and the effectiveness and adverse-effect profile of birth control.There were also a large number and proportion of adolescent user-generated queries about the detection and treatment of reproductive tract infections (primarily yeast and urinary tract infections), the length of time that marijuana remains detectable in the blood or urine, weight loss, and wisdom tooth removal.The content of adolescents' queries indicates their interest in and need for real-time, anonymous answers to questions about their sexual and reproductive health.
As with most studies that analyze social media data, this study had several limitations.First, we do not know whether users were searching for their own knowledge or on behalf of a friend or family member.Second, demographic data were self-reported by anonymous users, who may have misrepresented their city, state, sex, or age.Third, our research team was not provided access to this data until 2014, rendering the data 3-6 years old at the time of analysis.As a result, it is possible that the terminology used to describe health concerns, especially among adolescents, may be slightly outdated.However, we are less focused on how people talk about health concerns than on what issues cause them enough concern to prompt health information seeking.We believe it is unlikely that the core health concerns raised by users of the ChaCha services have changed dramatically in the last 3-6 years.Importantly, had we applied traditional methods to collect these data, the time lag between collection and analysis would have been substantially longer than the 3-to 6-year gap in our study.Finally, given that this is a proprietary dataset, as are many other social media datasets, it is not convenient for other investigators to replicate this work.
While other question-and-answer services exist, and many are more popular than ChaCha, the ChaCha service has several unique features that make it appealing for patient-centered research.First, ChaCha use is completely anonymous.Users of other question-and-answer sites, such as Quora, are required to sign up for the service using potentially traceable information such as email or Facebook profile.While Quora may be a secure site, the requisite entry of identifiable information in order to use the site may limit the pool of users and the types of questions they are willing to ask.Popular search engines such as Google or Bing provide a greater sense of privacy, but they leave a searchable history, which may also promote self-censorship.Moreover, ChaCha was specifically designed as a question-and-answer service, in which users understood there was a human curating the answers on the other end of the line.This simulates the health care encounter more closely than a Web search, in which the curating is done by the information seeker.
Additional research with these and other social media data are needed to develop a deeper understanding of spatial and temporal patterns in health information seeking that can inform patient-centered research.The ChaCha service provided a perfect environment for maximum frankness, especially around sensitive health questions.Just below the surface of this massive dataset are the quietly whispered questions, both banal and extraordinary, that represent the hopes, fears, dreams, and concerns of millions of people.Without compromising their anonymity in any way, we can listen in, to improve the health and wellbeing of millions more.

Figure 1 .
Figure 1.Number of queries posted to ChaCha by user location within the United States, 2009-2012.

Figure 2 .Figure 3 .
Figure 2. The most prevalent 2-word phrases submitted to ChaCha predominately by male users.

Figure 4 .
Figure 4.The most prevalent 2-word phrases submitted to ChaCha by both males and females.
The ability to analyze such a large volume of user-generated health information-seeking data in such a short time has the potential to fundamentally change patient-centered outcomes research.Patient-engagement strategies are at the heart of effective health outcomes research but are costly and time intensive.Big-data analytic strategies have the potential to make widespread adoption of patient-centered engagement strategies possible at a fraction of the cost.

Table 1 .
Description of data fields in queries submitted to the ChaCha question-and-answer service.
There were 19.3 million unique ChaCha users who submitted at least one query during the dates under study.The median user age was 17 years, and approximately 68.35% (5,431,866/7,947,118) of users were younger than age 20 years.Service use peaked in 2011, during which there were nearly 672 million queries.Monthly service use fluctuated between 10 million queries in January 2009 and a peak of approximately 60 million queries in May 2011.There were no noteworthy service use trends by month or day of the week.Users most often submitted their questions between 9 PM and 12 AM.

Table 2 .
Queries submitted to ChaCha: question counts by category and sex (n=1,459,279,135).

Table 3 .
The most prevalent 3-word phrases submitted to ChaCha by males.

Table 4 .
The most prevalent 3-word phrases submitted to ChaCha by females.

Table 5 .
Use of 3-word phrases when submitting queries to ChaCha, by sex and age.