Analysis of Patient Narratives in Disease Blogs on the Internet: An Exploratory Study of Social Pharmacovigilance

Background: Although several reports have suggested that patient-generated data from Internet sources could be used to improve drug safety and pharmacovigilance, few studies have identified such data sources in Japan. We introduce a unique Japanese data source: tōbyōki, which translates literally as “an account of a struggle with disease.” Objective: The objective of this study was to evaluate the basic characteristics of the TOBYO database, a collection of tōbyōki blogs on the Internet, and discuss potential applications for pharmacovigilance. Methods: We analyzed the overall gender and age distribution of the patient-generated TOBYO database and compared this with other external databases generated by health care professionals. For detailed analysis, we prepared separate datasets for blogs written by patients with depression and blogs written by patients with rheumatoid arthritis (RA), because these conditions were expected to entail subjective patient symptoms such as discomfort, insomnia, and pain. Frequently appearing medical terms were counted, and their variations were compared with those in an external adverse drug reaction (ADR) reporting database. Frequently appearing words regarding patients with depression and patients with RA were visualized using word clouds and word cooccurrence networks. Results: As of June 4, 2016, the TOBYO database comprised 54,010 blogs representing 1405 disorders. Overall, more entries were written by female bloggers (68.8%) than by male bloggers (30.8%). The most frequently observed disorders were breast cancer (4983 blogs), depression (3556), infertility (2430), RA (1118), and panic disorder (1090). Comparison of medical terms observed in tōbyōki blogs with those in an external ADR reporting database showed that subjective and symptomatic events and general terms tended to be frequently observed in tōbyōki blogs (eg, anxiety, headache, and pain), whereas events using more technical medical terms (eg, syndrome and abnormal laboratory test result) tended to be observed frequently in the ADR database. We also confirmed the feasibility of using visualization techniques to obtain insights from unstructured text-based tōbyōki blog data. Word clouds described the characteristics of each disorder, such as “sleeping” and “anxiety” in depression and “pain” and “painful” in RA. Conclusions: Pharmacovigilance should maintain a strong focus on patients’ actual experiences, concerns, and outcomes, and this approach can be expected to uncover hidden adverse event signals earlier and to help us understand adverse events in a patient-centered way. Patient-generated tōbyōki blogs in the TOBYO database showed unique characteristics that were different from the data in existing sources generated by health care professionals. Analysis of tōbyōki blogs would add value to the assessment of disorders with a high prevalence in women, psychiatric disorders in which subjective symptoms have important clinical meaning, refractory disorders, and other chronic disorders. JMIR Public Health Surveill 2017 | vol. 3 | iss. 1 | e10 | p.1 http://publichealth.jmir.org/2017/1/e10/ (page number not for citation purposes) Matsuda et al JMIR PUBLIC HEALTH AND SURVEILLANCE


Current Pharmacovigilance
The World Health Organization defines pharmacovigilance (PV) as the science and activities related to the detection, assessment, understanding, and prevention of adverse effects or any other drug-related problems [1]. In this era of what Edwards calls "information explosion," we must rethink PV [2] to effectively incorporate a variety of data sources while ensuring the timely decision-making that is crucial to avoiding unnecessary harm caused by adverse events (AEs) in real-world health care practice.
Current PV activities depend heavily on voluntary, spontaneous AE reports obtained from health care professionals (HCPs). It is generally accepted that one advantage of spontaneous reporting is its speed at detecting AE signals as early as possible. However, it is also acknowledged that spontaneous reports by HCPs alone may not be enough to capture all AE signals in a timely fashion. Because some symptomatic AEs can be expected to be reported only by patients who have firsthand experience of drug treatment [3], incorporating patient-generated data into PV is one of the most important challenges [4]. Several studies have suggested that self-reporting by patients is useful for catching AE signals earlier, and many countries have implemented patient AE reporting schemes [5][6][7][8]. The Japanese regulatory authority started preliminary implementation of a self-reporting system for patients in March 2012 [9,10]; however, the system is still under development and will require more time to be used effectively in a routine PV system [11].

Prior Research on Applying Internet Resources in Pharmacovigilance
Analyzing information on the Internet would add significant knowledge about public health, as shown in Eysenbach's study outlining the framework of infodemiology and infoveillance [12]. In PV, there has been recent growing interest in utilizing patient-generated Internet resources such as social media [13][14][15][16][17]. A survey conducted in 2001 and 2002 in the United States showed that the Internet is an important resource for the public; approximately 40% of respondents there obtained information on health-related topics through Internet sources [18]. In response to the increasing use of social media to share health care information, the US Food and Drug Administration announced in 2015 that they had started a collaboration with PatientsLikeMe [19], a patient networking website, to apply patient-generated data to risk management activities [20]. In Europe, the Medicines and Healthcare products Regulatory Agency in the United Kingdom started working with the WEB-RADR project in 2014 to develop a mobile phone app that helps HCPs and patients report AEs to national health care authorities [21]. The European Medicines Agency has also released guidelines on good pharmacovigilance practices, of which Module VI requires companies having the European Union marketing authorization to monitor the Internet or digital media under their management or responsibility for potential reports of suspected adverse reactions [22]. These ongoing efforts are expected to lead to important developments in PV. Like Americans and Europeans, approximately 39% of Japanese obtain health information via the Internet [23]. However, to our knowledge, no studies have explicitly identified such Japanese data sources for use in PV.

Patient-Generated Data and Study Objectives
Our motivation was to take the first step toward enhancing PV by considering the application of patient-generated data sources in Japan. In this study, we focused on the potential use of health-related disease blogs called tōbyōki. The term tōbyōki translates literally to "an account of a struggle with disease," and this form of writing predates the Internet. Although it is difficult to pinpoint when patients started writing tōbyōki, a sociological study has reported that the number of tōbyōki has been increasing in Japan since the 1970s [24]. In these diary-like accounts, patients record observations about their lives and diseases in handwritten journals. Recently, some patients have started sharing their tōbyōki as blogs on the Internet.
It has already been suggested that analyzing tōbyōki blogs is useful for understanding patients' feelings when they receive a cancer diagnosis [25], although there was no discussion on their potential use in PV. In this study, we introduce a growing database called TOBYO, which is a collection of a broad range of tōbyōki blogs on the Internet [26]. The objective of this exploratory study was to address the following questions: (1) what kinds of data elements exist in the TOBYO database? (2) what are the differences in population distribution between the TOBYO database and other external databases generated by HCPs? (3) what kinds of analytic approaches are useful to obtain insights from the TOBYO database? and (4) can the TOBYO database be useful for PV?
To achieve our objective, we conducted 2 analyses (Analysis A and Analysis B). In Analysis A, we used the whole TOBYO database to describe data elements and understand the overall characteristics of this database. In Analysis B, we used a data subset of selected disorders from the TOBYO database to explore the usefulness of the database in greater detail. Here, we focused on depressive disorders and rheumatoid arthritis (RA) because these conditions were expected to entail subjective patient symptoms such as discomfort, insomnia, and pain. Finally, we included a discussion of the potential of the TOBYO database and practical challenges from the PV perspective.

Data Source
In this study, we considered health-related tōbyōki blogs as a resource for patient-generated data. Some examples of excerpts from tōbyōki blogs are shown in Table 1 The TOBYO database consisted of a Web-based collection of tōbyōki blogs written in Japanese [26] and maintained by Initiative Inc (Tokyo, Japan). The overall flow of data in the TOBYO database is shown in Figure 1. Blogs written in Japanese were identified and extracted daily from the Internet using a proprietary crawling method. Before being registered in the TOBYO database, each tōbyōki blog was manually checked to judge whether it was a tōbyōki blog or noise, which was excluded. Each blog registered to the TOBYO database met all of the following selection criteria: (1) Language criteria: blogs written in plain Japanese language without extensive use of emoticons, symbols, or colloquial expressions were included; (2) Blogger criteria: blogs written by patients or their families were included. Blogs not written by patients or their families, such as those by manufacturers or HCPs who were providing medical care, were excluded (because such blogs generally described the HCP's records and did not contain a patient perspective); and (3) Content criteria: blogs containing at least ten pages of tōbyōki entries on patients' actual experiences were included. Blogs comprising excerpts from news media, books, health-related websites, or treatment guidelines were excluded. Blogs intended for marketing or promotion of commercial services or religious or political beliefs were also excluded.
At the time of registration in the TOBYO database, information on gender, age at onset, and the primary disorder of each patient was determined by checking the profile or introduction page of each tōbyōki blog and stored as structured data for each patient. Text-based data in tōbyōki blogs were stored as unstructured data for each patient. (1) This study focuses on tōbyōki blogs that are publicly available on the Internet. Generally, there is a substantial volume of noise (white) unrelated to tōbyōki blogs (shaded). (2) Based on selection criteria described in Methods, filtering of tōbyōki blogs is performed manually, (3) and noise such as blogs written by companies is excluded. (4) Appropriate tōbyōki blogs are registered in the TOBYO database and stored for additional analysis.

Demographic Characteristics of the TOBYO Database
To understand the demographic characteristics of the TOBYO database, structured data elements such as gender, age at onset, and frequently mentioned primary disorders were summarized in contingency tables. We also evaluated demographic characteristics by comparing population pyramids for the TOBYO database and 2 external databases generated by HCPs. The first HCP-generated database was the Japanese Adverse Drug Event Report (JADER) database maintained by the Pharmaceuticals and Medical Devices Agency. It comprised individual case safety reports (ICSRs) about the occurrences of serious adverse drug reactions (ADRs) for drugs approved in Japan. Similar to a previous report [27], we obtained the JADER dataset updated in September 2016 and extracted all ICSRs to create a population pyramid for the JADER database. The other HCP-generated database was the Japanese health insurance claims database maintained by Japan Medical Data Center, Ltd (Tokyo, Japan). It comprised medical claims information submitted from medical institutions to health insurance organizations for both corporate employees and their dependents [28]. Using this database, we created a population pyramid by determining the number of patients who had at least one record of drug prescription or disease from January 2011 to December 2015. As an additional comparison, we used national statistical surveillance data on all citizens living in Japan and publicly available through the Japanese government's website [29].

Distribution of Disorders in the TOBYO Database
To understand the distribution of primary disorders in the TOBYO database, frequently mentioned disorders were summarized. The name of each disorder was independently reviewed by 2 reviewers (ST and MS) and coded using Medical Dictionary for Regulatory Activities (MedDRA) version 19.1. MedDRA is a widely used, standardized medical terminology developed by the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use [30]. Both reviewers had at least two years of experience in processing and evaluating ICSRs.

Additional Characteristics of the TOBYO Database
We analyzed additional characteristics of tōbyōki blogs that might be useful to understand the data. Behavioral characteristics about writing tōbyōki blogs, such as the time and day of week for blog postings, were determined for all postings accompanied by relevant identifiable information. Continuity of tōbyōki blogs was calculated by counting the number of days from the first entry to the latest update for each patient.

Mining Events Appearing in Tōbyōki Blogs
As depicted in Figure 2, we applied natural language processing techniques to unstructured text-based data to prepare each dataset, which were then analyzed to answer specific questions (eg, what identifying words are frequently used by a particular population?). In this study, we extracted 2 different sets of tōbyōki blogs from the TOBYO database, 1 for patients with depression and 1 for patients with RA, and we prepared separate datasets containing all unstructured text written by patients with each disorder. We then analyzed the drugs and medical events mentioned in each dataset.
To process the unstructured text, we first performed a morphological analysis using MeCab, an open-source Japanese segmentation tool [31], to break down each text into words. This preprocessing approach is commonly used to delimit words in texts that do not delimit words with spaces, which is a characteristic of the Japanese language [32]. Because tōbyōki blogs contained many entries unrelated to disease, such as those related to everyday life, making the data noisy, we also identified the 100 most frequently mentioned drugs in each dataset (depression and RA). Then, for each dataset, we extracted every sentence containing at least one of the 100 most frequently mentioned drugs identified earlier, and these extracted sentences were used for subsequent analysis. This approach enabled us to focus on drug-related contexts rather than on everyday diary-like content. As mentioned earlier, 2 reviewers (ST and MS) independently reviewed summary tables containing the 300 most frequently mentioned words in each dataset (depression and RA) to identify medical events (eg, name of symptom, diagnosis, and disorder), which were coded using MedDRA. Because original descriptions written by patients tended to have some degree of ambiguity (eg, words such as suffering, feeling down, feeling unwell), discrepancies in coding sometimes occurred between the results of the 2 reviewers. The reviewers discussed any such discrepancies and determined a single appropriate Preferred Term in accordance with the standard guidance for MedDRA coding procedures (MedDRA Term Selection: Points to Consider [33]). Any discrepancies in coding results were resolved by discussion.
In addition to identifying medical events frequently observed in tōbyōki blogs, we examined differences in the types and frequencies of events between tōbyōki blogs and existing HCP-generated data sources. For this purpose, we compared medical terms frequently observed in tōbyōki blogs (as identified earlier) and those frequently observed in the JADER database. Using the JADER database, we first produced separate tables of the 30 most frequent ADRs reported for 4 biological drugs approved for RA (adalimumab, etanercept, infliximab, and tocilizumab were selected because they were the first 4 biologics approved in Japan around 2000 and were thus expected to contain enough data for comparison) and that of the 30 most frequent ADRs reported for 4 selective serotonin reuptake inhibitors approved for depression (escitalopram, fluvoxamine, paroxetine, and sertraline were selected because these were widely prescribed and also used in the previous study [14]). Then by comparing these lists of events from tōbyōki blogs and the JADER database, we identified the words appearing in both databases and those appearing in either database. This focused comparison based on frequently appearing events enabled us to highlight the major characteristics of these databases. This process of review and comparison was carried out independently by the 2 aforementioned reviewers (ST and MS).

Visualization of Tōbyōki Blog Contents
Because visualization approaches could be useful for PV, we used all sentences containing at least one of the above 100 drugs to calculate Jaccard coefficients to measure the similarity between term pairs. Jaccard coefficients index the degree of cooccurrence between term pairs by showing how much the terms overlap. For instance, Figure 3 shows the calculation of the Jaccard coefficient for drug A and verb X [34]. Using these Jaccard coefficients, we visually represented the words associated with depression or RA in word clouds. In the word clouds, the size of each word reflected the frequency with which the word appeared in text (ie, the more frequently a word appeared, the larger the word was shown in the word cloud). The colors of each word were randomly assigned and did not have any meaning. Word clouds could be used in PV to achieve an initial, intuitive understanding of data. We also created a word cooccurrence network for patients with RA to evaluate the occurrence of words in conjunction with the names of 4 biological drugs approved for RA. Word cooccurrence network analysis could be used in PV to explore terms related to specific drugs.
Statistical software R, JMP software version 11.2.1 (SAS institute), and Microsoft Excel were used for the analysis.

Ethics Approval
The study protocol was reviewed and approved by the nonprofit MINS Institutional Review Board [35]. The board waived informed consent because the data source did not contain personal information. In addition, we presented the data at the group level rather than at the individual level.

Demographic Characteristics of the TOBYO Database
As of June 4, 2016, the tōbyōki blogs aggregated in the TOBYO database comprised 54,010 blogs representing 1405 disorders. The blogs were started from 1994 to 2016, but more than 90% of them were started from 2005 to 2015.
As shown in Table 2, information on gender could be identified in most of the blogs (99.60%, 53,794/54,010). More blogs were written by female bloggers (68.80%, 37,161/54,010) than by male bloggers (30.80%, 16,633/54,010). Of approximately 40% of tōbyōki blogs in the TOBYO database with information on age at onset, more than half were written by people less than 50 years old. The peak age at onset was 20-34 years (24.44%, 13,201/54,010), followed by 35-49 years (16.35%, 8830/54,010) and less than 20 years (16.16%, 8730/54,010). We found apparent differences in population distribution between the TOBYO database and existing data sources such as the Japanese health insurance claims database, JADER database, and national population statistics ( Figure 4). Compared with national statistics as a standard, the population in the TOBYO database tended to be younger and contained relatively more females than males. In contrast, the population of the JADER database was older with no particular gender differences between ages. The health insurance claims database did not include people older than 75 years, but data for the young to middle-aged group seemed to be abundant with no particular gender differences between age groups.

Additional Characteristics of the TOBYO Database
We also highlighted unique data elements by analyzing behavioral characteristics of writing tōbyōki blogs and found that most writers updated their blogs between 9 PM and 0 AM ( Figure 5). No particular patterns were observed according to which days of the week blog entries were posted. About 40% of the blogs in the TOBYO database (36.81%, 19,879) had continued for more than 3 years.

Mining Events Appearing in Tōbyōki Blogs
Comparison of depression (Table 5) and RA (Table 6) events in tōbyōki blogs and the JADER database showed apparent differences in the types and frequencies of events observed.
Subjective, symptomatic terms and general terms for patients tended to be frequently observed in tōbyōki blogs (eg, anxiety, headache, and pain), whereas more technical, medical terms (eg, syndrome and abnormal laboratory test result) tended to be observed frequently in the JADER database. Exceptionally, the fact that "interstitial lung disease" in patients with RA was observed frequently in both tōbyōki blogs and the JADER database suggested relatively high attention for this event. Coding discrepancies occurred between the 2 reviewers (the different suggestions from the reviewers are shown in parentheses): Adverse reaction (adverse reaction or adverse drug reaction), Psychosis (mental disorder or psychotic disorder), Paroxysmal attack (seizure-like phenomena or seizure), Emotional instability (affect lability or feeling abnormal), Suffering (sense of oppression or emotional distress), Feeling down (depressed mood or emotional distress), and Psychiatric disorder (mental disorder or psychotic disorder). d Activation syndrome is a generic term used for central nervous system stimulation symptoms that are potential adverse effects caused by selective serotonin reuptake inhibitors.

Visualization of Contents in Tōbyōki Blogs
As depicted in Figures 6 and 7, "take" (as in "take medicine") was the most frequent word in the datasets for depression and RA, suggesting that extraction of tōbyōki blog content containing the 100 most frequently mentioned drugs helped focus the data. Among patients with depression ( Figure 6), sleep-related terms such as "lie down," "sleep (noun)," "sleep (verb)," "sleepiness," "awakening," and "awaken" were observed, indicating that patients shared information about their disease conditions. We also found therapy-specific words such as "adverse effects," "antidepressant agent," "depression drug," and "withdrawal symptoms." Among patients with RA (Figure 7), pain-related terms such as "pain," "painful," "swelling," and "stiffness" were frequently noted, indicating that these were important words for characterizing RA.
As depicted in Figure 8, the words "rheumatism," "give relief," "pain," and "painful" were located at the center of the word cooccurrence networks of the 4 biological drugs considered in this study, meaning that these words were frequently used in association with all 4 drugs. The characteristics of each drug were also observed in the margins of the word cooccurrence networks. For example, adalimumab and etanercept, administered as subcutaneous injections, were associated with the word "self-injection," and infliximab and tocilizumab, administered as intravenous infusions, were associated with the word "infusion."

Principal Findings
Patient-generated data is likely to play a key role in improving PV [36]. In Japan, however, a system of self-reporting by patients is still being considered [10] and no patient-generated data resources have been explicitly identified. As one option for such a resource, this study evaluated the TOBYO database from the PV perspective.
In the whole TOBYO database, more blogs were written by female bloggers, and fewer blogs were written by people older than 50 years ( Table 2). These findings were consistent with the results of a general survey of Internet usage in Japan [23]. Reflecting the fact that a higher percentage of tōbyōki blogs were written by women, the most frequently appearing disorders in the TOBYO database tended to have a higher prevalence in women: breast cancer, cervical cancer [37], RA [38], and panic disorder [39] (Table 3). Additional analysis of tōbyōki blogs would be more realistic for these disorders with a high prevalence in women. Our findings also suggested the relevance to frequently appearing disorders such as psychiatric disorders with subjective symptoms that have important clinical meaning, refractory disorders, autoimmune disorders, and other chronic disorders. Tables 5 and 6, tōbyōki blogs written by patients with depression or RA contained symptomatic, subjective terms rather than the medical diagnosis or other medical terms. This revealed a difference between tōbyōki blogs and the JADER database generated by HCPs and implied that the TOBYO database might have the advantage of enabling the analysis of patient-level outcomes that could not be captured in existing data sources. Indeed, previous research has shown that psychiatric events are difficult to identify in health care administrative databases because physicians have difficulty detecting them and patients avoid reporting the symptoms to their physicians [40]. Another interesting possibility is that even if a patient reporting system is implemented, patients may not voluntarily report events that they do not consider to be AEs, as suggested by a previous research conducted on patients with Parkinson's disease [41]. In such a situation, in which patients themselves do not consider the possibility of AEs, the TOBYO database can be useful for capturing initial symptoms as AE signals.

As shown in
We confirmed the feasibility of analyzing patient narratives using text mining to draw insights from tōbyōki blogs. Word clouds suggested characteristic words associated with selected conditions, such as "sleeping" and "anxiety" with depression and "pain" and "painful" with RA. This suggested that tōbyōki blogs were a useful resource for understanding characteristic information for each disorder. We were also able to identify words commonly associated with the 4 biological drugs located at the center of the word cooccurrence networks (Figure 8). The common words revealed in this study were not particularly noteworthy, but further research using the same approach with different drugs or disease areas might be useful for exploring drug safety concerns such as unknown AEs. For example, a report analyzing tweets written by Japanese patients with cancer suggested that visualizing narratives with word cooccurrence networks could be a useful approach to obtain insights from social media [42].
We noted several strengths of tōbyōki blogs as a resource for data analysis in this study. One was the ease of obtaining patient background information as summarized in Table 2. In contrast to other data sources such as Web-based discussion forums in which patient background information was inherently limited [43], tōbyōki blogs usually had a profile or introduction page from which a substantial level of information could be collected. Another strength was that most tōbyōki bloggers wrote their blogs voluntarily to record and share their experiences with others, resulting in primarily subjective descriptions of patient experiences. This first-hand, observational quality, free from obligations or interventions, might enable researchers to better understand patients' actual concerns. A third strength was that compared with common blogs or social media (even those written by patients), tōbyōki blogs might be more likely to contain analyzable information on health-related or life-related topics because serious disease and other health crises were typical motivations for starting tōbyōki blogs.

Limitations
This study had several limitations. First, because tōbyōki blogs were written by only a segment of the patient population, generalization of the findings required caution. For instance, the elderly population might be underrepresented in Internet sources [23]. In addition, as a patient's condition became more severe, it might be more difficult for them to continue writing their tōbyōki blogs. These biases should be considered when interpreting the results. Second, the insights obtained from qualitative text-mining approaches were based on some degree of subjective interpretation by researchers. For example, in word clouds, the relative size of each word reflected its frequency. It would be helpful to identify frequent or important words that were mentioned by many bloggers. On the other hand, because the size of each word did not reflect its clinical significance, it was possible that some smaller words might have greater clinical significances. Although word clouds have the potential to provide some insights from textual data, interpretation should be done in caution, keeping their pros and cons in mind. Third, some technical improvements would be necessary to extract more meaningful knowledge from the texts used in this study. For instance, we only considered fragmented words for analysis. By excluding phrases and other word combinations, we might have missed some important concepts or patient feelings. Additional techniques such as entity linking or named entity recognition should be considered in future studies to improve the results. Finally, because the language in social media tends to be highly informal and contain a wide variety of expressions, identification of specific concepts such as AEs and medicinal drugs from the unstructured narratives is a challenge. Although we could identify frequently appearing medical events in the TOBYO database, as shown in Tables 5 and 6, it is apparent that not all these events were AEs because we did not consider whether they had occurred before or after drug administration. Additional work is necessary to identify AEs occurring after drug administration.

Future Challenges for Social Pharmacovigilance
We also recognized future challenges for the effective use of social media data in PV. First, there is a need for an official guidance or policy about the necessity of obtaining informed consent from patients and protecting privacy. Although research interest in the use of social media is growing, there is currently no consensus or guideline [44]. We think there is no need for artificial constraints such as obtaining subsequent informed consent for the use of blog data because they are already publicly available on the Internet. Regarding patients' decisions on whether to share data, a study showed that patients in the cancer community tended to think positively about sharing as long as the benefit of sharing data outweighed the risk [45], and the authors recommended that researchers should be careful to protect patient anonymity. In accordance with this recommendation, we prepared all analysis output as summarized data and not individual-level data in consideration of patients' rights to protected privacy. Second, we acknowledge that issues exist with the reliability and reproducibility of social media, particularly from the regulatory, good pharmacovigilance practice perspective. Concerns about the incorporation of false information have been noted previously [46]. Considering our study using tōbyōki blogs, we assume that the extent of this problem would not be very large because there is no conceivable incentive for maintaining a fake tōbyōki blog at this time. Selecting blogs with more than 10 pages in the screening process before registration in the TOBYO database would help to prevent the inclusion of fake blogs. Concerns about the reproducibility of analysis present a practical challenge. It is not realistic to keep a dynamic dataset that is updated every day and that may be updated retrospectively. To ensure the reproducibility of individual research, storing the final dataset as a snapshot is recommended. Finally, because the volume of data on the Internet is continuously growing, there may be a need to think about how to efficiently detect and process AE information on the Internet. One option is to improve the text-mining algorithm using dictionary-based methods by preparing an annotated corpus to recognize AEs and drugs. However, this process would be time-consuming and costly. Another option is the application of a machine learning approach by preparing a classifier algorithm that does not necessarily require the preparation of annotated corpora, and there has been a report of the application of a deep-learning technique to detect potential AEs from social media texts [47]. In summary, we need to tackle several practical and technical issues to efficiently incorporate social media resources into PV.