The PsyTAR dataset: From patients generated narratives to a corpus of adverse drug events and effectiveness of psychiatric medications

The “Psychiatric Treatment Adverse Reactions” (PsyTAR) dataset contains patients’ expression of effectiveness and adverse drug events associated with psychiatric medications. The PsyTAR was generated in four phases. In the first phase, a sample of 891 drugs reviews posted by patients on an online healthcare forum, “askapatient.com”, was collected for four psychiatric drugs: Zoloft, Lexapro, Cymbalta, and Effexor XR. For each drug review, patient demographic information, duration of treatment, and satisfaction with the drugs were reported. In the second phase, sentence classification, drug reviews were split to 6009 sentences, and each sentence was labeled for the presence of Adverse Drug Reaction (ADR), Withdrawal Symptoms (WDs), Sign/Symptoms/Illness (SSIs), Drug Indications (DIs), Drug Effectiveness (EF), Drug Infectiveness (INF), and Others (not applicable). In the third phases, entities including ADRs (4813 mentions), WDs (590 mentions), SSIs (1219 mentions), and DIs (792 mentions) were identified and extracted from the sentences. In the four phases, all the identified entities were mapped to the corresponding UMLS Metathesaurus concepts (916) and SNOMED CT concepts (755). In this phase, qualifiers representing severity and persistency of ADRs, WDs, SSIs, and DIs (e.g., mild, short term) were identified. All sentences and identified entities were linked to the original post using IDs (e.g., Zoloft.1, Effexor.29, Cymbalta.31). The PsyTAR dataset can be accessed via Online Supplement #1 under the CC BY 4.0 Data license. The updated versions of the dataset would also be accessible in https://sites.google.com/view/pharmacovigilanceinpsychiatry/home.

demographic information, duration of treatment, and satisfaction with the drugs were reported. In the second phase, sentence classification, drug reviews were split to 6009 sentences, and each sentence was labeled for the presence of Adverse Drug Reaction (ADR), Withdrawal Symptoms (WDs), Sign/Symptoms/Illness (SSIs), Drug Indications (DIs), Drug Effectiveness (EF), Drug Infectiveness (INF), and Others (not applicable). In the third phases, entities including ADRs (4813 mentions), WDs (590 mentions), SSIs (1219 mentions), and DIs (792 mentions) were identified and extracted from the sentences. In the four phases, all the identified entities were mapped to the corresponding UMLS Metathesaurus concepts (916) and SNOMED CT concepts (755). In this phase, qualifiers representing severity and persistency of ADRs, WDs, SSIs, and DIs (e.g., mild, short term) were identified. All sentences and identified entities were linked to the original post using IDs (e.g., Zoloft. Value of the data The PsyTAR dataset can be used as a benchmark to train and evaluate the performance of lexicon-based systems and machine learning algorithms to identify adverse drug events (ADEs) and measure drug effectiveness from online healthcare forums, particularly for psychiatric medications. The PsyTAR dataset can be used to train machine learning systems (e.g. neural network) for normalizing medical concepts in online healthcare communities by extracting the semantic links among the layperson expressions of medical terms and medical standard vocabularies. The PsyTAR dataset can be used to evaluate the association between different types of ADEs and patient satisfaction (attitude) toward psychiatric medications. The PsyTAR dataset may also be used to facilitate the seamless exchange of information between patients' expressions of ADEs in personal health records (PHR) and electronic health records (EHRs) [1].

Data
The sample of the PsyTAR contains 891 drug reviews collected randomly from an online healthcare forum "askapatient.com". Fig. 1 shows the share of the sample for four drugs "Zoloft" and "Lexapro" from SSRIs (Selective Serotonin Reuptake Inhibitors) class and "Effexor XR" and "Cymbalta" from the SNRIs (Serotonin-Norepinephrine Reuptake Inhibitors) class. Fig. 2 shows the gender demographic distribution of the sample. The average of age and duration of usage were 37 and 18 months for the whole sample respectively.
In the second phase, drug review posts were split into sentences, and then sentences were labeled for the presence of ADRs (Adverse drug reaction), WDs (Withdrawal Symptoms), SSIs (sign, symptom, illness), DIs (Drug Indications), EF (drug effectiveness), and INF (drug ineffectiveness). The total number of sentences in the sample is 6009. Fig. 3 shows frequency of sentences labeled for each of these items for the whole PsyTAR dataset and SSRI and SNRI classes separately.
In the third phase, mentions of ADRs, WDs, SSIs, and DIs were identified and extracted from the sentences, and then classified as physiological, psychological, cognitive, or functional problem. Fig. 4 shows the total frequency of identified ADRs, WDs, DIs, and SSIs broken down by the type of entity including physiological, psychological, cognitive, and functional problems. Fig. 5 shows the percentage of identified ADRs, WDs, DIs, and SSIs for the entire PsyTAR dataset and type of entities separately.
In the fourth phase, all the identified entities were mapped to 918 unique UMLS concepts and 755 unique SNOMED CT concepts. Fig. 6 shows frequency of UMLS concepts for each ADRs, WDs, DIs, and SSIs. The 3180 unique identified ADRs in the third phase were mapped to 673 UMLS concepts,  indicating the high semantic variabilities of patients expression of ADRs [1]. Fig. 7 shows the reduction of identified entities by mapping to the UMLS Metathesaurus concepts.
In this phase, we also identified qualifiers indicating severity and persistency of identified entities. Fig. 8 shows the frequency of identified qualifiers including "mild", "moderate", and "severe" indicating severity, and "persistent" and "not-persistent" indicating persistency of the identified entities (ADRs, WDs, DIs, SSIs).

Experimental design, materials and methods
The drug reviews were collected from a healthcare forum called "askapatient.com". We developed an Application Programming Interface (API) to collect data from this forum. The sample size was calculated using the formula of sample size for qualitative studies [2]. In the next step, the drug reviews   were processed for correcting grammatical errors and removing personal information (e.g., website, emails). Then, the reviews were split into sentences, and each sentence was double coded (labeled) for the presence of ADR, WD, DI, SSI, EF, and INF. The calculated inter-annotator agreement (IAA) using Kappa was 78% for the entire dataset. In the next phase, mentions of the ADR, WD, SSIs, and DIs were identified from the relevant sentences. Four annotators identified the boundary of the entities by strictly following guidelines developed for the entity identification phase. The calculated IAA for entity identification was 86% for the entire dataset. In the last phase, the identified entities were mapped to   the corresponding UMLS Metathesaurus concepts and SNOMED CT concepts. All of the identified concepts were reviewed for consistency. The detailed methodology for developing this dataset is discussed in a separate manuscript [1].