Ranking Adverse Drug Reactions With Crowdsourcing

Background There is no publicly available resource that provides the relative severity of adverse drug reactions (ADRs). Such a resource would be useful for several applications, including assessment of the risks and benefits of drugs and improvement of patient-centered care. It could also be used to triage predictions of drug adverse events. Objective The intent of the study was to rank ADRs according to severity. Methods We used Internet-based crowdsourcing to rank ADRs according to severity. We assigned 126,512 pairwise comparisons of ADRs to 2589 Amazon Mechanical Turk workers and used these comparisons to rank order 2929 ADRs. Results There is good correlation (rho=.53) between the mortality rates associated with ADRs and their rank. Our ranking highlights severe drug-ADR predictions, such as cardiovascular ADRs for raloxifene and celecoxib. It also triages genes associated with severe ADRs such as epidermal growth-factor receptor (EGFR), associated with glioblastoma multiforme, and SCN1A, associated with epilepsy. Conclusions ADR ranking lays a first stepping stone in personalized drug risk assessment. Ranking of ADRs using crowdsourcing may have useful clinical and financial implications, and should be further investigated in the context of health care decision making.

each side of the current bin). The first CR batch totaled in 15,721 pairs compared and the second CR batch in 29,179 pairs. Each batch was completed in approximately two to four hours, totaling in fifteen hours. Task generation: The compared pairs in each batch were divided to worker tasks, each comprising of ten ADR pairwise comparisons. By dividing the entire set of pairs in each batch into five parts, we enabled each MTurk worker to perform between one and five such tasks (10-50 comparisons). Figure S1 shows that most workers compared 10 -50 pairs of ADRs (i.e. participated in one batch). Each comparison task comprised of ten pairs to compare (See Multimedia Appendix 2 for an example). The user interface provided clickable links to Google queries with the ADR name in order to aid workers learn about ADRs expressed in medical terminology with which they were not familiar. In order to identify reliable workers, each worker task of ten pairs included three pre-defined quality control pairs (and seven randomly chosen pairs from the batch-generated pairs). These quality control pairs were constructed by pairing all combinations from a manually selected set of severe ADRs and a set of mild ADRs: in the QC batches we used sixteen severe and sixteen mild ADRs, resulting in 256 quality control pairs and following the initial crude ranking of the first QC batch, we increased this number to 676 quality control pairs in the CR batches. In order to minimize biases, the location of each pair within the ten pairs in each task and the order of the two compared ADRs in each pair were randomized (including the pre-defined quality control pairs). In order to provide explanation of the ADRs and their medical terminology, we included a link adjacent to every ADR, querying the ADR in Google. Workers filtering and statistics: The workers were required to possess satisfactory task completion record, rejected in less than 5% of past tasks (95% approval rate) and be located in the United States, as a proxy to English proficiency. Using the unique worker identifier across tasks, we were able to filter tasks of unreliable workers who made incorrect choices on more than 20% of the quality control pairs (3% of the 2,589 workers were filtered). Each task, comprised of ten pairwise comparisons, took five minutes to complete on average, yielding 0.45 cents per worker (half a dollar including Amazon's fee). The entire ranking totaled in 146 person days at a cost of $6,300.

Ranking ADRs
In order to rank the ADRs based on the pairwise comparisons, we used linear programming that attempted to retain as much of the original rankings of the workers (in the minimization of the utility function) while ensuring that the ADRs obey the triangular inequality -i.e. for each ADR triplet A, B and C we denote more severe as "greater than", so if A>B and B>C then it follows that A>C. The linear programming optimization function is formulated as: Where n =2,929 is the total number of ADRs and the variable ∈ [0,1] is the fraction of the time ADR i is more severe than j.
are the weights for each pair of ADRs. If the pair of ADRs i and j were compared by MTurk workers, reflects this knowledge by assigning weight equal to one minus the fraction of times i was selected as more severe than j, i.e. = 0, = 1 if all workers determined that i is more severe than j. Thus, it places a penalty for choosing < 1 > 0. Untested pairs receive a uniform weight of 0.5.
The set of constraints are: Finally, the score of each ADR i is a "Borda count" 1 , summing over all its pairwise ranks Xij As the number of variables in this linear programming scheme is quadratic with n (all pairs) and the constraints involving triangular inequalities is cubic with n, we found it infeasible to solve for n=2,929 in terms of time and space. We therefore created 8,787 samples of size 100 (three samples per ADR), ranked each sample independently and finally created a global score for all the ADRs. In order to make each sample dense with inter-ADR comparisons, we created an iterative process whereby starting with each ADR i, we iteratively add the ADRs that were compared to i (first tier), the ADRs compared to those in the first tier and so on, until reaching n=100. In case the last tier brought the number to more than a hundred, we chose randomly from the last tier in order to reach a 100 (totaling in three random samples per ADR). Each ADR appeared in 300±87 samples and on average shared at least one sample with 2,925±4 (99.9% ±0.1%) of the other ADRs. Arranging all the samples in a fixed size has the advantage of constructing the constraints only once, since the utility function (equation 1) is the only part in the linear programming that depends on the worker comparison results. Computation of each sample took 58 seconds on average. Finally, the ranking of an ADR i is the average ranking across all the samples it participated in. Figure S2 shows that we obtain a stable ranking (Spearman correlation, ρ= 0.97) when the number of ranked samples exceeds 1000 samples.
The linear programming was implemented in MATLAB using IBM CPLEX package version 12.6 2 .

Optimizing equivalence classes
Using independent rankings computed from each of the repeated batches, we obtained the standard deviation per ADR rank. Based on these standard deviations, we selected ADR equivalence classes using balanced one-sided analysis of variance (ANOVA). In order to partition the ADRs to classes, we used a two-step greedy algorithm. In the first step, we divided the ADRs into k-equal sized bins and selected k with minimal p-value (k=12). For the next step, we merged the twelve equal-sized bins to six classes by randomly selecting and merging adjacent classes and computing their ANOVA p-value. We tested both a gradient descent approach and a simulated annealing approach. In the gradient descent approach, we merged classes if the resulting p-value was lower than before the merge. The gradient-descent approach was iterated a 1000 times to avoid local minima. The simulated annealing approach was performed with a 1000 iterations. Both methods converged to the same final six clusters with minimal ANOVA p-value.

Computing sematic similarity for ADRs
The semantic similarity between two ADRs is computed based on the hierarchical structure of HPO. As an example, vocal cord paresis a similar semantic meaning as one of its symptomshoarseness. Following the method proposed by 3 , we computed the information content of each concept c in the hierarchy as Similarity between two concepts c1 and c2 in HPO with the most informative common ancestor m is computed using the measure of 4 : ( 1, 2) = 1 − ( ( 1) + ( 2) − 2 ( ) ) .7 We combined the 3,730 mappings between HPO concepts and unified medical language system (UMLS) 5

Multimedia Appendix Legends
Multimedia Appendix 1. An example of a comparison presented to an MTurk worker.
Multimedia Appendix 2. (Table S1). The MTurk workers pairwise comparisons used to compute the ranking.
Multimedia Appendix 4. (Table S2). Ranked list of ADRs with their reported frequency.
Multimedia Appendix . The correlation between ADR semantic similarity and the mean difference in severity scores, computed for 793 ADRs.
Multimedia Appendix 6. (Table S3). Top prescribed drug in 2013 that have novel severe ADRs in OFFSIDES database.