Machine learning prediction of side effects for drugs in clinical trials

Summary Early and accurate detection of side effects is critical for the clinical success of drugs under development. Here, we aim to predict unknown side effects for drugs with a small number of side effects identified in randomized controlled clinical trials. Our machine learning framework, the geometric self-expressive model (GSEM), learns globally optimal self-representations for drugs and side effects from pharmacological graph networks. We show the usefulness of the GSEM on 505 therapeutically diverse drugs and 904 side effects from multiple human physiological systems. Here, we also show a data integration strategy that could be adopted to improve the ability of side effect prediction models to identify unknown side effects that might only appear after the drug enters the market.


INTRODUCTION
Side effects of drugs are typically identified through randomized controlled clinical trials. It is well known that many side effects cannot be observed during clinical trials due to limitations in sample size and time frames. Postmarketing surveillance programs, such as the Adverse Event Reporting System (AERS), were designed to assist in the identification of side effects after the drug entered the market. However, the late identification of drug side effects is known to cause high morbidity and mortality in public healthcare, 1,2 the re-assessment of drug safety through new clinical trials, 3 and the possible withdrawal of drugs from the market. 4 A wide range of computational approaches have been proposed to predict the side effects of drugs at different stages of the drug development process (see reviews by Ho et al. 5 and Boland et al. 6 ). The first group of methods is applicable during pre-clinical drug development when only chemical, biological, and pharmacological information is available. These methods exploit chemical features, 7-11 protein targets, 12 and pathway information, 13 often in combination with protein networks, 14 and, in general, they offer a modest accuracy. A second group of methods was proposed for the postmarketing phase of drug development. [15][16][17][18][19] These methods exploit the side effects collected in clinical trials and the postmarketing phase to predict other unknown side effects. Our study differs from these methods in that we assumed that only side effects identified during clinical trials are available. This represents a more challenging scenario due to information sparsity and selection bias. 20,21 Our goal is 2-fold: (1) to simulate the realistic scenarios faced by safety professionals working in clinical drug development and (2) to provide a computational tool that can assist in the early detection of side effects of drugs undergoing clinical trials.
A critical application of our approach is during the different phases of clinical trials, where computational predictions can be used as a hypotheses generator to set the direction of the risk assessment. Our approach uses a matrix completion model that we called the geometric self-expressive model (GSEM). This is based on our objective function and multiplicative learning algorithm, which learns globally optimal solutions. Our model MOTIVATION Drug side effects cause significant morbidity and mortality in healthcare. Side effects are discovered and added to the drug label during randomized controlled trials, but, due to trials' limited sample sizes, severe side effects are often discovered after the drug enters the market. An important question is whether we could use artificial intelligence to predict unknown side effects using the side effects identified during drug clinical trials. We studied this problem and developed a machine learning framework for predicting side effects for drugs undergoing clinical development. exploits known drug side effect associations and integrates graph structure information from chemical, biological, and pharmacological data. Here, we also show that predicting side effects that were identified after the drug entered the market from the information available during clinical trials is challenging. We attributed this to a distribution shift in side effect reports between clinical trials and postmarketing. This observation motivated a simple data integration technique that can be used to significantly improve the performance of GSEM at identifying side effects that might appear after the drug enters the market.

GSEM
Our starting point is the n 3 m drug side effect association matrix X, where x ij = 1 if drug i is known to induce side effect j, or x ij = 0 otherwise. Drugs can be related by their similarities in chemical structure, biological targets, and pharmacological activity. Side effects can also be related by their similarities in anatomical/ physiological phenotypes. Our method integrates drug and side effect information by learning two similarity matrices: a drug similarity matrix H˛R n3n such that XxHX and a side effect similarity matrix W such that XxXW. The GSEM generates scores for each drug-side effect pair by linearly combining these models: b X = HX + XW: (Equation 1) The first term in Equation 1 is the drug self-representation model, and the second term is the side effect self-representation model. To where w i and w j represent column vectors of W and L = D À A is the graph Laplacian with D = diagð P j a ij Þ. The constant values m i > 0 in Equation 2 weigh the importance of the smoothness constraint for the prediction. When multiple graphs are combined, the parameters m i in Equation 2 tell us about the contribution and importance of the individual graph information for the prediction model. The fourth term in Equation 2 is a penalty for diagonal elements to prevent the trivial solution W = I (the identity matrix). Typically, g [ 0 is used. The last constraint in Equation 2 is a non-negative constraint, 27 which is added here to favor interpretability of the learned W. Figure 1 depicts an overview of our GSEM. The starting point is the matrix X containing binary associations encoding the presence or absence of drug side effects. The GSEM learns the self-representation matrices H and W that minimize our loss functions in Equation 3 and 2, respectively, by employing an iterative algorithm that uses a simple multiplicative update rule (see STAR Methods). Our algorithm is inspired by the diagonally rescaled principle of non-negative matrix factorization. 27 GSEM is fast to run, and it does not require setting a learning rate or applying a projection function. Our algorithm also satisfies global guarantees of convergence given by the Karush-Kuhn-Tucker (KKT) complementary conditions (proof in Methods S2). Having learned independently H and W, we calculate b X = HX + XW. Notice that while X contains binary values ½0; 1 that correspond to our original data, b X contains real positive numbers that are our predicted scores.

Overview of evaluation
To obtain side effects identified in clinical trials, we followed the procedure in Galeano et al. 28 to retrieve side effects reported in randomized controlled studies from the Side Effect Resource (SIDER) 4.1. 21 27,610 associations were obtained for n = 505 marketed drugs and m = 904 unique side effect terms. We also collected side effects identified after the drugs entered the market from two independent sources. 6,818 side effects reported in the postmarketing section of drug leaflets were obtained from the SIDER database (SIDER postmarket set). 25,797 statistically significant side effects reported in the AERS were obtained from the OFFSIDES database 29 (OFFSIDES postmarket set). The collection of drug side effect data used in our study is shown in Figure 2A.
Our goal is to assess the performance of the GSEM at predicting unknown side effects for drugs with a small number of side effects identified in clinical trials. Therefore, only side effects identified in clinical trials were used for training the model. Figure 2B illustrates how the clinical trials' side effects were randomly split into training, validation, and testing sets.
Following previous approaches, [15][16][17][18][19] we framed our problem as a binary classification problem and used the area under the receiving operating curve (AUROC). The validation set consisted of 10% randomly held-out clinical trials side effects and randomly selected negatives of twice the number of positives. We used the validation set to tune the model hyperparameters. We then performed the evaluation by training the model with the combined training and validation sets using the optimal hyperparameters. We measure the AUROC and the area under the precision-recall curve (AUPR) on three test sets (see We compared the prediction performance of the GSEM with a representative number of side effect prediction models that can also be applied to our problem: (1) matrix factorization (MF); 16 (2) predictive pharmacosafety networks (PPNs); 15 (3) inductive matrix completion (IMC); 17 and (4) feature graph-regularized MF (FGRMF). 18 Each side effect prediction model integrates different types of complementary information about drugs and side effects. We collected and used five types of side information for our study. For drugs, we obtained the chemical structure and protein targets from DrugBank, 30 indications from the Drug Repositioning Hub, 31 and Anatomical, Therapeutic, and Chemical (ATC) classification (see STAR Methods). We used MACCS fingerprints 32 to represent chemical structure and computed Tanimoto similarity using RDKit. 33 For side effects, we obtained the Medical Dictionary for Regulatory Activities (MedDRA) terminology. To build graphs from the different side information, we calculated the adjacency matrices using similarity measures (see STAR Methods). For the ATC and MedDRA terms, we also obtained their corresponding hierarchies to calculate taxonomy similarities that have been used by previous approaches. 15,17 Evaluation of prediction performance on multiple drugs Figure 3A shows the AUROC performance of the side effect prediction models at recovering missing drug-side effect associations in the held-out test set. Following a common practice in the literature, 15 Figure 1. Geometric self-expressive model (GSEM) 27,610 associations identified on clinical trials for 505 drugs and 904 side effects were collected from the SIDER 4.1 database. The associations were arranged into an n 3 m matrix X by encoding their presence ð = 1Þ. Unknown associations were encoded with zeros ð = 0Þ. Our algorithm learns two similarity matrices that model the two pharmacological spaces of drug side effects. H (of size n 3 n) encodes similarities between drugs that are learned from drug networks built from chemical, indication, target, and taxonomy similarities. W (of size m 3 m) encodes similarities between side effects that are learned from physiological relationships between side effects. The GSEM learns independently H and W such that XxHX and XxXW. By linearly combining these models, HX + XW, we obtain b X , which models X, and where all the entries are replaced by real numbers-these are our predicted scores. Note that values replacing zero entries in X will constitute our predictions. Rows of H are drug self-representations, and columns of W are side effect self-representations. The lower illustration depicts how our model discovers a drug self-representation vector for the anti-diabetic drug metformin, and a self-representation vector for the side effect myocardial infarction (MI), such that the dot product of these vectors with the binary vector corresponding to known drugs for MI and known side effects of metformin, respectively, models the presence/absence of MI in patients on metformin. The body parts infographic vector was created by macrovector www.freepik.com.   Figure 3A). Second, if possible, one side information at a time together with X was integrated into the model to assess its contribution to the overall performance (second to fifth rows in Figure 3A). In these experiments, we run each method with the side information types proposed in the original publications (see Methods S1). Finally, if the original publications proposed a way to integrate multiple information types (more than one) in their framework, we implemented them, and their performance is shown in the last row of Figure 3A. Notice that the GSEM, as proposed in Equations 3 and 2, is a model that allows for the integration of multiple types of heterogeneous information.
On the held-out test set with other side effects identified in clinical trials, the GSEM outperforms all the competitors by 1.4%-13.3%. Even when training GSEM using the training matrix X alone, i.e., without side information, the GSEM achieves 0.940 in terms of the AUROC. This baseline performance can be slightly improved using side information for drugs and side effects. Other methods, such as PPNs 15 and IMC, 17 also show a similar trend; therefore, side information should be used when available. In addition, we observed that while the competitors' performance is more sensitive to the specific choice of side information, the performance of the GSEM displays a small variability across information types. The mean and SD AUROCs in the held-out test set are 0.9421 ± 0.0012 (GSEM) versus 0.9079 ± 0.0207 (FGRMF), 0.8405 ± 0.0026 (IMC), and 0.9239 ± 0.0212 (PPNs). GSEM also consistently outperforms the competitors in terms of the AUPR ( Figure S1).
We then tested our method in a more realistic scenario using a simulated prospective evaluation similar to the one used by Cami et al. 15 In this procedure, all side effects identified after the drugs entered the market were used as a test set (postmarket test sets in Figure 2B). Figures 3B and 3C show the prediction performance of the methods in postmarketing test sets. The GSEM outperforms the competitors by 1.5%-14.8% in the SIDER postmarket test set and by 0.7%-4.6% in the OFFSIDES postmarket test set.
Interestingly, the GSEM offers the best prediction performance in both prospective sets when combining all available side information. Following Cami et al., 15 we further asked whether the performance of the models varies for drug-or side effect-specific categories. We performed a second evaluation where we used the best-performing models of each column of Figure 3A to analyze the performance of a specific group of drugs and side effects (see STAR Methods). Figure 3D shows the AUROC performance of the models for drug-specific anatomical categories according to their primary ATC classification. For most categories, the GSEM's mean AUROC was above 0.75 in the SIDER postmarket test set-we obtained the lowest AUROC performance for nervous system drugs (0.706) and the highest performance for respiratory system drugs (0.852). In the OFFSIDES test set, the mean AUROC was above 0.55 for all the categories. The performance of the models for the side effect-specific MedDRA category of disorders are shown in Figure S2.
Distribution shifts in side effects reported before and after the drugs enter the market An important observation from Figures 3A-3C is that there is a considerable difference in AUROC performance when predicting other side effects from clinical trials (GSEM AUROC of 0.944) versus postmarketing (GSEM AUROCs of 0.728 and 0.618 in the SIDER and OFFSIDES postmarket sets, respectively). These differences cannot be explained by the specific method used or the type of side information used in the integration. The differences in prediction performance prompted us to ask whether they can be explained by a distribution shift in side effect reports before and after the drug enters the market.
To analyze differences in reporting trends, we defined the ratio of reporting frequency (RRF) as the normalized count of drugs associated with a given side effect (see STAR Methods). The RRF reflects whether a side effect has been associated with many or few drugs in our dataset. For instance, nausea, a side effect reported on most drugs, has an RRF of 1.0, while eye infection, reported only on a few drugs, has an RRF of 0.011. We contrasted the RRF of each side effect computed using clinical trial associations versus postmarketing associations. Figures 4A and 4B show that side effects reported in clinical trials and postmarketing follow a different trend. A side effect reported on a small number of drugs in clinical trials (low RRF in the x axis) can be reported on many drugs in the postmarketing phase. This trend is even more prominent in the OFFSIDES postmarket set. For comparison, the expected trend without distribution shift is shown in Figure 4C for a held-out set from clinical trials associations (Pearson, r = 0:923, p < 2.23 3 10 À308 ). Our results suggest differences in reporting trends between drug side effect associations reported in clinical trials and the postmarketing phase.
We further explored whether there are statistically significant differences in RRF values for drug anatomical classes and side effect disorder types. We grouped drugs by their main ATC classification and compared distributions of RRF values based on the known side effects reported in different sets (see STAR Methods). Figure 4D shows that for the majority of drug categories, the side effects that were reported in clinical trials tend to be biased toward frequently reported side effects except for nervous system drugs. Conversely, while the SIDER postmarket set tends to be more significant toward rarely reported side effects in clinical trials, the OFFSIDES set was more significant for frequently reported side effects. We repeated our statistical analysis by grouping side effects based on their main MedDRA category of disorders. Figure 4E shows that side effect categories are significant toward rarely reported side effects, i.e., low RRF values.
A fundamental assumption in machine learning is that the training and testing sets are drawn from the same underlying distribution. 34 Our analysis in Figure 4 shows that this is not the case for our problem. We hypothesized that the distribution shifts in side effect reports between clinical trials and postmarketing could explain the differences in prediction performance that we observed in Figures 3A-3C. It would imply a dependency between the AUROC performance and the RRF values of the side effects in the test set. To explore this dependency in more detail, we calculated AUROC values for single drugs on the SIDER postmarket test set. Figure 5 shows a correlation between prediction performance and the RRF values of the side effects we are trying to predict. A positive correlation is observed for all the methods, suggesting that each drug's  Reported side effects in OFFSIDES have even lower RRF values than those in SIDER (see Figure S3), thus explaining the differences in AUROC performance between SIDER and OFFSIDES postmarket sets in Figures 3B and 3C, and Figure S4 shows that the AUROC per drug varies by category depending on the RRF values of the side effects in the postmarketing test sets.
A data integration technique to improve prediction performance We propose a simple data integration technique to improve the prediction performance of side effect prediction models for individual drugs. Our idea is based on the observation that the effect of the distribution shift can be reduced if we integrate postmarketing data into the training matrix X. Figure 6B shows that the RRF values of specific side effects can be improved using postmarketing information in training. Figure 6A illustrates our evaluation procedure for single drugs. For a given drug x, we used its clinical trials side effects for training and its combined SIDER and OFFSIDES postmarketing side effects for testing. Then, we assessed the AUROC performance using two strategies that differ in the information used for the other drugs. The first strategy uses only side effect associations reported in clinical trials. The second strategy uses side effect associations reported in clinical trials and postmarketing. To prevent data leakage, we removed other chemically similar drugs from the training matrix X (see STAR Methods). Notice that for both strategies, we trained each method using the same set of optimal hyperparameters obtained in the validation set, as shown in Figure 2. Figures 6B and 6C shows the AUROC performance of the side effect prediction models using strategies 1 and 2. The inclusion of the postmarketing side effects for the other drugs used for training dramatically affected the prediction performance for single drugs. The mean AUROC improved from 0.604 to 0.667 for MF; 0.512 to 0.537 for IMC; 0.596 to 0.650 for FGRMF; 0.60 to 0.733 for PNN; and 0.616 to 0.746 for the GSEM. Our method shows a 13% performance improvement using strategy 2.  Article ll OPEN ACCESS

Self-representations capture biological relationships
Two properties make the GSEM an interpretable and reproducible model. First, the GSEM is interpretable because the predicted score can be explained in terms of learned similarities between drugs and side effects. Second, the GSEM's solutions are reproducible because the learned solution is a globally optimal solution of its objective function. The GSEM overcomes the common problem of machine learning models that learn different solutions even when training the same model with a different random initialization, which is persistent in deep-learning models. 35 The GSEM's predicted score for a drug i and side effect j can be written as follows: where H and W are non-negative. The first term in Equation 5 contains the learned similarities between drug i and the drugs known to cause side effect j. The second term in Equation 5 contains the learned similarities between side effect j and the side effects known to be caused by drug i. If any of the individual terms in the sum is high, the prediction score b X ij will be high because the model allows only for summation and not the subtraction of terms.
We hypothesized that the learned H can capture biological relationships between drugs. Following a similar procedure to Cheng et al., 36 we assessed whether our drug similarity measure, defined as ðH + H T Þ=2 (see STAR Methods), reflects known chemical, biological, and pharmacological relationships between drugs. To be sure that there is no information leakage, we trained the GSEM using all available clinical trials and postmarketing information (encoded in X) but without any side information (i.e., m i = a j = 0ci; j) (see STAR Methods). We found that our drug similarity based on H correlates with chemical, indication, target, and ATC taxonomy similarities ( Figure 7B). Interestingly, our drug similarity was also indicative that the drugs were pharmacologically similar (ATC taxonomy similarity above 0.05) or distinct (below 0.05). Our results suggest that the matrix H in our model could capture chemical, biological, and pharmacological relationships between drugs.
We also tested whether W could capture the anatomical/physiological relationships between side effect phenotypes, as defined by the MedDRA taxonomy similarity (see STAR Methods). We defined side effect similarities based on W as ðW + W T Þ=2 (see STAR Methods). We found that the side effect similarities based on W correlate with the MedDRA taxonomy similarity ( Figure 7B, bottom). We observed that phenotypically similar side effects tend to have similar self-representations. The similarity also indicates whether side effects are anatomically/physiologically similar (MedDRA taxonomy similarity above 0.05) or distinct (below 0.05).
To showcase how the learned matrices allow for interpretability, we explored the weights in W for two side effects: (1) myocardial infarction (MI), which has been associated with the withdrawal of many drugs from the market, 4 and (2) blurred vision. Figure 7A shows a diagram of the side effects that are more similar to MI and blurred vision based on the weights in W. We observed that MI is very similar to other vascular-related disorders, including angina pectoris, which has been shown to appear prior to MI. 37 46 drugs in our dataset are known to be associated with both angina pectoris and MI, which might explain the learned association. On the other hand, blurred vision, which is classified in MedDRA as both an eyes and nervous system disorder, is also very similar to other related conditions, including psychiatric disorders. The learned matrix W allows for a transparent inspection of how the model arrived at a given prediction. If a drug is known to induce MI, our model predicts that the drug might also induce similar side effects, as shown in Figure 7A.

DISCUSSION
Here, we introduced the GSEM, a computational approach for predicting the side effects of drugs during clinical drug development. Instead of waiting for postmarketing observational evidence to be accumulated, our framework can be used to assist drug safety professionals in the identification of side effects during drug clinical trials. To show this, we trained the models with side effects identified in clinical trials and tested them to predict side effects identified in the postmarketing phase. To our knowledge, this is the first attempt to predict the presence or absence of side effects for drugs with a small number of side effects identified in clinical trials. Our framework can be used together with our recent approach to predict the frequencies of drug side effects in patients. 28 These tools can be helpful in the early detection of rare side effects that cannot be effectively captured in small-size clinical trials.
Our analysis indicated that predicting side effects that were identified after the drugs entered the market is difficult when training only with side effects identified during clinical trials. Part of this difficulty lies in the differences in the distribution of side effects reported in clinical trials and in postmarketing. Scarcely reported side effects during clinical trials tend to be highly reported in postmarketing, thus explaining the models' difficulty at predicting them. We further studied this issue by analyzing the dependency between the number of drugs associated with a side effect (RRF value) and the prediction performance of machine learning models (see Figure 5). Our experiments showed that the prediction performance of the models heavily depended on the RRF value of the side effects we were aiming to predict. Strikingly, improving the RRF value of each side effect by adding information from postmarketing reports was more critical for improving the prediction of postmarketing side effects than the use of any drug or side effect features.
The problem of distribution shift in side effect reports is deeply connected to the intrinsic distributional properties of drug side effects. In a previous study, 28 we have shown that drug side effect reports follow a long-tailed distribution. The distribution can be summarized in a Pareto 80/30 rule, where 80% of the associations come from 30% of the side effects. 28 Unfortunately, this means that the amount of labeled information (captured by RRF), vital for machine learning models, varies per side effect, following an almost exponential distribution. It would be essential to consider the dependency between prediction performance and side effect RRF when evaluating computational models that aim to predict drug side effects. An innovative aspect of our algorithm is that it learns similarities between drugs (matrix H) and between side effects (matrix W). Our model is fundamentally different from previous side effect prediction models. A PPN 15 is a network-based method that builds topological features from the bipartite drug-side effect graph. The graph is obtained when connecting the nodes representing drugs to the set of nodes representing side effects. PPNs also integrate chemical, taxonomic, and biological features and then use a logistic regression model to predict. MF 16 is a matrix decomposition model that learns a low-dimensional feature vector for each drug and a low-dimensional feature vector for each side effect such that the dot product between the vectors models an entry in X. It amounts to a low-rank approximation of X. Similarly, FGRMF 18 uses several low-rank approximation models for each drug side information graph that are integrated into the model using the smoothness constraint. [24][25][26] The final FGRMF score is the probability given the logistic regression that combines the scores of the individual low-rank models. Finally, IMC 17 is an IMC model that integrates drugs and side effect features in the matrix decomposition model. A detailed description of the mathematical formulation of each competitor, together with their implementation and optimization, can be found in Methods S1.
GSEM builds upon the recent development of high-rank matrix completion based on self-expressive models (SEM) 38  Article ll OPEN ACCESS linear methods, 39 as well as the recent trend of deep learning on graphs. 26,40,41 SEMs represent data points, e.g., drugs, approximately as a linear combination of a few other data points. Elhamifar 38 proposed SEMs as a framework for simultaneously clustering and completing high-dimensional data lying in the union of lowdimensional subspaces. It has been shown that SEMs generalize over standard low-rank matrix completion models, 42,43 which might explain why the GSEM outperforms previous approaches that have been proposed to predict drug side effects based on low-rank matrix decomposition. [16][17][18] Self-representations naturally allow the integration of graph-based information about drugs or side effects. Our model is also related to non-negative MF (NMF). 27,44 They differ, however, in two main aspects. First, while NMF learns two low-rank matrices to represent the input data, the GSEM learns a single null-diagonal matrix that allows for a high-rank matrix. 38 Second, while the NMF objective function is non-convex, we proved that our objective function is convex and that our algorithm converges to a globally optimal solution. Our framework could be easily applied to proprietary datasets of drug side effects by following our procedure illustrated in Figure 2. The GSEM is fast to run, and its prediction performance is robust to the specific choice of hyperparameters (see our analysis in Figure S5). Applying our model for a compound undergoing clinical trials is as easy as adding the new compound information in a new row in X. We started investigating the potential of the GSEM for drug repositioning, 45 and we envision applying our algorithm to other open problems in biology, chemistry, and medicine, such as drug target prediction 46 or antiviral drug effect prediction. 47 To assist scientists working in clinical drug development in their difficult task, we provide the code to run our algorithm (https://github.com/ paccanarolab/GSEM), the predictions for the 505 drugs used in our study (supplementary dataset 4 in Galeano and Paccanaro 48 ), and the learned matrices that can help to interpret the predictions (supplementary datasets 5 and 6 in Galeano and Paccanaro 48 ).
Whenever machine learning models support high-stakes decisions, it is desirable to have inherently interpretable models. 49 We have shown that the learned matrices in our model capture biological and pharmacological relationships between drugs and physiological relationships between side effect phenotypes. However, the medical, biological, or pharmacological interpretation of the relationships requires expert biological and medical knowledge. In the supplemental information, we also discussed the differences between the interpretability capabilities of the GSEM and our latent factor model for predicting the frequencies of drug side effects 28 (see Methods S3).

Limitations of the study
We run our method only for drugs with at least five side effects identified in clinical trials. A limitation of expanding our analysis is the lack of standardized datasets that classify side effects depending on the phase of the clinical trial in which it was identified.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following: Postmarketing side effects Two test sets of postmarketing side effects were collected. The first set was obtained from the SIDER 4.1 database, 21 from which we retrieved 6,818 postmarket associations (labels 'postmarketing' in SIDER) -it corresponds to side effects reported in the postmarketing section of drug's leaflets. The second set was obtained from the OFFSIDES database, 29 from which we retrieved 25,797 ''significant'' associations -it corresponds to statistically significant postmarketing side effects reported in the Adverse Event Reporting System (AERS).

Drug-target interactions
We retrieved the known drug-target interactions from DrugBank release 5.1 30 . We mapped the drugs from SIDER to DrugBank using the PubChem IDs and the mapping provided in DrugBank. We retrieved molecular targets (section 'targets' of DrugBank) for the 505 drugs in our dataset. In total, 1,983 associations were found between the 505 drugs and 755 unique protein targets.

Chemical fingerprints
We retrieved the known drug SMILES notations from DrugBank release 5.1 30 . For the 505 drugs in our dataset, we could obtain a binary MACCS fingerprint using the open source RDKit python library. 33 MACCS are 167 bit structural key descriptors in which each bit is associated with an SMARTS pattern. 32

Drug indications
We retrieved drug indications from the Drug Repositioning Hub database 31 (accessed on 05/02/2020). Drug indications in the Drug Repositioning Hub were manually annotated. In total, 1,021 associations were found between the 505 drugs and 354 unique indications.

ATC information
We retrieved Anatomical, Therapeutic and Chemical (ATC) codes for each of the 505 drugs from the WHO proprietary dataset release 2018.
The datasets and similarity values used to implement GSEM are provided in Supplementary Dataset 1, 2 and 3.

Side effect ratio of reporting frequency (RRF)
The side effect ratio of reporting frequency is a normalized count of the number of drugs associated with a given side effect. For a given side effect j, the RRFðjÞ is defined as follow: where x ij in the entry ði; jÞ of the matrix X, n is the total number of drugs, and Z = maxf P n i x i1 ; P n i x i2 ; .; P n i x im g is the maximum number of associations for the side effects. When using only drug side effect associations from clinical trials, Z = 375.

Similarities in side information graphs
To build the graphs for drugs, we computed similarities from the side information features. Given a set of feature elements U associated with drug u (e.g. chemical fingerprints) and a set of feature elements V associated with drug v, the Jaccard similarity between u and v is given by: where j:j denotes the cardinality of the sets. The Jaccard similarity is bounded 0 % J ðu; vÞ % 1. The Jaccard similarity was used for the chemical, indication and target drug features. Three weighted and symmetric adjacency matrices A chem ; A Ind ; A DT were then obtained for each side information type. The Jaccard similarity of the chemical fingerprints is also known as the 2D Tanimoto Chemical similarity.
For the ATC side information, we followed Cami et al. 15 and calculated taxonomy similarities between drugs based on the shortest path between their set of ATC codes in the ATC hierarchy. ATC has four different levels, and each drug was annotated by its corresponding ATC codes in the lower level of the hierarchy. Given two drugs u and v, the ATC taxonomy similarity between the drugs was then calculated as follow: where ATC u and ATC u correspond to the set of ATC codes of drug u and v, respectively; SPðATC u ; ATC v Þ is a function that calculates the shortest path between the set of ATC annotations; and U is the set of drugs. In the ATC hierarchy, the smallest value of the shortest path between drugs is 2 and the largest is 8. The ATC taxonomy similarity between two drugs is a number between 0 and 1. We also obtained an adjacency matrix between drugs ðA ATC Þ based on the ATC taxonomy similarity. In total, four drug graphs were used in our model in Equation 2. The adjacency matrices for each of those graphs correspond to A chem ; A Ind ; A DT and A ATC . For side effects, we used one side information only in Equation 3. We computed the MedDRA taxonomy similarity using the MedDRA hierarchy following the same procedure used for calculating the ATC taxonomy similarity.

OPEN ACCESS
Model selection and evaluation for multiple drugs To evaluate each model for multiple drugs, we built a held-out test set by randomly sampling 10% of the known associations in X containing clinical trials side effects. The held-out test set contained 2,761 associations (positive class). To obtain the zeros for the test set (negative class), we randomly sampled twice the number of positives from the zero entries of X that were not in the test set. To set each of the model parameters, we randomly sampled 10% of the remaining entries in X, and placed them on a validation set. The negative class for the validation set was also build by the same negative sampling procedure used for the held-out test set. The validation set contained 2,484 associations and the training set contained 22,365 associations. We used the validation set for model selection. Model parameters were selected according to the Area Under the Receiver Operating Curve (AUROC) in the validation set. The details of the implementation of each model and the grid search for the model parameters is explained in Methods S1.
To assess the performance of the models in the held-out test set, we used the best set of parameters for each model and re-trained the models using all the combined training and validation sets. Then, the model was used to assess the performance in the held-out test set. To assess the performance of the model in the postmarketing test sets, we trained the model with the best set of parameters obtained from the validation set and by considering all the available data from clinical trials, that is, a total of 27,610 associations. In our evaluations for multiple drugs, we also reported the Area Under the Precision-Recall Curve (AUPR).

Performance evaluation for single drugs
When evaluating the performance of our method on single drugs, we trained the model using the following parameters a = 60; b = 0; m chem = 0:1; m Ind = 0:5; m DT = 0:01; m ATC = 5; g = 10 4 ; c = 40; d = 0:5; a MedDRA = 0:5. The procedure for each case presented in the manuscript is as follow: Evaluation by groups of drugs We trained our method using only clinical trials side effects. The performance of the model was then assessed for each drug on whether the model was able to predict the postmarketing side effects from of all the possible side effects -these correspond to the entries in a row of X that had values of zeros in training. We performed this evaluation for drugs with at least ten associations in the testing sets. We used the area under the receiving operating curve (AUROC) to measure the performance of the model. The performance was then reported by grouping drugs according to their main Anatomical, Therapeutic and Chemical (ATC) categories. Evaluation by groups of side effects We followed the same procedure described for groups of drugs. The difference is that for side effects, we assessed the performance for each side effect, by predicting postmarketing associations for a given column of X. The performance was then reported by grouping side effects based on their main MedDRA category of disorders. Evaluation when including postmarketing associations in training For each drug, we used its clinical trials side effects for training and used its SIDER and OFFSIDES postmarketing side effects for testing. For the remaining drugs in X, we also included its SIDER and OFFSIDES postmarketing associations. To prevent biases in the evaluation due to the presence of drug analogs, we removed the drugs in X that were above a Tanimoto chemical similarity threshold of 0.6 -this threshold had been used before to separate chemically similar from dissimilar drugs. 50 Multiplicative learning algorithm To minimize Equations 2 and 3 subject to the non-negative constraints W; H R 0, we developed efficient multiplicative algorithms inspired by the diagonally re-scaled principle of non-negative matrix factorization. 27,44 The algorithm consists in iteratively applying the following multiplicative update rules: where W and H are initialized as random dense matrices uniformly distributed in the range ½0; 0:01. The stopping criteria of our algorithm was based on the maximum tolerance of the relative change in the elements of W and H. The default value was TolX < 10 À 2 , that occurred typically in about 50 iterations. We proved that the iterative application of Equations 9 and 10 converges to a global optimal solution point by showing that the multiplicative learning rule satisfies the Karush-Khun-Tucker (KKT) conditions of convergence and that the objective functions are convex (Proofs in Methods S2). Self-representation similarity Given the drug self-representation matrix H, we defined the similarity between drugs as follow: The similarity between side effects was defined similarly: Interpretability procedure Following Cheng et al., 36 we analyzed whether the drug self-representation similarities, as captured by S H , capture the known chemical, biological and pharmacological relationships between drugs. For chemical relationships we used the 2D Tanimoto chemical similarity between drugs, for biological we used drug targets similarities, and for pharmacological relationship, we used the ATC Taxonomy and indications similarities. We also analyzed whether the side effect self-representation similarities, as captured by S W , reflects the physiological relationship between the side effect phenotypes. For this analysis, we used the MedDRA taxonomy similarity.
To analyze the self-representations, we trained our model without side information graphs, i.e. with the parameters a = 70;b = 0; m chem = 0;m Ind = 0;m DT = 0;m ATC = 0;g = 10 4 ;c = 30;d = 0:5;a MedDRA = 0. We trained the model using all the available data (clinical trials and postmarketing side effects), that is, using 59,497 associations. We then binned the drug and side effect self-representation similarity matrices, S W and S H , and checked the values of the side information similarities corresponding to each specific bin. The bins used were 0 À 0:05; 0:05 À 0:1; 0:1 À 0:2 and > 0:2.

QUANTIFICATION AND STATISTICAL ANALYSIS
One-tailed Wilcoxon Sum Rank Test Significance was used in the reported P-values. To analyze the significance of the RRF values for a given drug or side effect category, we adjusted the p values using the Benjamini-Hochberg method to keep the overall significance level below 0.05.