Network medicine framework for identifying drug-repurposing opportunities for COVID-19

Significance The COVID-19 pandemic has highlighted the importance of prioritizing approved drugs to treat severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infections. Here, we deployed algorithms relying on artificial intelligence, network diffusion, and network proximity to rank 6,340 drugs for their expected efficacy against SARS-CoV-2. We experimentally screened 918 drugs, allowing us to evaluate the performance of the existing drug-repurposing methodologies, and used a consensus algorithm to increase the accuracy of the predictions. Finally, we screened in human cells the top-ranked drugs, identifying six drugs that reduced viral infection, four of which could be repurposed to treat COVID-19. The developed strategy has significance beyond COVID-19, allowing us to identify drug-repurposing candidates for neglected diseases.


Introduction
The disruptive nature of the COVID-19 pandemic has unveiled the need for the rapid development, testing, and deployment of new drugs and cures. Given the compressed timescales, the de novo drug development process, that lasts a decade or longer, is not feasible.
A time-efficient strategy must rely on drug repurposing (or repositioning), helping identify among the compounds approved for clinical use the few that may also have a therapeutic effect in patients with COVID-19. Yet, the lack of reliable repurposing methodologies has resulted in a winner-takes-all pattern, where more than one-third of registered clinical trials focus on hydroxychloroquine or chloroquine, siphoning away resources from testing a wider range of potentially effective drug candidates.
Drug repurposing algorithms rank drugs based on one or multiple streams of information, such as molecular profiles 1 , chemical structures 2 , adverse profiles 3 , molecular docking 4 , electronic health records 5 , pathway analysis 6 , genome wide association studies (GWAS) 6 , and network perturbations [6][7][8][9][10][11][12][13][14] . As typically only a small subset of the top candidates is validated experimentally, the true predictive power of the existing repurposing algorithms remains unknown. To quantify and compare their predictive power, different algorithms must make predictions for the same set of candidates, and the experimental validation must focus not only on the top candidates, but also on a wider list of drugs chosen independently of their predicted rank.
The COVID-19 pandemic presents both the societal imperative and the rationale to test drugs at a previously unseen scale. Hence, it offers a unique opportunity to quantify and improve the efficacy of the available predictive algorithms, while also identifying potential treatments for COVID-19. Here, we implement three recently developed network-medicine drug-repurposing algorithms that rely on artificial intelligence 14,15 , network diffusion 16 , and network proximity 10 ( Figure 1A, B). To test the validity of the predictions, we identified 918 drugs ranked by all predictive pipelines, that had been experimentally screened to inhibit viral infection and replication in cultured cells 17 . We also collected clinical trial data to capture the medical community's collective assessment of promising drug candidates. We find that the predictive power varies for the different datasets and metrics, indicating that in the absence of a priori ground truth, it is impossible to decide which algorithm to trust. We propose, however, a multimodal ensemble forecasting approach that significantly improves the accuracy and the reliability of the predictions by seeking consensus among the predictive methods 14,18 .

Network-based Drug Repurposing
Repurposing strategies often prioritize drugs approved for (other) diseases whose molecular manifestations are similar to those caused by the pathogen or disease of interest 19 . To search for diseases whose molecular mechanisms overlap with the COVID-19 disease, we first mapped the experimentally identified 20 332 host protein targets of the SARS-CoV-2 proteins (Table S4) to the human interactome [21][22][23][24] (Table S3), a collection of 332,749 pairwise binding interactions between 18,508 human proteins (SI Section 1.1). We find that 208 of the 332 viral targets form a large connected component (COVID-19 disease module hereafter, Figure 2B), indicating that the SARS-CoV-2 targets aggregate in the same network vicinity 12,25 . Next, we evaluated the networkbased overlap between proteins associated with 299 diseases 26 ( ) and the host protein targets of SARS-CoV-2 ( ) using the !" metric 26 , finding !" > 0 for all diseases, implying that the COVID-19 disease module does not directly overlap with the disease proteins associated with any single disease (Figure S1-2 and Table S7). In other words, a potential COVID-19 treatment cannot be derived from the arsenal of therapies approved for a specific disease, arguing for a network-based strategy that can identify repurposable drugs without regard for their established disease indication.
We implemented three competing network repurposing methodologies ( Figure 1B and SI Section 1.4): i) The artificial intelligence-based algorithm 14,15 maps drug protein targets and diseaseassociated proteins to points in a low-dimensional vector space, resulting in four predictive pipelines A1-A4, that rely on different drug-disease embeddings. ii) The diffusion algorithm 16 is inspired by diffusion state distance, and ranks drugs based on capturing network similarity of a drug's protein targets to the SARS-CoV-2 host protein targets. Powered by distinct statistical measures, the algorithm offers five ranking pipelines (D1-D5). iii) The proximity algorithm 10 ranks drugs based on the distance between the host protein targets of SARS-CoV2 and the closest protein targets of drugs, resulting in three predictive pipelines of which P1 relies on all drug targets; P2 ignores targets identified as drug carriers, transporters, and drug-metabolizing enzymes; and P3 relies on differentially expressed genes identified by exposing each drug to cultured cells 27 . The low correlations across the three algorithms indicate that the methods extract complementary information from the network ( Figure 2C, and SI Section 1.5).

Experimental and Clinical Validation of Drug Repurposing Pipelines
We implemented the 12 pipelines to predict the expected efficacy of 6,340 drugs in Drugbank 27 against SARS-CoV-2 and extracted and froze the predictions in the form of 12 ranked lists on April 15, 2020. All pipelines rely on the same input data and to maintain the prospective nature of the study, all subsequent analysis relies on this initial prediction list. As the different pipelines make successful predictions of a different subset of drugs, we identified 918 drugs for which all pipelines (except for P3, which predicts the smallest number of drugs) offer predictions and whose compounds were available in the Broad Institute drug repurposing library 28 (Figure 1), and used two independent datasets to quantify the predictive power of each pipeline over the same set of drugs: (1) As the first ground truth we used 918 compounds that had been experimentally screened for their efficacy against SARS-CoV-2 in VeroE6 cells, kidney epithelial cells derived from African green monkey 17 (see SI Section 2). Briefly, the VeroE6 cells were pre-incubated with the drugs (from 8 µM down to 8 nM) and then challenged with wild type SARS-CoV-2 strain USA-WA1/2020.
Of the 918 drugs, 806 had no detectable effect on viral infectivity (N drugs, 87.8% of the tested list); 35 were cytotoxic to the host cells (C drugs); 37 had a strong effect (S drugs), being active over a broad range of concentrations; and 40 had a weak effect (W drugs) on the virus ( Figure   3A, Table S10). As the prediction pipelines offer no guidance on the magnitude of the in vivo effect, we consider as positive outcomes drugs that had a strong or a weak effect on the virus (S&W, 77 drugs, Table 2), and as negative outcomes the drugs without detectable effect (N, 806 drugs).
(2) On April 15, 2020 (prediction date), we scanned clinicaltrials.gov, identifying 67 drugs in 134 clinical trials for COVID-19 (CT415 dataset, Table S12). To compare outcomes across datasets, we limit our analysis to the experimentally tested 918 drugs, considering as positive the 37 drugs in clinical trial on the E918 list, and negative the remaining 881 drugs. As the outcomes of these trials are largely unknown, validation against CT415 dataset tests each pipeline's ability to predict the pharmacological consensus of the medical community on drugs with expected potential efficacy for COVID-19 patients.
The goal of drug repurposing is to prioritize all available drugs, allowing the experimental efforts to limit their resources on the top-ranked ones. Thus, the most appropriate performance metric is the number of positive outcomes among top ranked drugs (precision at ), and the fraction of all positive outcomes among the top ranked drugs (recall at ). For the E918 dataset( Figure   4C) A2 ranks 9 S&W drugs among the top 100, followed by P1 (7 drugs) and A3 and A4 (6 drugs).
We observe similar trends for recall ( Figure 4E): the A2 pipeline ranks 11.7% of all positive drugs in the top 100, and P1 selects 9%. Finally, A1 ranks 12 drugs currently in clinical trials among the top 100 in CT415, followed by A3 (11 drugs) and A2 (10 drugs), trends that are similar for recall ( Figure 4F).
Taken together, we find that while most algorithms show statistically significant predictive power (SI Section 3.1, Tables S1-2), they have different performance on the different ground truth datasets: the AI pipeline offers strong predictive power for the drugs selected for clinical trials, while proximity offers better predictive power for the E918 experimental outcomes. While together the twelve pipelines identify 22 positive drugs among the top 100, none of the pipelines offers consistent superior performance for all outcomes, prompting us to develop a multimodal approach that can extract the joint predictive power of all pipelines.

Multimodal Approach for Drug Repurposing
Predictive models for drug repurposing are driven by finite experimental resources that limit downstream experiments to those involving a finite number ( ) of drugs. How do we identify these drugs to maximize the positive outcomes of the tested list 18 ? With no initial knowledge as to which of the # = 12 predictive pipelines offer the best predictive power, we could place equal trust in all, by selecting the top # ⁄ drugs from each pipeline (Union list). We compare this scenario with an alternative strategy that combines the predictions of the different pipelines.
A widely used approach is to calculate the average rank of each drug over the # pipelines 29 (Average Rank list). The alternative is to search for consensus ranking that maximizes the number of pairwise agreements between all pipelines 15,18 . As the optimal outcome, called the Kemeny  Figure S8). In other words, we find that CRank extracts the cumulative predictive power of all methods, matching or exceeding the predictive power of the individual pipelines across all datasets. Its persistent performance indicates that an unsupervised multimodal approach can significantly improve the hit rate over individual prediction algorithms. It also suggests that in the absence of a ground truth, the Kemeny consensus, which seeks a ranking with the smallest number of pairwise disagreements between the individual pipelines, represents an effective and theoretically principled strategy when each pipeline carries some predictive power.

Network Effects
Most computationally informed drug repurposing methods rely on chemical binding energy minimization and docking patterns, limited to compounds that bind either to viral proteins or to the host targets of the viral proteins 20 ( Figure 1C). A good example is remdesivir, a direct-acting antiviral that inhibits viral RNA polymerase 32,33 . In contrast, our pipelines identify drugs that target host proteins to induce network-based perturbations that alter the virus's ability to enter the cell or to replicate within it. An example of such host-targeting drug 34 is dexamethasone, which reduces mortality in COVID-19 patients by modulating the host immune system 35 . We find that only one of the 77 S&W drugs are known to directly target a viral protein binding target: amitriptyline, which targets SIGMAR1, the target of the NSP6 SARS-CoV-2 protein. In other words, 76 of the 77 drugs that show efficacy in our experimental screen are "network drugs", achieving their effect indirectly, by perturbing the host subcellular network. As these drugs do not target viral proteins or their host targets target, they cannot be identified using traditional binding-based methods yet are successfully prioritized by network-based methods.
Searching for common mechanistic or structural patterns that could account for the efficacy of the 77 S&W drugs, we explored their target and pathway enrichment profiles ( Figures S6-7), as well as their reported mechanisms of action, failing to identify statistically significant features shared by most S&W drugs. This failure is partly explained by the diversity of the S&W drugs ( targets. We find that compared to random expectation, the N drug targets are far from the COVID-19 module ( Figure 3C), while the S and W drug targets are closer to the COVID-19 disease module than expected by chance. The magnitude of the effect is also revealing: the S drug targets are closer than the W drug targets, suggesting that network proximity is a positive predictor of a drug's efficacy.
Taken together, our analyses suggest that S&W drugs are diverse, and lack pathway-based or mechanistic signatures that distinguish them. We do find, however, that S&W drug target the same interactome neighborhood, located in the network vicinity of the COVID-19 disease module, potentially explaining their ability to influence viral activity, and the effectiveness of network-based methodologies to identify them.

Discussion
A recent in vitro screen 36  Inspecting the CRank list and the experimental outcomes, we find two highly ranked drugs with strong outcomes, but not yet in clinical trials (Table 1) Taken together, the methodological advances presented here not only suggest potential drug candidates for COVID-19, but offer a principled algorithmic toolset to identify future treatments for diseases underserved by the cost and the timelines of conventional de novo drug discovery processes. As only 918 of the 6,340 drugs prioritized by CRank were screened, a selection driven by compound availability, many potentially efficacious FDA-approved drugs remain to be tested.
Finally, it is also possible that some drugs that lacked activity in VeroE6 cells may nevertheless show efficacy in human cells, like loratadine (rank #95, N), which inhibited viral activity in the human cell line Caco-2 39 . Ritonavir, our top-ranked drug, also showed no effect in our screen, despite the fact that over 42 clinical trials are exploring its potential efficacy in patients. In other words, some of the drugs ranked high by CRank may show efficacy, even if they are not among the 77 S&W drugs with positive outcomes.

Human Interactome and SARS-CoV-2 and Drug Targets
The human interactome was assembled from 21 public databases that compile experimentally derived proteinprotein interaction (PPI) data: 1) binary PPIs, derived from high-throughput yeast-two hybrid (Y2H) experiments All proteins were mapped to their corresponding Entrez ID (NCBI) and the proteins that could not be mapped were removed. The final interactome used in our study contains 18,505 proteins, and 327,924 interactions (Table S3). We retrieved interactions between 26 SARS-CoV-2 proteins and 332 human proteins reported by Gordon, et. al. (2020) 19 (Table S4). We retrieved drug target information from the DrugBank database 20 , which contains 24,609 interactions between 6,228 drugs and their 3,903 targets, and drug target interaction data curated from the literature for 25 drugs (Table S5). We also obtained from the DrugBank database differentially expressed genes (DEGs) identified by exposure of drugs to different cell lines (Table S6). The Largest Connected Component (LCC) of human proteins that bind to SARS-CoV-2 proteins was calculated using a degree-preserving approach 21 , which prevents the repeated selection of the same high degree nodes, setting 100 degree bins in 1,000 realizations. (Fig 2A) We evaluated gene expression in the lung by using the GTEX database 22 , considering genes with a median count lower than 5 transcripts (raw counts) as not expressed.

Disease Comorbidities
Pre-existing conditions worsen prognosis and recovery of COVID-19 patients 23 . Previous work showed that the disease relevance of human proteins targeted by a virus can predict the signs, symptoms, and diseases caused by that pathogen 24 . This prompted us to identify diseases whose molecular mechanisms overlap with cellular processes targeted by SARS-CoV-2, allowing us to predict potential comorbidity patterns [25][26][27] . We retrieved 3,173 disease-associated genes for 299 diseases 28 , finding that 110 of the 332 proteins targeted by SARS-CoV-2 are implicated in other human diseases; however, the overlap between SARS-CoV-2 targets and the pool of the disease-associated genes was not statistically significant (Fisher's exact test; FDR-BH padj -value> 0.05). We evaluated the network-based overlap between the proteins associated with each of the 299 diseases and the host protein targets of SARS-CoV-2 using the !" metric 28 , where !" < 0 signals a networkbased overlap between the SARS-CoV-2 viral targets and the gene pool associated with disease . We found that !" > 0 for each disease, indicating that COVID-19 disease module does not directly overlap with any major disease module ( Figure S1 and Table S7). The diseases closest to the COVID-19 disease module (smallest !" ) included several cardiovascular diseases and cancers, whose comorbidity in COVID-19 patients is well documented 29-31 ( Figure S2). The same metric predicted comorbidity with neurological diseases, in line with our observation that the host protein targets are expressed in the brain (Table S7).
In summary, we found that the SARS-CoV-2 host protein targets do not overlap with proteins associated with any major diseases, indicating that a potential COVID-19 treatment cannot be derived from the arsenal of therapies approved for a specific disease. These findings argue for a strategy that maps drug targets without regard to their localization within a particular disease module. However, the disease modules closest to the SARS-CoV-2 viral targets are those with noted comorbidity for COVID-19 infection, such as pulmonary and cardiovascular diseases. We also found multiple network-based evidences linking the virus to the nervous system, a less explored comorbidity, consistent with the observations that many infected patients initially lose olfactory function and taste 32 , and 36% of patients with severe infection who require hospitalization have neurological manifestations.  outcome. The farther a disease is from the center, the more distant are its disease proteins from the COVID-19 viral targets.
Disease Comorbidity. We measured the network proximity between COVID-19 targets and 299 diseases. The figure represents each disease as a circle whose radius reflects the number of disease genes associated with it 28 . The diseases closest to the center, whose names are marked, are expected to have higher comorbidity with the COVID-19 outcome.
The farther a disease is from the center, the more distant are its disease proteins from the COVID-19 viral targets
COVID-19 drug treatment recommendation task. We cast the COVID-19 treatment recommendation task as a link prediction problem on the multimodal graph. The task was to predict new edges between drug and disease nodes such that a predicted link between a drug node # and a disease node $ should indicate that drug # is a promising treatment for disease $ (e.g., COVID-19). Our graph neural network is an end-to-end trainable model for link prediction on the multimodal graph and has two main components: (1) An encoder: a graph convolutional network operating on and producing embeddings for nodes in ; and (2) A decoder: a model optimizing embeddings such that they are predictive of approved drug indications.
Overview of graph neural architecture. The neural message passing encoder took as input a graph and produced a node d-dimensional embedding # ∈ % for every drug and disease node in the graph. We used the encoder 33 that learned a message-passing algorithm 34 where # ()) ∈ %()) is the hidden state of node # in the &' layer of the neural network with ( ) being the dimensionality of this layer's representation, is an edge type, matrix -()) is an edge-type specific parameter matrix, denotes a non-linear element-wise activation function (i.e., a rectified linear unit), anddenote attention coefficients 35 . To arrive at the final embedding # ∈ % of node # , we compute its representation as # = # ()) . Next, the decoder takes node embeddings and combines them to reconstruct labeled edges in .
In particular, the decoder scores a ( # , , $ ) triplet through a function whose goal is to assign a score ( # , , $ ) representing how likely it is that drugs vi will treat disease vj (i.e., denotes an 'indication' relationship) 35 .
Training the graph neural network. During model training, we optimized model parameters using the maxmargin loss functions to encourage the model to assign higher probabilities to successful drug indications ( # , , $ ) than to random drug-disease pairs. We took an end-to-end optimization approach that jointly optimized over all trainable parameters and propagated loss function gradients through both the encoder and the decoder. To optimize the model, we trained it for a maximum of 100 epochs (training iterations) using the Adam optimizer 36 with a learning rate of 0.001. We initialized weights using the initialization described in 37 . To make the model comparable to other drug repurposing methodologies in this study, we did not integrate additional side information into node feature vectors; instead, we used one-shot indicator vectors 38 Table S8 and Table S9, respectively. The pipeline A1 searches for drugs that are in the vicinity of the COVID-19 disease by calculating the cosine distance between COVID-19 and all drugs in the decoded embedding space 41 . The decoding is based on the = 10 nearest neighboring nodes in the embedding space, with a minimum distance between nodes of = 0.25. The pipeline A2 prevents that nodes in the decoding embedding space from packing together too closely, by using = 0.8 and keeping unchanged. These constraints push the structures apart into softer, more general features, offering a better overarching view of the embedding space at the loss of the more detailed structure.
Pipeline A3 forces the decoding to concentrate on the very local structure by using N = 5, to explore a smaller neighborhood, while setting the minimum distance at a midrange point of D = 0.5. Pipeline A4 focuses on a broader view of the embedding space by setting N=10 and D = 1. Finally, to obtain lists of candidate drugs, each pipeline ranked drugs based on the pipeline-defined distances of drugs to COVID-19 ( Figure S3).
Intuitively, parameter constrained the size of the local neighborhood each pipeline looked at in the embedding space when calculating the distances, and parameter controlled how tightly the pipeline was allowed to pack the embeddings together.

Diffusion-Based Algorithms (D1-D5)
The diffusion state distance (DSD) 42 algorithm uses a graph diffusion property to derive a similarity metric for pairs of nodes that takes into account how similarly they affect the rest of the network. We calculate the expected number of times ( , ) that a random walker starting at node visits node , representing each node by the vector 42 : which describes how a perturbation initiated from that node affects other nodes in the interactome. The similarity between nodes and is provided by the L1-norm of their corresponding vector representations: Inspired by the DSD, we developed five new metrics to calculate the impact of drug targets on the SARS-CoV-2 targets . The first (Pipeline D1) is defined as: where ( , ) represents the diffusion state distance between nodes and . Since the L1-norm of two large vectors may result in loss of information 43 , we also used the metrics (Pipeline D2): and (Pipeline D3): 9: where is the Kullback-Leibler ( ) divergence between the vector representations of the nodes and .
Finally, to provide symmetric measures, we tested the metrics (Pipeline D4): and (Pipeline D5) where is the Jensen Shannon ( ) divergence between the vector representations of nodes and . All five measures assume ≠ .

Proximity Algorithm (P1-P3)
Given , the set of COVID-19 virus targets, , the set of drug targets, and ( , ), the shortest path length between nodes ∈ and ∈ in the network, we define 21 : We determined the expected distances between two randomly selected groups of proteins, matching the size and degrees of the original V and T sets. To avoid repeatedly selecting the same high degree nodes, we use degree-binning 21 . The mean %(8,7) and standard deviation %(8,7) of the reference distribution allows us to convert the absolute distance = to a relative distance %= , defined as: .
We implemented three versions of the proximity algorithm: 1) relying on all drug targets (P1); 2) ignoring drug targets identified as drug carriers, transporters, and drug-metabolizing enzymes (P2); and 3) on differentially expressed genes (DEGs) identified by exposure of each drug to cultured cells, which was obtained from DrugBank's compilation of 17,222 DEGs linked to 793 drugs in multiple cell lines (Table S6).

Explanatory Subgraphs
For each pipeline, we identified "explanatory subgraphs" to help understand the predictions made by the respective pipeline. The key idea was to summarize where in the data the pipeline looks for evidence for their predictions. Given a particular prediction, an explanatory subgraph is a small sub-network of the entire network considered by the pipeline that is most influential for the prediction and contributes most to the predictive power. For the proximity method (P), the explanatory subgraphs can be derived exactly, representing the set of nodes contributing to proximity. For the artificial intelligence-based methods (A), the subgraphs were extracted using a GNN Explainer algorithm 44 . GNNExplainer specifies an explanation as a subgraph of the entire network the GNN was trained on, such that the subgraph maximizes the mutual information with the GNN's prediction. This is achieved by formulating a mean field variational approximation and learning a realvalued graph mask, which selects the important subgraph using counterfactual reasoning. For the diffusion method, we first identified the SARS-CoV-2 targets (seeds) that have the maximum (or median, depending on the pipeline) similarity with the drug targets under consideration. Once the seeds are identified for each drug target, we extract the vector representation of the target and the corresponding seeds. Each element of these vectors corresponds to a node in the network: Each pipeline performs an element-wise comparison of these two vectors to calculate similarity values, defined as similarity terms, using: 5 " e + # log( These distance similarity terms collectively contribute to each drug's ranking score. Among all 18,446 nodes, we are only interested in those whose variations lead to the current ranking (drug prediction scores). Therefore, we applied a feature selection algorithm to eliminate the network nodes (features) that do not contribute to the predicted scores (outcomes). This task is done by training a regression tree model (DecisionTreeRegressor Important features are those with non-zero importance value as characterized by the Regressor model.
Once the important features/nodes are extracted, we search this space to identify the explanatory network of each set of drug targets. To do so, we rank the similarity terms of each target and the corresponding seeds on the space of important features and identify the nodes with the highest contribution to the similarity measure such that they satisfy the following equation: If a drug has multiple targets or if each target has multiple corresponding seeds (seeds with the same similarity to a target), the results are aggregated. The explanatory network of a target that happens to be a seed is that seed itself. Figure S4 shows the similarities and differences among the explanatory subgraphs of the different prediction pipelines. Proximity and Diffusion bases methods explore the PPI in a much vast and diverse way than the AI methods (C) Methods inside the same pipeline tend to select similar genes, the similarity of selected genes across methods is different (Jaccard Index), those genes, interestingly, also do not lie in similar neighborhood (similarity), meaning that not only do the genes not overlap across methods, but the vicinity the methods explore are also different. (D) Another measure used to understand methods similarity involved using the PCA of gene drug pairs, showing that AI methods are fairly consistent in what they observe, and similarly, P1 and P2. Diffusion methods have a higher variance in gene-drug pair predictions and have a larger spread of their module; as expected, P3 is far from other proximity measures. (Fig 2C) To investigate the complementarity among the prediction algorithms, for each drug we measured the network separation F?% between the explanatory subgraph and the drug's targets ( ), and the separation F?! between and the 332 SARS-Cov2 viral targets ( ) capturing the disease module. Each drug has twelve subgraphs, each corresponding to one of the twelve pipelines. A total of 320 drugs, for which all pipelines have predictive subgraph and separation values, are shown in Figure     compound was active at the highest dose but an EC50 value could not be calculated due to insufficient activity, the percent inhibition of infection at 25 µM was used to rank potency. Each assay was performed in duplicate in 384 well plates.

Drug-Response Classification
The classification of the drug-response outcomes was done using a drug response curve (DRC) model 45 . We used the R package drc 46 to calculate the DRCs using a log-logistic model with four parameters (hill, IC50, min, and max). Each drug-response was classified in two steps: first inspecting toxicity and later evaluating the drug effect on the inhibition of viral proliferation.
To inspect the cytotoxicity, we first estimated the model parameters using as response variable the nuclei count in the treated cells, normalized by the nuclei count in the controls. We tested the dose-response effect for all drugs using a 0 test for goodness of fit and drugs with p < 0.01 (Bonferroni correction) were defined as cytotoxicity, with the exception of drugs with toxicity only at the last dose concentration. To evaluate inhibition of viral replication, we used as response for the DRC model the number of infected cells in the treated samples normalized by the controls. For that, a drug was considered to have a dose-response effect by using a 0 test for goodness of fit (p < 0.01, Bonferroni correction), and the significant drugs were defined as Strong (S) or Weak (W) if the viral reduction was greater than 80% and 50%, respectively. The drugs that did not meet the criteria for S or W were classified as no-effect (N). Finally, we classified drugs as cytotoxic (C) if their toxicity curves were greater than their viral proliferation curves in at least half of the doses tested.

Biological Interpretation of Effective Drugs in E918 Dataset
We observed 77 drugs that showed strong (S) or weak effects (W) in the high-throughput screening. There was no drug category (ATC Classification) that was enriched among the S, W, or S&W drugs (hypergeometric test FDR-BH padj > 0.05). To search for common patterns that could explain their bioactivity, we performed hierarchical clustering on the drug target profiles, failing to find binding patterns shared by all drugs ( Figure S6).
Only four small groups of drugs are observed, documenting various degrees of shared targets ( Figure S6), three of which contain drugs from multiple categories, and one group consists of 7 nervous system-related drugs with similar target profiles. We also performed pathway enrichment analysis to identify biological processes shared across the targets of drugs with strong or weak effects. Among the 77 S&W drugs, 42 are located in three groups associated with common pathways, and 20 of these drugs are of diverse indications linked to transport and metabolism of different substrates. Eighteen are associated with pathways related to membrane receptors, most of them indicated for nervous system disorders, targeting G protein-coupled receptors such as ADRA1A, HTR2A, and HRH1 ( Figure S7). Taken together, neither the pathway nor the target analysis reveals patterns that could explain the efficacy of the 77 S&W drugs.

Performance Evaluation using ROC Curves, Precision, and Recall
We examined whether positive drugs (e.g., strong-effect drugs) were ranked high by measuring the predictive considered hit-rate based metrics to evaluate the quality of top-K drugs in each ranked list. Here, we evaluated performance at a given cut-off rank K, considering only the topmost predictions by the pipeline. In particular, we calculated the fraction of top-K ranked drugs that were positive outcomes (precision at K) and the fraction of all positive outcomes that were among the top-K ranked drugs (recall at K).
We considered three types of ground-truth information to evaluate prediction performance: 1) The outcome of the experimental screening of 918 compounds (E918 dataset, Table S10). We identified 806 no effect drugs, 40 with weak effect, and 37 with strong effect.
2) The outcome of the experimental screening of additional 74 compounds tested with a wider range of doses (0.625 -20μM, 0.2 MOI) (E74 dataset, Table S11) ( Figure S8). The E74 dataset represents a subset of 81 compounds by a medical doctor among the top 10% of all drug predictions that were available for purchase. We identified 39 no effect drugs, 10 with weak effect, and 11 with strong effect.
3) 67 drugs that, as of April 2020, were in ongoing trials for COVID-19, obtained from the ClinicalTrials.gov website ** (CT415 dataset, Table S12). ClinicalTrials.gov organizes COVID-19 specific collection of all trials. Trial records consist of information on inclusion and exclusion criteria, details on drugs being tested, the scientific team behind the study, and funding agencies. We extract drug names from clinical trials' treatment information and match their names with records on the DrugBank database 20 .
Note that some methods do not provide prediction for every drug in the full dataset. While that would make a fair comparison of the methods challenging, we note that ground-truth information described above is available for drugs predicted by all pipelines (except for P3, hence it is harder to compare this pipeline with the other 11). Finally, we note that we adopted a conservative approach by evaluating predictive performance using the rankings across all 6,340 drugs, not only 918 experimentally screened drugs. For example, it is possible to conceive that a particular topmost prediction in a pipeline represents a positive drug, however, that is impossible to know if the predicted drug was not included in experimental screening. Because of that, the reported precision and recall values represent conservative estimates of prediction performance, i.e., the values are lower than what one could obtain if the analysis was limited to only experimentally screened drugs.
To determine the significance of predictive power, we calculated the expected number of positive drugs among top-K drugs for each pipeline and compared the expected values with the observed precision and recall values.
To this end, we calculated the expected number of positive drugs by taking into account (a) the number of drugs for which ground-truth information is available, and (b) the number of drugs for which a pipeline makes predictions. We used an exact one-tailed binomial test (p-value < 0.05) to test whether a top-K list returned by a pipeline is biased towards containing more positive drugs than what we would expect on average by pure chance had the ranking be a random one. presents, in most cases, higher hit rates, precision and recalls when compared to E74.

Rank Aggregation Algorithms (RAAs)
Rank aggregation is concerned with how to combine several independently constructed rankings into one final ranking that represents a consensus ranking, i.e., a collective opinion of prediction methods that is representative of all rankings returned by the methods 48 . The classical consideration for specifying the final ranking is to maximize the number of pairwise agreements between the final ranking and each input ranking.
Unfortunately, this objective, known as the Kemeny consensus, is NP-hard to compute 48,49 , which has motivated the development of methods that either use heuristics or approximate the Kemeny optimal ranking 48,50-52 .

Average Rank Method
The Average Rank method follows the most straightforward way to integrate multiple rankings. For each drug, it calculates a simple rank average over 12 rankings returned by the pipelines to obtain the overall ranking.
While the Average Rank method is a popular ad-hoc rank aggregation strategy, many studies 53-55 , including ours, found that studying the average ranks can be a poor aggregation approach. Next, we briefly overview methods that realize more sophisticated approaches to obtain the overall ranking.

Borda Method
The Borda method 56 is one of most commonly used rank aggregation methods. Briefly, the method proceeds as follows. Given are rankings exist, + , 0 , … , ) . For each drug ∈ # , is assigned a score # ( ) equal to the number of drugs that outranks in ranking # . The Borda count ( ) of drug is then calculated as . Finally, drugs are sorted in the descending order based on their Borda counts to create a consensus ranking. Theoretically, Borda method offers a guarantee on approximating Kemeny consensus. In particular, Borda method is a 5-approximation algorithm of the Kemeny optimal ranking 51 .

Dowdall Method
The Dowdall method 57 is a modified form of the Borda method that has been widely used in political elections in many countries. Intuitively, individual pipelines make predictions for drugs, which are interpreted as preferences of the pipeline. For a pipeline, its 1 st choice gets a score of 1, its 2 nd choice get 1/2, its 3 rd choice gets 1/3, and so on. Drug with the largest total score across pipelines wins. Formally, let be given rankings, + , 0 , … , ) . For each drug ∈ # , is first assigned a score # ( ) equal to the reciprocal of drug's rank in ranking # . The total score ( ) is then calculated as ∑ # ( ) ) #G+ . Candidates are sorted in descending order based on their total score to create a consensus ranking.

CRank
The CRank algorithm 58 starts with ranked lists of drugs, -, each one arising from a different pipeline, r. Each ranked list is partitioned into equally sized groups, called bags. Each bag in ranked listhas attached importance weight -# whose initial values are all equal. CRank uses a two-stage iterative procedure to aggregate the individual rankings by taking into account uncertainty that is present across ranked lists. After initializing the aggregate ranking as a weighted average of ranked lists -, CRank alternates between the following two stages until no changes were observed in the aggregated ranking .
(1) First, it uses the current aggregated ranking to update the importance weights -# for each ranked list. For that purpose, the topranked drugs in serve as a temporary gold standard. Given bag and ranked list -, CRank updates importance weight -# based on how many drugs from the temporary gold standard appear in bag using drug for ranking , and -( ) is the rank of according to . By using an iterative approach, CRank allows for the importance of a ranked list returned by an individual pipeline not to be predetermined, i.e., a-priori fixed, and to vary across drugs. The final output is a global ranked list of drugs that represents the collective opinion of all drug repurposing prediction algorithms. In all experiments, we set the number of bags to 1,000, the size of the temporary gold standard to 0.5% of the total number of drugs in , and the maximum number of iterations to 50. In all cases, the algorithm converged, in fewer than 20 iterations 59,60 . The pipelines' ranked lists and CRank's aggregation are provided in Table S14. The Python source code implementation of CRank is available at https://github.com/mims-harvard/crank (raa.py).

Comparison of RAAs
What explains CRank's outstanding performance across all datasets? Each RAA aims to approximate the optimal Kemeny consensus, which offers the best agreement with all 12 prediction pipelines. As this consensus remains unknown (NP-hard), we cannot assess how well the different RAA methods approximate it. We do, however, have a ground-truth ranking, offered by the experimental and clinical datasets (E918 and CT415).
We assigned rank 1 to the strong drugs, rank 2 to the weak drugs, and rank 3 to the no-effect drugs, allowing us to measure the Kemeny score for each aggregated list, representing the fraction of pairwise disagreements between the respective ranked list and the experimental outcomes. For K = 100, the Kemeny score of the Average Rank method is infinite for E918, as there are no positive drugs among the top 100. In contrast, for the Borda count, we obtain a Kemeny score of KS = 0.7131, indicating that 71% of all drug pairs in the ranked list of Borda method disagrees with the ground-truth ranking in the E918 dataset. Note that the theoretical expectation for a purely random ranking is KS = 0.5, meaning that 50% of all drug pairs in the random reference are flipped, i.e., while with KS = 0.4545 Dowdall does better than random, we observe a much lower KS = 0.2679 for CRank. We measured the Kemeny score for multiple values, for both datasets (E918 and CT415), finding that for K<250 (top drugs), CRank offers the best agreement with the outcomes.  (2020)).