The clinical trials puzzle: How network effects limit drug discovery

Summary The depth of knowledge offered by post-genomic medicine has carried the promise of new drugs, and cures for multiple diseases. To explore the degree to which this capability has materialized, we extract meta-data from 356,403 clinical trials spanning four decades, aiming to offer mechanistic insights into the innovation practices in drug discovery. We find that convention dominates over innovation, as over 96% of the recorded trials focus on previously tested drug targets, and the tested drugs target only 12% of the human interactome. If current patterns persist, it would take 170 years to target all druggable proteins. We uncover two network-based fundamental mechanisms that currently limit target discovery: preferential attachment, leading to the repeated exploration of previously targeted proteins; and local network effects, limiting exploration to proteins interacting with highly explored proteins. We build on these insights to develop a quantitative network-based model to enhance drug discovery in clinical trials.


INTRODUCTION
Prior to receiving approval by the Food and Drug Administration (FDA), a new drug must complete multiple phases of clinical trials to prove its efficacy and safety.The complete clinical trials pipeline for a single drug, from early safety testing to trials on large populations, takes on average six years, 1 and is estimated to cost about $1 billion USD. 2 In 2007, the FDA Act 3 required funders to publicly post clinical trial designs and results to an online repository managed by the National Library of Medicine (NLM), increasing transparency in the drug discovery process. 4Despite well-documented compliance issues on reporting the results, [5][6][7] the accumulated data offer a unique lens into the drug innovation practices, 8 and has allowed researchers to conduct meta-analyses on disease specific trials, 9,10 obtain key insights into equity for patients with rare diseases, 11,12 and unveil systemic biases in patient demographics. 13,14he choices in clinical trials, from designing the trial protocol to selecting the patient population to testing drugs for specific diseases, have direct implications for the efficacy and equity of drugs that enter the market.While advances in genomics, machine learning, 15,16 network medicine, 17,18 and pharmacology 19 present novel opportunities for drug discovery, potentially reducing the cost and time of conducting exhaustive experimental testing, 20 they may be inadequate if the discovered knowledge about drug candidates (in silico) is not actively transferred to applied settings (in vitro), and make their way into clinical practice.Therefore, understanding the drug exploration patterns documented by clinical trials is important to improve population health. 21,22n this work, we offer a large-scale temporal analysis of drugs and its target's trajectory through clinical trials by exploring the cumulative knowledge of the clinical trials database.By combining data from various sources, including investigational and approved drugs, rare and common diseases, proteins and its disease associations, we aim to understand the factors driving the discovery and exploration of new drugs and targets.We find that while the number of clinical trials continues to increase, the rate of novel drugs entering clinical trials has decreased since 2001, a puzzling effect potentially indicating a drug discovery winter.We also find that target selection is primarily driven by two distinct network-based mechanisms, preferential attachment and local network effects, leading to the over exploration of certain drugs and protein targets.Our results illustrate that we currently fail to utilize the complete therapeutic potential of the human genome, prompting us to offer a data-driven pathway to unlock its potential through the human interactome, which captures the physical interaction between targets.We build a quantitative model of drug discovery that helps unveil network effects capable of boosting the identification of novel targets.

Curating clinical trials and drugs
We extracted the clinical trials data from the publicly available clinical trials portal (https://clinicaltrials.gov), documenting 356,403 trials from 1975 to 2020.We observe a rapid growth in the number of reported drug trials before the 2007 activation date of the FDA amendment that required all funders to publicly disclose all active clinical trials by that year (Figure 1A, vertical line), likely reflecting the sudden registration of all ongoing trials.Following 2007, an organic growth sets in, indicating compliance with public reporting of new trials.
We conducted a multi-step data standardization process to disambiguate drug names listed on trials (see STAR Methods), enabling the identification of 5,694 drugs used in 127,432 trials (89% of drug trials).A drug is designed to bind to specific proteins in the human interactome, known as primary drug targets, responsible for the desired therapeutic effect.In some cases, drugs can also indirectly bind to other proteins, referred to as secondary drug targets.Of the 5,694 identified drugs, 2,528 (44%) drugs have associations to 2,726 drug targets (both primary and secondary) and 1,442 (25%) drugs have associations to 1,842 primary targets.We consider both primary and secondary targets, but we find that our results apply even when we limit our focus on primary targets only (see STAR Methods).
Clinical trials are divided into several phases. 23The pre-clinical stage (Phase 0 or early Phase 1) involves small dosage of a drug on a few people for a short duration to measure treatment response, corresponding to 1,880 (1.5%) trials in our data.Phase 1 is the first full-scale human trial that includes close monitoring of treatment on a small number of patients, representing 26,207 trials (18%).Phase 2 requires 25 to 100 patients with a specific disease condition to test for drug efficacy, representing 37,784 (26%) trials.Phase 3 usually involves several hundred patients, where the experimental drugs are tested alongside other drugs to compare side effects and drug efficacy, representing 24,896 (17%) trials.Finally, Phase 4 often involves thousands of patients, aiming to gain additional knowledge on drug safety over time, interaction with various diseases, and consists of 21,632 (15%) trials.Some trials combined multiple phases such as Phase 1/Phase 2, Phase 2/Phase 3, together representing 11,381 (8%) trials in our database.Here, we focus only on drug trials in Phases 1 to 4, representing in total 110,519 (76%) trials (Figure 1B highlighted), and disregard 19,718 (13%) trials without phase information (Figure 1A, gray).Clinical trials can test multiple types of interventions, from drugs to medical devices to behavioral studies.Drugs, the most widely tested intervention, represent 40% of all trials, followed by medical devices (10%) and behavioral interventions (10%) (Figure 1C).

Drug discovery winter
The Human Genome Project (HGP), lasting from 1990 to 2001, boosted innovation and drug exploration, 24 as in this decade clinical trials tested 768 (30% of all) new drugs and  We observe a slowdown in novel drugs tested since 2001, following the end of the Human Genome Project (HGP), signaling a drug discovery winter.For example, the number of drugs tested from 2011 to 2020 is considerably less compared to the exploration in previous decades.We also find an increasing gap between the number of approved drugs that have new targets and approved drugs with no new targets (inset).(B) Complementary cumulative distribution (CCDF) of the tested drugs and targets in each phase.We consider a protein and a drug in Phase 4 to have successfully completed Phase 1-3.The plot indicates that only a small proportion of drugs and targets in phase 4 have been approved.(C) Proportion of targets in the entire human genome in trials.We find that less than 20% of all proteins have been tested in trials.The sudden jump in number of proteins in 2015 is due to a single publication in 2015 that found 306 targets for the drug fostamatinib (see SI).
(D) Number of yearly trials of the top targets, demonstrating the inequality of drug exploration.Some targets, like CYP3A4, ABCB1, and ABCC2 (highlighted) are the focus of multiple trials, while other targets are tested in only a few trials each year.We would like to highlight that we treat the presence of multiple drug-target associations for a single protein within an individual clinical trial as a single occurrence of a clinical trial for that specific protein.By adopting this approach, we effectively remove the number of drugs as a confounding factor when analyzing the number of yearly trials associated with a particular target.
(E) Number of yearly trials for drugs.A select few drugs like levomenthol and lidocaine are tested in several trials every year, while other drugs are rarely tested.We see the impact of COVID-19 with a rapid increase in the number of trials for hydroxychloriquine.
not previously targeted proteins (Figure 2A, bottom).This indicates a drug discovery winter that started around 2001 characterized by a large number of clinical trials that focus mainly on drugs that target proteins already targeted by other previously tested or approved drugs.Throughout the history of clinical trials, 956 drugs (17% of all), involving 1,340 targets (49% of all) have been approved by the FDA (Figure 2A  inset).Yet, only 342 (35%) approved drugs test novel targets, indicating that drugs with established targets are more likely to receive approval. 25Although 1,449 (70%) drugs and 2,076 (81%) targets have reached Phase 4, only 40% of those drugs and 51% of those targets in Phase 4 targets received approval (Figure 2B).We also find that, on average a drug experiences a 3-year lag for approval after successfully completing Phase 3 clinical trials capturing the slow approval period, despite standard clinical development times 26 (Figure S15).Taken together, we find that clinical trials have tested only 12% of all human proteins and 22% of all druggable proteins 27 (Figure 2C).We estimate that if the current exploration patterns persist, it will likely lead to the exploration of 2,477 (13% of all) proteins by 2025, and following this rate, it would take 170 years to test all 10,648 druggable proteins (see STAR Methods).
Previously tested proteins are repeatedly selected for future trials Clinical trials tend to focus on a small number of previously tested proteins, leading to an uneven approach to drug discovery (Figures 2 and  S12).For example, we find that CYP3A4, ABCB1, ABCC2, SLCO1A2, proteins associated with the drug metabolism and transportation, 28 are involved in 72,884 (66% of all) trials, while EGFR, TNF, TP53, proteins associated to auto-immune diseases and several neoplasms, are involved in 8,396 (8% of all) trials (Figure 2D).Similarly, we find lidocaine, levomenthol, drugs that serve as anesthetics, to be over-represented in trials (Figure 2E).The COVID-19 pandemic had also a detectable impact on trial activity: hydroxychloroquine, a dormant drug which had a few clinical trials for over a decade, experienced a rapid increase in the number of trials in 2020 29 (Figure 2E).
A consequence of this uneven drug-target exploration is that only a small number of trials focus on new targets, new drugs, and new target combinations (Figures 3A-3C).The majority of the trials (50%) involve only previously approved drugs, while 11% of the trials test a combination of approved and experimental drugs (Figure 3D).Seeking to find the patterns responsible for this over-exploration of previously targeted proteins, we measured to what degree targets that received more attention in the past are tested in subsequent years.We find that the number of drugs that target a specific protein, N drug ðtÞ, is well approximated by a growth rate following, N drug ðtÞfN g drug ðt À 1Þ, where g is a scaling exponent (Figure S13; g 2000 = 1:2, g 2010 = 1:1, g 2020 = 0:9).This pattern, known as preferential attachment, is known to be responsible for the emergence of network hubs in network science 30,31 and quantifies the degree to which previously tested proteins have a cumulative advantage over other proteins.

The role of human interactome in drug exploration
Some diseases can be treated by inhibiting the disease associated proteins, but most often the effective drugs target proteins that are in the network vicinity of known disease proteins. 32Indeed, most drugs act by modulating the activity of the sub-cellular web known as the human interactome, 33 captured by experimentally detected protein-protein interactions (PPI) (Figure 4A).As pharmaceutical scientists leverage this network topology during the development of small molecules, it prompts us to inquire whether we can harness the power of the interactome to explain the underlying patterns that define target discovery and exploration.To answer this question, we first mapped the 2,726 drug targets explored in clinical trials into the interactome, finding that 1,260 (92% of all) experimental drugs target at least one protein that has been previously targeted by another approved drug, in line with Figures 2 and 3.However, when focusing on the proteins not targeted by previously approved drugs, we find that 891 (76%) of them interact with at least one protein that is targeted by an approved drug, while 274 (23%) are two steps away from the target of an approved drug.This local network-based clustering of experimental and approved drugs is absent if we randomly select the drug targets (see STAR Methods).
We also find that proteins located farther from approved and experimental targets are rarely selected as a drug-target (Figure 4A), even if they have multiple disease associations and are known to be druggable.In other words, we find a strong preference for targeting proteins that are embedded in local network neighborhoods with multiple explored targets (Figure S22).This means that a protein that interacts with other proteins that are the subject of multiple clinical trials for experimental or approved drugs is more likely to be selected as a new drug-target compared to a protein located in an unexplored network neighborhood.This suggests that the protein-protein interaction network captures and potentially drives drug discovery and exploration. 34o unlock the impact of the observed network effects, we examine the likelihood of a protein to be selected as a drug-target in a future clinical trial using a Generalized Linear Mixed Model (GLMM).The GLMM model considers as input four features of each target: (1) disease associations, (2) number of approved drugs targeting it, (3) number of clinical trials it was involved in, and (4) number of experimental drugs targeting it (see STAR Methods).The model is used for inferential purposes, offering as output several insights on the mechanisms governing new drug-target exploration (Figure 4B; Tables S3-S5): 1. Disease associated proteins are two times more likely to be in a clinical trial compared to proteins with no disease associations (OR: 2.2 [CI:1.6,3.2], p < 0.05).
These findings establish two fundamental mechanisms that drive drug exploration.
(1) Preferential attachment: The future attractiveness of a protein as a drug candidate increases as more drugs target it and more trials focus on it (increased clinical exposure).For example, for a protein that is already targeted by ten drugs, its odds of being the target of a new drug increase 8-fold, compared to a protein not targeted by a drug.(2) Local network effects: Previously untargeted proteins located in network neighborhoods with high exploration patterns (containing multiple drug targets and clinical trials) are more likely to be selected as new drug target compared to proteins located in network neighborhoods with fewer clinical trials and drugs.

Modeling choices in drug discovery
We build on the insights ( 1) and ( 2) to introduce a network model that aims to quantitatively recreate the observed patterns in drug exploration, and helps us understand how to accelerate drug discovery by exploring a wider set of druggable candidates.We begin by creating a timeline of drug discovery, accounting for the precise dates when targets became associated with drugs (Figure 5A).Using the proteins (nodes) and its interactions (links) in the PPI network as the underlying space of possible exploration, we model drug discovery through two parameters: The parameter p represents the probability that a previously tested protein is selected again for clinical trials.Hence, for p = 0, we model the scenario where

Novelty in clinical trials
For each phase we identified the first time a drug, a target, or a target combination was first tested.We then trace the proportion of trials in each year for each phase that focus on (A) new targets (B) new drugs, and (C) new target combinations.Across (A-C), we observe a rapid rise in trials with new targets, coinciding with the completion of the HGP (shaded).Starting in 2005, only a minimal percentage of trials across various phases are dedicated to exploring novel targets, drugs, and combinations.(D) We also observe that close to half of the trials each year test previously approved drugs, indicating high interest in drug repurposing.This may partly be motivated by patent laws that force the patent owners to find new uses for the drug compound.As a consequence, we find growing inequality, where a select list of targets of approved drugs is repeatedly in clinical trials, thus preventing broad exploration of the human genome.
we always choose untargeted proteins, while for p = 1 we always select previously tested proteins as targets.The second parameter, q, represents the probability that we choose an untargeted protein that is part of an explored neighborhood, driven by local network search (Figure 5B).Hence, for q = 0, we always select proteins from unexplored neighborhoods, while for q = 1 we select proteins from previously explored neighborhoods.Finally, to account for preferential attachment in target selection, a previously tested protein is selected again as a target proportionally to the number of drugs that have targeted it in the previous years, PðN drug ðtÞÞfPðN drug ðtÀ 1ÞÞ (Figure S13).We observe that the region of proteins associated with FDA approved drugs (green) and proteins associated with experimental drugs (pink) are closely located in the network.We also find large unexplored regions: blue indicates disease associated proteins, representing 93% of all unexplored proteins, and purple indicates non-disease associated proteins, 7% of all unexplored proteins.Nodes are sized based on number of clinical trials, indicating that the PPI network captures and potentially drives drug discovery.(B).Logistic model results.We show the odds ratio estimate for different variables using four different models evaluating the likelihood of a protein to be selected for a new drug.Model 1 uses disease neighborhood variables: interactions to a disease associated protein and a previously targeted protein.Model 2 considers approval features: number of approved drugs that target the protein and its network neighborhood.Model 3 utilizes the clinical trials exploration: number of clinical trials of a protein and its network neighborhood.Model 4 uses drug exploration parameters: number of experimental drugs targeting the protein and its network neighborhood.The error bars indicate the standard error of the estimates.Results table shown at Table S3.
The advantage of the proposed model is that we can explicitly extract the parameters p and q from the clinical trials data (Figure 5C).For example, in 2010, 295 proteins were tested in clinical trials, of which 244 (82%) were tested in previous clinical trials, and we find that of the 51 previously untargeted proteins, 45 (88%) interact with a previously tested protein, hence p = 0:82 and q = 0:88.We find that the empirically obtained (p,q) parameters are remarkably stable over time, indicating that previously tested proteins are in each year preferred at high rates (p 2010 = 0:82, p 2015 = 0:78, p 2020 = 0:91; Figure S21).We also find that among the untargeted proteins, those interacting with other previously tested proteins are more likely to be selected (q 2010 = 0:88, q 2015 = 0:92, q 2020 = 0:87), allowing us to quantify the stable patterns characterizing drug discovery (Figure S22).As Figure 5C shows, the empirically observed patterns are stable in the high (p, q) regime, with a slight shift over time to higher values of p and q, confirming an increasing trend to explore previously tested targets.
We find that for the observed (p Ã , q Ã ) values, the network model accurately reproduces the distribution of number of drugs per target (Figure 5D; KS-distance: 0.06; p < 0.01).The model also allows us to test the relative importance of its building blocks.For example, if we remove the preferential selection of targets, the model fails to capture the drug exploration patterns (Figure S26), confirming that preferential attachment (PA) is a key ingredient of the current drug exploration strategy.The model also unveils the imperfections of the current target The exploration of the protein-protein interaction (PPI) network, where new proteins are selected as targets for drugs in clinical trials.For time t0, we calculate the number of drugs that previously targeted each protein in the network, NDrug(t0).At the timestep t0, new drugs are introduced in clinical trials for testing.We identify the targets of these drugs at time of trial, represented using arrows, and update the number of drugs for proteins at the next time step, NDrug(t1).
Similarly, we identify the drugs introduced at time t1 and its targets and update the number of drugs that target a protein at time t2.The temporal characteristics of each protein allows us to capture the drug discovery process in clinical trials.
(B) Network model.We consider the network at time step, t0, using the above described process and group proteins into three categories: (i) proteins that were previously tested (ii) proteins connected to a previously tested protein, and (iii) proteins that are not connected to a previously tested protein.With probability p, we select a previously tested protein, while with probability q we select a protein connected to a previously tested protein, and with probability 1Àq, we select a protein not connected to a previously tested protein.When choosing a previously tested protein, we sample proteins proportional to the number of drugs that have previously targeted it, P(NDrug), representing preferential attachment.In the network simulations, we select m(t) proteins (calculated from data) and update the network at the end of each time step.We describe one version of the simulation where parameters n = 3, p = 0.5, q = 0.8 are used to select the proteins B, C, and D at next time step t1.(C-F) (C) The search space of exploration.We measure the number of targets that are tested in the simulations as a function of the parameters p and q.The circles indicate the empirical choices for different years (2010, 2015, 2020).We show the distribution of number of drugs per target obtained under the three different exploration strategies: (D) Preferential attachment (PA) (p = 0.95, q = 0.95).(E) Random (R) (p = 0.5, q = 0.95) and (F) Network Search (NS) (p = 0.05, q = 0.95).
selection patterns: the PA strategy, which redirects attention and resources to previously tested proteins, only tests 21 new targets yearly on average.As a consequence, the same protein is explored as a target for a total of 175 (17% of all) drugs (GINI = 0:65 t/b), acting as a hub of drug discovery.Overall, the current strategy, by repeatedly targeting previously tested targets, fails to take advantage of the broader potential of the interactome to unveil potential novel targets.To validate the model, we quantified its ability to predict drug candidates for three autoimmune diseases-rheumatoid arthritis (RA), Crohn's disease (CD), and asthma (see STAR Methods).We find that the model accurately predicted novel candidates for these diseases with 70% accuracy (Figure S28).Further, we validated the predicted proteins through an extensive literature search, finding them to be biologically relevant (Table S6).For example, the model identified protein NLRP3 as a potential drug candidate for RA, which has been shown to reduce RA-induced inflammation in animal models. 35These results demonstrate that a network strategy can be a useful mechanism to drive exploration toward proteins in druggable parts of the network.Finally, we want to exploit the predictive power of the network model to explore how to incentivize a wider exploration of human interactome as potential targets.For this, we examine two alternative exploration strategies: (1) random (R) strategy, when the newly tested proteins are randomly selected (p = 0:5); (2) network search (NS) strategy, when untargeted proteins interacting with previously targeted proteins are preferred (p = 0:05).In each case we keep q = 0:95, as indicated by the empirical data.
We find that the random (R) strategy selects more drug targets than currently tested (as captured by the PA strategy) (2,655 vs. 1,121), offering an opportunity to deviate from the current distribution of number of drugs per target (Figure 5E, KS-distance: 0.22, p < 0.01).Despite the randomness of the strategy, the same protein is selected as a target for 110 (11% of all) drugs (GINI = 0:35), indicating that the R strategy also focuses repeatedly on a few network hubs, a pattern similar to the one observed in the PA strategy (175).Overall, the R strategy tests more targets than PA but still results in an over-exploration of a few proteins, and hence offers minimal improvements compared to PA (Figure S27).
In contrast, we find that the network search (NS) strategy generates statistically different distribution of number of drugs per target (Figure 5F; KS-distance: 0.37; p < 0.01).Most importantly, the strategy selected 4,055 targets, a 3-fold increase in the number of selected targets compared to the PA strategy (1,121).Of those 4,055, we find that 3,922 (96%) are new targets.Further, the NS strategy selects the same protein as a target for a maximum of 10 (1% of all) drugs (GINI: 0:06), significantly lower compared to the R (110) or PA (175) strategies.
Overall, our results indicate that the current practice (PA) is inefficient in terms of exploring the human interactome, focusing most resources on a small number of highly explored protein targets.In contrast, a network search approach can improve the total number of tested targets by preventing the emergence of protein hubs in drug discovery and also attract attention to potential drug candidates, ultimately resulting in a wider exploration of the human interactome.These results suggest that policy changes, such as prioritizing the approval of drugs with novel targets or targeted funding from the National Institutes of Health (NIH) toward the exploration of novel targets, could help augment existing innovation practices and significantly enhance drug discovery by re-focusing resources on a wider range of novel targets while maintaining accuracy.

DISCUSSION
A scientist's choice of an idea to pursue is influenced by a combination of the project novelty and its potential research impact. 36,37Similarly, a pharmaceutical company's choice of a target for a new drug is influenced by its potential market value and the likelihood that the drug succeeds in clinical trials. 38However, the high attrition rates of drugs in clinical trials, 39 difficulties with patent licensing, 40 and the growing cost of developing new molecules 41 have led to a risk-averse approach to drug discovery characterized by ''small bets, big wins.'' 25While this strategy, resulting in the creation of multiple drugs within the same therapeutic class, 42 increases competition and reduces drug prices, 43,44 it takes away resources from the exploration of novel drugs and targets, 45 encouraging incremental innovation and hindering progress for population health.
Our analysis of clinical trials data shows that the highest growth in drug exploration was between 1990 and 2001, likely driven by the advent of the Human Genome Project (HGP).However, in the following two decades, there was a decrease in the incentive to test novel drugs, and a disproportionate focus on approved drugs (61% of all trials).This allocation of resources ultimately slows the discovery of novel therapies.Further, drug discovery in clinical trials often prioritize previously tested proteins (preferential attachment) and proteins connected to previously tested proteins (network effect), neglecting proteins in under-explored regions of the network, even if they have disease associations and are verified as druggable targets.To optimize target exploration in druggable regions of the network and improve the number of tested targets, it may be beneficial to reduce the emphasis on previously tested proteins and adopt a network-based search for drug candidates.
It is important to acknowledge that designing a new small molecule that engages with a specific protein may be challenging despite the fact that the protein may be considered a druggable candidate.These factors encompass limitations in experimentation, such as the absence of suitable animal models, economic constraints and market dynamics, and the inherent complexities and challenges associated with discovering effective treatments.To gain a more comprehensive understanding of target prioritization, it is important to integrate network-based strategies with other relevant data sources, such as genomic information, phenotypic data, and comprehensive analysis of clinical outputs obtained from both successful and failed trials.6][7] By embracing these recommendations and actively pursuing an integrative approach, we can foster a more robust and effective drug discovery process.This, in turn, will pave the way for the development of innovative pharmaceutical interventions that address unmet medical needs, ultimately benefiting patients and society as a whole.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following: type of trial (e.g., intervention, observational), its associated phase (e.g., Phase 1, Phase 2), status (e.g., completed, recruiting) (Figure S7), a list of conditions (e.g., asthma, rheumatoid arthritis), a list of interventions (if applicable, e.g., budenoside, inhaler) and its associated types (e.g., drug, medical device).We then filter all trials that have a ''drug'' type associated with any of its listed interventions.This gives a subset of 146,314 trials.

Clinical trials drug data curation
The listed drug names part of clinical trials are not standardized, and presents an issue to accurately identify drug exploration.For example, the drug 'lepirudin' may be refered to as 'lepirudin recombinant', 'hirudin variant-1' or even its associated brand name 'Refludan'.As a result, we find a total list of 94,615 interventions in the clinical much higher than the number of drugs identified by DrugBank.To standardize the drug names, we conduct a multi-step matching process.First, we map the intervention names to the direct name on drug bank, giving us a total of 103,398 (70.6% of all drug trials) trials and 4,458 drugs.Next, we map the intervention names to the drug synonyms provided by DrugBank allowing us to map an additional 7,698 trials.We also connect the drug names to the official drug product names allowing us to map another 14,759 trials.We also map intervention names with the wikipedia names of drugs providing additional drug maps for 500 drugs.Finally, we map the drugs names with a fuzzy match with drug names, providing mapping for another 1,077 trials.At the end of this methodology, we are left with 127,432 trials (87.6% of all) and 5,694 drugs.We also control for placebo drugs in trials by searching for the term 'placebo' in the intervention names.We thus remove 1,171 trials on 590 drugs from our analysis.
The data curation steps then reveal 127,432 drug trials for 5,694 drugs and 2,726 targets, representing the final data used in the analysis.

Druggable genes
The list of druggable genes is curated by a large-scale crowdsourcing effort by incorporating multiple data sources (e.g., Gene Ontology, OncoKB, PharmGKB). 27The data publicly available for free download from DG-IDB(www.dgidb.org)The November 2020 version of the data update was extracted for our analysis which contains 10,648 druggable human proteins.It is important to note that the finding of a drug-gene interaction as potentially druggable does not necessitate the ineffectiveness (or the lack thereof) for a drug to interact with other genes in different regions.

Protein-protein interaction network
The proteins in the cell of an organism are known to have biological interactions with other proteins in neighboring cells.This relationship between proteins can be mapped to represent a network of genes and its interactions, a well-studied mechanism in network medicine. 47he protein interaction network comprises 18,508 nodes (proteins) and 332,646 edges (interactions).

Experimentally validated PPI network
We conduct the analysis using the experimentally validated protein interactions, a network comprising 8,876 proteins and 61,985 interactions.We find the similar result as above, targets of experimental drugs are enriched in the region of proteins that target approved drugs (p< 0:001; Figure S25), verifying that the network processes are not driven by potential selection biases of the PPI network.

Drug approval data
The data regarding drugs and its approval is provided by the Food and Drug Administration (FDA), publicly available at https://www.fda.gov/drugs/development-approval-process-drugs/drug-approvals-and-databases.The entire corpus was extracted in December 2020 that contains 1,002 approved drugs.After matching the FDA data with clinical trials, we found 911 drugs, representing 90% of all approved drugs.

Disease data
The data about disease associations were extracted from DisGeNet. 48We find 15,474 genes associated with 19,620 diseases.Since the data also lists the corresponding publication reference that discovered the disease association, we map the publication (PubMed) id with the year of publication to identify the specific year that the gene was found to be associated with a disease, allowing us to accurately recreate the exploration patterns (Figure S2).
The clinical trials data also contains the disease condition of the trial (e.g., hypertension).However, the disease names are not standardized.To address this issue, we use the same multi-step matching process used to curate drug data to match the disease of each trial to the curated disease data on DisGeNet. 48Specifically, we use string matching, fuzzy matching, and cosine similarity.We find that the top 25 diseases collectively account for 40% of all clinical trials (Figure S3).

Common and rare diseases
Information about common and rare diseases were extracted from Orphanet: an online rare disease and orphan drug database (https://www.orphadata.com/).The data are indexed via ORPHAcode that links diseases to associated genes, along with information about the association like causative, modifier, susceptibility.We then map these diseases with the DisGeNet 48 data through Mesh ID to identify gene associations with rare and common diseases (Figure S4).After mapping, we find 29,001 common diseases associated with 15,339 genes and 1,169 rare diseases associated with 9,152 genes.The data is free to download from http://www.orpha.net.Accessed on September 2021. in 1999, and the third in 2000.In 2004, the first drug that targeted the protein was approved, followed by another drug approved in 2006.Similarly, the protein TPH1, which is associated with multiple mental disorders, was discovered in 1987, and its first clinical trial was in 2004, 17 years after its discovery.The second drug was tested in 2006, and the first approved drug emerged in 2007 (Figure S18A, bottom).These exploration patterns prompted us to introduce two variables to quantify recency: 1) time to first trial since discovery of a protein, and 2) time to first approval since the first trial.
We utilize the Kaplan-Meier survival curves 52 to estimate the time to event variables.We find that the time to subsequent trials decreases if a protein is targeted by multiple drugs (Figure S18B), indicating that clinical trials are more likely to focus on recently tested targets.That is, the more drugs target the protein, the more experimental validity it receives, decreasing the time until a subsequent trial.In a similar fashion, the time to approval for targets decreases as it becomes associated with several approved drugs (Figure S18C), hence the time to second approval is much shorter than the time to first approval, and so on.In summary, we find that proteins experience a long wait time until their first trial as a target, but recently targeted proteins are more likely to be selected for new drugs.
Further, we find non disease genes enter the trial rapidly after approval but a higher proportion of disease genes eventually receive a trial (Figure S20A).Interestingly, there are no differences in the survival times of common and rare disease genes (log rank test: 0.29, p = 0.58).Further, we find that genes associated with no diseases are less likely to be associated with an approved drug (Figure S20B).Unsurprisingly, druggable genes are more likely to be in a trial and more likely to be approved than non druggable genes (Figures S20C and S20D).

Repeated occurrence of proteins
We model the dynamics of repeated occurrences of proteins in trials using the PWP Gap Time model, 53 a survival model for event recurrence estimation, where the time to event resets based on sequential occurrence of events.Specifically, the proteins are stratified based on the clinical trial events, for example, first drug trial, second drug trial.We find that a target's hazard ratio (HR) to be associated to a second drug increases after its first drug trial (HR: 0.82, CI:[0.73,0.93] vs. HR: 1.22, CI:[1.03,1.45], p < 0.01: Table S1), indicating that a protein experiences increased likelihood of a new drug after its first drug trial.In summary, we find that proteins experience a long wait time until their first trial as a target, but recently targeted proteins receive increased attention, reducing the time to be subsequently tested for new drugs.

GLMM model
The data includes measurements where the same target can be used for multiple new drugs over several years, creating repeated and longitudinal observations for the same target.To model these interactions in a temporal fashion, we consider the generalized linear mixed effects model (GLMM) that accounts for fixed and random effects.We use a binomial regression with a logistic link function: (Equation 1) where E½Y i represents the probability of a protein to be selected as a new drug target.X i represents the explanatory variables associated with fixed effects b; Z i represents the parameter associated with random effects on U, quantified as (i) gene observation (ii) year of clinical trial; and g represents the model residuals.We consider the following fixed effects variables: (1) association with a common disease (binary) (2) association with a rare disease (binary) (3) disease associated protein in the neighborhood (binary) (4) number of approved drugs at time t; n t approved , (count) (5) number of approved drugs in the neighborhood at time t; nn t approved , (count) (6) number of clinical trials at time t; n t ct , (count) (7) number of clinical trials in the neighborhood at time t; nn tÀ 1 ct , (count) (8) number of drugs at time t; n t drug , (count) (9) number of drugs in the neighborhood at time t; nn tÀ 1 drug , (count) The parameters of the GLMM were selected after preliminary data analysis.First, we found that a clear distinction in number of trials based on the disease type association, for example, rare diseases are rarely tested (see Figure S2).This prompts us to consider the disease associations of proteins.Second, we found that 1,260 (92%) of all drugs tend to target at least one protein targeted by an approved drug, prompting us to include drug approval parameters.Third, we found that previously tested proteins tend to be repeatedly tested in clinical trials (see Figure S10), prompting us to include number of previously tested drugs and number of clinical trials as parameters for our model.Finally, we found that the majority of the proteins (76%) selected for new drugs tend to interact with proteins that are previously targeted by drugs, prompting us to incorporate the exploration patterns in the local network neighborhood of the protein.
We explored four GLMM models: (a) Model 1 includes disease related variables (association to a common disease, association to a rare disease, and disease prevalence in the local network neighborhood).(b) Model 2 we consider the number of approved drugs associated to the target and the number of approved drugs associated to the target's local network.(c) Model 3 we consider the role of clinical trials by capturing the number of previously tested clinical trials on the target and the number of previously tested clinical trials in the network neighborhood (d) Model 4 we consider the target disease variables and the number of experimental drugs associated to the target and the number of experimental drugs associated to proteins in the local network neighborhood.In the models 2 to 4, we also include target specific disease variables, allowing us to better disentangle the effects between disease association and clinical trials drug exploration.We consider all targets that were tested in at least one clinical trial in a given year as positive samples (6%), and the remaining targets as negative samples (94%).We show the results in Table S3 and the results when only considering the primary targets in Table S4).
It is important to note that our model does not investigate the mechanisms behind the discovery of new proteins or help explain the interactions between proteins in the network.Instead, our focus is on utilize the PPI network to understand the underlying processes that lead to the exploration of novel targets.It is worth mentioning that our analysis only considers binary versions of the PPI network.

Testing for interactions
To investigate the interaction between the two key identifying results in the GLMM model, we incorporate cross interaction variables that consider the association of the target with both common or rare diseases and its previous testing in clinical trials.By including these cross interaction variables, we aim to measure the combined effect of these two factors on the likelihood of a target being selected for a clinical trial.
(1) association with a common disease (binary) (2) association with a rare disease (binary) (3) whether the target was previously tested in a clinical trials (binary) (4) whether the target is associated to a common disease and it is also previously tested (interaction term) (5) whether the target is associated to a rare disease and it is also previously tested (interaction term) We provide the results of our analysis in Table S5.Notably, we observe that when a target is associated with a rare disease and has undergone previous testing in a clinical trial, its likelihood of being chosen for a new clinical trial decreases.This finding provides valuable additional insights into the relationship between target disease associations and their impact on the selection of targets for clinical trials.

Network-based drug discovery model
We model choices in drug discovery using two parameters, first is parameter p that represents the probability of selecting a previously tested protein and second is parameter q that represents the probability of selecting a protein part of a previously explored neighborhood.We utilize the entire search space of p and q to simulate alternative exploration strategies and examine its related benefits for drug discovery.We consider drug exploration from 2011 to 2020 in our simulations, sampling the exact number of proteins tested every year, mðtÞ.
To test the empirical validity of the model, we utilize the resulting distribution of number of drugs per target for each simulation.The distribution characterization how widely proteins are selected as targets for drugs.We utilize the Kolmogorov-Smirnoff distance to measure the maximum difference between the model and the empirical data.As we show in the main text, the model accurately finds this distribution in the preferential attachment (PA) strategy.Yet, we find that the model fails to recreate the observed patterns if we remove preferential selection of drug targets (Figure S26).

Predicting potential drug candidates
To validate the model's ability to identify potential drug targets, we ask the model to identify drug candidates for three autoimmune diseases -Rheumatoid Arthritis (RA), Crohn Disease (CD), and Asthma.We begin by identifying disease proteins associated to each of the three disease that were tested in previous clinical trials.Next, we search the interaction of these proteins and pick untargeted proteins among them, representing proteins that are part of explored neighborhoods.Next, we use the model to select proteins through the three outlined strategies (PA, R, NS), allowing us to rank proteins based on the frequency they are targeted.Finally, the proteins in the network are validated as druggable, based on extensive experimental studies.We use the well curated list of druggable proteins, 27 to investigate whether the predicted protein has been verified as a potential drug-target, allowing us to measure if the exploration patterns leads to potential druggable outcomes.
We present the prediction result for the breadth of p and q parameters.Across all three diseases, we find that 70% of the selected targets through the NS strategy are verified as potential drug candidates (Figure S28).Indeed, the current practices (PA) selects targets with high accuracy but does so at the cost of prioritizing previously tested targets.In contrast, we show that a network-based search process can be an effective way to improve drug discovery in under-explored regions of the interactome.

Target validation
Additionally, we conduct in-silico studies by searching the predicted results for the network search (NS) strategy.We present the list of identified proteins for RA, CD, and Asthma in Table S6, along with the specific functions of each protein, provided by GeneCards. 54he network model is able to find drug candidates in the local network neighborhood of disease-associated proteins.For example, the method selected the protein NLRP3 as a potential drug candidate for RA.NLRP3 interacts with proteins ABCB1, HSP90AA1, CYP3A4, NR1I2, proteins that have been associated to RA and that were previously tested in clinical trials.Indeed, mutations downstream of NLRP3 play an essential role in regulating the inflammasome, identified as a risk factor for inflammatory diseases. 55Animal model studies verified that the regulating the over-expression of this gene inhibits the maturation of interleukin-1b (IL-1b), and reduces RA-induced inflammation. 35These results indicate that the model is able to predict potential novel drug candidates.The illustrated technique can be used to conduct in-silico testing of the model predictions for multiple diseases.

Figure 1 .
Figure 1.Clinical trials over time (A) Number of drug trials initiated over time.The rapid rise in clinical trials prior to 2007 is likely due to the 2007 FDA act that required all ongoing clinical trials to be registered on clinicaltrials.gov(purple line).We limit our analysis to phases 1 to 4 of clinical trials, and disregard combined phases and trials with unknown phase (gray).(B) Number of trials grouped by phase.We filter all known drug trials and match the drug interventions listed on the trials to known drugs (Supplementary Section 1.3).We show the final number of trials, grouped by phase, representing the corpus for our analysis (dark shade).(C) Proportion of trials and interventions by intervention type.Here we focus on drug trials, which represent roughly 40% of all clinical trials and 30% of all interventions.

Figure 2 .
Figure 2. Drugs and targets tested in clinical trials (A) Number of drugs tested in clinical trials.We observe a slowdown in novel drugs tested since 2001, following the end of the Human Genome Project (HGP), signaling a drug discovery winter.For example, the number of drugs tested from 2011 to 2020 is considerably less compared to the exploration in previous decades.We also find an increasing gap between the number of approved drugs that have new targets and approved drugs with no new targets (inset).(B) Complementary cumulative distribution (CCDF) of the tested drugs and targets in each phase.We consider a protein and a drug in Phase 4 to have successfully completed Phase 1-3.The plot indicates that only a small proportion of drugs and targets in phase 4 have been approved.(C) Proportion of targets in the entire human genome in trials.We find that less than 20% of all proteins have been tested in trials.The sudden jump in number of proteins in 2015 is due to a single publication in 2015 that found 306 targets for the drug fostamatinib (see SI). (D) Number of yearly trials of the top targets, demonstrating the inequality of drug exploration.Some targets, like CYP3A4, ABCB1, and ABCC2 (highlighted) are the focus of multiple trials, while other targets are tested in only a few trials each year.We would like to highlight that we treat the presence of multiple drug-target associations for a single protein within an individual clinical trial as a single occurrence of a clinical trial for that specific protein.By adopting this approach, we effectively remove the number of drugs as a confounding factor when analyzing the number of yearly trials associated with a particular target.(E) Number of yearly trials for drugs.A select few drugs like levomenthol and lidocaine are tested in several trials every year, while other drugs are rarely tested.We see the impact of COVID-19 with a rapid increase in the number of trials for hydroxychloriquine.

Figure 4 .
Figure 4. Networked exploration process of drug discovery (A) Protein-Protein Interaction (PPI) network.We observe that the region of proteins associated with FDA approved drugs (green) and proteins associated with experimental drugs (pink) are closely located in the network.We also find large unexplored regions: blue indicates disease associated proteins, representing 93% of all unexplored proteins, and purple indicates non-disease associated proteins, 7% of all unexplored proteins.Nodes are sized based on number of clinical trials, indicating that the PPI network captures and potentially drives drug discovery.(B).Logistic model results.We show the odds ratio estimate for different variables using four different models evaluating the likelihood of a protein to be selected for a new drug.Model 1 uses disease neighborhood variables: interactions to a disease associated protein and a previously targeted protein.Model 2 considers approval features: number of approved drugs that target the protein and its network neighborhood.Model 3 utilizes the clinical trials exploration: number of clinical trials of a protein and its network neighborhood.Model 4 uses drug exploration parameters: number of experimental drugs targeting the protein and its network neighborhood.The error bars indicate the standard error of the estimates.Results table shown at TableS3.

2 Figure 5 .
Figure 5. Modeling mechanisms of drug-target discovery (A) The exploration of the protein-protein interaction (PPI) network, where new proteins are selected as targets for drugs in clinical trials.For time t0, we calculate the number of drugs that previously targeted each protein in the network, NDrug(t0).At the timestep t0, new drugs are introduced in clinical trials for testing.We identify the targets of these drugs at time of trial, represented using arrows, and update the number of drugs for proteins at the next time step, NDrug(t1).Similarly, we identify the drugs introduced at time t1 and its targets and update the number of drugs that target a protein at time t2.The temporal characteristics of each protein allows us to capture the drug discovery process in clinical trials.(B) Network model.We consider the network at time step, t0, using the above described process and group proteins into three categories: (i) proteins that were previously tested (ii) proteins connected to a previously tested protein, and (iii) proteins that are not connected to a previously tested protein.With probability p, we select a previously tested protein, while with probability q we select a protein connected to a previously tested protein, and with probability 1Àq, we select a protein not connected to a previously tested protein.When choosing a previously tested protein, we sample proteins proportional to the number of drugs that have previously targeted it, P(NDrug), representing preferential attachment.In the network simulations, we select m(t) proteins (calculated from data) and update the network at the end of each time step.We describe one version of the simulation where parameters n = 3, p = 0.5, q = 0.8 are used to select the proteins B, C, and D at next time step t1.(C-F) (C) The search space of exploration.We measure the number of targets that are tested in the simulations as a function of the parameters p and q.The circles indicate the empirical choices for different years(2010, 2015, 2020).We show the distribution of number of drugs per target obtained under the three different exploration strategies: (D) Preferential attachment (PA) (p = 0.95, q = 0.95).(E) Random (R) (p = 0.5, q = 0.95) and (F) Network Search (NS) (p = 0.05, q = 0.95).