Main

Given over 1060 drug-like molecules that are estimated to be possible1,2, screening of 105.5 to 106.5 random molecules, as in high-throughput screens, might in principle never work. A widely mooted explanation for why it has worked3,4,5 is that screening decks are far from random, but are biased toward molecules that proteins have evolved to recognize: metabolites, natural products and their mimicking drugs6—what we will call ‘bio-like’ molecules. The implication of this idea, which we6 and others7,8,9,10,11,12,13,14,15,16,17,18,19 have promoted, is that as chemical libraries expand they should remain biased toward bio-like molecules. While popular, this idea has never been prospectively tested.

An opportunity to do so has come with the advent of ultra-large, make-on-demand or ‘tangible’ libraries20. These virtual libraries are composed of molecules that have not been previously made, but can be readily synthesized. Since 2016, these libraries have expanded accessible molecules from 3.5 million (available from in-stock collections) to over 29 billion (https://enamine.net/compound-collections/real-compounds). While such libraries cannot be empirically screened, molecules within them can be computationally prioritized for synthesis and testing, often using molecular docking. Indeed, docking of these ultra-large, tangible libraries has revealed highly potent molecules for multiple targets, with affinities often in the mid-nanomolar and sometimes high picomolar range21,22,23,24,25, results that are typically better than from docking the much smaller, in-stock collections. If the idea that the success of compound screens reflects library bias toward bio-like molecules, one would expect these new ultra-large libraries, and the hits emerging from them, to share the biases toward metabolites, natural products and drugs observed in the ‘in-stock’ libraries. Since we know the identity of every molecule screened in these libraries, this idea can be explicitly tested.

It also seemed interesting to consider what other factors in the tangible libraries have contributed to docking successes, and those of other library screening methods26, and what challenges may be anticipated as the libraries continue to grow27. For instance, should we expect the fit of library molecules into their receptor targets to improve as the libraries grow, and, if so, at what rate? Do the high-ranking molecules from a library screen come to be dominated by a small number of chemotypes as libraries grow, or is diversity maintained? As the libraries grow, does the likelihood of artifacts28 that exploit weaknesses in the docking scoring and sampling also grow?

In this Article, we explore how similarity to bio-like molecules has changed with library growth, how goodness-of-fit and chemical diversity of high-ranking molecules changes with library growth and how we might anticipate rare but high-ranking artifacts to change with library growth. Even at this early stage in the field, the results that emerge are strong enough to suggest strategies to maximize success in ultra-large-library screening.

Results

Similarity to bio-like molecules changes with library size

An intriguing observation of high-throughput (HTS) screening decks, and of ‘in-stock’ libraries, was that they resembled bio-like molecules (metabolites, natural products and drugs) by over 1,000-fold compared with what would be expected at random6. To investigate how this similarity to bio-like molecules changed with library size, we compared the 3.5 million in-stock library and the 3.1 billion make-on-demand library to worldwide drugs, to metabolites and to natural products (‘bio-like’ molecules). Using ECFP4 topological fingerprints, we calculated the Tanimoto similarity between each library molecule and each bio-like molecule. In this comparison, the Tanimoto coefficient (Tc) represents the features shared between two molecules (library and bio-like) divided by the total number of features. A Tc of 1 indicates the pair are identical, while a score of 0.2 indicates that the similarity is low enough to be essentially meaningless. As seen previously6, the in-stock set is far more similar to bio-like molecules than expected at random (Fig. 1a, blue curve), with 10,000 in-stock molecules being identical to metabolites, natural products or drugs (Fig. 1b). Conversely, as the library grows 886-fold from 3.5 million in-stock to 3 billion tangible molecules, the number with Tc values >0.8 to bio-like molecules actually decreases 2.3-fold, despite a library that is three orders of magnitude larger (Fig. 1a, orange curve). Most of the growth of the tangible library comes in the random similarity region compared to bio-like molecules, where the peak is around a Tc of 0.25; in this region the tangible library grows by 3,000-fold versus the in-stock library. Between the two extremes of random similarity and full identity, the similarity to bio-like molecules falls much faster for the 3 billion tangible library (Fig. 1a, orange curve) than it does for the 3.5 million in-stock library (Fig. 1a, blue curve). By the time essentially full identity (0.95 < Tc ≤ 1.0) with bio-like molecules is reached, only 0.000022% (700 molecules) of the make-on-demand library qualify, whereas 0.42% of the ‘in-stock’ molecules do so, a 19,000-fold decrease. Thus, although docking campaigns with the new ultra-large libraries have returned potent molecules with high hit rates21,22,23,24,25, the new libraries do not retain the strong bias to bio-like molecules that was a feature of both in-stock and HTS libraries6.

Fig. 1: Bio-like bias decreases dramatically as the screening library grows from the in-stock library to the make-on-demand library.
figure 1

a, The distribution of molecules in the in-stock (blue) and make-on-demand (orange) libraries as a function of the Tanimoto similarity to their nearest neighbor in the bio-like molecule set, which contains worldwide drugs, metabolites and natural products. b, Percentage of the in-stock (blue) and make-on-demand (orange) libraries as a function of the Tanimoto similarity to their nearest neighbor in the bio-like molecule set. c, The distribution of docking-prioritized and experimentally active (blue) and nonactive (orange) molecules from five different docking campaigns as a function of the Tanimoto similarity to their nearest neighbor in the bio-like molecule set. The docking campaigns from left to right are the D4 dopamine receptor, the AmpC β-lactamase and the σ2 receptor. The rest of the two docking campaigns, on the melatonin receptor and the Nsp3 macrodomain, are shown in Extended Data Fig. 1.

Source data

Of course, it could be that the actual docking hits nevertheless resemble bio-like molecules even though the overall library does not. Accordingly, we plotted the similarity to bio-like molecules of large-library docking hits from five targets, including two G protein-coupled receptors (GPCRs)29,30, a third integral membrane protein and two enzymes: the D4 dopamine receptor22, AmpC β-lactamase23, the melatonin receptor25, the σ2 receptor21 and the Nsp3 macrodomain31 from SARS-CoV-2 (Fig. 1c and Extended Data Fig. 1). In all five campaigns, the docking-prioritized molecules had Tc values <0.6 to bio-like molecules, peaking at Tc values of 0.3 to 0.35, similarity values that are not much different from that expected for pairs of random molecules. There was little difference in the distribution of molecules selected for synthesis and testing (orange bars, Fig. 1c and Extended Data Fig. 1) and the subset of those that were found to be active on target on experimental testing (blue bars, Fig. 1c and Extended Data Fig. 1)

While similarity to bio-like molecules confers little benefit in docking hit rate, it might improve success later in drug discovery. Several investigators32 have noted that natural products, for instance, are more likely to be transporter substrates, improving permeability and exposure. ‘Transportability’ is hard to calculate, but several proxies may be used to calculate cell and organ permeability, including calculated octanol–water partition coefficient (cLogP), topological polar surface area (tPSA), numbers of rotatable bonds and formal charge. By these criteria, the ‘in-stock’ and tangible (make-on-demand) libraries differ little (Extended Data Fig. 2a–d). Even if we only compare the 61,179 bio-like molecules ‘in-stock’ to the tangible library, the same conclusion emerges (Extended Data Fig. 3a–d). We can extend this analysis to violations of Lipinski’s rule-of-five (Ro5)33 and Jorgensen’s rule-of-three (QPPCaco > 22, LogS > −5.7, potential metabolism sites <7)34. We calculated these properties for the 61,179 bio-like molecules ‘in-stock’ and compared them to the same number drawn 30 times from the lead-like tangible molecules (Extended Data Figs. 3e,f). There were actually fewer violations among the tangible molecules than among the bio-like ones. Naturally, this partly reflects the intentional lead-like35 character of the tangible molecules, reducing Ro5 violations, but since this is the set being docked it remains meaningful. Finally, where molecules deriving from ultra-large-library docking have been tested in vivo they have had favorable plasma and brain exposure on intraperitoneal and even oral dosing21,25,36,37. Thus, while we cannot rule out an advantage for bio-like molecules, the physical properties of the tangible molecules put them at no obvious disadvantage.

Docking score improves with library size

Whether more and more favorable molecules are found as the library grows will govern how far we should expand the tangible libraries. Ideally we would like to know how the affinities and hit rates of docked molecules improve with library size, but determining this would be an expensive undertaking. As a proxy, we can ask how docking score improves with library size. While docking score, with its errors and approximations, may be a weak link to likelihood of binding, we have found that it correlates with hit rate in two systems, the D4 dopamine22 and the σ2 receptors21, and it is the primary criterion by which molecules are selected in docking screens.

We docked ever larger libraries against the D4, σ2 and 5HT2A receptors, looking for how docking score changes with library size. We first docked 344 million, 1.4 billion and 1.7 billion molecules against the three receptors, respectively; and from this largest set we picked ever larger subsets of the library at random 30 times, with subset size increasing by half-logs from 105 to over 109 molecules. For each subset, the scores and scaffolds of the top 5,000 ranking molecules, divided into quartiles, and the number of molecules with scores better than a certain threshold were measured.

As the subsets grow from 105 to over 109 molecules, the scores of the top-ranking 5,000 molecules monotonically improved for all three targets (Fig. 2a). This improvement was roughly log-linear for all quartiles among the 5,000 molecules, excluding the very top-scoring molecule where it increases faster (but see below), and does not seem to saturate with library size. While the curves appear to have some negative curvature, this mostly reflects larger improvements in score from the smallest docking libraries; above 1 million molecules the rate of change appears steady for each log increase in library size. In short, as the library enlarges, the fit of the top-ranking docked molecules steadily improves without signs of saturation, at least on the log scale.

Fig. 2: Docking performance improves with library size docking against the D4 (left), σ2 (middle) and 5HT2A (right) receptor.
figure 2

a, Change of docking score of the top-ranking 5,000 molecules as library size grows. b, Number of Bemis–Murcko scaffolds in the top-ranking 5,000 molecules as library size grows. c, Change with library size of the number of molecules with docking scores suggesting high likelihood of binding, based on experimental correlation21,22, for the D4 and σ2 receptors (the 5HT2A receptor was excluded as this experimental correlation has not been measured for it). d, Change with library size of the number of scaffolds with docking scores suggesting high likelihood of binding. All data are mean ± standard deviation. Each set was selected 30 times with random selection from the full library.

Source data

The improvement of the docking scores could reflect new scaffolds appearing in the library as it grows, or it could reflect the optimization of analogs of molecules already present. To investigate this, we analyzed the top 5,000 molecules in each library subset for Bemis–Murcko scaffolds38 (Fig. 2b). The scaffolds can be divided into two categories: singletons without analogs and scaffolds for which analogs exist. Plotting the score variations among singleton scaffolds, analogs in a group scaffold and all top-ranking 5,000 molecules, we observe that both singletons and analog clusters contribute to the improvement of docking scores as the library grows (Extended Data Fig. 4). While the proportion of analogs in the top 5,000 increases with library size, molecules from both categories contribute to score improvements up to the billion-molecule range (Extended Data Fig. 5).

One can also ask how the number of molecules with scores favorable enough that they are likely to bind experimentally changes with library size. Ordinarily this is difficult owing to the approximations and errors in docking, but, at least for the D4 and σ2 receptors, the variation of hit rate with docking score has been measured experimentally by testing about 500 molecules from across the docking scoring range21,22 (this has not been done for the 5HT2A receptor, which was thus excluded from this analysis). For both targets, this revealed a sigmoidal curve with a high hit-rate plateau; molecules that score in this plateau have a high likelihood of binding. For the D4 and σ2 receptor, the plateaus are defined by scores of ≤−60 and ≤−55 DOCK scoring units, respectively21,22. Both the number of molecules and the number of scaffolds in this favorable scoring region increase with library size (Fig. 2c,d), indicating not only that molecules that better fit the site are found, but also that more types of such molecules are found with library growth (Fig. 3).

Fig. 3: Expansion of the long tail with library size. The left to right panels are D4, σ2 and 5HT2A receptor, respectively.
figure 3

The x axis is in linear scale while the y axis is in log scale. The color gradient changes from orange to marine, representing library size changes from small to large.

Source data

Artifacts increase with library size

An exception to the log-linear improvement of docking scores may be observed for the very best molecules from the screens (Fig. 2a, blue curves). The score of these molecules shows positive curvature with library growth, and, especially in the larger libraries, diverges from the other top 5,000 ranking molecules. On inspection, these are not molecules that fit the receptor uniquely well, but rather molecules that cheat the scoring function by exploiting its holes and approximations. For instance, for the D4 receptor these are molecules that are conformationally strained39, for the σ2 receptor they are molecules that have artifactually low desolvation penalties (and thus too favorable scores)21 and for the σ2 and 5HT2A receptor they are molecules with artifactual atomic partial charges and with wrong tautomers. As the libraries grow, so too do the number of these artifactual hits, and by the time we dock 1.3 billion molecules against σ2, over 98% of the top 100 ranking molecules have incorrect tautomerization. Meanwhile, beyond the top 100,000 docking hits, these artifacts almost disappear—their biggest impact is in a thin slice of the top-ranking docking molecules. (We distinguish between these artifacts, which exploit a hole in the scoring function and are rare, and molecules whose scores are too favorable owing to scoring function approximations, and are within some error range of what their true scores should be. A key feature of the rare artifacts is that they crowd the top of the docking scoring list; the more common decoys are more evenly spread throughout.) Still, if one picked molecules exclusively from among the very top-ranking molecules, and was limited to a fixed number of them, it could easily be true that the prioritized molecules could come to be dominated by artifacts.

Naturally, one solution to these artifactual ‘cheating’ molecules is to fix the holes in the docking scoring function. Certainly, once one finds a particular artifact one can address it. However, two characteristics of these artifacts may make this difficult in general. First, they are rare events; if they were more common, they would be discovered by the retrospective control screens that are commonly conducted before a large prospective screen40. Second, they can change from target to target. For instance, in the campaign against the dopamine D4 receptor it was conformationally strained molecules that contributed most to these artifacts, for the σ2 receptor it was molecules with artifactually low desolvation penalties21, wrong tautomers and artifactual partial atomic charges, the latter two of which also characterized the top-ranking molecules for 5HT2A. Other targets may reveal other artifacts. As rare molecules in a multi-billion-molecule library, these may be hard to anticipate.

One may, however, imagine a general strategy, free of any particular aspect of the docking scoring function, to treat the problem of rare artifacts. In doing so, it is important to consider two of their features: first, they are rare events that rise to the top, and second, especially for large-library screens, there can be hundreds-of-thousands of molecules that score within the plateau region where molecules may be likely to bind. For instance, in a simplifying example, assume that these rare-event artifacts occur at a rate of 0.001% of the molecules docked. In this case, the number of ‘cheating’ artifacts will increase from 10 to 10,000 as the library grows from one million to one billion molecules. Usually, one can only afford to synthesize and test a fixed number of top-ranking compounds. If that number is 100 molecules, picked from the very top-ranking docked molecules, then in docking a million-molecule library the cheating artifacts will only account for 10% of the molecules tested, but docking a billion-molecule library they will amount to 10,000 molecules, completely dominating the top 100 ranked molecules.

More generally, we can model how the number of rare-event, cheating molecules will grow with library size, using a statistical distribution of these molecules versus the rest of the library, and considering different rates of occurrence. We simulate the distribution of these artifacts using both an extreme value distribution and a uniform distribution, while using a normal distribution for other library molecules. From these two distributions, we can estimate the effect of varying the artifact-to-library-molecule ratio with growing library size. Performance is evaluated by the percentage of artifacts in the top N-ranked molecules. With either distribution, artifacts begin to dominate the top-ranking list as the library grows for a given artifact-to-library-molecule ratio (Fig. 4a). If we cannot afford to synthesize and test more than a few hundred top-ranking molecules, the campaign will inevitably begin to falter as libraries rise toward 1 billion molecules.

Fig. 4: Number of artifacts increases with library size.
figure 4

a, Heat maps of the percentage of artifacts in the top N (N = 100, 1,000 (1K), 10,000 (10K) and 100,000 (100K)) docked molecules. The percentage of artifacts in the top N-docked molecules for a given library size and the ratio between artifacts (A) and library molecules (Z), A/Z is colored using a linear scale ranging from 0% (white) to 100% (blue). The artifactual molecules were sampled from the extreme value distribution for the left panel and the uniform distribution for the right panel. M, million; B, billion. b, The percentage of artifactual molecules in the 100 selected molecules between the two strategies. The first strategy is just picking the top 100 molecules, colored by black bars, and the second strategy is selecting 100 molecules evenly distributed from five ranking ranges, colored by gray bars. Five ranking ranges were top 1–100, top 101–1,000, top 1,001–10,000, top 10,001–100,000 and top 100,001–1,000,000. Twenty molecules were drawn at random from each ranking tranche. This selection was repeated 20 times at random. The artifactual molecules were sampled from the extreme value distribution for the left panel and the uniform distribution for the right panel. Data shown here are mean ± standard deviation.

Source data

A general solution to this problem is simply not to prioritize the several hundred molecules to be synthesized and tested exclusively from the very top-ranked molecules. Recall that a broad range of high-ranking docked molecules—a range that grows with library size—may have roughly equal likelihood of binding, and in a docking screen of a billion molecules, the top million might have scores that differ little from each other. To explore the impact of such a rank-spreading strategy on rare-event artifacts, we defined as five rank ranges the top 1–100 molecules, the top 101–1,000 molecules, the top 1,001–10,000 molecules, the top 10,001–100,000 molecules and the top 100,001–1,000,000 molecules, picking 20 molecules from each. We plotted the percentage of rare-event artifacts among the 100 molecules picked in this rank-spreading strategy versus the same percentage among simply the top 100 molecules, as a function of library size (Fig. 4b). For a given library size, the percentage of artifacts in the rank-spreading strategy was always lower than picking them exclusively from the top 100 molecules; for larger libraries, this rank-spreading strategy decreased the number of artifacts from 100% to between 25 and 50%. Naturally, there may be other strategies that will achieve the same goal, including rescoring the top-ranked molecules with another scoring function that, while it may also suffer from rare artifacts, may not suffer from the same ones. Even here, a strategy of picking from across the high-ranking ranges may have benefit.

Discussion

Since 2016, readily accessible molecular libraries for virtual screening have increased from 3.5 million to over 29 billion compounds. Our ability to prioritize from this vast chemical space depends on the molecules it explores and the ability of computational methods, often docking, to prioritize true ligands from an ocean of decoys. Three main observations from this study begin to illuminate the molecules that the new libraries explore, and how docking prioritizes them. First, the billion-plus tangible library is 19,000-fold less biased toward bio-like molecules than is the 3.5 million in-stock library. Second, as the libraries grow, better-fitting molecules are found. The improvement in docking score is log-linear with library size and does not yet appear to saturate. Third, as the libraries grow so too do rare-event artifacts. While these are inconsequential for smaller, million-molecule libraries, by the time the libraries grow to a billion molecules they can dominate hit lists. A general strategy of spreading docking picks from the docking rank curve suggests a way to overcome what might be a general problem.

Not only is the 3.1 billion make-on-demand library 19,000-fold less biased toward bio-like molecules than the 3.5 million in-stock library (Fig. 1a), but also thousands of experimentally tested high-ranking molecules from five docking campaigns are also dissimilar to bio-like molecules (Fig. 1b). This contradicts the idea, which we6 and others7,8,9,10,11,12,13,14,15,16,17,18,19 have advocated, that biasing a library toward metabolites, natural products and drugs increases the chance of success in screening. Instead, the tangible library is little more similar to these bio-like molecules than one would expect at random, and diverges further from them as it grows. The ‘in-stock’, bio-like and tangible molecules have similar distributions of cLogP, tPSA and rotatable bonds (Extended Data Figs. 2a–c and 3a–c); and the tangible molecules are, if anything, more compliant with Ro5 and ADMET rules than is the bio-like set. To the extent that these physical properties contribute to success in subsequent compound optimization, they differ little between the two sets of molecules. Rather than biasing toward bio-like molecules—which may be simply a historical feature of the ‘in-stock’ libraries and HTS decks6—the tangible libraries are defined by the over 200,000 intentionally diverse and stereogenic building blocks from which they are synthesized. The emphasis on the exploration of a wide range of chemotypes with high three-dimensionality ensures a diverse collection of functionalities and shape, and it may be this feature, rather than similarity to precedented molecules, that drives the better receptor fits of molecules from these libraries.

The exploration of stereogenic, functionally congested molecules ensures that as the library grows41 more and more molecules are sampled that well-complement receptor sites. Docking diverse libraries leads to a long tail of high-scoring molecules, separated from the more normal distribution of docking scores from the library. By docking a library that is 1,000-fold larger than the ‘in-stock’ libraries, which until recently dominated the field, we are essentially extending and filling in this long tail, such that it is populated with statistically relevant sampling of chemotypes (Fig. 3). As the libraries grow, docking scores improve log-linearly, and show no sign of saturation. Interestingly, a similar trend using ligand-based virtual screening has been previously reported; here too, best scores, in this case measuring three-dimensional similarity to known ligands, improves log-linearly up to 10 billion molecules26. For the docked molecules, improved scores derive from both new chemotypes fortuitously appearing in the libraries, and from analogs of previously explored scaffolds that optimize fit (Extended Data Fig. 4). An inference from these trends is that, at least for now, screening larger and larger libraries will continue to improve docking results—with one caveat—and the hits that emerge will often have analogs to support early optimization.

Counterbalancing the improvement in docking fits with library expansion is the growth in the raw number of rare-event artifacts. If the number of molecules one could synthesize and test could be scaled with library size this would not be a problem. But, with resources to only synthesize and text for an essentially fixed number of molecules, these rare, high-ranking artifacts will eventually overwhelm the true positives (Fig. 4a). This outcome can be alleviated by a strategy that not only tests the very top-ranking molecules, but also selects ones from slightly lower ranks that remain high scoring (Fig. 4b). We suspect that such rare-event artifacts will occur in most types of library screens, including HTS, DNA-encoded chemical libraries or even genetic screens, and will become more pernicious with library size. Variations of this strategy may also be useful in these other areas.

Certain caveats bear airing. While bio-like molecules confer no advantage in docking hit rate, nor in physical properties, they may have advantages not directly assessed here, including being transporter substrates32. Thus far, where molecules from ultra-large-library docking have been tested in vivo they have had favorable plasma and brain exposure on intraperitoneal and even oral dosing21,25,36,37, but this remains a small set of experiments. Mechanically, the divergence of the tangible libraries from bio-like molecules has only been measured by one type of topological similarity, other metrics may show different levels of divergence. We suspect that, while this may affect the results quantitatively, qualitatively the story will remain; certainly, the number of molecules that are identical to metabolites, natural products and drugs will not change. Apropos of the cheating artifacts, we would re-emphasize that these are rare molecules that score well by finding holes in the scoring function. They are not the general run of molecules that are evaluated properly but with enough error that they rank too highly—docking continues to struggle with these, and the strategies suggested here will not address them. Finally, other, more quantitative approaches can be considered to solving the problem of rare artifacts, including rescoring to identify molecules exploiting holes in one scoring function that another function does not share.

The key observations of this work should not be obscured by these caveats. Virtual libraries are growing into a chemical space that is far less similar to bio-like molecules than are in-stock libraries. Despite this, multiple screens of these billion-molecule libraries have returned potent ligands with high success rate21,22,23,24,25, suggesting that bias toward precedented molecules might never have explained the success of large-library screens. Indeed, simulation of docking performance with library size shows that we are still in the domain where ever larger, diverse, stereogenic libraries will continue to fortuitously explore molecules with better and better fit for a target binding site. Strategies to avoid rare-event artifacts will help to ensure that docking and related techniques can continue to prioritize from this growing chemical space, finding ever-more interesting molecules.

Methods

Bio-like libraries

We used two sets of molecules from the ZINC15 database (https://zinc15.docking.org) to approximate the chemical space of bio-like molecules: worldwide drug set (https://zinc15.docking.org/substances/subsets/world/) and biogenic set (https://zinc15.docking.org/substances/subsets/biogenic/). The worldwide drug set contains 5,900 compounds and the biogenic set contains 168,185.

Screening libraries

The in-stock and make-on-demand libraries were used in the analysis of bio-like bias. Molecules from both libraries are within the lead-like range: cLogP ≤ 3.5 and heavy atom count ≤25. The in-stock library contained 3,539,537 molecules. For bio-like bias quantification and physical property calculations, make-on-demand libraries contained 3,164,844,749 and 4,941,080,527 molecules at that time, respectively.

Quantifying bio-like bias

Each molecule of the in-stock and make-on-demand libraries was in turn compared to each molecule of the bio-like library. Compounds were represented by their ChemAxon ECFP4 fingerprints (https://chemaxon.com/). The length of the fingerprints was 1,024 bits. The similarity was calculated by comparing their respective ECFP4 fingerprints with the Tanimoto coefficient. The tool to measure this similarity was deposited at https://github.com/docking-org/ChemInfTools. Related figures were made using GraphPad Prism v.9.4.

Physical property calculations

cLogP, tPSA and number of rotatable bonds for each molecule from the in-stock bio-like, in-stock and make-on-demand libraries were calculated by the RDKit v.2020.09.1.0 package (https://www.rdkit.org). The net charge of 61,179 in-stock bio-like molecules was predicted at pH 7.4 by the majormicrospecies modular in ChemAxon Jchem 21.13 (https://chemaxon.com/). The net charge of in-stock and make-on-demand molecules were precalculated in ZINC22. Details can be found at https://cartblanche22.docking.org/tranches/3d. To evaluate number of violations on Lipinski’s rule-of-five and Jorgensen’s rule-of-three, 61,179 molecules were randomly picked 30 times from the lead-like make-on-demand library (https://cartblanche22.docking.org/search/random). These two metrics were calculated by QikProp from the 2022–1 released Schrödinger suite. Related figures were made using GraphPad Prism v.9.4.

Molecular docking

The docking setups of the D4 and σ2 campaigns were reported previously21,22. The 5HT2A receptor with a ligand (unpublished) was used in the docking calculation. This unpublished structure is in an active state and is similar to the published 5HT2A active structure (PDB 6WHA) with a low Cα root mean square deviation of 0.8 Å. The atoms of the lisuride were used as the matching sphere calculation in the orthosteric site. The spheres were labeled42 according to the charge–charge interaction and hydrogen-bond patterns of the lisuride ligand in the cryo-EM structure. Forty-five spheres were used in total and were grouped into seven clusters on the basis of their spatial locations in the binding site by the k-means clustering method. These labeled and clustered spheres were used to improve search efficiency for speeding up docking calculations. The complex structure was protonated by Epik and PROPKA at pH 7.0 in Maestro (2021 release). Partial charges of residue atoms were assigned on the basis of AMBER united atom types. The volume of the low dielectric and the desolvation volume were extended out from the surface of the receptor by 1.1 Å and 0.5 Å, respectively. Docking energy grids were precalculated with QNIFFT43 for Poisson–Boltzmann-based electrostatic potentials, AMBER force fields using CHEMGRID for van der Waals potentials44 and SOLVMAP45 for ligand desolvation.

Since the 5HT2A receptor structure used in this study is in the active state, the docking setup was evaluated for its ability to enrich known 5HT2A agonists over property-matched decoys. Decoys are molecules with dissimilar topology but that share similar physical properties to known ligands, so they are unlikely to bind to the receptor. Forty-seven known 5HT2A agonists were extracted from the IUPHAR46 and ChEMBL47 databases and 2,050 property-matched decoys were generated by the DUD-E pipeline48. Docking performance was judged by the ability to enrich the 5HT2A known agonists over the decoys on the basis of docking rank, using logAUC values. The docking setup described above achieved a logAUC value of 5. An ‘extrema’ set48 of 146,620 was constructed through the DUDE-Z web server (http://tldr.docking.org) to make sure that molecules with extreme charge properties were not prioritized. This docking setup enriched over 97% monocations among the top 1,000 ranking molecules with a high logAUC value of 27. A small ‘goldilocks’ set <(2 < cLogP ≤ 3 and 300 Da < molecular weight ≤ 350 Da) of 1,161,497 were also downloaded from the DUDE-Z web server (http://tldr.docking.org) to check if 5HT2A known agonists remain among the highest scored compounds. The docking setup achieved a decent logAUC value of 25 in this control experiment.

Using DOCK3.8, over 344 million, over 1.3 billion and over 1.6 billion library molecules from ZINC20/ZINC22 (http://zinc20.docking.org and https://cartblanche22.docking.org) were docked against the D4, σ2 and 5HT2A receptors, respectively. Each library molecule was sampled in about 2,761, 3,409 and 713 orientations, and, on average, 174, 235 and 350 conformations for the D4, σ2 and 5HT2A receptor, respectively. The total calculation time was 70,705, 740,030 and 680,653 hours for the D4, σ2 and 5HT2A receptors, respectively.

Simulating docking performance with library size

To investigate how dock scores of top 5,000, number of chemotypes in top 5,000 and number of molecules below given dock score cutoff change with library size, we docked the full make-on-demand library against the D4, σ2 and 5HT2A receptors with the docking setups described above. To evaluate the effects of library size on the three metrics mentioned above, 105, 3 × 105, 106, 3 × 106, 107, 3 × 107, 108, 3 × 108, 109 (if possible) sets of molecules were randomly picked from the full docking-ranked list and the three metrics above were measured. Each set was selected 30 times with random selection from the full library. Chemotypes here were defined by the Bemis–Murcko scaffold analysis38. The program mitools v.2020.04.4 (https://www.molinspiration.com/) was used to calculate scaffolds for this analysis. Related figures were made using GraphPad Prism v.9.4.

Toy model for artifacts with library size

We constructed a model of how artifacts change with library size. Inputs to the model were the artifact-to-library-molecule ratio and the library size. The distributions were sampled from these two parameters 20 times independently for given artifact-to-library-molecule ratios and library sizes. We used the extreme value distribution (the shape parameter c = −0.1) or uniform distribution (the mean parameter (loc) is −10 and the standard deviation parameter (scale) −20) to sample artifacts, while we used the normal distribution (the parameter loc = 5 and 0, respectively) for library molecules. The percentage of artifactual molecules was calculated for the top 100, top 1,000, top 10,000 and top 100,000. Related figures were made using GraphPad Prism v.9.4.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.