Evolution of commercially available compounds for HTS

Over recent years, an industry of compound suppliers has grown to provide drug discovery with screening compounds: it is estimated that there are over 16 million compounds available from these sources. Here, we review the chemical space covered by suppliers’ compound libraries (SCL) in terms of compound chemophysical properties, novelty, diversity, and quality. We examine the feasibility of compiling high-quality vendor-based libraries avoiding complicated, expensive compound management activity, and compare the resulting libraries to the ChEMBL data set. We also consider how vendors have responded to the evolving requirements for drug discovery. filled this initial diversity hole. The degree of library overlap was generally higher than in the more fine-grained map #1, and gradually increased from 0.51 (three suppliers) to 0.54 (six suppliers), 0.63 (12 suppliers) ,and 0.65 (all suppliers). Map#3 was based on plain ISIDA atom sequence counts. Similar to map#1, it also focused on chemical constitution and connectivity patterns, but was less fine-grained than the latter; thus, the libraries are strongly overlap. On this map, the three-supplier library appears as a core collection that gradually expands (in particular, into the north-west and south-west regions) as compounds from further suppliers were added. Overlap degrees varied from 0.34 (three suppliers) to 0.40 (six suppliers), 0.47 (12 suppliers), and 0.49 (all suppliers).


Introduction
A growing body of evidence from clinical outcomes, along with scientific and technological advances over the past decades, has resulted in shaping the strategies of early-stage drug discovery [1]. High-throughput screening (HTS) has evolved since its introduction during the early 1990s. Initially, many pharmaceutical companies were screening hundreds of thousands of compounds against hundreds of targets per year. Today, HTS is often complemented with fragment-based lead discovery (FBLD) [2], encoded library technologies [3], and phenotypic approaches [4] to form a comprehensive screening toolbox and an opportunity to combine knowledge from each approach to successfully identify new lead molecules. Despite these industry-changing 'paradigm shifts', the number of new drugs approved per US$1 billion spent on research and development (R&D) has been halving every 9 years since 1950 [5], and now an estimate of R&D spending per new product exceeds US$2 billion [6].
There has been much speculation in the literature and in the industry around the quality of HTS data derived from random screening, both in terms of sample purity and the physicochemical properties of HTS screening decks. Many consider the classical approaches used by James Whyte Black during the 1960s-1970s [5,7] as being a preferred alternative. However, further studies have clearly shown that HTS is a valuable part of a proven scientific toolkit, limited undesirable functionality: no 'PAINS', stable, no hot functionality (except covalent libraries)]; (iii) possibility of provision of analogs for hit follow-up in a time-and cost-effective manner (except for NP and metabolites); (iv) the SCL represents numerous and/or original chemotypes, as defined by Bemis-Murcko, Tanimoto, and so on; and (v) the vendor updates the catalog regularly, and is clear about pricing with transparent and prompt communication throughout the purchasing process.
However, a comprehensive analysis of the vendors fulfilling the above-mentioned criteria limited to the information extractable from open sources because most companies prefer not to share their analysis of various vendors. Therefore, we used cheminformatic approaches to compare the SCLs found in open platforms. As an indirect indicator of the vendor's activity in the field, we analyzed the dynamics of the reshaping and growth of their collections over a set time period.

Collection of the data and characteristics of the data sets
The starting point of the current study was the creation of the chemical space covered by purchasable screening compounds using the ZINC database † . To create this space, we performed standardization of SMILES for all the sets involved in our search using RDKit nodes for the KNIME analytics platform ‡ . This space was defined as the union of standardized SMILES strings of all sets prepared, as mentioned earlier. Duplicates were deleted from the newly created large set. After removal of duplicates, the standardized space comprised 16 902 208 unique structures, including stereoisomers (all stereochemical features mentioned by vendors were included). As illustrated by Figure  1 and Figure S1 in the supplementary information online, the impact of the vendors on the space differed significantly by the number of structures as well as by percentage of unique compounds. From 33 sets, eight showed a high fraction of unique compounds (80% and more): Abamachem, AnalytiCon Discovery, BCH Research, Enamine, FCH Group, Intermed, Selenachem, and UORSY; all these sets, except for AnalytiCon Discovery, contained more than 1 million molecules. Eight sets contained a medium number of unique compounds (40-80%), and three of these sets were of 1 million or more molecules (Asischem, ChemBridge, and ChemDiv). Even though Princeton Biomolecular Research and Vitas-M contained 1.2 million and 1.4 million molecules, respectively, the fraction of unique compounds was <10% for both databases.

Compound-level analysis (for the 16 902 208 set)
For the preliminary evaluation of the quality of the purchasable chemical space as well as the set from each vendor, ten selected molecular properties were chosen: MW, logP, hydrogen atom (HA) count, number of hydrogen bond donor (HBDs), number of hydrogen bond acceptors (HBAs), polar surface area (PSA), number of rotatable bonds (ROTB), Fsp 3 , number of rings, and number of aromatic rings. The mean values of these parameters are detailed in Table 1. We also compared these values with the corresponding data from our previous analysis from 2011 [44]. The data showed that, during the past 7 years, the mean values of the six parameters mentioned in our previous paper significantly shifted from drug-likeness to lead-likeness, which accords with general trends of the screening libraries criteria. The mean MW (∆ = -26), logP (∆ = -0.67), PSA (∆ = -22.4), HBA (∆ = -1.57), and ROTB (∆ = -0.47) significantly decreased whereas mean HBD slightly increased (∆ = +0.20). Given the impact of historical compounds from the collections of the main players in the field, which strongly affected the mean values, we compared the mean value of the compounds appearing from 2010 to 2017 § ; encouragingly, these results were the closest to the leadoriented synthesis concept ¶ . Comparison of the characteristics of the 'new compounds' set from the SCL 2010-2017 with the European Lead Factory** (ELF) library [45] (mean values, calculated on the basis of the data from two publications [46,47] (Table 1).
In addition to the mean values, we analyzed the distribution of the aforementioned parameters for all purchasable chemical space as well as for each vendor collection (for exact information on vendors, see PhysChem.xlsx in the supplementary information online). To simplify the visualization of the distributions of each vendor compared with the space, we divided the distributions into several areas. The distributions that were difficult to assign to the areas are marked in the figures as 'outliers'. The representative examples of such simplifications are shown in Figure S2 in the supplementary information online.
For example, in reviewing the results for MW, we believe there are three general categories of suppliers: Area 1: ten distribution curves (Abama Chemicals, BCH Research, Intermed Chemicals, Selena Chemicals, ChemBridge, Enamine, FCH Group, Key Organics, Maybridge, and UORSY) have narrow peaks with maxima between 300 and 400 Da; Area 2: 18 distribution curves (Alinda Chemicals, Asinex, ChemDiv, Aronis, Asischem, Chemical Block, InterBioScreen, Life Chemicals, Otava Chemicals, Pharmeks, Princeton Biomolecular Research, Selleck Chemicals, Specs, Timtec, Tocris, Toslab, Vitas-M Laboratory, and Zelinsky Institute) have wide peaks with a vertex at 400 Da. By coontrast, five curves (AnalytiCon Discovery, Alfa Chemistry, Fluorochem, MolMall, and Oakwood Chemicals) were left as is and recognized as 'outliers'. Another representative example of simplification is the distribution of HBD number given in Figure S2 in the supplementary information online. Using such an approach, distributions of all above-mentioned parameters were calculated and are shown in Figure 2.
Among the compound suppliers, AnalytiCon Discovery, Alfa Chemistry, Fluorochem, MolMall ,and Oakwood Chemicals were identified as 'frequent outliers'. The main reason for this rests on the main business activity of these companies. AnalytiCon Discovery specializes on natural products and macrocycles; Fluorochem and Oakwood Chemicals are widely known as suppliers of building blocks and reagents; Alfa Chemistry is a contract research organization; and MolMall is a small collection of samples from different sources. All these companies are not 'classical' producers of the compounds for HTS. However, despite differences in the parameter distributions of each vendor, the cumulative distributions of the parameters of purchasable space have one peak, which is usual for screening collection. An exception is the Fsp 3 distribution, which has a more complex character, unlike the curves of vendors. In this case, old historical collections and the newly synthesized compounds have significantly different Fsp 3 parameter values ( Figure S3.01 in the supplementary information online). Nevertheless, the quantitative estimate of drug-likeness (QED) [48] histogram for the purchasable space revealed the quality of the compounds based on this parameter (see QED.xlsx in the supplementary information online). The maximum QED accounted for 0.8-0.9 ( Figure 2).
The chemical diversity of the space and vendor collections was analyzed by ECFP4-based Tanimoto similarity of each compound with its nearest neighbor (for all vendors, see Figures S3.01-3.10 in the supplementary information online). For the purchasable space, the corresponding histogram is shown in Figure 2. Its profile demonstrates a diverse set with a mean Tanimoto distance to nearest neighbor of 0.3. Notably, Tanimoto diversity for the purchasable space is worse than the data announced for the Joint European Compound Library (JECL): a mean Tanimoto distance of 0.4 to the nearest neighbor [47]. Deeper analysis of the contribution of each supplier to a joint diversity of the space showed that some sets represent completely different areas of chemical space, whereas others have a significant overlap. As an example, the AnalytiCon set has a low internal diversity but occupies a significantly different space from other vendors (median Tanimoto distance 0.18 within the set, but 0.55 against the full space).By contrast, the Vitas-M set is narrowly distributed (median Tanimoto distance 0.24 in set, and median Tanimoto distance in comparison with the full space 0.29). Selleck set had high internal diversity and differed from other vendors (median Tanimoto distance was 0.56 in the set but median Tanimoto distance in comparison with full space was 0.46). The corresponding histograms are shown in Figures S4.01-4.33 in the supplementary information online.
For the 3D-shape analysis of the purchasable space as well as vendor sets, the Plane of Best Fit (PBF) -Principal Moments of Inertia (PMI) approach was used [49]. Generation of coordinates and geometry optimization (mmff94, 100 iterations per molecule) along with subsequent PMI and PBF calculations, was performed using RDKit. Density plots were built in R Statistics using the hexbin package; the plot for the complete space is shown in Figure 3a.

Scaffold level analysis
Bemis-Murcko loose frameworks (scaffolds) analysis [50] was used to evaluate the 2D shape and topology of the compounds in the purchasable space and each vendor collection (Figures S6.01-6.33 in the supplementary information online). This analysis gave 2 886 942 unique frameworks representing purchasable space. Cumulative scaffold frequency plots (CSFP) [51] were built for the space and vendor collections. As in the case of compound-level analysis, the main 'area' and outliers were identified. This time, UORSY appeared in outliers, the CSFP of which was close to those of Binding DB and DrugBank (Figure 3b).
Equal distributions of compounds across molecular scaffolds were found in the Selleck and Tocris collections, mainly because of the main profiles of these companies: Selleck and Tocris are worldwide recognized suppliers of reference compounds, which are usually used as standards in different screening assays as well as in biomedical investigations. Our data are in slight disagreement with a recently published analysis of the libraries of the main players [59], but the CSFP curves obtained therein fit the 'area' in Figure 3b.

SCL changes analysis
An important factor in the choice of compound vendor is the viability of the sample resupply and further opportunity for the hit follow-up support [38]. Another is how vendors have responded to the desire for more lead-like compounds. To address these issues, we focused on companies active in this field. Promotional materials of those companies do not give a true picture; therefore, we evaluated such companies by comparing the results of analyses carried out in 2010 and in the current paper. Initially, differences in compound numbers in collections were plotted (Figure 4). Some vendors presented in 2010 (AMRI, ComGenex, Tripos, ART-CHEM, Nanosyn, SALOR, IVK Laboratories, ChemStar, Ufark, and Spectrum) were absent in 2017 in ZINC. Some of these companies had been sold (e.g., ComGenex § § or Tripos ¶ ¶ ), whereas others, such as AMRI and Nanosyn, provided integrated MedChem solutions using in-house libraries. Moreover, all these vendors were not active participants in screening compound production. In 2017, 14 new vendors were present: AnalytiCon, Selleck, Tocris, MolMall, Alfa Chemistry, Aronis, Chemical Block, Alinda, Zelinsky Institute, Intermed, BCH research, Abamachem, Selena Chemicals, and FCH Group. The libraries of the latter four contain more than 1 million unique diverse compounds with good PhysChem properties (see Cut_off_filtering.xlsx in the supplementary information online), proving their activity on screening compounds market.
At a cursory glance, the space was sufficiently diverse and covered significant PhysChem parameters for most screening campaigns; thus, it could deliver an appropriate HTS set. To verify this statement, several case studies were performed.

Case study: an 'ideal' million
Among the variety of screening paradigms that exist to identify hits [53], we chose an example comprising building a compound set to screen against a novel target with an unknown structure, with few known active chemotypes, or without existing small-molecule modulators. In this case, HTS is the method of choice for its potential to identify quality leads because it does not require information about the target. However, determining the optimal size of such a screening deck is problematic. Several studies have addressed this question but the optimal size of a screening collection [54,55] has remained undefined and varied.
The technical possibilities of modern HTS are almost unlimited. Nowadays, 384-well microtitre plates are the 'golden standard,' whereas 1536-well plates are increasing in popularity, and even 3456-well microtitre plates are used in some projects. Throughputs of ≥100 000 compounds screened per day are routine in leading HTS practitioner laboratories using in vitro biochemical, functional cell-based, reporter gene, and phenotypic assays [56]. According to reports on screening campaigns, the number of compounds used in an 'all-or-nothing' screening mode ranges from 50 000 to 1 500 000 [57]: a maximum mean value of 800 000 compounds per screen was reported in 2003, whereas this number had decreased to 500 000 in 2009 [58]. Despite a low true positive hit rate (<1% in 2010 [59]), in 2018, AZ concluded that increasing success could be achieved by gaining access to as many compounds as possible [13]. Moreover, choosing the 'relevant region' of the chemical space [28] would decrease further attrition and increase the true positive hit rate [60]. Support for the trend to use several million screening compound campaigns is the multiplexing of more than one compound per well during primary HTS to increase the capacity without compromising screening quality [61]. Thus, we assembled a screening deck of 1 million lead-like compounds, based on 50 000 scaffolds with 20 representatives each, belonging to clusters that were as diverse as possible for the first case study. We limited the number of the compounds to eliminate the molecular redundancy [62], but left a sufficient number of compounds per cluster to efficiently identify latent hit series and rapid preliminary structure-activity relationships (SARs), and to avoid any singletons [63]. Currently, there is controvency over the optimal size of compounds per cluster per scaffold. The first papers discussing the issue were published in early 2000, although their conclusions varied from 10 [64] to 50-100 [65] compounds per scaffold. By contrast, the 'Open Scaffolds' collection from Compounds Australia was build with ≤30 SAR-meaningful compounds per scaffold (avarage value 28) [66]. Nevertheless, a series of 5-20 compounds was most frequently used by Pfizer [67] during plate-based diversity subset generation 2 (PBDS2). Therefore, we selected a model value of 20 compounds per scaffold, also in agreement with the opinion of Bostwick***. For comparison, we also ran the study using 50 compounds per scaffold.
To build an 'ideal million' set, we initially subjected the purchasable chemical space of 16 902 208 compounds to structural filtering against PAINS (despite recent criticism [68], the filters are routinely used) and toxicologu/reactive Eli Lilly Rules [28,29], which afforded 15 968 338 compounds. Further application of the lead-likeness [69] and Ro3/75 [23] criteria resulted in two spaces with 6 544 044 and 3 705 803 compounds, respectively. Bemis-Murcko loose framework analysis of the sets gave only 39 101 and 22 162 scaffolds bearing more than 20 compounds per scaffold and 13 156 and 8006 scaffolds bearing more than 50 compounds per scaffold (Table 3). Given that the first model ideal million set (20 compounds per scaffold) would require 50 000 scaffolds and fewer than this were available from drug-like space, we targeted a 0.5 million set represented by 25 000 scaffolds with 20 compounds per scaffold and used a 6 544 044 set. From this set of 39 101 scaffolds, we extracted 25 000 of the most diverse using the MaxMin algorithm [70]. If the scaffolds had more than 20 compounds in the lead-like space, we selected the 20 most diverse structures using the above-mentioned MaxMin algorithm for compounds from overpopulated scaffolds [70]. In this 'ideal half million', the unique structures from all 33 suppliers were presented, although the contribution of each supplier varied significantly ( Figure 5). To simplify compound management (as mentioned in the Introduction), we studied the dependence of the quality of the selected set on the number of suppliers. Based on the obtained data ( Figure 5), we selected 12, six, and three suppliers that contributed the most. The above-mentioned procedure for the 'ideal half million' selection was applied for the chemical space covered by these 12, six, and three suppliers, respectively. For the 12 and six suppliers, the generated space contained 0.5 million compounds, whereas for three suppliers, the size of the space decreased to 384 520 compounds based on 19 226 scaffolds. We then compared these three spaces with the initial space from 33 suppliers at the compound and scaffold levels. Diversity at the compound level as well as QED were similar for all the three spaces ( Figures S7.01 and S7.02 in the supplementary information online). However, a similar analysis at the scaffold level showed a significant decrease in diversity from the 33 to the three supplier sets (Figure 7a).
The second model 'ideal million' set (50 compounds per scaffold) was collected using the above-mentioned algorithm. Similarly, for 50 compounds per scaffold set, only an 'ideal half million' could be generated. However, in contrast to the previous analysis, this resulted in a different level of contribution from each supplier (Figure 6). We also analyzed the contribution from the top 12, six, and three suppliers. For 12 suppliers, applying the algorithm resulted in a 0.5 million compound set, whereas for six and three suppliers, the size of the r sets was 494 450 and 306 200 compounds based on 9889 and 6124 scaffolds, respectively. Compared with the 20 compounds per scaffold set analysis, decreasing the number of suppliers did not significantly influence the Tanimoto diversity at the compound level or the QED (Figures S7.03 and S7.04 in the supplementary information online), but did significantly decreased diversity at the scaffold level (Figure 7b). In general, the comparison of the two sets (20 and 50 compounds per scaffolds) showed that the 50 compounds per scaffold set was significantly less diverse at the scaffold level. Therefore, the 20 compounds per scaffold set with the number of suppliers reduced to six or three subsets would be a pragmatic way to build a useful set of compounds for HTS screening campaigns based on compounds purchased from commercial sources.
The last step of our investigation was to compare the results from 33, 12, six, and three suppliers (for the libraries bearing 20 compounds per scaffold). For this purpose, we utilized the recently developed Generative Topographic Mapping (GTM) [71,72] because it is considered the most efficient tool among the published methods for multiple descriptor chemical space comparison. The 1.5-million ChEMBL compound data set was used as a reference database. The four compound sets corresponded to three, six, 12, and 33 suppliers. These were mapped against the background of ChEMBL compounds, with blue zones corresponding to chemical space areas dominated by supplier compounds, versus dark-red zones containing (almost) exclusively ChEMBL compounds, after applying Bayesian normalization to compensate for the initial imbalance of set size (300 000-500 000 for supplier sets, versus 1.5-million ChEMBL compounds). Intermediate colors, from light red through yellow and green, corresponded to chemical space zones in which supplier and ChEMBL compounds mingled (increasing relative density of supplier compounds corresponding to a 'blue shift'). Three maps were built on the basis of the aforementioned principles, shown in Figure 8.
Map #1 was based on ISIDA [73] force-field-type colored atom sequences acting as molecular descriptors. The force field types assigned to atoms (the CVFF forcefield typing rules were applied) were specific to their chemical environment and, therefore, this class of ISIDA fragment descriptors provides a fine-grained analysis of chemical space. The three-supplier set dominated the 'north-eastern' chemical space zone, clearly separated by a ChEMBLdominated central part from some secondary 'islands' in both the north-western and south-eastern regions. Increasing the number of suppliers resulted in a gradually growth of overlap with the ChEMBL set, by embracing more compounds in the central area, which remained dominated by ChEMBL compounds while also starting to be populated by supplier molecules. The extent of library overlaps, calculated as the Tanimoto score of the mean vectors responsible from the supplier and ChEMBL libraries, respectively, increased from 0.28 (three suppliers) to 0.33 (six suppliers) to 0.42 (12 suppliers) and remained constant when all suppliers were considered.
Map#2 relied on ISIDA pharmacophore-type colored atom sequence count descriptors (i.e., it monitors pharmacophore pattern diversity). Therefore, it ignored the precise chemical nature of the atoms, rendered as hydrophobes, aromatics, HBA and HBD, cations, and anions, respectively. The three-supplier set provided significant coverage of the chemical space, with the only ChEMBL-dominated area close to the 'south pole' of the map. The addition of compounds from further suppliers gradually filled this initial diversity hole. The degree of library overlap was generally higher than in the more fine-grained map #1, and gradually increased from 0.51 (three suppliers) to 0.54 (six suppliers), 0.63 (12 suppliers) ,and 0.65 (all suppliers).
Map#3 was based on plain ISIDA atom sequence counts. Similar to map#1, it also focused on chemical constitution and connectivity patterns, but was less fine-grained than the latter; thus, the libraries are strongly overlap. On this map, the three-supplier library appears as a core collection that gradually expands (in particular, into the north-west and south-west regions) as compounds from further suppliers were added. Overlap degrees varied from 0.34 (three suppliers) to 0.40 (six suppliers), 0.47 (12 suppliers), and 0.49 (all suppliers).

Teaser
An assessment of 16 million commercially available compounds, (properties and quality), comparing vendors' offerings and how they have evolved to meet modern physiochemical requirements. A selection of 500,000 lead-like compounds for high throughput screening.

Concluding remarks
As HTS has matured, our understanding of what features constitute a quality hit and lead has evolved. It is generally regarded that low lipophilic, and higher Fsp3 properties are preferred. From our analysis, it appears that, over the past 10 years, the market has evolved to meet these demands, with new compounds from many suppliers having modern physiochemical properties. Currently, it is not possible to purchase an ideal 1-million compound set (50 000 scaffolds, minimum of 20 compounds per scaffold). However, it appears that an ideal 500 000 set can be purchased. If sample logistics is an issue, then we have shown that it is possble to purchase the 500 000 set from only six suppliers, with a 350 000 set available from just three suppliers. Many large companies have been through similar exercises and have built their screening decks accordingly. If you are considering building a screening deck ab initio, then it is possible to achieve this from purchasable space. In the interest of open innovation, we have made our data available online (www.awridian.co.uk/Resources). We are confident that, as new challenges in sample supply emerge, the market place will respond positively.

Dmitriy Volochnyuk
Dmitriy Volochnyuk shares his time as head of the Biologically Active Compounds Department at the Institute of Organic Chemistry of the NAS of Ukraine and as a professor in the Institute of High Technology, Kiev National University. He received his PhD in Organic Chemistry in 2005 and his DSc in organic and organomettalic chemistry in 2011. He has 10+ years' experience in managing chemical outsourcing projects having previously worked in contract research organizations. Dr Volochnyuk is an expert in fluoroorganic, organophosphorus, heterocyclic, combinatorial ,and medicinal chemistry. He is also an author on over 120 scientific papers.

Sergey Ryabukhin
Sergey Ryabukhin is an associate professor in the Institute of High Technology, Kiev National Taras Shevchenko University. He was awarded his PhD by Kiev National University in 2008. He has 10+ years' experience in managing combinatorial chemistry departments as well as chemical outsourcing projects having previously worked in contract research organizations. Dr Ryabukhin is an expert in combinatorial methods in organic chemistry, organosilicon, and organoboron chemistry. He is an author on over 50 scientific papers.

Duncan B. Judd
Duncan B. Judd is consultant at Awridian Ltd, currently working with a range of organizations including international companies. He is an accomplished medicinal chemist with extensive outsourcing experience and a 39-year proven track record with a blue-chip pharmaceutical company. Duncan has made significant contributions to numerous drug discovery projects, and is cited on many patents and publications. He has extensive outsourcing experience and has published and presented on open innovation in drug discovery, for which he is a strong advocate.