Evaluating Scalable Supervised Learning for Synthesize-on-Demand Chemical Libraries

Traditional small-molecule drug discovery is a time-consuming and costly endeavor. High-throughput chemical screening can only assess a tiny fraction of drug-like chemical space. The strong predictive power of modern machine-learning methods for virtual chemical screening enables training models on known active and inactive compounds and extrapolating to much larger chemical libraries. However, there has been limited experimental validation of these methods in practical applications on large commercially available or synthesize-on-demand chemical libraries. Through a prospective evaluation with the bacterial protein–protein interaction PriA-SSB, we demonstrate that ligand-based virtual screening can identify many active compounds in large commercial libraries. We use cross-validation to compare different types of supervised learning models and select a random forest (RF) classifier as the best model for this target. When predicting the activity of more than 8 million compounds from Aldrich Market Select, the RF substantially outperforms a naïve baseline based on chemical structure similarity. 48% of the RF’s 701 selected compounds are active. The RF model easily scales to score one billion compounds from the synthesize-on-demand Enamine REAL database. We tested 68 chemically diverse top predictions from Enamine REAL and observed 31 hits (46%), including one with an IC50 value of 1.3 μM.


A.1 Training Data Preprocessing Overview
Here we describe in detail how we preprocessed the LC and MLPCN screening data in Table 5 to create the 427,300 compound training dataset.At the time this dataset was assembled, the LC4 retest dataset was not available so hits were defined using only the LC4 primary screen.The first step involved merging the three primary and two retest screens.
Merging these datasets yielded a total of 442,274 entries; a compound could have multiple readings.This dataset was then processed to determine the binary activity of each compound.Specifically, for each compound, the scores were grouped, and three criteria were assessed: the median % inhibition of primary screens was ≥ 35%, the median % inhibition of retest screens was ≥ 35%, and the compound did not match a PAINS filter. 62,63A compound was deemed active if it met all three conditions.Finally, the training dataset was constructed such that it contained one row per compound that included its binary activity label and its median inhibition score based on all recorded primary inhibition scores.

A.2 Merged Dataset Structure
Each of the original screening data files had eight columns: SMSSF ID, CDD SMILES, Batch Name, Library ID, Plate Name, Plate Well, Run Date, PriA-SSB AS % inhibition.We created a combined dataset with the following columns: 14.Primary Filter: is active (denoted as 1) if the median of the primary screens for this molecule is greater than or equal to binary threshold.
15. Retest Filter: is active (denoted as 1) if the median of the retest screens for this molecule is greater than or equal to binary threshold.
16. PAINS Filter: is active (denoted as 1) if the molecule is not reported as a PAINS compound by the RDKit PAINS filter.

A.3 Ensuring Molecule Uniqueness
There are two types of repeated molecules for which we retain all measurements.The first is molecules that were experimentally tested more than once, which we generally refer to as replicates.Molecules can be tested more than once because they belong to multiple libraries (e.g., LC1 and MLPCN) or they were included in a primary and a retest screen.The other type of repeated molecules has different SMSSF IDs but the same RDKit SMILES due to salts.These molecules with the same SMSSF ID or the same RDKit SMILES are grouped together.The grouping operation assigns them the same Molecule ID and different Duplicate IDs.This allows us to compute aggregate % inhibition scores by grouping on Molecule ID.
In addition to grouping replicate measurements, we also ensure that a single experimental measurement is not erroneously recorded twice.Entries in the combined dataset should not match in the following columns: SMSSF ID, Plate Name, Plate Well, Run Date, PriA-SSB AS % inhibition.If two entries match on these columns, the same data have been copied so one entry is removed.

A.4 Combined Dataset Preprocessing Steps
The preprocessing steps for the combined dataset are as follows: 1. Read in the merged five individual measurement files.

A.5 Binary Activity Rules
Molecules can have up to four % inhibition score measurements (same Molecule ID but different Duplicate ID).We defined the following rules for creating binary activity labels for training classifiers: 1.The median % inhibition value over all primary screens of the molecule is ≥ 35%.
2. If one or more retest screens were conducted, the median % inhibition value over all retest screens of the molecule is ≥ 35%.
3. The molecule does not match a PAINS filter per RDKit's FilterCatalog.
These rules are applied to all molecules individually by grouping using the Molecule ID.
The binary activity label is assigned to all instances of the molecule identified by Molecule ID.

A.6 Generating the Training Dataset
The combined dataset can have many % inhibition readings for a single molecule.The training dataset will contain a single entry for each unique molecule identified by its Molecule ID.It is generated as follows: 1. Read in the combined dataset.
2. Remove retests, leaving the primary screens.Recall that the binary activity is still recorded in the primary entries.4. Group by Molecule ID and compute the median % inhibition for the primary screens in the PriA-SSB AS % inhibition (Primary Median) column.

Save this as the training dataset.
Each entry in the training dataset has a unique Molecule ID and is thus regarded as a unique molecule.The method to generate folds for cross validation starts by grouping molecules based on the Library ID column.The next step performs stratified sampling on each Library ID value, grouping it into 10 folds based on the binary activity labels.This ensures that each library is well represented in each fold and the activity ratios are similar.
For example, if LC1 has 1,000 total molecules with 10 actives, then each of the 10 folds will have 1 LC1 active and 99 LC1 inactives.

A.7 Supervised Learning Model Hyperparameters
Here we describe the best hyperparameters combinations for the RF-C, XGB-C, XGB-R, NN-C, and NN-R models.We performed grid searches over the hyperparameter values in the tables referenced below.Each combination of hyperparameters in the grid search was assigned a numeric ID.The mapping from numeric IDs to hyperparameter combinations is available in our GitHub repository.
For each model, we report the top-performing hyperparameter IDs sorted by performance.
Some training jobs took over a week to complete.We recorded results for all the jobs that terminated within a week and randomly picked additional long jobs to run to termination.

A.8 Ensemble Model Training
The Model-Based Ensemble and Max Vote Ensemble models are introduced in the model selection stage.Both ensemble models make use of N base model predictions as input and produce a single activity probability prediction as output.The Max Vote Ensemble uses a heuristic strategy that takes the maximum of the base model predictions.On the other hand, the Model-Based Ensemble trains an XGB model with the base model predictions as input.
The steps for training the ensemble models (depicted in Figure S3) are as follows: • Step 1: The base models are trained on folds 0-8 with fold 8 being used as the validation fold.The RF models do not require a validation set, so they do not make use of fold 8.

• Step 2:
The trained base models from step 1 are then used to generate predictions for folds 0-7 and fold 9.
• Step 3: All N base model predictions from step 2 for folds 0-7 are concatenated (stacked) to construct ensemble features.The Model-Based Ensemble is then trained with these ensemble features as input and the true labels as output for folds 0-7.For the Max Vote Ensemble, no training is needed as the output is the maximum of the ensemble features.
• Step 4: All N base model predictions from step 2 for fold 9 are concatenated.The trained ensemble from step 3 is then used to generate the final fold 9 prediction with the concatenated features as input.
There are several ways in which our ensembling approach could be improved.In a typical stacking ensemble, 76 the ensemble is trained on predictions that the base models did not train on.If the training set is denoted as A, then one common stacking method is to do K-fold cross validation with the base models and then use the stacked K held-out fold predictions to train the ensemble.Another strategy is to split A into two subsets, one for training the base models, and one for generating base model predictions to train the ensemble.However, we train the base models on a dataset A, generate predictions on A, and train the ensemble on these predictions.This can introduce a bias in a parameter-based ensemble like Model-Based Ensemble towards favoring base models that have a better fit on A. The Max Vote Ensemble does not have this potential bias because it is parameter free.
In addition, we could have used a more flexible and thorough strategy to select the base model mixtures and excluded ensemble configurations with a single base model (Table S20).
In retrospect, we observed that including regression models in the ensembles hurt performance (Figures S1 and S2).Furthermore, the regression base model outputs were not scaled, which was likely detrimental to the Max Vote Ensemble when combining classification and regression models.The Model-Based Ensemble is less affected by the scale of the regression model outputs because its XGB model can account for different score distributions provided by each base model.

A.9 Enamine REAL Dose-Response Curves
We generated three sets of dose-response curves: eight-point curves with four replicates for all 68 Enamine compounds, 16-point curves with eight replicates for 10 compounds, and another set of 16-point curves with six replicates for 10 compounds.The eight-point doseresponse curves started with an initial concentration of 66 µm and progressively divided the concentration in half until reaching 515.6 nm.Both 16-point dose-response curves started with an initial concentration of 100 µm, then 66.7 µm, and then progressively divided the concentration in half until reaching 4.2 nm.All dose-response screens followed the same protocol used for the MLPCN library.The dose-response curve images are available in our Zenodo dataset.
We generated the 16-point dose-response curves because we were unable to reliably establish IC50 values for some compounds in the eight-point screen.The minimum concentrations tested were not low enough.When reporting IC50 values and bounds in Table 4 we used the screening data from the highest-quality dose-response curve.The second set of 16-point curves were the highest quality because they had 16 concentrations and were unaffected by any spatial plate effects.The first set of 16-point curves were the next highest quality because they also had 16 concentrations and the most replicates.However, some of the readings at the lowest concentration in these curves were skewed due to assay spatial effects at the plate edge.The eight-point curves were used only if the compound was not tested in a 16-point curve.Some of the replicates at the lowest concentration in the eight-point curves were also affected by the spatial effects.
The Collaborative Drug Discovery Vault software 83 modifies the original dose-response classification criteria. 82It defines curve class 1 as a "complete curve, showing multiple points near both asymptotes" and curve class 2 as a "partial curve, showing multiple points near only one asymptote."The 1.2 and 2.2 subclasses indicate that the curve fit has R 2 > 0.9 and maximum activity ≤ 80%.

B.1 Base Model versus Ensemble Model Actives
In the model selection stage, fold 9 was used for performance comparison among models.
The motivation for using ensembles was to pool the predictions of various models and thus promote diversity in the top predicted compounds.We compared the prediction overlap between the ensembles and the best base models in the model selection stage (Table 1).Specifically, for each top base model, we compared the top 1% predicted compounds from fold 9 against the top 1% predicted by the ensembles (Tables S2-S6).This examines the similarity and difference in the actives retrieved by the top base models and the ensembles.
These predictions for the Model-Based Ensemble were regenerated using a newer version of Keras.However, the performance of the regenerated ensemble and the ensemble referred to in Table 1 differs by no more than 10 −3 .

B.2 AMS Compound Activity Level versus Cluster Membership
The 1,024 AMS compounds were divided into three activity level categories based on their % inhibition values in the two replicates: • Weak Actives: Compounds with % inhibition in [35.0, 50.0) in both replicates.The value 35.0 is close to the 40th percentile in replicate 1.
• Normal Actives: Compounds with % inhibition between [50.0, 72.0) in both replicates.The value 50.0 is the threshold used to define hits.
• Strong Actives: Compounds with % inhibition ≥ 72.0 in both replicates.The value 72.0 is close to the 75th percentile in replicate 1.
Table S8 summarizes the AMS compound counts by activity level.Note that the sum of normal and strong actives is equal to the total hits.A Jupyter notebook in our GitHub repository shows the cluster membership versus activity level counts.We performed Fisher's exact test using the stats package in R with the following arguments: workspace=2e8, hybrid=FALSE, alternative="two.sided",conf.level=0.95,simulate.p.value=TRUE.

B.3 Scaling to a Billion Compounds with Enamine REAL
Using the top RF-C model from the model selection stage trained on the training set, we scored the 1,077,562,987 Enamine REAL compounds.A total of 18 jobs, each processing about 60.3 million compounds, were processed on 18 independent compute nodes with a single CPU core, 6GB RAM, and 30GB disk space allocated to each job.Table S11       Table S4: Top XGB-R model versus the ensembles.Set difference of actives when filtering for the top 427 predictions on fold 9 in the model selection stage.Table S5: Top NN-C model versus the ensembles.Set difference of actives when filtering for the top 427 predictions on fold 9 in the model selection stage.

S19
Table S6: Top NN-R model versus the ensembles.Set difference of actives when filtering for the top 427 predictions on fold 9 in the model selection stage.

3 .
Standardize Library IDs to support stratifying by library.

Figure S1 :
Figure S1: Model-Based Ensemble results on fold 9. TableS20lists the base models associated with each ensemble hyperparameter ID.

Figure S2 :
Figure S2: Max Vote Ensemble results on fold 9. Table S20 lists the base models associated with each ensemble hyperparameter ID.

Figure S3 :Figure
Figure S3: Illustration of ensemble training steps in the context of the model selection stage.Step 1 trains base models on folds 0-8 with fold 8 being used as the validation fold.Step 2 generates predictions on folds 0-7 and fold 9. Step 3 trains the ensemble classifier on base model predictions for folds 0-7.Step 4 uses the trained ensemble to generate final predictions on fold 9.
summarizes the processing time in hours and number of compounds per node.The mean time was 53.21 hours with a 6.40 hour standard deviation.Because each compound is scored independently, we could easily reduce the total wall time needed to score over a billion compounds by distributing the compounds across hundreds or thousands of compute nodes.The Similarity Baseline ran under similar conditions and had a mean time of 19.48 hours with a 1.68 hour standard deviation.

Table S1 :
Cross validation (CV) training, validation, and test fold splits for each run ID.A total of four runs were done for each hyperparameter setting in the hyperparameter tuning stage.Ensemble models followed a different splitting scheme.

Table S2 :
Top RF-C model versus the ensembles.Set difference of actives when filtering for the top 427 predictions on fold 9 in the model selection stage.

Table S3 :
Top XGB-C model versus the ensembles.Set difference of actives when filtering for the top 427 predictions on fold 9 in the model selection stage.

Table S7 :
Summary of cluster hit metrics for the 1,024 AMS compounds.Unique cluster (U.Cl.) hits denotes the number of clusters containing at least one hit.Novel cluster (Nov.Cl.) hits denotes the number of clusters with AMS hits but no training set actives.Taylor-Butina (TB) clustering with the specified distance thresholds is used to define clusters.

Table S8 :
The 1,024 AMS compounds grouped by activity level.Only Strong Actives and Normal Actives were considered to be hits in all primary analyses.Weak Actives fell below the % inhibition threshold.

Table S9 :
Illustration of the five AMS hits selected by RF-C that are the farthest from their nearest training set active.Tanimoto distance is 1 -Tanimoto similarity.

Table S10 :
Illustration of the five AMS hits selected by the Similarity Baseline that are the farthest from their nearest training set active.Tanimoto distance is 1 -Tanimoto similarity.

Table S11 :
Timing estimates for RF-C prediction on the Enamine REAL library consisting of 1,077,562,987 compounds.The library processing was split among 18 compute nodes.

Table S12 :
The overlap of the top K Enamine REAL compounds selected by the RF-C and Similarity Baseline models.Both models prioritized compounds from the entire Enamine REAL dataset.Then, the top K predicted compounds from both models were compared for overlap.

Table S13 :
The 40 inactive Enamine REAL compounds, which include some compounds that were called as initial hits but not confirmed in the full dose-response curves.The most similar known active compound from the training set or AMS compounds is shown along with its Tanimoto distance (which is 1 -Tanimoto similarity) to the Enamine inactive.The similarity maps from RDKit show similar substructures in green and dissimilar substructures in red.Inactives that match AlphaScreen frequent hitters filters 33 are denoted with a checkmark.

Table S14 :
Summary of fold information denoting the number of compounds, number of clusters, and number of clusters not present in other folds.Compounds were clustered using Taylor-Butina at the 0.4 distance threshold.