Machine Learning-Boosted Docking Enables the Efficient Structure-Based Virtual Screening of Giga-Scale Enumerated Chemical Libraries

The emergence of ultra-large screening libraries, filled to the brim with billions of readily available compounds, poses a growing challenge for docking-based virtual screening. Machine learning (ML)-boosted strategies like the tool HASTEN combine rapid ML prediction with the brute-force docking of small fractions of such libraries to increase screening throughput and take on giga-scale libraries. In our case study of an anti-bacterial chaperone and an anti-viral kinase, we first generated a brute-force docking baseline for 1.56 billion compounds in the Enamine REAL lead-like library with the fast Glide high-throughput virtual screening protocol. With HASTEN, we observed robust recall of 90% of the true 1000 top-scoring virtual hits in both targets when docking only 1% of the entire library. This reduction of the required docking experiments by 99% significantly shortens the screening time. In the kinase target, the employment of a hydrogen bonding constraint resulted in a major proportion of unsuccessful docking attempts and hampered ML predictions. We demonstrate the optimization potential in the treatment of failed compounds when performing ML-boosted screening and benchmark and showcase HASTEN as a fast and robust tool in a growing arsenal of approaches to unlock the chemical space covered by giga-scale screening libraries for everyday drug discovery campaigns.

Table S6: Recalls and runtime of runs with 0.1% and 0.01% docking fraction for GAK .
Table S7: Recalls of top 100, 1000, and 10 000 virtual hits for SurA target . . . . . . . .Table S8: Recalls of top 100, 1000, and 10 000 virtual hits for GAK target . . . . . . . .S3 and S4.RMSEs are shown for HASTEN runs utilizing a failed score of +5.0 (orange, diamonds) or 0.0 (blue, squares), and for runs that excluded failed compounds from the training data (yellow, circles).For SurA, the average validation and test set RMSE values per iteration are shown for the three replicates using a failed score of +5.0, and for GAK, the average of three replicates where failed compounds were dropped.

GAK receptor selection and method validation
Figure S4: Boxplots of Tanimoto distances of top-scoring virtual hits for SurA to their closest training dataset relative (i.e. the largest Tanimoto similarity to any compound that was selected on a previous iteration).The data shown is for all compounds among the true top 10 000 virtual hits that were selected by HASTEN and was obtained with the help of chemfp from the run with a docking fraction of 0.1% per iteration and the drop-failed protocol.
Figure S5: Heatmap of Pearson correlations of predicted scores on iterations 2-10 for the SurA top virtual hits (defined here by a docking score cutoff of -9.0, total: 37 818 compounds).The analyzed models were obtained with a docking fraction of 0.1% per iteration and the drop-failed protocol.
Figure S6: Recalls of the top 100 (top), 1000 (middle) and 10 000 (bottom) true virtual hits in the runs with docking fractions of 0.1% (orange) and 0.01% (blue) expressed as a function of the total number of compounds docked.The left column shows results for the SurA target and the right column, GAK.All HASTEN runs shown in this plot were done with the drop-failed protocol.
Figure S7: Recalls of the top 100 (top), 1000 (middle) and 10 000 (bottom) true virtual hits in the runs with docking fractions of 0.1% (orange) and 0.01% (blue) expressed as a function of the total runtime in minutes when predictions ran with a single Chemprop per GPU.The left column shows results for the SurA target and the right column, GAK.All HASTEN runs shown in this plot were done with the drop-failed protocol.
Figure S8: Recalls of the top 100 (top), 1000 (middle) and 10 000 (bottom) true virtual hits in the runs with docking fractions of 0.1% (orange) and 0.01% (blue) expressed as a function of the total runtime in minutes when predictions ran with 4 Chemprops per GPU.The left column shows results for the SurA target and the right column, GAK.All HASTEN runs shown in this plot were done with the drop-failed protocol.

Supplementary tables
Table S1: Recalls for the target SurA with failed compounds scored as +5.0, 0.0 or dropped.For the failed score of +5.0,only the first replicate run is reported.

Generation of a custom GAK actives/decoys dataset
Known actives for the GAK protein were collected from ChEMBL. 1 Any compound with reported activities of at least 1 µM IC 50 , K i , or K d was considered a potential active to mimic a generous selection of potential binders during the virtual screening project.To ensure that the possible enrichment was assessed in the most relevant property space, the retrieved actives were next filtered by their properties to keep only such compounds that fell inside the lead-like criteria of the ERLL library to be used in the screening study.
The remaining 104 lead-like actives were used to generate a set of custom decoys using DUD-E. 2,3After removal of duplicates from the decoy set, we ended up with a final dataset of 104 actives and 5600 custom decoys, that were prepared as described for the ERLL library in the main text.

Docking performance and enrichment assessment
For the four prioritized GAK receptors (PDB-IDs 5y7z, 4y8d, 4c58, and 4c59), we analyzed the screening performance and enrichment of actives over decoys using custom Python scripts to compute the following metrics: Area under the Receiver Operating Characteristic Curve (ROC, equation 1), Area under the Accumulation Curve (AUAC, equation 2), and the Enrichment Factor in the top 1% (EF, equation 3).The results are reported in Table S10.
with n: number of actives in a total of N compounds; F a (k) and F i (k): the number of actives and inactives at rank position k, respectively.
with n: number of actives in a total of N compounds; F a (k): the number of actives at rank position k.

Figure S2 :
Figure S2: Failed compounds selected for docking per HASTEN iteration for GAK target

Figure S5 :
Figure S5: Correlations of predicted scores per iteration for virtual hits of SurA . . . . .

Figure S8 :
Figure S8: Recalls for docking fractions 0.1% and 0.01% vs. time (multiple Chemprops) Figure S9: Venn diagrams showing recall overlap of top 1000 virtual hits for SurA target Figure S10: Venn diagrams showing recall overlap of top 1000 virtual hits for GAK target

Figure S2 :
Figure S2: Bar plots showing the number of failed compounds selected for docking on each HAS-TEN iteration for the GAK target when using a failed score of +5.0 (blue) and excluding failed compounds from the training data (orange).

Figure S3 :
Figure S3: Validation and test set RMSE per Chemprop training iteration for the targets SurA (left) and GAK (right) with different treatments of failed compounds: Validation set RMSE curves are shown semi-translucent and, if invisible, are overlaid by the test set RMSE curves due to highly similar RMSEs.The data is also summarized in TablesS3 and S4.RMSEs are shown for HASTEN runs utilizing a failed score of +5.0 (orange, diamonds) or 0.0 (blue, squares), and for runs that excluded failed compounds from the training data (yellow, circles).For SurA, the average validation and test set RMSE values per iteration are shown for the three replicates using a failed score of +5.0, and for GAK, the average of three replicates where failed compounds were dropped.

Figure S9 :
Figure S9: Venn diagrams illustrating the overlap in recalled compounds among the top 1000 virtual hits for the SurA target.Compound numbers are shown for each replicate R on every HASTEN iteration with a failed score of +5.0, starting from iteration 2.

Figure S10 :
Figure S10: Venn diagrams illustrating the overlap in recalled compounds among the top 1000 virtual hits for the GAK target.Compound numbers are shown for each replicate R on every HASTEN iteration with excluded failed compounds, starting from iteration 2.

Table S3 :
Validation and test set RMSE per Chemprop training iteration for SurA . . . .

Table S4 :
Validation and test set RMSE per Chemprop training iteration for GAK . . . .

Table S2 :
Recalls for the target GAK with failed compounds scored as +5.0, 0.0 or dropped.For the run where failed compounds were dropped, only the first replicate run is reported.

Table S3 :
Validation (valid.)and test set RMSE values per Chemprop training iteration for the SurA target: The three replicates with a failed score of +5.0, a failed score of 0.0 and the run where failed compounds were dropped from the training data are shown.

Table S4 :
Validation (valid.)and test set RMSE values per Chemprop training iteration for the GAK target: Results for a failed score of +5.0, a failed score of 0.0 and the three replicate runs where failed compounds were dropped from the training data are shown.

Table S6 :
Recalls and runtime for the target GAK with dropped failed compounds when adding 1.56 million compounds, i.e. 0.1% training data per iteration (0.1%), and when adding 156 000 compounds, i.e. 0.01% training data per iteration (0.01%).The shorter runtime was achieved when running 4 Chemprops/GPU, the longer runtime with a single Chemprop/GPU.The run with the larger training dataset size was terminated after 10 iterations, the run with the smaller training dataset size was continued for a total of 25 iterations.

Table S7 :
Recalls of top 100, 1000, and 10000 true virtual hits according to conventional docking obtained in three independent replicates of HASTEN for the target SurA.Failed compounds were assigned a score of +5.0.

Table S8 :
Recalls of top 100, 1000, and 10000 true virtual hits according to conventional docking obtained in three independent replicates of HASTEN for the target GAK.Failed compounds were dropped.