We Should at Least Be Able to Design Molecules That Dock Well

Designing compounds with desired properties is a key element of the drug discovery process. However, measuring progress in the field has been challenging due to the lack of realistic retrospective benchmarks, and the large cost of prospective validation. To close this gap, we propose a benchmark based on docking, a popular computational method for assessing molecule binding to a protein. Concretely, the goal is to generate drug-like molecules that are scored highly by SMINA, a popular docking software. We observe that popular graph-based generative models fail to generate molecules with a high docking score when trained using a realistically sized training set. This suggests a limitation of the current incarnation of models for de novo drug design. Finally, we propose a simplified version of the benchmark based on a simpler scoring function, and show that the tested models are able to partially solve it. We release the benchmark as an easy to use package available at https://github.com/cieplinski-tobiasz/smina-docking-benchmark. We hope that our benchmark will serve as a stepping stone towards the goal of automatically generating promising drug candidates.


Introduction
Designing compounds with some desired chemical properties is the central challenge in the drug discovery process [Sliwoski et al., 2014, Elton et al., 2019. De novo drug design is one of the most successful computational approach that involves generating new potential ligands from scratch, which avoids enumerating explicitly the vast space of possible structures. Recently, deep learning has unlocked new progress in drug design. Promising results using deep generative models have been shown in generating soluble [Kusner et al., 2017], bioactive [Segler et al., 2018], and drug-like [Jin et al., 2018]

molecules.
A key challenge in the field of drug design is the lack of realistic benchmarks [Elton et al., 2019]. Ideally, the generated molecule by a de novo method should be tested in the wet lab for the desired property. In practice, typically, a proxy is used. For example, the octanolwater partition coefficient or bioactivity is predicted using a computational model [Segler et al., 2018, Kusner et al., 2017. However, these models are often too simplistic [Elton et al., 2019]. This is aptly summarized in Coley et al. [2019] as "The current evaluations for generative models do not reflect the complexity of real discovery problems.". In contrast to drug design, more realistic benchmarks have been used in the design of photovoltaics [Pyzer-Knapp et al., 2015] or in the design oof molecules with certain excitation energies [Sumita Figure 1: Visualization of the proposed docking-based benchmark for de-novo drug design methods. First, the proposed molecule (leftmost) is docked to the target's binding site using SMINA, a popular docking software. In the most difficult version of the benchmark, the final score is computed based on the ligand pose using the default SMINA docking score. et al., 2018], where a physical calculation was carried out to both train models, and to evaluate generated compounds.
Our main contribution is a realistic benchmark for de novo drug design. We base our benchmark on docking, a popular computational method for predicting molecule binding to a protein. Concretely, the goal is to generate molecules that are scored highly by SMINA [Koes et al., 2013]. We picked Koes et al. [2013] due to its popularity and being available under a free license. While we focus on de novo drug design, our methodology can be extended to evaluate retrospectively other approaches to designing molecules. Code to reproduce results and evaluate new models is available online at https://github.com/cieplinski-tobiasz/sminadocking-benchmark.
Our second contribution is exposing limitation of currently popular de novo drug design methods for generating bioactive molecules. When trained using a few thousands compounds, a typical training set size, the tested methods fail to generate highly active structures according to the docking software. The highest scoring molecules in most cases did not outperform top 10% molecules found in either the ZINC database or the training set. This suggest we should exercise caution when applying them in drug discovery pipelines, where we seldom have larger number of known ligands. We hope our benchmark will serve as a stepping stone to further improve these promising models.
The paper is organised as follows. We first discuss prior work and introduce our benchmark. Next, we use our benchmark to evaluate two popular models for de novo drug design. Finally, we analyse why the tested models fail on the most difficult version of the benchmark.

Docking-based benchmark
We begin by briefly discussing prior work and motivation. Next, we introduce our benchmark.

Why do we need yet another benchmark?
Standardized benchmarks are critical to measure progress in any field. Development of large-scale benchmarks such as the ImageNet was critical for the recent developments in artificial intelligence [Deng et al., 2009, Wang et al., 2018. Many new methods for de novo drug design are conceived every year, which motivates the need for a systematic and efficient way to compare them [Schneider and Clark, 2019].
De novo drug design methods are typically evaluated using proxy tasks that circumvent the need to test the generated compounds experimentally [Jin et al., 2018, You et al., 2018, Maziarka et al., 2020, Kusner et al., 2017, Gómez-Bombarelli et al., 2016. Optimizing the octanol-water partition coefficient (logP) is a common example. The logP coefficient is commonly computed using an atom-based method that involves summing contribution of individual atoms [Wildman andCrippen, 1999, Jin et al., 2018], which is available in the RDKit package [Landrum, 2016]. Due to the fact that it is easy to optimize the atom-based method by producing unrealistic molecules [Brown et al., 2019], a version that heuristically penalizes hard to synthesize compounds is used in practice [Jin et al., 2018]. This example illustrates the need to develop more realistic ways to benchmark these methods. Another example is QED score [Bickerton et al., 2012] which is designed to capture druglikeliness of a compound. Finally, some approacches use a model (e.g. a neural network) to predict bioactivity of the generated compounds Segler et al. [2018]. Similarly to logP, these two tasks are also possible to optimize while producing unrealistic molecules. This is aptly summarized in Coley et al.
[2019] as "The current evaluations for generative models do not reflect the complexity of real discovery problems." Interestingly, besides the aforementioned proxy tasks, more realistic proxy tasks are rarely used in the context of evaluating de novo drug design methods. This is in contrast to evaluation of generative models for generating photovoltaics [Pyzer-Knapp et al., 2015] or molecules with certain excitation energies [Sumita et al., 2018]. One notable exception is Aumentado-Armstrong [2018] who try to generate compounds that are active according to the DrugScore [Neudert and Klebe, 2011], and then evaluate the generated compounds using rDock [Ruiz-Carmona et al., 2014]. This lack of the overall diversity and realism in the typically used evaluation methods motivates us to propose our benchmark.

Docking-based benchmark
Our docking-based benchmark is defined by: (1) docking software that computes for a generated compound its pose in the binding site, (2) a function that scores the pose, (3) a training set of compounds with an already computed docking score.
The goal is to generate a given number of molecules that achieve the maximum possible docking score. For the sake of simplicity, we do not impose limits on the distance of the proposed compounds to training set. Thus a simple baseline is to return the training set. Finding similar compounds that have a higher docking score is already prohibitively challenging for current state-of-the-art methods. As the field progresses, our benchmark can be easily extended to account for the similarity between the generated compounds and the training set.
Finally, we would like to stress that the benchmark is not limited to de novo methods. The benchmark is applicable to any other approaches such as virtual screening. The only limitation required for a fair comparison is that docking is performed only on the supplied training set.

Instantiation
As a concrete instantiation of our docking-based benchmark, we use SMINA v. 2017.11.9 [Koes et al., 2013 due to its wide-spread use and being offered under a free license. To create the training set, we download from the ChEMBL [Gaulton et al., 2016] database molecules tested against popular drug-targets: 5-HT1B, 5-HT2B, ACM2, and CYP2D6. For 5-HT1B the final dataset consists in 1891 molecules, out of which 1148 are active (Ki < 100nm) and 743 are inactive molecules (Ki > 1000nm). We list sizes of the datasets in Table 1. We dock each molecule using default settings in SMINA to a manually selected binding site coordinate. Protein structures were downloaded from the Protein Database Bank, cleaned and prepared for docking using Schrödinger modeling package. The resulting protein structures are provided in our code repository. We describe further details on the preparation of the datasets in Appendix C.
Starting from the above, we define the following three variants of the benchmark. In the first variant, the goal is to propose molecules that achieve the smallest SMINA docking score used in score only mode, defined as follows: where all terms are computed based on the final docking pose. The first three terms measure the steric interaction between ligand and the protein. The fourth and the fifth term look for hydrophobic and hydrogen bonds between the ligand and the protein. We include in Appendix A a detailed description of all the terms.
Next, we propose two simpler variants of the benchmark based on individual terms in the SMINA scoring function. In the Repulsion task, the goal is to only minimize the repulsion component, which is defined as: is the distance between the atoms minus the sum of their van der Waals radii. Distance unit is Angstrom (10 −10 m).
The third task, Hydrogen Bond Task, is to maximize the non_dir_h_bond term: To make the results more stable, we average the score over the top 5 best-scoring binding poses. Finally, to make the benchmark more realistic, we filter the generated compounds using the Lipinski rule, and discard molecules with molecular weight lower than 100.

Novelty criterion
To avoid trivial solutions, we filter out generated compounds that are similar to the training set. To set the similarity threshold, we downloaded 10 6 compounds from the ZINC [Irwin and Shoichet, 2005] database and calculated their similarity to each of the training sets used in our benchmark. To evaluate similarity we used the ECFP [Rogers and Hahn, 2010] representation with the radius set to 2 and the number of bits of 1024. Based on the results, we simply set the similarity threshold to 0.2, which is approximately the 5th percentile of the similarity of ZINC molecules to each of the training sets.

ZINC baseline
The premise behind de novo drug discovery is that it enables access to structurally novel and potent molecules. To contextualize results in the benchmark, we included as the baseline sampling from the subset of ZINC database containing 9,204,719 molecules 1 . We selected molecules having following properties: 3D representation, standard reactivity, in-stock purchasability, ref pH, charges from -2 to 2 inclusive and used drug-like preferred subset. For each protein we have sampled a set of molecules from aforementioned ZINC subset of protein's training set size. In each task we compare to the mean score, the top 10%, and the top 1% of scores.

Diversity
To better understand the performance of each model, besides the mean score, we also evaluate the diversity of the proposed molecules. Concretely, we compute the mean Tanimoto distance between all pairs of molecules in the generated sample. We use the 1024-bit ECFP representation [Rogers and Hahn, 2010] with radius 2. The score is reported in the benchmark along with the docking score results. We observe that the optimized models narrow down to a less diverse subspace of compounds that are dissimilar to the training set. This can also be observed in the t-SNE plots of the generated compounds compared to the training set ( Figure 2). The small focused clouds of compounds generated using different optimization targets always concentrate at one side of the map. This suggests that there is similar bias of the model independent of the optimization target, which can be the ChEMBL prior of the REINVENT model. Besides that observation, we note that the generated compounds are less diverse, showing similarity only inside one generated batch (the same optimization target).

When is a task solved?
In the experiments, we compare to two baselines: (i) random compounds from ZINC as the baseline, and (ii) compounds from the training set. In each case we report the mean score, the top 10% of scores, and the top 1% of scores. We also report diversity of the results. Roughly speaking, we consider a given task solved if the generated molecules exceed the top 1% score found in the ZINC database, while achieving at least the same diversity as observed in the training set. This criterion is necessarily arbitrary. It is inspired by a natural baseline -sampling several thousands of compounds from the ZINC database.

Results and discussion
In this section, we evaluate three popular models for de novo drug design on our dockingbased benchmark.

Models
We compare three popular models for de novo drug design. Chemical Variational Autoencoder (CVAE) [Gómez-Bombarelli et al., 2018] applies Variational Autoencoder [Kingma and Welling, 2013] by representing molecules as strings of characters (using SMILES encoding). This approach was later extended by Grammar Variational Autoencoder (GVAE) [Kusner et al., 2017], which ensures that generated compounds are grammatically correct. The third model, REINVENT [Olivecrona et al., 2017], is a recurrent neural network trained using reinforcement learning for molecular optimization.

Experimental details
To generate active compounds, we follow a similar approach to one in Jin et al. [2018], disregarding the penalty for insufficient similarity. Analogous methods using sparse Gaussian Process instead of multilayer perceptron are also employed in Gómez-Bombarelli et al. [2018] and Kusner et al. [2017]. First, we fine-tune a given generative model for 5 epochs on the training set ligands, starting from weights made available by the authors 2 . All hyperparameters are set to default values used in Gómez-Bombarelli et al. [2018] and Kusner et al. [2017]. Additionally, we use the provided scores to train a multilayer perceptron (MLP) to predict the target (e.g. the SMINA scoring function) based on the latent space representation of the molecule.
For CVAE and GVAE, to generate compounds, we first take a random sample from the latent space by sampling from a Gaussian distribution with the standard deviation of 1 and the mean of 0. Starting from this point in the latent space, we take 50 gradient steps to optimize the output of the MLP. Based on this approach we generate 250 compounds from the model. For the REINVENT model, we use pretrained weights on the ChEMBL database provided by Olivecrona et al. [2017]. As there is no latent space in this model, we train a random forest model to predict the target directly from the molecule structure. We use the ECFP fingerprint to encode the molecule [Rogers and Hahn, 2010]. The reward is computed based on the random forest prediction multiplied by the QED score calculated using RDKit.  The key task is Docking Score Function in which the goal is to optimize docking score against a given protein target. Each cell reports the mean score for 250 generated molecules in each task. In the parenthesis, the internal diversity of generated molecules is reported (see text for details). The tested models tend to improve upon the mean score in the ZINC database (ZINC). However, they generally do not improve upon the top molecules from ZINC; ZINC (10%) and ZINC (1%) show the top 10% of scores and the top 1% of scores. Missing results ("-") indicate that the model failed to generate 250 molecules that satisfy drug-like filters (see text for details). Table 2 summarizes the results on all three tasks. Recall that we generally consider a given task solved if the generated molecules exceed the top 1% score found in the ZINC database, while achieving at least the same diversity as in the training set. Below we make several observations.

Results
Docking Score Function task The key task in the benchmark is Docking Score Function. We observe that CVAE and GVAE models fail to generate compounds that achieve higher docking score compared to the mean docking score in the ZINC dataset (−8.785 for 5-HT1B compared to −4.647 and −4.955 achieved by CVAE and GVAE, respectively). The REINVENT model achieves much better performance (−9.774 for 5-HT1B). However, while docking scores attained by the molecules generated by REINVENT generally outperform the mean docking in the ZINC dataset and the training set, they fall short of outperforming the top 10% molecules found in ZINC (−9.894 for 5-HT1B, with the exception of ACM2). We also draw attention to the fact that the generated molecules by REINVENT are markedly less diverse than the diversity of the training set (0.506 mean Tanimoto distance compared to 0.787 in the training set). These results suggest that generative models applied to de novo drug discovery might require substantial more data to generate well-binding compounds than is typically available for training. In the key Docking Score Function task, models generally fail to outperform the top 10% from the ZINC database. It should worry us that optimizing for docking score, which seems to be a simpler target than true biological binding affinity, is already challenging given realistically sized training sets (between 1193 and 4199 molecules).
Repulsion task Interestingly, REINVENT performs significantly worse than GVAE and CVAE on the Repulsion task. All models fail to outpeform top 10% found in the ZINC dataset. We observe markedly lower diversity of molecules generated by REINVENT compared to the training set.
Hydrogen Bonding task The Hydrogen Bonding task is the simplest, and both GVAE and REINVENT generate molecules that almost match the top 1% molecules found in the ZINC database and the training set. We again observe relatively low diversity of molecules generated by REINVENT. Generated molecules Figure 5 shows the best scoring molecules generated by REIN-VENT. We observe that optimizing each objective promotes different structural motifs. For example, the best scoring molecules in the Repulsion task are small, which intuitively enables them to easily fit into the binding pocket, achieving lower repulsion values than the top 1% molecules in the training set. Similarly, there are clear patterns visible in the top molecules of CVAE and GVAE. For example, CVAE generates macrocycles in the task of docking score optimization, while GVAE generates long chains with no cycles when optimizing the same objective. These models also create oxygen or nitrogen chains when optimizing Hydrogen Bonding, and very small molecules (often less than 3 heavy atoms) for the Repulsion task.
We noticed a moderately strong correlation between docking scores and the number of rotable bonds or molecular weight. Figures 6 and 7 show that with the increasing number of   rotable bonds or molecular weight, the docking scores improve. For the number of rotable bonds, the generated compounds are well mixed with the training data marginal distribution. On the other hand, the distribution of generated compounds is shifted towards better docking scores and smaller molecular weights in the case of the weight-to-docking-score relation. In other words, molecules achieve better docking scores at the same molecular weight after the optimization. The correlations are weaker for CYP2D6, which may be caused by a bigger binding site of this enzyme. However, the last observation about molecular weights holds.
When examining generated compounds from the chemical point of view, it should undeniably be stated that REINVENT produced the most consistent ligands with the highest possibility of desired biological activity. When different optimization approaches are considered, the best results were produced during the docking score optimization. Non-dir h-bond optimization produced compounds with sometimes a high number of moieties able to produce a hydrogen bond. In the repulsion task, the produced compounds are correct from the chemical point of view. The drug-likeliness of compounds produced by CVAE and GVAE is lower (although they still meet criteria included in the Lipiski Rule of Five), but they still can be used in the docking benchmark task.

Conclusion
As concluded by Coley et al. [2019], "the current evaluations for generative models do not reflect the complexity of real discovery problems". Motivated by this, we proposed a new, more realistic, benchmark tailored to de novo drug design, using docking score as the target to optimize. Code to evaluate new models is available at https://github.com/cieplinskitobiasz/smina-docking-benchmark.
Our results suggest that generative models applied to de novo drug discovery pipelines might require substantial more data to generate realistic compounds than is typically available for training. Despite using over 1000 compounds for training (between 1074 and 3780), the best docking scores generally do not outperform top 10% docking scores in the ZINC dataset. Docking score is only a simple proxy of the actual binding affinity, and as such it should worry us that it is already challenging to optimize.
On a more optimistic note, the tested models achieved much better performance on the simplest task in the benchmark, which is to optimize a single term in the SMINA scoring function involving the number of hydrogen bonds to the binding site. This suggests that producing compounds that optimize docking score based on the provided dataset is an attainable, albeit challenging, task. We hope our benchmark better reflects the complexity of real discovery problems and will serve as a stepping stone towards developing better de novo models for drug discovery.

B Model details
We include hyperparameters and training settings used in our models. Our code is available at https://github.com/cieplinski-tobiasz/smina-docking-benchmark.
MLP is used to predict docking score from CVAE or GVAE latent space representation of molecule. It is a simple feed forward neural network with one hidden layer. Hyperparameters of this model are listed in 3.    The REINVENT model was used with the default parameters of the original implemen-tation provided by Olivecrona et al. [2017]. The model consists of 3 GRU layers and was pretrained on the ChEMBL dataset. The RL agent was trained for 200 steps with a random forest scoring function. The hyperparameters of the random forest were chosen by a grid search, and the resulting configuration is shown in Table 6.

Parameter
Parameter number of estimators 500 criterion gini maximum number of features in splits sqrt(number of features)

C Dataset details
The compound sets were downloaded from the ChEMBL database [Gaulton et al., 2016]. All records referring to human-and rat-based records were taken into account. Compounds with Ki values below 100 nM (referred to as active compounds) and above 1000 nM (inactive ones) were taken into account. Only binding data were considered, and it was assumed that IC50 = Ki/2. The compound protonation states were generated for pH = 7.4.
The crystal structures for docking were fetched from the PDB database, the following structures were used in the study: 4IAQ for 5-HT1B, 4NC3 for 4-HT2B, 3UON for ACM2, and 3QM4 for CYP3D6. The Protein Preparation Wizard from the Schrödinger molecular modeling package was used for protein preparation for docking.